Fetching synonyms from CZI and SoftwareKG dataset 

Considering that CZI has synonyms fetched from most used software websites (PyPi, CRAN, Bioconductor and SciCrunch) this notebook will for every software in the benchmark query the datasets and try to retrieve synonyms. 

First we save all unique software names (lowered) into a dictionary, the using a function get_synonyms_CIZ, we find all synonyms that have software_mention (lowered) the same as the key in the dictionary. Dictionary is used in order to speed up the process.

Next, we query SoftwareKG graphs for synonyms (lowered). Again, the dictionary is used to speed up the process by querieng only once for the same name. This is done in get_synonyms_from_SoftwareKG function.

CZI dataset was downloaded from https://datadryad.org/dataset/doi:10.5061/dryad.6wwpzgn2c#methods (disambiguated file).
SoftwareKG SPARQL point: https://data.gesis.org/somesci/sparql

In [None]:
import pandas as pd 
import os
import json
#Read files
df = pd.read_csv("../CZI/synonyms_matrix.csv")


In [None]:
benchmark_df = pd.read_csv("../temp/v3.2/updated_with_metadata_file.csv")

In [None]:
%pip install sparqlwrapper

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [None]:
#check for null values and drop them
#turning all names into strings
print(df["software_mention"].isnull().sum())
df = df.dropna(subset=["software_mention"])
print(df["software_mention"].isnull().sum())
df['software_mention'] = df['software_mention'].astype(str)  # Convert all values to strings
benchmark_df['name']=benchmark_df['name'].astype(str)


2
0


In [None]:
#making a dictionary for the benchmark
benchmark_dictonary = {name.lower(): set() for name in benchmark_df["name"].unique()}


In [None]:
#Function that retirieves synonyms for each software mention
def get_synonyms_from_CZI(df, dictionary):
    for key in dictionary.keys():
        if dictionary[key] != set():
            continue
        # Find matching rows in synonyms_df where the software mention matches the dictionary key
        matches = df[df["software_mention"].str.lower() == key]["synonym"].tolist()
        # Store synonyms as a list
        dictionary[key].update(matches)

    

In [None]:
from SPARQLWrapper import SPARQLWrapper, JSON


def get_synonyms_from_SoftwareKG(dictionary):
    # Define the SPARQL endpoint
    sparql = SPARQLWrapper("https://data.gesis.org/somesci/sparql")
    # Execute the query
    for key in dictionary.keys():
        if dictionary[key] != set():
            continue
        query = f"""
    PREFIX nif: <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#>
PREFIX sms: <http://data.gesis.org/somesci/>
PREFIX its: <http://www.w3.org/2005/11/its/rdf#>

SELECT DISTINCT ?synonym
WHERE {{
    # Find the software entity associated with the given spelling
    ?sw_phrase a nif:Phrase ;
               its:taClassRef [ rdfs:subClassOf sms:Software ] ;
               its:taIdentRef ?sw_identity ;
               nif:anchorOf "{key}" .  # Replace "Excel" with the desired software name

    # Retrieve other spellings linked to the same software identity
    ?other_phrase its:taIdentRef ?sw_identity ;
                  nif:anchorOf ?synonym .
    
    FILTER (?synonym != "{key}")  # Exclude the original input spelling from results
}}
ORDER BY ?synonym
    """
        try:
            # Set query and return format
            sparql.setQuery(query)
            sparql.setReturnFormat(JSON)
            results = sparql.query().convert()

            # Process results
            for result in results["results"]["bindings"]:
                synonym = result.get("synonym", {}).get("value")
                if synonym:
                    dictionary[key].add(synonym)

        except Exception as e:
            print(f"Error retrieving synonyms for {key}: {e}")
        

In [None]:
def get_synonyms(dictionary, CZI = 1, SoftwareKG = 1):
    if CZI == 1:
        get_synonyms_from_CZI(df, dictionary)
    if SoftwareKG == 1:
        get_synonyms_from_SoftwareKG(dictionary)
    dictionary = {key: list(value) for key, value in dictionary.items()}
    return dictionary

In [None]:
# Add synonyms column
output_json_path = "./synonym_dictionary.json"
if os.path.exists(output_json_path) and os.path.getsize(output_json_path) > 0:
        with open(output_json_path, "r", encoding="utf-8") as f:
            try:
                benchmark_dictonary = json.load(f)
            except json.JSONDecodeError:
                print("⚠️ Warning: Could not decode existing JSON. Starting with empty cache.")
                benchmark_dictonary = {name.lower(): set() for name in benchmark_df["name"].unique()}
else:
        benchmark_dictonary = {name.lower(): set() for name in benchmark_df["name"].unique()}
benchmark_dictonary= get_synonyms(benchmark_dictonary,1,1)
# Save the updated dictionary to a JSON file
with open(output_json_path, "w", encoding="utf-8") as f:
    json.dump(benchmark_dictonary, f, ensure_ascii=False, indent=4)
#print(benchmark_dictonary)
benchmark_df["synonyms"] = (benchmark_df["name"]
    .str.lower()
    .map(benchmark_dictonary)
    .str.join(",")
)

# Save t
benchmark_df.to_csv("../temp/v3.2/updated_with_metadata_file.csv", index=False)