Fetching synonyms from CZI and SoftwareKG dataset 

Considering that CZI has synonyms fetched from most used software websites (PyPi, CRAN, Bioconductor and SciCrunch) this notebook will for every software in the benchmark query the datasets and try to retrieve synonyms. 

First we save all unique software names (lowered) into a dictionary, the using a function get_synonyms, we find all synonyms that have software_mention (lowered) the same as the key in the dictionary. Dictionary is used in order to speed up the process.

CZI dataset was downloaded from https://datadryad.org/dataset/doi:10.5061/dryad.6wwpzgn2c#methods (disambiguated file).

In [14]:
import pandas as pd 
#Read files
df = pd.read_csv("CZI/synonyms_matrix.csv")
benchmark_df = pd.read_excel("Benchmark.xlsx")

In [15]:
#check for null values and drop them
#turning all names into strings
print(df["software_mention"].isnull().sum())
df = df.dropna(subset=["software_mention"])
print(df["software_mention"].isnull().sum())
df['software_mention'] = df['software_mention'].astype(str)  # Convert all values to strings
benchmark_df['name']=benchmark_df['name'].astype(str)


2
0


In [None]:
#making a dictionary for the benchmark
benchmark_dict = {name.lower(): set() for name in benchmark_df["name"].unique()}
#ADD NEW DICTIONARY FOR SOFTWAREKG THAT WILL NOT LOWER NAMES AND QUERY IT LIKE THAT AND THEN PUT TWO DICTIONARIES TOGETHER
benchmark_dict["excel"]=set()

In [24]:
#Function that retirieves synonyms for each software mention
def get_synonyms_from_CZI(df, dictionary):
    for key in dictionary.keys():
        # Find matching rows in synonyms_df where the software mention matches the dictionary key
        matches = df[df["software_mention"].str.lower() == key]["synonym"].tolist()
        # Store synonyms as a list
        dictionary[key].add(matches)

    

In [25]:
from SPARQLWrapper import SPARQLWrapper, JSON


def get_synonyms_from_SoftwareKG(dictionary):
    # Define the SPARQL endpoint
    sparql = SPARQLWrapper("https://data.gesis.org/somesci/sparql")
    # Execute the query
    for key in dictionary.keys():
        query = f"""
    PREFIX nif: <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#>
PREFIX sms: <http://data.gesis.org/somesci/>
PREFIX its: <http://www.w3.org/2005/11/its/rdf#>

SELECT DISTINCT ?synonym
WHERE {{
    # Find the software entity associated with the given spelling
    ?sw_phrase a nif:Phrase ;
               its:taClassRef [ rdfs:subClassOf sms:Software ] ;
               its:taIdentRef ?sw_identity ;
               nif:anchorOf "{key}" .  # Replace "Excel" with the desired software name

    # Retrieve other spellings linked to the same software identity
    ?other_phrase its:taIdentRef ?sw_identity ;
                  nif:anchorOf ?synonym .
    
    FILTER (?synonym != "{key}")  # Exclude the original input spelling from results
}}
ORDER BY ?synonym
    """
        try:
            # Set query and return format
            sparql.setQuery(query)
            sparql.setReturnFormat(JSON)
            results = sparql.query().convert()

            # Process results
            for result in results["results"]["bindings"]:
                synonym = result.get("synonym", {}).get("value")
                if synonym:
                    dictionary[key].add(synonym)

        except Exception as e:
            print(f"Error retrieving synonyms for {key}: {e}")
        

In [26]:
def get_synonyms(dictionary, CZI = 1, SoftwareKG = 1):
    if CZI == 1:
        get_synonyms_from_CZI(df, dictionary)
    if SoftwareKG == 1:
        get_synonyms_from_SoftwareKG(dictionary)
    dictionary = {key: ", ".join(value) for key, value in dictionary.items()}

In [32]:
# Add synonyms column
get_synonyms(benchmark_dict,0,1)
print(benchmark_dict)
benchmark_df["synonyms"] = benchmark_df["name"].str.lower().map(benchmark_dict)

# Save the enriched benchmark
benchmark_df.to_csv("benchmark_with_synonyms.csv", index=False)

{'sklearn': set(), 'sklearn python package': set(), 'python package sklearn': set(), 'python sklearn library': set(), 'python sklearn': set(), 'sklearn python': set(), 'pandas': set(), 'panda': set(), 'panda (pipeline for analyzing brain diffusion images)': set(), 'activity': set(), 'sets': set(), 'set': set(), 'rhino': set(), 'rhinoceros': set(), 'rhinos': set(), 'excel': {'Office Excel', 'Microsoft Office Excel', 'Excel', 'EXCEL'}}
