Fetching synonyms from CZI and SoftwareKG dataset 

Considering that CZI has synonyms fetched from most used software websites (PyPi, CRAN, Bioconductor and SciCrunch) this notebook will for every software in the benchmark query the datasets and try to retrieve synonyms. 

First we save all unique software names (lowered) into a dictionary, the using a function get_synonyms_CIZ, we find all synonyms that have software_mention (lowered) the same as the key in the dictionary. Dictionary is used in order to speed up the process.

Next, we query SoftwareKG graphs for synonyms (lowered). Again, the dictionary is used to speed up the process by querieng only once for the same name. This is done in get_synonyms_from_SoftwareKG function.

CZI dataset was downloaded from https://datadryad.org/dataset/doi:10.5061/dryad.6wwpzgn2c#methods (disambiguated file).
SoftwareKG SPARQL point: https://data.gesis.org/somesci/sparql

In [16]:
import pandas as pd 
#Read files
df = pd.read_csv("CZI/synonyms_matrix.csv")
benchmark_df = pd.read_excel("Benchmark.xlsx")

In [17]:
#check for null values and drop them
#turning all names into strings
print(df["software_mention"].isnull().sum())
df = df.dropna(subset=["software_mention"])
print(df["software_mention"].isnull().sum())
df['software_mention'] = df['software_mention'].astype(str)  # Convert all values to strings
benchmark_df['name']=benchmark_df['name'].astype(str)


2
0


In [18]:
#making a dictionary for the benchmark
benchmark_dictonary = {name.lower(): set() for name in benchmark_df["name"].unique()}


In [19]:
#Function that retirieves synonyms for each software mention
def get_synonyms_from_CZI(df, dictionary):
    for key in dictionary.keys():
        # Find matching rows in synonyms_df where the software mention matches the dictionary key
        matches = df[df["software_mention"].str.lower() == key]["synonym"].tolist()
        # Store synonyms as a list
        dictionary[key].update(matches)

    

In [20]:
from SPARQLWrapper import SPARQLWrapper, JSON


def get_synonyms_from_SoftwareKG(dictionary):
    # Define the SPARQL endpoint
    sparql = SPARQLWrapper("https://data.gesis.org/somesci/sparql")
    # Execute the query
    for key in dictionary.keys():
        query = f"""
    PREFIX nif: <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#>
PREFIX sms: <http://data.gesis.org/somesci/>
PREFIX its: <http://www.w3.org/2005/11/its/rdf#>

SELECT DISTINCT ?synonym
WHERE {{
    # Find the software entity associated with the given spelling
    ?sw_phrase a nif:Phrase ;
               its:taClassRef [ rdfs:subClassOf sms:Software ] ;
               its:taIdentRef ?sw_identity ;
               nif:anchorOf "{key}" .  # Replace "Excel" with the desired software name

    # Retrieve other spellings linked to the same software identity
    ?other_phrase its:taIdentRef ?sw_identity ;
                  nif:anchorOf ?synonym .
    
    FILTER (?synonym != "{key}")  # Exclude the original input spelling from results
}}
ORDER BY ?synonym
    """
        try:
            # Set query and return format
            sparql.setQuery(query)
            sparql.setReturnFormat(JSON)
            results = sparql.query().convert()

            # Process results
            for result in results["results"]["bindings"]:
                synonym = result.get("synonym", {}).get("value")
                if synonym:
                    dictionary[key].add(synonym)

        except Exception as e:
            print(f"Error retrieving synonyms for {key}: {e}")
        

In [21]:
def get_synonyms(dictionary, CZI = 1, SoftwareKG = 1):
    if CZI == 1:
        get_synonyms_from_CZI(df, dictionary)
    if SoftwareKG == 1:
        get_synonyms_from_SoftwareKG(dictionary)
    dictionary = {key: ", ".join(value) for key, value in dictionary.items()}
    return dictionary

In [22]:
# Add synonyms column
benchmark_dictonary= get_synonyms(benchmark_dictonary,1,1)
print(benchmark_dictonary)
benchmark_df["synonyms"] = benchmark_df["name"].str.lower().map(benchmark_dictonary)

# Save the enriched benchmark
benchmark_df.to_csv("benchmark_with_synonyms.csv", index=False)

{'sklearn': 'sklearn_extra, sklearn Python library, Sklearn API, sklearn Python package, sklear, sklearn.hmm, sklearn‐rvm, sklearn.tree, sklearn python package, sklearn”, Scikit-learn, Python sklearn, sklearn Python, sklearn, Python sklearn package, Python package sklearn, sklearn.utils, scikits.learn, sklearn0, sklearning, sklearn.svm, Python sklearn library, sklearn-fuse', 'sklearn python package': 'learn Python package37, hmmlearn Python package, Nilearn Python package, sklearn Python, sklearn Python package, sklearn package, sklearn python package', 'python package sklearn': 'Python package scipy, Python package “scikit-learn”, Python package seaborn, Python package scikit learn, Python package scikit-learn, Python package Holes, Python package sci-kit learn, Python sklearn package, Python packages, Python package Scikit-learn, Python package scikits.learn, Python package scikit‐learn, Python sklearn', 'python sklearn library': 'Python Scikit-learn library, Python ‘sklearn’ library