This is a notebook that based on software mention from a paper, doi of that paper and paragraph surronding the software mention retrieves URLs that software is refering to. User has two options:
1. **As an input, putting csv file with columns name, doi and paragraph (optional column candidate_urls). 
2. As an input, entering name, doi and paragraph, optionally candidate_urls separated by comma
ADD SOFTWARE MENTION EXACTLY AS MENTIONED IN THE PAPER AND PARAGRAPH SURROUNDING THE MENTION.**

**IMPORTANT**
In order for a notebook to work, it is necessary to have a CZI folder with synonym_matrix inside of it, folder with code and model_pipeline file with the model (when downloading the zip with the notebook, all are included). 
Only URLs that are valid are the ones belonging to GitHub, PyPI and CRAN.
Only thing that needs to be changed in the notebook is right bellow in the first cell and it is clearly marked:
input_file - if input is file, provide the path to it
model_input_path - path to the file that is input to the model, if path is not provided the default (./model_input.csv) is be used 
output_path_aggregated_groups - path to the file that is general output, this file will contain all mentions with their metadata and columns url (URLs which software is refering to) andd not url (URLs which software is not refering to), if not provided default (./aggregated_groups.csv) is used 
**OPTIONAL** - in the process a lot of files can be produced in order to follow the process. If you wish to produce these files change paths for them or leave the current default versions, if you wish to not save any of them, put None.
output_file_corpus - path to the file that will contain software mention/s with all additional data added (synonyms, language, authors and candidate URLs)
output_path_pairs - path to the file that will contain software mention/s paired with each candidate URLs found
output_path_updated_with_metadata - path to the file that will contain software mention/s with all aditional data added, as well as metadata fetched from each URL
output_path_similarities - path to the file that will contain software mention/s with all additional data and metadata, as well as similarities calculated
output_path_predictions - path to the file that has all columns like similarities file, with addition of prediction made by model

WHAT DOES NOTEBOOK DO:
1. For each pair of software mention/doi/paragraph are fetched:
    -   language (searches paragraph to find a programming language closest to the software mention)
    -   synonyms (searches CZI to find synonyms of software mention)
    -   authors (uses openAlex tool to get names of the paper authors)
    -   candidate URLs (searches GitHub, PyPI and CRAN  to get possible URLs software may be refering to)
2. Updates metadata cache JSON file that containts all up until now fetched metadata from URLs
3. Makes pairs of software mention and URL for every software and URL candidate
4. Adds metadata fetched from URLs
5. Calculates similarities for every software/URL pair
    -   software name, author and synonym similarities are calculated using Jaro Wrinkler
    -   paragraph and repository description similarity is calculated using BERT model
6. Selects columns necessary for the model and feeds the input to receive a predictions
7. When predictions are there, groups rows based on software name, doi and paragraph and separates candidate URLs into two columns url and not url, based on the prediction
Model used for prediction is Random Forest. 

In [None]:
import sys
import pandas as pd

#Add the path to the input file (optional)
input_file = ""
if input_file is None or input_file == "":
    name = input("Enter the software mention: ")
    if name == "":
        print("No software mention provided. Exiting.")
        sys.exit(1)
    paragraph = input("Enter the paragraph: ")
    if paragraph == "":
        print("No paragraph provided. Exiting.")
        sys.exit(1)
    doi = input("Enter the DOI: ")
    if doi == "":
        print("No DOI provided. Exiting.")
        sys.exit(1)
    candidate_urls = input("Enter the candidate URLs (comma-separated, optional): ")
    input_dataframe = pd.DataFrame({
        'name': [name],
        'paragraph': [paragraph],
        'doi': [doi],
        'candidate_urls': [candidate_urls]
    })
else:
    input_dataframe = pd.read_csv(input_file,delimiter=';')
# Add the path to the output file for file with added languages, synonyms, authors and candidate URLs (optional)
output_file_corpus = "./temp/softwares_with_languages.csv"
# Add the path to the output file for file with pairs of software names with candidate URLs (optional)
output_path_pairs = "./temp/pairs.csv"
# Add the path to the output file for file with added metadata (optional)
output_path_updated_with_metadata = "./temp/updated_with_metadata_file.csv"
# Add the path to the output file for file with calculated similarities (optional)
output_path_similarities = "./temp/similarities.csv"
# Add the path to the output file for file with model input
model_input_path = "./model_input.csv"
if model_input_path is None or model_input_path == "":
    model_input_path = "./model_input.csv"
# Add the path to the output file with predictions (optional)
output_path_predictions = "./temp/predictions.csv"
# Add the path to the output file with aggregated groups)
output_path_aggregated_groups = "./aggregated_groups.csv"
if output_path_aggregated_groups is None or output_path_aggregated_groups == "":
    output_path_aggregated_groups = "./aggregated_groups.csv"


candidates_cache_file = "./candidate_urls.json"
synonyms_file = "./synonym_dictionary.json"
metadata_cache_file = "./metadata_cache.json"

In [None]:
import os
import numpy as np
import joblib


# Add ../code to sys.path (1 level up from demo, into code)
sys.path.append(os.path.abspath("../code"))

from preprocessing_corpus import find_nearest_language_for_softwares,get_authors,get_synonyms_from_file, make_pairs, dictionary_with_candidate_metadata, add_metadata,aggregate_group
from fetch_candidates import get_candidate_urls
from similarity_metrics import compute_similarity_test




In [14]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [None]:
CZI = pd.read_csv("./CZI/synonyms_matrix.csv")




In [None]:


# Get the synonyms from the file
get_synonyms_from_file(synonyms_file, input_dataframe,CZI_df=CZI)
# Find the nearest language for each software
input_dataframe['language'] = input_dataframe.apply(
    lambda row: find_nearest_language_for_softwares(row['paragraph'], row['name']), axis=1
)
results = input_dataframe['doi'].apply(get_authors)
input_dataframe['authors'] = results.apply(lambda x: ','.join(x.get('authors', [])) if isinstance(x, dict) else '')
# Get candidate URLs for each software
input_dataframe=get_candidate_urls(input_dataframe, candidates_cache_file)
#Fill all missing values with Nan
input_dataframe.fillna(value=np.nan, inplace=True)
# Save the updated DataFrame to a new CSV file (optional)
if output_file_corpus is not None and output_file_corpus != "":
    input_dataframe.to_csv(output_file_corpus, index=False)

  input_dataframe.fillna(value=np.nan, inplace=True)


In [18]:
#input_dataframe = pd.read_csv(output_file_corpus)
metadata_cache = dictionary_with_candidate_metadata(input_dataframe, metadata_cache_file)
input_dataframe= make_pairs(input_dataframe,output_path_pairs)

add_metadata(input_dataframe,metadata_cache, output_path_updated_with_metadata)
input_dataframe= compute_similarity_test(input_dataframe,output_path_similarities)

model_input = input_dataframe[['name_metric', 'paragraph_metric','language_metric','synonym_metric','author_metric']].copy()
model_input.to_csv(model_input_path, index=False)

✅ All done — cache saved to ./metadata_cache.json
📄 Updated CSV file saved to ./temp/updated_with_metadata_file.csv
📄 Similarity metrics saved to ./temp/similarities.csv


In [19]:
#Loading model

model = joblib.load("../model_pipeline.joblib")
predictions = model.predict(model_input)
# Add predictions to the input DataFrame
input_dataframe['prediction'] = predictions
# Save the final DataFrame with predictions to a new CSV file
if output_path_similarities is not None:
    input_dataframe.to_csv(output_path_similarities, index=False)
grouped = input_dataframe.groupby(['name', 'paragraph', 'doi']).apply(aggregate_group).reset_index()
grouped.to_csv(output_path_aggregated_groups, index=False)
print("Processing complete. Output files generated.")

Processing complete. Output files generated.


  grouped = input_dataframe.groupby(['name', 'paragraph', 'doi']).apply(aggregate_group).reset_index()
