# 🔍 Software Mention Disambiguation Notebook

This notebook identifies which software repository (GitHub, PyPI, or CRAN) a software mention from a scientific paper refers to.

---

## 🛠️ Input Options

You can provide the software mention in two ways:

1. **CSV input**: File must include columns:
   - `name` (software mention)
   - `doi` (paper DOI)
   - `paragraph` (context around the mention)
   - `candidate_urls` (optional, comma-separated list of URLs)

2. **Manual input**: If no CSV is provided, you'll be prompted to enter:
   - Software name (as mentioned in the paper)
   - Paragraph
   - DOI
   - Candidate URLs (optional, comma-separated)

⚠️ **Make sure to copy the software mention *exactly as in the paper* and include the surrounding paragraph.**

---

## 📁 Folder Structure (Expected)

These must be present for the notebook to work:
```
├── demo.ipynb
├── model.pkl                     ← Trained model
├── preprocessing.py             ← Utility functions
├── models.py                    ← ML model utilities
├── CZI/synonyms_matrix.csv      ← Synonym mapping
├── json/
│   ├── candidate_urls.json
│   ├── synonym_dictionary.json
│   └── metadata_cache.json      ← JSON caches
```
If the `json/` folder is missing, it will be recreated during execution — make sure you set valid paths for those files if you want to store them.

Optional output files will be saved in:
```
├── temp/
│   ├── corpus_with_candidates.csv
│   ├── pairs.csv
│   ├── updated_with_metadata.csv
│   ├── similarities.csv
│   └── predictions.csv
```

---

## ✍️ Configuration (Edit Below)

In the first code cell:
- `input_file`: Path to your CSV input
- `model_path`: Path to model (`./model.pkl` by default)
- `model_input_path`: File for model input
- `output_path_aggregated_groups`: Final file with URLs predicted as relevant (`url`) and irrelevant (`not url`)
- `somef_path`: Path to cloned SOMEF repository

If you do **not** want to save intermediate files, set those output paths to `None`.

---

## 🌐 GitHub Token
GitHub API access requires a token. Instructions:
https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token

To use GitHub search functionality, you **must** set an environment variable:
```bash
export GITHUB_TOKEN=your_token_here     # macOS/Linux
set GITHUB_TOKEN=your_token_here        # Windows
```

---

## 🧩 SOMEF

This notebook requires [SOMEF](https://github.com/KnowledgeCaptureAndDiscovery/somef) to fetch repository metadata.

Clone the repo (and follow instructions provided in repository README) and set `somef_path` in the notebook.

---

## 🧠 What Happens Inside

1. Extracts extra info for each mention:
   - Language (from paragraph)
   - Synonyms (from CZI)
   - Authors (from OpenAlex)
   - Candidate URLs (from GitHub, PyPI, CRAN)
2. Adds metadata for each candidate URL
3. Computes similarities:
   - Jaro-Winkler (name, authors, synonyms)
   - BERT (paragraph vs. repo description)
4. Predicts with a Random Forest model
5. Aggregates output with predicted `url` and `not url`

---

## ✅ Output

- Final results are saved to: **`aggregated_groups.csv`**
- Contains original fields + classified URLs


In [4]:
import sys
import pandas as pd

#Add the path to the input file (optional)
input_file = "./input.csv"
if input_file is None or input_file == "":
    name = input("Enter the software mention: ")
    if name == "":
        print("No software mention provided. Exiting.")
        sys.exit(1)
    paragraph = input("Enter the paragraph: ")
    if paragraph == "":
        print("No paragraph provided. Exiting.")
        sys.exit(1)
    doi = input("Enter the DOI: ")
    if doi == "":
        print("No DOI provided. Exiting.")
        sys.exit(1)
    candidate_urls = input("Enter the candidate URLs (comma-separated, optional): ")
    input_dataframe = pd.DataFrame({
        'name': [name],
        'paragraph': [paragraph],
        'doi': [doi],
        'candidate_urls': [candidate_urls]
    })
else:
    input_dataframe = pd.read_csv(input_file,delimiter=';')
# Add the path to the output file for file with added languages, synonyms, authors and candidate URLs (optional)
output_file_corpus = './temp/corpus_with_candidates.csv'
# Add the path to the output file for file with pairs of software names with candidate URLs (optional)
output_path_pairs = "./temp/pairs.csv"
# Add the path to the output file for file with added metadata (optional)
output_path_updated_with_metadata = "./temp/updated_with_metadata.csv"
# Add the path to the output file for file with calculated similarities (optional)
output_path_similarities = "./temp/similarities.csv"
#Add the path to the model
model_path = "./model.pkl"
if model_path is None or model_path == "":
    model_path = "./model.pkl"
# Add the path to the output file for file with model input
model_input_path = "./model_input.csv"
if model_input_path is None or model_input_path == "":
    model_input_path = "./model_input.csv"
# Add the path to the output file with predictions (optional)
output_path_predictions = "./temp/predictions.csv"
# Add the path to the output file with aggregated groups)
output_path_aggregated_groups = "./aggregated_groups.csv"
if output_path_aggregated_groups is None or output_path_aggregated_groups == "":
    output_path_aggregated_groups = "./aggregated_groups.csv"

# Add the path to the somef repository
somef_path = "D:/MASTER/TMF/somef"


candidates_cache_file = "./json/candidate_urls.json"
synonyms_file = "./json/synonym_dictionary.json"
metadata_cache_file = "./json/metadata_cache.json"

In [8]:
import os
import numpy as np
import cloudpickle


from preprocessing import find_nearest_language_for_softwares,get_authors,get_synonyms_from_file, make_pairs, dictionary_with_candidate_metadata, add_metadata,aggregate_group,get_candidate_urls,compute_similarity_test
from models import make_model, get_preprocessing_pipeline




In [9]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [4]:
CZI = pd.read_csv("./CZI/synonyms_matrix.csv")

In [5]:


# Get the synonyms from the file
get_synonyms_from_file(synonyms_file, input_dataframe,CZI_df=CZI)
# Find the nearest language for each software
input_dataframe['language'] = input_dataframe.apply(
    lambda row: find_nearest_language_for_softwares(row['paragraph'], row['name']), axis=1
)
results = input_dataframe['doi'].apply(get_authors)
input_dataframe['authors'] = results.apply(lambda x: ','.join(x.get('authors', [])) if isinstance(x, dict) else '')
# Get candidate URLs for each software
input_dataframe=get_candidate_urls(input_dataframe, candidates_cache_file)
#Fill all missing values with Nan
input_dataframe.fillna(value=np.nan, inplace=True)
# Save the updated DataFrame to a new CSV file (optional)
if output_file_corpus is not None and output_file_corpus != "":
    input_dataframe.to_csv(output_file_corpus, index=False)

[Attempt 1] Rate limited. Sleeping 35s until reset…
[Attempt 2] Rate limited. Sleeping 1s until reset…
[Attempt 1] Rate limited. Sleeping 37s until reset…
[Attempt 2] Rate limited. Sleeping 1s until reset…
[Attempt 1] Rate limited. Sleeping 35s until reset…
[Attempt 2] Rate limited. Sleeping 1s until reset…
[Attempt 1] Rate limited. Sleeping 37s until reset…
[Attempt 2] Rate limited. Sleeping 1s until reset…
[Attempt 1] Rate limited. Sleeping 35s until reset…
[Attempt 2] Rate limited. Sleeping 1s until reset…
[Attempt 1] Rate limited. Sleeping 35s until reset…
[Attempt 2] Rate limited. Sleeping 1s until reset…
[Attempt 1] Rate limited. Sleeping 36s until reset…
[Attempt 2] Rate limited. Sleeping 1s until reset…
[Attempt 1] Rate limited. Sleeping 36s until reset…
[Attempt 2] Rate limited. Sleeping 1s until reset…
[Attempt 1] Rate limited. Sleeping 36s until reset…
[Attempt 2] Rate limited. Sleeping 1s until reset…
[Attempt 1] Rate limited. Sleeping 36s until reset…
[Attempt 2] Rate limi

In [None]:
input_dataframe = pd.read_csv(output_file_corpus)
metadata_cache = dictionary_with_candidate_metadata(input_dataframe, metadata_cache_file, somef_path)
input_dataframe= make_pairs(input_dataframe,output_path_pairs)

add_metadata(input_dataframe,metadata_cache, output_path_updated_with_metadata)
input_dataframe= compute_similarity_test(input_dataframe,output_path_similarities)

model_input = input_dataframe[['name_metric', 'paragraph_metric','language_metric','synonym_metric','author_metric']].copy()
model_input.to_csv(model_input_path, index=False)

🔍 Processing: https://github.com/swagger-api/swagger-codegen
Failed to extract metadata for https://github.com/swagger-api/swagger-codegen: Command '['poetry', 'run', 'somef', 'describe', '-r', 'https://github.com/swagger-api/swagger-codegen', '-o', 'C:\\Users\\Jelena\\AppData\\Local\\Temp\\tmpmkc00mrh.json', '-t', '0.93', '-m', '-kt', '\\\\?\\D:\\MASTER\\TMF\\somef\\temp']' returned non-zero exit status 1.
🔍 Processing: https://github.com/LatticeX-Foundation/Rosetta
Failed to extract metadata for https://github.com/LatticeX-Foundation/Rosetta: Command '['poetry', 'run', 'somef', 'describe', '-r', 'https://github.com/LatticeX-Foundation/Rosetta', '-o', 'C:\\Users\\Jelena\\AppData\\Local\\Temp\\tmp90hzh3lw.json', '-t', '0.93', '-m', '-kt', '\\\\?\\D:\\MASTER\\TMF\\somef\\temp']' returned non-zero exit status 1.
🔍 Processing: https://pypi.org/project/sph/
🔍 Processing: https://github.com/danjulio/MPPT-Solar-Charger
🔍 Processing: https://cran.r-project.org/package=STAND
🔍 Processing: http

In [None]:
#Loading model
with open(model_path, "rb") as f:
    model = cloudpickle.load(f)
predictions = model.predict(model_input)
# Add predictions to the input DataFrame``
input_dataframe['prediction'] = predictions
# Save the final DataFrame with predictions to a new CSV file
if output_path_similarities is not None:
    input_dataframe.to_csv(output_path_similarities, index=False)
grouped = input_dataframe.groupby(['name', 'paragraph', 'doi']).apply(aggregate_group).reset_index()
grouped.to_csv(output_path_aggregated_groups, index=False)
print("Processing complete. Output files generated.")