# **arxiv-recommendation**

Recommending similar arXiv papers based on the embeddings of their abstracts.

## **Setup**

This notebook is designed to work in both Google Colab and local environments.

**For Google Colab:**
- **Mount Google Drive:** Enables saving files and accessing them across Colab.
    > ⚠ **Warning** <br>
    > This mounts your entire Google Drive, giving theoretical access to all files. While the code only accesses the project folder, consider using a dedicated Google account.
- **Clone the repository:** Ensures the latest code and utility modules are available.
- Add repo to Python path: Lets us import custom project modules as regular Python packages.

**For local environments:**
- Add project root to Python path: Lets us import custom project modules from the parent directory.

[ Optionally ]:
- Enable Autoreload: Lets us modify utility modules without having to reload them manually (useful for development).

In [1]:
import os
import sys

def setup_environment(repo_url, dev=False, drive_mount_path="/content/drive"):
    """Sets up the development environment for both Google Colab and local environments."""

    if "google.colab" not in sys.modules:
        # Define local project root
        project_root = os.path.dirname(os.getcwd())

        print("Not running in Google Colab.\nSkipping Colab setup.")

    else:
        # Mount Google Drive
        from google.colab import drive
        drive.mount(drive_mount_path, force_remount=True)

        # Define where within Drive to clone the git repository
        project_parent_dir = os.path.join(drive_mount_path, "MyDrive")
        project_name = repo_url.split("/")[-1].replace('.git', "")
        project_root = os.path.join(project_parent_dir, project_name)

        # Clone the repository if it doesn't exist
        if not os.path.exists(project_root):
            print(f"\nCloning repository into {project_root}")
            try:
                os.chdir(project_parent_dir)  # Change to the parent directory to clone the repo
                !git clone {repo_url}
            finally:
                os.chdir(project_root)  # Always change back to the original directory, even if clone fails
        else:
            print(f"\nRepository already exists at {project_root}")

        print("\nColab setup complete.")

    # Add project to Python path
    if project_root not in sys.path:
        sys.path.insert(0, project_root)
        print(f"\n'{project_root}' added to Python path.")
    else:
        print(f"\n'{project_root}' in Python path.")

    # Enable autoreload (for developement)
    if dev:
        from IPython import get_ipython
        ipython = get_ipython()

        # # Load extension quietly if not already loaded
        if "autoreload" not in ipython.extension_manager.loaded:
            ipython.magic("load_ext autoreload")

        print("\nAutoreload extension enabled (mode 2).")
        ipython.magic("autoreload 2")

In [2]:
setup_environment("https://github.com/nadrajak/arxiv-semantic-search.git", dev=False)

Mounted at /content/drive

Repository already exists at /content/drive/MyDrive/arxiv-semantic-search

Colab setup complete.

'/content/drive/MyDrive/arxiv-semantic-search' added to Python path.


## **Imports**

In [None]:
!pip install arxiv

In [4]:
import numpy as np
import pandas as pd

import arxiv

import torch

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import semantic_search


# Custom modules
from utils import config
from utils import data_loader
from utils import preprocessing

In [5]:
# Initialize randomness
np.random.seed(config.RANDOM_SEED);
torch.manual_seed(config.RANDOM_SEED);

## **Load data**


We use the [arXiv dataset from Kaggle](https://www.kaggle.com/Cornell-University/arxiv), which contains metadata and abstracts for scholarly papers across STEM fields.

Below, we load a sample of the dataset and briefly inspect its structure.

In [None]:
# Download dataset from Kaggle
arxiv_dataset_path = data_loader.load_arxiv_dataset()

In [7]:
# Load json file as a pandas DataFrame
recommend_nrows = 1_000
data = pd.read_json(arxiv_dataset_path, lines=True, nrows=(config.FT_NROWS + recommend_nrows))

# Skip first `config.FT_NROWS` because they were used for training
data = data.iloc[config.FT_NROWS:]

print(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 15000 to 15999
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   id              1000 non-null   float64
 1   submitter       1000 non-null   object 
 2   authors         1000 non-null   object 
 3   title           1000 non-null   object 
 4   comments        875 non-null    object 
 5   journal-ref     512 non-null    object 
 6   doi             646 non-null    object 
 7   report-no       80 non-null     object 
 8   categories      1000 non-null   object 
 9   license         77 non-null     object 
 10  abstract        1000 non-null   object 
 11  versions        1000 non-null   object 
 12  update_date     1000 non-null   object 
 13  authors_parsed  1000 non-null   object 
dtypes: float64(1), object(13)
memory usage: 109.5+ KB
None


## **Preprocessing**

In [8]:
# Select essential columns
data = data[["title", "abstract", "categories", "authors"]].copy()

# Apply light preprocessing to text columns
data = preprocessing.normalize_whitespace(data)
data = preprocessing.normalize_abstracts(data)

# Simplify categories
data = preprocessing.truncate_categories(data)

## **Load model**

In [9]:
model = SentenceTransformer("nadrajak/allenai-specter-ft2")

In [10]:
corpus_embeddings = model.encode(data["abstract"].to_list())

## **Recommendation**

In [11]:
def id_to_url(id):
    """Converts an arXiv paper ID to its URL."""
    id_a, id_b = id.split(".")
    url  = f"https://arxiv.org/abs/{id_a.zfill(4)}.{id_b}"

    return url

In [12]:
def get_paper_info(url):
    """Fetches paper information from arXiv given a paper URL."""

    id = url.split("/")[-1]

    # Create session & look up paper using an API wrapper
    client = arxiv.Client()
    search = arxiv.Search(id_list=[id])
    results = client.results(search)

    for r in results:
        title = r.title
        abstract = r.summary
        categories = " ".join(r.categories)
        authors_parsed = [[a.name] for a in r.authors]

    # Select essential columns
    result_df = pd.DataFrame({
        "title": title,
        "abstract": abstract,
        "categories": categories,
        "authors_parsed": [authors_parsed],
    })

    # Apply the same preprocessing as we did to the corpus
    result_df = preprocessing.normalize_whitespace(result_df)
    result_df = preprocessing.normalize_abstracts(result_df)
    result_df = preprocessing.truncate_categories(result_df)

    return result_df


In [13]:
url_1 = "https://arxiv.org/abs/1605.08386"
url_2 = id_to_url("704.0001")

In [14]:
queries = []

queries.append(get_paper_info(url_1))
queries.append(get_paper_info(url_2))

queries = pd.concat(queries)

In [15]:
query_embeddings = model.encode(queries["abstract"].to_list())

hits = semantic_search(query_embeddings, corpus_embeddings, top_k=5)
hits

[[{'corpus_id': 171, 'score': 0.9468393921852112},
  {'corpus_id': 766, 'score': 0.9418857097625732},
  {'corpus_id': 589, 'score': 0.9412132501602173},
  {'corpus_id': 569, 'score': 0.9411637187004089},
  {'corpus_id': 276, 'score': 0.9386530518531799}],
 [{'corpus_id': 651, 'score': 0.9743863940238953},
  {'corpus_id': 225, 'score': 0.9730255603790283},
  {'corpus_id': 807, 'score': 0.9680969715118408},
  {'corpus_id': 674, 'score': 0.9676692485809326},
  {'corpus_id': 673, 'score': 0.965707004070282}]]

In [16]:
for i, hit in enumerate(hits):
    print(f"{queries.iloc[i]['category']}, {queries.iloc[i]['title']}")

    for hit in hits[i]:
        result = data.iloc[hit['corpus_id']]
        print(f"  {hit['score']:.4f}, {result['category']}, {result['title']}")
    if i != len(hits) - 1: print()

math, Heat-bath random walks with Markov bases
  0.9468, math, Exit problems associated with affine reflection groups
  0.9419, math, On the linear fractional self-attracting diffusion
  0.9412, math, The equilibrium states for semigroups of rational maps
  0.9412, math, The logarithmic Sobolev inequality along the Ricci flow
  0.9387, math, The 2-adic valuation of a sequence arising from a rational integral

hep, Calculation of prompt diphoton production cross sections at Tevatron and LHC energies
  0.9744, hep, Tomography for amplitudes of hard exclusive processes
  0.9730, hep, High p_T Top Quarks at the Large Hadron Collider
  0.9681, hep, Measurement of Moments of the Hadronic-Mass and -Energy Spectrum in Inclusive Semileptonic $\bar{B} \to X_{c} \ell^{-} \bar{\nu}$ Decays
  0.9677, hep, Mixings of 4-quark components in light non-singlet scalar mesons in QCD sum rules
  0.9657, hep, KLOE measurement of the charged kaon absolute semileptonic BR's
