# **arxiv-recommendation**

Recommending similar arXiv papers based on the embeddings of their abstracts.

## **Setup**

This notebook is designed to work in both Google Colab and local environments.

**For Google Colab:**
- **Mount Google Drive:** Enables saving files and accessing them across Colab.
    > ⚠ **Warning** <br>
    > This mounts your entire Google Drive, giving theoretical access to all files. While the code only accesses the project folder, consider using a dedicated Google account.
- **Clone the repository:** Ensures the latest code and utility modules are available.
- Add repo to Python path: Lets us import custom project modules as regular Python packages.

**For local environments:**
- Add project root to Python path: Lets us import custom project modules from the parent directory.

[ Optionally ]:
- Enable Autoreload: Lets us modify utility modules without having to reload them manually (useful for development).

In [None]:
import os
import sys

def setup_environment(repo_url, dev=False, drive_mount_path="/content/drive"):
    """Sets up the development environment for both Google Colab and local environments."""

    if "google.colab" not in sys.modules:
        # Define local project root
        project_root = os.path.dirname(os.getcwd())

        print("Not running in Google Colab.\nSkipping Colab setup.")

    else:
        # Mount Google Drive
        from google.colab import drive
        drive.mount(drive_mount_path, force_remount=True)

        # Define where within Drive to clone the git repository
        project_parent_dir = os.path.join(drive_mount_path, "MyDrive")
        project_name = repo_url.split("/")[-1].replace('.git', "")
        project_root = os.path.join(project_parent_dir, project_name)

        # Clone the repository if it doesn't exist
        if not os.path.exists(project_root):
            print(f"\nCloning repository into {project_root}")
            try:
                os.chdir(project_parent_dir)  # Change to the parent directory to clone the repo
                !git clone {repo_url}
            finally:
                os.chdir(project_root)  # Always change back to the original directory, even if clone fails
        else:
            print(f"\nRepository already exists at {project_root}")

        print("\nColab setup complete.")

    # Add project to Python path
    if project_root not in sys.path:
        sys.path.insert(0, project_root)
        print(f"\n'{project_root}' added to Python path.")
    else:
        print(f"\n'{project_root}' in Python path.")

    # Enable autoreload (for developement)
    if dev:
        from IPython import get_ipython
        ipython = get_ipython()

        # Load extension quietly if not already loaded
        if "autoreload" not in ipython.extension_manager.loaded:
            ipython.magic("load_ext autoreload")

        print("\nAutoreload extension enabled (mode 2).")
        ipython.magic("autoreload 2")

In [None]:
setup_environment("https://github.com/nadrajak/arxiv-semantic-search.git", dev=True)

## **Imports**

In [None]:
import numpy as np
import pandas as pd

import torch

from sentence_transformers import SentenceTransformer


# Custom modules
from utils import config
from utils import data_loader
from utils import preprocessing

In [None]:
# Initialize randomness
np.random.seed(config.RANDOM_SEED);
torch.manual_seed(config.RANDOM_SEED);

## **Load data**


We use the [arXiv dataset from Kaggle](https://www.kaggle.com/Cornell-University/arxiv), which contains metadata and abstracts for scholarly papers across STEM fields.

Below, we load a sample of the dataset and briefly inspect its structure.

In [None]:
# Download dataset from Kaggle
arxiv_dataset_path = data_loader.load_arxiv_dataset()

In [None]:
# Load json file as a pandas DataFrame
recommend_nrows = 1_000
data = pd.read_json(arxiv_dataset_path, lines=True, nrows=(config.FT_NROWS + recommend_nrows))

# Skip first `config.FT_NROWS` because they were used for training
data = data.iloc[config.FT_NROWS:]

print(data.info())

## **Preprocessing**

In [None]:
# Select essential columns
data = data[["title", "abstract", "categories", "authors"]].copy()

# Apply light preprocessing to text columns
data = preprocessing.normalize_whitespace(data)
data = preprocessing.normalize_abstracts(data)

# Simplify categories
data = preprocessing.truncate_categories(data)

In [None]:
data.head()

## **Recommendation**

In [None]:
# TODO: