# arXiv Paper Embedding

## On a Single GPU
This notebook utilizes an NVIDIA T4 on Saturn Cloud.

In [1]:
import cudf
import pandas as pd
import json
import os
import re
import string
import pickle
from pathlib import Path
from typing import List

DATA_PATH = "arxiv-metadata-oai-snapshot.json"
YEAR_PATTERN = r"(19|20[0-9]{2})"

In [2]:
# Morten additions / overwrites

# Used grep to sub-select those with id starting with 20 for 2020
DATA_PATH = "../../../arxiv/2020_arxiv-metadata-oai-snapshot.json"
TEX_EQ_PATH = "./arxiv_src/"
ID_TEX = {}
for path in Path(TEX_EQ_PATH).glob("*.json"):
    with open(path, 'r') as fp:
        ID_TEX.update(json.load(fp))
ID_TEX = {k: v for k, v in ID_TEX.items() if v}

## Step 1: Data pre processing
Before we do anything else, we need to load the papers dataset, do some basic cleaning, and get it into a workable format. Below,
we will use CuDF to house the data and apply seom transformations in a generator, loading from file.

Luckily, we are going to hack away the semantic search on abstracts and replace it with search on LaTeX math so we don't want to remove punctuation. To be honest the HuggingFace / Sbert tokenizer should be doing all the transformations the model was trained with. Unless you specifically know that your text is abnormal in some way it shouldn't be necessary to perform these transforms. There's some weirdness in our text with regard to escape characters and such but not enough to make the semantic search fail.

We are also going to skip every paper where the LaTeX scraping came up empty-handed. Of course you wouldn't do that if you were building a search product to release.

In [3]:
def clean_latex(latex_scrape: List[str]) -> str:
    if not latex_scrape:
        return ""
    latex_scrape = '\n'.join(latex_scrape)
    return latex_scrape

In [4]:
# Generator functions that iterate through the file and process/load papers

def process(paper: dict):
    paper = json.loads(paper)
    latex_scrape = ID_TEX.get(paper["id"], None)
    if not latex_scrape:
        return {}
    # Morten: We shouldn't actually need this for recent papers since the year is in the id now
    # Morten: But lets keep it as it is
    if paper['journal-ref']:
        # Attempt to parse the date using Regex: this could be improved
        years = [int(year) for year in re.findall(YEAR_PATTERN, paper['journal-ref'])]
        years = [year for year in years if (year <= 2022 and year >= 1991)]
        year = min(years) if years else None
    else:
        year = None
    return {
        'id': paper['id'],
        'title': paper['title'],
        'year': year,
        'authors': paper['authors'],
        'categories': ','.join(paper['categories'].split(' ')),
        'abstract': paper['abstract'],
        'latex_scrape': latex_scrape,
        'input': clean_latex(latex_scrape) # input for embedding model
    }

def papers():
    with open(DATA_PATH, 'r') as f:
        for paper in f:
            paper = process(paper)
            # Returns empty dict if we didn't scrape anything
            if paper == {}:
                continue
            # Yield only papers that have a year I could process
            if paper['year']:
                yield paper


In [5]:
# Example
next(papers())

{'id': '2001.00001',
 'title': 'Quantum GestART: Identifying and Applying Correlations between\n  Mathematics, Art, and Perceptual Organization',
 'year': 2020,
 'authors': 'Maria Mannone, Federico Favali, Balandino Di Donato, Luca Turchet',
 'categories': 'math.HO,cs.MM',
 'abstract': '  Mathematics can help analyze the arts and inspire new artwork. Mathematics\ncan also help make transformations from one artistic medium to another,\nconsidering exceptions and choices, as well as artists\' individual and unique\ncontributions. We propose a method based on diagrammatic thinking and quantum\nformalism. We exploit decompositions of complex forms into a set of simple\nshapes, discretization of complex images, and Dirac notation, imagining a world\nof "prototypes" that can be connected to obtain a fine or coarse-graining\napproximation of a given visual image. Visual prototypes are exchanged with\nauditory ones, and the information (position, size) characterizing visual\nprototypes is conn

In [6]:
# Load papers into a CuDF
cdf = cudf.DataFrame(list(papers()))

In [7]:
len(cdf)

31458

In [8]:
# Morten: apparently some of these are not from 2020? Doesn't sound right.
cdf.year.value_counts()

2020    15880
2021    12257
2022     1432
2019      757
2018      219
2017      113
2001       84
2016       75
2011       64
2000       62
2002       56
2015       54
2012       46
2004       43
2006       39
2014       39
2007       38
2005       37
2009       35
2003       34
2013       33
2008       32
2010       29
Name: year, dtype: int32

In [9]:
# Pickle the dataframe to save you time in the future

with open('cdf.pkl', 'wb') as f:
    pickle.dump(cdf, f)
    
# Load pickle
# with open('cdf.pkl', 'rb') as f:
#     cdf = pickle.load(f)

## Step 2: Create sentence embeddings
Here I use a cookie-cutter -- **out of the box** -- model from HuggingFace to transform papers abstracts + titles into vectors.

**This takes a long time**... So best to take a subset. Or use the dask cluster for multi-gpu encoding.

In [10]:
# batch = cdf[:100000].copy()
batch = cdf.copy()

In [11]:
from sentence_transformers import SentenceTransformer

# Morten: Going to use a smaller model to speed this up
# Morten: On second thought that changes the vector length so we need to modify the Redis upload code
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
# model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

vectors = model.encode(
    sentences = batch.input.values_host,
    normalize_embeddings = True,
    batch_size = 64,
    show_progress_bar = True
)

Batches:   0%|          | 0/492 [00:00<?, ?it/s]

In [12]:
# Vectors created!
batch['vector'] = cudf.Series(vectors.tolist(), index=batch.index)

In [13]:
batch.head()

Unnamed: 0,id,title,year,authors,categories,abstract,latex_scrape,input,vector
0,2001.00001,Quantum GestART: Identifying and Applying Corr...,2020,"Maria Mannone, Federico Favali, Balandino Di D...","math.HO,cs.MM",Mathematics can help analyze the arts and in...,[\begin{equation}\label{product}\n\begin{footn...,\begin{equation}\label{product}\n\begin{footno...,"[-0.032643456012010574, -0.021090390160679817,..."
1,2001.00011,Dark Energy and Modified Scale Covariant Theor...,2020,"Koijam Manihar Singh, Sanjay Mandal, Longjam P...","gr-qc,hep-th",Taking up four model universes we study the ...,[\begin{equation}\n\label{eqn:1}\ng_{ij}'=\phi...,\begin{equation}\n\label{eqn:1}\ng_{ij}'=\phi^...,"[-0.00736722256988287, -0.031140638515353203, ..."
2,2001.00018,"Connecting optical morphology, environment, an...",2020,John F. Wu,"astro-ph.GA,astro-ph.IM",A galaxy's morphological features encode det...,[\begin{equation}\n {\rm RMSE} \equiv \sqrt...,\begin{equation}\n {\rm RMSE} \equiv \sqrt{...,"[-0.03967193141579628, -0.07747125625610352, 0..."
3,2001.00019,Not all doped Mott insulators have a pseudogap...,2020,"Wei Wu, Mathias S. Scheurer, Michel Ferrero, A...",cond-mat.str-el,The Mott insulating phase of the parent comp...,[\begin{equation}\n \epsilon^{*}_{\vec{k}} ...,\begin{equation}\n \epsilon^{*}_{\vec{k}} =...,"[-0.009624576196074486, -0.04841659963130951, ..."
4,2001.00021,Efficient classical simulation of random shall...,2022,"John Napp, Rolando L. La Placa, Alexander M. D...","quant-ph,cond-mat.stat-mech,cs.CC",Random quantum circuits are commonly viewed ...,"[\begin{equation} #1 \end{equation}, \begin{al...",\begin{equation} #1 \end{equation}\n\begin{ali...,"[-0.02470673806965351, -0.018680408596992493, ..."


In [14]:
del batch["latex_scrape"]

In [15]:
# Dump these to file with pickle or write them to Redis
# Morten: Since it is a pickle we would have to have cudf where we are loading and that is unlikely
# so changing it to a pandas dataframe before saving
with open(f'embeddings_{len(batch)}.pkl', 'wb') as f:
    pickle.dump(batch.to_pandas(), f)