This notebook is used 
 - to download the arxiv metadata 
 - read the entire metadata and convert it into a dataframe
 - Save the dataframe in a csv in your google drive for future use and delete the metadata to free space 
 - Filter out the five AI/ML categories 
 - clean the latex formatted abstract and title columns 
 - Instantiate the model, convert abstracts to vectors, and instantiate the faiss index with IndexFlatL2

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


 Have a closer look on the hardware spcifications, i.e. to get information about the installed CPU and GPU:

In [None]:
!lscpu |grep 'Model name'

Model name:          Intel(R) Xeon(R) CPU @ 2.00GHz


In [None]:
!nvidia-smi -L

GPU 0: Tesla T4 (UUID: GPU-680d0840-aec8-a9b0-db95-2c6a8ca29a84)


In addition, you can check the available RAM and HDD memory:

In [None]:
!cat /proc/meminfo | grep 'MemAvailable'

MemAvailable:   12403728 kB


In [None]:
!df -h / | awk '{print $4}'

Avail
27G


Finally, one can execute the following command to get a live update on the GPU usage. This is useful to check how much of the GPU memory is in use to optimize the batchsize for training. Note that whenever the training routine in a notebook is still running, you need to execute this command in another Colaboratory notebook to get an instant response:

In [None]:
!nvidia-smi

Sun Aug  1 05:19:50 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.42.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   45C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

### Download the arxiv meta data witg gsutil
We will need gsutil utility from google cloud sdk. Firstly, you need to authenticate yourself in Colab. Once you run the code below, it will ask you to follow a link to login and enter an access token that you receive upon successful login.


In [None]:
from google.colab import auth
auth.authenticate_user()

We would be using the gsutil command to upload and download files. So we first need to install the GCloud SDK.

In [None]:
!curl https://sdk.cloud.google.com | bash1
!gcloud init

### Download the json metadata from the cloud

In [None]:
!gsutil cp -n gs://arxiv-dataset/metadata-v5/arxiv-metadata-oai.json /content/gdrive/My\ Drive/arxiv-metadata-oai.json
!ls -l /content/gdrive/My\ Drive


### Reading the entire json metadata
This cell may take a minute to run considering the volume of data

In [None]:
import os
import tqdm
import json

input_file = "/content/gdrive/MyDrive/Arxiv/arxiv-metadata-oai-snapshot.json"

data  = []
with tqdm.tqdm(total=os.path.getsize(input_file)) as pbar:
     with open(input_file, 'r') as f:
          for line in f:
              pbar.update(len(line))
              data.append(json.loads(line))

100%|██████████| 3109294971/3109294971 [02:21<00:00, 21958727.42it/s]


In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

data = pd.DataFrame(data)

In [None]:
data.head()

Unnamed: 0,id,submitter,authors,title,comments,journal-ref,doi,report-no,categories,license,abstract,versions,update_date,authors_parsed
0,704.0001,Pavel Nadolsky,"C. Bal\'azs, E. L. Berger, P. M. Nadolsky, C.-...",Calculation of prompt diphoton production cros...,"37 pages, 15 figures; published version","Phys.Rev.D76:013009,2007",10.1103/PhysRevD.76.013009,ANL-HEP-PR-07-12,hep-ph,,A fully differential calculation in perturba...,"[{'version': 'v1', 'created': 'Mon, 2 Apr 2007...",2008-11-26,"[[Balázs, C., ], [Berger, E. L., ], [Nadolsky,..."
1,704.0002,Louis Theran,Ileana Streinu and Louis Theran,Sparsity-certifying Graph Decompositions,To appear in Graphs and Combinatorics,,,,math.CO cs.CG,http://arxiv.org/licenses/nonexclusive-distrib...,"We describe a new algorithm, the $(k,\ell)$-...","[{'version': 'v1', 'created': 'Sat, 31 Mar 200...",2008-12-13,"[[Streinu, Ileana, ], [Theran, Louis, ]]"
2,704.0003,Hongjun Pan,Hongjun Pan,The evolution of the Earth-Moon system based o...,"23 pages, 3 figures",,,,physics.gen-ph,,The evolution of Earth-Moon system is descri...,"[{'version': 'v1', 'created': 'Sun, 1 Apr 2007...",2008-01-13,"[[Pan, Hongjun, ]]"
3,704.0004,David Callan,David Callan,A determinant of Stirling cycle numbers counts...,11 pages,,,,math.CO,,We show that a determinant of Stirling cycle...,"[{'version': 'v1', 'created': 'Sat, 31 Mar 200...",2007-05-23,"[[Callan, David, ]]"
4,704.0005,Alberto Torchinsky,Wael Abu-Shammala and Alberto Torchinsky,From dyadic $\Lambda_{\alpha}$ to $\Lambda_{\a...,,"Illinois J. Math. 52 (2008) no.2, 681-689",,,math.CA math.FA,,In this paper we show how to compute the $\L...,"[{'version': 'v1', 'created': 'Mon, 2 Apr 2007...",2013-10-15,"[[Abu-Shammala, Wael, ], [Torchinsky, Alberto, ]]"


Rename the id column to arxiv id and set the idex column as the id for easier manipulation

In [None]:
data.rename(columns = {'id':'arxiv_id'}, inplace = True)

In [None]:
print(data.index.name)
data.index.name = 'id'

None


In [None]:
data.reset_index(level=0, inplace=True)

In [None]:
data.head(10)

Unnamed: 0,id,arxiv_id,submitter,authors,title,comments,journal-ref,doi,report-no,categories,license,abstract,versions,update_date,authors_parsed
0,0,704.0001,Pavel Nadolsky,"C. Bal\'azs, E. L. Berger, P. M. Nadolsky, C.-...",Calculation of prompt diphoton production cros...,"37 pages, 15 figures; published version","Phys.Rev.D76:013009,2007",10.1103/PhysRevD.76.013009,ANL-HEP-PR-07-12,hep-ph,,A fully differential calculation in perturba...,"[{'version': 'v1', 'created': 'Mon, 2 Apr 2007...",2008-11-26,"[[Balázs, C., ], [Berger, E. L., ], [Nadolsky,..."
1,1,704.0002,Louis Theran,Ileana Streinu and Louis Theran,Sparsity-certifying Graph Decompositions,To appear in Graphs and Combinatorics,,,,math.CO cs.CG,http://arxiv.org/licenses/nonexclusive-distrib...,"We describe a new algorithm, the $(k,\ell)$-...","[{'version': 'v1', 'created': 'Sat, 31 Mar 200...",2008-12-13,"[[Streinu, Ileana, ], [Theran, Louis, ]]"
2,2,704.0003,Hongjun Pan,Hongjun Pan,The evolution of the Earth-Moon system based o...,"23 pages, 3 figures",,,,physics.gen-ph,,The evolution of Earth-Moon system is descri...,"[{'version': 'v1', 'created': 'Sun, 1 Apr 2007...",2008-01-13,"[[Pan, Hongjun, ]]"
3,3,704.0004,David Callan,David Callan,A determinant of Stirling cycle numbers counts...,11 pages,,,,math.CO,,We show that a determinant of Stirling cycle...,"[{'version': 'v1', 'created': 'Sat, 31 Mar 200...",2007-05-23,"[[Callan, David, ]]"
4,4,704.0005,Alberto Torchinsky,Wael Abu-Shammala and Alberto Torchinsky,From dyadic $\Lambda_{\alpha}$ to $\Lambda_{\a...,,"Illinois J. Math. 52 (2008) no.2, 681-689",,,math.CA math.FA,,In this paper we show how to compute the $\L...,"[{'version': 'v1', 'created': 'Mon, 2 Apr 2007...",2013-10-15,"[[Abu-Shammala, Wael, ], [Torchinsky, Alberto, ]]"
5,5,704.0006,Yue Hin Pong,Y. H. Pong and C. K. Law,Bosonic characters of atomic Cooper pairs acro...,"6 pages, 4 figures, accepted by PRA",,10.1103/PhysRevA.75.043613,,cond-mat.mes-hall,,We study the two-particle wave function of p...,"[{'version': 'v1', 'created': 'Sat, 31 Mar 200...",2015-05-13,"[[Pong, Y. H., ], [Law, C. K., ]]"
6,6,704.0007,Alejandro Corichi,"Alejandro Corichi, Tatjana Vukasinac and Jose ...",Polymer Quantum Mechanics and its Continuum Limit,"16 pages, no figures. Typos corrected to match...","Phys.Rev.D76:044016,2007",10.1103/PhysRevD.76.044016,IGPG-07/03-2,gr-qc,,A rather non-standard quantum representation...,"[{'version': 'v1', 'created': 'Sat, 31 Mar 200...",2008-11-26,"[[Corichi, Alejandro, ], [Vukasinac, Tatjana, ..."
7,7,704.0008,Damian Swift,Damian C. Swift,Numerical solution of shock and ramp compressi...,Minor corrections,"Journal of Applied Physics, vol 104, 073536 (2...",10.1063/1.2975338,"LA-UR-07-2051, LLNL-JRNL-410358",cond-mat.mtrl-sci,http://arxiv.org/licenses/nonexclusive-distrib...,A general formulation was developed to repre...,"[{'version': 'v1', 'created': 'Sat, 31 Mar 200...",2009-02-05,"[[Swift, Damian C., ]]"
8,8,704.0009,Paul Harvey,"Paul Harvey, Bruno Merin, Tracy L. Huard, Luis...","The Spitzer c2d Survey of Large, Nearby, Inste...",,"Astrophys.J.663:1149-1173,2007",10.1086/518646,,astro-ph,,We discuss the results from the combined IRA...,"[{'version': 'v1', 'created': 'Mon, 2 Apr 2007...",2010-03-18,"[[Harvey, Paul, ], [Merin, Bruno, ], [Huard, T..."
9,9,704.001,Sergei Ovchinnikov,Sergei Ovchinnikov,"Partial cubes: structures, characterizations, ...","36 pages, 17 figures",,,,math.CO,,Partial cubes are isometric subgraphs of hyp...,"[{'version': 'v1', 'created': 'Sat, 31 Mar 200...",2007-05-23,"[[Ovchinnikov, Sergei, ]]"


Save the csv as there is a lot of data

In [None]:
data.to_csv("/content/gdrive/MyDrive/Arxiv/Arxiv_Full.csv",index=False)

Factory reset the runtime to clear the ram.<br>
Upload the requirements.txt file in notebooks folder in the repo <br>
Mount the drive again

In [1]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [2]:
!pip install -r /content/gdrive/MyDrive/Arxiv/requirements.txt

Collecting torch==1.8.1
  Downloading torch-1.8.1-cp37-cp37m-manylinux1_x86_64.whl (804.1 MB)
[K     |████████████████████████████████| 804.1 MB 2.7 kB/s 
[?25hCollecting transformers==3.3.1
  Downloading transformers-3.3.1-py3-none-any.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 52.4 MB/s 
[?25hCollecting sentence-transformers==0.3.8
  Downloading sentence-transformers-0.3.8.tar.gz (66 kB)
[K     |████████████████████████████████| 66 kB 6.7 MB/s 
[?25hCollecting pandas==1.1.2
  Downloading pandas-1.1.2-cp37-cp37m-manylinux1_x86_64.whl (10.5 MB)
[K     |████████████████████████████████| 10.5 MB 13.4 MB/s 
[?25hCollecting faiss-cpu==1.6.1
  Downloading faiss_cpu-1.6.1-cp37-cp37m-manylinux2010_x86_64.whl (7.1 MB)
[K     |████████████████████████████████| 7.1 MB 21.1 MB/s 
[?25hCollecting numpy==1.19.2
  Downloading numpy-1.19.2-cp37-cp37m-manylinux2010_x86_64.whl (14.5 MB)
[K     |████████████████████████████████| 14.5 MB 29 kB/s 
[?25hCollecting folium==0.2

Read from the csv and delete the arxiv json metadata if required

In [3]:
import pandas as pd
df = pd.read_csv("/content/gdrive/MyDrive/Arxiv/Arxiv_Full.csv")

  """A safe version of the builtin execfile().


In [4]:
#Filter out the Five AI/ML categories
# cs.CL => Computation and Language
# cs.IR => Information Retrieval
# cs.LG => Machine Learning
# cs.HC => Human-Computer Interaction
# cs.CV => Computer Vision and Pattern Recognition
ai_ml_df = df[df.categories.str.match('cs.CL|cs.IR|cs.LG|cs.HC|cs.CV')]

In [5]:
import numpy as np


def vector_search(query, model, index, num_results=10):
    """Tranforms query to vector using a pretrained, sentence-level
    DistilBERT model and finds similar vectors using FAISS.
    Args:
        query (str): User query that should be more than a sentence long.
        model (sentence_transformers.SentenceTransformer.SentenceTransformer)
        index (`numpy.ndarray`): FAISS index that needs to be deserialized.
        num_results (int): Number of results to return.
    Returns:
        D (:obj:`numpy.array` of `float`): Distance between results and query.
        I (:obj:`numpy.array` of `int`): Paper ID of the results.

    """
    vector = model.encode(list(query))
    D, I = index.search(np.array(vector).astype("float32"), k=num_results)
    return D, I


def id2details(df, I, column):
    """Returns the paper titles based on the paper index."""
    return [list(df[df.id == idx][column]) for idx in I[0]]

In [6]:
ai_ml_df.dtypes

id                 int64
arxiv_id          object
submitter         object
authors           object
title             object
comments          object
journal-ref       object
doi               object
report-no         object
categories        object
license           object
abstract          object
versions          object
update_date       object
authors_parsed    object
dtype: object

In [7]:
!pip install pylatexenc



In [8]:
# We will transform both the title and abstract text to UTF-8 format using the pylatexenc library
from pylatexenc.latex2text import LatexNodes2Text

# LaTex to UTF
clean_abstract = []
clean_title = []
for i,a in ai_ml_df.iterrows():
    # Clean title
    try:
        clean_title.append(LatexNodes2Text().latex_to_text(a['title']).replace('\n', ' ').strip()) 
    except:
        clean_title.append(a['abstract'].replace('\n', ' ').strip())
    # Clean abstract
    try:
        clean_abstract.append(LatexNodes2Text().latex_to_text(a['abstract']).replace('\n', ' ').strip()) 
    except:
        clean_abstract.append(a['abstract'].replace('\n', ' ').strip())
ai_ml_df['clean_abstracts'] = clean_abstract
ai_ml_df['clean_title'] = clean_title

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [9]:
ai_ml_df.to_csv('/content/gdrive/MyDrive/Arxiv/Arxiv_AIML_processed.csv')

In [10]:
print(f"Arxiv articles:{ai_ml_df.id.unique().shape[0]}")

Arxiv articles:117234


In [11]:
import pandas as pd

# Used to create the dense document vectors.
import torch
from sentence_transformers import SentenceTransformer

# Used to create and store the Faiss index.
import faiss
import numpy as np
import pickle
from pathlib import Path

In [12]:
# Instantiate the sentence-level DistilBERT
model = SentenceTransformer('distilbert-base-nli-stsb-mean-tokens')
# Check if GPU is available and use it
if torch.cuda.is_available():
    model = model.to(torch.device("cuda"))
print(model.device)

100%|██████████| 245M/245M [00:16<00:00, 14.4MB/s]


cuda:0


In [13]:
# Convert abstracts to vectors
embeddings = model.encode(ai_ml_df.clean_abstracts.to_list(), show_progress_bar=True)

HBox(children=(FloatProgress(value=0.0, description='Batches', max=3664.0, style=ProgressStyle(description_wid…




In [14]:
print(f'Shape of the vectorised abstract: {embeddings[0].shape}')

Shape of the vectorised abstract: (768,)


In [16]:
# Step 1: Change data type
embeddings = np.array([embedding for embedding in embeddings]).astype("float32")

# Step 2: Instantiate the index
index = faiss.IndexFlatL2(embeddings.shape[1])

# Step 3: Pass the index to IndexIDMap
index = faiss.IndexIDMap(index)

#convert id to int64
ids = np.asarray(ai_ml_df.id.astype('int64'))
print(ids)
# Step 4: Add vectors and their IDs
index.add_with_ids(embeddings, ids)

print(f"Number of vectors in the Faiss index: {index.ntotal}")

[   1019    1027    1266 ... 1665369 1665376 1665377]
Number of vectors in the Faiss index: 117234


In [18]:
ai_ml_df.iloc[5415, 15]

'Three-dimensional particle tracking is an essential tool in studying dynamics under the microscope, namely, fluid dynamics in microfluidic devices, bacteria taxis, cellular trafficking. The 3d position can be determined using 2d imaging alone by measuring the diffraction rings generated by an out-of-focus fluorescent particle, imaged on a single camera. Here I present a ring detection algorithm exhibiting a high detection rate, which is robust to the challenges arising from ring occlusion, inclusions and overlaps, and allows resolving particles even when near to each other. It is capable of real time analysis thanks to its high performance and low memory footprint. The proposed algorithm, an offspring of the circle Hough transform, addresses the need to efficiently trace the trajectories of many particles concurrently, when their number in not necessarily fixed, by solving a classification problem, and overcomes the challenges of finding local maxima in the complex parameter space whi

In [19]:
# Retrieve the 10 nearest neighbours
D, I = index.search(np.array([embeddings[5415]]), k=10)
print(f'L2 distance: {D.flatten().tolist()}\n\nArxiv paper IDs: {I.flatten().tolist()}')

L2 distance: [0.0, 68.53523254394531, 70.18161010742188, 72.04075622558594, 73.01589965820312, 75.42435455322266, 76.07571411132812, 76.50065612792969, 77.51425170898438, 77.761962890625]

Arxiv paper IDs: [466374, 1457933, 1660222, 1105598, 1481269, 1017831, 622208, 1038120, 389291, 1135725]


In [20]:
# Fetch the paper titles based on their index
id2details(ai_ml_df, I, 'clean_title')

[['Robust and highly performant ring detection algorithm for 3d particle   tracking using 2d microscope imaging'],
 ['Real-time dense 3D Reconstruction from monocular video data captured by   low-cost UAVs'],
 ['Camera Calibration: a USU Implementation'],
 ['Defogging Kinect: Simultaneous Estimation of Object Region and Depth in   Foggy Scenes'],
 ['Self-supervised Depth Estimation Leveraging Global Perception and   Geometric Smoothness Using On-board Videos'],
 ['HMS-Net: Hierarchical Multi-scale Sparsity-invariant Network for Sparse   Depth Completion'],
 ['Noise in Structured-Light Stereo Depth Cameras: Modeling and its   Applications'],
 ['CNN-based Preprocessing to Optimize Watershed-based Cell Segmentation in   3D Confocal Microscopy Images'],
 ['Orientation Determination from Cryo-EM images Using Least Unsquared   Deviation'],
 ['Structure from Motion for Panorama-Style Videos']]

In [21]:
id2details(ai_ml_df, I, 'clean_abstracts')

[['Three-dimensional particle tracking is an essential tool in studying dynamics under the microscope, namely, fluid dynamics in microfluidic devices, bacteria taxis, cellular trafficking. The 3d position can be determined using 2d imaging alone by measuring the diffraction rings generated by an out-of-focus fluorescent particle, imaged on a single camera. Here I present a ring detection algorithm exhibiting a high detection rate, which is robust to the challenges arising from ring occlusion, inclusions and overlaps, and allows resolving particles even when near to each other. It is capable of real time analysis thanks to its high performance and low memory footprint. The proposed algorithm, an offspring of the circle Hough transform, addresses the need to efficiently trace the trajectories of many particles concurrently, when their number in not necessarily fixed, by solving a classification problem, and overcomes the challenges of finding local maxima in the complex parameter space w

##Putting all together
So far, we've built a Faiss index using the misinformation abstract vectors we encoded with a sentence-DistilBERT model. That's helpful but in a real case scenario, we would have to work with unseen data. To query the index with an unseen query and retrieve its most relevant documents, we would have to do the following:

Encode the query with the same sentence-DistilBERT model we used for the rest of the abstract vectors.
Change its data type to float32.
Search the index with the encoded query.

In [22]:
user_query = """
This paper describes an efficient reduction of the learning problem of
ranking to binary classification. The reduction guarantees an average pairwise
misranking regret of at most that of the binary classifier regret, improving a
recent result of Balcan et al which only guarantees a factor of 2. Moreover,
our reduction applies to a broader class of ranking loss functions, admits a
simpler proof, and the expected running time complexity of our algorithm in
terms of number of calls to a classifier or preference function is improved
from $\Omega(n^2)$ to $O(n \log n)$. In addition, when the top $k$ ranked
elements only are required ($k \ll n$), as in many applications in information
extraction or search engines, the time complexity of our algorithm can be
further reduced to $O(k \log k + n)$. Our reduction and algorithm are thus
practical for realistic applications where the number of points to rank exceeds
several thousands. Much of our results also extend beyond the bipartite case
previously studied.
Our rediction is a randomized one. To complement our result, we also derive
lower bounds on any deterministic reduction from binary (preference)
classification to ranking, implying that our use of a randomized reduction is
essentially necessary for the guarantees we provide.
"""

In [23]:
# For convenience, I've wrapped all steps in the vector_search function.
# It takes four arguments: 
# A query, the sentence-level transformer, the Faiss index and the number of requested results
D, I = vector_search([user_query], model, index, num_results=10)
print(f'L2 distance: {D.flatten().tolist()}\n\nArxiv paper IDs: {I.flatten().tolist()}')

L2 distance: [5.541277885437012, 55.357948303222656, 62.93092346191406, 67.29206848144531, 68.80670166015625, 70.92662048339844, 72.32435607910156, 74.47453308105469, 74.989013671875, 77.26283264160156]

Arxiv paper IDs: [29836, 100170, 353482, 1131386, 1442388, 1204467, 1007661, 1138352, 969530, 407092]


In [24]:
# Fetching the paper titles based on their index
id2details(ai_ml_df, I, 'clean_title')

[['An efficient reduction of ranking to classification'],
 ['The Offset Tree for Learning with Partial Labels'],
 ['Surrogate Regret Bounds for Bipartite Ranking via Strongly Proper Losses'],
 ['Equipping Experts/Bandits with Long-term Memory'],
 ['Adaptive Importance Sampling for Finite-Sum Optimization and Sampling   with Decreasing Step-Sizes'],
 ['A Reduction from Reinforcement Learning to No-Regret Online Learning'],
 ['Acceleration through Optimistic No-Regret Dynamics'],
 ['Online Active Learning of Reject Option Classifiers'],
 ['Online Improper Learning with an Approximation Oracle'],
 ['Adaptive Metric Dimensionality Reduction']]

In [25]:
# Define project base directory
# Change the index from 1 to 0 if you run this on Google Colab
# Serialise index and store it as a pickle
with open("/content/gdrive/MyDrive/Arxiv/faiss_index_aiml.pickle", "wb") as h:
    pickle.dump(faiss.serialize_index(index), h)