# Analysis of Preprint Papers from the ArXiv

The website [arxiv.org](https://arxiv.org) is a popular database for scientific papers in STEM fields. ArXiv has its own classification system consisting of roughly 150 different categories, which are manually added by the authors whenever a new paper is uploaded. A paper can be assigned multiple categories.

The goal for this project is to develop a machine learning model which can predict the ArXiv category from a given title and abstract.

We start by importing all the packages we will need and setting up a data directory.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os # used for handling files
from sklearn.decomposition import PCA # dimension reduction of data
import pickle # saving models
from pathlib import Path # to get home directory
from functools import reduce # used to calculate accuracy of model

# local files
import arxiv_scraper
import cleaner
import elmo
import onehot

print("Packages loaded.")

Packages loaded.


The data set used here has been scraped from the [ArXiv API](https://arxiv.org/help/api) over several days, using the Python scraper `arxiv_scraper.py`. To get a sense for how long the scraping takes, you can uncomment and run the script below.

In [45]:
#arxiv_scraper.cat_scrape(
#    max_results_per_cat = 100, # maximum number of papers to download per category (there are ~150 categories)
#    file_path = "arxiv_data", # name of output file
#    batch_size = 100 # size of every batch - lower batch size requires less memory - must be less than 30,000
#)

Alternatively, I have downloaded metadata from about a million papers using this scraper (with `max_results_per_cat` = 10000), which can be freely downloaded below. This data set takes up ~1gb of space, however, so I've included many random samples of this data set as well:

* `arxiv` contains the main data set
* `arxiv_sample_1000` contains 1,000 papers
* `arxiv_sample_5000` contains 5,000 papers
* `arxiv_sample_10000` contains 10,000 papers
* `arxiv_sample_25000` contains 25,000 papers
* `arxiv_sample_50000` contains 50,000 papers
* `arxiv_sample_100000` contains 100,000 papers
* `arxiv_sample_200000` contains 200,000 papers
* `arxiv_sample_500000` contains 500,000 papers
* `arxiv_sample_750000` contains 750,000 papers

Choose your favorite below. Alternatively, of course, you can set it to be the file name of your own scraped data.

In [2]:
file_name = "arxiv_sample_200000"

Next up, we specify the folder in which we will store all our data. Change to whatever folder you would like.

In [3]:
home_dir = str(Path.home())
data_path = os.path.join(home_dir, "pCloudDrive", "public_folder", "scholarly_data")

## Fetching data

We then do some basic setting up.

In [48]:
# create path directory and download a list of all arXiv categories
cleaner.setup(data_path)

# download the raw titles and abstracts
cleaner.download_papers(file_name, data_path)

cats.csv is already downloaded.
arxiv_val_1hot.csv is already downloaded.
arxiv_val_1hot_agg.csv is already downloaded.
arxiv_sample_50000.csv is already downloaded.


Next, we store the list of arXiv categories.

In [49]:
# construct category dataframe and array
full_path = os.path.join(data_path, "cats.csv")
cats_df = pd.read_csv(full_path)
cats = np.asarray(cats_df['category'].values)

pd.set_option('display.max_colwidth', 50)
cats_df.head()

Unnamed: 0,category,description
0,astro-ph,Astrophysics
1,astro-ph.CO,Cosmology and Nongalactic Astrophysics
2,astro-ph.EP,Earth and Planetary Astrophysics
3,astro-ph.GA,Astrophysics of Galaxies
4,astro-ph.HE,High Energy Astrophysical Phenomena


## Cleaning the data

We now do some basic cleaning operations on our raw data. We convert strings '\[cat_1, cat_2\]' into actual lists \[cat_1, cat_2\], make everything lower case, removing punctuation, numbers and whitespace, and dropping NaN rows.

Our last text cleaning step is to lemmatise the text, which reduces all words to its base form. For instance, 'eating' is converted into 'eat' and 'better' is converted into 'good'. This usually takes a while to finish, so instead we're simply going to download a lemmatised version of your chosen data set. Alternatively, if you're dealing with your own scraped data set, you can uncomment the marked lines below.

In [50]:
full_path = os.path.join(data_path, f"{file_name}_clean.csv")
if not os.path.isfile(full_path):
    # preclean raw data and save the precleaned texts and
    # categories to {file_name}_preclean.csv
    cleaner.get_preclean_text(file_name, data_path)

    # lemmatise precleaned data and save lemmatised texts to 
    # {file_name}_clean.csv and delete the precleaned file
    cleaner.lemmatise_file(file_name, batch_size = 1000, path = data_path, confirmation = False)

# load in cleaned text
print("Loading cleaned text...")
full_path = os.path.join(data_path, f"{file_name}_clean.csv")
clean_text = pd.read_csv(full_path, header = None)
clean_df = pd.DataFrame(clean_text)
clean_df.columns = ['category', 'clean_text']

print(f"Shape of clean_df: {clean_df.shape}. Here are some of the lemmatised texts:")
pd.set_option('display.max_colwidth', 1000)
clean_df.head()

Loading cleaned text...
Shape of clean_df: (49183, 2). Here are some of the lemmatised texts:


Unnamed: 0,category,clean_text
0,"['math.ST', 'cs.CC', 'stat.ML', 'stat.TH']","fast sparse least - square regression with non - asymptotic guarantee in this paper , -PRON- study a fast approximation method for { \it large - scale high - dimensional } sparse least - square regression problem by exploit the johnson - lindenstrauss ( jl ) transform , which embe a set of high - dimensional vector into a low - dimensional space . in particular , -PRON- propose to apply the jl transform to the data matrix and the target vector and then to solve a sparse least - square problem on the compress datum with a { \it slightly large regularization parameter}. theoretically , -PRON- establish the optimization error bind of the learn model for two different sparsity - induce regularizer , i.e. , the elastic net and the $ \ell_$ norm . compare with previous relevant work , -PRON- analysis be { \it non - asymptotic and exhibit more insight } on the bound , the sample complexity and the regularization . as an illustration , -PRON- also provide an error bind of the { \it dantzig..."
1,['cs.SD'],"an interesting property of lpcs for sonorant vs fricative discrimination linear prediction ( lp ) technique estimate an optimum all - pole filter of a give order for a frame of speech signal . the coefficient of the all - pole filter , /a(z ) be refer to as lp coefficient ( lpcs ) . the gain of the inverse of the all - pole filter , a(z ) at z = , i.e , at frequency = , a ( ) correspond to the sum of lpcs , which have the property of be low ( high ) than a threshold for the sonorant ( fricative ) . when the inverse - tan of a ( ) , denote as t ( ) , be use a feature and test on the sonorant and fricative frame of the entire timit database , an accuracy of .% be obtain . hence , -PRON- refer to t ( ) as sonorant - fricative discrimination index ( sfdi ) . this property have also be test for -PRON- robustness for additive white noise and on the telephone quality speech of the ntimit database . these result be comparable to , or in some respect , well than the state - of - the - art m..."
2,"['math.ST', 'stat.TH', '62F15, 60F05 (Primary) 65C60 (Secondary)']","on some asymptotic property and an almost sure approximation of the normalize inverse - gaussian process in this paper , -PRON- present some asymptotic property of the normalize inverse - gaussian process . in particular , when the concentration parameter be large , -PRON- establish an analogue of the empirical functional central limit theorem , the strong law of large number and the glivenko - cantelli theorem for the normalize inverse - gaussian process and -PRON- corresponding quantile process . -PRON- also derive a finite sum - representation that converge almost surely to the ferguson and klass representation of the normalize inverse - gaussian process . this almost sure approximation can be use to simulate efficiently the normalize inverse - gaussian process ."
3,['quant-ph'],"road towards fault - tolerant universal quantum computation current experiment be take the first step toward noise - resilient logical qubit . crucially , a quantum computer must not merely store information , but also process -PRON- . a fault - tolerant computational procedure ensure that error do not multiply and spread . this review compare the lead proposal for promote a quantum memory to a quantum processor . -PRON- compare magic state distillation , color code technique and other alternative idea , pay attention to relative resource demand . -PRON- discuss the several no - go result which hold for low - dimensional topological code and outline the potential reward of use high - dimensional quantum ( ldpc ) code in modular architecture ."
4,"['astro-ph.SR', 'astro-ph.EP', 'astro-ph.GA']","model mid - infrared molecular emission line from t tauri star -PRON- introduce a new modelling framework call flit to simulate infrared line emission spectra from protoplanetary disc . this paper focus on the mid - ir spectral region between . um to um for t tauri star . the generate spectra contain several ten of thousand of molecular emission line of ho , oh , co , co , hcn , ch , h and a few other molecule , as well as the forbidden atomic emission line of si , sii , siii , siii , feii , neii , neiii , arii and ariii . in contrast to previously publish work , -PRON- do not treat the abundance of the molecule nor the temperature in the disc as free parameter , but use the complex result of detailed d prodimo disc model concern gas and dust temperature structure , and molecular concentration . flit compute the line emission spectra by ray trace in an efficient , fast and reliable way . the result be broadly consistent with r= spitzer / irs observational datum of t tauri star conc..."


## One-hot encoding of categories

We then perform a one hot encoding for the category variable, as this will make training our model easier. We do this by first creating a dataframe with columns the categories and binary values for every paper, and then concatenate our original dataframe with the binary values.

In [11]:
# one-hot encode categories
#onehot.onehot_encode(file_name, data_path)

# load data
print("Loading category data...")
full_path = os.path.join(data_path, f"{file_name}_1hot_agg.csv")
df_1hot = pd.read_csv(full_path, header = None)
print(f"Category data loaded.")

# show the new columns of the data frame
pd.set_option('display.max_colwidth', 100)
print(f"Dimensions of df_1hot: {df_1hot.shape}.")
df_1hot.head()

Loading category data...
Category data loaded.
Dimensions of df_1hot: (199998, 8).


Unnamed: 0,0,1,2,3,4,5,6,7
0,modulus stable map genus logarithmic geometry pair paper develop framework application logarithm...,0,0,0,0,0,0,0
1,break symmetric cryptosystem quantum period find shor algorithm quantum computer severe threat p...,0,0,0,0,0,0,0
2,improve surgical training phantom hyperrealism deep unpaired image image translation real surger...,0,0,0,0,0,0,0
3,chaotic dynamic bounce coin study dynamic bounce coin motion restrict dimensional plane coin mod...,0,0,0,0,0,0,0
4,geographica benchmark geospatial rdf store geospatial extension sparql like geosparql stsparql r...,0,0,0,0,0,0,0


## ELMo feature extraction

To build our model we have to extract features from the titles and abstracts. We will be using ELMo, a state-of-the-art NLP framework developed by AllenNLP, which converts text input into vectors, with similar words being closer to each other. We will first download the ELMo model. It is over 350mb in size, so it might take a little while.

In [52]:
elmo.download_elmo_model()

ELMo model already downloaded.


We now need to extract ELMo features from our cleaned text data. This is done using the `extract` function from `elmo.py`. This usually takes a LONG time.

In [53]:
full_path = os.path.join(data_path, f"{file_name}_elmo.csv")
if not os.path.isfile(full_path):
    # extract ELMo data
    elmo.extract(
        file_name = file_name,
        path = data_path,
        batch_size = 20, # lower batch size gives less accurate vectors but requires less memory
        doomsday_clock = 50,
        confirmation = False
        )

# load ELMo data
print("Loading ELMo'd text...")
full_path = os.path.join(data_path, f"{file_name}_elmo.csv")
elmo_data = pd.read_csv(full_path, header = None)
print(f"ELMo data loaded from {file_name}_elmo.csv.")

elmo_df = clean_df.copy()
elmo_df = elmo_data.join(elmo_df['category'])

print(f"Shape of elmo_df: {elmo_df.shape}")
elmo_df.head()

Loading ELMo'd text...
ELMo data loaded from arxiv_sample_50000_elmo.csv.
Shape of elmo_df: (49999, 1025)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1015,1016,1017,1018,1019,1020,1021,1022,1023,category
0,0.032264,0.164495,4.6e-05,-0.008434,0.062614,-0.097488,-0.062888,0.252686,0.09511,-0.187274,...,0.077523,-0.038856,0.107055,0.255856,-0.061011,0.394257,-0.01174,0.138539,-0.013543,['math...
1,0.18023,0.266668,0.042819,-0.038384,0.038294,-0.034775,-0.023941,0.36783,0.07021,-0.084328,...,0.116197,-0.024739,0.109669,0.140959,-0.035091,0.431889,-0.09257,0.226117,-0.007482,['cs.SD']
2,-0.016303,0.067849,0.052801,-0.009187,-0.00715,-0.064071,0.004345,0.117463,-0.027563,-0.178627,...,0.057723,0.033624,0.099338,0.163222,0.029108,0.164415,0.038682,0.215892,0.023488,['math...
3,0.00249,0.063034,-0.025218,-0.051667,0.082643,-0.084306,-0.021061,0.037081,0.052564,-0.164204,...,0.040261,0.004867,0.136856,0.159625,0.058569,0.285389,0.003747,0.125958,-0.007329,['quan...
4,0.217903,0.444179,-0.155039,-0.028567,0.11354,-0.132693,-0.209251,0.191103,0.145439,-0.233822,...,0.064445,-0.042244,0.139357,0.401823,0.134148,0.919678,-0.068214,0.570734,0.007093,['astr...


In [None]:
n = elmo_df.shape[1] - 1
X = np.asarray(elmo_df.iloc[:, :n])
X_2d = PCA(n_components = 2).fit_transform(X)
elmo_2d = pd.DataFrame(X_2d, columns = ['x', 'y'])

fig, ax = plt.subplots(1, figsize = (15, 10))
ax.set_xticks([])
ax.set_yticks([])
sns.scatterplot(data = elmo_2d, x = 'x', y = 'y')

## Analysis of the data

Here is how the categories in our data set are distributed.

In [None]:
# save a dataframe with the amount of papers in each category
sum_cats = df_1hot.iloc[:, 1024:].apply(lambda x: x.sum())

# get statistical information about the distribution of the amount of papers
sum_cats.describe()

In [None]:
# plot the distribution of the amount of papers in each category
plt.figure(figsize = (20,10))
plt.bar(x = sum_cats.keys(), height = sum_cats.values)
plt.xlabel('Categories', fontsize = 13)
plt.ylabel('Number of papers', fontsize = 13)
plt.title('Distribution of categories in data set', fontsize = 18)
#plt.xticks([])
plt.show()

We see that our data is not particularly uniformly distributed. These are the categories with the most amount of papers in the data set.

In [None]:
# add the counts to the dataframe and sort 
cats_df['count'] = sum_cats.values
cats_df = cats_df.sort_values(by=['count'], ascending = False)

pd.set_option('display.max_colwidth', 50)
cats_df[:5]

## Building a model

We are now done manipulating our data, and the time has come to build a model.