# Automatic tagging

## Modelling

The goal of this section is to build a system that detects the category of an article based on the abstract. To achieve our goal, we need to use the processed text data from the last section to build a machine learning model. First we load the 20k sample dataset:

In [1]:
import pandas as pd
import os, sys
import joblib

In [10]:
df_20k = pd.read_csv('../input/arxiv_20krows_train.csv', converters={'general_category': pd.eval}) # read general category and make sure it is loaded as a list

In [32]:
df_20k.head()

Unnamed: 0,id,submitter,authors,title,comments,journal-ref,doi,report-no,categories,license,abstract,versions,update_date,authors_parsed,cats_split,general_category
0,704.0001,Pavel Nadolsky,"C. Bal\'azs, E. L. Berger, P. M. Nadolsky, C.-...",Calculation of prompt diphoton production cros...,"37 pages, 15 figures; published version","Phys.Rev.D76:013009,2007",10.1103/PhysRevD.76.013009,ANL-HEP-PR-07-12,hep-ph,,A fully differential calculation in perturba...,"[{'version': 'v1', 'created': 'Mon, 2 Apr 2007...",2008-11-26,"[['Balázs', 'C.', ''], ['Berger', 'E. L.', '']...",['hep-ph'],[hep-ph]
1,704.0002,Louis Theran,Ileana Streinu and Louis Theran,Sparsity-certifying Graph Decompositions,To appear in Graphs and Combinatorics,,,,math.CO cs.CG,http://arxiv.org/licenses/nonexclusive-distrib...,"We describe a new algorithm, the $(k,\ell)$-...","[{'version': 'v1', 'created': 'Sat, 31 Mar 200...",2008-12-13,"[['Streinu', 'Ileana', ''], ['Theran', 'Louis'...","['math.CO', 'cs.CG']","[math, cs]"
2,704.0003,Hongjun Pan,Hongjun Pan,The evolution of the Earth-Moon system based o...,"23 pages, 3 figures",,,,physics.gen-ph,,The evolution of Earth-Moon system is descri...,"[{'version': 'v1', 'created': 'Sun, 1 Apr 2007...",2008-01-13,"[['Pan', 'Hongjun', '']]",['physics.gen-ph'],[physics]
3,704.0004,David Callan,David Callan,A determinant of Stirling cycle numbers counts...,11 pages,,,,math.CO,,We show that a determinant of Stirling cycle...,"[{'version': 'v1', 'created': 'Sat, 31 Mar 200...",2007-05-23,"[['Callan', 'David', '']]",['math.CO'],[math]
4,704.0005,Alberto Torchinsky,Wael Abu-Shammala and Alberto Torchinsky,From dyadic $\Lambda_{\alpha}$ to $\Lambda_{\a...,,"Illinois J. Math. 52 (2008) no.2, 681-689",,,math.CA math.FA,,In this paper we show how to compute the $\L...,"[{'version': 'v1', 'created': 'Mon, 2 Apr 2007...",2013-10-15,"[['Abu-Shammala', 'Wael', ''], ['Torchinsky', ...","['math.CA', 'math.FA']","[math, math]"


## Model selection

Since machine learning models only accept numbers as an input, we need to convert words to numbers. There are many techniques for this task, and one of the simplest ones is to make a vector of abstract using **TF-IDF** method. The numeric representation of each word is proportional to frequency of the word in documents ([more about TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)).

When abstracts are converted to numbers, we feed the the `abstract` as features and `general_category` as labels to the model for training. A simple model that is frequently use for text classification is Naive Bayes model. The Naive Bayes models predicts the category of an article based on prior knowledge of features of a class in training data. The limitation of Naive Bayes model is that is assumes features (in our case words) are not correlated, which might not be true for many cases. Despite this limitation, Naive Bayes is fast and can be used as strong base model.

## Evaluation

In order to evaluate the performance of our model, we need to select a suitable metric. In a classification problem, accuracy, precision and recall are the most common metrics. Since our dataset is imbalanced, accuracy will not be a good choice to measure the performance. Our goal is to accurately identify a label in our predictions (high precision), and also we want to identify most of the samples of a label (high recall). So we select f1-score as our metric, since it is a harmonic mean of the two previous metrics.

So the steps for the model training are:
    
1. Convert labels (`general_category`) to 0's and 1's
2. Convert abstracts to numbers using TF-IDF
3. Fit the model on training data
4. Predict the f1-score on test data to calculate the evaluation metric

To get a better picture of model performance, we use K-fold validation technique to split training dataset into k parts. Then we repeat the above process for each fold of the dataset.

The above steps are implemented in `train.py` script in `src` directory.

Let's train our first model with 3 folds:

In [4]:
dir_path = os.path.dirname(os.getcwd())
SRC_PATH = os.path.join(dir_path, "src")

if SRC_PATH not in sys.path:
    sys.path.append(SRC_PATH)

In [5]:
import train, config_set
from train import train_model

In [34]:
train_model(3, 'n_bayes') 

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

           0       0.96      0.83      0.89      1403
           1       0.86      0.77      0.82      1312
           2       0.89      0.37      0.52       389
           3       0.00      0.00      0.00         2
           4       0.73      0.31      0.43       366
           5       0.75      0.42      0.54       125
           6       0.00      0.00      0.00        63
           7       0.76      0.57      0.65       599
           8       0.81      0.58      0.67       636
           9       0.85      0.89      0.87      1658
          10       1.00      0.01      0.01       284
          11       0.00      0.00      0.00       141
          12       0.00      0.00      0.00        92
          13       0.80      0.02      0.04       189
          14       0.64      0.02      0.03       554
          15       1.00      0.02      0.04       112
          16       0.00      0.00      0.00        33
          17       0.95    

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))


              precision    recall  f1-score   support

           0       0.95      0.84      0.89      1338
           1       0.85      0.78      0.82      1266
           2       0.87      0.40      0.55       250
           3       0.00      0.00      0.00         0
           4       0.74      0.28      0.41       372
           5       0.91      0.42      0.57       195
           6       0.00      0.00      0.00        81
           7       0.70      0.62      0.66       629
           8       0.82      0.57      0.67       679
           9       0.85      0.90      0.87      1656
          10       0.00      0.00      0.00       289
          11       0.00      0.00      0.00       149
          12       0.00      0.00      0.00        88
          13       1.00      0.04      0.07       188
          14       0.64      0.01      0.02       553
          15       0.00      0.00      0.00       121
          16       0.00      0.00      0.00        31
          17       0.87    

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


The f1-score is around 0.31 which is not great. There are a few reasons that the performance is poor:
- Number of samples for each category is not enough. (e.g category 3, 6, 12 and 16)
- The features generated by TF-IDF technique are not a good representative of each category. There is a need for more elaborate feature engineering.
- The Naive Bayes model is a simple model that is not powerful to make good predictions

To overcome the first reason, we load the second dataset with 100k samples, so there are more samples for each category. We postpone feature engineering of the TF-IDF technique to a later time, and we use more powerful transformer models in next section to compare the performance of newer models to Naive Bayes,

So now let's load the second dataset.

In [2]:
sample_df = pd.read_csv('../input/sample_df_2021.csv', converters={'general_category': pd.eval})

In [3]:
sample_df['general_category'][1]

['physics']

In [None]:
import importlib
importlib.reload(config_set)

In [8]:
train_model(3, 'n_bayes') 

  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

           0       0.86      0.86      0.86      3039
           1       0.74      0.78      0.76      3682
           2       0.86      0.89      0.88     13847
           3       0.67      0.01      0.02       401
           4       0.55      0.52      0.53      2559
           5       0.71      0.71      0.71      1140
           6       0.50      0.60      0.55       576
           7       0.93      0.07      0.13       196
           8       0.68      0.78      0.73      1422
           9       0.62      0.65      0.64      1393
          10       0.84      0.83      0.83      9335
          11       0.36      0.09      0.15       860
          12       1.00      0.01      0.01       364
          13       0.43      0.22      0.29       235
          14       0.51      0.35      0.41       436
          15       0.53      0.48      0.50      3516
          16       0.63      0.06      0.11       606
          17       0.85    

  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

           0       0.85      0.85      0.85      2904
           1       0.74      0.79      0.77      3584
           2       0.87      0.89      0.88     13936
           3       0.89      0.02      0.04       391
           4       0.53      0.51      0.52      2546
           5       0.69      0.70      0.70      1187
           6       0.49      0.57      0.53       553
           7       0.86      0.07      0.12       183
           8       0.68      0.75      0.71      1403
           9       0.66      0.66      0.66      1472
          10       0.85      0.82      0.83      9356
          11       0.37      0.09      0.14       911
          12       1.00      0.01      0.02       385
          13       0.55      0.21      0.31       238
          14       0.56      0.37      0.45       429
          15       0.55      0.48      0.51      3651
          16       0.75      0.07      0.12       601
          17       0.83    

  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

           0       0.84      0.87      0.85      2948
           1       0.74      0.78      0.76      3681
           2       0.86      0.89      0.88     13914
           3       0.50      0.01      0.03       340
           4       0.55      0.52      0.53      2545
           5       0.71      0.71      0.71      1124
           6       0.50      0.60      0.54       533
           7       0.88      0.03      0.07       203
           8       0.69      0.78      0.73      1369
           9       0.63      0.66      0.64      1341
          10       0.85      0.83      0.84      9301
          11       0.40      0.09      0.15       851
          12       0.33      0.00      0.01       343
          13       0.49      0.19      0.27       247
          14       0.52      0.33      0.40       439
          15       0.53      0.48      0.50      3534
          16       0.73      0.09      0.16       569
          17       0.75    

It seems that we got a better model since the f1-score improved to 0.47. But still there is room for more improvement. We can continue the model training stage with trying more models, feature engineering, model hyper-parameter optimization, number of features, samples per category, and different word embeddings.

Finally we save the Naive Base as our baseline model.

## Inference

Let's predict the class of two samples using our model.

In [46]:
# source: https://arxiv.org/pdf/2112.00728, category:math
Sample1 = '''
we address the problem of reshaping light in the schrödinger optics regime from the
perspective of optimal control theoryN in technological applicationsL schrödinger optics is often
used to model a slowlyMvarying amplitude of a paraMaxially propagating electric field where the
square of the waveguideGs index of refraction is treated as the potentialN the objective of the
optimal control problem is to find the controlling potential whichL together with the constraining
schrödinger dynamicsL optimally reshape the intensity distribution of schrödinger eigenfunctions
from one end of the waveguide to the otherN this work considers reshaping problems found
in work due to kunkel and legerL and addresses computational needs by adopting tools from
the quantum control literatureN the success of the optimal control approach is demonstrated
numerically.
'''

# source: https://arxiv.org/pdf/2112.00746, category:hep-th
Sample2 = '''
We study quantum fields on an arbitrary, rigid background with boundary. We derive
the action for a scalar in the holographic basis that separates the boundary and bulk de-
grees of freedom. From this holographic action, a relation between Dirichlet and Neumann
propagators valid for any background is obtained. As an application in a warped back-
ground, we derive an exact formula for the flux of bulk modes emitted from the boundary.
We also derive the holographic action in the presence of two boundaries. Integrating out
free bulk modes, we derive a formula for the Casimir pressure on a (d−1)-brane depending
only on the boundary-to-bulk propagators. In AdS2 we find that the quantum force pushes
a point particle toward the AdS2 boundary. In higher dimensional AdSd+1 the quantum
pressure amounts to a detuning of the brane tension, which gets renormalized for even d.
We evaluate the one-loop boundary effective action in the presence of interactions by
means of the heat kernel expansion. We integrate out a heavy scalar fluctuation with scalar
interactions in AdSd+1, obtaining the long-distance effective Lagrangian encoding loop-
generated boundary-localized operators. We integrate out nonabelian vector fluctuations
in AdS4,5,6 and obtain the associated holographic Yang-Mills β functions. Turning to the
expanding patch of dS, following recent proposals, we provide a boundary effective action
generating the perturbative cosmological correlators by analytically continuing from dS to
EAdS. We obtain the “cosmological” heat kernel coefficients in the scalar case and work
out the divergent part of the dS4 effective action which renormalizes the cosmological
correlators. More developments are needed to extract all one-loop information from the
cosmological effective action.
'''

In [39]:
# STEPS
# load model
# load vectorizer
# predict the category

In [40]:
# loading the trained model and vectorizer
model_bin = open(config.MODEL_OUTPUT_PATH, 'rb') 
classifier = joblib.load(model_bin)

vectorizer_bin = open(config.VECTORIZER_PATH, 'rb') 
vectorizer = joblib.load(vectorizer_bin)

In [47]:
vectorizer.get_feature_names_out()

array(['aa', 'ab', 'ab initio', ..., 'zeros', 'zeta', 'zone'],
      dtype=object)

In [48]:
input_str = vectorizer.transform([Sample1, Sample2])

In [49]:
input_str

<2x5000 sparse matrix of type '<class 'numpy.float64'>'
	with 133 stored elements in Compressed Sparse Row format>

In [50]:
predictions = classifier.predict(input_str)
predictions

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

In [52]:
classifier.classes_

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19])

In [51]:
probs=classifier.predict_proba(input_str)
probs

array([[0.0107416 , 0.02031755, 0.3593347 , 0.00967294, 0.0906358 ,
        0.00963807, 0.00978814, 0.00893255, 0.00833041, 0.00673122,
        0.00551098, 0.00664394, 0.01496101, 0.00934039, 0.01150463,
        0.16272638, 0.01251457, 0.0135098 , 0.05720226, 0.00455834],
       [0.02260705, 0.02337614, 0.00987463, 0.01550031, 0.00341146,
        0.13106048, 0.00800314, 0.02474755, 0.03461081, 0.20197394,
        0.06220145, 0.08799639, 0.02911355, 0.01423868, 0.03047882,
        0.0576527 , 0.01702609, 0.01567191, 0.02728399, 0.0027305 ]])

The classifier predicts none of the categories for the two samples, which in this case might be from poor performance of the trained model.