#### 1. Load Data and Libraries
#### At first, we will import the necessary libraries and load the arxiv_data.csv file using pandas.

In [7]:
!pip install matplotlib
!pip install seaborn
!pip install nltk
!pip install scikit-learn



In [8]:
import pandas as pd
import re
import matplotlib.pyplot as plt
import seaborn as sns
from nltk.corpus import stopwords
from sklearn.preprocessing import MultiLabelBinarizer

df = pd.read_csv('arxiv_data.csv')

df = df[['titles', 'summaries', 'terms']]
df.rename(columns={'titles': 'title', 'summaries': 'abstract', 'terms': 'categories'}, inplace=True)

print("Data loaded successfully.")
df.head()

Data loaded successfully.


Unnamed: 0,title,abstract,categories
0,Survey on Semantic Stereo Matching / Semantic ...,Stereo matching is one of the widely used tech...,"['cs.CV', 'cs.LG']"
1,FUTURE-AI: Guiding Principles and Consensus Re...,The recent advancements in artificial intellig...,"['cs.CV', 'cs.AI', 'cs.LG']"
2,Enforcing Mutual Consistency of Hard Regions f...,"In this paper, we proposed a novel mutual cons...","['cs.CV', 'cs.AI']"
3,Parameter Decoupling Strategy for Semi-supervi...,Consistency training has proven to be an advan...,['cs.CV']
4,Background-Foreground Segmentation for Interio...,"To ensure safety in automated driving, the cor...","['cs.CV', 'cs.LG']"


#### 2. Clean the Abstract Text
#### Now we would define and apply a function to clean the raw text of the abstracts. This makes the text uniform for the model.

In [9]:
stop_words = set(stopwords.words('english'))

def clean_text(text):
    """Applies basic text cleaning."""
    text = text.lower() 
    text = re.sub(r'\[.*?\]', '', text) 
    text = re.sub(r'[^a-z\s]', '', text) 
    text = ' '.join(word for word in text.split() if word not in stop_words)
    return text

df['abstract_cleaned'] = df['abstract'].apply(clean_text)

print("Text cleaning complete. Here's a before and after example:")
print("\nOriginal:\n", df['abstract'].iloc[0])
print("\nCleaned:\n", df['abstract_cleaned'].iloc[0])

Text cleaning complete. Here's a before and after example:

Original:
 Stereo matching is one of the widely used techniques for inferring depth from
stereo images owing to its robustness and speed. It has become one of the major
topics of research since it finds its applications in autonomous driving,
robotic navigation, 3D reconstruction, and many other fields. Finding pixel
correspondences in non-textured, occluded and reflective areas is the major
challenge in stereo matching. Recent developments have shown that semantic cues
from image segmentation can be used to improve the results of stereo matching.
Many deep neural network architectures have been proposed to leverage the
advantages of semantic segmentation in stereo matching. This paper aims to give
a comparison among the state of art networks both in terms of accuracy and in
terms of speed which are of higher importance in real-time applications.

Cleaned:
 stereo matching one widely used techniques inferring depth stereo imag

#### 3. Prepare the Labels (Multi-Hot Encoding)
#### Finally, would convert the text-based category labels into a numerical format that the model can understand.

In [10]:
df['categories_list'] = df['categories'].apply(lambda x: x.split())

mlb = MultiLabelBinarizer()

y = mlb.fit_transform(df['categories_list'])

print("\nLabel processing complete.")
print("Shape of the new numerical labels:", y.shape)
print("Example of the first paper's numerical label vector:", y[0])
print("The first 10 class names found:", mlb.classes_[:10])


Label processing complete.
Shape of the new numerical labels: (51774, 1392)
Example of the first paper's numerical label vector: [0 0 0 ... 0 0 0]
The first 10 class names found: ["'00']" "'00-02']" "'00B25']" "'00Bxx'," "'03B52," "'03B70," "'05B45,"
 "'05C20," "'05C21," "'05C50']"]


#### We have now successfully preprocessed our data. We have the cleaned text in df['abstract_cleaned'] and the corresponding numerical labels in the y variable, ready for modeling.

#### Next we would be creating a simple baseline model. This will give a performance score that the more complex deep learning model will need to beat.

#### 4. Split Data for Training and Testing

#### Now we will split the cleaned text (X) and numerical labels (y) into a training set (to teach the model) and a testing set (to evaluate it).

In [11]:
from sklearn.model_selection import train_test_split

X = df['abstract_cleaned']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Data split into training and testing sets.")
print("Training samples:", len(X_train))
print("Testing samples:", len(X_test))

Data split into training and testing sets.
Training samples: 41419
Testing samples: 10355


#### 5. Vectorize Text with TF-IDF

#### Now we convert the text into numerical vectors using TfidfVectorizer. The model can only work with numbers, not raw text.

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(max_features=5000)

X_train_tfidf = tfidf.fit_transform(X_train)

X_test_tfidf = tfidf.transform(X_test)

print("\nText vectorization complete.")
print("Shape of the TF-IDF matrix for training data:", X_train_tfidf.shape)


Text vectorization complete.
Shape of the TF-IDF matrix for training data: (41419, 5000)


#### 6. Train and Evaluate the Baseline Model

#### Finally, train a LogisticRegression model. We wrap it in a OneVsRestClassifier to handle the multi-label nature of the problem.

In [13]:
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import f1_score, accuracy_score

lr = LogisticRegression(solver='liblinear', random_state=42)
clf = OneVsRestClassifier(lr)

print("\nTraining the baseline model...")
clf.fit(X_train_tfidf, y_train)
print("Training complete.")

y_pred = clf.predict(X_test_tfidf)

f1_micro = f1_score(y_test, y_pred, average='micro')
print(f"\nBaseline Model F1-Score (micro): {f1_micro:.4f}")


Training the baseline model...




Training complete.

Baseline Model F1-Score (micro): 0.4405


#### Now that we have a baseline score, it's time to build the more powerful deep learning model. This involves preparing the text data in a new way and then building, training, and evaluating the LSTM model.

#### 7. Prepare Data for the Deep Learning Model 
#### We will convert the text into sequences of integers and then pad them so they are all the same length. Neural networks require inputs to have a fixed size.

In [15]:
from keras.src.legacy.preprocessing.text import Tokenizer
from keras.utils import pad_sequences

X_train_text, X_test_text, y_train, y_test = train_test_split(df['abstract_cleaned'], y, test_size=0.2, random_state=42)

vocab_size = 10000
tokenizer = Tokenizer(num_words=vocab_size)
tokenizer.fit_on_texts(X_train_text)

X_train_seq = tokenizer.texts_to_sequences(X_train_text)
X_test_seq = tokenizer.texts_to_sequences(X_test_text)

max_length = 200
X_train_pad = pad_sequences(X_train_seq, maxlen=max_length, padding='post')
X_test_pad = pad_sequences(X_test_seq, maxlen=max_length, padding='post')

print("Text data has been tokenized and padded.")
print("Shape of padded training data:", X_train_pad.shape)

Text data has been tokenized and padded.
Shape of padded training data: (41419, 200)


#### 8. Build, Compile, and Train the LSTM Model
#### Now we will define the architecture of our LSTM model using Keras. Then, compile it with the correct loss function and optimizer, and finally, train it on our prepared data.

In [20]:
from sklearn.utils import class_weight
import numpy as np

class_weights = class_weight.compute_class_weight(
    'balanced',
    classes=np.unique(y_train.flatten()),
    y=y_train.flatten()
)
class_weight_dict = {i: weight for i, weight in enumerate(class_weights)}

print("Calculated Class Weights:", class_weight_dict)

model = Sequential([
    Embedding(input_dim=vocab_size, output_dim=128, input_length=max_length),
    LSTM(128, return_sequences=False),
    Dropout(0.5),
    Dense(num_classes, activation='sigmoid')
])

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

print("\nTraining the LSTM model with class weights...")
history = model.fit(X_train_pad, y_train,
                    epochs=20, # Keep the increased epochs
                    batch_size=32,
                    validation_split=0.1,
                    class_weight=class_weight_dict) # <-- APPLY THE WEIGHTS HERE
print("Training complete.")

Calculated Class Weights: {0: np.float64(0.5007436386680713), 1: np.float64(336.6847772768681)}

Training the LSTM model with class weights...




Epoch 1/20
[1m1165/1165[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m152s[0m 129ms/step - accuracy: 0.1415 - loss: 0.0161 - val_accuracy: 0.0000e+00 - val_loss: 0.0054
Epoch 2/20
[1m1165/1165[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m156s[0m 134ms/step - accuracy: 0.1386 - loss: 0.0055 - val_accuracy: 0.3416 - val_loss: 0.0053
Epoch 3/20
[1m1165/1165[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m159s[0m 136ms/step - accuracy: 0.1399 - loss: 0.0055 - val_accuracy: 0.3416 - val_loss: 0.0053
Epoch 4/20
[1m1165/1165[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m148s[0m 127ms/step - accuracy: 0.1416 - loss: 0.0055 - val_accuracy: 0.0000e+00 - val_loss: 0.0054
Epoch 5/20
[1m1165/1165[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m143s[0m 123ms/step - accuracy: 0.1394 - loss: 0.0055 - val_accuracy: 0.0000e+00 - val_loss: 0.0054
Epoch 6/20
[1m1165/1165[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m145s[0m 124ms/step - accuracy: 0.1313 - loss: 0.0056 - val_accuracy: 0.3416

#### 9. Evaluate the Deep Learning Model
#### Finally, now we will make predictions on the test set and evaluate the F1 score. Compare this score to your baseline to see the improvement.

In [21]:
import numpy as np

y_pred_probs = model.predict(X_test_pad)

y_pred_dl = (y_pred_probs > 0.5).astype(int)

f1_micro_dl = f1_score(y_test, y_pred_dl, average='micro')

print(f"\nDeep Learning Model F1-Score (micro): {f1_micro_dl:.4f}")
print(f"Baseline Model F1-Score (micro): {f1_micro:.4f}")

improvement = ((f1_micro_dl - f1_micro) / f1_micro) * 100
print(f"Improvement over baseline: {improvement:.2f}%")

[1m324/324[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 40ms/step

Deep Learning Model F1-Score (micro): 0.0000
Baseline Model F1-Score (micro): 0.4405
Improvement over baseline: -100.00%


In [22]:
import pickle

model.save('arxiv_classifier_model.h5')
print("Keras model saved as arxiv_classifier_model.h5")

with open('tokenizer.pickle', 'wb') as handle:
    pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)
print("Tokenizer saved as tokenizer.pickle")

with open('mlb.pickle', 'wb') as handle:
    pickle.dump(mlb, handle, protocol=pickle.HIGHEST_PROTOCOL)
print("MultiLabelBinarizer saved as mlb.pickle")



Keras model saved as arxiv_classifier_model.h5
Tokenizer saved as tokenizer.pickle
MultiLabelBinarizer saved as mlb.pickle
