In [33]:
import pandas as pd
from sklearn.feature_extraction.text import (
    ENGLISH_STOP_WORDS,
    TfidfVectorizer,
    CountVectorizer,
)
from sklearn.preprocessing import LabelEncoder
import re
import torch
from torch.utils.data import TensorDataset, DataLoader
import torch.nn as nn
import torch.optim as optim

### Import Dataset

In [34]:
train_df = pd.read_csv("../Data/arxiv_train.csv")
test_df = pd.read_csv("../Data/arxiv_test.csv")

### Dataset Overview

In [35]:
research_fields = sorted(train_df["label"].unique())

# Getting number of rows
num_rows = train_df.shape[0]
print(f"Number of articles: {num_rows}\n")

# Printing each unique element in an ordered list format
print("Research Fields")
for i, element in enumerate(research_fields, start=1):
    print(f"{i}. {element}")

Number of articles: 80000

Research Fields
1. astro-ph
2. cond-mat
3. cs
4. eess
5. hep-ph
6. hep-th
7. math
8. physics
9. quant-ph
10. stat


### Data Cleaning

In [36]:
def clean_text(text):

    # Remove special characters
    text = re.sub(r"\W", " ", text)

    # Remove single characters
    text = re.sub(r"\s+[a-zA-Z]\s+", " ", text)

    # Remove single characters from the start
    text = re.sub(r"\^[a-zA-Z]\s+", " ", text)

    # Substitute multiple spaces with single space
    text = re.sub(r"\s+", " ", text, flags=re.I)

    # Remove prefixed 'b'
    text = re.sub(r"^b\s+", "", text)

    # Converting to Lowercase
    text = text.lower()

    # Removing stopwords
    stopwords = set(ENGLISH_STOP_WORDS)
    text = " ".join([word for word in text.split() if word not in stopwords])

    return text


# Cleaning the abstracts
train_df["cleaned_abstract"] = train_df["abstract"].apply(clean_text)
test_df["cleaned_abstract"] = test_df["abstract"].apply(clean_text)

In [23]:
train_df.head(10)

Unnamed: 0.1,Unnamed: 0,abstract,label,cleaned_abstract
0,31716,Automatic meeting analysis is an essential f...,eess,automatic meeting analysis essential fundament...
1,89533,We propose a protocol to encode classical bi...,quant-ph,propose protocol encode classical bits measure...
2,82700,A number of physically intuitive results for...,quant-ph,number physically intuitive results calculatio...
3,78830,In the last decade rare-earth hexaborides ha...,physics,decade rare earth hexaborides investigated fun...
4,94948,We introduce the weak barycenter of a family...,stat,introduce weak barycenter family probability d...
5,74849,Direct Statistical Simulation (DSS) solves t...,physics,direct statistical simulation dss solves equat...
6,66424,We introduce a notion of a girth-regular gra...,math,introduce notion girth regular graph k regular...
7,6562,Planet host stars with well-constrained ages...,astro-ph,planet host stars constrained ages provide rar...
8,84292,Unprecedented increase of complexity and sca...,quant-ph,unprecedented increase complexity scale data e...
9,18822,"The usual concepts of topological physics, s...",cond-mat,usual concepts topological physics berry curva...


### Feature Extraction
#### Extract features using both TF-IDF and CountVectorizer. Given that we have separate training and testing datasets, we'll apply fit_transform on the training data and transform on the testing data to ensure the model is tested on unseen data.

#### TF-IDF Vectorizer

In [37]:
tfidf_vectorizer = TfidfVectorizer(max_features=5000)
X_train_tfidf = tfidf_vectorizer.fit_transform(train_df["cleaned_abstract"])
X_test_tfidf = tfidf_vectorizer.transform(test_df["cleaned_abstract"])

#### CountVectorizer

In [38]:
count_vectorizer = CountVectorizer(max_features=5000)
X_train_count = count_vectorizer.fit_transform(train_df["cleaned_abstract"])
X_test_count = count_vectorizer.transform(test_df["cleaned_abstract"])

### PyTorch Data Preparation
#### We convert feature vectors and labels into PyTorch tensors. For illustration, we convert the TF-IDF vectors and corresponding labels:

In [39]:
# Encode categorical labels to integers
label_encoder = LabelEncoder()
label_encoder.fit(pd.concat([train_df["label"], test_df["label"]]))

y_train_encoded = label_encoder.transform(train_df["label"])
y_test_encoded = label_encoder.transform(test_df["label"])

# Convert to PyTorch tensors
X_train_tensor_tfidf = torch.FloatTensor(X_train_tfidf.toarray())
y_train_tensor = torch.LongTensor(y_train_encoded)
X_test_tensor_tfidf = torch.FloatTensor(X_test_tfidf.toarray())
y_test_tensor = torch.LongTensor(y_test_encoded)

# Create TensorDatasets and DataLoaders for TF-IDF
train_data_tfidf = TensorDataset(X_train_tensor_tfidf, y_train_tensor)
test_data_tfidf = TensorDataset(X_test_tensor_tfidf, y_test_tensor)

batch_size = 64
train_loader_tfidf = DataLoader(train_data_tfidf, shuffle=True, batch_size=batch_size)
test_loader_tfidf = DataLoader(test_data_tfidf, batch_size=batch_size)

 ### Defining the Neural Network Models
 ### A. Feedforward Neural Network (FFNN)
#### We start by defining a simple FFNN architecture in PyTorch. This model will take the TF-IDF (or CountVectorizer) features as input.

In [40]:
class FFNN(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(FFNN, self).__init__()
        self.fc1 = nn.Linear(input_dim, 512)  # First hidden layer
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(512, output_dim)  # Output layer
        self.softmax = nn.Softmax(dim=1)

    def forward(self, x):
        out = self.fc1(x)
        out = self.relu(out)
        out = self.fc2(out)
        out = self.softmax(out)
        return out

input_dim = X_train_tensor_tfidf.shape[1]  # This will depend on our feature extraction
output_dim = len(label_encoder.classes_)  # Number of unique labels
model_ffnn = FFNN(input_dim, output_dim)

### B. Recurrent Neural Network (RNN)
#### For an RNN model, considering we're working with bag-of-words features, it's a bit unconventional, as RNNs are typically used with sequential data (like the original text). However, for educational purposes or if you're working with sequential data later, here's a basic outline for an RNN model in PyTorch:

In [56]:
class BasicRNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size, num_layers):
        super(BasicRNN, self).__init__()
        self.rnn = nn.RNN(input_size, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        # Assuming x has shape [batch_size, seq_len, input_size]
        out, _ = self.rnn(x)
        # If out has shape [batch_size, seq_len, hidden_size] and seq_len is treated as 1
        # We need to ensure the tensor is correctly shaped for the linear layer
        # If the sequence length is 1, out will effectively be 2D after the rnn layer
        if out.dim() == 3:  # [batch_size, seq_len, hidden_size]
            out = out[:, -1, :]  # Get the outputs of the last time step
        elif out.dim() == 2:  # [batch_size, hidden_size]
            # No need to index, as there's no sequence length dimension
            pass  # out is already correctly shaped
        else:
            raise ValueError("Unexpected output shape from RNN layer")

        out = self.fc(out)
        return out

# Note: For RNN, input_dim would typically be the size of the embedding dimension
hidden_dim = 128  # Number of features in the hidden state
num_layers = 1  # Number of stacked RNN layers
model_rnn = BasicRNN(input_dim, hidden_dim, output_dim, num_layers)

### Training Our Models
#### Training a model in PyTorch involves setting up a loss function, an optimizer, and then looping over out training data to make predictions, compute the loss, and update our model parameters.

### Training FFNN model:

In [44]:
num_epochs = 10  # Number of times to iterate over the entire dataset

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model_ffnn.parameters(), lr=0.001)

# Training loop
for epoch in range(num_epochs):
    total_loss = 0
    for batch_idx, (data, target) in enumerate(train_loader_tfidf):
        optimizer.zero_grad()
        output = model_ffnn(data)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    # Print average loss for the epoch
    print(f"Epoch {epoch+1}/{num_epochs}, Loss: {total_loss/len(train_loader_tfidf)}")

Epoch 1/10, Loss: 1.534249767780304
Epoch 2/10, Loss: 1.531530926990509
Epoch 3/10, Loss: 1.530596013736725
Epoch 4/10, Loss: 1.5300429203987123
Epoch 5/10, Loss: 1.5296407349586487
Epoch 6/10, Loss: 1.5290078923225403
Epoch 7/10, Loss: 1.5283774554252625
Epoch 8/10, Loss: 1.5278866076469422
Epoch 9/10, Loss: 1.5273891352653504
Epoch 10/10, Loss: 1.5271688702583313


### Training RNN model:

In [57]:
num_epochs = 10  # Define the number of epochs

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model_rnn.parameters(), lr=0.001)

for epoch in range(num_epochs):
    total_loss = 0  # Initialize total loss for the epoch
    for batch_idx, (data, targets) in enumerate(
        train_loader_tfidf
    ):  # Assuming train_loader has sequence data
        optimizer.zero_grad()

        # Reshape data to (batch_size, seq_length, input_size) if not already
        # Adjust -1 to actual sequence length if needed.
        data = data.view(batch_size, -1, input_dim)

        output = model_rnn(data)
        loss = criterion(output, targets)
        loss.backward()
        optimizer.step()

        total_loss += loss.item()  # Add up batch loss

    avg_loss = total_loss / len(
        train_loader_tfidf
    )  # Calculate average loss for the epoch
    print(f"Epoch {epoch+1}/{num_epochs}, Average Loss: {avg_loss}")

Epoch 1/10, Average Loss: 0.6999785131573677
Epoch 2/10, Average Loss: 0.4093828803062439
Epoch 3/10, Average Loss: 0.3583139921784401
Epoch 4/10, Average Loss: 0.32706051809191705
Epoch 5/10, Average Loss: 0.30389044005274773
Epoch 6/10, Average Loss: 0.28642616223692896
Epoch 7/10, Average Loss: 0.2716622624397278
Epoch 8/10, Average Loss: 0.25951268746852874
Epoch 9/10, Average Loss: 0.2491700305700302
Epoch 10/10, Average Loss: 0.23984440425634385


### Evaluate FFNN Model

In [61]:
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

# Disable gradient computation
with torch.no_grad():
    # Make predictions
    y_pred_ffnn = model_ffnn(X_test_tensor_tfidf).argmax(dim=1).numpy()
    y_true_ffnn = y_test_tensor.numpy()

# Compute metrics
accuracy_ffnn = accuracy_score(y_true_ffnn, y_pred_ffnn)
f1_ffnn = f1_score(y_true_ffnn, y_pred_ffnn, average="macro")
precision_ffnn = precision_score(y_true_ffnn, y_pred_ffnn, average="macro")
recall_ffnn = recall_score(y_true_ffnn, y_pred_ffnn, average="macro")

print(
    f"FFNN Model\nAccuracy: {accuracy_ffnn * 100:.2f}%\nF1 Score: {f1_ffnn * 100:.2f}%\nPrecision: {precision_ffnn * 100:.2f}%\nRecall: {recall_ffnn * 100:.2f}%\n"
)

FFNN Model
Accuracy: 81.74%
F1 Score: 81.58%
Precision: 81.62%
Recall: 81.71%



### Evaluate RNN Model

In [62]:
# Assuming the RNN expects input of shape [batch_size, seq_len, input_size]
# and your BoW features were treated as a sequence of length equal to the number of features
with torch.no_grad():
    predictions, true_labels = [], []
    for (
        X_batch,
        y_batch,
    ) in test_loader_tfidf:  # Ensure this DataLoader is prepared correctly
        # No need for reshaping if your DataLoader provides the correct shape
        y_pred_rnn = model_rnn(X_batch).argmax(dim=1)
        predictions.extend(y_pred_rnn.tolist())
        true_labels.extend(y_batch.tolist())

# Compute metrics for RNN
accuracy_rnn = accuracy_score(true_labels, predictions)
f1_rnn = f1_score(true_labels, predictions, average="macro")
precision_rnn = precision_score(true_labels, predictions, average="macro")
recall_rnn = recall_score(true_labels, predictions, average="macro")

print(
    f"RNN Model\nAccuracy: {accuracy_rnn * 100:.2f}%\nF1 Score: {f1_rnn * 100:.2f}%\nPrecision: {precision_rnn * 100:.2f}%\nRecall: {recall_rnn * 100:.2f}%"
)

RNN Model
Accuracy: 78.21%
F1 Score: 78.32%
Precision: 78.66%
Recall: 78.18%


### Logical Comparison and Insights:
#### Higher FFNN Performance: 
#### The FFNN model outperforms the RNN model across all metrics. This is a logical outcome considering the nature of the data and the task. FFNNs are well-suited for classification tasks where the input features represent independent variables, as is the case with BoW features. BoW and similar techniques like TF-IDF transform text into a vector space where the sequence of words is not preserved, aligning well with the FFNN's processing manner.

#### RNN's Sequential Nature: 
#### RNNs are designed to process sequential data, capturing dependencies at different time steps. However, when applied to BoW features, the sequential aspect of the data is not utilized, which can lead to suboptimal performance. The RNN's strength in capturing the order of words or phrases is essentially nullified when the input is a BoW feature vector, explaining why it might not perform as well as the FFNN in this context.

#### Precision vs. Recall: 
#### It's noteworthy that the RNN model, despite its lower overall performance, has a precision slightly higher than its recall, indicating it might be slightly more conservative in its positive predictions. However, the differences are not stark, and both models show a balanced performance between precision and recall.

### Conclusions:
#### The FFNN's superior performance for this task is logical and expected given the match between the model capabilities and the nature of the input features. FFNNs efficiently handle the independent features generated by BoW techniques, making them a strong choice for tasks involving non-sequential data.

#### The RNN's lower performance highlights the importance of aligning model choice with data characteristics. For sequential data, such as raw text where the order of words carries meaning, RNNs (and their more advanced variants like LSTMs or GRUs) could outperform FFNNs. However, when the sequential information is not present or not relevant, as with BoW features, RNNs lose their advantage.

#### These results underscore the principle that no single model is universally best for all tasks. The choice of model should be guided by the nature of the task, the characteristics of the data, and the specific requirements of the application.

