**Lamiah Khan: NLP Project 2**

*  ***YOU DO NOT NEED TO RUN THE FOLLOWING ON COLAB!***: The dataset used for this project was corpus 1. The first step is to convert the textfiles to a csv file. Corpus training set will be train.csv, and the test set will be test.csv. My recommendation is to run this script outside of Colab, otherwise you would have to upload all the files from corpus 1 (including those in the folder). Personally, I ran this script on Anaconda Spyder. I am attaching the scripts I got outputted to the email. In case you want to try: here is the script!





In [None]:
import os
import pandas as pd

# link for reference code: https://www.geeksforgeeks.org/convert-text-file-to-csv-using-python-pandas/

def load_data(article_path, label_file):
    data = []
    with open(label_file, "r") as file:
        lines = file.readlines()
    for line in lines:
        parts = line.strip().split()
        file_path = os.path.join(article_path, parts[0])
        label = parts[1]
        try:
            with open(file_path, "r", encoding="utf-8") as file:
                text = file.read()
            data.append({"text": text, "label": label})
        except FileNotFoundError:
            print(f"Warning: The file {file_path} does not exist and will be skipped.")
        except Exception as e:
            print(f"Error reading {file_path}: {e}")
    return pd.DataFrame(data)


def main():
    base_path = "./"
    train_label_file = "corpus1_train.labels"
    test_label_file = "corpus1_test.labels"

    try:
        print("Loading training data...")
        train_data = load_data(base_path, train_label_file)
        print("Loading testing data...")
        test_data = load_data(base_path, test_label_file)

        train_data.to_csv("train.csv", index=False)
        test_data.to_csv("test.csv", index=False)
        print("Dataset saved to CSV format")
    except Exception as e:
        print(f"An error occurred: {e}")


if __name__ == "__main__":
    main()

Loading training data...
An error occurred: [Errno 2] No such file or directory: 'corpus1_train.labels'


***1ST CODE BLOCK FOR RUNNING***: the following code block imports all the neccessary libraries, and gets device to GPU.

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import pandas as pd
import numpy as np
import re
import os
from gensim.models import Word2Vec
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt_tab')

# Downloading necessary NLTK data if not already downloaded
nltk.download('stopwords', quiet=True)
nltk.download('punkt', quiet=True)

# Set device to GPU if available, else CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Using device: cpu


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


***2ND CODE BLOCK FOR RUNNING***: Next, we will load and process all the files. In order to pre-process the files, I (1) removed stop words (2) removed white space (3) removed digits. Then to process the text, I (1) tokenized (2) created vocabulary (3) changed the labels into numerical format (4) converted to tensors. I also created a function to create word2vec embeddings for the experiment. I padded the word embeddings, but it is set to 0 since RNN on pytorch should have automatic padding.

In [None]:
# Creating the CSV files, and checking if it exists and if it doesn't then create it
train_csv = 'train.csv'
test_csv = 'test.csv'

if not os.path.exists(train_csv):
    print("Creating training CSV...")
    train_df = load_data('./', 'corpus1_train.labels')
    train_df.to_csv(train_csv, index=False)
else:
    train_df = pd.read_csv(train_csv)

if not os.path.exists(test_csv):
    print("Creating testing CSV...")
    test_df = load_data('./', 'corpus1_test.labels')
    test_df.to_csv(test_csv, index=False)
else:
    test_df = pd.read_csv(test_csv)

print("Train dataset shape:", train_df.shape)
print("Test dataset shape:", test_df.shape)


# Preprocessing function to clean and tokenize text
# A lot of this is imported from Project 1
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    # Convert to lowercase
    text = str(text).lower()
    # Remove digits
    text = re.sub(r'\d+', '', text)
    # Tokenize
    tokens = word_tokenize(text)
    # Remove stopwords and whitespace
    tokens = [token for token in tokens if token not in stop_words and token.strip()]
    return tokens

# Encode labels as integers
label_encoder = LabelEncoder()
train_df['encoded_label'] = label_encoder.fit_transform(train_df['label'])

# Create vocabulary based on preprocessed training text
all_tokens = [token for text in train_df['text'] for token in preprocess_text(text)]
vocab = sorted(set(all_tokens))
word_to_idx = {word: idx + 1 for idx, word in enumerate(vocab)}  # Reserve 0 for padding
word_to_idx['<PAD>'] = 0
idx_to_word = {idx: word for word, idx in word_to_idx.items()}

print("Vocabulary size:", len(word_to_idx))

# Function to create Word2Vec embeddings
def create_word2vec_embeddings(texts, embed_dim):
    sentences = [preprocess_text(text) for text in texts]
    model = Word2Vec(sentences, vector_size=embed_dim, window=5, min_count=1, workers=4)
    return model


Train dataset shape: (894, 2)
Test dataset shape: (403, 2)
Vocabulary size: 4656


***3RD CODE BLOCK FOR RUNNING***:
The TextDataset class is a custom dataset for handling text data in PyTorch, designed to:
* Initialize with text data and labels (optional) and additional parameters, including word_to_idx (for word-to-index mapping), max_len (maximum text length), and word2vec_model (for embeddings).
* Return the dataset length with __len__ to provide the number of samples.
* Retrieve preprocessed text with __getitem__, encoding text into word indices or embeddings, then padding or truncating to max_len.

In __getitem__:
* If word2vec_model is provided, each token is represented by a Word2Vec vector if available, or a zero-vector if not.
* If word2vec_model is not provided, each token is encoded by looking up its index in word_to_idx.
Padding is added if the text is shorter than max_len, and if longer, it is truncated.
* The function returns the encoded text (as a tensor) and, if available, the label as a tensor.


In [None]:
# References: https://discuss.huggingface.co/t/help-understanding-how-to-build-a-dataset-for-language-as-with-the-old-textdataset/5870
# https://github.com/huggingface/transformers/issues/24742

class TextDataset(Dataset):
    def __init__(self, texts, labels=None, word_to_idx=None, max_len=50, word2vec_model=None, use_static_embedding=False):
        self.texts = texts
        self.labels = labels
        self.word_to_idx = word_to_idx
        self.max_len = max_len
        self.word2vec_model = word2vec_model
        self.use_static_embedding = use_static_embedding

    def __getitem__(self, idx):
        text = self.texts.iloc[idx]
        tokens = preprocess_text(text)
        if self.use_static_embedding:
            # Convert tokens to embeddings directly
            encoded_text = [self.word2vec_model.wv[token] if token in self.word2vec_model.wv else np.zeros(self.word2vec_model.vector_size) for token in tokens]
            # Pad or truncate to `max_len`
            if len(encoded_text) < self.max_len:
                encoded_text.extend([np.zeros(self.word2vec_model.vector_size)] * (self.max_len - len(encoded_text)))
            else:
                encoded_text = encoded_text[:self.max_len]
            encoded_text = np.array(encoded_text, dtype=np.float32)  # Ensure float32 type
        else:
            # Use indices for dynamic embeddings
            encoded_text = [self.word_to_idx.get(token, 0) for token in tokens]
            if len(encoded_text) < self.max_len:
                encoded_text += [0] * (self.max_len - len(encoded_text))
            else:
                encoded_text = encoded_text[:self.max_len]
            encoded_text = np.array(encoded_text, dtype=np.int64)

        if self.labels is not None:
            label = self.labels.iloc[idx]
            return torch.tensor(encoded_text), torch.tensor(label)
        else:
            return torch.tensor(encoded_text)

    def __len__(self):
        return len(self.texts)


***4TH CODE BLOCK FOR RUNNING***: The TextClassifier class defines the neural network architecture for text classification.

For Initialization:
* Embedding Layer: Either uses pretrained embeddings (static) or trains embeddings (non-static).
* RNN Layer: Initializes a recurrent neural network (LSTM, GRU, or RNN) based on input parameters:
* rnn_type specifies the type (LSTM/GRU/RNN).
* bidirectional allows for bidirectional RNNs, enabling processing both forward and backward.
* num_layers specifies the number of stacked RNN layers.
dropout applies dropout for regularization between layers if num_layers > 1.
* A dropout layer and fully connected layer follow the RNN, mapping to the final output classes.

There is also Forward Pass, which:
* Embeds the input if using non-static embedding, else uses the input directly (already a Word2Vec vector).
* Processes the embedded sequence through the RNN.
* Uses only the final RNN hidden state for prediction.
* Applies dropout and then outputs the final class scores.

In [None]:
# Define text classifier model: https://github.com/claravania/lstm-pytorch/blob/master/model.py
class TextClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, output_dim, rnn_type='LSTM', bidirectional=False, num_layers=1, dropout=0.5, use_static_embedding=True, word2vec_weights=None):
        super(TextClassifier, self).__init__()
        self.use_static_embedding = use_static_embedding
        if use_static_embedding:
            self.embedding = nn.Embedding.from_pretrained(word2vec_weights, freeze=True)
        else:
            self.embedding = nn.Embedding(vocab_size, embed_dim)
# Initialize RNN (LSTM/GRU/RNN) for later experimentation: https://discuss.pytorch.org/t/bidirectional-3-layer-lstm-hidden-output/41336
        self.rnn_type = rnn_type
        if rnn_type == 'LSTM':
            self.rnn = nn.LSTM(embed_dim, hidden_dim, num_layers=num_layers,
                               bidirectional=bidirectional, batch_first=True, dropout=dropout if num_layers > 1 else 0)
        elif rnn_type == 'GRU':
            self.rnn = nn.GRU(embed_dim, hidden_dim, num_layers=num_layers,
                              bidirectional=bidirectional, batch_first=True, dropout=dropout if num_layers > 1 else 0)
        else:
            self.rnn = nn.RNN(embed_dim, hidden_dim, num_layers=num_layers,
                              bidirectional=bidirectional, batch_first=True, dropout=dropout if num_layers > 1 else 0)

        self.dropout = nn.Dropout(dropout)
        direction_multiplier = 2 if bidirectional else 1
        self.fc = nn.Linear(hidden_dim * direction_multiplier, output_dim)

    def forward(self, x):
        if not self.use_static_embedding:
            embedded = self.embedding(x)
        else:
            embedded = x

        if self.rnn_type in ['LSTM', 'GRU']:
            rnn_out, _ = self.rnn(embedded)
        else:
            rnn_out, _ = self.rnn(embedded)

        final_hidden_state = rnn_out[:, -1, :]
        dropped = self.dropout(final_hidden_state)
        return self.fc(dropped)

***5TH CODE BLOCK FOR RUNNING***: Next, we have the training function. The train_model function trains the TextClassifier model.

Training Loop:
* For each epoch, it iterates over batches, computing the model’s output, loss, and performing backpropagation to update weights.
train_loss accumulates the loss per epoch.

Validation Phase:
* After each epoch, it switches to evaluation mode to calculate the validation loss and accuracy without gradient updates.
* Predicts classes, compares with true labels, and calculates accuracy.


In [None]:
def train_model(model, train_loader, val_loader, criterion, optimizer, epochs):

    epoch_accuracies = []  # Store accuracy for each epoch
    # code reference: https://stackoverflow.com/questions/51503851/calculate-the-accuracy-every-epoch-in-pytorch

    for epoch in range(epochs):
        model.train()  # Set model to training mode
        train_loss = 0
        for batch in train_loader:
            texts, labels = batch
            texts, labels = texts.to(device), labels.to(device)

            optimizer.zero_grad()
            output = model(texts)
            loss = criterion(output, labels)
            loss.backward()  # Backpropagation
            optimizer.step()

            train_loss += loss.item()

        # Validation phase (after each epoch)
        model.eval()  # Set model to evaluation mode
        val_loss = 0
        correct = 0
        total = 0
        with torch.no_grad():  # No gradient computation for validation
            for batch in val_loader:
                texts, labels = batch
                texts, labels = texts.to(device), labels.to(device)

                output = model(texts)
                loss = criterion(output, labels)
                val_loss += loss.item()

                # Get predictions and calculate accuracy
                _, predicted = torch.max(output.data, 1)
                total += labels.size(0)
                correct += (predicted == labels).sum().item()

        epoch_accuracy = 100 * correct / total
        epoch_accuracies.append(epoch_accuracy)

                # Print concise output for the epoch
        print(f"Epoch {epoch+1}/{epochs}, "
              f"Train Loss: {train_loss / len(train_loader):.4f}, "
              f"Val Loss: {val_loss / len(val_loader):.4f}, "
              f"Val Accuracy: {epoch_accuracy:.2f}%")

    return epoch_accuracies

***6TH CODE BLOCK FOR RUNNING***: The evaluate_accuracy function evaluates the trained model on test data by:
* Switching to evaluation mode, iterating over batches of test data, and accumulating the total number of samples processed.This is used to assess generalization to unseen data.

In [None]:
# Evaluation function (testing accuracy)
def evaluate_accuracy(model, test_loader):
    model.eval()
    correct = 0
    total = 0
    with torch.no_grad():
        for texts in test_loader:
            texts = texts.to(device)
            outputs = model(texts)
            _, predicted = torch.max(outputs.data, 1)
            total += texts.size(0)
    return total

***7TH CODE BLOCK FOR RUNNING***: # https://github.com/keras-team/keras/issues/853 Next, we defined loaders and definitions so we can run experiments. The parameters I worked with included:
* embed_dim: Dimension of the word embeddings.
* hidden_dim: Dimension of the RNN’s hidden layer.
* lr: Learning rate for the optimizer.
* epochs: Number of epochs to train the model.
* rnn_type: Type of recurrent layer to use (options: LSTM, GRU, RNN).
* bidirectional: If True, the RNN layer will be bidirectional.
* num_layers: Number of stacked RNN layers.
* dropout: Dropout rate to apply after the RNN layers.
* use_static_embedding: If True, uses static (pre-trained) embeddings; otherwise, trainable embeddings.

I loaded word embeddings (is using static embeddings:
* If use_static_embedding=True, it calls create_word2vec_embeddings to generate embeddings based on the training text data.
* Each word in word_to_idx is converted into its Word2Vec embedding if available, or a zero vector if not found in word2vec_model.
* word2vec_weights stores the pre-trained Word2Vec embeddings in a PyTorch tensor. This tensor will be used in the TextClassifier as a fixed (non-trainable) embedding layer.
* If use_static_embedding=False, the word2vec_weights is set to None, and a new embedding layer will be learned from scratch during training.

I also create a TextDataset for training, validation, and test datasets:
* If using static embeddings, the word2vec_model is passed into TextDataset, which will use Word2Vec embeddings for each word.
* Otherwise, word_to_idx is used for encoding the text into word indices.
* DataLoader instances are then created for each dataset, batching the data for efficient model training and evaluation.

For the TextCalssifier:
word2vec_weights is provided if use_static_embedding=True, and use_static_embedding is set accordingly. If use_static_embedding=False, the model will initialize and learn its own embedding weights.

In [None]:
def run_experiment(embed_dim, hidden_dim, lr, epochs, rnn_type='LSTM', bidirectional=False, num_layers=1, dropout=0.5, use_static_embedding=True):
  # reference code: https://github.com/ultralytics/yolov5/blob/master/classify/train.py

    if use_static_embedding:
        word2vec_model = create_word2vec_embeddings(train_df['text'], embed_dim)
        word_to_idx = {word: idx + 1 for idx, word in enumerate(word2vec_model.wv.key_to_index)}
        word_to_idx['<PAD>'] = 0
        idx_to_word = {idx: word for word, idx in word_to_idx.items()}

        train_dataset = TextDataset(train_texts, train_labels, word_to_idx=word_to_idx, word2vec_model=word2vec_model, use_static_embedding=True)
        val_dataset = TextDataset(val_texts, val_labels, word_to_idx=word_to_idx, word2vec_model=word2vec_model, use_static_embedding=True)
        test_dataset = TextDataset(test_df['text'], word_to_idx=word_to_idx, word2vec_model=word2vec_model, use_static_embedding=True)

        # Pre-load Word2Vec weights
        word2vec_weights = torch.FloatTensor([word2vec_model.wv[word] if word in word2vec_model.wv else np.zeros(embed_dim) for word in word_to_idx]).to(device)
    else:
        # Handle dynamic embedding case
        all_tokens = [token for text in train_df['text'] for token in preprocess_text(text)]
        vocab = sorted(set(all_tokens))
        word_to_idx = {word: idx + 1 for idx, word in enumerate(vocab)}
        word_to_idx['<PAD>'] = 0
        idx_to_word = {idx: word for word, idx in word_to_idx.items()}

        word2vec_weights = None  # No pre-trained weights in this case

        train_dataset = TextDataset(train_texts, train_labels, word_to_idx=word_to_idx, use_static_embedding=False)
        val_dataset = TextDataset(val_texts, val_labels, word_to_idx=word_to_idx, use_static_embedding=False)
        test_dataset = TextDataset(test_df['text'], word_to_idx=word_to_idx, use_static_embedding=False)

    train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
    val_loader = DataLoader(val_dataset, batch_size=32)
    test_loader = DataLoader(test_dataset, batch_size=32)

    model = TextClassifier(len(word_to_idx), embed_dim, hidden_dim, len(label_encoder.classes_),
                           rnn_type=rnn_type, bidirectional=bidirectional, num_layers=num_layers,
                           dropout=dropout, use_static_embedding=use_static_embedding,
                           word2vec_weights=word2vec_weights).to(device)

    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=lr)

    epoch_accuracies = train_model(model, train_loader, val_loader, criterion, optimizer, epochs=epochs)
    avg_accuracy = sum(epoch_accuracies) / len(epoch_accuracies)
    print(f"Average Accuracy: {avg_accuracy:.2f}%")

    total_samples = evaluate_accuracy(model, test_loader)
    return total_samples, avg_accuracy


***8TH CODE BLOCK FOR RUNNING***: This is the block where I ran experiments. FINAL ARCHITECTURE THAT WORKS BEST HAS BEEN LEFT UNCOMMENTED! I compared vanilla RNNs to LSTMs, and even to GRUs, which was not talked about explicitly in class, but I found they also worked with high accuracy! I also compared single-direction LSTMs to bi-LSTMs, and stacked two or three layers of LSTMs together. I also experiment with hyperparameters, and I also experimented with the system learning an embedding layer versus using static embedding with word2vec. Overall, I had the best results for Bidirectional LSTM with four layers and no static word embedding. I ran this three times, and always got an overall accuracy between 72.18% to 75.61% (as shown in the write-up). Bidirectional GRU with 1 layer also resulted in a 78.07% accuracy.

In [None]:
# Splitting the data
train_texts, val_texts, train_labels, val_labels = train_test_split(
    train_df['text'], train_df['encoded_label'], test_size=0.2, random_state=42)

# Running some experiments with different configurations: FINAL CHOICE IS NOT COMMENTED
experiments = [
    # Vanilla RNN
    #(100, 128, 0.001, 20, 'RNN', False, 1, 0.5, False),

    # Increased hidden dimension
    #(100, 256, 0.001, 20, 'LSTM', False, 2, 0.4, True),  # LSTM with more hidden units, static embedding
    #(100, 256, 0.0005, 30, 'GRU', True, 2, 0.4, False),  # GRU, bidirectional, smaller lr

    # More layers
    (100, 128, 0.0005, 30, 'LSTM', True, 4, 0.3, False),  # LSTM, more layers, smaller lr
    #(100, 128, 0.0005, 30, 'LSTM', True, 4, 0.3, True),  # LSTM, more layers, smaller lr
    #(200, 128, 0.0005, 40, 'GRU', False, 3, 0.3, True),   # GRU, increased embed_dim, longer epochs

    # Higher embedding dimensions
    #(300, 128, 0.001, 20, 'LSTM', True, 2, 0.4, True),    # Higher embed_dim with LSTM, bidirectional

    # Hyperparameter tuning
    #(200, 256, 0.0005, 30, 'LSTM', True, 2, 0.6, False),

    # Lower dropout
    #(100, 128, 0.001, 20, 'RNN', True, 1, 0.3, False),    # Vanilla RNN, lower dropout, bidirectional
    #(100, 512, 0.0005, 40, 'GRU', True, 2, 0.3, True),    # Higher hidden dimension, longer epochs, bidirectional

    # Experiment with small learning rate and more epochs
    #(200, 256, 0.0001, 40, 'LSTM', True, 3, 0.4, True),   # Lower lr with LSTM, more layers, static embedding
]


# Running each experiment and storing the results
results = []
for params in experiments:
    total_samples, avg_accuracy = run_experiment(*params)
    results.append((params, total_samples, avg_accuracy))

print("\nExperiment Results:")
for params, total_samples, avg_accuracy in results:
    print(f"Model: {params[4]}, Bidirectional: {params[5]}, Layers: {params[6]}, "
          f"Embed dim: {params[0]}, Hidden dim: {params[1]}, LR: {params[2]}, "
          f"Epochs: {params[3]}, Dropout: {params[7]}, Static Embedding: {params[8]}")
    print(f"Total samples processed: {total_samples}")
    print(f"Average Accuracy across all epochs: {avg_accuracy:.2f}%\n")



Epoch 1/30, Train Loss: 0.6346, Val Loss: 0.6218, Val Accuracy: 67.60%
Epoch 2/30, Train Loss: 0.6106, Val Loss: 0.6279, Val Accuracy: 67.60%
Epoch 3/30, Train Loss: 0.6190, Val Loss: 0.6190, Val Accuracy: 67.60%
Epoch 4/30, Train Loss: 0.6201, Val Loss: 0.6186, Val Accuracy: 67.60%
Epoch 5/30, Train Loss: 0.6173, Val Loss: 0.6186, Val Accuracy: 67.60%
Epoch 6/30, Train Loss: 0.6121, Val Loss: 0.6203, Val Accuracy: 67.60%
Epoch 7/30, Train Loss: 0.6123, Val Loss: 0.6201, Val Accuracy: 67.60%
Epoch 8/30, Train Loss: 0.6111, Val Loss: 0.6168, Val Accuracy: 68.16%
Epoch 9/30, Train Loss: 0.6178, Val Loss: 0.6054, Val Accuracy: 69.27%
Epoch 10/30, Train Loss: 0.6003, Val Loss: 0.5997, Val Accuracy: 69.83%
Epoch 11/30, Train Loss: 0.5952, Val Loss: 0.7101, Val Accuracy: 51.96%
Epoch 12/30, Train Loss: 0.6224, Val Loss: 0.6080, Val Accuracy: 69.83%
Epoch 13/30, Train Loss: 0.5781, Val Loss: 0.5659, Val Accuracy: 70.95%
Epoch 14/30, Train Loss: 0.5370, Val Loss: 0.5807, Val Accuracy: 72.07%
E