## TC 5033
### Word Embeddings

<br>

#### Activity 3b: Text Classification using RNNs and AG_NEWS dataset in PyTorch
<br>

- Objective:
    - Understand the basics of Recurrent Neural Networks (RNNs) and their application in text classification.
    - Learn how to handle a real-world text dataset, AG_NEWS, in PyTorch.
    - Gain hands-on experience in defining, training, and evaluating a text classification model in PyTorch.
    
<br>

- Instructions:
    - Data Preparation: Starter code will be provided that loads the AG_NEWS dataset and prepares it for training. Do not modify this part. However, you should be sure to understand it, and comment it, the use of markdown cells is suggested.

    - Model Setup: A skeleton code for the RNN model class will be provided. Complete this class and use it to instantiate your model.

    - Implementing Accuracy Function: Write a function that takes model predictions and ground truth labels as input and returns the model's accuracy.

    - Training Function: Implement a function that performs training on the given model using the AG_NEWS dataset. Your model should achieve an accuracy of at least 80% to get full marks for this part.

    - Text Sampling: Write a function that takes a sample text as input and classifies it using your trained model.

    - Confusion Matrix: Implement a function to display the confusion matrix for your model on the test data.

    - Submission: Submit your completed Jupyter Notebook. Make sure to include a markdown cell at the beginning of the notebook that lists the names of all team members. Teams should consist of 3 to 4 members.
    
<br>

- Evaluation Criteria:

    - Correct setup of all the required libraries and modules (10%)
    - Code Quality (30%): Your code should be well-organized, clearly commented, and easy to follow. Use also markdown cells for clarity. Comments should be given for all the provided code, this will help you understand its functionality.
    
   - Functionality (60%):
        - All the functions should execute without errors and provide the expected outputs.
        - RNN model class (20%)
        - Accuracy fucntion (10%)
        - Training function (10%)
        - Sampling function (10%)
        - Confucion matrix (10%)

        - The model should achieve at least an 80% accuracy on the AG_NEWS test set for full marks in this criterion.


Dataset

https://pytorch.org/text/stable/datasets.html#text-classification

https://paperswithcode.com/dataset/ag-news


### Import libraries

In [None]:
# conda install -c pytorch torchtext
# conda install -c pytorch torchdata
# conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch

SyntaxError: invalid syntax (<ipython-input-6-95ca3c9e499b>, line 1)

In [None]:
!pip install torch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1
!pip install torchtext==0.14.1 torchdata==0.5.1
!pip install scikit-plot
!pip install numpy==1.22.4
!pip install scipy==1.7.3



In [None]:
import numpy as np
import torch
from torchtext.datasets import AG_NEWS
from torch.utils.data import DataLoader, random_split
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torchtext.data.functional import to_map_style_dataset
import torch.nn as nn
from torch.nn import functional as F
import scikitplot as skplt
import gc

# Additional libraries
import torchvision.transforms as transforms
import torchvision.io as tv_io
from torchvision.models import vgg16, VGG16_Weights
import glob
from PIL import Image

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)


cpu


### Get the train and the test datasets and dataloaders

Classes:

* 1 - World

* 2 - Sports

* 3 - Business

* 4 - Sci/Tech

We will convert them to:

* 0 - World

* 1 - Sports

* 2 - Business

* 3 - Sci/Tech

In [None]:
# Load the AG_NEWS dataset
train_iter, test_iter = AG_NEWS()

# Convert iterators to lists to avoid portalocker issues
train_list = list(train_iter)
test_list = list(test_iter)

# Tokenizer
tokenizer = get_tokenizer('basic_english')

# Build vocabulary
def yield_tokens(data_list):
    for _, text in data_list:
        yield tokenizer(text)

vocab = build_vocab_from_iterator(yield_tokens(train_list), specials=["<unk>"])
vocab.set_default_index(vocab["<unk>"])

# Prepare data
def text_pipeline(x):
    return vocab(tokenizer(x))

def label_pipeline(x):
    return int(x) - 1  # Convert labels to 0-based index

# Create DataLoader
def collate_batch(batch):
    label_list, text_list, lengths = [], [], []
    for (_label, _text) in batch:
        label_list.append(label_pipeline(_label))
        processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)
        text_list.append(processed_text)
        lengths.append(processed_text.size(0))
    label_list = torch.tensor(label_list, dtype=torch.int64)
    text_list = nn.utils.rnn.pad_sequence(text_list, batch_first=True)
    lengths = torch.tensor(lengths, dtype=torch.int64)
    return label_list.to(device), text_list.to(device), lengths.to(device)


train_dataset = to_map_style_dataset(train_list)
test_dataset = to_map_style_dataset(test_list)

NUM_TRAIN = int(len(train_dataset) * 0.9)
NUM_VAL = len(train_dataset) - NUM_TRAIN

train_dataset, val_dataset = random_split(train_dataset, [NUM_TRAIN, NUM_VAL])

BATCH_SIZE = 256
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch)


RNN model

In [None]:
EMBEDDING_SIZE = 64  # Define embedding size
NEURONS = 128  # Define number of neurons
LAYERS = 2  # Define number of layers
NUM_CLASSES = 4  # Define number of classes

# RNN Model
class RNN_Model_1(nn.Module):
    def __init__(self, embed_size, hidden, layers, num_classes):
        super(RNN_Model_1, self).__init__()
        self.embedding_layer = nn.Embedding(num_embeddings=len(vocab), embedding_dim=embed_size)
        self.rnn = nn.RNN(embed_size, hidden, num_layers=layers, nonlinearity='relu', batch_first=True)
        self.fc = nn.Linear(hidden, num_classes)

    def forward(self, text, lengths):
        x = self.embedding_layer(text)
        packed_input = nn.utils.rnn.pack_padded_sequence(x, lengths, batch_first=True, enforce_sorted=False)
        packed_output, _ = self.rnn(packed_input)
        rnn_out, _ = nn.utils.rnn.pad_packed_sequence(packed_output, batch_first=True)
        rnn_out = rnn_out[torch.arange(rnn_out.size(0)), lengths - 1]
        return self.fc(rnn_out)


In [None]:
# Define the training function
def train(model, optimizer, epochs=100):
    criterion = nn.CrossEntropyLoss()
    for epoch in range(epochs):
        model.train()
        total_loss = 0
        for label, text, lengths in train_loader:
            optimizer.zero_grad()
            output = model(text, lengths)
            loss = criterion(output, label)
            loss.backward()
            optimizer.step()
            total_loss += loss.item()
        print(f'Epoch {epoch+1}, Loss: {total_loss/len(train_loader):.4f}')


In [None]:
def accuracy(model, loader):
    model.eval()
    total_acc = 0
    with torch.no_grad():
        for label, text, lengths in loader:
            output = model(text, lengths)
            total_acc += (output.argmax(1) == label).sum().item()
    return total_acc / len(loader.dataset)


Train the model

In [None]:
epochs = 20  # Increase the number of epochs
lr = 0.0005  # Adjust the learning rate

# Instantiate model
rnn_model = RNN_Model_1(EMBEDDING_SIZE, NEURONS, LAYERS, NUM_CLASSES).to(device)
optimizer = torch.optim.Adam(rnn_model.parameters(), lr=lr)

In [None]:
# Train the model
train(rnn_model, optimizer=optimizer, epochs=epochs)
print(f'Accuracy: {accuracy(rnn_model, test_loader):.4f}')

Epoch 1, Loss: 1.0702
Epoch 2, Loss: 0.6039
Epoch 3, Loss: 0.4252
Epoch 4, Loss: 0.3517
Epoch 5, Loss: 0.3098
Epoch 6, Loss: 0.2822
Epoch 7, Loss: 0.2597
Epoch 8, Loss: 0.2380
Epoch 9, Loss: 0.2234
Epoch 10, Loss: 0.2158
Epoch 11, Loss: 0.1876
Epoch 12, Loss: 0.1737
Epoch 13, Loss: 0.1657
Epoch 14, Loss: 0.1574
Epoch 15, Loss: 0.1467
Epoch 16, Loss: 0.1363
Epoch 17, Loss: 0.1524
Epoch 18, Loss: 0.1217
Epoch 19, Loss: 0.1175
Epoch 20, Loss: 0.1046
Accuracy: 0.8857


In [1]:
def sample_text(model, loader):
    model.eval()
    with torch.no_grad():
        for label, text, offsets in loader:
            output = model(text, offsets)
            print(f'Text: {text}, Predicted: {output.argmax(1)}, Actual: {label}')
            break

sample_text(rnn_model, test_loader)


NameError: name 'rnn_model' is not defined

In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
import seaborn as sns

def plot_confusion_matrix(model, loader):
    all_preds = []
    all_labels = []
    model.eval()
    with torch.no_grad():
        for label, text, offsets in loader:
            output = model(text, offsets)
            preds = output.argmax(1)
            all_preds.extend(preds.cpu().numpy())
            all_labels.extend(label.cpu().numpy())
    cm = confusion_matrix(all_labels, all_preds)
    plt.figure(figsize=(10, 8))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
    plt.xlabel('Predicted')
    plt.ylabel('True')
    plt.show()

plot_confusion_matrix(rnn_model, test_loader)
