# ECS289G Project: CNN for Text Classification using PyTorch

* A PyTorch implementation for CNN Text Classification based on [Convolutional Neural Networks for Sentence Classification](https://arxiv.org/abs/1408.5882) (Kim, 2014).

* Skeleton of this notebook is based on this post: https://chriskhanhtran.github.io/posts/cnn-sentence-classification/
    * We are using Glove 6B 300d for pretrained embeddings, tuned hyperparameters, and extend the datasets to both MR and R8
    * Datasets and the split strategy is from https://github.com/yao8839836/text_gcn/tree/master/data

In [None]:
import os
import re
from tqdm import tqdm
import numpy as np
import pandas as pd
import nltk
nltk.download("all")
import matplotlib.pyplot as plt
import torch

%matplotlib inline

## Data Preprocessing

### Datasets
Datasets that we will be using for our project are:
* Movie Review(MR): http://www.cs.cornell.edu/people/pabo/movie-review-data/
* R8: https://www.cs.umb.edu/˜smimarog/textmining/datasets/
* They can also be downloaded [here](https://github.com/yao8839836/text_gcn/tree/master/data)

In [None]:
# Download datasets
!rm -rf ./data
!wget https://raw.githubusercontent.com/yao8839836/text_gcn/master/data/mr/text_train.txt -P ./data/mr/
!wget https://raw.githubusercontent.com/yao8839836/text_gcn/master/data/mr/text_test.txt -P ./data/mr/
!wget https://raw.githubusercontent.com/yao8839836/text_gcn/master/data/mr/label_train.txt -P ./data/mr/
!wget https://raw.githubusercontent.com/yao8839836/text_gcn/master/data/mr/label_test.txt -P ./data/mr/
!wget https://raw.githubusercontent.com/yao8839836/text_gcn/master/data/R8/train.txt -P ./data/R8/
!wget https://raw.githubusercontent.com/yao8839836/text_gcn/master/data/R8/test.txt -P ./data/R8/

In [None]:
# Preprocessing into "list of texts" and "list of labels"
def load_mr(filename):
    with open(filename, 'rb') as f:
        texts = []
        for line in f:
            item = line.decode(errors='ignore').lower().strip()
            if item == '0' or item == '1':
                item = int(item)
            texts.append(item)

    return np.array(texts)

r8_dict = {
    "acq": 0,
    "crude": 1,	
    "earn": 2,	
    "grain": 3,
    "interest": 4,	
    "money-fx": 5,
    "ship": 6,
    "trade": 7
}
def load_r8(filename):
    with open(filename, 'rb') as f:
        texts = []
        labels = []
        for line in f:
            line = line.decode(errors='ignore').split("\t")
            text, label = line[1], line[0]
            texts.append(text)
            labels.append(r8_dict[label])

    return np.array(texts), np.array(labels)

mr_train_texts = load_mr('./data/mr/text_train.txt')
mr_test_texts = load_mr('./data/mr/text_test.txt')
mr_train_labels = load_mr('./data/mr/label_train.txt')
mr_test_labels = load_mr('./data/mr/label_test.txt')
r8_train_texts, r8_train_labels = load_r8('./data/R8/train.txt') 
r8_test_texts, r8_test_labels = load_r8('./data/R8/test.txt')

## Pretrained Embedding and Encoding

In [None]:
# Download Glove Embeddings
URL = "https://huggingface.co/stanfordnlp/glove/resolve/main/glove.6B.zip"
FILE = "Glove"

if os.path.isdir(FILE):
    print("Glove exists.")
else:
    !wget -P $FILE $URL
    !unzip $FILE/glove.6B.zip

In [None]:
# tokenization
from nltk.tokenize import word_tokenize
from collections import defaultdict

def tokenize(texts):
    """Tokenize texts, build vocabulary and find maximum sentence length.
    
    Args:
        texts (List[str]): List of text data
    
    Returns:
        tokenized_texts (List[List[str]]): List of list of tokens
        word2idx (Dict): Vocabulary built from the corpus
        max_len (int): Maximum sentence length
    """

    max_len = 0
    tokenized_texts = []
    word2idx = {}

    # Add <pad> and <unk> tokens to the vocabulary
    word2idx['<pad>'] = 0
    word2idx['<unk>'] = 1

    # Building our vocab from the corpus starting from index 2
    idx = 2
    for sent in texts:
        tokenized_sent = word_tokenize(sent)

        # Add `tokenized_sent` to `tokenized_texts`
        tokenized_texts.append(tokenized_sent)

        # Add new token to `word2idx`
        for token in tokenized_sent:
            if token not in word2idx:
                word2idx[token] = idx
                idx += 1

        # Update `max_len`
        max_len = max(max_len, len(tokenized_sent))

    return tokenized_texts, word2idx, max_len

def encode(tokenized_texts, word2idx, max_len):
    """Pad each sentence to the maximum sentence length and encode tokens to
    their index in the vocabulary.

    Returns:
        input_ids (np.array): Array of token indexes in the vocabulary with
            shape (N, max_len). It will the input of our CNN model.
    """

    input_ids = []
    for tokenized_sent in tokenized_texts:
        # Pad sentences to max_len
        tokenized_sent += ['<pad>'] * (max_len - len(tokenized_sent))

        # Encode tokens to input_ids
        input_id = [word2idx.get(token) for token in tokenized_sent]
        input_ids.append(input_id)
    
    return np.array(input_ids)

In [None]:
# load embeddings
from tqdm import tqdm_notebook

def load_pretrained_vectors(word2idx, fname):
    """Load pretrained vectors and create embedding layers.
    
    Args:
        word2idx (Dict): Vocabulary built from the corpus
        fname (str): Path to pretrained vector file

    Returns:
        embeddings (np.array): Embedding matrix with shape (N, d) where N is
            the size of word2idx and d is embedding dimension
    """

    print("Loading pretrained vectors...")
    fin = open(fname, 'r', encoding='utf-8', newline='\n', errors='ignore')
    # n, d = map(int, fin.readline().split())
    d=300

    # Initilize random embeddings
    embeddings = np.random.uniform(-0.25, 0.25, (len(word2idx), d))
    embeddings[word2idx['<pad>']] = np.zeros((d,))

    # Load pretrained vectors
    count = 0
    for line in tqdm_notebook(fin):
        tokens = line.rstrip().split(' ')
        word = tokens[0]
        if word in word2idx:
            count += 1
            embeddings[word2idx[word]] = np.array(tokens[1:], dtype=np.float32)

    print(f"There are {count} / {len(word2idx)} pretrained vectors found.")

    return embeddings

In [None]:
train_texts = mr_train_texts
test_texts = mr_test_texts
# Tokenize, build vocabulary, encode tokens
print("Tokenizing...\n")
tokenized_texts_train, word2idx, max_len = tokenize(train_texts)
tokenized_texts_test, word2idx, max_len = tokenize(test_texts)
tokenized_texts, word2idx, max_len = tokenize(np.concatenate((train_texts, test_texts), axis=None))
input_ids_train = encode(tokenized_texts_train, word2idx, max_len)
input_ids_test = encode(tokenized_texts_test, word2idx, max_len)


# Load pretrained vectors
# tokenized_texts, word2idx, max_len = tokenize(np.concatenate((train_texts, test_texts), axis=None))
embeddings = load_pretrained_vectors(word2idx, "glove.6B.300d.txt")
embeddings = torch.tensor(embeddings)

## Dataloader

In [None]:
from torch.utils.data import (TensorDataset, DataLoader, RandomSampler,
                              SequentialSampler)

def data_loader(train_inputs, test_inputs, train_labels, test_labels,
                batch_size=50):
    """Convert train and validation sets to torch.Tensors and load them to
    DataLoader.
    """

    # Convert data type to torch.Tensor
    train_inputs, test_inputs, train_labels, test_labels =\
    tuple(torch.tensor(data) for data in
          [train_inputs, test_inputs, train_labels, test_labels])

    # Specify batch_size
    batch_size = batch_size

    # Create DataLoader for training data
    train_data = TensorDataset(train_inputs, train_labels)
    train_sampler = RandomSampler(train_data)
    train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

    # Create DataLoader for validation data
    test_data = TensorDataset(test_inputs, test_labels)
    test_sampler = SequentialSampler(test_data)
    test_dataloader = DataLoader(test_data, sampler=test_sampler, batch_size=batch_size)

    return train_dataloader, test_dataloader

In [None]:
# Load data to PyTorch DataLoader
train_inputs = input_ids_train
test_inputs = input_ids_test
train_labels = mr_train_labels
test_labels = mr_test_labels
train_dataloader, test_dataloader = \
data_loader(train_inputs, test_inputs, train_labels, test_labels, batch_size=50)

## Model

In [None]:
# set up GPU
if torch.cuda.is_available():       
    device = torch.device("cuda")
    print(f'There are {torch.cuda.device_count()} GPU(s) available.')
    print('Device name:', torch.cuda.get_device_name(0))

else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class CNN_NLP(nn.Module):
    """An 1D Convulational Neural Network for Sentence Classification."""
    def __init__(self,
                 pretrained_embedding=None,
                 freeze_embedding=False,
                 vocab_size=None,
                 embed_dim=300,
                 filter_sizes=[3, 4, 5],
                 num_filters=[100, 100, 100],
                 num_classes=2,
                 dropout=0.5):
        """
        The constructor for CNN_NLP class.

        Args:
            pretrained_embedding (torch.Tensor): Pretrained embeddings with
                shape (vocab_size, embed_dim)
            freeze_embedding (bool): Set to False to fine-tune pretraiend
                vectors. Default: False
            vocab_size (int): Need to be specified when not pretrained word
                embeddings are not used.
            embed_dim (int): Dimension of word vectors. Need to be specified
                when pretrained word embeddings are not used. Default: 300
            filter_sizes (List[int]): List of filter sizes. Default: [3, 4, 5]
            num_filters (List[int]): List of number of filters, has the same
                length as `filter_sizes`. Default: [100, 100, 100]
            n_classes (int): Number of classes. Default: 2
            dropout (float): Dropout rate. Default: 0.5
        """

        super(CNN_NLP, self).__init__()
        # Embedding layer
        if pretrained_embedding is not None:
            self.vocab_size, self.embed_dim = pretrained_embedding.shape
            self.embedding = nn.Embedding.from_pretrained(pretrained_embedding,
                                                          freeze=freeze_embedding)
        else:
            self.embed_dim = embed_dim
            self.embedding = nn.Embedding(num_embeddings=vocab_size,
                                          embedding_dim=self.embed_dim,
                                          padding_idx=0,
                                          max_norm=5.0)
        # Conv Network
        self.conv1d_list = nn.ModuleList([
            nn.Conv1d(in_channels=self.embed_dim,
                      out_channels=num_filters[i],
                      kernel_size=filter_sizes[i])
            for i in range(len(filter_sizes))
        ])
        # Fully-connected layer and Dropout
        self.fc = nn.Linear(np.sum(num_filters), num_classes)
        self.dropout = nn.Dropout(p=dropout)

    def forward(self, input_ids):
        """Perform a forward pass through the network.

        Args:
            input_ids (torch.Tensor): A tensor of token ids with shape
                (batch_size, max_sent_length)

        Returns:
            logits (torch.Tensor): Output logits with shape (batch_size,
                n_classes)
        """

        # Get embeddings from `input_ids`. Output shape: (b, max_len, embed_dim)
        x_embed = self.embedding(input_ids).float()

        # Permute `x_embed` to match input shape requirement of `nn.Conv1d`.
        # Output shape: (b, embed_dim, max_len)
        x_reshaped = x_embed.permute(0, 2, 1)

        # Apply CNN and ReLU. Output shape: (b, num_filters[i], L_out)
        x_conv_list = [F.relu(conv1d(x_reshaped)) for conv1d in self.conv1d_list]

        # Max pooling. Output shape: (b, num_filters[i], 1)
        x_pool_list = [F.max_pool1d(x_conv, kernel_size=x_conv.shape[2])
            for x_conv in x_conv_list]
        
        # Concatenate x_pool_list to feed the fully connected layer.
        # Output shape: (b, sum(num_filters))
        x_fc = torch.cat([x_pool.squeeze(dim=2) for x_pool in x_pool_list],
                         dim=1)
        
        # Compute logits. Output shape: (b, n_classes)
        logits = self.fc(self.dropout(x_fc))

        return logits

In [None]:
import torch.optim as optim

def initilize_model(pretrained_embedding=None,
                    freeze_embedding=False,
                    vocab_size=None,
                    embed_dim=300,
                    filter_sizes=[3, 4, 5],
                    num_filters=[100, 100, 100],
                    num_classes=2,
                    dropout=0.5,
                    learning_rate=0.01,
                    weight_decay=0):
    """Instantiate a CNN model and an optimizer."""

    assert (len(filter_sizes) == len(num_filters)), "filter_sizes and \
    num_filters need to be of the same length."

    # Instantiate CNN model
    cnn_model = CNN_NLP(pretrained_embedding=pretrained_embedding,
                        freeze_embedding=freeze_embedding,
                        vocab_size=vocab_size,
                        embed_dim=embed_dim,
                        filter_sizes=filter_sizes,
                        num_filters=num_filters,
                        num_classes=num_classes,
                        dropout=0.5)
    
    # Send model to `device` (GPU/CPU)
    cnn_model.to(device)

    # Instantiate Adadelta optimizer
    optimizer = optim.Adadelta(cnn_model.parameters(),
                               lr=learning_rate,
                               rho=0.95,
                               weight_decay=weight_decay)

    return cnn_model, optimizer

## Training and Evaluation

In [None]:
import random
import time

# Specify loss function
loss_fn = nn.CrossEntropyLoss()

def set_seed(seed_value=42):
    """Set seed for reproducibility."""

    random.seed(seed_value)
    np.random.seed(seed_value)
    torch.manual_seed(seed_value)
    torch.cuda.manual_seed_all(seed_value)

def train(model, optimizer, train_dataloader, test_dataloader=None, epochs=10, model_name=""):
    """Train the CNN model."""
    
    # Tracking best validation accuracy
    best_accuracy = 0
    train_time = 0

    # Start training loop
    print("Start training...\n")
    print(f"{'Epoch':^7} | {'Train Loss':^12} | {'Test Loss':^10} | {'Test Acc':^9} | {'Elapsed':^9}")
    print("-"*60)

    for epoch_i in range(epochs):
        # =======================================
        #               Training
        # =======================================

        # Tracking time and loss
        t0_epoch = time.time()
        total_loss = 0

        # Put the model into the training mode
        model.train()

        for step, batch in enumerate(train_dataloader):
            # Load batch to GPU
            b_input_ids, b_labels = tuple(t.to(device) for t in batch)

            # Zero out any previously calculated gradients
            model.zero_grad()

            # Perform a forward pass. This will return logits.
            logits = model(b_input_ids)

            # Compute loss and accumulate the loss values
            loss = loss_fn(logits, b_labels)
            total_loss += loss.item()

            # Perform a backward pass to calculate gradients
            loss.backward()

            # Update parameters
            optimizer.step()

        # Calculate the average loss over the entire training data
        avg_train_loss = total_loss / len(train_dataloader)
        
        train_time += time.time() - t0_epoch

        # =======================================
        #               Evaluation
        # =======================================
        if test_dataloader is not None:
            # After the completion of each training epoch, measure the model's
            # performance on our validation set.
            test_loss, test_accuracy = evaluate(model, test_dataloader)

            # Track the best accuracy
            if test_accuracy > best_accuracy:
                best_accuracy = test_accuracy
                torch.save(model, "./models/" + model_name + "_best_model.pt")

            # Print performance over the entire training data
            time_elapsed = time.time() - t0_epoch
            print(f"{epoch_i + 1:^7} | {avg_train_loss:^12.6f} | {test_loss:^10.6f} | {test_accuracy:^9.2f} | {time_elapsed:^9.2f}")
            
    print("\n")
    print(f"Training complete! Best accuracy: {best_accuracy:.2f}%.")
    return best_accuracy, train_time

def evaluate(model, test_dataloader):
    """After the completion of each training epoch, measure the model's
    performance on our validation set.
    """
    # Put the model into the evaluation mode. The dropout layers are disabled
    # during the test time.
    model.eval()

    # Tracking variables
    test_accuracy = []
    test_loss = []

    # For each batch in our validation set...
    for batch in test_dataloader:
        # Load batch to GPU
        b_input_ids, b_labels = tuple(t.to(device) for t in batch)

        # Compute logits
        with torch.no_grad():
            logits = model(b_input_ids)

        # Compute loss
        loss = loss_fn(logits, b_labels)
        test_loss.append(loss.item())

        # Get the predictions
        preds = torch.argmax(logits, dim=1).flatten()

        # Calculate the accuracy rate
        accuracy = (preds == b_labels).cpu().numpy().mean() * 100
        test_accuracy.append(accuracy)

    # Compute the average accuracy and loss over the validation set.
    test_loss = np.mean(test_loss)
    test_accuracy = np.mean(test_accuracy)

    return test_loss, test_accuracy

In [None]:
!mkdir ./models

In [None]:
# parameters count
def count_params(model):
    pytorch_total_params = sum(p.numel() for p in model.parameters())
    pytorch_trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
    print("Total Parameters: " + str(pytorch_total_params))
    print("Trainable Parameters: " + str(pytorch_trainable_params))

In [None]:
# cal std and mean of list
def mean_std(arr):
    arr = np.array(arr)
    mean = np.mean(arr)
    std = np.std(arr)
    return mean, std
    

In [None]:
# need to create dir: "./models"
# CNN-rand: Word vectors are randomly initialized.
# set_seed(42)
cnn_rand, optimizer = initilize_model(vocab_size=len(word2idx),
                                      embed_dim=300,
                                      learning_rate=0.5,
                                      dropout=0.5,
                                      weight_decay=1e-3)
count_params(cnn_rand)
acc = []
train_t = []
for i in range(10):
    best_acc, train_time = train(cnn_rand, optimizer, train_dataloader, test_dataloader, epochs=20, model_name="mr_cnn_rand")
    acc.append(best_acc)
    train_t.append(train_time)

# cal avg and std of acc and time
acc_mean, acc_std = mean_std(acc)
t_mean, t_std = mean_std(train_t)
print("Average accuracy: " + str(acc_mean) + " Acc std: " + str(acc_std))
print("Average time: " + str(t_mean) + " time std: " + str(t_std))

In [None]:
# CNN-static: pretrained word vectors are used and freezed during training.
# set_seed(42)
cnn_static, optimizer = initilize_model(pretrained_embedding=embeddings,
                                        freeze_embedding=True,
                                        learning_rate=0.5,
                                        dropout=0.5,
                                        weight_decay=1e-3)
count_params(cnn_static)
acc = []
train_t = []
for i in range(10):
    best_acc, train_time = train(cnn_static, optimizer, train_dataloader, test_dataloader, epochs=20, model_name="mr_cnn_static")
    acc.append(best_acc)
    train_t.append(train_time)
    
# cal avg and std of acc and time
acc_mean, acc_std = mean_std(acc)
t_mean, t_std = mean_std(train_t)
print("Average accuracy: " + str(acc_mean) + " Acc std: " + str(acc_std))
print("Average time: " + str(t_mean) + " time std: " + str(t_std))

In [None]:
# CNN-non-static: pretrained word vectors are fine-tuned during training.
# set_seed(42)
cnn_non_static, optimizer = initilize_model(pretrained_embedding=embeddings,
                                            freeze_embedding=False,
                                            learning_rate=0.5,
                                            dropout=0.5,
                                            weight_decay=1e-3)
count_params(cnn_non_static)
acc = []
train_t = []
for i in range(10):
    best_acc, train_time = train(cnn_non_static, optimizer, train_dataloader, test_dataloader, epochs=20, model_name="mr_cnn_non_static")
    acc.append(best_acc)
    train_t.append(train_time)

# cal avg and std of acc and time
acc_mean, acc_std = mean_std(acc)
t_mean, t_std = mean_std(train_t)
print("Average accuracy: " + str(acc_mean) + " Acc std: " + str(acc_std))
print("Average time: " + str(t_mean) + " time std: " + str(t_std))

## R8 Dataset Experiment

In [None]:
train_texts = r8_train_texts
test_texts = r8_test_texts
# Tokenize, build vocabulary, encode tokens
print("Tokenizing...\n")
tokenized_texts_train, word2idx, max_len = tokenize(train_texts)
tokenized_texts_test, word2idx, max_len = tokenize(test_texts)
tokenized_texts, word2idx, max_len = tokenize(np.concatenate((train_texts, test_texts), axis=None))
input_ids_train = encode(tokenized_texts_train, word2idx, max_len)
input_ids_test = encode(tokenized_texts_test, word2idx, max_len)


# Load pretrained vectors
# tokenized_texts, word2idx, max_len = tokenize(np.concatenate((train_texts, test_texts), axis=None))
embeddings = load_pretrained_vectors(word2idx, "glove.6B.300d.txt")
embeddings = torch.tensor(embeddings)

# Load data to PyTorch DataLoader
train_inputs = input_ids_train
test_inputs = input_ids_test
train_labels = r8_train_labels
test_labels = r8_test_labels
train_dataloader, test_dataloader = \
data_loader(train_inputs, test_inputs, train_labels, test_labels, batch_size=50)

In [None]:
# CNN-rand: Word vectors are randomly initialized.
# set_seed(42)
cnn_rand, optimizer = initilize_model(vocab_size=len(word2idx),
                                      embed_dim=300,
                                      learning_rate=0.5,
                                      dropout=0.5,
                                      num_classes=8,
                                      weight_decay=1e-3)
count_params(cnn_rand)
acc = []
train_t = []
for i in range(10):
    best_acc, train_time = train(cnn_rand, optimizer, train_dataloader, test_dataloader, epochs=20, model_name="r8_cnn_rand")
    acc.append(best_acc)
    train_t.append(train_time)

# cal avg and std of acc and time
acc_mean, acc_std = mean_std(acc)
t_mean, t_std = mean_std(train_t)
print("Average accuracy: " + str(acc_mean) + " Acc std: " + str(acc_std))
print("Average time: " + str(t_mean) + " time std: " + str(t_std))

In [None]:
# CNN-static: pretrained word vectors are used and freezed during training.
# set_seed(42)
cnn_static, optimizer = initilize_model(pretrained_embedding=embeddings,
                                        freeze_embedding=True,
                                        learning_rate=0.5,
                                        dropout=0.5,
                                        num_classes=8,
                                        weight_decay=1e-3)
count_params(cnn_static)
acc = []
train_t = []
for i in range(10):
    best_acc, train_time = train(cnn_static, optimizer, train_dataloader, test_dataloader, epochs=20, model_name="r8_cnn_static")
    acc.append(best_acc)
    train_t.append(train_time)

# cal avg and std of acc and time
acc_mean, acc_std = mean_std(acc)
t_mean, t_std = mean_std(train_t)
print("Average accuracy: " + str(acc_mean) + " Acc std: " + str(acc_std))
print("Average time: " + str(t_mean) + " time std: " + str(t_std))

In [None]:
# CNN-non-static: pretrained word vectors are fine-tuned during training.
# set_seed(42)
cnn_non_static, optimizer = initilize_model(pretrained_embedding=embeddings,
                                            freeze_embedding=False,
                                            learning_rate=0.5,
                                            dropout=0.5,
                                            num_classes=8,
                                            weight_decay=1e-3)
count_params(cnn_non_static)
acc = []
train_t = []
for i in range(10):
    best_acc, train_time = train(cnn_non_static, optimizer, train_dataloader, test_dataloader, epochs=20, model_name="r8_cnn_non_static")
    acc.append(best_acc)
    train_t.append(train_time)
    
# cal avg and std of acc and time
acc_mean, acc_std = mean_std(acc)
t_mean, t_std = mean_std(train_t)
print("Average accuracy: " + str(acc_mean) + " Acc std: " + str(acc_std))
print("Average time: " + str(t_mean) + " time std: " + str(t_std))

## Test 

In [None]:
def predict_review(text, model=cnn_non_static.to("cpu"), max_len=62):
    """Predict probability that a review is positive."""

    # Tokenize, pad and encode text
    tokens = word_tokenize(text.lower())
    padded_tokens = tokens + ['<pad>'] * (max_len - len(tokens))
    input_id = [word2idx.get(token, word2idx['<unk>']) for token in padded_tokens]

    # Convert to PyTorch tensors
    input_id = torch.tensor(input_id).unsqueeze(dim=0)

    # Compute logits
    logits = model.forward(input_id)

    #  Compute probability
    probs = F.softmax(logits, dim=1).squeeze(dim=0)

    print(f"This review is {probs[1] * 100:.2f}% positive.")

r8_id2cat = {
    "acq": 0,
    "crude": 1,	
    "earn": 2,	
    "grain": 3,
    "interest": 4,	
    "money-fx": 5,
    "ship": 6,
    "trade": 7
}

def predict_r8(text, model=cnn_non_static.to("cpu"), max_len=62):
    """Predict probability of each category in r8."""

    # Tokenize, pad and encode text
    tokens = word_tokenize(text.lower())
    padded_tokens = tokens + ['<pad>'] * (max_len - len(tokens))
    input_id = [word2idx.get(token, word2idx['<unk>']) for token in padded_tokens]

    # Convert to PyTorch tensors
    input_id = torch.tensor(input_id).unsqueeze(dim=0)

    # Compute logits
    logits = model.forward(input_id)

    #  Compute probability
    probs = F.softmax(logits, dim=1).squeeze(dim=0)

    for i in range(8):
        print(f"This review is {probs[i] * 100:.2f}% {r8_id2cat[i]}.")

In [None]:
test_model = torch.load("./models/mr_cnn_non_static_best_model.pt")
test_model.to("cpu")
test_model.eval()
predict_review("All of friends slept while watching this movie. But I really enjoyed it.", model=test_model)
predict_review("I have waited so long for this movie. I am now so satisfied and happy.", model=test_model)
predict_review("This movie is long and boring.", model=test_model)
predict_review("I don't like the ending.", model=test_model)

### Imapct of Document Length(R8)

In [None]:
def len_text(input_id):
    cnt = 0
    for token in input_id:
        if token != 0:
            cnt += 1
    return cnt

def evaluate_doc_length(model, test_dataloader):
    """
    we divide documents into extreme short (less than 30 words), 
    short(30-50 words), medium (50-70 words), and long (more than 70 words)..
    """
    # Put the model into the evaluation mode. The dropout layers are disabled
    # during the test time.
    model.eval()

    # Tracking variables
    c_es = 0
    c_s = 0
    c_m = 0
    c_l = 0
    t_es = 0
    t_s = 0
    t_m = 0
    t_l = 0

    # For each batch in our validation set...
    for batch in test_dataloader:
        # Load batch to GPU
        b_input_ids, b_labels = tuple(t.to(device) for t in batch)
        
        for input_id, label in zip(b_input_ids, b_labels):

            # Compute logits
            length = len_text(input_id)
            with torch.no_grad():
                input_id = input_id.clone().detach().unsqueeze(dim=0)
                logit = model(input_id)

            # Get the predictions
            pred = torch.argmax(logit)

            # Calculate the accuracy rate
            if length > 70:
                t_l += 1
                if pred == label:
                    c_l += 1
            elif length >= 50:
                t_m += 1
                if pred == label:
                    c_m += 1
            elif length >= 30:
                t_s += 1
                if pred == label:
                    c_s += 1
            else:
                t_es += 1
                if pred == label:
                    c_es += 1
                

    # Compute acc
    print("Acc of extreme short: " + str(c_es/t_es))
    print("Acc of short: " + str(c_s/t_s))
    print("Acc of medium: " + str(c_m/t_m))
    print("Acc of long: " + str(c_l/t_l))
    
evaluate_doc_length(cnn_static, test_dataloader)
    

## Conclusion
* CNN networks looks effective on text classification problem, it's competitive to other networks like LSTM, GNN, etc.
* We reached around 77-78% testing accuracy on Movie Review dataset, around 97% testing accuracy on R8 dataset, with a really fast training speed
* We tried with several hyperparameters but we did not conduct any hyperparameters search, however, the accuracy looks good
* Glove pretrained embeddings definitely helps, but the non-static embeddings doesn't improve much