# Chapter 6: Convolutional Neural Networks for Text Classification (Notes)

- RNNs arent the only model that can be used for text classification, CNN also works
- RNN rely on sequential modeling, maintaining a hidden state and then step seqentially through the text word by word (in order), and updating weights at each iteration.
- CNN do not rely on sequential element of language but tries to learn but understanding each word in context with its surrounding sentence.
- CNNs mostly used for images but works reasonable well on text as well
- logic being: meaning of individual words in the sentence depends on their context and the words they appear next to

### Exploring CNNs
- basics come from CV, but can be extended to NLP
- intuition being sentence (left to right) and image (group of pixels)
- so the individual pixels do not mean much but how its relationship to one another is important

### Convolution for images
- basic concept behind CNN is convolutions
- A convolution is essentially a sliding window function that's applied to a matrix in order to capture information from the surrounding pixels
- how it works is that for a large imagine, we use a kernal function that goes over the image as a matrix, do some operation and produce a new resulting image which contains infomation about our original image or matrix of pixels
- in large image/complex sentences, we also add a pooling layer 
- pooling layer further reduces dimentionality (which is helpful)
- pooling layer adds a function (usually a max function) to the outout of the Convolution layer to reduce dim.
- this function is added over a sliding window, where the convo layer do not over lap
- pooling layers shown to effectively reduce dim. of data which still retainging essential infomation

Quick summary
- kernal operation is like a sliding window that using all pixels and does some operands
- pooling layer, used a function (usually max) and selects 1 value from the matrix which helps reduce dimentionality.

Two main advatanges to using convolutions in this context
- 1) able to compose a series of low-level featur into high-level feature (feature reduction)
- 2) makes our model location invariant (edge detection, same features will be picked up)

### Convolutions for NLP
- since words can be vectors and sentence as a sequence of vectors. our corpus or text can be a matrix
- main logic: if we can convolve over a sentence in a way that allows us to capture the relation of one word to the words around it, we can theoretically detect patterns in language and use this to better classify our sentences.
- Convolutions in NLP is slightly different to images in that we now want to perform convolution layer accross whole word vecors than within word vectors
- benefit is that there sint a limit to the number of ngrams we can convolve over and also be able to convolve multiple different ngrams simultaniously
- for eg, can capture both bi-gram and trigrams given specific architecture
- they also have their drawbacks: 
- unlike images where a pixel is most likely only related to its surrounding pixel
-  a word in a sentence can be related to its surrounding as well as something as the end of the sentence (this is captured in the RNN architecture using longer-term memory dependnency) but CNNs may struggle as it only captures local surroundings of target words

- however, CNN for nlp has been proven to perform well in certain tasks
- main advantage of CNNs for NLP is its speed and efficiency
- Convolutions can be easily impllement on GPUs for easier and fast parallelisation

# CODE

In [12]:
!pip3 install torchtext.legacy


[31mERROR: Could not find a version that satisfies the requirement torchtext.legacy (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for torchtext.legacy[0m[31m
[0m

In [14]:
from torchtext import data
from torchtext import datasets
import torch
import spacy
import random
import numpy as np
import time
import torch.nn as nn
import torch.nn.functional as F
import spacy
nlp = spacy.load('en_core_web_sm')

- Aim to build a multi-class text classification (6 targets)
- A question answering data set (https://trec.nist.gov/data/qa.html)
- which is commonly used to evaluate the performance of a models text-classification tasks
- model now returns a probability for each of the six possible classes (pick highest)
- 

NOTE:
- normally, we can view our datasets but here we are dealing with a TorchText dataset object
- use train_data.examples[0].text and train_data.examples[0].label to view data contents
- neural network will not take raw text as an input, turn into some form of embedding representation
-

In [10]:
import torchtext

In [15]:
questions = torchtext.legacy.data.Field(tokenize = 'spacy', batch_first = True)
labels = data.LabelField(dtype = torch.float)

AttributeError: module 'torchtext' has no attribute 'legacy'

In [None]:
train_data, _ = datasets.TREC.splits(questions, labels)

train_data, valid_data = train_data.split()

In [None]:
train_data


In [None]:
train_data.examples[0].text

In [None]:
train_data.examples[0].label


In [None]:
print(len(train_data))
print(len(valid_data))

In [None]:

questions.build_vocab(train_data,
                 vectors = "glove.6B.200d", 
                 unk_init = torch.Tensor.normal_)

labels.build_vocab(train_data)

In [None]:

questions.vocab.vectors

In [None]:

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator, valid_iterator = data.BucketIterator.splits(
    (train_data, valid_data), 
    batch_size = 64, 
    device = device)

In [None]:
class CNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, n_filters, filter_sizes, output_dim, dropout, pad_idx):
        
        super().__init__()
                
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx = pad_idx)
        
        self.convs = nn.ModuleList([
                                    nn.Conv2d(in_channels = 1, 
                                              out_channels = n_filters, 
                                              kernel_size = (fs, embedding_dim)) 
                                    for fs in filter_sizes
                                    ])
        
        self.fc = nn.Linear(len(filter_sizes) * n_filters, output_dim)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, text):
        
        emb = self.embedding(text).unsqueeze(1)
        
        conved = [F.relu(c(emb)).squeeze(3) for c in self.convs]
                
        pooled = [F.max_pool1d(c, c.shape[2]).squeeze(2) for c in conved]
        
        concat = self.dropout(torch.cat(pooled, dim = 1))
            
        return self.fc(concat)

In [None]:
input_dimensions = len(questions.vocab)
output_dimensions = 6
embedding_dimensions = 200
pad_index = questions.vocab.stoi[questions.pad_token]

number_of_filters = 100
filter_sizes = [2,3,4]
dropout_pc = 0.5


model = CNN(input_dimensions, embedding_dimensions, number_of_filters, 
            filter_sizes, output_dimensions, dropout_pc, pad_index)

In [None]:
glove_embeddings = questions.vocab.vectors

model.embedding.weight.data.copy_(glove_embeddings)

In [None]:
unknown_index = questions.vocab.stoi[questions.unk_token]

model.embedding.weight.data[unknown_index] = torch.zeros(embedding_dimensions)
model.embedding.weight.data[pad_index] = torch.zeros(embedding_dimensions)

In [None]:
optimizer = torch.optim.Adam(model.parameters())

criterion = nn.CrossEntropyLoss().to(device)

model = model.to(device)

In [None]:
def multi_accuracy(preds, y):
    pred = torch.max(preds,1).indices
    correct = (pred == y).float()
    acc = correct.sum() / len(correct)
    return acc

In [None]:
def train(model, iterator, optimizer, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.train()
    
    for batch in iterator:
        
        optimizer.zero_grad()
        
        preds = model(batch.text).squeeze(1)
        loss = criterion(preds, batch.label.long())
        
        acc = multi_accuracy(preds, batch.label)
        
        loss.backward()
        
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    total_epoch_loss = epoch_loss / len(iterator)
    total_epoch_accuracy = epoch_acc / len(iterator)
        
    return total_epoch_loss, total_epoch_accuracy

In [None]:
def evaluate(model, iterator, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.eval()
    
    with torch.no_grad():
    
        for batch in iterator:

            preds = model(batch.text).squeeze(1)
            
            loss = criterion(preds, batch.label.long())
            
            acc = multi_accuracy(preds, batch.label)

            epoch_loss += loss.item()
            epoch_acc += acc.item()
            
    total_epoch_loss = epoch_loss / len(iterator)
    total_epoch_accuracy = epoch_acc / len(iterator)
        
    return total_epoch_loss, total_epoch_accuracy

In [None]:
epochs = 10

lowest_validation_loss = float('inf')

for epoch in range(epochs):

    start_time = time.time()
    
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()
    
    if valid_loss < lowest_validation_loss:
        lowest_validation_loss = valid_loss
        torch.save(model.state_dict(), 'cnn_model.pt')
    
    print(f'Epoch: {epoch+1:02} | Epoch Time: {int(end_time - start_time)}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')