# Part A. Tokenising and Embeddings
- Connect to GPU (at least for Part B)

In [1]:
sentence = "hello hello there world"
words = sentence.split()
print(words)

['hello', 'hello', 'there', 'world']


Manually construct a simple dictionary containing the words in the sentence

In [2]:
word_dict = {"hello":0, "world":1, "there":2}

Get the index for each word in the sentence

In [3]:
indices = [word_dict[w] for w in words]
indices

[0, 0, 2, 1]

Convert the word indices of the input sentence to a tensor

In [4]:
import torch

input_tensor = torch.tensor(indices, dtype=torch.long) # normally long used for words
input_tensor

tensor([0, 0, 2, 1])

### Create Embeddings

In [5]:
from torch.nn import Embedding

In [6]:
num_embs, dim = 10, 4

emb1 = Embedding(num_embs, dim) 

emb1
 

Embedding(10, 4)

See the actual values of the embedding

In [7]:
emb1.weight

Parameter containing:
tensor([[ 0.8305,  0.7825,  0.6083, -1.0396],
        [ 0.2254, -0.1195, -0.5939, -0.2896],
        [ 0.8269,  1.3437,  0.9034,  0.9363],
        [ 1.1714, -0.1877, -0.4743, -0.6999],
        [ 0.4091,  0.7670,  0.4651, -0.8560],
        [-0.3699,  0.1732,  0.7964,  1.1244],
        [ 1.6849, -0.7846, -0.1776, -0.5359],
        [-1.6785,  1.8147,  1.2790, -0.0272],
        [-0.3433,  0.3914, -0.2413,  0.1454],
        [ 0.1517,  0.3224,  1.6081, -0.6556]], requires_grad=True)

Since `requires_grad=True`, this entire array is learnable




### Exercise
Create embeddings for our sentence `"hello hello there world"` by a passing suitable parameters to the Embedding class. 
- each word should be converted into a 10-dimensional vector
- you may use `input_tensor` directly for this
- print the embeddings for this sentence

In [None]:
# Q1. Insert your code here
  ## ANSWR: I think the required codes are given below

In [8]:
# Create the embeddings for the word list that we have
input_len = len(input_tensor)
emb_dim = 10

emb = Embedding(input_len, emb_dim)
emb_vecs = emb(input_tensor)
print(emb_vecs)
print(emb_vecs.shape)

tensor([[-0.2722, -1.0138, -1.1229, -0.0488, -0.3080,  1.5382, -0.0425, -1.8042,
         -0.1192, -0.5644],
        [-0.2722, -1.0138, -1.1229, -0.0488, -0.3080,  1.5382, -0.0425, -1.8042,
         -0.1192, -0.5644],
        [-0.7802,  0.9540, -1.5835,  1.5391,  1.9708,  0.5258, -1.2213,  0.7689,
          0.2542,  0.1941],
        [-1.3428, -1.1686, -0.3649, -1.4019, -0.0329, -1.1831,  1.2871, -0.0253,
         -0.0124,  0.1072]], grad_fn=<EmbeddingBackward>)
torch.Size([4, 10])


In [9]:
# Q2. Why are the embeddings the same for the first two rows?
  ## Because the first and the second words are same as "Hello"

### Padding
- Out of vocab or Unknown words should be padded (don't care about their gradients)

In [10]:
emb2 = Embedding(num_embs, dim, padding_idx=5)
emb2.weight

Parameter containing:
tensor([[ 0.7412,  0.6610,  0.2485, -0.1974],
        [-1.9179,  0.5738, -0.6792,  0.3645],
        [-1.3824,  0.8561, -1.8040,  0.3604],
        [-0.5156,  0.3010,  0.7308, -0.1739],
        [-1.1909, -0.4703,  1.8619, -2.7545],
        [ 0.0000,  0.0000,  0.0000,  0.0000],
        [ 2.2418,  1.8003, -0.3473,  1.5860],
        [ 0.3003, -0.0048,  0.4735,  1.3921],
        [ 0.7057, -0.1509,  0.1116,  0.3007],
        [ 0.0543,  1.7655, -0.8394,  0.1166]], requires_grad=True)

In [None]:
# Q3. What is the effect of applying padding_idx=5 to the embedding?
  ## ANSWER: By applying padding_idx=5, the embedding vector in index 5 does not contribute to the gradient. 
  ## Therefore that vector is not updated during training, also is defaulted to all zeros.

Average out the embedding on the input and compute the gradient

In [11]:
emb2(input_tensor).mean().backward()
emb2.weight.grad

tensor([[0.1250, 0.1250, 0.1250, 0.1250],
        [0.0625, 0.0625, 0.0625, 0.0625],
        [0.0625, 0.0625, 0.0625, 0.0625],
        [0.0000, 0.0000, 0.0000, 0.0000],
        [0.0000, 0.0000, 0.0000, 0.0000],
        [0.0000, 0.0000, 0.0000, 0.0000],
        [0.0000, 0.0000, 0.0000, 0.0000],
        [0.0000, 0.0000, 0.0000, 0.0000],
        [0.0000, 0.0000, 0.0000, 0.0000],
        [0.0000, 0.0000, 0.0000, 0.0000]])

## Pat B: Text Classification using AG News
- The `AG_NEWS` dataset has news on four topics – World
Sports, Business and Sci/Tech
- Source: https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html


### 1. Access to the raw data as an iterator

In [3]:
import torch
from torchtext.datasets import AG_NEWS
train_iter = AG_NEWS(split='train')

train.csv: 29.5MB [00:00, 86.1MB/s]


View some items in this dataset

In [13]:
next(train_iter)

(3,
 "Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green again.")

In [15]:
next(train_iter)

(3,
 "Oil and Economy Cloud Stocks' Outlook (Reuters) Reuters - Soaring crude prices plus worries\\about the economy and the outlook for earnings are expected to\\hang over the stock market next week during the depth of the\\summer doldrums.")

In [None]:
# Q4. What does each value in the result of the last two code cells represent with regards to a news item?
  ## ANSWER: The values of the first position are the data's labels. The second ones are texts. 

### 2. Create Vocabulary Using Tokeniser
- Use `get_tokenizer()` to create tokens from the `basic_english` language
- Generate a token for each word in the training set. More on `yield` [HERE](https://stackoverflow.com/questions/231767/what-does-the-yield-keyword-do)
- Build the vocabulary using `build_vocab_from_iterator()`

In [4]:
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

tokenizer = get_tokenizer('basic_english')
train_iter = AG_NEWS(split='train')

def yield_tokens(data_iter):
    for _, text in data_iter:
        yield tokenizer(text)

vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=["<unk>"], min_freq=10)
vocab.set_default_index(vocab["<unk>"])


# Q5. What does the last line of code do?
  ## ANSWER: It sets the default index for OOV tokens.

In [17]:
print("The length of the new vocab is", len(vocab))
new_stoi = vocab.get_stoi()
print("The index of '<unk>' is", new_stoi['<unk>'])
print(new_stoi)
new_itos = vocab.get_itos()
print("The token at index 2 is", new_itos[2])
print(new_itos)

# Q6. How many words are in this vocabulary? 
  ## ANSWER: 20644

# Q7. Would vocab size be higher or lower if min_freq in build_vocab_from_iterator() was set to something smaller? Why?
  ## ANSWER: If min_freq gets smaller, vocab size would be higher, as minimum frequency of the word that is required become smaller.

The length of the new vocab is 20644
The index of '<unk>' is 0
The token at index 2 is the


Test this vocab with an example sentence

In [5]:
for token in tokenizer("Oh, howdy world?"):
  print(vocab.lookup_indices([token]))
  print()

[6122]
[3]
[0]
[50]
[80]


In [None]:
# Q8. How many tokens does this sentence have? Does this match the number of words in the sentence? Why?
  ## ANSWER: 5 / It matches if we consider ',' and '?' as words. 

# Q9. Looking at the token values, are there any of them not contained in the vocab? Which one(s)?
  ## ANSWER: The third word 'howdy', as its index value is 0

### 3. Prepare Text and Label Pipelines
- the text and label pipelines will be used to process the raw data strings from the dataset iterators

In [19]:
text_pipeline = lambda x: vocab(tokenizer(x))
label_pipeline = lambda x: int(x) - 1

So now we can use the pipeline by sending in a whole sentence

In [20]:
text_pipeline('here is the an example')

[475, 21, 2, 30, 5297]

In [21]:
label_pipeline(2)

1

### 4. Data Batch and Iterator
- we will use DataLoader that we are already familiar with
- we will write `collate_fn` function to work on a batch of samples generated from DataLoader
- it will take a batch from the DataLoader and processes them according to the data processing pipelines declared previously
- the text entries in the original data batch input are packed into a list and concatenated as a single tensor for the input of `nn.EmbeddingBag` later
- The offset is a tensor of delimiters to represent the beginning index of the individual sequence in the text tensor

In [9]:
from torch.utils.data import DataLoader

# Device configuration
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def collate_batch(batch):
    label_list, text_list, offsets = [], [], [0]
    for (_label, _text) in batch:
         # Get the label
         label_list.append(label_pipeline(_label)) # get the label
         # Get the tokens for each text entry
         processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64) # torch.long
         # Store tokens in text_list
         text_list.append(processed_text)
         # Get index of the current text entry
         offsets.append(processed_text.size(0)) # returns the size of the 1st dimension of the tensor
    label_list = torch.tensor(label_list, dtype=torch.int64)
    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0) # cumulative sum of the first dimension
    text_list = torch.cat(text_list) # concatenate
    return label_list.to(device), text_list.to(device), offsets.to(device)

train_iter = AG_NEWS(split='train')
dataloader = DataLoader(train_iter, batch_size=8, shuffle=False, collate_fn=collate_batch)

# Q10. What is "offsets" and why is it required for this dataset?
  ## ANSWER: "Offset" is a tensor of delimiters that represents the beggining index of each sequence in the text tensor
  ##  It is required because each text data has different length. 
  ## Therefore we should set the offset for putting a whole chunk from start to end properly.

Quick illustration of `cumsum()`

In [10]:
x = torch.arange(0, 6).view(2, 3)
print(x)
print(f'Cumulative sum in 1st dim: \n{x.cumsum(dim=0)}')
print(f'Cumulative sum in 2nd dim: \n{x.cumsum(dim=1)}')

tensor([[0, 1, 2],
        [3, 4, 5]])
Cumulative sum in 1st dim: 
tensor([[0, 1, 2],
        [3, 5, 7]])
Cumulative sum in 2nd dim: 
tensor([[ 0,  1,  3],
        [ 3,  7, 12]])


### 5. Define the NN Model
- consists of an embedding layer and a linear layer
- the `nn.EmbeddingBag` layer computes sums or means of `bags` of embeddings, without instantiating the intermediate embeddings


In [7]:
from torch import nn

class TextClassificationModel(nn.Module):

    def __init__(self, vocab_size, embed_dim, num_class):
        super().__init__()
        self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=True)
        self.fc = nn.Linear(embed_dim, num_class)
        self.init_weights()

    def init_weights(self):
        initrange = 0.5
        self.embedding.weight.data.uniform_(-initrange, initrange)
        self.fc.weight.data.uniform_(-initrange, initrange)
        self.fc.bias.data.zero_()

    def forward(self, text, offsets):
        embedded = self.embedding(text, offsets)
        return self.fc(embedded)

# Q11. Why is padding not required when using EmbeddingBag()?
  ## ANSWER: Because its module saves the length of each text as offset.

Build model instance

In [11]:
train_iter = AG_NEWS(split='train')
num_class = len(set([label for (label, text) in train_iter])) # derive the unique class labels
vocab_size = len(vocab)
emsize = 64 # embedding size

model = TextClassificationModel(vocab_size, emsize, num_class).to(device)

# Q12. What would be the vector size of each token in the vocabulary?
  ## ANSWER: 64

### 6. Functions for Training and Evaluation

In [12]:
import time

def train(dataloader):
    model.train()
    total_acc, total_count = 0, 0
    log_interval = 500
    start_time = time.time()

    for idx, (label, text, offsets) in enumerate(dataloader):
        optimizer.zero_grad()
        predicted_label = model(text, offsets)
        loss = criterion(predicted_label, label)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 0.1)
        optimizer.step()
        total_acc += (predicted_label.argmax(1) == label).sum().item()
        total_count += label.size(0)
        if idx % log_interval == 0 and idx > 0:
            elapsed = time.time() - start_time
            print('| epoch {:3d} | {:5d}/{:5d} batches '
                  '| accuracy {:8.3f}'.format(epoch, idx, len(dataloader),
                                              total_acc/total_count))
            total_acc, total_count = 0, 0
            start_time = time.time()

def evaluate(dataloader):
    model.eval()
    total_acc, total_count = 0, 0

    with torch.no_grad():
        for idx, (label, text, offsets) in enumerate(dataloader):
            predicted_label = model(text, offsets)
            loss = criterion(predicted_label, label)
            total_acc += (predicted_label.argmax(1) == label).sum().item()
            total_count += label.size(0)
    return total_acc/total_count

### 7. Split the dataset and run
- the original data was not split into train/validation
- use `torch.utils.data.dataset.random_split` to split the data
- `torch.optim.lr_scheduler.StepLR` decays the learning rate of each parameter group by `gamma` every `step_size` epochs
- `to_map_style_dataset()` converts iterable-style dataset to map-style dataset. This is needed because iterables do not have `__getitem__()` and therefore cannot be indexed
- **NOTE!** If the training code is too slow (you do not see the first output within 1 minute), restart your runtime and start over

In [27]:
from torch.utils.data.dataset import random_split
from torchtext.data.functional import to_map_style_dataset

EPOCHS = 10 # epoch
LR = 5  # learning rate
BATCH_SIZE = 32 # batch size for training

criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=LR)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=1.0, gamma=0.1)

train_iter, test_iter = AG_NEWS()
train_dataset = to_map_style_dataset(train_iter)
test_dataset = to_map_style_dataset(test_iter)

num_train = int(len(train_dataset) * 0.9)
split_train_, split_valid_ = \
    random_split(train_dataset, [num_train, len(train_dataset) - num_train])

train_dataloader = DataLoader(split_train_, batch_size=BATCH_SIZE,
                              shuffle=True, collate_fn=collate_batch)
valid_dataloader = DataLoader(split_valid_, batch_size=BATCH_SIZE,
                              shuffle=True, collate_fn=collate_batch)
test_dataloader = DataLoader(test_dataset, batch_size=BATCH_SIZE,
                             shuffle=True, collate_fn=collate_batch)

total_acc = None

for epoch in range(1, EPOCHS + 1):
    epoch_start_time = time.time()
    train(train_dataloader)
    acc_val = evaluate(valid_dataloader)
    if total_acc is not None and total_acc > acc_val:
      scheduler.step()
    else:
       total_acc = acc_val
    print('-' * 59)
    print('| end of epoch {:3d} | time: {:5.2f}s | '
          'valid accuracy {:8.3f} '.format(epoch,
                                           time.time() - epoch_start_time,
                                           acc_val))
    print('-' * 59)

test.csv: 1.86MB [00:00, 41.1MB/s]                  


| epoch   1 |   500/ 3375 batches | accuracy    0.635
| epoch   1 |  1000/ 3375 batches | accuracy    0.827
| epoch   1 |  1500/ 3375 batches | accuracy    0.863
| epoch   1 |  2000/ 3375 batches | accuracy    0.871
| epoch   1 |  2500/ 3375 batches | accuracy    0.881
| epoch   1 |  3000/ 3375 batches | accuracy    0.881
-----------------------------------------------------------
| end of epoch   1 | time: 15.23s | valid accuracy    0.888 
-----------------------------------------------------------
| epoch   2 |   500/ 3375 batches | accuracy    0.903
| epoch   2 |  1000/ 3375 batches | accuracy    0.899
| epoch   2 |  1500/ 3375 batches | accuracy    0.901
| epoch   2 |  2000/ 3375 batches | accuracy    0.905
| epoch   2 |  2500/ 3375 batches | accuracy    0.906
| epoch   2 |  3000/ 3375 batches | accuracy    0.904
-----------------------------------------------------------
| end of epoch   2 | time: 15.63s | valid accuracy    0.899 
--------------------------------------------------

In [None]:
# Q13. Why are the train and test datasets converted to map style datasets?
  ## ANSWER: By using 'to_map_style_dataset'.

# Q14. Think of ways to further improve the accuracy of the model. 
  ## ANSWER: Using advanced models better than CNN / Apply ensembles / Make the model bigger