# 4 NLP with PyTorch

## 4. 1. Representing text as Tensors

If we want to solve Natural Language Processing (NLP) tasks with neural networks, we need some way to represent text as tensors. Computers already represent textual characters as numbers that map to fonts on your screen using encodings such as ASCII or UTF-8.

For example, when you type "Hello", The computer will see [1001000,1100101,etc] Where H-> 1001000, e-> 1100101.

Human understand what each letter represents, and how all characters come together to form the words of a sentence. However, computers by themselves do not have such an understanding, and neural network has to learn the meaning of words during training.

Therefore, we can use different approaches when representing text:

1. Character-level representation, when we represent text by treating each character as a number. Given that we have 'C' different characters in our text corpus, the word 'Hello' would be represented by '5×C' tensor. Each letter would correspond to a tensor column in one-hot encoding.

2. Word-level representation, in which we create a vocabulary of all words in our text, and then represent words using one-hot encoding. This approach is somehow better, because each letter by itself does not have much meaning, and thus by using higher-level semantic concepts - words - we simplify the task for the neural network. However, given large dictionary size, we need to deal with high-dimensional sparse tensors.

## 4.2 Text classification task
In this module, we will start with a simple text classification task based on AG_NEWS dataset, which is to classify news headlines into one of 4 categories: 
- World, 
- Sports, 
- Business
- Sci/Tech. 

### 4.2.1 Download data set

In [26]:
import torch
import torchtext
import os
from collections import Counter, OrderedDict
from torchtext.vocab import vocab

# This dataset is built into torchtext module, so we can easily access it by using torchtext.datasets.
path = "/tmp/pytorch/data"
os.makedirs(path, exist_ok=True)
# torchtext.datasets returns iterators
train_dataset, test_dataset = torchtext.datasets.AG_NEWS(root=path)
# to reuse data, we convert iterators to list
train_dataset = list(train_dataset)
test_dataset = list(test_dataset)
classes = ['World', 'Sports', 'Business', 'Sci/Tech']

### 4.2.2 Explore the data set

The train_dataset and test_dataset contain iterators of rows. Each row has two columns:
- label (number of class, e.g. 0->World, 1->Sports, 2->Business, 3->Sci/Tech)
- text

Below is an example of a row, 3 is the label(business), the string is the text.

In [27]:
train_dataset[0]

(3,
 "Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green again.")

In [28]:
# print 5 first rows of the data set
for i, x in zip(range(5), train_dataset):
    print(f"label: {x[0]} -> text: {x[1]}")

label: 3 -> text: Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again.
label: 3 -> text: Carlyle Looks Toward Commercial Aerospace (Reuters) Reuters - Private investment firm Carlyle Group,\which has a reputation for making well-timed and occasionally\controversial plays in the defense industry, has quietly placed\its bets on another part of the market.
label: 3 -> text: Oil and Economy Cloud Stocks' Outlook (Reuters) Reuters - Soaring crude prices plus worries\about the economy and the outlook for earnings are expected to\hang over the stock market next week during the depth of the\summer doldrums.
label: 3 -> text: Iraq Halts Oil Exports from Main Southern Pipeline (Reuters) Reuters - Authorities have halted oil export\flows from the main pipeline in southern Iraq after\intelligence showed a rebel militia could strike\infrastructure, an oil official said on Saturday.
label: 3 -> text: Oil pric

### 4.2.3 Transform text to tensors

To make text readable by Neuron Network, we need to convert text into tensors. 

#### 4.2.3.1
First step: we convert text into numbers. And we want word-level representation, we need to do two things:

- use tokenizer to split text into tokens
- build a vocabulary of those tokens.

In [29]:
# pytorch provide basic tokenizer, here is an example
tokenizer = torchtext.data.utils.get_tokenizer('basic_english')
tokenizer('He said: hello')

['he', 'said', 'hello']

In [30]:
# here we use counter to store the generated token to take in account the token frequency
counter = Counter()
# we iterate over all rows, covert text to word token, and add these token to bag_of words
for (label, line) in train_dataset:
    counter.update(tokenizer(line))
# sort the token counter by token's frequencies
sorted_by_token_freq_tuples = sorted(counter.items(), key=lambda x: x[1], reverse=True)
# build a set of words as an orderedDict
words_dict = OrderedDict(sorted_by_token_freq_tuples)
# we build a vocabulary based on the words token
vocab1 = vocab(words_dict)

In [31]:
# check the size of the vacabulary
vocab_size = len(vocab1)
print(f"Vocab size if {vocab_size}")


Vocab size if 95810


In [32]:
# we can easily convert a text to a set of numbers by using the generated vocabulary
def encode(vocabulary, text):
    return [vocabulary[word] for word in tokenizer(text)]


encode(vocab1, 'I love to play with my words')

[281, 2318, 3, 335, 17, 1299, 2353]

#### 4.2.3.2 Bag of words text representation

In step1, we have converted texts to numbers, now we want to convert these numbers to tensors. Bag of words is one of the ways to do so.

Because words represent meaning, sometimes we can figure out the meaning of a text by just looking at the individual words, regardless of their order in the sentence. For example, when classifying news, words like weather, snow are likely to indicate weather forecast, while words like stocks, dollar would count towards financial news.

Bag of Words (BoW) vector representation is the most commonly used traditional vector representation. Each word is linked to a vector index, vector element contains the number of occurrences of a word in a given document.

Note: You can also think of BoW as a sum of all one-hot-encoded vectors for individual words in the text.

In [36]:
# Generate a bag of words by using scikit learn
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
corpus = [
    'I like hot dogs.',
    'The dog ran fast.',
    'Its hot outside.',
]
# train the vectorizer with above text
vectorizer.fit_transform(corpus)

# use the trained vectorizer to transform a text
vectorizer.transform(['My dog likes hot dogs on a hot day.']).toarray()

array([[1, 1, 0, 2, 0, 0, 0, 0, 0]])

Step2: convert encoded text to tensor

In [None]:
# vocabulary argument is the vocabulary of all the token generated from the dataset
# text argument is the input text that you want to transform
# bow_vocab_size specify the default size of the bow vocabulary.
def to_bow(vocabulary, text, bow_vocab_size):
    # create a one dimension tensor that has the size of bow_vocab_size, and float type
    result = torch.zeros(bow_vocab_size, dtype=torch.float32)
    # encode convert text to a list of indices of the token in the vocabulary
    for i in encode(vocabulary, text):
        if i < bow_vocab_size:
            result[i] += 1
    return result

Since often the vocabulary size is pretty big, we can limit the size of the vocabulary to most frequent words. Try lowering vocab_size value and running the text classifier model training code below, and see how it affects the accuracy. You should expect some accuracy drop, but not dramatic, in lieu of higher performance.

In [35]:
vocab_size = len(vocab1)

print(to_bow(vocab1, train_dataset[0][1], vocab_size))

tensor([2., 1., 2.,  ..., 0., 0., 0.])


### 4.2.4 Training BoW classifier
Now that we have learned how to build Bag-of-Words representation of our text, let's train a classifier on top of it. 

#### 4.2.4.1 Prepareing data
First, we need to convert our dataset for training in such a way, that all positional vector representations are converted to bag-of-words representation. This can be achieved by passing bowify function as collate_fn parameter to standard torch DataLoader:

In [45]:
from torch.utils.data import DataLoader
import numpy as np


# this collate function gets list of batch_size tuples, and needs to 
# return a pair of label-feature tensors for the whole minibatch
def bowify(news_data_batch):
    return (
        # for items of a batch, we convert their label digit to one tensor
        # note each item has two elements, item[0] is the label, item[1] is the text
        torch.LongTensor([item[0] - 1 for item in news_data_batch]),
        torch.stack([to_bow(vocab1, item[1], vocab_size) for item in news_data_batch])
    )


train_loader = DataLoader(train_dataset, batch_size=16, collate_fn=bowify, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=16, collate_fn=bowify, shuffle=True)

#### 4.2.4.2 Build nn classifier model

The classifier neural network contains one linear layer. The size of the input vector equals to vocab_size, and output size corresponds to the number of the news classes which is 4 (e.g. e.g. 0->World, 1->Sports, 2->Business, 3->Sci/Tech). 


Because we are solving classification task, the final activation function is LogSoftmax().

In [46]:
# note, I did not sub class torch.nn.module to build our model
model = torch.nn.Sequential(torch.nn.Linear(vocab_size, 4), torch.nn.LogSoftmax(dim=1))

#### 4.2.4.3 Build training loop

Now we will define standard PyTorch training loop.

In [47]:
# model is the nn that we want to train, 
# data is the training data in dataloader format
# lr is the learning rate
# loss_fn is the loss function
# epoch_size defines how many times we want to train the model with data.
# report_freq defines the frequency of reporting
# optimizer defines how the model optimize its parameter of each layer with the loss computed by loss function

def train_loop(model, data, lr=0.01, optimizer=None, loss_fn=torch.nn.NLLLoss(), epoch_size=None, report_freq=200):
    # set optimizer, if nono provided use the default one
    optimizer = optimizer or torch.optim.Adam(model.parameters(), lr=lr)
    model.train()
    # we reset total_loss, acc, count,i for each training loop to avoid cumulating the numbers
    total_loss, acc, count, i = 0, 0, 0, 0
    for labels, features in data:
        optimizer.zero_grad()
        out = model(features)
        loss = loss_fn(out, labels)  #cross_entropy(out,labels)
        loss.backward()
        optimizer.step()
        total_loss += loss
        _, predicted = torch.max(out, 1)
        acc += (predicted == labels).sum()
        count += len(labels)
        i += 1
        if i % report_freq == 0:
            print(f"{count}: acc={acc.item() / count}")
        if epoch_size and count > epoch_size:
            break
    return total_loss.item() / count, acc.item() / count

In [48]:
train_loop(model, train_loader, epoch_size=15000)

3200: acc=0.8021875
6400: acc=0.83703125
9600: acc=0.8516666666666667
12800: acc=0.859296875


(0.025697248576800707, 0.8637393390191898)