# Deep Learning for text data

First of all we need to convert text into numerical representations. This process is called __vectorization__. It can be done:

- __convert__ text into words and words into vectors;
- __convert__ text into characters and characters into vectors;

- create __n-gram__ of words and represent them as vectors.

Each smaller unit of text is called __token__, hence the process of braking a text into tokens is called __tokenization__. The most popular approaches to convert a token into a vector are: _one-hot encoding_ and _word embedding_.

In one-hot encoding, each token is represented as a vector of size $N$, where $N$ is the size of the vocabulary (i.e. total number of unique words in the document). The one-hot representation of a vocabulary has dimension $|V|\times |V|$.

Word embedding is a popular way of representing text data in problems solved by deep learning algorithms. Word embedding provides a representation of a word. The dimension of the vector is a hyper-parameter set during training phase. The representation of a vocabulary becomes $|V|\times D$, where $D$ is the dimension to be set. One way to create word embeddings is to start with dense vectors for each token containing random numbers, and then train a model such as a document classifier. Hence, the numbers representing the tokens will get adjusted such that semantically closer words will have smaller vectorial distance. 

In [None]:
import torch

In [None]:
## Tokenization: split text into characters/words
thorTxt = "The action scenes were top notch in this movie.\
           Thor has never been this epic in the MCU. He does\
           some pretty epic shit in this movie and he is\
           definitely not under-powered anymore. Thor in\
           unleashed in this, I love that."

If we want to express tokens as characters, we can use the `list()` function.

In [None]:
thor_char = list(thorTxt)
print(thor_char)

On the other hand, if we want to express tokens as words, we can use the `split()` function in the Python string object.

In [None]:
thor_word = thorTxt.split()
print(thor_word)

__N-grams__ are group of words extracted from given text. $N$ represents the number of words that can be used together.

In [None]:
##n-gram representation N = 2

thor_ngram = [(thor_word[i], thor_word[i+1]) for i in range(len(thor_word)-1)]
print(thor_ngram)

## Vectorization

In [None]:
##one-hot encoding
##we first map each word in the vocabulary to a unique index
word_to_ix = {}

for word in thor_word:
    if word not in word_to_ix:
        word_to_ix[word] = len(word_to_ix)

print(word_to_ix)

In [None]:
onehot_thor = torch.zeros(len(word_to_ix), len(word_to_ix)) ##onehot tensor of dimension |V|x|V|

for word in word_to_ix:
    id = word_to_ix[word]
    onehot_thor[id, id] = 1 
    
onehot_w = torch.zeros(len(word_to_ix))

def onehot(word, word_to_ix):
    onehot_vect = torch.zeros(len(word_to_ix)) ##onehot tensor of dimension |V| for a word from the vocabulary 
    ix = word_to_ix[word]
    onehot_vect[ix] = 1
    return onehot_vect

onehot('were', word_to_ix)

In [None]:
print(onehot_thor)

In [None]:
##FINIRE!
##Word embedding
##training (Continuous Bag of Words: predicts word given the context) = pretraining embedding
import torch
import torch.nn as nn
impotr numpy as np

##create context tensor
def make_contex_vector(context, word_to_ix):
    idxs = [word_to_ix[w] for w in context]
    tensor = torch.LongTensor(idxs)
    return tensor

def get_index_max(input):
    index = 0
    for i in range(len(input), 1):
        if input[i] > input[index]:
            index = i
    return index

def get_max_prob_result(input, ix_to_word):
    return ix_to_word[get_index_max(input)]

class CBOW(nn.Module):
    
    def __init__(self, vocab_size, embedding_dim):
        super(CBOW, self).__init__()
        
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(embedding_dim, 128)
        self.activation1 = nn.ReLU()
        self.linear2 = nn.Linear(128, vocab_size)
        self.activation2 = nn.LogSoftmax(dim=-1)
        
    def forward(self, inputs):
        embeds = sum(self.embeddings(inputs)).view((1, -1))
        out = self.linear1(embeds)
        out = self.activation1(out)
        out = self.linear2(out)
        out = self.activation2(out)
        return out
    
    def get_word_embedding(self, word):
        word = torch.LongTensor([word_to_ix[word]])
        return self.embeddings(word).view(1, -1)

In [None]:
#training model.train()
model = CBOW(vocab_size, EMBEDDING_DIM)

loss_function = nn.NLLLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.001)

for epoc in range(50):
    total_loss = 0
    for context, target in data:
        context_vector = make_context_vector(context, word_to_ix)
        model.zero_grad()
        log_probs = model(context_vector)
        loss = loss_function(log_probs, torch.LongTensor([word_to_ix[target]]))
        loss.backward()
        optimizer.step()
        
        total_loss += loss.data

In [None]:
##test model.eval()

## CNN for text classification 
We create word embeddings as part of our architecture and train the entire model for prediction. CNN performes feature learning.

We use `torchtext` to download the `IMDB` dataset and split it into `train` and `test` datasets. `torchtext` for downloading, tokenizing and building vocabulary for the `IMDB` dataset.
The `torchtext` package consists of data processing utilities and popular datasets for natural language.

In [1]:
##data preparation
import torchtext
import torchtext.datasets as datasets
from torchtext import data
from torchtext.vocab import GloVe

TEXT = data.Field(lower=True, batch_first=True, fix_length=20) ##lowercase the text, tokenize the text and trim it to a maximum length of 20
LABEL = data.Field(sequential=False)

train, test = datasets.IMDB.splits(TEXT, LABEL) ##LABEL: positive, negative review

##now we build the vocabulary for the data
TEXT.build_vocab(train, vectors=GloVe(name='6B', dim=300), 
                 max_size=10000, min_freq=10) ##pretrained embeddings of dim=300
##max number of words in the vocab = 10000 --> one element has token = 10001!! WHY?
##removed words with freq<10
##GloVe=Global Vectors for word representation
LABEL.build_vocab(train,)

##create iterators that generate batches for train and test datasets
train_iter, test_iter = data.BucketIterator.splits((train, test),
                                                   batch_size=128, 
                                                   device=-1, 
                                                   shuffle=False)##device=-1 means CPU, None for GPU

In [2]:
batch = next(iter(train_iter)) ##inspect one batch
b = batch.text
l = batch.label
print(b.size()) #dim: 128x20 [batch_size, fix_len(from TEXT definition of sequence length)]
#l.size()
print(b)

torch.Size([128, 20])
tensor([[ 2346,     0,     0,  ...,    13,     0,  2346],
        [    9,  4923,     9,  ...,  1045,   158,  1308],
        [ 3360,     0,    14,  ...,    25,    90,   337],
        ...,
        [  828,   872,  1679,  ...,     6,    48,    74],
        [    3,    75,     5,  ...,     2,    49,  4725],
        [    9,   120,   775,  ...,   248,  4342,     3]])


In [3]:
len(train_iter.dataset.examples)
print(len(test_iter.dataset.examples))

25000


In [None]:
print(TEXT.vocab.vectors) ##shows embeddings
print(TEXT.vocab.freqs) ##shows frequencies of tokens
print(TEXT.vocab.stoi) ##shows the dictionary with words and indices

In [2]:
import torch.nn as nn
##CNN implementation
class IMDBcnn(nn.Module):
    def __init__(self, vocab_size, embedding_dim, n_cat, bs=1, max_len=20, kernel_size=3):
        super(IMDBcnn, self).__init__()
        self.bs = bs
        self.n_cat = n_cat
        self.embedding_dim = embedding_dim
        self.embedding = nn.Embedding(vocab_size, embedding_dim) ##n_channels=tokens
        self.cnn = nn.Conv1d(max_len, embedding_dim, kernel_size) ##one-dimensional convolution (time as a conv feat)
        ##convolution layer accepts the sequence_length and the embedding_dimension
        self.avg = nn.AdaptiveAvgPool1d(10)
        self.fc = nn.Linear(1000, n_cat)
        self.softmax = nn.LogSoftmax(dim=-1)
        
    def forward(self, input):
        bs = input.size()[0]
        if bs != self.bs:
            self.bs = bs
        embed = self.embedding(input)
        out = self.cnn(embed)
        out = self.avg(out)
        out = out.view(self.bs, -1)
        #print(out.size())
        out = F.dropout(self.fc(out), p=0.5)
        #print(out.size())
        return self.softmax(out)

In [3]:
import torch
import torch.optim as optim
import torch.nn.functional as F

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = IMDBcnn(vocab_size=20000, embedding_dim=100, n_cat=2, bs=128)
model.to(device)

IMDBcnn(
  (embedding): Embedding(20000, 100)
  (cnn): Conv1d(20, 100, kernel_size=(3,), stride=(1,))
  (avg): AdaptiveAvgPool1d(output_size=10)
  (fc): Linear(in_features=1000, out_features=2, bias=True)
  (softmax): LogSoftmax()
)

In [4]:
optimizer = optim.Adam(model.parameters(), lr=0.0001)

In [None]:
##training the model
train_loss, train_acc = [], []

for epoch in range(1):
    running_loss = 0.0
    train_lossTmp = 0.0
    train_accTmp = 0.0
    
    for i, data in enumerate(train_iter, 0):
        #print(i)
        inputs, labels = data.text, data.label-1 ##data.text = inputs [128x20], data.label = labels [128]
        inputs, labels = inputs.to(device), labels.to(device)
        #print(inputs.size())
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = F.nll_loss(outputs, labels)
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
        if i % 1000 == 0:
            print('[{0:d}, {1:5d}] loss: {2:3f}'.format(epoch + 1, i + 1, 
                                                        running_loss))
        train_lossTmp += running_loss
        
        _, predicted = torch.max(outputs, 1)
        train_accTmp += (predicted == labels).sum().item()
        
        running_loss = 0.0
        
    train_lossTmp = train_lossTmp/len(train_iter.dataset)
    train_acc.append(train_accTmp)
print("Finished training")

[1,     1] loss: 0.663877
[1,  1001] loss: 0.873771
[1,  2001] loss: 0.734582
[1,  3001] loss: 0.660496
[1,  4001] loss: 0.587941
[1,  5001] loss: 0.707496
[1,  6001] loss: 0.706104
[1,  7001] loss: 0.641321
[1,  8001] loss: 0.617158
[1,  9001] loss: 0.453076
[1, 10001] loss: 0.606310
[1, 11001] loss: 0.585477
[1, 12001] loss: 0.478372
[1, 13001] loss: 0.443530
[1, 14001] loss: 0.411518
[1, 15001] loss: 0.571858
[1, 16001] loss: 0.431129
[1, 17001] loss: 0.352848
[1, 18001] loss: 0.339925
[1, 19001] loss: 0.294157
[1, 20001] loss: 0.430539
[1, 21001] loss: 0.408975
[1, 22001] loss: 0.405468
[1, 23001] loss: 0.249852
[1, 24001] loss: 0.240808
[1, 25001] loss: 0.333480
[1, 26001] loss: 0.392163
[1, 27001] loss: 0.353186
[1, 28001] loss: 0.220064
[1, 29001] loss: 0.221836
[1, 30001] loss: 0.299842
[1, 31001] loss: 0.209823
[1, 32001] loss: 0.184576
[1, 33001] loss: 0.218767
[1, 34001] loss: 0.170951
[1, 35001] loss: 0.335098
[1, 36001] loss: 0.198010
[1, 37001] loss: 0.138783
[1, 38001] l

In [None]:
##test
correct = 0 
total = 0 
test_loss, test_acc = [], []

model.cpu()
model.eval()

with torch.no_grad():
    
    for epoch in range(20):
        test_lossTmp = 0.0
        
        for data in test_iter:
            inputs, labels = data.text, data.labels
            out = model(inputs)
            _, predicted = torch.max(out, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
            
            loss = F.mll_loss(out, labels)
            test_lossTmp += loss.item()
            
        test_lossTmp = test_lossTmp/len(test_iter.dataset)
        test_loss.append(test_lossTmp)
        
        test_accTmp = correct/total
        test_acc.append(test_accTmp)
        
        print('Epoch{0:2d} accuracy of the network on the {1:5d} test images: {2:3f} %%'.format(epoch, 
                                                                                                total, 
                                                                                                100 * correct / total))

In [None]:
inputs.numpy().max()

In [None]:
inputs, labels = b, l-1
inputs = inputs.to(device)
labels = labels.to(device)

In [None]:
emb = nn.Embedding(10000, 100)

In [None]:
out = emb(inputs)

In [None]:
conv = nn.Conv1d(20, 100, 3)

In [None]:
out = conv(out)

In [None]:
out = model(inputs)

In [None]:
out.size()

In [None]:
loss = F.nll_loss(out, labels)


In [None]:
loss.backward()

In [None]:
optimizer.step()

In [None]:
loss

In [None]:
labels

In [None]:
print(range(len(word_to_ix), 1))

In [None]:
def get_index_max(input):
    index = 0
    for i in range(len(input), 1):
        if input[i] > input[index]:
            index = i
    return index

In [None]:
get_index_max([10,2,3,4,5,6])

## Sentiment analysis
Sentence classification in "positive" or "negative" with CNN (Kim, Y., 2014).