## Text classification using Neural Networks

The goal of this notebook is to learn to use Neural Networks for text classification.

In this notebook, we will:
- Train a shallow model with learning embeddings
- Download pre-trained embeddings from Glove
- Use these pre-trained embeddings

However keep in mind:
- Deep Learning can be better on text classification that simpler ML techniques, but only on very large datasets and well designed/tuned models.
- We won't be using the most efficient (in terms of computing) techniques, as Keras is good for prototyping but rather inefficient for training small embedding models on text.
- The following projects can replicate similar word embedding models much more efficiently: [word2vec](https://github.com/dav/word2vec) and [gensim's word2vec](https://radimrehurek.com/gensim/models/word2vec.html)   (self-supervised learning only), [fastText](https://github.com/facebookresearch/fastText) (both supervised and self-supervised learning), [Vowpal Wabbit](https://github.com/JohnLangford/vowpal_wabbit/wiki) (supervised learning).
- Plain shallow sparse TF-IDF bigrams features without any embedding and Logistic Regression or Multinomial Naive Bayes is often competitive in small to medium datasets.


### 20 Newsgroups Dataset

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups http://qwone.com/~jason/20Newsgroups/

In [None]:
import numpy as np
from sklearn.datasets import fetch_20newsgroups

newsgroups_train = fetch_20newsgroups(subset='train')
newsgroups_test = fetch_20newsgroups(subset='test')

In [None]:
sample_idx = 1000
print(newsgroups_train["data"][sample_idx])

In [None]:
target_names = newsgroups_train["target_names"]

target_id = newsgroups_train["target"][sample_idx]
print("Class of previous message:", target_names[target_id])

Here are all the possible classes:

In [None]:
target_names

In [None]:
newsgroups_train["data"][0]

### Preprocessing text for the (supervised) CBOW model

We will implement a simple classification model in Keras. Raw text requires (sometimes a lot of) preprocessing.

The following cells uses Keras to preprocess text:
- using a tokenizer. You may use different tokenizers (from scikit-learn, NLTK, custom Python function etc.). This converts the texts into sequences of indices representing the `20000` most frequent words
- sequences have different lengths, so we pad them (add 0s at the end until the sequence is of length `1000`)
- we convert the output classes as 1-hot encodings

In [None]:
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

# Get the raw text data.
texts_train = newsgroups_train["data"]
texts_test = newsgroups_test["data"]

tokenizer = get_tokenizer('basic_english')

def yield_tokens(data_iter):
    for text in data_iter:
        yield tokenizer(text)

vocab = build_vocab_from_iterator(yield_tokens(texts_train), specials=["<unk>"], min_freq=5)
vocab.set_default_index(vocab["<unk>"])

# See the impact of min_freq to reduce vocab size.
len(vocab)

In [None]:
tokenizer(newsgroups_train["data"][0])

In [None]:
text_pipeline = lambda x: vocab(tokenizer(x))

In [None]:
print(f"Vocab size : {len(vocab)}.")

Tokenized sequences are converted to list of token ids (with an integer code):

In [None]:
text_pipeline(texts_train[0])

The tokenizer object stores a mapping (vocabulary) from word strings to token ids that can be inverted to reconstruct the original message (without formatting):

In [None]:
" ".join(vocab.lookup_tokens(text_pipeline(texts_train[0])))

In [None]:
texts_train[0]

Let's have a closer look at the tokenized sequences:

In [None]:
sequences = [text_pipeline(item) for item in texts_train]
sequences_test = [text_pipeline(item) for item in texts_test]

seq_lens = [len(s) for s in sequences]
print("average length: %0.1f" % np.mean(seq_lens))
print("max length: %d" % max(seq_lens))

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

plt.hist(seq_lens, bins=50);

Let's zoom on the distribution of regular sized posts. The vast majority of the posts have less than 1000 symbols:

In [None]:
plt.hist([l for l in seq_lens if l < 3000], bins=50);

Let's truncate and pad all the sequences to 1000 symbols to build the training set:

In [None]:
from keras.preprocessing.sequence import pad_sequences

MAX_SEQUENCE_LENGTH = 1000

# Pad sequences with 0s.
x_train = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)
x_test = pad_sequences(sequences_test, maxlen=MAX_SEQUENCE_LENGTH)

print('Shape of data tensor:', x_train.shape)
print('Shape of data test tensor:', x_test.shape)

In [None]:
from keras.utils.np_utils import to_categorical

y_train = newsgroups_train["target"]
y_test = newsgroups_test["target"]

y_train = to_categorical(np.asarray(y_train))
print('Shape of label tensor:', y_train.shape)

### A simple supervised CBOW model in Keras

The following computes a very simple model, as described in [fastText](https://github.com/facebookresearch/fastText):

<img src="https://raw.githubusercontent.com/m2dsupsdlclass/lectures-labs/6ccf141172e47a3d9e367620b4d70e1c9564e9d2/labs/06_deep_nlp/images/fasttext.svg" style="width: 600px;" />

- Build an embedding layer mapping each word to a vector representation
- Compute the vector representation of all words in each sequence and average them
- Add a dense layer to output 20 classes (+ softmax)

In [None]:
import torch
import torch.nn as nn

from tqdm.notebook import tqdm
from torchmetrics import Accuracy
from torch.utils.data import DataLoader

from typing import Optional

DEVICE = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
DEVICE

In [None]:
EMBEDDING_DIM = 50
N_CLASSES = len(target_names)

class TextClassificationModel(nn.Module):

    def __init__(self, vocab_size, embed_dim, num_class, embedding_bag: Optional[nn.EmbeddingBag] = None):
        super(TextClassificationModel, self).__init__()
        if embedding_bag:
            # If we use use pretrained.
            self.embedding = embedding_bag
            self.embedding.weight.requires_grad = False # Freeze embeddings.
        else:
            self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, mode="mean")
        self.fc = nn.Linear(embed_dim, num_class)

    def forward(self, text):
        embedded = self.embedding(text)
        out = self.fc(embedded)
        return out

model = TextClassificationModel(vocab_size=len(vocab), embed_dim=EMBEDDING_DIM, num_class=N_CLASSES)

In [None]:
x_train = torch.from_numpy(x_train)
y_train = torch.from_numpy(y_train)
x_test = torch.from_numpy(x_test)
y_test = torch.from_numpy(y_test)

train_data = []
for i in range(len(x_train)):
    train_data.append([x_train[i], y_train[i]])

test_data = []
for i in range(len(x_test)):
    test_data.append([x_test[i], y_test[i]])


train_dl = DataLoader(train_data, batch_size=256, shuffle=True)
test_dl = DataLoader(test_data, batch_size=256, shuffle=True)

In [None]:
LEARNING_RATE = 1e-2
WEIGHT_DECAY = 0

optimizer = torch.optim.Adam(params=model.parameters(), lr=LEARNING_RATE, weight_decay=WEIGHT_DECAY)
criterion = torch.nn.CrossEntropyLoss()

N_EPOCHS = 100

model.to(DEVICE)

pbar = tqdm(range(N_EPOCHS))

val_acc = Accuracy().to(DEVICE)
train_acc = Accuracy().to(DEVICE)

for epoch in pbar:
    model.train()
    for b in train_dl:
        x, y = b
        x = x.to(DEVICE)
        y = y.to(DEVICE)

        preds = model(x)

        loss = criterion(preds, y)
        train_acc.update(preds, torch.argmax(y.squeeze(), dim=1))


        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
    
    epoch_train_acc = train_acc.compute()
    train_acc.reset()
    

    model.eval()
    with torch.no_grad():
        for b in test_dl:
            x, y = b
            x = x.to(DEVICE)
            y = y.to(DEVICE)

            val_acc.update(model(x), y)

        epoch_val_acc = val_acc.compute()

        val_acc.reset()

    pbar.set_description(f"Epoch {epoch}, Loss : {loss.item():.2f}, Val : {epoch_val_acc.item():.2f}, Train : {epoch_train_acc.item():.2f}")
    

### Building more complex models

**Exercise**
- From the previous template, build more complex models using:
  - 1d convolution and 1d maxpooling.
  - Recurrent neural networks through LSTM (you will need to reduce sequence length before)
  
  
<img src="https://raw.githubusercontent.com/m2dsupsdlclass/lectures-labs/6ccf141172e47a3d9e367620b4d70e1c9564e9d2/labs/06_deep_nlp/images/unrolled_rnn_one_output_2.svg" style="width: 600px;" />

**Bonus**
- You may try different architectures with:
  - more intermediate layers, combination of dense, conv, recurrent
  - different recurrent (GRU, RNN)
  - bidirectional LSTMs

Note: The goal is to build working models rather than getting better test accuracy. To achieve much better results, we'd need more computation time and data quantity. Build your model, and verify that they converge to OK results.

### LSTM model !

In [None]:
EMBEDDING_DIM = 50
N_CLASSES = len(target_names)

class LstmTextClassificationModel(nn.Module):

    def __init__(self, vocab_size, embed_dim, num_class, embedding: Optional[nn.Embedding] = None):
        super(LstmTextClassificationModel, self).__init__()
    
        if embedding:
            # If we use use pretrained.
            self.embedding = embedding
            self.embeddings.weight.requires_grad = False # Freeze embeddings.
        else:
            self.embedding = nn.Embedding(vocab_size, embed_dim)

        self.conv1 = nn.Sequential(
            nn.Conv1d(in_channels=50, out_channels=64, kernel_size=5),
            nn.MaxPool1d(5),
            nn.Dropout(p=0.5),
            nn.ReLU()
        )

        self.conv2 = nn.Sequential(
            nn.Conv1d(in_channels=64, out_channels=64, kernel_size=5),
            nn.MaxPool1d(5),
            nn.Dropout(p=0.5),
            nn.ReLU()
        )

        self.lstm = nn.Sequential(
            nn.LSTM(input_size=64, hidden_size=64, num_layers=1, batch_first=True),
        )
        
        self.fc = nn.Sequential(nn.ReLU(), nn.Dropout(p=0.5), nn.Linear(64, num_class))

    def forward(self, text):
        x = self.embedding(text)
        x = x.transpose(1, 2) # To have the proper representation for 1d convs.
        x = self.conv1(x)
        x = self.conv2(x)
        x = x.transpose(1, 2)
        x, (ht, ct) = self.lstm(x)
        out = self.fc(ht.reshape(x.shape[0], -1)) # Use the last hidden layer.
        return out

model = LstmTextClassificationModel(vocab_size=len(vocab), embed_dim=EMBEDDING_DIM, num_class=N_CLASSES)

In [None]:
LEARNING_RATE = 1e-3
WEIGHT_DECAY = 1e-4

optimizer = torch.optim.Adam(params=model.parameters(), lr=LEARNING_RATE, weight_decay=WEIGHT_DECAY)
criterion = torch.nn.CrossEntropyLoss()

N_EPOCHS = 300

model.to(DEVICE)

pbar = tqdm(range(N_EPOCHS))

val_acc = Accuracy().to(DEVICE)
train_acc = Accuracy().to(DEVICE)

for epoch in pbar:
    model.train()
    for b in train_dl:
        x, y = b
        x = x.to(DEVICE)
        y = y.to(DEVICE)

        preds = model(x)

        loss = criterion(preds, y)
        train_acc.update(preds, torch.argmax(y.squeeze(), dim=1))


        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
    
    epoch_train_acc = train_acc.compute()
    train_acc.reset()
    

    model.eval()
    with torch.no_grad():
        for b in test_dl:
            x, y = b
            x = x.to(DEVICE)
            y = y.to(DEVICE)

            val_acc.update(model(x), y)

        epoch_val_acc = val_acc.compute()

        val_acc.reset()

    pbar.set_description(f"Epoch {epoch}, Loss : {loss.item():.2f}, Val : {epoch_val_acc.item():.2f}, Train : {epoch_train_acc.item():.2f}")

## Conv model !

In [None]:
EMBEDDING_DIM = 50
N_CLASSES = len(target_names)

class ConvTextClassificationModel(nn.Module):

    def __init__(self, vocab_size, embed_dim, num_class, embedding: Optional[nn.Embedding] = None):
        super(ConvTextClassificationModel, self).__init__()
        
        if embedding:
            # If we use use pretrained.
            self.embedding = embedding
            self.embeddings.weight.requires_grad = False # Freeze embeddings.
        else:
            self.embedding = nn.Embedding(vocab_size, embed_dim)

        self.conv1 = nn.Sequential(
            nn.Conv1d(in_channels=50, out_channels=64, kernel_size=5),
            nn.MaxPool1d(5),
            nn.Dropout(p=0.2),
            nn.ReLU()
        )

        self.conv2 = nn.Sequential(
            nn.Conv1d(in_channels=64, out_channels=64, kernel_size=5),
            nn.MaxPool1d(5),
            nn.Dropout(p=0.2),
            nn.ReLU()
        )
        
        self.fc = nn.Sequential(nn.Flatten(), nn.ReLU(), nn.Dropout(p=0.2), nn.Linear(64*39, num_class))

    def forward(self, text):
        x = self.embedding(text)
        x = x.transpose(1, 2) # To have the proper representation for 1d convs.
        x = self.conv1(x)
        x = self.conv2(x)
        out = self.fc(x)
        return out

model = ConvTextClassificationModel(vocab_size=len(vocab), embed_dim=EMBEDDING_DIM, num_class=N_CLASSES)

In [None]:
LEARNING_RATE = 1e-3
WEIGHT_DECAY = 1e-6

optimizer = torch.optim.Adam(params=model.parameters(), lr=LEARNING_RATE, weight_decay=WEIGHT_DECAY)
criterion = torch.nn.CrossEntropyLoss()

N_EPOCHS = 300

model.to(DEVICE)

pbar = tqdm(range(N_EPOCHS))

val_acc = Accuracy().to(DEVICE)
train_acc = Accuracy().to(DEVICE)

for epoch in pbar:
    model.train()
    for b in train_dl:
        x, y = b
        x = x.to(DEVICE)
        y = y.to(DEVICE)

        preds = model(x)

        loss = criterion(preds, y)
        train_acc.update(preds, torch.argmax(y.squeeze(), dim=1))


        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
    
    epoch_train_acc = train_acc.compute()
    train_acc.reset()
    

    model.eval()
    with torch.no_grad():
        for b in test_dl:
            x, y = b
            x = x.to(DEVICE)
            y = y.to(DEVICE)

            val_acc.update(model(x), y)

        epoch_val_acc = val_acc.compute()

        val_acc.reset()

    pbar.set_description(f"Epoch {epoch}, Loss : {loss.item():.2f}, Val : {epoch_val_acc.item():.2f}, Train : {epoch_train_acc.item():.2f}")

### Loading pre-trained embeddings

The file `glove100K.100d.txt` is an extract of [Glove](http://nlp.stanford.edu/projects/glove/) Vectors, that were trained on english Wikipedia 2014 + Gigaword 5 (6B tokens).

We extracted the `100 000` most frequent words. They have a dimension of `100`

In [None]:
import os

import zipfile
from urllib.request import urlretrieve

Downloading and extracting gloves embeddings :

In [None]:
URL_REPRESENTATIONS = "https://github.com/m2dsupsdlclass/lectures-labs/releases/download/0.3/glove100k.100d.zip"
ZIP_REPRESENTATIONS = "glove100k.100d.zip"
FILE_REPRESENTATIONS = "glove100K.100d.txt"

print('Downloading from %s to %s...' % (URL_REPRESENTATIONS, ZIP_REPRESENTATIONS))
urlretrieve(URL_REPRESENTATIONS, './' + ZIP_REPRESENTATIONS)

print("extracting %s..." % ZIP_REPRESENTATIONS)
myzip = zipfile.ZipFile(ZIP_REPRESENTATIONS)
myzip.extractall()

In [None]:
embeddings_index = {}
embeddings_vectors = []
f = open('glove100K.100d.txt', 'rb')

word_idx = 0
for line in f:
    values = line.decode('utf-8').split()
    word = values[0]
    vector = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = word_idx
    embeddings_vectors.append(vector)
    word_idx = word_idx + 1
f.close()

inv_index = {v: k for k, v in embeddings_index.items()}
print("found %d different words in the file" % word_idx)

In [None]:
# Stack all embeddings in a large numpy array
glove_embeddings = np.vstack(embeddings_vectors)
glove_norms = np.linalg.norm(glove_embeddings, axis=-1, keepdims=True)
glove_embeddings_normed = glove_embeddings / glove_norms
print(glove_embeddings.shape)

In [None]:
def get_emb(word):
    idx = embeddings_index.get(word)
    if idx is None:
        return None
    else:
        return glove_embeddings[idx]


def get_normed_emb(word):
    idx = embeddings_index.get(word)
    if idx is None:
        return None
    else:
        return glove_embeddings_normed[idx]

In [None]:
get_emb("computer")

### Finding most similar words

**Exercice**

Build a function to find most similar words, given a word as query:
- lookup the vector for the query word in the Glove index;
- compute the cosine similarity between a word embedding and all other words;
- display the top 10 most similar words.

**Bonus**

Change your function so that it takes multiple words as input (by averaging them)

In [None]:
# %load solutions/most_similar.py
def most_similar(words, topn=10):
    query_emb = 0
    # If we have a list of words instead of one word
    # (bonus question)
    if type(words) == list:
        for word in words:
            query_emb += get_emb(word)       
    else:
        query_emb = get_emb(words)
        
    query_emb = query_emb / np.linalg.norm(query_emb)
    
    # Large numpy vector with all cosine similarities
    # between emb and all other words
    cosines = np.dot(glove_embeddings_normed, query_emb)
    
    # topn most similar indexes corresponding to cosines
    idxs = np.argsort(cosines)[::-1][:topn]
    
    # pretty return with word and similarity
    return [(inv_index[idx], cosines[idx]) for idx in idxs]


In [None]:
most_similar("cpu")

In [None]:
most_similar("pitt")

In [None]:
most_similar("jolie")

Predict the future better than tarot:

In [None]:
np.dot(get_normed_emb('aniston'), get_normed_emb('pitt'))

In [None]:
np.dot(get_normed_emb('jolie'), get_normed_emb('pitt'))

In [None]:
most_similar("1")

In [None]:
# bonus: yangtze is a chinese river
most_similar(["river", "chinese"])

### Displaying vectors with  t-SNE


In [None]:
from sklearn.manifold import TSNE

word_emb_tsne = TSNE(perplexity=30).fit_transform(glove_embeddings_normed[:1000])

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

plt.figure(figsize=(40, 40))
axis = plt.gca()
np.set_printoptions(suppress=True)
plt.scatter(word_emb_tsne[:, 0], word_emb_tsne[:, 1], marker=".", s=1)

for idx in range(1000):
    plt.annotate(inv_index[idx],
                 xy=(word_emb_tsne[idx, 0], word_emb_tsne[idx, 1]),
                 xytext=(0, 0), textcoords='offset points')
plt.savefig("tsne.png")
plt.show()

### Using pre-trained embeddings in our model

We want to use these pre-trained embeddings for transfer learning. This process is rather similar than transfer learning in image recognition: the features learnt on words might help us bootstrap the learning process, and increase performance if we don't have enough training data.
- We initialize embedding matrix from the model with Glove embeddings:
 - take all words from our 20 Newgroup vocabulary, and look up their Glove embedding 
 - place the Glove embedding at the corresponding index in the matrix
 - if the word is not in the Glove vocabulary, we only place zeros in the matrix
- We may fix these embeddings or fine-tune them

In [None]:
EMBEDDING_DIM = 100

embedding_bag = nn.EmbeddingBag(len(vocab), EMBEDDING_DIM)

i = 0

with torch.no_grad():
    for token in list(embeddings_index.keys()):
        if vocab[token] != 0:
            embedding_bag.weight[vocab[token]] = torch.from_numpy(get_emb(token)).float()
            i += 1

# Not placing zeros in the matrix, you might wanna try ?


print(f"Added {i} vectors to the pretrained embeddings.")

### A model with pre-trained Embeddings

Average word embeddings pre-trained with Glove / Word2Vec usually works suprisingly well. However, when averaging more than `10-15` words, the resulting vector becomes too noisy and classification performance is degraded.

In [None]:
model = TextClassificationModel(vocab_size=len(vocab), embed_dim=EMBEDDING_DIM, num_class=N_CLASSES, embedding_bag=embedding_bag)

In [None]:
LEARNING_RATE = 1e-2
WEIGHT_DECAY = 0

optimizer = torch.optim.Adam(params=model.parameters(), lr=LEARNING_RATE, weight_decay=WEIGHT_DECAY)
criterion = torch.nn.CrossEntropyLoss()

N_EPOCHS = 300

model.to(DEVICE)

pbar = tqdm(range(N_EPOCHS))

val_acc = Accuracy().to(DEVICE)
train_acc = Accuracy().to(DEVICE)

for epoch in pbar:
    model.train()
    for b in train_dl:
        x, y = b
        x = x.to(DEVICE)
        y = y.to(DEVICE)

        preds = model(x)

        loss = criterion(preds, y)
        train_acc.update(preds, torch.argmax(y.squeeze(), dim=1))


        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
    
    epoch_train_acc = train_acc.compute()
    train_acc.reset()
    

    model.eval()
    with torch.no_grad():
        for b in test_dl:
            x, y = b
            x = x.to(DEVICE)
            y = y.to(DEVICE)

            val_acc.update(model(x), y)

        epoch_val_acc = val_acc.compute()

        val_acc.reset()

    pbar.set_description(f"Epoch {epoch}, Loss : {loss.item():.2f}, Val : {epoch_val_acc.item():.2f}, Train : {epoch_train_acc.item():.2f}")

### Reality check

On small/medium datasets, simpler classification methods usually perform better, and are much more efficient to compute. Here are two resources to go further:
- Naive Bayes approach, using scikit-learn http://scikit-learn.org/stable/datasets/twenty_newsgroups.html
- Alec Radford (OpenAI) gave a very interesting presentation, showing that you need a VERY large dataset to have real gains from GRU/LSTM in text classification https://www.slideshare.net/odsc/alec-radfordodsc-presentation

However, when looking at features, one can see that classification using simple methods isn't very robust, and won't generalize well to slightly different domains (e.g. forum posts => emails)

## Going further

- Compare pre-trained embeddings vs specifically trained embeddings
- Train your own wordvectors in any language using [gensim's word2vec](https://radimrehurek.com/gensim/models/word2vec.html)
- Install fastText (Linux or macOS only, use the Linux VM if under Windows) and give it a try on the classification example in its repository.