In [None]:
%pip install transformers sklearn datasets ipywidgets

In [1]:
import tensorflow as tf
from tensorflow import keras
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)
tf.get_logger().setLevel('ERROR')
tf.autograph.set_verbosity(2)

from tqdm import tqdm
import numpy as np

# Challenge 3: Sentiment analysis
***
We are interested in predicting the sentiment of written text.
* For the challenge, we adopt the IMDb dataset of movie reviews:
    * "[...] *this film is very lovable in a way many comedies are not* '[...]"

The task is simple:
* Predict whether a review has a positive or a negative sentiment
    * Input: Paragraphs of text (string) and binary label (1: positive, 0: negative)
    * Metric: Accuracy
* Examples to get you started:
    * Finetune transformer models, e.g. BERT
    * Word2Vec + Deep Neural Network of your choice

# Hints
***
When training transformers:
* The models are *huge*, so training will run very slowly
    * Running several batches of the full training data will be a costly operation
    * Google Colab needs you to do a captcha every 2h...


# Loading the data
***
Using Huggingface `datasets` library:

In [3]:
from datasets import load_dataset
raw_datasets = load_dataset("imdb")
print(raw_datasets["train"]["text"][2])

Reusing dataset imdb (C:\Users\nikla\.cache\huggingface\datasets\imdb\plain_text\1.0.0\4ea52f2e58a08dbc12c2bd52d0d92b30b88c00230b4522801b3636782f625c5b)


Brilliant over-acting by Lesley Ann Warren. Best dramatic hobo lady I have ever seen, and love scenes in clothes warehouse are second to none. The corn on face is a classic, as good as anything in Blazing Saddles. The take on lawyers is also superb. After being accused of being a turncoat, selling out his boss, and being dishonest the lawyer of Pepto Bolt shrugs indifferently "I'm a lawyer" he says. Three funny words. Jeffrey Tambor, a favorite from the later Larry Sanders show, is fantastic here too as a mad millionaire who wants to crush the ghetto. His character is more malevolent than usual. The hospital scene, and the scene where the homeless invade a demolition site, are all-time classics. Look for the legs scene and the two big diggers fighting (one bleeds). This movie gets better each time I see it (which is quite often).


# Dataset preprocessing
***
The function below translates the sentences to token IDs and splits into train and validation:

In [32]:
def tokenize_and_split(datasets, tokenizer):

    def tokenize_function(examples):
        return tokenizer(examples["text"], padding="max_length", truncation=True)

    tokenized_datasets = datasets.map(tokenize_function, batched=True)
    train_dataset = tokenized_datasets["train"]
    eval_dataset = tokenized_datasets["test"]
    return train_dataset, eval_dataset

def as_tf_dataset(dataset, tokenizer, batch_size=8):
    tf_data = dataset.remove_columns(["text"]).with_format("tensorflow")
    train_features = {x: tf_data[x].to_tensor() for x in tokenizer.model_input_names}
    tf_dataset = tf.data.Dataset.from_tensor_slices((train_features, tf_data["label"]))
    tf_dataset = tf_dataset.shuffle(len(tf_dataset)).batch(batch_size)
    return tf_dataset

# Finetuning transformers
***
Huggingface's `transformers` library provides excellent functionality and many pretrained models
* Let's load a smaller variant of BERT, called `DistilBert` and its tokenizer

In [33]:
from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")

# Run tokenization
***
Apply tokenizer and sample a small subset for showcasing the workflow

In [None]:
train_data, test_data = tokenize_and_split(raw_datasets, tokenizer)
small_train_dataset = train_data.shuffle(seed=42).select(range(1000))
small_eval_dataset = test_data.shuffle(seed=42).select(range(1000))

# Variant 1: Train using `keras`
***
We need to convert to a dataset format that `keras` understands:

In [22]:
tf_train_small = as_tf_dataset(small_train_dataset, tokenizer, batch_size=8)
tf_eval_small = as_tf_dataset(small_eval_dataset, tokenizer, batch_size=8)
print(tf_train_small)

<BatchDataset shapes: ({input_ids: (None, 512), attention_mask: (None, 512)}, (None,)), types: ({input_ids: tf.int64, attention_mask: tf.int64}, tf.int64)>


# Loading the model
***

In [14]:
from transformers import TFDistilBertForSequenceClassification
model = TFDistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)
model.summary(50)

Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForSequenceClassification: ['vocab_transform', 'vocab_layer_norm', 'vocab_projector', 'activation_13']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['dropout_39', 'pre_classifier', 'classifier']
You should probably TRAIN this model on a down-stream task to be able to use i

Model: "tf_distil_bert_for_sequence_classification_1"
__________________________________________________
Layer (type)          Output Shape        Param # 
distilbert (TFDistilB multiple            66362880
__________________________________________________
pre_classifier (Dense multiple            590592  
__________________________________________________
classifier (Dense)    multiple            1538    
__________________________________________________
dropout_39 (Dropout)  multiple            0       
Total params: 66,955,010
Trainable params: 66,955,010
Non-trainable params: 0
__________________________________________________


# Time to train (`keras`)
***

In [20]:
from tensorflow import keras

model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=5e-5),
    loss= keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=tf.metrics.SparseCategoricalAccuracy(),
)

model.fit(tf_train_small, validation_data=tf_eval_small, epochs=3, verbose=1)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<tensorflow.python.keras.callbacks.History at 0x2954e23f9d0>

# Evaluation code for `keras`
***
We run the final evaluation on the full test dataset:

In [25]:
tf_eval_full = as_tf_dataset(test_data, tokenizer, batch_size=16)

In [32]:
loss, acc = model.evaluate(tf_eval_full)
print(f"Reached {acc:.3f} accuracy and a loss of {loss:.4f}")

Reached 0.768 accuracy and a loss of 0.5586


# Variant 2: Train using PyTorch
***
The `transformers` framework comes with its own `Trainer` class which we can use

In [None]:
from transformers import DistilBertForSequenceClassification, Trainer, TrainingArguments

model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")

# Specifying training arguments
***
The usual hyperparameters can be set using `TrainingArguments`

In [None]:
training_args = TrainingArguments(
    output_dir="results/",
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=16,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=50,                # how often to log
    evaluation_strategy="epoch",     # when to run evaluation
)

# Adding accuracy metric
***
Evaluation workflow is a little different to what we are used from `keras`:

In [None]:
import numpy as np
from datasets import load_metric

metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

# Time to train (PyTorch)
***

In [None]:
trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=small_train_dataset,   # training dataset
    eval_dataset=small_eval_dataset,     # evaluation dataset
    compute_metrics=compute_metrics,     # code to run accuracy metric
)
trainer.train()

# Evaluation code for PyTorch
***
We simply define a new `Trainer` that runs on the complete `eval_data`

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=test_data,
    compute_metrics=compute_metrics,
)
results = trainer.evaluate()
loss, acc = results["eval_loss"], results["eval_accuracy"]
print(f"Reached {acc:.3f} accuracy and a loss of {loss:.4f}")

# Applying Word2Vec
***
The previous state-of-the-art NLP models made heavy use of Word2Vec embeddings:
1. Learn a word embedding on a large amount of text (unsupervised)
    * It's also possible to train on task-specific text only
2. Translate all words in an input sequence to their vectors
3. Apply sequences of word vectors to a network model of your choice
4. ???
5. Profit.

In [None]:
%pip install gensim

# Getting pre-trained vectors
***
`gensim` is a popular framework for training Word2Vec models. It also has a model zoo:

In [4]:
import gensim
import gensim.downloader
print(list(gensim.downloader.info()['models'].keys()))
# random pick:
w2v = gensim.downloader.load('word2vec-google-news-300')



['fasttext-wiki-news-subwords-300', 'conceptnet-numberbatch-17-06-300', 'word2vec-ruscorpora-300', 'word2vec-google-news-300', 'glove-wiki-gigaword-50', 'glove-wiki-gigaword-100', 'glove-wiki-gigaword-200', 'glove-wiki-gigaword-300', 'glove-twitter-25', 'glove-twitter-50', 'glove-twitter-100', 'glove-twitter-200', '__testing_word2vec-matrix-synopsis']


Here is the famous `king - man + woman = queen` example:

In [5]:
w2v.most_similar(positive=["woman", "king"], negative=["man"])

[('queen', 0.7118193507194519),
 ('monarch', 0.6189674139022827),
 ('princess', 0.5902431011199951),
 ('crown_prince', 0.5499460697174072),
 ('prince', 0.5377321839332581),
 ('kings', 0.5236844420433044),
 ('Queen_Consort', 0.5235945582389832),
 ('queens', 0.5181134343147278),
 ('sultan', 0.5098593831062317),
 ('monarchy', 0.5087411999702454)]

# Data preprocessing
***
For starters, we do the simplest possible tokenization: Splitting by `" "`

In [6]:
def tokenize_w2v(w2v, dataset, sentence_function=None):
    '''
        sentence_function will be applied to each list of translated vectors.
        This allows to save a lot of RAM when wishing to aggregate the paragraphs.
    '''
    x = []
    y = np.array(dataset["label"], "int32")
    for text in tqdm(dataset["text"]):
        paragraph = [w2v[token] for token in text.split(" ") if token in w2v]
        if sentence_function is None:
            paragraph = np.array(paragraph, "float32")
        else: 
            paragraph = sentence_function(paragraph)
        x.append(paragraph)
    return x, y

# Getting rid of variable length data
***
For a first simple prototype, we simply sum up all word vectors of a review
* Per review, we will get a single vector
    * But is that a good approach?

In [7]:
sum_of_words = lambda vectors: np.sum(vectors, axis=0)
x_train, y_train = tokenize_w2v(w2v, raw_datasets["train"], sum_of_words)
x_train = np.array(x_train)
x_test, y_test = tokenize_w2v(w2v, raw_datasets["test"],sum_of_words)
x_test = np.array(x_test)

100%|██████████████████████████████████| 25000/25000 [00:16<00:00, 1501.29it/s]
100%|██████████████████████████████████| 25000/25000 [00:17<00:00, 1441.55it/s]


# Training with summed word vectors
***


In [8]:
from tensorflow.keras.layers import (
    Input, Dense, Dropout,
)
from tensorflow.keras.models import Model

input_layer = Input(w2v.vector_size)
l = input_layer
l = Dense(128, "relu")(l)
l = Dropout(0.4)(l)
l = Dense(2, "softmax")(l)
model = Model(input_layer, l)
model.summary(50)

Model: "functional_1"
__________________________________________________
Layer (type)          Output Shape        Param # 
input_1 (InputLayer)  [(None, 300)]       0       
__________________________________________________
dense (Dense)         (None, 128)         38528   
__________________________________________________
dropout (Dropout)     (None, 128)         0       
__________________________________________________
dense_1 (Dense)       (None, 2)           258     
Total params: 38,786
Trainable params: 38,786
Non-trainable params: 0
__________________________________________________


In [9]:
opt = keras.optimizers.Adam(lr=0.001)
model.compile(opt, "sparse_categorical_crossentropy", ["accuracy"])
model.fit(x_train, y_train, batch_size=128, epochs=1000,
          verbose=2, validation_split=0.2)

Epoch 1/1000
157/157 - 1s - loss: 0.9240 - accuracy: 0.7169 - val_loss: 0.6332 - val_accuracy: 0.6886
Epoch 2/1000
157/157 - 0s - loss: 0.4709 - accuracy: 0.7873 - val_loss: 0.5447 - val_accuracy: 0.7446
Epoch 3/1000
157/157 - 0s - loss: 0.4405 - accuracy: 0.8031 - val_loss: 0.6022 - val_accuracy: 0.7082
Epoch 4/1000
157/157 - 0s - loss: 0.4248 - accuracy: 0.8144 - val_loss: 0.3471 - val_accuracy: 0.8662
Epoch 5/1000
157/157 - 0s - loss: 0.4100 - accuracy: 0.8205 - val_loss: 0.5675 - val_accuracy: 0.7412
Epoch 6/1000
157/157 - 0s - loss: 0.4054 - accuracy: 0.8219 - val_loss: 0.5051 - val_accuracy: 0.7778
Epoch 7/1000
157/157 - 0s - loss: 0.3993 - accuracy: 0.8264 - val_loss: 0.4766 - val_accuracy: 0.7964
Epoch 8/1000
157/157 - 0s - loss: 0.3966 - accuracy: 0.8281 - val_loss: 0.4544 - val_accuracy: 0.8004
Epoch 9/1000
157/157 - 0s - loss: 0.3972 - accuracy: 0.8269 - val_loss: 0.6355 - val_accuracy: 0.7110
Epoch 10/1000
157/157 - 0s - loss: 0.3932 - accuracy: 0.8291 - val_loss: 0.5048 - 

KeyboardInterrupt: 

# Evaluation code for W2V based models
***

In [13]:
loss, acc = model.evaluate(x_test, y_test, batch_size=128)
print(f"Reached {acc:.3f} accuracy and a loss of {loss:.4f}")

Reached 0.765 accuracy and a loss of 0.4770
