# Lecture 3: NLP
***

# Agenda
***
* Recap RL challenge
* Lecture on NLP
    * Word2vec
    * Transformer networks
* Challenge 4: Sentiment analysis

# Reinforcement learning challenge
***
Two environments:
* Easy: `LunarLander-v2`
    * Default gym from OpenAI
* Hard: `LunarLander-hardcore-v2`
    * Adapted version with different starting positions, more terrain and randomization
<center>
<video width="50%" controls src="img/slides/ll_ep50.mp4" type="video/mp4" />
</center>

# Solution
***
Best posted solution:
* Single `Dense` layer with `relu` activation, 256 units
* `Nadam` optimizer, `lr=0.005` on `MSE` loss
* `discount=0.99`
* Trained for 10k episodes

The tricks:
* Run `n` episodes, but only train on the `top_n` runs with highest reward
    * Idea: Give higher influence to positive reinforcement
        * Here: Train on top 2 rewarding episodes of 10 trials
* Finetune agents from the easy gym for the hard environment

# Results
***
Two agents:
* (E) Trained solely in the easy env
* (H) The same model, but finetuned in the hard environment

Easy:
* (E): `Agent achieved 279.70 mean reward over 1000 runs`
* (H): `Agent achieved 258.77 mean reward over 1000 runs`

Hard:
* (E): `Agent achieved 224.53 mean reward over 1000 runs`
* (H): `Agent achieved 270.21 mean reward over 1000 runs`

# Comparison on `easy`
***
<center><div style="display:flex; justify-content: space-evenly;">
    <div style="flex-basis:40%">
        <center>Agent E</center>
        <video width="95%" controls src="img/slides/ll_easy_solved.mp4" type="video/mp4" />
    </div>
    <div style="flex-basis:40%">
        <center>Agent H</center>
        <video width="95%" controls src="img/slides/ll_hard_transfer.mp4" type="video/mp4" />
    </div>
</div>
</center>

# Comparison on `hard`
***
<center><div style="display:flex; justify-content: space-evenly;">
    <div style="flex-basis:40%">
        <center>Agent E</center>
        <video width="95%" controls src="img/slides/ll_easy_transfer.mp4" type="video/mp4" />
    </div>
    <div style="flex-basis:40%">
        <center>Agent H</center>
        <video width="95%" controls src="img/slides/ll_hard_solved.mp4" type="video/mp4" />
    </div>
</div>
</center>

# Natural language processing
***

# What is NLP?
***
Everything we express carries huge amounts of information
* what words we use and in which context
* our tone of voice
* ...

People are (more or less :D) able to understand other people
* BUT computers definitly not yet 

**Natural Language Processing** is about:
* field of research to make machines unterstand and derive meaning from human languages
* often ML based

# Use Cases of NLP: Chatbot
***
<img src="img/slides/chatbot_fails.png" style="float: middle;">



# Use Cases of NLP: Disease Diagnosis
***
<img src="img/slides/diagnose_patients.png" style="float: left;">



# Use Cases of NLP: Knowledge Extraction
***
<img src="img/slides/knowledge_extraction.png" style="float: left;">



# Use Cases of NLP: Translation
***
<img src="img/slides/translation.png" style="float: left;">



# Use Cases of NLP: Sentiment Analysis
***
<img src="img/slides/sentiments_analysis.png" width= 700 style="float: middle;">



# But how to represent the meaning of a word
***
Commonest linguistic way of thinking of meaning:
* signifier (symbol) ⟺ signified (idea or thing)

How about synonyms? Use WordNet!
* Example: good
    * goodness
    * honorable
    * beneficial
    * upright
    * well
    * proficient
    * exellent

# Problems with resources like WordNet
***
Great as a resource but missing nuance
* e.g., “proficient” is listed as a synonym for “good” (This is only correct in some contexts)

Missing new meanings of words

Requires human labor to create and adapt

Can’t compute accurate word similarity

# Representing words as discrete symbols
***
determine a vocabulary $V = \{expect, movie, is, good, but,  exellent\}$
* create one-hot encoded vectors

$expect = \begin{pmatrix} 1\\0\\0\\0\\0\\0 \end{pmatrix},\; movie = \begin{pmatrix} 0\\1\\0\\0\\0\\0 \end{pmatrix},\; is = \begin{pmatrix} 0\\0\\1\\0\\0\\0 \end{pmatrix},\; good = \begin{pmatrix} 0\\0\\0\\1\\0\\0 \end{pmatrix},\; but = \begin{pmatrix} 0\\0\\0\\0\\1\\0 \end{pmatrix},\; exellent = \begin{pmatrix} 0\\0\\0\\0\\0\\1 \end{pmatrix} $

visualize these encodings, we can think of a 6 dimensional space
* each word occupies one of the dimensions and has nothing to do with the rest 
* This means ‘good’ and ‘exellent’ are as different as ‘expect’ and ‘is’, which is not true
* Solution: learn to encode similarity in the vectors themselves

# Representing words by their context
***
When a word $w \in V$ appears in a text, its context is the set by words that appear nearby (within a fixed-size window)
* Use the many contexts of $w$ to build up a representation of $w$
* for each $w$ a word vector is built, chosen so that it is similar to vectors of words that appear in similar contexts

**Note**: word vectors are also called word embeddings or (neural) word representations

# Word meaning as a word embedding
***
$is = \begin{pmatrix} 0.784\\ 0.629\\−0.205\\−0.176\\0.211\\−0.563\\0.418\\0.189\\-0.621 \end{pmatrix}$ <img src="img/slides/word_embedding.png" width="700" style="border: 1px black; float: right;">

# Constructing word embeddings: Word2Vec
***
Two possible methods
* Skip Gram
* Common Bag Of Words (CBOW)

Let's focus on Skip Grams! (It is better for large datasets)
* Skip Gram model considers center word as input and predicts context words

<img src="img/slides/skip_gram.png" style="height: 300px; float: left;">



# Architecture of Word2Vec
***
2-layer neural network
* input layer requires one-hot vectors
* hidden layer is dense layer whose weights will be the word embeddings
* output layer + softmax outputs probabilities for words from the vocabulary

<img src="img/slides/Word2vec.png" width= 700 style="float: middle;">

**Note**: figure reproduced from https://israelg99.github.io/2017-03-23-Word2Vec-Explained/ 

# Training of Word2Vec
***

Training
* Train the weights of the hidden layer and output layer with the help of skip gram model
* cut off the output layer to get the embeddings

<img src="img/slides/Word2vec.png" width= 700 style="float: middle;">


# Training of Word2Vec: Example
***
<img src="img/slides/skip_gram.png" style="height:300px; float: right;">

determine the vocabulary $V = \{expect, the, movie, is, good, but\}$

train the hidden weights (= word embedding) for the word "is" for the given input and output with gradient descent algo

$input: \begin{pmatrix} 0\\0\\0\\1\\0\\0 \end{pmatrix} \; output: \begin{pmatrix} 0\\1\\1\\0\\1\\1 \end{pmatrix}$

# Training of Word2Vec
***
<img src="img/slides/Word2vec.png" width= 600 style="float: middle;">

If different words are similar in context, then Word2Vec should have similar outputs when these words are passed as inputs
* in-order to have a similar outputs, the computed word embeddings  for these words have to be similar
* thus Word2Vec is motivated to learn similar word embeddings for words in similar context

# What can you do with word embeddings?
***
<img src="img/slides/embedding_example.png" style="">

**Note**: figure reproduced from https://israelg99.github.io/2017-03-23-Word2Vec-Explained/ 

# Language Modeling
***
based on Markov Assumption: assumption that the $t+1$st word is dependent on the previous $t$ words

Language Modeling is the task of predicting what word comes next
* "The students opened their ____" $\rightarrow$ \[books, laptops, exams, minds\]

Formally:\
Given a sequence of words $x^{(1)}, x^{(2)}, ..., x^{(t)}$, compute the probability distribution of the next word $x^{(t+1)}$:

$P(x^{(t+1)} \vert x^{(t)}, ..., x^{(1)})$

where $x^{(t+1)}$ can be any word in the vocabulary $V = \{w_1, ..., w_{\vert V \vert}\}$

# You use Language Models every day!
***
<img src="img/slides/lm_google.png" style="float: middle;">

# n-gram Language Models
***
"The students opened their ____"

**Definition**: A n-gram is a chunk of n consecutive words
* unigrams: “the”, “students”, “opened”, ”their”
* bigrams: “the students”, “students opened”, “opened their”
* trigrams: “the students opened”, “students opened their”
* 4-grams: “the students opened their”

**Idea**: Collect statistics about how frequent different n-grams are and use these to predict next word (no deep learning!)

**Hint**: IfIS lecture on Information Retrieval

# n-gram Language Models
***
Suppose we are learning a 4-gram Language Model:
<img src="img/slides/n_gram_lm.png" style="">


$P(w|\text{students opened their})=\frac{\text{count(students opened their w})}{\text{count(students opened their)}}$

For example, suppose that in the corpus:
* "students opened their" occurred 1000 times
* "students opened their **books**" occurred 400 times
    * $P(\text{books} | \text{students opened their}) = 0.4$
* "students opened their **exams**" occurred 100 times
    * $P(\text{exams} | \text{students opened their}) = 0.1$

# Problems with n-gram Language Models
***
What if "students opened their" never occurred in data? Then we can’t calculate probability for any $w$

Need to store count for all n-grams you saw in the corpus
* could need to consider more than n words at a time if we want to model language well
* but considering big $n$ need much storage

**Solution**: As always, neural networks! (Because neural networks are great \*grins\*)

# Neural Language Models
***
<img src="img/slides/neural_LM.png" width=550 style="float: right;">

Input sentence: $w_1, ..., w_5 \rightarrow $ strings

Neural network cannot understand strings and would rather have numbers\
as input
* \*jay\* we can use our word embeddings
* so feed input sentence to neural network with the help \
of the  word embeddings

At the end classfication layer for e.g. sentiment analysis

Magic based on **CNNs** or **RNNs** or **Transformers** or ... But wait, what are all these things?

Let's start with something we already know: CNNs

# Neural Language Models based on CNNs
***
**idea**: tackle dependencies by applying different kernels to the same sentence
* a kernel of size 2 for example learns relationships between pairs of words, a kernel of size 3 between triplets of words and so on

**problem**: too costly to capture possible combinations of words in a sentence $\rightarrow$ many and big kernels needed

# Neural Language Models based on RNNs
***
<img src="img/slides/RNN.png" width= 550 style="float: right;">

**idea**: not just consider the actual sentences but also previous sentences to memorize what happens previously

**problem**: Sequential processing
* to encode the second word in a sentence it needs the previously\
computed hidden states of the first word

**problem**: Short memorization
* encoding of a specific word is retained only for the next time step $\rightarrow$ encoding of word strongly affect only the representation of the next word
* influence is quickly lost after few time steps (LSTMs (Long short-term memory) can boost a bit the memorization)

**Note**: figure reproduced from \
https://medium.com/swlh/simple-explanation-of-recurrent-neural-network-rnn-1285749cc363

# Language Models based on Transformers
***
<img src="img/slides/transformer.png" style="width: 700px; float: right;">

Transformers solves all issius of CNNs and RNNs with
encoder-decoder architectur and self-attention 

**encoder**:
* input (positional word embeddings) first flows through a self-attention layer
    * self-attention helps the encoder look at other words in the input sentence
* the outputs of the self-attention layer are fed to a feed-forward neural network

**decoder**:
* also self-attention layer and feed-forward neural network
* in between is an attention layer that helps the decoder focus on relevant parts of the input sentence



**Note**: figure reproduced from https://jalammar.github.io/illustrated-transformer/

# Transformer: Self-Attention
***
<img src="img/slides/exbert_input.png" style="">
<img src="img/slides/exbert_output.png" style="">

Example: What does “it” in this sentence refer to?
* self attention allows it to look at other positions in the input sequence to get a better encoding for this word

**Note**: creating attention plots: https://exbert.net/exBERT.html?model=bert-base-cased&modelKind=bidirectional&sentence=This%20movie%20is%20terrible%20but%20it%20has%20some%20good%20effects.&corpus=woz&layer=1&heads=..9&threshold=0.5&tokenInd=6&tokenSide=right&maskInds=..&metaMatch=pos&metaMax=pos&displayInspector=null&offsetIdxs=..-1,0,1&hideClsSep=true

# How can we calculate attention scores?
***
create three vectors from each of the encoder’s input word embeddings: Query vector, Key vector, Value vector
* these vectors are created by multiplying the embedding by three matrices that we trained during the training process

with these three vectors and after a few more steps... we can calculate an attention score

**Note**: detailed explaination at https://jalammar.github.io/illustrated-transformer/

# Pre-Training and Fine-Tuning
***
<img src="img/slides/pre_training_embeddings.png" style="">

**Idea**:
* Only pre-train the embeddings (e.g. Word2Vec)
* Put on top a neural network/classification layer
    * incorporate context while training on a downstream task (e.g. sentiment anlysis)

**Problem**: training data for downstream task must be sufficient to teach all contextual aspects of language

# Pre-Training and Fine-Tuninig
***
<img src="img/slides/pre_training_all.png" style="float: middle;">

**Idea**:
* all weights are initialized via pre-training
* fine-tune only with small training data on downstream task (e.g. sentiment analysis) because during pre-training the general contextual knowledge was already teached

# Language Models: GPT
***
<img src="img/slides/decoders.png" width= 200 style="float: right;">

GPT (Generative Pretrained Transformer)
* based on pre-trained decoders
* helpful in tasks where the output is a sequence with a vocabulary like that in pre-training
    * e.g. Translation, Summarization, ...

**Note**: GPT-2 or GPT-3 are mostly the same architecture but are trained on more/other tasks and muuuch more data


# Language Models: BERT
***
<img src="img/slides/encoder.png" width= 500 style="float: right;">

BERT (Bidirectional Encoder Representations from Transformers)
* based on pre-trained encoders
* encoder get bidirectional context by using \[MASK\]-token

# Let's go for a little deep dive into BERT!
***
<div style="float: right; display: grid; grid-auto-rows: 50% 50%;">
<img src="img/slides/bert-base-bert-large.png" width= 500 style="float: left;">
<img src="img/slides/bert-base-bert-large-encoders.png" width= 500 style="float: left;">
</div>

two main models
* BERT base
* BERT large

differ in:
* number of encoder layers (12 vs 24)
* number of parameter in feed-forward networks (768 vs 1024)
* multi-head attention (12 vs 16)

**Note**: figure reproduced from http://jalammar.github.io/illustrated-bert/

# BERT: Pre-Training and Fine-Tuning
***
Pre-training on Books Corpus und English Wikipedia
* using two tasks to teach contextual knowledge
    * fill-mask task, next sentence prediction task
* semi-supervised
* **Note**: you can use BERT already after pre-training for the two tasks (fill-mask and next sentence prediction)
    
Fine-tuning on your favourite downstream task
* "Pre-train once, fine-tune many times"
* take pre-trained BERT, delete old classification layer and put a new classification layer for your downstream task on top
* downstream tasks like spam classifier, sentiment analysis, fact checker, special fill-mask task (e.g. to predict facts), ...
* supervised

# BERT: fixed vocabulary
***
<img src="img/slides/BERT_basecased_vocab.png" width= 250 style="float: right;">

Vocabulary of BERT base consists of tokens which BERT can "understand" and can predict for \[MASK\]-token

If a word of the input sentence is not in the vocabulary, BERT can put it together with tokens
* has nothing to do with real syllables
* extension to words that are not in the vocabulary but can be composed of tokens
* e.g. incredible $\rightarrow$ incred ##ible (Only for explanatory purposes, normally\
"incredible" is in the vocabulary)

BUT BERT can only predict **single-token** of its vocabulary for \[MASK\]-token
* e.g. Edinburgh $\rightarrow$ Edinburgh (can predict "Edinburgh" because it is a single token)
* e.g. incredible $\rightarrow$ incred ##ible (cannot predict "incredible" because two tokens, could only predict "incred" **or** "##ible")

**Note**: whole-word-masking is a good keyword here if you want to do multi-token prediction

# Talk to pre-trained BERT: fill-mask task
***
<img src="img/slides/BERT_0.png" width= 700 style="float: middle;">

\[CLS\]: classification token

\[SEP\]: seperation token (only needed at e.g next sentence prediction task, but it is always added)

\[PAD\]: padding token

**Note**: figures reproduced from http://jalammar.github.io/illustrated-bert/

# Look inside of pre-trained BERT
***
<img src="img/slides/BERT_1.png" width= 700 style="float: middle;">

<img src="img/slides/pre_training_all.png" width= 300 style="float: right;">

remember figure of pre-training
* all weights were initialized during pre-training
* BERT learned his own positional embeddings

**Note**: figures reproduced from http://jalammar.github.io/illustrated-bert/

# Get prediction of BERT: fill-mask task
***
<img src="img/slides/BERT_2.png" width= 700 style="float: middle;">

**Note**: figures reproduced from http://jalammar.github.io/illustrated-bert/

# Get prediction of BERT: fill-mask task
***
<img src="img/slides/BERT_3.png" width= 700 style="float: middle;">

**Note**: figures reproduced from http://jalammar.github.io/illustrated-bert/

# BERT: Pre-Training with fill-mask task
***
<img src="img/slides/BERT-language-modeling-masked-lm.png" width= 700 style="float: right;">

1) add classification layer  + Softmax (over all tokens of vocabulary) on top of BERT to pre-train BERT on the fill-mask task

2) randomly mask 15\% of the input sentence to teach BERT to predict the correct words for the \[MASK\]-tokens
* no need to get labels (= semi-supervised) because we \
only omit words of the whole sentence which BERT \
should predict

**Note**: figure reproduced from\
http://jalammar.github.io/illustrated-bert/

# BERT: Next sentence prediction task
***
<img src="img/slides/bert-next-sentence-prediction.png" width= 700 style="float: right;">

1) add classification layer + Softmax (two labels: IsNext and NotNext) on top of BERT to pre-train BERT on the next sentence prediction task

2) two sentences as input and label (IsNext or NotNext) as output to teach BERT relationships between multiple sentences
* no need to get labels (= semi-supervised) because we \
use sentences from BookCorpus and Wikipedia where we \
know whether they are follow each other

**Note**: figure reproduced from \
http://jalammar.github.io/illustrated-bert/

# BERT: Fine-Tuning
***
Example downstream task: spam classification

Dataset
<img src="img/slides/spam-labeled-dataset.png" width= 300 style="">

<img src="img/slides/bert-classifier.png" width= 700 style="float: right;">



1) cut of the classification layer of the fill-mask and next sentence prediction task

2) add classification layer + Softmax (two labels: Spam and NotSpam) on top of BERT to fine-tune BERT on the spam classification task

3) email message as input and label (Spam or NotSpam) as output to teach BERT to classify spam

**Note**: figures reproduced from http://jalammar.github.io/illustrated-bert/

# Use fine-tuned BERT to classify spam
***
use the fine-tuned BERT consisting of the pre-trained part and the new classification layer
* BERT can now predict whether a email message, which was not seen during fine-tuning, is spam or not

Example spam message: "Help Prince Mayuko Transfer Huge Inheritance"
<img src="img/slides/BERT-classification-spam.png" width= 700 style="float: middle;">

**Note**: figure reproduced from http://jalammar.github.io/illustrated-bert/


**BUT WAIT**: Didn't we set out to use BERT for sentiment analysis of movie reviews?!


# It's your turn!
***
Challenge
* fine-tune BERT so that he can predict the sentiment of a movie review
* use the IMDb dataset and huggingface framework

Objective
* get the highest precision by playing around with:
    * the models (meanwhile there are much more than BERT base and BERT large)
    * the hyperparmeters during fine-tuning
    * training dataset

# Sources
***
This lecture is based on http://web.stanford.edu/class/cs224n/. It is a really good course! :)