# Introduction to Natural Language Processing

November 18, 2021

[Judit Ács](https://hlt.bme.hu/en/judit),
[Ádám Kovács](https://hlt.bme.hu/en/kovacsadam)

May 17, 2023

[Dávid Márk Nemeskey](https://hlt.bme.hu/en/david)

## Agenda

- Overview of NLP tasks
- SpaCy NLP toolkit
- Neural networks in NLP, sequence modeling
- Pre-trained language models, BERT
- optional: BERT usage examples
- optional: a full sequence classification example with PyTorch

## Why do we need NLP?

- Make the computer understand text
- Extract useful information from it
- A collection that helps us processing huge amount of texts
- We have two directions:
    - Analysis: Convert text to a structural representation
    - Generation: Generate text from formal representation

## NLP tasks

### End-user tasks

- Spellchecking
- Machine translation
- Chatbots

### "Real" tasks?

- Low level NLP tasks:
    - tokenization
    - lemmatization
    - POS tagging
    - syntactic parsing
    - dependency parsing

- High level or downstream tasks:
    - summarization
    - question answering
    - information extraction (e.g. NER tagging)
    - relation extraction 
    - chatbots
    - machine translation
    - etc.

## [Spacy](https://spacy.io)

- Open-source NLP library for Python
- For demonstrating NLP tasks, we are going to use the library [spacy](https://spacy.io/) a lot.
- It features a lot of out-of-the-box models for NLP
- NER, POS tagging, dependency parsing, vectorization
- Hosts models for many languages

In [None]:
from IPython.display import Image
import numpy as np
import pandas as pd
import seaborn as sns
import torch
import torch.nn as nn

from transformers import pipeline
from transformers import AutoTokenizer, AutoModel

In [None]:
import spacy
from spacy import displacy

# Loading the english model
    nlp = spacy.load('en_core_web_sm')

## Tokenization

- Splitting text into words, sentences, documents, etc..
- One of the goals of tokenizing text into words is to create a <strong>vocabulary</strong>

<p><em>Muffins cost <strong>$3.88</strong> in New York. Please buy me two as I <strong>can't</strong> go. <strong>They'll</strong> taste good. I'm going to <strong>Finland's</strong> capital to hear about <strong>state-of-the-art</strong> solutions in NLP.</em></p>

- $3.88 - split on the period?
- can't - can not?
- They'll - they will?
- Finland's - Finland?
- state-of-the-art?

In [None]:
sens = "Muffins cost $3.88 in New York. Please buy me two as I can't go." \
" They'll taste good. I'm going to Finland's capital to hear about state-of-the-art solutions in NLP."

print(sens.split())
print(len(sens.split()))

In [None]:
sens = "Muffins cost $3.88 in New York. Please buy me two as I can't go." \
" They'll taste good. I'm going to Finland's capital to hear about state-of-the-art solutions in NLP."

doc = nlp(sens)

tokens = [token.text for token in doc]
print(tokens)

In [None]:
for sen in doc.sents:
    print(sen)

In [None]:
for token in doc[:5]:
    print(f"{token.text=}, {token.is_alpha=}, {token.is_stop=}")

In [None]:
pd.DataFrame([
    {'text': token.text, 'is_alpha': token.is_alpha, 'is_stop': token.is_stop}
    for token in doc
])

### Lemmatization

- The goal of lemmatization is to find the dictionary form of the words
- Called the "lemma" of a word
- _dogs_ -> _dog_ , _went_ -> _go_
- Ambiguity plays a role: _saw_ -> _see_?
- Needs POS tag to disambiguate

In [None]:
doc = nlp("I saw two dogs yesterday.")

lemmata = [token.lemma_ for token in doc]
print(lemmata)

### Stemming

- Similar to lemmatization, it tries to normalize the text
- Stems are always substrings of the word

### POS tagging

- Words can be groupped into grammatical categories.
- These are called the Part Of Speech tags of the words.
- Words belonging to the same group are interchangable
- Ambiguity: _guard_ ?
- Similar to _szófaj_

In [None]:
doc = nlp("The white dog went to play football yesterday.")

[token.pos_ for token in doc]

<h3 id="Morphological-analysis">Morphological analysis</h3>
<ul>
<li>Splitting words into morphemes</li>
<li>Morphemes are the smallest meaningful units in a language (part of the words)</li>
<li>friend<span style="color: #e03e2d;">s</span>, wait<span style="color: #e03e2d;">ing</span>, friend<span style="color: #e03e2d;">li</span><span style="color: #3598db;">er</span></li>
<li>Tagging them with morphological tags</li>
<li>Ambiguity: <em>v&aacute;rnak</em></li>
</ul>

### Named entity recognition

- Identify the present entities in the text

In [None]:
sens = "Muffins cost $3.88 in New York. Please buy me two as I can't go." \
" They'll taste good. I'm going to Finland's capital to hear about state-of-the-art solutions in NLP."

doc = nlp(sens)
for ent in doc.ents:
    print(ent)

displacy.render(doc, style='ent', jupyter=True)

### Language modelling

- One of the most important task in NLP
- The goal is to compute the "probability" of a sentence
- Can be used in:
    - Machine Translation
    - Text generation
    - Correcting spelling
    - Word vectors?
- P(the quick brown __fox__) > P(the quick brown __stick__)

### <center>Lexical Inference, Natural Language Inference</center>



<div class="frame">

| **entailment**                                                |     |     |
|:--------------------------------------------------------------|:----|:----|
| A young family enjoys feeling ocean waves lap at their feet.  |     |     |
| A family is at the beach                                      |     |     |
| **contradiction**                                             |     |     |
| There is no man wearing a black helmet and pushing a bicycle  |     |     |
| One man is wearing a black helmet and pushing a bicycle       |     |     |
| **neutral**                                                   |     |     |
| An old man with a package poses in front of an advertisement. |     |     |
| A man poses in front of an ad for beer.                       |     |     |

</div>

## Demos

- http://e-magyar.hu/hu/parser
- https://demo.allennlp.org/
- https://talktotransformer.com/
- [GPT-3](https://github.com/elyase/awesome-gpt3) (*has 175B parameters*)

# Sequence modeling

## Recurrent neural networks

- In NLP, recurrent neural networks (RNN) are commonly used to analyse sequences. 
- It takes in a sequence of words, one at a time, and produces hidden states ($h$) after each steps. 
- RNN-s are used recurrently by feeding in the current word and the hidden state from the previous word.
- Once we have our final hidden state, $h_T$, (from feeding in the last word in the sequence, $x_T$) we feed it through a linear layer, $f$ (fully connected layer) to reduce the dimension into the dimension of the labels.

$$
    h_t = \sigma_h(W_{h} x_t + U_{h} h_{t-1} + b_h) \\
    y_t = \sigma_y(W_{y} h_t + b_y),
$$

where:

* $x_t$ is the input at time $t$
* $y_t$ is the output at time $t$
* $h_t$ is the hidden state at time $t$,
* $W_h, W_y, U_h, b_h, b_y$ are the RNN's parameters

![rnn](https://github.com/bentrevett/pytorch-sentiment-analysis/raw/79bb86abc9e89951a5f8c4a25ca5de6a491a4f5d/assets/sentiment1.png)

_(image from bentrevett)_

## LSTM

- One of the biggest problem of recurrent neural networks is the vanishing gradient problem. 
- It happens when the gradient shrinks during bakcpropagarion. 
- If it becomes very small, the network stops learning. This mostly happen when long sentences are present. 
- LSTM networks address this problem by having an inner memory cell to remember important information or forget others. 
- LSTM has a similar flow as a RNN, it processes data and passes information as it propagates forward. 

$$
f_t = \sigma_g(W_{f} x_t + U_{f} h_{t-1} + b_f) \\
i_t = \sigma_g(W_{i} x_t + U_{i} h_{t-1} + b_i) \\
o_t = \sigma_g(W_{o} x_t + U_{o} h_{t-1} + b_o) \\
\tilde{c}_t = \sigma_c(W_{c} x_t + U_{c} h_{t-1} + b_c) \\
c_t = f_t \circ c_{t-1} + i_t \circ \tilde{c}_t \\
h_t = o_t \circ \sigma_h(c_t)
$$

# Sequence elements

We deal with sequences in NLP:
- a token is a sequence of characters/morphemes
- a sentence is a sequence of tokens
- a paragraph is a sequence of sentences
- a dialogue is a sequence of utterances
- etc.

What are the elements of these sequences?

## Words

Pros:

- More or less well-defined in most languages
- Relatively short sequences (a sentence is rarely longer than 30 tokens)

Cons:
- Difficult tokenization in some languages
- Large vocabulary (100,000+ easily)
- Out-of-vocabulary words are always there regardless of the size of the vocabulary
- Many rare words
    - Hapax: a word that only appears once in the dataset.

## Characters

Pros:
- Smaller vocabulary although logographic writing systems (Chinese and Japanese) have thousands of characters
- Easy tokenization
- Well defined: Unicode symbols

Cons:
- Long sequences
- Too fine-grained, token level information is lost

## Subwords

- Multiple characters but smaller than words
- Modern language models use subword vocabularies
- We will cover these next week

# Embeddings

## Word embeddings

- map each word to a small dimensional (around 100-300) continuous vector
- similar words should have similar vectors
    - Cosine similarity

## Creating word embeddings

Word embeddings are learned with neural networks. The target can be:

- predict the word given the context - The Continous Bag Of Words model (CBOW)
- predict the context given a words - The SkipGram model

The training examples are generated from big text corpora.
- no need for expensive manual annotation
- only limited by the availability of textual data

For example from the sentence “The quick brown fox jumps over the lazy dog.” we can generate the following inputs:

![training examples](http://mccormickml.com/assets/word2vec/training_data.png)

### Famous static word embeddings for English

- [Word2vec](https://arxiv.org/pdf/1301.3781.pdf)
- [GLOVE](https://nlp.stanford.edu/projects/glove/)

# Types of sequence tasks

## Sequence classification

Assign a single label to the full sequence:

<img src="img/tikz/abstract_sequence_classification.png" width="350" />

__Applications__

- Topic classification 
- Sentiment analysis: is this sentence or paragraph a positive (1) or a negative (0) review?

<img src="img/tikz/example_sequence_classification.png" width="500" />

## Sequence tagging

Assign a label to each element of the sequence:

<img src="img/tikz/abstract_sequence_tagging.png">

__Applications__

- part-of-speech tagging
- named entity recognition (NER)

<img src="img/tikz/example_sequence_tagging.png" >

## Seq2seq

<img src="img/tikz/abstract_seq2seq.png" width=600px>

- Maps a source sequence to a target sequence
    - Arbitrary length
    
- Two steps:
    1. Encode: create a representation of the source
    2. Decode: generate the target representation
        - autoregressive: generate tokens from left-to-right one-by-one (condition on the left context)
        
- Applications:
    - Neural machine translation
    - Morphological inflection

- Usually implemented as two separate neural networks for example:
    - The encoder is a bidirectional LSTM
    - The decoder is a unidirectional LSTM
    
- Problems:
    - The input sequence is compressed into a single vector
    - The decoding steps rely on the same input representation in every step

### Attention mechanism

Attention:
- emphasizes the important part of the input
- and de-emphasizes the rest.
- Mimics cognitive attention.

Method:
- It does this by assigning weights to the elements of the input sequence.
- The weights depend on the current context in the decoder:
    - the current decoder hidden state,
    - the previous output.
- The source vectors are multiplied by the weights and then summed -> **context vector**
- The context vector is used for predicting the next output symbol.

[image source](https://aihub.cloud.google.com/u/0/p/products%2F024b89fd-9bc8-4c24-b8a8-e347479f3270):

In [None]:
Image("img/dl/attention_mechanism.jpg")

#### Problems

Recall that we used recurrent neural cells, specifically LSTMs to encode and decode sequences.

__Problem 1. No parallelism__

LSTMs are recurrent, they rely on their left and right history (horizontal arrows), so the symbols need to be processed in order -> no parallelism.

__Problem 2. Long-range dependencies__

Long-range dependencies are not infrequent in NLP.

"The **people/person** who called and wanted to rent your house when you go away next year **are/is** from California" -- Miller & Chomsky 1963

LSTMs have a problem capturing these because there are too many backpropagation steps between the symbols.

# Transformers

Introduced in [Attention Is All You Need](https://papers.nips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf) by Vaswani et al., 2017 (30000 citations)

Transformers solve Problem 1 by relying purely on attention instead of recurrence.

Not having recurrent connections means that sequence position no longer matters.

Recurrence is replaced by **self attention**.

Each symbol is encoded the following way:

__Step 1__: the encoder 'looks' at the other symbols in the input sequence
    - In the example above: the representation of **are/is** depends on **people/person** more than any other word in the sentence, it should receive the highest attention weight.

In [None]:
Image("http://jalammar.github.io/images/t/transformer_self-attention_visualization.png", embed=True)  # from Illustrated Transformers

__Step 2__: the context vector is passed through a feed-forward network which is shared across all symbols.

In [None]:
Image("http://jalammar.github.io/images/t/encoder_with_tensors.png", embed=True)  # from Illustrated Transformers

This visualization is available in the [Tensor2tensor notebook in Google Colab](https://colab.research.google.com/github/tensorflow/tensor2tensor/blob/master/tensor2tensor/notebooks/hello_t2t.ipynb)

## Other components

__Residual connections__

- Also called __skip connections__
- The output of a module is added to the input

$$
\text{output} = \text{layer}(\text{input}) + \text{input}
$$

__Softmax__

- Only used in the decoder
- Maps the output vector to a probability distribution
    - In other words it tells us how likely each symbol is.

## Multiple heads and layers

Transformers have a number of additional components summarized in this figure:

In [None]:
Image("img/dl/transformer.png")  # from Vaswani et al. 2018

## Positional encoding

Without recurrence word order information is lost.

Positional information is important:

    John loves Mary.
    Mary loves John.

Transformers apply positional encoding:

$$
\text{PE}_{\text{pos},2i} = \sin(\frac{\text{pos}}{10000^{2i/d_{\text{model}}}}), \\
\text{PE}_{\text{pos},2i+1} = \cos(\frac{\text{pos}}{10000^{2i/d_{\text{model}}}}),
$$

where:
- $d_{\text{model}}$ is the input dimension to the Transformer, usually the embedding size
- $\text{pos}$ is the position of the symbol in the input sequence i.e. first word, second word etc.
- $i$ is the coordinate index in the input vector.

# Contextualized embeddings

- static representations (GloVe, Word2vec)
    - the same vector is assigned to each occurrence of a word
    
But words can have different meaning in different contexts, e.g. the word 'stick':

1. Find some dry <b>sticks</b> and we'll make a campfire.
2. Let's <b>stick</b> with glove embeddings.

Contextualized embeddings take the full sentence as their input.
    
![elmo](http://jalammar.github.io/images/elmo-embedding-robin-williams.png)

_(Peters et. al., 2018 in the ELMo paper)_

## ELMo

**E**mbeddings from **L**anguage **Mo**dels

Word representations are functions of the full sentences instead of the word alone.

Two bidirectional LSTM layers are linearly combined.

[Deep contextualized word representations](https://arxiv.org/abs/1802.05365) by Peters et al., 2018, 8400 citations

# BERT

[BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://www.aclweb.org/anthology/N19-1423/)
by Devlin et al. 2018, 29600 citations

[BERTology](https://huggingface.co/transformers/bertology.html) is the nickname for the growing amount of BERT-related research.

Trained on two tasks:

1. Masked language model:

    1. 15% of the <s>tokens</s>wordpieces are selected at the beginning.
    2. 80% of those are replaced with `[MASK]`,
    3. 10% are replaced with a random token,
    4. 10% are kept intact.
    
2. Next sentence prediction:
    - Are sentences A and B consecutive sentences?
    - Generate 50-50%.
    - Binary classification task.

## Embedding layer

In [None]:
Image("img/dl/bert_embedding.png")

## Transformer layers


## Finetuning

1. Take a pre-trained BERT model.
2. Add a small classification layer on top (typically a 2-layer MLP).
3. Train BERT along with the classification layer on an annotated dataset.
    - Much smaller than the data BERT was trained on

Another option: freeze BERT and train the classification layer only.
- Easier training regime.
- Smaller memory footprint.
- Worse performance.

In [None]:
Image("img/dl/bert_encoding_finetuning.png")

## BERT pretrained checkpoints

### BERT-Base

- 12 layers
- 110M parameters

### BERT-Large

- 24 layers
- 340M parameters

### Cased and uncased

Uncased: everything is lowercased. Diacritics are removed.

### Multilingual BERT - mBERT

104 language version trained on the 100 largest Wikipedia.

# BERT tokenization

## WordPiece tokenizer

BERT's input **must** be tokenized with BERT's own tokenizer.

A middle ground between word and character tokenization.

Static vocabulary:
- Byte-pair encoding: simple frequency-based tokenization method
- Continuation symbols (\#\#symbol)
- Special tokens: `[CLS]`, `[SEP]`, `[MASK]`, `[UNK]`
- It tokenizes everything, falling back to characters and `[UNK]` if necessary

`AutoTokenizer` is a factory class for pretrained tokenizers. ng id. `from_pretrained` instantiates the corresponding class and loads the weights:

In [None]:
t = AutoTokenizer.from_pretrained('bert-base-uncased')
print(len(t.get_vocab()))

t.tokenize("My beagle's name is Tündérke.")

In [None]:
t.tokenize("Русский")

**Cased** models keep diacritics:

In [None]:
t = AutoTokenizer.from_pretrained('bert-base-cased')
print(len(t.get_vocab()))

t.tokenize("My beagle's name is Tündérke.")

It character tokenizes Chinese and Japanese but doesn't know all the characters:

In [None]:
t.tokenize("日本語")

Korean is missing from this version:

In [None]:
t.tokenize("한 한국어")

## mBERT tokenization

104 languages, 1 vocabulary

In [None]:
t = AutoTokenizer.from_pretrained('bert-base-multilingual-cased')

In [None]:
len(t.get_vocab())

In [None]:
t.tokenize("My puppy's name is Tündérke.")

In [None]:
t.tokenize("한 한국어")

In [None]:
t.tokenize("日本語")

# Using BERT

## Using `BertModel` directly

`AutoModel`
- each pretrained checkpoint has a string id. `from_pretrained` instantiates the corresponding class and loads the weights:

In [None]:
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
model = AutoModel.from_pretrained('bert-base-cased')
type(model), type(tokenizer)

In [None]:
sentence = "There are black cats and black dogs."

output = model(**tokenizer(sentence, return_tensors='pt'), return_dict=True)

for k, v in output.items():
    print(f"{k}: {v.size()=}")

In [None]:
import gc

del model
gc.collect()

## BERT applications

### Sequence classification

Pretrained model for sentiment analysis.

Base model: `distilbert-base-uncased`

Finetuned on the [Stanford Sentiment Treebank](https://nlp.stanford.edu/sentiment/index.html) or SST-2, a popular sentiment analysis dataset.

Model id: `distilbert-base-uncased-finetuned-sst-2-english`

In [None]:
nlp = pipeline("sentiment-analysis")
nlp("This is an amazing class.")

In [None]:
nlp("This is not a good class but it's not too bad either.")

In [None]:
nlp("This is not a class.")

In [None]:
del nlp
gc.collect()

### Sequence tagging/labeling: Named entity recognition

Base model: `bert-large-cased`

Finetuned on [CoNLL-2003 NER](https://www.clips.uantwerpen.be/conll2003/ner/).

In [None]:
nlp = pipeline("ner")

In [None]:
result = nlp("Jupiter is a planet that orbits around James the center of the Universe")
result

In [None]:
result = nlp("George Clooney has a pet pig named Estella.")
result

In [None]:
del nlp
gc.collect()

### Machine translation

In [None]:
nlp = pipeline("translation_en_to_fr")
print(nlp("Hugging Face is a technology company based in New York and Paris", max_length=40))

Even the [blessé - blessed false cognate](https://frenchtogether.com/french-english-false-friends/) is handled correctly:

In [None]:
nlp("I was blessed by God after I injured my head.", max_length=40)

In [None]:
del nlp
gc.collect()

### Summarization

In [None]:
summarizer = pipeline("summarization")
summarizer("Deep learning is used almost exclusively in a Linux environment.\
You need to be comfortable using the command line if you are serious about deep learning and NLP.\
    Most NLP and deep learning libraries have better support for Linux and MacOS than Windows. \
    Most papers nowadays release the source code for their experiments with Linux support only.",
           min_length=5)

### Sentiment Analysis
- In the simplest case, decide whether a text is negative or positive.

In [None]:
sentiment = pipeline("sentiment-analysis")
sentiment(['This class is really cool! I would recommend this to anyone!'])

### Question Answering

- Given a context and a question choose the right answer
- Can be extractive or abstractive

In [None]:
question_answerer = pipeline('question-answering')
question_answerer({
    'question': 'Who went to the store ?',
    'context': 'Adam went to the store yesterday.'})

## GPT-2 text generation

Causal language modeliing is when the $i^{th}$ token is modeled based on all the previous tokens as opposed to masked language modeling where both left and right context are used.

In [None]:
text_generator = pipeline("text-generation")

In [None]:
print(text_generator("This is a serious issue we should address", max_length=50, do_sample=False)[0]['generated_text'])

In [None]:
print(text_generator("Twitter is a bad idea, Jack Dorsey had a bad day when he came up with it", max_length=100, do_sample=False)[0]['generated_text'])

In [None]:
del text_generator
gc.collect()

# Further information

[Official PyTorch Transformer tutorial](https://pytorch.org/tutorials/beginner/transformer_tutorial.html)

[Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/)
- Famous blog post with a detailed gentle introduction to Transformers

[The Annotated Transformer](http://nlp.seas.harvard.edu/2018/04/03/attention.html)
- A walkthrough of original Transformer paper with code and detailed illustration

[Huggingface Transformers - Summary of tasks](https://huggingface.co/transformers/task_summary.html)

[My blog post about mBERT's tokenizer](http://juditacs.github.io/2019/02/19/bert-tokenization-stats.html)