

# Welcome to Natural Language Processing!

This course is about Natural Language Processing (NLP) with Keras/TensorFlow and the HuggingFace collection of language models. NLP is the field in statistical learning that teaches computers how to 'understand' language. (Or at least, how to make it appear it does &#128521;.)

In this course you will learn to work with the most current NLP techniques that build on (deep) neural networks. You will

- Use state of the art Transformer models to train a document classifier with Keras
- Train your own word vector embeddings 
- Learn the fundamentals behind autoregressive language models using RNNs 
- Use pretrained models to complete sentences 
- Let state of the art models from Huggingface answer questions 


If you've taken the Introduction to Deep Learning course, you'll know everything you need to be successful.

Now let's get started!

# Introduction

The `transformers` library gives access to a host of pretrained language models for various NLP tasks:

- `feature-extraction` (get the vector representation of a text)
- `fill-mask`
- `ner` (named entity recognition)
- `question-answering`
- `sentiment-analysis`
- `summarization`
- `text-generation`
- `translation`
- `zero-shot-classification`

The easiest way to access these models is by means of the `pipeline` object. The `pipeline` object allows you to _instantiate_ a model as an object that you can call on a string of text. The object takes care of preprocessing the text (tokenization, etc.), running the model, and turning the output into and easy understandable and use format.

Let's take a look at some of these models.

# Sentiment analysis

The first item on the list that we'll try is `sentiment-analysis`

In [1]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

As you can see, the first thing that the call to `pipeline` does is (besides complaining that we weren't specific about the model we wanted—more on that later) to download files. These are the pretrained model files that are stored on the [🤗 HuggingFace repository](https://huggingface.co). If they are not available (yet) in your current working environment, they need to be downloaded. The `transformer` library takes care of this for you _automagically_.

Let's see what `classifier` does:

In [2]:
classifier(
    ["It is a new day, the sun is shining, I feel good.", 
     "I love this NLP stuff!",
     "This muscle ache is killing me",
     "I can reassure you that your headaches are now a sorrow of the past."])

[{'label': 'POSITIVE', 'score': 0.9998846054077148},
 {'label': 'POSITIVE', 'score': 0.9998706579208374},
 {'label': 'NEGATIVE', 'score': 0.9995550513267517},
 {'label': 'NEGATIVE', 'score': 0.9915013313293457}]

It's clear that this model is not very good at recognizing double negatives... (But then again, are humans good at that?)

# Question answering

Next we'll try `question-answering`. In question answering, the model takes two strings:

1. A context—a text in which the answer can be found, and
2. A question that is _assumed to be_ answerable with the text in the context string.

The model was trained to predict the start and end of the segment (in terms of the location of the first and last character in the _context_ that holds the answer to the _question_.

So for instance, if

`context` = _"Lucy had named the kitten Purr."_

and the question is _"What was the name of the cat?"_, the model will try to predict `(26,30)`, because those are the start and end of the substring that contains "Purr".

In [3]:
question_answerer = pipeline("question-answering")

No model was supplied, defaulted to distilbert-base-cased-distilled-squad (https://huggingface.co/distilbert-base-cased-distilled-squad)


Downloading:   0%|          | 0.00/473 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/249M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]

Notice that a different model is downloaded. For most tasks, a dedicated model needs to be downloaded. Some of the models are really large (more than 1GB) and so we won't try all of them here. (But feel free to play around and have some fun with them! Basic usage documentation can be found on [🤗 HuggingFace](https://huggingface.co).)

In [4]:
print(question_answerer(
    context="His daughter Lucy had named the kitten Purr",
    question="What was the name of the cat?",
))
print(question_answerer(
    context="His daughter Lucy had named the kitten Purr",
    question="Who owns the cat?",
))

{'score': 0.5847036838531494, 'start': 32, 'end': 43, 'answer': 'kitten Purr'}
{'score': 0.46665525436401367, 'start': 13, 'end': 17, 'answer': 'Lucy'}


The model does seem to 'know' to distinguish between the two names in the sentence. Let's try something more complicated:

In [5]:
TEXT = """Key developments in artificial neural networks and deep learning came not from 
    computer science, but from cognitive and mathematical psychology. Computer scientists 
    accelerated the development with more powerful computers and implementation frameworks."""

[  question_answerer(question="In which academic field was deep learning developed?", context = TEXT), 
   question_answerer(question="What was the role of computer science?",context=TEXT)]


[{'score': 0.4984089136123657,
  'start': 111,
  'end': 148,
  'answer': 'cognitive and mathematical psychology'},
 {'score': 0.8162105083465576,
  'start': 175,
  'end': 186,
  'answer': 'accelerated'}]

# Text generation with GPT2

Remember the complaint about not being specific about the model? We can fix that by specifying a specific model (in stead of relying on default choices). Here we'll use the 🤗 HuggingFace simplified version of the famous GPT2 model (the precursor to GPT3) to generate some text.

In [6]:
from transformers import pipeline

generator = pipeline("text-generation", model="distilgpt2")

Downloading:   0%|          | 0.00/762 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/336M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

In [7]:
generator("Welcome to the natural language processing (NLP) module of Deep Learning in Python. In this course, you will learn ",
    max_length=100, num_return_sequences=2, pad_token_id=50256)

[{'generated_text': 'Welcome to the natural language processing (NLP) module of Deep Learning in Python. In this course, you will learn ㅠㅠㅠ, as well as learn how to construct a set of Python words using Python 2.'},
 {'generated_text': 'Welcome to the natural language processing (NLP) module of Deep Learning in Python. In this course, you will learn 他仕好合何好地 (莣状蓤些社, 檻蓤些社, 硾 媻蓤些社, 硾 媻蓤些社, 硾'}]

Not exactly sensible, nor Shakespeare, and not exactly GPT3 either, but you get the drift.


# State of the art NLP models

All of these models (plus the models for the NLP tasks listed earlier) were built using what is known as a _Transformer_ architecture. We'll come back to what that entails in more detail in a later tutorial. For now, let's recognize that these types of language models are currently (in 2022) the absolute state of the art. 

The models we've seen here are not the best available models, but they do share the same basic architecture with the models that are the best—what's sets the latter apart is not only their performance on a litany of tasks, but certainly also their size: The models here were limited to under 500Mb in file size to store the weights, but the best models are several Gb and larger. 

All of these models have been trained on enormous text corpora, such as the [Common Crawl](https://commoncrawl.org/), [Wikipedia (EN)](https://huggingface.co/datasets/wikipedia), or [WebText (EN)](https://paperswithcode.com/dataset/webtext), to repressent the statistical dependancies between words, in an unsupervised way. 

## How?

You may wonder how, and the answer is relatively simple: by predicting words from context words. This can take various forms, but let's focus on two:

1. Fill in the blank: Predict the (likelihood of) word that goes in the empty space in for example `"the cat ___ the mouse"` (e.g., `ate`, `killed`, `caught` are all likely, but `painted`, `sung`, `addopted` are all unlikely).
2. Predict the next token, given the sequence of tokens so far, for each token in a piece of text. (We say token here instead of word, because a token may also be a punctuation character, or the _End Of Sequence_ token `<EOS>`. For example, in `"the cat chased the mouse"` the model tries to predict 
  
  $P({\tt the})$,

  $P({\tt cat}|{\tt the})$,

  $P({\tt chased}|{\tt the, cat})$,

  $P({\tt the}|{\tt the, cat, chased})$,

  $P({\tt mouse}|{\tt the, cat, chased, the})$, and

  $P({\tt <\!EOS\!>}|{\tt the, cat, chased, the, mouse})$

The difference between the two methods isn't very large, but notice that the latter method can learn the probability for the entire sentence: From the rules of conditional probability 

$$P(A,B) = P(A)P(B|A),$$ and more generally, $$P(A,B,C,\ldots) = P(A)P(B,C,\ldots|A).$$

Applying this recursively, we can compute 

$$P(A,B,C,\ldots) = P(A)\cdot P(B|A)\cdot   P(C|A,B)\cdot P(\ldots|A,B,C) \cdot \cdots$$

Hence, if we took a random sentence from the text of say the internet, we can write the probability that the sentence is equal to `"the cat chased the mouse"` as

$$P({\tt the, cat, chased, the, mouse, <\!EOS\!>}) =  
P({\tt the})\, 
P({\tt cat}|{\tt the})\,
P({\tt chased}|{\tt the, cat})\, P({\tt the}|{\tt the, cat, chased})\,
P({\tt mouse}|{\tt the, cat, chased, the})\, 
P({\tt <\!EOS\!>}|{\tt the, cat, chased, the, mouse}).$$

Notice that we can do this with any sentence. Specifically, any sentence of any length in principle (in practice computational problems mount as sentences become larger). 

Contrast this to the first method of training (the fill-in-the-blank method): This method only learns the conditional probabilities of the form 

$$P(X_0 | X_{p}, \ldots, X_{-1}, X_1, \ldots, X_q).$$

Here $p$ and $q$ are integers that specify a window around the focus token $X_0$. Usually, $p \lt 0 \le q$, but this doesn't have to be the case. For our running example this would be, for instance,

$$P(X_0 = {\tt chased} | X_{-2}={\tt the}, X_{-1}={\tt cat}, X_1={\tt the}, X_2={\tt mouse}),$$ 

$$P(X_0 = {\tt caught} | X_{-2}={\tt the}, X_{-1}={\tt cat}, X_1={\tt the}, X_2={\tt mouse}),$$ 

$$P(X_0 = {\tt killed} | X_{-2}={\tt the}, X_{-1}={\tt cat}, X_1={\tt the}, X_2={\tt mouse}),$$ 

etc.


# Your Turn

Now that you've seen a few uses of the 🤗 HuggingFace `transformer` library for NLP, it's your turn to try it to teach a neural network to [recognize sentences with the same meaning](https://www.kaggle.com/code/datasniffer/nlp-ex1/).


---

<small>
    
_For Deep Learning in Python (2022)._
    
</small>
