# Natural Language Processing

Huggingface is like sklearn, it provides many built-in functionalities.  However, it is also very modular such that it allows us to explore new models.

## Pipeline

Before anything, make sure you got transformers installed:

    pip install datasets evaluate transformers sentencepiece

In [1]:
import transformers
transformers.__version__

'4.25.1'

In [2]:
import torch
torch.__version__

'1.13.0'

## 1. Pipeline

The most basic object in the 🤗 Transformers library is the pipeline() function. It connects a model with its necessary preprocessing and postprocessing steps, allowing us to directly input any text and get an intelligible answer.


In [3]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")
classifier("I've been waiting for a HuggingFace course my whole life.")

[{'label': 'POSITIVE', 'score': 0.9598049521446228}]

We can even pass several sentences!

In [4]:
classifier(
    ["I've been waiting for a HuggingFace course my whole life.", "I hate this so much!"]
)

[{'label': 'POSITIVE', 'score': 0.9598049521446228},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

By default, this pipeline selects a particular pretrained model that has been fine-tuned for sentiment analysis in English. The model is downloaded and cached when you create the classifier object. If you rerun the command, the cached model will be used instead and there is no need to download the model again.

There are three main steps involved when you pass some text to a pipeline:

1. The text is preprocessed into a format the model can understand.
2. The preprocessed inputs are passed to the model.
3. The predictions of the model are post-processed, so you can make sense of them.

Some of the currently available pipelines are:

- feature-extraction (get the vector representation of a text)
- fill-mask
- ner (named entity recognition)
- question-answering
- sentiment-analysis
- summarization
- text-generation
- translation
- zero-shot-classification

Let’s have a look at a few of these!

### Zero shot classification

This pipeline is called `zero-shot` because you don’t need to fine-tune the model on your data to use it. It can directly return probability scores for any list of labels you want!`

In [5]:
from transformers import pipeline

classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
classifier(
    "This is a course about the Transformers library",
    candidate_labels=["education", "politics", "business"],
)

{'sequence': 'This is a course about the Transformers library',
 'labels': ['education', 'business', 'politics'],
 'scores': [0.8445981740951538, 0.11197477579116821, 0.04342705383896828]}

### Text generation

 The main idea here is that you provide a prompt and the model will auto-complete it by generating the remaining text.

In [6]:
from transformers import pipeline

generator = pipeline("text-generation", model="gpt2", max_length=30, pad_token_id=0)
generator("In this course, we will teach you how to")

[{'generated_text': 'In this course, we will teach you how to create custom apps using JavaScript, create new services using Angular, and build apps that look good with an'}]

The pipeline will use default model if `model` is not specified. Go to the Model Hub (https://huggingface.co/models) for full list.

Let’s try the `distilgpt2` model! Here’s how to load it in the same pipeline as before:

In [7]:
from transformers import pipeline

generator = pipeline("text-generation", model="distilgpt2")
generator(
    "In this course, we will teach you how to",
    max_length=30,
    num_return_sequences=2,
    pad_token_id=0
)

[{'generated_text': 'In this course, we will teach you how to practice mindfulness and become productive from the beginning!!!!!!!!!!!'},
 {'generated_text': 'In this course, we will teach you how to develop new functions and frameworks to achieve the benefits of the Python programming language with the Python 3 environment.'}]

### Mask filling

This is basically language modeling.

In [8]:
from transformers import pipeline

unmasker = pipeline("fill-mask", model='distilroberta-base')

#The top_k argument controls how many possibilities you want to be displayed. 
unmasker("This course will teach you all about <mask> models.", top_k=2)

[{'score': 0.1961980164051056,
  'token': 30412,
  'token_str': ' mathematical',
  'sequence': 'This course will teach you all about mathematical models.'},
 {'score': 0.04052722826600075,
  'token': 38163,
  'token_str': ' computational',
  'sequence': 'This course will teach you all about computational models.'}]

### Name entity recognition

This is very similar to NER in spaCy but has much more variety of models.

In [9]:
from transformers import pipeline

# aggregation_strategy="simple" tell the pipeline to regroup together the parts of the sentence that correspond to the same entity: 
# here the model correctly grouped “Hugging” and “Face” as a single organization, even though the name consists of multiple words. 
ner = pipeline("ner", aggregation_strategy="simple", model='dbmdz/bert-large-cased-finetuned-conll03-english')
ner("My name is Sylvain and I work at Hugging Face in Brooklyn.")

[{'entity_group': 'PER',
  'score': 0.9981694,
  'word': 'Sylvain',
  'start': 11,
  'end': 18},
 {'entity_group': 'ORG',
  'score': 0.9796019,
  'word': 'Hugging Face',
  'start': 33,
  'end': 45},
 {'entity_group': 'LOC',
  'score': 0.9932106,
  'word': 'Brooklyn',
  'start': 49,
  'end': 57}]

### Question Answering

In [10]:
from transformers import pipeline

#Note that this pipeline is extractive
question_answerer = pipeline("question-answering", model='distilbert-base-cased-distilled-squad')
question_answerer(
    question="Where do I work?",
    context = "My name is Sylvain and I work at Hugging Face in Brooklyn",
)

{'score': 0.6949770450592041, 'start': 33, 'end': 45, 'answer': 'Hugging Face'}

### Summarization

In [11]:
from transformers import pipeline

summarizer = pipeline("summarization", model='sshleifer/distilbart-cnn-12-6',  max_length=100)
summarizer(
    """
    America has changed dramatically during recent years. Not only has the number of 
    graduates in traditional engineering disciplines such as mechanical, civil, 
    electrical, chemical, and aeronautical engineering declined, but in most of 
    the premier American universities engineering curricula now concentrate on 
    and encourage largely the study of engineering science. As a result, there 
    are declining offerings in engineering subjects dealing with infrastructure, 
    the environment, and related issues, and greater concentration on high 
    technology subjects, largely supporting increasingly complex scientific 
    developments. While the latter is important, it should not be at the expense 
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other 
    industrial countries in Europe and Asia, continue to encourage and advance 
    the teaching of engineering. Both China and India, respectively, graduate 
    six and eight times as many traditional engineers as does the United States. 
    Other industrial countries at minimum maintain their output, while America 
    suffers an increasingly serious decline in the number of engineering graduates 
    and a lack of well-educated engineers.
"""
)

[{'summary_text': ' America has changed dramatically during recent years . The number of engineering graduates in the U.S. has declined in traditional engineering disciplines such as mechanical, civil,    electrical, chemical, and aeronautical engineering . Rapidly developing economies such as China and India continue to encourage and advance the teaching of engineering .'}]

### Translation

In [12]:
from transformers import pipeline

translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")
translator("Ce cours est produit par Hugging Face.")

[{'translation_text': 'This course is produced by Hugging Face.'}]

## 2. A bit about transformers

### History

This is the short history of transformers:

<img src = "../../../figures/historybert.png" width=600>

- June 2018: `GPT`, the first pretrained Transformer model, used for fine-tuning on various NLP tasks and obtained state-of-the-art results
- October 2018: `BERT`, another large pretrained model, this one designed to produce better summaries of sentences (more on this in the next chapter!)
- February 2019: `GPT-2`, an improved (and bigger) version of GPT that was not immediately publicly released due to ethical concerns
- October 2019: `DistilBERT`, a distilled version of BERT that is 60% faster, 40% lighter in memory, and still retains 97% of BERT’s performance
- October 2019: `BART and T5`, two large pretrained models using the same architecture as the original Transformer model (the first to do so)
- May 2020, `GPT-3`, an even bigger version of GPT-2 that is able to perform well on a variety of tasks without the need for fine-tuning (called zero-shot learning)

This list is far from comprehensive, and is just meant to highlight a few of the different kinds of Transformer models. Broadly, they can be grouped into three categories:

- BERT-like (also called **auto-encoding** or **encoder** Transformer models)
- GPT-like (also called **auto-regressive** or **decoder** Transformer models)
- BART/T5-like (also called **sequence-to-sequence** or **encoder-decoder** Transformer models)

### Encoder

Encoder models use only the encoder of a Transformer model. The pretraining of these models usually revolves around masked language modeling.

Encoder models are best suited for tasks requiring an understanding of the full sentence, such as **sentence classification**, **named entity recognition (and more generally word classification)**, and **extractive question answering**.

Representatives of this family of models include: ALBERT, BERT, DistilBERT, ELECTRA, RoBERTa

### Decoder

Decoder models use only the decoder of a Transformer model. The pretraining of decoder models usually revolves around predicting the next word in the sentence.

These models are best suited for tasks involving **text generation**.

Representatives of this family of models include: CTRL, GPT, GPT-2, Transformer XL

### Encoder-decoder

Encoder-decoder models (also called sequence-to-sequence models) use both parts of the Transformer architecture.

The pretraining of these models can be done using the objectives of encoder or decoder models, but usually involves something a bit more complex. For instance, T5 is pretrained by replacing random spans of text (that can contain several words) with a single mask special word, and the objective is then to predict the text that this mask word replaces.

Sequence-to-sequence models are best suited for tasks revolving around **generating new sentences depending on a given input**, such as **summarization**, **translation**, or **generative question answering**.

Representatives of this family of models include: BART, mBART, Marian, T5
  
### Transformers are language models

All the Transformer models mentioned above (GPT, BERT, BART, T5, etc.) have been trained as language models. This means they have been trained on large amounts of raw text in a **self-supervised fashion**.

This type of model develops a statistical understanding of the language it has been trained on (also called **pretraining**), but it’s not very useful for specific practical tasks. Because of this, the general pretrained model then goes through a process called **transfer learning** or **finetuning**. During this process, the model is fine-tuned in a supervised way — that is, using human-annotated labels — on a given task.

Note that finetuning is almost always better than training from scratch.  Since the pretrained model already contains many basic statistical understanding of langauge, thus finetuning requires less resource and time to train.

## 3. Bias and Limitations

Since pretrained model is trained on real-world datasets, it can unintentionally grab human biases.  For example:

In [19]:
from transformers import pipeline

unmasker = pipeline("fill-mask", model='distilroberta-base')
result = unmasker("This man works as a <mask>.")
print([r["token_str"] for r in result])

result = unmasker("This woman works as a <mask>.")
print([r["token_str"] for r in result])

[' translator', ' consultant', ' bartender', ' waiter', ' courier']
[' waitress', ' translator', ' nurse', ' bartender', ' consultant']


## 4. Behind the pipeline

Let’s take a look at what happened behind the pipeline, so we can make our own what is more useful.

In [2]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis", model='distilbert-base-uncased-finetuned-sst-2-english')
classifier(
    [
        "I've been waiting for a HuggingFace course my whole life.",
        "I hate this so much!",
    ]
)

[{'label': 'POSITIVE', 'score': 0.9598049521446228},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

This pipeline groups together three steps: preprocessing, passing the inputs through the model, and postprocessing:

<img src = "../../../figures/pipeline.png" width = 700>

### Tokenizer

- Splitting the input into words, subwords, or symbols (like punctuation) that are called tokens
- Mapping each token to an integer
- Adding additional inputs that may be useful to the model

To do this, we use the `AutoTokenizer` class and its `from_pretrained()` method. Using the checkpoint name of our model, it will automatically fetch the data associated with the model’s tokenizer and cache it (so it’s only downloaded the first time you run the code below).

Since the default checkpoint of the sentiment-analysis pipeline is `distilbert-base-uncased-finetuned-sst-2-english`, we run the following:

In [3]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

Once we have the tokenizer, we can directly pass our sentences to it and we’ll get back a dictionary that’s ready to feed to our model! The only thing left to do is to convert the list of input IDs to tensors.

To specify the type of tensors we want to get back (PyTorch, TensorFlow, or plain NumPy), we use the `return_tensors` argument:

In [4]:
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt") #pt for pytorch
print(inputs)

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
             0,     0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}


### Model

We can download our pretrained model the same way we did with our tokenizer. Transformers provides an `AutoModel` class which also has a `from_pretrained()` method.

This architecture contains only the base Transformer module: given some inputs, it outputs what we’ll call hidden states, also known as features. For each model input, we’ll retrieve a high-dimensional vector representing the contextual understanding of that input by the Transformer model.

While these hidden states can be useful on their own, they’re usually inputs to another part of the model, known as the head.

In [1]:
from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)

#it is complaining because it expects you to finetune...

Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/255M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing DistilBertModel: ['pre_classifier.weight', 'classifier.bias', 'pre_classifier.bias', 'classifier.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


The vector output by the Transformer module is usually large. It generally has three dimensions:

- `Batch size`: The number of sequences processed at a time (2 in our example).
- `Sequence length`: The length of the numerical representation of the sequence (16 in our example).
- `Hidden size`: The vector dimension of each model input.

We can see this if we feed the inputs we preprocessed to our model:

In [11]:
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

#Note that the outputs of Transformers models behave like namedtuples or dictionaries. 
# You can access the elements by attributes (like we did) or by key (outputs["last_hidden_state"]), 
# or even by index if you know exactly where the thing you are looking for is (outputs[0]).

torch.Size([2, 16, 768])


### Model heads: Making sense out of numbers

The model heads take the high-dimensional vector of hidden states as input and project them onto a different dimension. They are usually composed of one or a few linear layers:

<img src = "../../../figures/heads.png" width = 700>

The output of the Transformer model is sent directly to the model head to be processed.

In this diagram, the model is represented by its embeddings layer and the subsequent layers. The embeddings layer converts each input ID in the tokenized input into a vector that represents the associated token. The subsequent layers manipulate those vectors using the attention mechanism to produce the final representation of the sentences.

There are many different architectures available, with each one designed around tackling a specific task. Here is a non-exhaustive list:

- `Model` (retrieve the hidden states)
- `ForCausalLM`
- `ForMaskedLM`
- `ForMultipleChoice`
- `ForQuestionAnswering`
- `ForSequenceClassification`
- `ForTokenClassification`

For our example, we will need a model with a sequence classification head (to be able to classify the sentences as positive or negative). So, we won’t actually use the `AutoModel` class, but `AutoModelForSequenceClassification`:

In [12]:
from transformers import AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)

Now if we look at the shape of our outputs, the dimensionality will be much lower: the model head takes as input the high-dimensional vectors we saw before, and outputs vectors containing two values (one per label):

In [None]:
print(outputs.logits.shape)  #batch, one-hot encoded targets or two sentences, two labels

torch.Size([2, 2])

### Postprocessing the output

In [None]:
print(outputs.logits)

tensor([[-1.5607,  1.6123],
        [ 4.1692, -3.3464]], grad_fn=<AddmmBackward>)

Our model predicted [-1.5607, 1.6123] for the first sentence and [ 4.1692, -3.3464] for the second one. Those are not probabilities but logits, the raw, unnormalized scores outputted by the last layer of the model. To be converted to probabilities, they need to go through a `SoftMax` layer (all Transformers models output the logits, as the loss function for training will generally fuse the last activation function, such as SoftMax, with the actual loss function, such as cross entropy):

In [13]:
import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

tensor([[4.0195e-02, 9.5980e-01],
        [9.9946e-01, 5.4418e-04]], grad_fn=<SoftmaxBackward0>)


Now we can see that the model predicted [0.0402, 0.9598] for the first sentence and [0.9995, 0.0005] for the second one. These are recognizable probability scores.

To get the labels corresponding to each position, we can inspect the `id2label` attribute of the model config.

In [14]:
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

Now we can conclude that the model predicted the following:

First sentence: NEGATIVE: 0.0402, POSITIVE: 0.9598
Second sentence: NEGATIVE: 0.9995, POSITIVE: 0.0005