# Huggingface - NLP Crourse - Chapter 1: Transformer Models

I'll be following along with: https://huggingface.co/learn/nlp-course/chapter1/

## Demo of `pipeline()` capabilities

In [1]:
# requirements
# ---
# !pip install transformers[sentencepiece]
# !pip install sentencepiece

In [2]:
from transformers import pipeline

In [3]:
classifier = pipeline("sentiment-analysis")
classifier([
  "I've been waiting for a HuggingFace course my whole life.",
  "I hate this so much!"
])

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'POSITIVE', 'score': 0.9598049521446228},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

In [4]:
classifier = pipeline("zero-shot-classification")
classifier(
    "This is a course about the Transformers library",
    candidate_labels=["education", "politics", "business"],
)

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


{'sequence': 'This is a course about the Transformers library',
 'labels': ['education', 'business', 'politics'],
 'scores': [0.8445982336997986, 0.1119748130440712, 0.043426983058452606]}

In [5]:
generator = pipeline("text-generation")
generator("In this course, we will teach you how to")

No model was supplied, defaulted to gpt2 and revision 6c0e608 (https://huggingface.co/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to develop a fully automated production model using an open source software framework, which comes with all the tools needed to efficiently work with your environment.\n\nI will show you two ways you can configure how you'}]

In [6]:
generator = pipeline("text-generation", model="distilgpt2")
generator(
    "In this course, we will teach you how to",
    max_length=30,
    num_return_sequences=2,
)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to use an HTML5-based app on HTML5-powered tablets. In this course, we will'},
 {'generated_text': 'In this course, we will teach you how to start DGX.'}]

In [7]:
unmasker = pipeline("fill-mask")
unmasker("This course will teach you all about <mask> models.", top_k=2)

No model was supplied, defaulted to distilroberta-base and revision ec58a5b (https://huggingface.co/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'score': 0.1961977779865265,
  'token': 30412,
  'token_str': ' mathematical',
  'sequence': 'This course will teach you all about mathematical models.'},
 {'score': 0.0405273362994194,
  'token': 38163,
  'token_str': ' computational',
  'sequence': 'This course will teach you all about computational models.'}]

In [8]:
ner = pipeline("ner", grouped_entities=True)
ner("My name is Sylvain and I work at Hugging Face in Brooklyn.")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'entity_group': 'PER',
  'score': 0.9981694,
  'word': 'Sylvain',
  'start': 11,
  'end': 18},
 {'entity_group': 'ORG',
  'score': 0.9796019,
  'word': 'Hugging Face',
  'start': 33,
  'end': 45},
 {'entity_group': 'LOC',
  'score': 0.9932106,
  'word': 'Brooklyn',
  'start': 49,
  'end': 57}]

In [9]:
question_answerer = pipeline("question-answering")
question_answerer(
    question="Where do I work?",
    context="My name is Sylvain and I work at Hugging Face in Brooklyn",
)

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


{'score': 0.6949770450592041, 'start': 33, 'end': 45, 'answer': 'Hugging Face'}

In [10]:
summarizer = pipeline("summarization")
summarizer(
    """
    America has changed dramatically during recent years. Not only has the number of 
    graduates in traditional engineering disciplines such as mechanical, civil, 
    electrical, chemical, and aeronautical engineering declined, but in most of 
    the premier American universities engineering curricula now concentrate on 
    and encourage largely the study of engineering science. As a result, there 
    are declining offerings in engineering subjects dealing with infrastructure, 
    the environment, and related issues, and greater concentration on high 
    technology subjects, largely supporting increasingly complex scientific 
    developments. While the latter is important, it should not be at the expense 
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other 
    industrial countries in Europe and Asia, continue to encourage and advance 
    the teaching of engineering. Both China and India, respectively, graduate 
    six and eight times as many traditional engineers as does the United States. 
    Other industrial countries at minimum maintain their output, while America 
    suffers an increasingly serious decline in the number of engineering graduates 
    and a lack of well-educated engineers.
"""
)

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'summary_text': ' America has changed dramatically during recent years . The number of engineering graduates in the U.S. has declined in traditional engineering disciplines such as mechanical, civil,    electrical, chemical, and aeronautical engineering . Rapidly developing economies such as China and India continue to encourage and advance the teaching of engineering .'}]

In [11]:
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")
translator("Ce cours est produit par Hugging Face.")



[{'translation_text': 'This course is produced by Hugging Face.'}]

## Bias and limitations

In [12]:
from transformers import pipeline

unmasker = pipeline("fill-mask", model="bert-base-uncased")
result = unmasker("This man works as a [MASK].")
print([r["token_str"] for r in result])

result = unmasker("This woman works as a [MASK].")
print([r["token_str"] for r in result])

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'bert.pooler.dense.weight', 'bert.pooler.dense.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


['carpenter', 'lawyer', 'farmer', 'businessman', 'doctor']
['nurse', 'maid', 'teacher', 'waitress', 'prostitute']


# Huggingface - NLP Crourse - Chapter 2: Using ü§ó Transformers
I'll be following along with: https://huggingface.co/learn/nlp-course/chapter2/

## Re-creating the `pipeline()`
We are going to replicate the functionality of this `pipeline()` step by step

In [13]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier(
    [
        "I've been waiting for a HuggingFace course my whole life.",
        "I hate this so much!",
    ]
)

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'POSITIVE', 'score': 0.9598049521446228},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

### 1- Tokenizer
`pipeline()` pre-process the inputs into tokens

In [14]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [15]:
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
             0,     0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}


### 2- Model
The tokens are passed to a pre-trained model.

In [16]:
from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)

In [17]:
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

torch.Size([2, 16, 768])


It produces a tensor `torch.Size([2, 16, 768])` or `batch_size = 2`, `sequence_length = 16`, and `hidden_size = 768` because the `AutoModel` doesn't include the head of the model.

To have the output of the head we use `AutoModelForSequenceClassification` instead

In [18]:
from transformers import AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)

In [19]:
print(outputs.logits)

tensor([[-1.5607,  1.6123],
        [ 4.1692, -3.3464]], grad_fn=<AddmmBackward0>)


The output tensor of size [2, 2] is not yet probabilities (they do not sum to 1), because the model returns logits.

### 3- Postprocessing
Apply `softmax()` to logits to normalize the probabilities to 1 and retrieve the labels in `model.config.id2label`.

In [20]:
import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

tensor([[4.0195e-02, 9.5980e-01],
        [9.9946e-01, 5.4418e-04]], grad_fn=<SoftmaxBackward0>)


In [21]:
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

In [22]:
[[f'{model.config.id2label[i]} : {pred:.2%}' for i, pred in enumerate(prediction)] for prediction in predictions]

[['NEGATIVE : 4.02%', 'POSITIVE : 95.98%'],
 ['NEGATIVE : 99.95%', 'POSITIVE : 0.05%']]

## Model and a checkpoint
Instantiating a model with no checkpoint makes a model with random weights.

In [23]:
from transformers import BertConfig, BertModel

# Building the config
config = BertConfig()

# Building the model from the config
model = BertModel(config)

In [24]:
print(config)

BertConfig {
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.32.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}



Let's load a pre-trained model from a checkpoint instead.

In [25]:
from transformers import BertModel

model = BertModel.from_pretrained("bert-base-cased")

Models can be saved to disk using `.save_pretrained()`.

In [26]:
model.save_pretrained("directory_on_my_computer")

## Tokenizer
Models can only work with number, so inputs are converted to numbers before being fed to the model. The 3 main tokenization algorithms are:
- Word based
- Character based
- Subword based

### Word based tokenization
Split text into words (by splitting on whitespace) and words are then converted to a unique id.

e.g.: `'I like turtle'` -> `['I', 'like', 'turtle']` -> `[17, 494, 237]`

Each word carry a lot of context and semantic meaning, but it has limitations, `dog` is almost like `dogs` but they will be treated completely differently. We also need an id for each word in the vocabulary which create a very large mapping and makes models bloated. Every new word will be treated as `out_of_vocabulary` which will also lead to a loss of information.

In [27]:
vocabulary = {
  '[OUT OF VOCABULARY]': 0,
  'I': 17,
  'turtle': 237,
  'like': 494,
}
text = 'I like turtle'
tokens = text.split()
print(tokens)
numerical_tokens = [vocabulary.get(t, 0) for t in tokens]
print(numerical_tokens)


['I', 'like', 'turtle']
[17, 494, 237]


## Character based tokenization
Split text into characters.

e.g.: `'I like turtle'` -> `['I', ' ', 'l', 'i', 'k', 'e', ' ', 't', 'u', 'r', 't', 'l', 'e']` -> `[73, 32, 108, 105, 107, 101, 32, 116, 117, 114, 116, 108, 101]`

The vocabulary is much smaller and cover the entire possible words spelled in your charset. But each token has a weaker meaning and hold less information.

In [28]:
vocabulary = {c: i for i, c in enumerate([chr(n) for n in range(255)])}
text = 'I like turtle'
tokens = list(text)
print(tokens)
numerical_tokens = [vocabulary.get(t, 0) for t in tokens]
print(numerical_tokens)

['I', ' ', 'l', 'i', 'k', 'e', ' ', 't', 'u', 'r', 't', 'l', 'e']
[73, 32, 108, 105, 107, 101, 32, 116, 117, 114, 116, 108, 101]


## Subword based tokenization
It's a compromise between word and char tokenizer.

Frequent words are not split, while rare words are decomposed into meaningful subwords.

e.g.:
- `dog` -> `dog\w`
- `dogs` -> [`dog`, `s\w`]
- `tokenization` -> [`token`, `##ization`]

In [29]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

In [30]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

In [31]:
sequence = "Using a Transformer network is simple"

# tokenize step by step
tokens = tokenizer.tokenize(sequence)
print(tokens)
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

# or use the equivalent
tokenizer(sequence)

# decoding
decoded_string = tokenizer.decode([7993, 170, 11303, 1200, 2443, 1110, 3014])
print(decoded_string)

['Using', 'a', 'Trans', '##former', 'network', 'is', 'simple']
[7993, 170, 13809, 23763, 2443, 1110, 3014]
Using a transformer network is simple


## Tokenizing multiple inputs at once
When tokenizing multiple inputs we have to make sure they are the same size as tensors need to have regular dimensions. So the inputs are padded up to a certain size. To match that we pass a mask for the attention layer so the padding tokens are ignored.

In [32]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)

input_ids = torch.tensor([ids])
print("Input IDs:", input_ids)

output = model(input_ids)
print("Logits:", output.logits)

Input IDs: tensor([[ 1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,  2607,
          2026,  2878,  2166,  1012]])
Logits: tensor([[-2.7276,  2.8789]], grad_fn=<AddmmBackward0>)


For batched requests we need to pad the inputs to be the same size.

In [33]:
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

attention_mask = [
    [1, 1, 1],
    [1, 1, 0],
]

outputs = model(torch.tensor(batched_ids), attention_mask=torch.tensor(attention_mask))
print(outputs.logits)

tensor([[ 1.5694, -1.3895],
        [ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>)


# Huggingface - NLP Crourse - Chapter 3: Fine-Tuning a Pretrained Model
I'll be following along with: https://huggingface.co/learn/nlp-course/chapter3/

In [34]:
# !pip install datasets evaluate transformers[torch] scipy sklearn scikit-learn

## Dataset
The dataset is stored on disk using [Apache Arrow](https://arrow.apache.org/docs/python/dataset.html). Only requested rows are loaded in memory. So we can manipulate big datasets without going OOM.

## Load the dataset
Here we are using a dataset comming from a [Microsoft paper](https://aclanthology.org/I05-5002.pdf), it provides a corpus for "Sentential Paraphrases" (aka. equivalent sentences).

Additional datasets can be found at: https://huggingface.co/datasets

In [35]:
from datasets import load_dataset

raw_datasets = load_dataset("glue", "mrpc")
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

In [36]:
raw_train_dataset = raw_datasets["train"]
raw_train_dataset[0]

{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
 'label': 1,
 'idx': 0}

 `.features['label']` provides the correspondance between the label id and the human meaning. Here `id 0 = 'not_equivalent'` and `id 1 = 'equivalent'`

In [37]:
raw_train_dataset.features

{'sentence1': Value(dtype='string', id=None),
 'sentence2': Value(dtype='string', id=None),
 'label': ClassLabel(names=['not_equivalent', 'equivalent'], id=None),
 'idx': Value(dtype='int32', id=None)}

## Tokenize the entries

### With fixed padding
This is the easiest solution. Pad the entire dataset to a fixed size (here 128) and this is fast on TPU.

Note: Dynamic padding can provide a speedup on CPU/GPU for the batches with only smaller entries.

In [38]:
from transformers import AutoTokenizer

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], padding='max_length', truncation=True, max_length=128)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1725
    })
})

### With dynamic padding

In [39]:
from transformers import AutoTokenizer, DataCollatorWithPadding

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

### Cleanup the data
Remove unused columns and rename `label` to `labels` to make the huggingface model happy.

In [40]:
tokenized_datasets = tokenized_datasets.remove_columns(['idx', 'sentence1', 'sentence2'])
tokenized_datasets = tokenized_datasets.rename_column('label', 'labels')
tokenized_datasets = tokenized_datasets.with_format('torch')
tokenized_datasets['train']

Dataset({
    features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 3668
})

In [41]:
# Note: we can get a smaller subset of the data by using `.select()`
# small_train_dataset = tokenized_datasets['train'].select(range(100))

## Train

In [42]:
from transformers import TrainingArguments
from transformers import AutoModelForSequenceClassification
from transformers import Trainer

training_args = TrainingArguments("test-trainer")
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [43]:
trainer.train()

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
500,0.5974
1000,0.4761


TrainOutput(global_step=1377, training_loss=0.490657354838275, metrics={'train_runtime': 81.1125, 'train_samples_per_second': 135.663, 'train_steps_per_second': 16.976, 'total_flos': 405324636337200.0, 'train_loss': 0.490657354838275, 'epoch': 3.0})

### Evaluate the result

In [44]:
predictions = trainer.predict(tokenized_datasets["validation"])
print(predictions.predictions.shape, predictions.label_ids.shape)

(408, 2) (408,)


In [45]:
import numpy as np
import evaluate

preds = np.argmax(predictions.predictions, axis=-1)
metric = evaluate.load("glue", "mrpc")
metric.compute(predictions=preds, references=predictions.label_ids)

{'accuracy': 0.8308823529411765, 'f1': 0.881646655231561}

### Train + Evaluate

In [46]:
def compute_metrics(eval_preds):
    metric = evaluate.load("glue", "mrpc")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

training_args = TrainingArguments("test-trainer", evaluation_strategy="epoch")
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

trainer.train()

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.343839,0.852941,0.896552
2,0.492500,0.60714,0.848039,0.89701
3,0.289700,0.715746,0.857843,0.899306


TrainOutput(global_step=1377, training_loss=0.32119023773924715, metrics={'train_runtime': 80.515, 'train_samples_per_second': 136.67, 'train_steps_per_second': 17.102, 'total_flos': 405540469624800.0, 'train_loss': 0.32119023773924715, 'epoch': 3.0})

## Train by hand with PyTorch

In [47]:
from accelerate import Accelerator
from transformers import AdamW, AutoModelForSequenceClassification, get_scheduler
from torch.utils.data import DataLoader
from tqdm.auto import tqdm

train_dataloader = DataLoader(
    tokenized_datasets["train"], shuffle=True, batch_size=8, collate_fn=data_collator
)
eval_dataloader = DataLoader(
    tokenized_datasets["validation"], batch_size=8, collate_fn=data_collator
)

accelerator = Accelerator()

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
optimizer = AdamW(model.parameters(), lr=3e-5)

train_dl, eval_dl, model, optimizer = accelerator.prepare(
    train_dataloader, eval_dataloader, model, optimizer
)

num_epochs = 3
num_training_steps = num_epochs * len(train_dl)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dl:
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/1377 [00:00<?, ?it/s]

# Huggingface - NLP Crourse - Chapter 4: Sharing Models and Tokenizers
I'll be following along with: https://huggingface.co/learn/nlp-course/chapter4/

Browse https://huggingface.co/models for models, filter on task/trainingset/backbone. Use the widget to test the model from the webpage.

# Huggingface - NLP Crourse - Chapter 5: The ü§ó Datasets Library
I'll be following along with: https://huggingface.co/learn/nlp-course/chapter5/

## Load a dataset from outside the ü§ó ecosystem

In [48]:
# !wget https://github.com/crux82/squad-it/raw/master/SQuAD_it-train.json.gz
# !wget https://github.com/crux82/squad-it/raw/master/SQuAD_it-test.json.gz

### Read from File

In [49]:
from datasets import load_dataset

data_files = {"train": "SQuAD_it-train.json", "test": "SQuAD_it-test.json"}
squad_it_dataset = load_dataset("json", data_files=data_files, field="data")

### Read from Compressed File

In [50]:
from datasets import load_dataset

data_files = {"train": "SQuAD_it-train.json.gz", "test": "SQuAD_it-test.json.gz"}
squad_it_dataset = load_dataset("json", data_files=data_files, field="data")

### Read from URL

In [51]:
from datasets import load_dataset

url = "https://github.com/crux82/squad-it/raw/master/"
data_files = {
    "train": url + "SQuAD_it-train.json.gz",
    "test": url + "SQuAD_it-test.json.gz",
}
squad_it_dataset = load_dataset("json", data_files=data_files, field="data")

## Slice and Dice the data

### Shuffle the data
This prevent the model from learning artificial ordering.

In [52]:
from datasets import load_dataset

squad = load_dataset('squad', split='train')
squad_shuffled = squad.shuffle(seed=666)

print([s['id'] for s in squad.select(range(5))])
print([s['id'] for s in squad_shuffled.select(range(5))])

['5733be284776f41900661182', '5733be284776f4190066117f', '5733be284776f41900661180', '5733be284776f41900661181', '5733be284776f4190066117e']
['5727cc873acd2414000deca9', '5730b096396df919000962a0', '5729125daf94a219006aa029', '5727e9e0ff5b5019007d9852', '5731cca5e17f3d140042240a']


### Create a random train/test split

In [53]:
from datasets import load_dataset

squad = load_dataset('squad', split='train')
dataset = squad.train_test_split(test_size=0.1)
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 78839
    })
    test: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 8760
    })
})

### Select specific rows from the dataset

In [54]:
from datasets import load_dataset

squad = load_dataset('squad', split='train')
squad_filtered = squad.filter(lambda x: x['title'].startswith('L'))
print([s['title'] for s in squad_filtered.shuffle().select(range(5))])

['List_of_numbered_streets_in_Manhattan', 'LaserDisc', 'List_of_numbered_streets_in_Manhattan', 'Light-emitting_diode', 'LaserDisc']


### Flatten

In [55]:
from datasets import load_dataset

squad = load_dataset('squad', split='train')
print(squad.features.keys())
squad_flattened = squad.flatten()
print(squad_flattened.features.keys())

dict_keys(['id', 'title', 'context', 'question', 'answers'])
dict_keys(['id', 'title', 'context', 'question', 'answers.text', 'answers.answer_start'])


### Convert Dataset to Pandas

In [56]:
from datasets import load_dataset

dataset = load_dataset('swiss_judgment_prediction', 'all', split='train')
dataset.set_format('pandas')
dataset[0]

Unnamed: 0,id,year,text,label,language,region,canton,legal area,source_language
0,2,2000,A.- Der 1955 geborene V._ war seit 1. Septembe...,0,de,Z√ºrich,zh,insurance law,


In [57]:
df = dataset[:]
df.head()

Unnamed: 0,id,year,text,label,language,region,canton,legal area,source_language
0,2,2000,A.- Der 1955 geborene V._ war seit 1. Septembe...,0,de,Z√ºrich,zh,insurance law,
1,3,2000,"Anspr√ºche nach OHG, hat sich ergeben: A.- X._ ...",1,de,Central Switzerland,lu,public law,
2,4,2000,Art. 4 aBV (Strafverfahren wegen falschen Zeug...,0,de,Northwestern Switzerland,ag,public law,
3,5,2000,"Art. 5 Ziff. 1 EMRK (Haftentlassung), hat sich...",1,de,,,public law,
4,6,2000,"Mietvertrag, hat sich ergeben: A.- Die CT Cond...",0,de,,,civil law,


In [58]:
df.groupby('region')['language'].value_counts()

region                    language
Central Switzerland       de           4778
                          it              1
Eastern Switzerland       de           5650
                          it             57
Espace Mittelland         de           5150
                          fr           3104
                          it              3
Federation                de           1011
                          fr            227
                          it             70
Northwestern Switzerland  de           5654
                          fr              1
R√©gion l√©manique          fr          13100
                          de            336
Ticino                    it           2249
                          de              6
Z√ºrich                    de           8785
                          fr              3
n/a                       fr           4744
                          de           4088
                          it            692
Name: count, dtype: int64

In [59]:
df['legal area'].value_counts()

legal area
public law       15173
penal law        11795
civil law        11477
insurance law    11142
social law        9727
other              395
Name: count, dtype: int64

In [60]:
# when done, get the format back to default
dataset.reset_format()

### Saving/Reload a dataset

In [61]:
from datasets import load_dataset

ds = load_dataset('allocine')
# print(ds.cache_files)

#### Apache Arrow
Default efficient format

In [62]:
ds.save_to_disk('local-allocine')

Saving the dataset (0/1 shards):   0%|          | 0/160000 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/20000 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/20000 [00:00<?, ? examples/s]

In [63]:
from datasets import load_from_disk

arrow_ds = load_from_disk('local-allocine')

#### CSV

In [64]:
for split, dataset in ds.items():
  dataset.to_csv(f'local-allocine-{split}.csv', index=None)

Creating CSV from Arrow format:   0%|          | 0/160 [00:00<?, ?ba/s]

Creating CSV from Arrow format:   0%|          | 0/20 [00:00<?, ?ba/s]

Creating CSV from Arrow format:   0%|          | 0/20 [00:00<?, ?ba/s]

In [65]:
data_files = {
  'train': 'local-allocine-train.csv',
  'validation': 'local-allocine-validation.csv',
  'test': 'local-allocine-test.csv',
}

csv_ds = load_dataset('csv', data_files=data_files)

Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

#### JSON

In [66]:
for split, dataset in ds.items():
  dataset.to_json(f'local-allocine-{split}.jsonl', index=None)

Creating json from Arrow format:   0%|          | 0/160 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/20 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/20 [00:00<?, ?ba/s]

In [67]:
data_files = {
  'train': 'local-allocine-train.jsonl',
  'validation': 'local-allocine-validation.jsonl',
  'test': 'local-allocine-test.jsonl',
}

json_ds = load_dataset('json', data_files=data_files)

Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

#### Parquet
For long term (size efficient) storage.

In [68]:
for split, dataset in ds.items():
  dataset.to_parquet(f'local-allocine-{split}.parquet')

Creating parquet from Arrow format:   0%|          | 0/160 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/20 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/20 [00:00<?, ?ba/s]

In [69]:
data_files = {
  'train': 'local-allocine-train.parquet',
  'validation': 'local-allocine-validation.parquet',
  'test': 'local-allocine-test.parquet',
}

parquet_ds = load_dataset('parquet', data_files=data_files)

Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

## Semantic Search with FAISS
Using an encoder (e.g. BERT) we can convert an input sentence into an "embedding" (aka. a vector of integers). We can compute the similarities between different embeddings by computing the angles between them (smaller angle means more similar).

The example take the github issues with at least 15 words. Concat the title body and comments into one string. Compute embeddings of it.

In [70]:
# !pip install faiss-cpu faiss-gpu

In [71]:
from datasets import load_dataset

issues_dataset = load_dataset('lewtun/github-issues', split='train')
issues_dataset = issues_dataset.filter(lambda x: (x['is_pull_request'] == False and len(x['comments']) > 0))
columns = issues_dataset.column_names
columns_to_keep = ['title', 'body', 'html_url', 'comments']
columns_to_remove = set(columns_to_keep).symmetric_difference(columns)
issues_dataset = issues_dataset.remove_columns(columns_to_remove)
issues_dataset.set_format('pandas')
df = issues_dataset[:]
comments_df = df.explode('comments', ignore_index=True)
comments_df.head(4)

Repo card metadata block was not found. Setting CardData to empty.


Unnamed: 0,html_url,title,comments,body
0,https://github.com/huggingface/datasets/issues...,Protect master branch,"Cool, I think we can do both :)",After accidental merge commit (91c55355b634d0d...
1,https://github.com/huggingface/datasets/issues...,Protect master branch,@lhoestq now the 2 are implemented.\r\n\r\nPle...,After accidental merge commit (91c55355b634d0d...
2,https://github.com/huggingface/datasets/issues...,Backwards compatibility broken for cached data...,Hi ! I guess the caching mechanism should have...,## Describe the bug\r\nAfter upgrading to data...
3,https://github.com/huggingface/datasets/issues...,Backwards compatibility broken for cached data...,"If it's easy enough to implement, then yes ple...",## Describe the bug\r\nAfter upgrading to data...


In [72]:
from datasets import Dataset

comments_dataset = Dataset.from_pandas(comments_df)
# keep comments with at least 15 words
comments_dataset = comments_dataset.map(lambda x: {'comment_length': len(x['comments'].split())})
comments_dataset = comments_dataset.filter(lambda x: x["comment_length"] > 15)

def concatenate_text(examples):
    return {
        "text": examples["title"]
        + " \n "
        + examples["body"]
        + " \n "
        + examples["comments"]
    }
comments_dataset = comments_dataset.map(concatenate_text)

Map:   0%|          | 0/2964 [00:00<?, ? examples/s]

Filter:   0%|          | 0/2964 [00:00<?, ? examples/s]

Map:   0%|          | 0/2175 [00:00<?, ? examples/s]

In [73]:
from transformers import AutoTokenizer, AutoModel
import torch

model_ckpt = 'sentence-transformers/multi-qa-mpnet-base-dot-v1'
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModel.from_pretrained(model_ckpt)

# device = torch.device('cuda')
# windows doesn't have faiss-gpu ...
device = torch.device('cpu')

model.to(device)

def cls_pooling(model_output):
    return model_output.last_hidden_state[:, 0]

def get_embeddings(text_list):
    encoded_input = tokenizer(text_list, padding=True, truncation=True, return_tensors='pt')
    encoded_input = {k: v.to(device) for k, v in encoded_input.items()}
    model_output = model(**encoded_input)
    return cls_pooling(model_output)

embedding = get_embeddings(comments_dataset['text'][0])
embedding.shape

torch.Size([1, 768])

In [74]:
# truncate the dataset because CPU takes 35min to compute the embeddings otherwise
comments_dataset = comments_dataset.select(range(100))
embeddings_dataset = comments_dataset.map(
    lambda x: {'embeddings': get_embeddings(x["text"]).detach().cpu().numpy()[0]}
)

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

In [75]:
embeddings_dataset.add_faiss_index(column='embeddings')

  0%|          | 0/1 [00:00<?, ?it/s]

Dataset({
    features: ['html_url', 'title', 'comments', 'body', 'comment_length', 'text', 'embeddings'],
    num_rows: 100
})

In [76]:
question = 'How can I load a dataset offline?'
question_embedding = get_embeddings([question]).cpu().detach().numpy()
question_embedding.shape

(1, 768)

In [80]:
import pandas as pd

scores, samples = embeddings_dataset.get_nearest_examples('embeddings', question_embedding, k=5)
samples_df = pd.DataFrame.from_dict(samples)
samples_df["scores"] = scores
samples_df.sort_values("scores", ascending=False, inplace=True)

In [81]:
for _, row in samples_df.iterrows():
    print(f"COMMENT: {row.comments}")
    print(f"SCORE: {row.scores}")
    print(f"TITLE: {row.title}")
    print(f"URL: {row.html_url}")
    print("=" * 50)
    print()

COMMENT: Thanks for reporting ! #2852 fixed this error

We'll do a new release of `datasets` soon :)
SCORE: 43.938785552978516
TITLE: Cannot load linnaeus dataset
URL: https://github.com/huggingface/datasets/issues/2821

COMMENT: > * For the platform, we need to know the operating system of your machine. Could you please run the command `datasets-cli env` and copy-and-paste its output below?
> * In relation with the error, you just gave us the error type and message (`TypeError: 'NoneType' object is not callable`). Could you please copy-paste the complete stack trace, so that we know exactly which part of the code threw the error?

1. For the platform, here are the output:
        - datasets` version: 1.11.0
        - Platform: Windows-10-10.0.19041-SP0
        - Python version: 3.7.10
        - PyArrow version: 5.0.0
2. For the code and errorÔºö
     ```python
     from datasets import load_dataset, load_metric
     dataset = load_dataset("glue", "cola")
    ```
    ```python
    Trac

# Huggingface - NLP Crourse - Chapter 6: The ü§ó Tokenizers Library
I'll be following along with: https://huggingface.co/learn/nlp-course/chapter6/

Training a new Tokenizer is necessary if the corpus you want to train on is in a different language / charset / domain (e.g. medical) / style (e.g. old english).

## Train a Tokenizer for Python code

In [2]:
from datasets import load_dataset

raw_datasets = load_dataset('code_search_net', 'python')

In [3]:
from transformers import AutoTokenizer

def get_training_corpus():
  dataset = raw_datasets['train']
  for start_idx in range(0, len(dataset), 1000):
    samples = dataset[start_idx: start_idx + 1000]
    yield samples['whole_func_string']

training_corpus = get_training_corpus()
old_tokenizer = AutoTokenizer.from_pretrained('gpt2')
new_tokenizer = old_tokenizer.train_new_from_iterator(training_corpus, 52000) # 52000 is the vocabulary's size
new_tokenizer.save_pretrained('code-search-net-tokenizer')
# new_tokenizer.push_to_hub('code-search-net-tokenizer')

('code-search-net-tokenizer\\tokenizer_config.json',
 'code-search-net-tokenizer\\special_tokens_map.json',
 'code-search-net-tokenizer\\vocab.json',
 'code-search-net-tokenizer\\merges.txt',
 'code-search-net-tokenizer\\added_tokens.json',
 'code-search-net-tokenizer\\tokenizer.json')

### Before and after comparison

In [6]:
example = '''def add_numbers(a, b):
    """Add the two numbers `a` and `b`."""
    return a + b'''

tokens = old_tokenizer.tokenize(example)
print(tokens)

['def', 'ƒ†add', '_', 'n', 'umbers', '(', 'a', ',', 'ƒ†b', '):', 'ƒä', 'ƒ†', 'ƒ†', 'ƒ†', 'ƒ†"""', 'Add', 'ƒ†the', 'ƒ†two', 'ƒ†numbers', 'ƒ†`', 'a', '`', 'ƒ†and', 'ƒ†`', 'b', '`', '."', '""', 'ƒä', 'ƒ†', 'ƒ†', 'ƒ†', 'ƒ†return', 'ƒ†a', 'ƒ†+', 'ƒ†b']


In [7]:
tokens = new_tokenizer.tokenize(example)
print(tokens)

['def', 'ƒ†add', '_', 'numbers', '(', 'a', ',', 'ƒ†b', '):', 'ƒäƒ†ƒ†ƒ†', 'ƒ†"""', 'Add', 'ƒ†the', 'ƒ†two', 'ƒ†numbers', 'ƒ†`', 'a', '`', 'ƒ†and', 'ƒ†`', 'b', '`."""', 'ƒäƒ†ƒ†ƒ†', 'ƒ†return', 'ƒ†a', 'ƒ†+', 'ƒ†b']


## Recreating `aggregation_strategy='simple'`

In [27]:
example = 'My name is Sylvain and I work at Hugging Face in Brooklyn.'
checkpoint = "dbmdz/bert-large-cased-finetuned-conll03-english"

### Goal
given naive token, group the semanticaly relevent tokens together.

Instead of `['S', '##yl', '##va', '##in']` -> `['Sylvain']`

In [28]:
from transformers import pipeline

token_classifier = pipeline('token-classification', model=checkpoint, tokenizer=checkpoint)
tokens = token_classifier(example)
print(f'{example=}\n[' + '\n'.join(str(t) for t in tokens) + ']')

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


example='My name is Sylvain and I work at Hugging Face in Brooklyn.'
[{'entity': 'I-PER', 'score': 0.99938285, 'index': 4, 'word': 'S', 'start': 11, 'end': 12}
{'entity': 'I-PER', 'score': 0.99815494, 'index': 5, 'word': '##yl', 'start': 12, 'end': 14}
{'entity': 'I-PER', 'score': 0.99590707, 'index': 6, 'word': '##va', 'start': 14, 'end': 16}
{'entity': 'I-PER', 'score': 0.99923277, 'index': 7, 'word': '##in', 'start': 16, 'end': 18}
{'entity': 'I-ORG', 'score': 0.9738931, 'index': 12, 'word': 'Hu', 'start': 33, 'end': 35}
{'entity': 'I-ORG', 'score': 0.976115, 'index': 13, 'word': '##gging', 'start': 35, 'end': 40}
{'entity': 'I-ORG', 'score': 0.9887976, 'index': 14, 'word': 'Face', 'start': 41, 'end': 45}
{'entity': 'I-LOC', 'score': 0.9932106, 'index': 16, 'word': 'Brooklyn', 'start': 49, 'end': 57}]


In [29]:
from transformers import pipeline

token_classifier = pipeline('token-classification', model=checkpoint, tokenizer=checkpoint, aggregation_strategy='simple')
tokens = token_classifier(example)
print(f'{example=}\n[' + '\n'.join(str(t) for t in tokens) + ']')

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


example='My name is Sylvain and I work at Hugging Face in Brooklyn.'
[{'entity_group': 'PER', 'score': 0.9981694, 'word': 'Sylvain', 'start': 11, 'end': 18}
{'entity_group': 'ORG', 'score': 0.9796019, 'word': 'Hugging Face', 'start': 33, 'end': 45}
{'entity_group': 'LOC', 'score': 0.9932106, 'word': 'Brooklyn', 'start': 49, 'end': 57}]


### Manual-ish re-implementation

In [37]:
from transformers import AutoTokenizer
import numpy as np

results = []
token_classifier = pipeline('token-classification', model=checkpoint, tokenizer=checkpoint)
tokens = token_classifier(example)

idx = 0
while idx < len(tokens):
    label = tokens[idx]['entity']
    # Remove the B- or I-
    label = label[2:]
    start = tokens[idx]['start']

    # Grab all the tokens labeled with I-label
    all_scores = []
    while idx < len(tokens) and tokens[idx]['entity'] == f'I-{label}':
      all_scores.append(tokens[idx]['score'])
      end = tokens[idx]['end']
      idx += 1

    # The score is the mean of all the scores of the tokens in that grouped entity
    score = np.mean(all_scores).item()
    word = example[start: end]
    results.append(
      {
        "entity_group": label,
        "score": score,
        "word": word,
        "start": start,
        "end": end,
    })

print(f'{example=}\n[' + '\n'.join(str(t) for t in results) + ']')

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


example='My name is Sylvain and I work at Hugging Face in Brooklyn.'
[{'entity_group': 'PER', 'score': 0.9981694221496582, 'word': 'Sylvain', 'start': 11, 'end': 18}
{'entity_group': 'ORG', 'score': 0.9796019196510315, 'word': 'Hugging Face', 'start': 33, 'end': 45}
{'entity_group': 'LOC', 'score': 0.9932106137275696, 'word': 'Brooklyn', 'start': 49, 'end': 57}]


## Find the Answer to a Question in a Context

### Goal

In [38]:
from transformers import pipeline

question_answerer = pipeline("question-answering")
context = """
ü§ó Transformers is backed by the three most popular deep learning libraries ‚Äî Jax, PyTorch, and TensorFlow ‚Äî with a seamless integration
between them. It's straightforward to train your models with one before loading them for inference with the other.
"""
question = "Which deep learning libraries back ü§ó Transformers?"
question_answerer(question=question, context=context)

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


{'score': 0.9802601933479309,
 'start': 78,
 'end': 106,
 'answer': 'Jax, PyTorch, and TensorFlow'}

### Manual-ish re-implementation

In [41]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

model_checkpoint = "distilbert-base-cased-distilled-squad"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

inputs = tokenizer(question, context, return_tensors="pt")
outputs = model(**inputs)

start_logits = outputs.start_logits
end_logits = outputs.end_logits
# First logits represent the start token idx of the answer, second the end token idx.
print(start_logits.shape, end_logits.shape)

torch.Size([1, 67]) torch.Size([1, 67])


In [45]:
import torch

sequence_ids = inputs.sequence_ids()
# Mask everything apart from the tokens of the context
mask = [i != 1 for i in sequence_ids]
# Unmask the [CLS] token
mask[0] = False
mask = torch.tensor(mask)[None]

start_logits[mask] = -10000
end_logits[mask] = -10000

start_probabilities = torch.nn.functional.softmax(start_logits, dim=-1)[0]
end_probabilities = torch.nn.functional.softmax(end_logits, dim=-1)[0]

tensor([4.4531e-07, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
        0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 8.1185e-06, 1.3470e-05,
        2.4369e-07, 2.1236e-06, 1.3220e-06, 3.7722e-04, 6.9220e-03, 1.0237e-05,
        4.3289e-06, 1.5143e-05, 3.2464e-07, 4.1933e-06, 1.6808e-04, 9.9179e-01,
        8.6288e-06, 3.8557e-04, 5.9956e-06, 4.3725e-06, 5.8977e-07, 3.0929e-06,
        3.8999e-06, 2.9493e-06, 2.1940e-04, 5.4713e-06, 7.1354e-06, 2.3212e-05,
        5.2711e-06, 4.7788e-07, 2.4292e-07, 4.4467e-07, 1.4879e-08, 4.8133e-08,
        3.7169e-07, 7.1242e-08, 3.1735e-07, 2.2365e-07, 1.3685e-06, 2.4093e-08,
        1.1470e-08, 4.4891e-07, 2.2828e-08, 5.2562e-07, 5.8093e-07, 1.6419e-06,
        1.4114e-08, 2.0591e-07, 2.0161e-08, 2.5390e-07, 2.3251e-08, 1.4667e-08,
        5.4533e-08, 2.4235e-08, 5.5390e-09, 1.8524e-08, 3.6818e-08, 3.4721e-08,
        0.0000e+00], grad_fn=<SelectBackward0>)


#### Enforce some invarient
We want to make sure that start_idx < end_idx. So we compute the probability of all pairs of `(start, end)` NOP-ing the ones where `start_idx >= end_idx` and pick a winner.

In [57]:
# Compute all combinations
scores = start_probabilities[:, None] * end_probabilities[None, :]
# NOP the lower triangular part
scores = torch.triu(scores)

# Get the index of the winner
max_index = scores.argmax().item()
start_index = max_index // scores.shape[1]
end_index = max_index % scores.shape[1]
print(f'{start_index=} {end_index=} {scores[start_index, end_index]=}')

start_index=23 end_index=35 scores[start_index, end_index]=tensor(0.9803, grad_fn=<SelectBackward0>)


In [60]:
inputs_with_offsets = tokenizer(question, context, return_offsets_mapping=True)
offsets = inputs_with_offsets["offset_mapping"]

start_char, _ = offsets[start_index]
_, end_char = offsets[end_index]
answer = context[start_char:end_char]

result = {
    "answer": answer,
    "start": start_char,
    "end": end_char,
    "score": scores[start_index, end_index],
}
print(result)

{'answer': 'Jax, PyTorch, and TensorFlow', 'start': 78, 'end': 106, 'score': tensor(0.9803, grad_fn=<SelectBackward0>)}


## Handling long context
If the context + question has more tokens than the model can take we can chunk the context and do multiple queries. To prevent the answer to be cut in between chunks we generate overlapping chunks of context.

### Goal

In [61]:
long_context = """
ü§ó Transformers: State of the Art NLP

ü§ó Transformers provides thousands of pretrained models to perform tasks on texts such as classification, information extraction,
question answering, summarization, translation, text generation and more in over 100 languages.
Its aim is to make cutting-edge NLP easier to use for everyone.

ü§ó Transformers provides APIs to quickly download and use those pretrained models on a given text, fine-tune them on your own datasets and
then share them with the community on our model hub. At the same time, each python module defining an architecture is fully standalone and
can be modified to enable quick research experiments.

Why should I use transformers?

1. Easy-to-use state-of-the-art models:
  - High performance on NLU and NLG tasks.
  - Low barrier to entry for educators and practitioners.
  - Few user-facing abstractions with just three classes to learn.
  - A unified API for using all our pretrained models.
  - Lower compute costs, smaller carbon footprint:

2. Researchers can share trained models instead of always retraining.
  - Practitioners can reduce compute time and production costs.
  - Dozens of architectures with over 10,000 pretrained models, some in more than 100 languages.

3. Choose the right framework for every part of a model's lifetime:
  - Train state-of-the-art models in 3 lines of code.
  - Move a single model between TF2.0/PyTorch frameworks at will.
  - Seamlessly pick the right framework for training, evaluation and production.

4. Easily customize a model or an example to your needs:
  - We provide examples for each architecture to reproduce the results published by its original authors.
  - Model internals are exposed as consistently as possible.
  - Model files can be used independently of the library for quick experiments.

ü§ó Transformers is backed by the three most popular deep learning libraries ‚Äî Jax, PyTorch and TensorFlow ‚Äî with a seamless integration
between them. It's straightforward to train your models with one before loading them for inference with the other.
"""
question_answerer(question=question, context=long_context)

{'score': 0.9714871048927307,
 'start': 1892,
 'end': 1919,
 'answer': 'Jax, PyTorch and TensorFlow'}

### Manual-ish re-implementation

#### Small examples for Intuition

In [63]:
sentence = "This sentence is not too long but we are going to split it anyway."
inputs = tokenizer(sentence, truncation=True, return_overflowing_tokens=True, max_length=6, stride=2)

for ids in inputs["input_ids"]:
    print(tokenizer.decode(ids))

[CLS] This sentence is not [SEP]
[CLS] is not too long [SEP]
[CLS] too long but we [SEP]
[CLS] but we are going [SEP]
[CLS] are going to split [SEP]
[CLS] to split it anyway [SEP]
[CLS] it anyway. [SEP]


In [90]:
import torch

starts = torch.tensor([1, 2, 3, 4])
ends = torch.tensor([5, 6, 7, 8])
print(f'{starts=} reshaped as a column {starts[:, None]}')

# Compute the cross product
# [[1],                        [[1 * 5,   1 * 6,   1 * 7,   1 * 8],
#  [2],  x  [[5, 6, 7, 8]]  =   [2 * 5,   2 * 6,   2 * 7,   2 * 8],
#  [3],                         [3 * 5,   3 * 6,   3 * 7,   3 * 8],
#  [4]]                         [4 * 5,   4 * 6,   4 * 7,   4 * 8]]
cross = starts[:, None] * ends[None, :]
print(f'cross product:\n{cross}')

# NOP the bottom triangle
cross = torch.triu(cross)
print(f'upper triangle:\n{cross}')

starts=tensor([1, 2, 3, 4]) reshaped as a column tensor([[1],
        [2],
        [3],
        [4]])
cross product:
tensor([[ 5,  6,  7,  8],
        [10, 12, 14, 16],
        [15, 18, 21, 24],
        [20, 24, 28, 32]])
upper triangle:
tensor([[ 5,  6,  7,  8],
        [ 0, 12, 14, 16],
        [ 0,  0, 21, 24],
        [ 0,  0,  0, 32]])


#### Actual Code

In [64]:
inputs = tokenizer(
    question,
    long_context,
    stride=128,
    max_length=384,
    padding="longest",
    truncation="only_second",
    return_overflowing_tokens=True,
    return_offsets_mapping=True,
)

_ = inputs.pop("overflow_to_sample_mapping")
offsets = inputs.pop("offset_mapping")

inputs = inputs.convert_to_tensors("pt")
print(inputs["input_ids"].shape)

torch.Size([2, 384])


In [65]:
outputs = model(**inputs)

start_logits = outputs.start_logits
end_logits = outputs.end_logits
print(start_logits.shape, end_logits.shape)

torch.Size([2, 384]) torch.Size([2, 384])


In [66]:
sequence_ids = inputs.sequence_ids()
# Mask everything apart from the tokens of the context
mask = [i != 1 for i in sequence_ids]
# Unmask the [CLS] token
mask[0] = False
# Mask all the [PAD] tokens
mask = torch.logical_or(torch.tensor(mask)[None], (inputs["attention_mask"] == 0))

start_logits[mask] = -10000
end_logits[mask] = -10000

start_probabilities = torch.nn.functional.softmax(start_logits, dim=-1)
end_probabilities = torch.nn.functional.softmax(end_logits, dim=-1)

In [67]:
candidates = []
for start_probs, end_probs in zip(start_probabilities, end_probabilities):
    scores = start_probs[:, None] * end_probs[None, :]
    idx = torch.triu(scores).argmax().item()

    start_idx = idx // scores.shape[1]
    end_idx = idx % scores.shape[1]
    score = scores[start_idx, end_idx].item()
    candidates.append((start_idx, end_idx, score))

print(candidates)

[(0, 18, 0.33866992592811584), (173, 184, 0.9714871048927307)]


In [68]:
for candidate, offset in zip(candidates, offsets):
    start_token, end_token, score = candidate
    start_char, _ = offset[start_token]
    _, end_char = offset[end_token]
    answer = long_context[start_char:end_char]
    result = {"answer": answer, "start": start_char, "end": end_char, "score": score}
    print(result)

{'answer': '\nü§ó Transformers: State of the Art NLP', 'start': 0, 'end': 37, 'score': 0.33866992592811584}
{'answer': 'Jax, PyTorch and TensorFlow', 'start': 1892, 'end': 1919, 'score': 0.9714871048927307}


## Normalization

Preprocess the input to smooth the results. Frequent normalization include removing accents/fonts/multiple whitespaces, applying a `.lowercase()`, unicode standardisation (e.g. NFC, NFD, NFKC, NFKD) ...

Note: Normalization encure risks of altering the original meaning (e.g. `un p√®re indign√©` -> `un pere indigne`).

In [93]:
from transformers import AutoTokenizer

original = "H√©ll√≤ h√¥w are √º?"
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
normalized = tokenizer.backend_tokenizer.normalizer.normalize_str(original)
print(f'{original=} {normalized=}')

original='H√©ll√≤ h√¥w are √º?' normalized='hello how are u?'


## Pre-Tokenizer
Split the input into chunks based on some rules (e.g. split on whitespace, or punctuation, transform whitespaces or discard ...).

Note: Some Pre-Tokenizer are lossy (e.g. BERT discards whitespaces, so it's not possible to recronstruct the original input).

In [98]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str("Hello, how are  you?")

[('Hello', (0, 5)),
 (',', (5, 6)),
 ('how', (7, 10)),
 ('are', (11, 14)),
 ('you', (16, 19)),
 ('?', (19, 20))]

In [95]:
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str("Hello, how are  you?")

[('Hello', (0, 5)),
 (',', (5, 6)),
 ('ƒ†how', (6, 10)),
 ('ƒ†are', (10, 14)),
 ('ƒ†', (14, 15)),
 ('ƒ†you', (15, 19)),
 ('?', (19, 20))]

In [97]:
tokenizer = AutoTokenizer.from_pretrained("t5-small")
tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str("Hello, how are  you?")

[('‚ñÅHello,', (0, 6)),
 ('‚ñÅhow', (7, 10)),
 ('‚ñÅare', (11, 14)),
 ('‚ñÅyou?', (16, 20))]

## Popular Tokenizer Algorithms

### BPE
Tokenize by letters and make a frequency of pair queue. Merge the most common pair, replace the pairs with the new merged token and start again until we reach the desired number of tokens.

### WordPiece
Tokenize by letters, adding a prefix of `##` for letters that do not start a word (e.g. `hug` -> `['h', '##u', '##g']`). Similar to BPE make all the pairs. And compute a score for each pair:

$score = {freq\_of\_pair \over {freq\_of\_first\_elem\ \ *\ \ freq\_of\_second\_elem}}$

### Unigram
Start with a very large vocabulary and prune it until we reached desired vocabulary size. At each step we compute the unigram loss to decide what element to prune.

$loss = \sum {freq} * (-log(P(word)))$

There is an effictient algorithm for it called [Viterbi Algorithm](https://en.wikipedia.org/wiki/Viterbi_algorithm).

## Tokenizer from scratch
Recreate BERT.

### Create a Dataset

In [13]:
from datasets import load_dataset

dataset = load_dataset("wikitext", name='wikitext-2-raw-v1', split='train')

def get_training_corpus():
    for i in range(0, len(dataset), 1000):
        yield dataset[i : i + 1000]['text']

### Model

In [14]:
from tokenizers import decoders, models, normalizers, pre_tokenizers, processors, trainers, Tokenizer

tokenizer = Tokenizer(models.WordPiece(unk_token="[UNK]"))

### Normalizer

In [15]:
tokenizer.normalizer = normalizers.Sequence([normalizers.NFD(), normalizers.Lowercase(), normalizers.StripAccents()])
print(tokenizer.normalizer.normalize_str("H√©ll√≤ h√¥w are √º?"))

hello how are u?


### Pre-Tokenizer

In [16]:
pre_tokenizer = pre_tokenizers.Sequence([pre_tokenizers.WhitespaceSplit(), pre_tokenizers.Punctuation()])
pre_tokenizer.pre_tokenize_str("Let's test my pre-tokenizer.")

[('Let', (0, 3)),
 ("'", (3, 4)),
 ('s', (4, 5)),
 ('test', (6, 10)),
 ('my', (11, 13)),
 ('pre', (14, 17)),
 ('-', (17, 18)),
 ('tokenizer', (18, 27)),
 ('.', (27, 28))]

### Trainer

In [17]:
special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]
trainer = trainers.WordPieceTrainer(vocab_size=25000, special_tokens=special_tokens)
tokenizer.train_from_iterator(get_training_corpus(), trainer=trainer)

In [18]:
encoding = tokenizer.encode("Let's test this tokenizer.")
print(encoding.tokens)

['l', '##et', "##'s t", '##est ', '##this ', '##to', '##ken', '##iz', '##er', '##.']


### Post-Processor

In [19]:
cls_token_id = tokenizer.token_to_id('[CLS]')
sep_token_id = tokenizer.token_to_id('[SEP]')

tokenizer.post_processor = processors.TemplateProcessing(
    single=f'[CLS]:0 $A:0 [SEP]:0',
    pair=f'[CLS]:0 $A:0 [SEP]:0 $B:1 [SEP]:1',
    special_tokens=[('[CLS]', cls_token_id), ('[SEP]', sep_token_id)],
)

In [20]:
encoding = tokenizer.encode("Let's test this tokenizer.")
print(encoding.tokens)

['[CLS]', 'l', '##et', "##'s t", '##est ', '##this ', '##to', '##ken', '##iz', '##er', '##.', '[SEP]']


### Decoder

In [21]:
tokenizer.decoder = decoders.WordPiece(prefix="##")

In [22]:
tokenizer.decode(encoding.ids)

"let's test this tokenizer."

### Save / Load

In [23]:
tokenizer.save("tokenizer.json")

In [24]:
new_tokenizer = Tokenizer.from_file("tokenizer.json")

### Make it part of the Fast Tokenizer library

In [25]:
from transformers import PreTrainedTokenizerFast

wrapped_tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=tokenizer,
    # tokenizer_file="tokenizer.json", # You can load from the tokenizer file, alternatively
    unk_token="[UNK]",
    pad_token="[PAD]",
    cls_token="[CLS]",
    sep_token="[SEP]",
    mask_token="[MASK]",
)