<a href="https://colab.research.google.com/github/lazarust/JupyterNotebooks/blob/huggingface-course/%20HuggingFaceCourse/notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# [HuggingFace Course](https://huggingface.co)

This notebook is going to be where I experiment with some of the concepts and ideas that are discussed in the course. Some of the code within this notebook may be found in the course so all credit goes to 🤗!

In [None]:
!pip install datasets transformers[sentencepiece]

## Chapter 1

### Pipelines

In [37]:
from transformers import pipeline

#### Zero-Short Classification

In [38]:
# Can Zero-Shot Classification Detect Political Bias?
string = "The president has a long history of exaggerating stories about himself. Most recently, he recounted for the fifth time during his presidency a heartfelt yet factually challenged story about an Amtrak employee during a speech in New Jersey. The employee Biden frequently mentions actually died a year before the story was said to have taken place."
labels = ["democrat", "republican", "center"]
classifier = pipeline("zero-shot-classification")
classifier(
    string,
    candidate_labels=labels,
)

No model was supplied, defaulted to facebook/bart-large-mnli (https://huggingface.co/facebook/bart-large-mnli)


Downloading:   0%|          | 0.00/1.13k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

{'labels': ['center', 'republican', 'democrat'],
 'scores': [0.8460571765899658, 0.09381183981895447, 0.060130972415208817],
 'sequence': 'The president has a long history of exaggerating stories about himself. Most recently, he recounted for the fifth time during his presidency a heartfelt yet factually challenged story about an Amtrak employee during a speech in New Jersey. The employee Biden frequently mentions actually died a year before the story was said to have taken place.'}

In [39]:
classifier = pipeline("zero-shot-classification", model='valhalla/distilbart-mnli-12-6')
classifier(
    string,
    candidate_labels=labels,
)

Downloading:   0%|          | 0.00/1.36k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.14G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/772 [00:00<?, ?B/s]

{'labels': ['center', 'democrat', 'republican'],
 'scores': [0.7452044486999512, 0.14780858159065247, 0.10698693990707397],
 'sequence': 'The president has a long history of exaggerating stories about himself. Most recently, he recounted for the fifth time during his presidency a heartfelt yet factually challenged story about an Amtrak employee during a speech in New Jersey. The employee Biden frequently mentions actually died a year before the story was said to have taken place.'}

In [40]:
classifier = pipeline("zero-shot-classification", model='vicgalle/xlm-roberta-large-xnli-anli')
classifier(
    string,
    candidate_labels=labels,
)

Downloading:   0%|          | 0.00/734 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.09G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/4.83M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/150 [00:00<?, ?B/s]

{'labels': ['center', 'republican', 'democrat'],
 'scores': [0.5324409604072571, 0.30717146396636963, 0.16038765013217926],
 'sequence': 'The president has a long history of exaggerating stories about himself. Most recently, he recounted for the fifth time during his presidency a heartfelt yet factually challenged story about an Amtrak employee during a speech in New Jersey. The employee Biden frequently mentions actually died a year before the story was said to have taken place.'}

#### Fill-Mask

In [41]:
unmasker = pipeline("fill-mask")
unmasker("I like using <mask> for data analysis.", top_k=5)

No model was supplied, defaulted to distilroberta-base (https://huggingface.co/distilroberta-base)


Downloading:   0%|          | 0.00/480 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/316M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

[{'score': 0.05414031445980072,
  'sequence': 'I like using Python for data analysis.',
  'token': 31886,
  'token_str': ' Python'},
 {'score': 0.04662960395216942,
  'sequence': 'I like using Excel for data analysis.',
  'token': 27241,
  'token_str': ' Excel'},
 {'score': 0.045307811349630356,
  'sequence': 'I like using graphs for data analysis.',
  'token': 36386,
  'token_str': ' graphs'},
 {'score': 0.02621152251958847,
  'sequence': 'I like using MySQL for data analysis.',
  'token': 46097,
  'token_str': ' MySQL'},
 {'score': 0.024079876020550728,
  'sequence': 'I like using JSON for data analysis.',
  'token': 47192,
  'token_str': ' JSON'}]

## Chapter 2

### Behind Pipelines

#### AutoTokenizer

In [42]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

raw_inputs = [
    "This course is super interesting!",
    "I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)

{'input_ids': tensor([[ 101, 2023, 2607, 2003, 3565, 5875,  999,  102],
        [ 101, 1045, 5223, 2023, 2061, 2172,  999,  102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1]])}


#### AutoModel

In [43]:
from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)

outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

Some weights of the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing DistilBertModel: ['pre_classifier.bias', 'classifier.weight', 'classifier.bias', 'pre_classifier.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


torch.Size([2, 8, 768])


In [44]:
from transformers import AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)
print(outputs.logits)

tensor([[-4.1537,  4.4618],
        [ 4.1692, -3.3464]], grad_fn=<AddmmBackward0>)


In [45]:
import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

tensor([[1.8123e-04, 9.9982e-01],
        [9.9946e-01, 5.4418e-04]], grad_fn=<SoftmaxBackward0>)


In [46]:
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

In [47]:
sentence_num = 1
for sentance in predictions:
  print(f'Sentence {sentence_num}: NEGATIVE:{sentance[0]}, POSITIVE:{sentance[1]}')

Sentence 1: NEGATIVE:0.00018123067275155336, POSITIVE:0.999818742275238
Sentence 1: NEGATIVE:0.9994558691978455, POSITIVE:0.0005441842367872596


### Tokenizers

In [48]:
from transformers import AutoTokenizer

#### BERT Base Cased

In [49]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
tokens = tokenizer.tokenize(sequence)
print(tokens)

ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

decoded_string = tokenizer.decode(ids)
print(decoded_string)

['I', "'", 've', 'been', 'waiting', 'for', 'a', 'Hu', '##gging', '##F', '##ace', 'course', 'my', 'whole', 'life', '.']
[146, 112, 1396, 1151, 2613, 1111, 170, 20164, 10932, 2271, 7954, 1736, 1139, 2006, 1297, 119]
I've been waiting for a HuggingFace course my whole life.


#### GPT2

In [50]:
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokens = tokenizer.tokenize(sequence)
print(tokens)

ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

decoded_string = tokenizer.decode(ids)
print(decoded_string)

['I', "'ve", 'Ġbeen', 'Ġwaiting', 'Ġfor', 'Ġa', 'ĠHug', 'ging', 'Face', 'Ġcourse', 'Ġmy', 'Ġwhole', 'Ġlife', '.']
[40, 1053, 587, 4953, 329, 257, 12905, 2667, 32388, 1781, 616, 2187, 1204, 13]
I've been waiting for a HuggingFace course my whole life.


#### Google Pegasus

In [51]:
tokenizer = AutoTokenizer.from_pretrained("google/pegasus-xsum")
tokens = tokenizer.tokenize(sequence)
print(tokens)

ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

decoded_string = tokenizer.decode(ids)
print(decoded_string)

['▁I', "'", 've', '▁been', '▁waiting', '▁for', '▁a', '▁H', 'ugging', 'Face', '▁course', '▁my', '▁whole', '▁life', '.']
[125, 131, 261, 174, 1838, 118, 114, 1176, 73940, 28795, 422, 161, 664, 271, 107]
I've been waiting for a HuggingFace course my whole life.


### Multiple Sequences

In [52]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)

input_ids = torch.tensor([ids])
print("Input IDs:", input_ids)

output = model(input_ids)
print("Logits:", output.logits)

Input IDs: tensor([[ 1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,  2607,
          2026,  2878,  2166,  1012]])
Logits: tensor([[-2.7276,  2.8789]], grad_fn=<AddmmBackward0>)


In [53]:
batched_ids = [ids, ids]

input_ids = torch.tensor(batched_ids)
print("Input IDs:", input_ids)

output = model(input_ids)
print("Logits:", output.logits)

Input IDs: tensor([[ 1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,  2607,
          2026,  2878,  2166,  1012],
        [ 1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,  2607,
          2026,  2878,  2166,  1012]])
Logits: tensor([[-2.7276,  2.8789],
        [-2.7276,  2.8789]], grad_fn=<AddmmBackward0>)


## Chapter 3