# Hugging Face 
Your go-to tool for using any pretrained models

In this era, everyone uses pretrained models, thus HuggingFace is like a no-brainer

In [1]:
import transformers #huggingface
transformers.__version__

#pip install transformers[sentencepiece]
#pip install transformers
#pip install sentencepiece

'4.23.1'

In [3]:
import evaluate ##a tool to help compute metrics
evaluate.__version__

'0.4.0'

In [4]:
import datasets ##hugggingface datasets
datasets.__version__

'2.9.0'

In [5]:
import accelerate ##make your training faster!
accelerate.__version__

'0.16.0'

In [None]:
#uncomment this if you are not using our department puffer
# import os
# os.environ['http_proxy']  = 'http://192.41.170.23:3128'
# os.environ['https_proxy'] = 'http://192.41.170.23:3128'

## 1. Pipeline

The most basic thing in Huggingfacel you insert the pretrained model, and just use it for inference

In [12]:
from transformers import pipeline

clf = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")
clf("I love huggingface so much")

[{'label': 'POSITIVE', 'score': 0.9994008541107178}]

In [13]:
clf(["I love huggingface so much", 'The movie is annoyingly good'])

[{'label': 'POSITIVE', 'score': 0.9997593760490417},
 {'label': 'POSITIVE', 'score': 0.9995754361152649}]

### Zero shot classification

This pipeline is called `zero-shot` because you don’t need to fine-tune the model on your data to use it. It can directly return probability scores for any list of labels you want!`

In [18]:
from transformers import pipeline

classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
classifier(
    "This is a NLP course at AIT",
    candidate_labels=["education", "politics", "business","tech",'finance'],
)

{'sequence': 'This is a NLP course at AIT',
 'labels': ['tech', 'education', 'business', 'finance', 'politics'],
 'scores': [0.698082447052002,
  0.25426194071769714,
  0.023365696892142296,
  0.01272545475512743,
  0.01156448945403099]}

### Text generation

 The main idea here is that you provide a prompt and the model will auto-complete it by generating the remaining text.

In [17]:
from transformers import pipeline

generator = pipeline("text-generation", model="distilgpt2")
generator("Harry Potter is", max_length=100, num_return_sequences = 2, pad_token_id=0, eos_token_id=0)

[{'generated_text': "Harry Potter is set to star again in Hollywood this year, though he couldn't resist directing an adaptation of Harry Potter.\n\n\n\n\n\n\nThe Potter: Potter: the Deathly Hallows is currently in development.\n\n\n\nA co-production by Gwyneth Paltrow, Tom Hiddleston, Harry Potter: The Cursed Child and The Chronicles of Narnia, will be produced by Disney XD, The Hollywood Reporter said.\n\n\nThe Potter:"},
 {'generated_text': "Harry Potter is a student of English at The College of Edinburgh and is a student of French at The University of Leicester.We've decided to create two awesome videos for YouTube that I want to help people find a more fun way to get to the epic battle of war between the United States and Iraq over land.\n\n\nThe battle of the Soviet Union is so old and old, as US Presidents Dwight Eisenhower and George W. Bush declared war without Congressional backing after a massive military battle. The"}]

### Mask filling

This is basically language modeling.

In [22]:
from transformers import pipeline

unmasker = pipeline("fill-mask", model='distilroberta-base')

#The top_k argument controls how many possibilities you want to be displayed. 
unmasker("Chaky loves to teach deep <mask>.", top_k=3)

[{'score': 0.15232981741428375,
  'token': 2239,
  'token_str': ' learning',
  'sequence': 'Chaky loves to teach deep learning.'},
 {'score': 0.10400044918060303,
  'token': 9589,
  'token_str': ' breathing',
  'sequence': 'Chaky loves to teach deep breathing.'},
 {'score': 0.0700930655002594,
  'token': 30079,
  'token_str': ' truths',
  'sequence': 'Chaky loves to teach deep truths.'}]

### Question Answering

In [23]:
from transformers import pipeline

#Note that this pipeline is extractive
question_answerer = pipeline("question-answering", model='distilbert-base-cased-distilled-squad')
question_answerer(
    question="Where do Chaky work?",
    context = "My name is Chaky and I love at AIT"
)

Downloading:   0%|          | 0.00/473 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/261M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/436k [00:00<?, ?B/s]

{'score': 0.9845138788223267, 'start': 31, 'end': 34, 'answer': 'AIT'}

### Summarization

In [24]:
from transformers import pipeline

summarizer = pipeline("summarization", model='sshleifer/distilbart-cnn-12-6',  max_length=100)
summarizer(
    """
    America has changed dramatically during recent years. Not only has the number of 
    graduates in traditional engineering disciplines such as mechanical, civil, 
    electrical, chemical, and aeronautical engineering declined, but in most of 
    the premier American universities engineering curricula now concentrate on 
    and encourage largely the study of engineering science. As a result, there 
    are declining offerings in engineering subjects dealing with infrastructure, 
    the environment, and related issues, and greater concentration on high 
    technology subjects, largely supporting increasingly complex scientific 
    developments. While the latter is important, it should not be at the expense 
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other 
    industrial countries in Europe and Asia, continue to encourage and advance 
    the teaching of engineering. Both China and India, respectively, graduate 
    six and eight times as many traditional engineers as does the United States. 
    Other industrial countries at minimum maintain their output, while America 
    suffers an increasingly serious decline in the number of engineering graduates 
    and a lack of well-educated engineers.
"""
)

Downloading:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

[{'summary_text': ' America has changed dramatically during recent years . The number of engineering graduates in the U.S. has declined in traditional engineering disciplines such as mechanical, civil,    electrical, chemical, and aeronautical engineering . Rapidly developing economies such as China and India continue to encourage and advance the teaching of engineering .'}]

In [25]:
##gender bias
from transformers import pipeline

unmasker = pipeline("fill-mask", model='distilroberta-base')

#The top_k argument controls how many possibilities you want to be displayed. 
result = unmasker("This man works as a <mask>")
print([r["token_str"] for r in result])

[' courier', ' translator', ' waiter', ' consultant', ' bartender']


### Bias and Limitations

Since pretrained model is trained on real-world datasets, it can unintentionally grab human biases.  For example:

In [27]:
##gender bias
from transformers import pipeline

unmasker = pipeline("fill-mask", model='distilroberta-base')

#The top_k argument controls how many possibilities you want to be displayed. 
result = unmasker("This woman works as a <mask>")
print([r["token_str"] for r in result])

[' waitress', ' translator', ' nurse', ' prostitute', ' bartender']


## 2. Tokenizer

The first component of the pipeline

- Splitting the input into words, subwords, or symbols (like punctuation) that are called tokens
- Mapping each token to an integer
- Adding additional inputs that may be useful to the model

To do this, we use the `AutoTokenizer` class and its `from_pretrained()` method. Using the checkpoint name of our model, it will automatically fetch the data associated with the model’s tokenizer and cache it (so it’s only downloaded the first time you run the code below).

Since the default checkpoint of the sentiment-analysis pipeline is `distilbert-base-uncased-finetuned-sst-2-english`, we run the following:

In [29]:
from transformers import AutoTokenizer
#what is AutoTokenizer - means it can be any tokenizer-specify what tokenizer....

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [30]:
raw_inputs = [
    "Chaky have been waiting for a HuggingFace course my whole life.",
    "NLP course is interesing but tough!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt") #pt for pytorch
print(inputs)

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
             0,     0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}


In [31]:
tokenizer.decode([101,15775,4801])

'[CLS] chaky'

## 3.Model

the second component of the pipeline (after Tokenizer)

In [32]:
from transformers import AutoModel

checkpoint  = "distilbert-base-uncased-finetuned-sst-2-english"
model       = AutoModel.from_pretrained(checkpoint)

#it is complaining because it expects you to finetune...

Some weights of the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing DistilBertModel: ['pre_classifier.bias', 'pre_classifier.weight', 'classifier.weight', 'classifier.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [33]:
outputs = model(**inputs) #** means keyword arguments
print(outputs.last_hidden_state.shape) #[batch size, seq len, hid dim]

#Note that the outputs of Transformers models behave like namedtuples or dictionaries. 
# You can access the elements by attributes (like we did) or by key (outputs["last_hidden_state"]), 
# or even by index if you know exactly where the thing you are looking for is (outputs[0]).

torch.Size([2, 16, 768])


## 4. Postprocessing 

Last step of the pipeline (after the model)

In [34]:
from transformers import AutoModelForSequenceClassification #train with another linear layer 

checkpoint  = "distilbert-base-uncased-finetuned-sst-2-english"
model       = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs     = model(**inputs)

In [36]:
print(outputs.logits)  #batch, one-hot encoded targets or two sentences, two labels

tensor([[-1.5607,  1.6123],
        [ 4.1692, -3.3464]], grad_fn=<AddmmBackward0>)


In [38]:
import torch
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
predictions

tensor([[4.0195e-02, 9.5980e-01],
        [9.9946e-01, 5.4418e-04]], grad_fn=<SoftmaxBackward0>)

In [39]:
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}