<a href="https://colab.research.google.com/github/iamhasanhumane/Hugging_Face/blob/main/Chapter_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install transformers
!pip install huggingface_hub



## Inside the Pipeline Function

Lets have a look at what actually happens when we execute the following code

In [2]:
from transformers import pipeline
classifier = pipeline("sentiment-analysis")
classifier([
    "I've been waiting for a HuggingFace course my whole life",
    "I hate this so much"
])


No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Device set to use cpu


[{'label': 'POSITIVE', 'score': 0.9516071081161499},
 {'label': 'NEGATIVE', 'score': 0.9995144605636597}]

There are three stages in Pipeline Presentation


**Tokenizer** - - - - - **Model** - - - - **Postprocessing**



---



*Tokenizer*

*   We convert the raw texts to numbers the model can make sense of using a tokenizer.

* This course is amazing! - - - - - [101,2023,2607,2003,6429,999,102]


*Model*

* These numbers go through the model which outputs logits.
* [101,2023,2607,2003,6429,999,102] - - - - - - [-4.3630 , 4.6859]


*Post processing*

* The post processing step converts those logits into labels and scores.
* [-4.3630 , 4.6859] - - - - - - [Positive : 99.89%  ,Negative : 0.11%]



### Stage - 1 : Tokenization



1.   First , the text is split into small chunks called tokens. They can be words , parts of words or punctuation symbols.
2.   Then the tokenizer will add some special tokens like [CLS] and [SEP] (If the model expects them ).
3.   Lastly, the tokenizer matches each token to its unique ID in the vocabulary of the pretrained model.



*The AutoTokenizer class can load the tokenizer for any checkpoint*

In [6]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life",
    "I hate this so much"
]

inputs = tokenizer(raw_inputs , padding = True , truncation = True , return_tensors = 'pt')

print(inputs)

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   102,     0,     0,     0,
             0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}


### Stage 2 : Model



The AutoModel class loads a model without its pretraining head

In [10]:
from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)

outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

torch.Size([2, 15, 768])


The AutoModel Api will only instantiate the body of the model , i.e., the part of the model that is left once the pretraining head is removed.

It will output a high-dimensional tensor that is a representation that is a representation of the sentences passed .

In [12]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)

print(outputs.logits)

tensor([[-1.4683,  1.5105],
        [ 4.2141, -3.4158]], grad_fn=<AddmmBackward0>)


Each AutoModelForXxxx class loads a model suitable for a specific task

### Stage 3 : Post Processing

In [20]:
torch.set_printoptions(precision=4, sci_mode=False)

To go from logits to probabilites we apply a softmax layer.

In [22]:
import torch

predictions = torch.nn.functional.softmax(outputs.logits , dim = 1)
print(predictions)

tensor([[    0.0484,     0.9516],
        [    0.9995,     0.0005]], grad_fn=<SoftmaxBackward0>)


In [25]:
predicted_labels = torch.argmax(predictions, dim = 1)
predicted_labels

tensor([1, 0])

In [27]:
sentiment_labels = ["positive" if label == 1 else "negative" for label in predicted_labels]
sentiment_labels

['positive', 'negative']