<a href="https://colab.research.google.com/github/not-sid-29/transformers_huggingface/blob/main/2_Breakdown_pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Processing(NLP) - 2:<br>
## Breaking down the `pipeline` function

- The `pipeline` function is the highest level API of transformers library, which integrates many steps of a language task to deliver required results.<br>
**Components of the `pipeline` function:**  <br>

a. *Tokenizer* - the component responsible for accepting a raw input text and converting the words present into suitable Input IDs.<br>
b. *Model* - the component responsible for achieving the required results, produces Logits as an output.<br>
c. *Post-Processor* - the component responsible for converting the logits into probablities or predictions by applying an activation function.


In [None]:
!pip install transformers



In [None]:
from transformers import pipeline

#Going through the Sentiment-Analysis pipeline function:
classifier = pipeline("sentiment-analysis")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

In [None]:
#Using some examples:

classifier(
    ['He was a benevolent king',
     'He could not perform well in his final exams'
     ]
)

[{'label': 'POSITIVE', 'score': 0.9995684027671814},
 {'label': 'NEGATIVE', 'score': 0.9997758269309998}]

### Applying each component independently:

**A. Tokenizers:**

In [None]:
from transformers import AutoTokenizer
model = "distilbert-base-uncased-finetuned-sst-2-english"

tokenizer = AutoTokenizer.from_pretrained(model)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

In [None]:
inputs = [
    'He was a benevolent king',
    'He could not perform well in his final exams'
]

tokens = tokenizer(inputs, padding=True, truncation=True, return_tensors="pt")
print(tokens)

{'input_ids': tensor([[  101,  2002,  2001,  1037, 25786,  2332,   102,     0,     0,     0,
             0],
        [  101,  2002,  2071,  2025,  4685,  2092,  1999,  2010,  2345, 13869,
           102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}


**B. Model:**

In [None]:
from transformers import AutoModel
pretrained_model = "distilbert-base-uncased-finetuned-sst-2-english"

model = AutoModel.from_pretrained(pretrained_model)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

- this function returns a high dimensional vector, "high-dimensional" because the vector size is very large.<br>
- these high-dimensional vectors contain 3 parameters:<br>
a. batch-size,<br>
b. sequence-length,<br>
c. hidden-size.

In [None]:
outputs = model(**tokens)
print(outputs.last_hidden_state.shape)

torch.Size([2, 11, 768])


-> batch-size = 2, sequence-length = 11, hidden-size = 768

**Using the `AutoModelForSequenceClassification` function to generate outputs for sentiment analysis:**

In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(pretrained_model)

This model will generate a much smaller dimensional vector, aka *Logits*

In [None]:
outputs = model(**tokens)
print(outputs.logits)

tensor([[-3.7530,  3.9945],
        [ 4.6498, -3.7531]], grad_fn=<AddmmBackward0>)


**C. Post-Processor:**

In [None]:
#The post-processor works by applying softmax activation function on the logits, thus normalizing them and converting them into probabilities
import torch
preds = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(preds)

tensor([[4.3161e-04, 9.9957e-01],
        [9.9978e-01, 2.2417e-04]], grad_fn=<SoftmaxBackward0>)


**Breakdown of the score:**<br>
a. NEGATIVE = 0.000431, POSITIVE = 0.99957<br>
b. NEGATIVE = 0.99978, POSITIVE = 0.0002241