# Behind the pipeline (PyTorch)

### This notebook is my simplification of the Tokenizers demo, on how Tokenizers work found in Hugging Face Hub NLP Course, https://huggingface.co/learn/nlp-course/chapter2/2?fw=pt

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [1]:
%pip install datasets evaluate transformers[sentencepiece]

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


## Simplest implemetation of Sentiment Analysis

In [2]:
from transformers import pipeline
import pandas as pd
classifier = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")
out=classifier(
    [
            "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much so that life can be much easier without your constant interruptions!",
]
)
pd.DataFrame(out)



Unnamed: 0,label,score
0,POSITIVE,0.959805
1,NEGATIVE,0.997773


## Load tokenizer

In [3]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

## Study output of tokenizers and attention mask they generate- Modify the sentences and try and see output of the Tokenizer

In [4]:

import pandas as pd
import numpy as np
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much so that life can be much easier without your constant interruptions!",
]
#x=raw_inputs[0].split()
#print(len(x))
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="np")
#print(inputs)
i=0
for i in inputs['input_ids']: 
    i+=1
columns1=len(i)
print(columns1)


df = pd.DataFrame()
l1=[i for i in range (1,columns1+1,1)]
df[l1] = 0
#np.random.randint(1000, size=(1,16))

#df.append(l1)
#[k*0 for k in range(16)]
#df = pd.DataFrame(l1:[i for i in range(16)])
#df.loc[0] = l1
#.from_dict(inputs)

for i in inputs['input_ids']:
    #print(i.tolist())
    df.loc[len(df.index)] = i
    #df['input_ids']=i.tolist()
    #df.insert(0,"0", i.tolist())

for i in inputs['attention_mask']:
    #print(i.tolist())
    df.loc[len(df.index)] = i
df.insert(0, "Variable", ["Tokenized ID A", "Tokenized ID B", "Attention Mask A", "Attention Mask B"])
 
df

20


Unnamed: 0,Variable,1,2,3,4,5,6,7,8,9,...,11,12,13,14,15,16,17,18,19,20
0,Tokenized ID A,102,1046,1006,2311,2043,3404,2006,1038,17663,...,2608,2027,2879,2167,1013,103,1,1,1,1
1,Tokenized ID B,102,1046,5224,2024,2062,2173,2062,2009,2167,...,2023,2173,6083,2303,2116,5378,24192,2016,1000,103
2,Attention Mask A,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,0,0,0,0
3,Attention Mask B,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1


##  Model heads: Making sense out of numbers

The model heads take the high-dimensional vector of hidden states as input and project them onto a different dimension. This is output "without giving it a HEAD"

In [5]:
from transformers import AutoModel
import pandas as pd
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)



raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much so that life can be much easier without your constant interruptions!",
]

inputs1 = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
outputs = model(**inputs1)
print(outputs)
print(outputs.last_hidden_state.shape)

BaseModelOutput(last_hidden_state=tensor([[[-0.1798,  0.2333,  0.6321,  ..., -0.3017,  0.5008,  0.1481],
         [ 0.2758,  0.6497,  0.3200,  ..., -0.0760,  0.5136,  0.1329],
         [ 0.9046,  0.0985,  0.2950,  ...,  0.3352, -0.1407, -0.6464],
         ...,
         [-0.1878,  0.1431,  0.4369,  ..., -0.0848,  0.4192,  0.1138],
         [-0.1030,  0.3895,  0.3599,  ..., -0.1300,  0.2746,  0.2901],
         [-0.2132,  0.1402,  0.3943,  ..., -0.0903,  0.4004,  0.0896]],

        [[-0.4200,  0.7560, -0.2024,  ..., -0.0954, -0.4335,  0.0514],
         [ 0.0833,  1.0050, -0.2039,  ..., -0.2464, -0.2403,  0.4341],
         [-0.0034,  0.8134, -0.1011,  ..., -0.1996, -0.5014,  0.4725],
         ...,
         [-0.3173,  0.6448, -0.1439,  ...,  0.1067, -0.3979, -0.3808],
         [ 0.1764,  1.0865, -0.3391,  ..., -0.0470, -0.4173, -0.0399],
         [ 0.2042,  0.3926, -0.1452,  ..., -0.1849, -0.5751, -0.1281]]],
       grad_fn=<NativeLayerNormBackward0>), hidden_states=None, attentions=None)
t

The model heads take the high-dimensional vector of hidden states as input and project them onto a different dimension. This is output "AFTER giving it a HEAD". Now if we look at the shape of our outputs, the dimensionality will be much lower: the model head takes as input the high-dimensional vectors we saw before, and outputs vectors containing two values (one per label).

In [6]:
from transformers import AutoModelForSequenceClassification
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much so that life can be much easier without your constant interruptions!",
]

inputs1 = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
outputs = model(**inputs1)
pd.DataFrame(outputs.logits.detach().numpy())

Unnamed: 0,0,1
0,-1.560701,1.612286
1,3.315524,-2.789292


These above are not probabilities but logits, the raw, unnormalized scores outputted by the last layer of the model. To be converted to probabilities, they need to go through a SoftMax layer (all 🤗 Transformers models output the logits, as the loss function for training will generally fuse the last activation function, such as SoftMax, with the actual loss function, such as cross entropy)

Since we have just two sentences and two labels, the result we get from our model is of shape 2 x 2.

## Postprocessing the output

The values we get as output from our model don’t necessarily make sense by themselves.

In [7]:
pd.DataFrame(outputs.logits.detach().numpy())

Unnamed: 0,0,1
0,-1.560701,1.612286
1,3.315524,-2.789292


Those are not probabilities but logits, the raw, unnormalized scores outputted by the last layer of the model. To be converted to probabilities, they need to go through a SoftMax layer (all 🤗 Transformers models output the logits, as the loss function for training will generally fuse the last activation function, such as SoftMax, with the actual loss function, such as cross entropy)

In [8]:
import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
pd.DataFrame(predictions.detach().numpy())

Unnamed: 0,0,1
0,0.040195,0.959805
1,0.997773,0.002227


In [9]:
pd.DataFrame(model.config.id2label, index=[0])

Unnamed: 0,0,1
0,NEGATIVE,POSITIVE


0 Column is NEGATIVE
1 Column is POSITIVE
For making better sense of this table of probability values p-values, i add...

In [10]:
df=pd.DataFrame(predictions.detach().numpy())
df.columns=["NEGATIVE", "POSITITVE"]
df

Unnamed: 0,NEGATIVE,POSITITVE
0,0.040195,0.959805
1,0.997773,0.002227
