<a href="https://colab.research.google.com/github/ladylazy9x/huggingface-learn/blob/main/course/en/chapter2/section2_pt.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Behind the pipeline (PyTorch)

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [1]:
!pip install datasets evaluate transformers[sentencepiece]

Collecting datasets
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.5.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.2/491.2 kB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [3

In [2]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier(
    [
        "I've been waiting for a HuggingFace course my whole life.",
        "I hate this so much!",
    ]
)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Device set to use cpu


[{'label': 'POSITIVE', 'score': 0.9598049521446228},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

In [4]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

In [5]:
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
             0,     0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}


In [6]:
from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

In [7]:
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

torch.Size([2, 16, 768])




```
# This is formatted as code
Please explain this code.

Okay, let's break down these two lines of code:

Obtaining Model Outputs and their Shape
This section focuses on utilizing the pre-trained model to generate outputs and subsequently examining the shape of the last hidden state.

outputs = model(**inputs)
print(outputs.last_hidden_state.shape)
Use code with caution
`outputs = model(inputs)`**

This line is where the magic happens. You're essentially feeding your pre-processed inputs (which are now numerical representations of your sentences) to the model.
**inputs unpacks the inputs dictionary. The model expects keyword arguments like input_ids, attention_mask, this syntax provides those arguments coming from the dictionary.
The model processes these inputs and stores the results in the outputs variable. These results include various things depending on the model, but importantly, they contain information about how the model understood the input text.
print(outputs.last_hidden_state.shape)

This line accesses the last_hidden_state attribute within the outputs and then prints its shape.
The last_hidden_state is a tensor (a multi-dimensional array) that holds the final representation of the input sequence after it's been processed by all the layers of the model.
By printing its shape, you get information about its dimensions:
Batch size: The number of input sequences (in this case, 2 since you provided 2 sentences).
Sequence length: The length of the sequence (padded to a uniform length using the padding in the input)
Hidden size: The dimensionality of the hidden state (this depends on the architecture of the specific model used, how many features the model uses to represent information).
In simpler terms: you're passing your text data to the model, getting the model's understanding of it, and then examining the dimensions of this understanding to get an idea of how the model structured its internal representation.
```





In [8]:
from transformers import AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)

In [9]:
print(outputs.logits.shape)

torch.Size([2, 2])


In [10]:
print(outputs.logits)

tensor([[-1.5607,  1.6123],
        [ 4.1692, -3.3464]], grad_fn=<AddmmBackward0>)


In [11]:
import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

tensor([[4.0195e-02, 9.5980e-01],
        [9.9946e-01, 5.4418e-04]], grad_fn=<SoftmaxBackward0>)


In [13]:
# Convert predictions to numbers without scientific notation ('e')
predictions_no_e = predictions.tolist()
for i, prediction in enumerate(predictions_no_e):
  print(f"Prediction {i+1}: {[round(x, 4) for x in prediction]}") #Rounding to 4 decimal places for readability. Adjust as needed.

Prediction 1: [0.0402, 0.9598]
Prediction 2: [0.9995, 0.0005]


In [None]:
# prompt: convert those prediction above to number without ''e'

from transformers import pipeline
from transformers import AutoTokenizer
from transformers import AutoModel
from transformers import AutoModelForSequenceClassification
import torch

# Install necessary libraries (if not already installed)
# !pip install datasets evaluate transformers[sentencepiece]

# ... (rest of your existing code)

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

# Convert predictions to numbers without scientific notation ('e')
predictions_no_e = predictions.tolist()
for i, prediction in enumerate(predictions_no_e):
  print(f"Prediction {i+1}: {[round(x, 4) for x in prediction]}") #Rounding to 4 decimal places for readability. Adjust as needed.


In [12]:
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}



```
# This is formatted as code
so this section mean we can use AutoModelForSequenceClassification instead of AutoModel? and why AutoModel appear here?

Okay, let's address your questions about AutoModel and AutoModelForSequenceClassification:

AutoModelForSequenceClassification vs. AutoModel
You are correct that for this specific task of sentiment analysis, using AutoModelForSequenceClassification is more direct and appropriate than using AutoModel.

Here's why:

AutoModelForSequenceClassification is a specialized version of AutoModel designed for classification tasks like sentiment analysis. It includes an additional classification layer on top of the base model, enabling it to directly output probabilities for different classes (e.g., positive or negative sentiment).

AutoModel is a more general-purpose model. It provides the underlying transformer architecture's outputs (like the last_hidden_state), but you would need to add your own classification head to perform sentiment analysis.

Why AutoModel Appears in this Code
Even though AutoModelForSequenceClassification would be more efficient here, the code likely includes AutoModel for demonstration or educational purposes.

By first using AutoModel, the code showcases:

Accessing the last_hidden_state: It allows you to see how to extract and inspect the internal representations of the model.
Building a deeper understanding: It illustrates the building blocks of transformer models before introducing the more specialized AutoModelForSequenceClassification.
In a real-world sentiment analysis scenario, you would typically use AutoModelForSequenceClassification directly, as it simplifies the process and is optimized for classification tasks.

I hope this clarifies the roles of AutoModel and AutoModelForSequenceClassification in the code. Let me know if you have any more questions!
```

