In [1]:
!pip install datasets evaluate transformers[sentencepiece]

Collecting datasets
  Downloading datasets-2.20.0-py3-none-any.whl (547 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.8/547.8 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting evaluate
  Downloading evaluate-0.4.2-py3-none-any.whl (84 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-16.1.0-cp310-cp310-manylinux_2_28_x86_64.whl (40.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.8/40.8 MB[0m [31m11.9 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
Collecting requests>=2.32.2 (from datasets)
  Downloading requests-2.32.3-py3-none-any.whl (64 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

##The code starts by importing the necessary modules: `torch` for tensor operations and `AutoTokenizer` and `AutoModelForSequenceClassification` from the Transformers library.

##Next, a pre-trained model and tokenizer for sentiment analysis are loaded using the `from_pretrained` method. In this case, the `"distilbert-base-uncased-finetuned-sst-2-english"` checkpoint is used, which is a DistilBERT model fine-tuned on the Stanford Sentiment Treebank (SST-2) dataset for binary sentiment classification.

##The input sequence to be classified is defined as "I've been waiting for a HuggingFace course my whole life."

##The input sequence is then tokenized using the `tokenizer.tokenize` method, which splits the sequence into individual tokens. The tokens are converted to their corresponding IDs using `tokenizer.convert_tokens_to_ids`, and the resulting IDs are converted to a PyTorch tensor using `torch.tensor`.

##Finally, an attempt is made to pass the `input_ids` tensor to the model using `model(input_ids)`. However, this line will fail because the model expects additional input tensors, such as attention masks and token type IDs, which are not provided in this code.

##To successfully run the model for sequence classification, you would need to provide the required input tensors. The Transformers library provides convenience functions like `tokenizer.encode` or `tokenizer.encode_plus` to handle the tokenization and tensor preparation steps in a single call.

##This code demonstrates the basic steps involved in using the Transformers library for sequence classification tasks, including loading pre-trained models and tokenizers, tokenizing input sequences, and preparing input tensors for the model. However, it is important to note that additional steps may be required to properly handle the input data and obtain the desired output from the model.

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Load a pre-trained model and tokenizer for sentiment analysis
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

# Define the input sequence
sequence = "I've been waiting for a HuggingFace course my whole life."

# Tokenize the input sequence
tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
input_ids = torch.tensor(ids)

# Attempt to pass the input_ids to the model (this line will fail)
model(input_ids)

In [3]:
tokenized_inputs = tokenizer(sequence, return_tensors="pt")
print(tokenized_inputs["input_ids"])

tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102]])


##The code starts by importing the necessary modules: `torch` for tensor operations and `AutoTokenizer` and `AutoModelForSequenceClassification` from the Transformers library.

##Next, a pre-trained DistilBERT model and tokenizer fine-tuned for sentiment analysis are loaded using the `from_pretrained` method.

##The input sequence to be classified is defined as "I've been waiting for a HuggingFace course my whole life."

##The input sequence is tokenized using the `tokenizer.tokenize` method, and the tokens are converted to their corresponding IDs using `tokenizer.convert_tokens_to_ids`. The resulting IDs are converted to a PyTorch tensor using `torch.tensor` and wrapped in a list to match the expected input format for the model.

##The `input_ids` tensor is printed to the console for reference.

##Finally, the `input_ids` tensor is passed to the model using `model(input_ids)`, and the output logits are obtained. The logits represent the raw scores for each class before applying the softmax function. In this case, since the model is fine-tuned for binary sentiment classification, there will be two logits: one for the positive sentiment class and one for the negative sentiment class.

##The output logits are printed to the console.

##To interpret the output and determine the predicted sentiment, you would need to apply the softmax function to the logits and select the class with the highest probability. Alternatively, you can use the built-in `model.forward` method, which applies the softmax function and returns the class probabilities directly.

In [4]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Load a pre-trained model and tokenizer for sentiment analysis
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

# Define the input sequence
sequence = "I've been waiting for a HuggingFace course my whole life."

# Tokenize the input sequence
tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)

# Convert token IDs to a PyTorch tensor
input_ids = torch.tensor([ids])
print("Input IDs:", input_ids)

# Pass the input_ids to the model and get the output logits
output = model(input_ids)
print("Logits:", output.logits)

Input IDs: tensor([[ 1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,  2607,
          2026,  2878,  2166,  1012]])
Logits: tensor([[-2.7276,  2.8789]], grad_fn=<AddmmBackward0>)


In [5]:
batched_ids = [
        [200, 200, 200],
        [200, 200]
            ]


In [6]:
padding_id = 100

batched_ids = [
    [200, 200, 200],
    [200, 200, padding_id],
        ]

##The code starts by loading the pre-trained DistilBERT model for sequence classification using the `AutoModelForSequenceClassification.from_pretrained` method.

##Next, three different input sequences are defined:

##*1. `sequence1_ids`: A sequence of length 3, represented as a list of token IDs.*
##*2. `sequence2_ids`: A sequence of length 2, represented as a list of token IDs.*
##*3. `batched_ids`: A batch of two sequences, where the second sequence is padded to the same length as the first sequence using the `tokenizer.pad_token_id`.*

##The padding is necessary because most models expect input sequences of the same length within a batch. The `tokenizer.pad_token_id` is a special token ID used to pad shorter sequences to the desired length.

##Finally, each input sequence (or batch of sequences) is passed to the model using `model(torch.tensor(sequence_ids))`, and the output logits are printed to the console.

##The output logits represent the raw scores for each class before applying the softmax function. In the case of binary classification, there will be two logits: one for the positive class and one for the negative class.

##By using different input formats (single sequence and batched sequences with padding), this code demonstrates how to handle variable-length sequences and batches when using pre-trained models for sequence classification tasks.

In [7]:
# Load the pre-trained DistilBERT model for sequence classification
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

# Define input sequences with different lengths
sequence1_ids = [[200, 200, 200]]  # Sequence of length 3
sequence2_ids = [[200, 200]]  # Sequence of length 2
batched_ids = [
        [200, 200, 200],  # Sequence of length 3
        [200, 200, tokenizer.pad_token_id],  # Sequence of length 2, padded to length 3
            ]
# Pass the input sequences to the model and print the output logits
print(model(torch.tensor(sequence1_ids)).logits)
print(model(torch.tensor(sequence2_ids)).logits)
print(model(torch.tensor(batched_ids)).logits)

We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.


tensor([[ 1.5694, -1.3895]], grad_fn=<AddmmBackward0>)
tensor([[ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>)
tensor([[ 1.5694, -1.3895],
        [ 1.3374, -1.2163]], grad_fn=<AddmmBackward0>)
