# Hugging Face Introduction

See the [`README.md`](./README.md) file for more information.

Table of contents:

- [Introduction and Quick Start](#introduction-and-quick-start)
  - [Setup](#setup)

## Introduction and Quick Start

In [1]:
import torch

In [2]:
torch.__version__

'1.13.0+cu117'

### Setup

First install either Tensorflow/Keras or Pytorch in an environment. Then, we can install the `transformers` library, which comes from [Hugging Face](https://huggingface.co/):

```bash
# Install/activate a basic environment
conda env create -f conda.yaml
conda activate ds

# Pytorch: Windows + CUDA 11.7
# Update your NVIDIA drivers: https://www.nvidia.com/Download/index.aspx
# I have version 12.1, but it works with older versions, e.g. 11.7
# Check your CUDA version with: nvidia-smi.exe
# In case of any runtime errors, check vrsion compatibility tables:
# https://github.com/pytorch/vision#installation
python -m pip install torch==1.13+cu117 torchvision==0.14+cu117 torchaudio torchtext==0.14 --index-url https://download.pytorch.org/whl/cu117

# Install the transformers library
pip install transformers datasets accelerate evaluate

# For CPU support only:
pip install 'transformers[torch]' datasets accelerate evaluate
```

### Pipeline

The [pipeline](https://huggingface.co/docs/transformers/main/main_classes/pipelines) does 3 things:

- Preprocessing the text: tokenization
- Feed the preprocessed text to the model
- Postprocessing, e.g., labels applied and output packed

To the pipeline, we need to pass the **task** we want to carry out, and optionally the **model** (plus the **revision**) we would like to use. Then, the pipeline returns the complete model pipeline which performs the 3 steps above. There are many [tasks](https://huggingface.co/docs/transformers/main/main_classes/pipelines#transformers.pipeline.task):

- `sentiment-analysis`
- `text-generation`
- `question-answering`
- `translation`
- `zero-shot-classification`
- `audio-classification`
- `image-to-text`
- `object-detection`
- `image-segmentation`
- `summarization`
- ...

Notes:

- When a model is used for the first time, it needs to be downloaded.
- Always use the `model` and `revision` for reproducibility!

In [3]:
from transformers import pipeline

In [4]:
# Vanilla sentiment analysis with a pipeline object:
# https://huggingface.co/docs/transformers/main/main_classes/pipelines
# We pass the task (see link above for a complete list of tasks)
classifier = pipeline(task="sentiment-analysis") # Task; no model selected - default used
res = classifier("I've been waiting for a HuggingFace course my whole life.")
print(res)
# [{'label': 'POSITIVE', 'score': 0.9598049521446228}]

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Using C:\Users\Msagardi\AppData\Local\torch_extensions\torch_extensions\Cache\py38_cu117 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Failed to load CUDA kernels. Mra requires custom CUDA kernels. Please verify that compatible versions of PyTorch and CUDA Toolkit are installed: CUDA_HOME environment variable is not set. Please set it to your CUDA install root.
Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


[{'label': 'POSITIVE', 'score': 0.9598049521446228}]


In [5]:
# Now, we pass the model explicitly
classifier = pipeline(task="sentiment-analysis",
                      model="distilbert-base-uncased-finetuned-sst-2-english")
res = classifier("I've been waiting for a HuggingFace course my whole life.")
print(res)
# [{'label': 'POSITIVE', 'score': 0.9598049521446228}]

[{'label': 'POSITIVE', 'score': 0.9598049521446228}]


In [6]:
# Vanilla text generation
generator = pipeline(task="text-generation",
                     model="distilgpt2")
res = generator("In this course, I will teach you how to",
                max_length=30,
                num_return_sequences=2)
print(res)
# [{'generated_text': 'In this course, I will teach you how to solve the problems in various languages.\n\n\nPlease do not miss this course:\nIf you'}, {'generated_text': 'In this course, I will teach you how to make a good job for your children:\n\n\n\n1. Make a decision about the future'}]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, I will teach you how to use a non-trivial approach to a class. However, the lessons will be presented in'}, {'generated_text': 'In this course, I will teach you how to take advantage of my favorite teaching methods.\nThe most important thing about that course is that you can'}]


In [7]:
# Vanilla zero-shot-classification
classifier = pipeline(task="zero-shot-classification",
                      model="facebook/bart-large-mnli")
res = classifier("This course is about Python list comprehensions",
                 candidate_labels=["education", "politics", "business"])
print(res)
# {'sequence': 'This course is about Python list comprehensions', 'labels': ['education', 'business', 'politics'], 'scores': [0.8270175457000732, 0.12526951730251312, 0.04771287366747856]}

{'sequence': 'This course is about Python list comprehensions', 'labels': ['education', 'business', 'politics'], 'scores': [0.8270175457000732, 0.12526951730251312, 0.04771287366747856]}


### Tokenizer and Models

The **tokenizer** is the first step in the **pipeline**. We can access and use it, and we can even pass our own tokenizer to the pipeline. The same happens with the **models**.

- [Hugging Face: Tokenizer](https://huggingface.co/docs/transformers/main/main_classes/tokenizer)
- [Hugging Face: Models](https://huggingface.co/docs/transformers/main/main_classes/model)

In [8]:
from transformers import pipeline
# Generic classes for tokenization & sequence classification
from transformers import AutoTokenizer, AutoModelForSequenceClassification
# Specific classes for tokenization & sequence classificaion
from transformers import BertTokenizer, BertForSequenceClassification, BertModel

In [9]:
# Now, we can create instances of the tokenizer/model
# It is important to take the objects using from_pretrained()
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

classifier = pipeline(task="sentiment-analysis",
                      model=model,
                      tokenizer=tokenizer)

res = classifier("I've been waiting for a HuggingFace course my whole life.")
print(res)
# [{'label': 'POSITIVE', 'score': 0.9598049521446228}]

[{'label': 'POSITIVE', 'score': 0.9598049521446228}]


In [10]:
# We can also use the tokenizer separately
sequence = "Using transformers is simple with HuggingFace."

# We get the token ids and the attention_mask: 0 ignored by attention layer
res = tokenizer(sequence)
print(res)
# {'input_ids': [101, 2478, 19081, 2003, 3722, 2007, 17662, 12172, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

# We get the token strings
tokens = tokenizer.tokenize(sequence)
print(tokens)
# ['using', 'transformers', 'is', 'simple', 'with', 'hugging', '##face', '.']

# We get the ids of the token strings
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)
# [2478, 19081, 2003, 3722, 2007, 17662, 12172, 1012]

# We decode the ids back to words
decoded_string = tokenizer.decode(ids)
print(decoded_string)
# using transformers is simple with huggingface.


{'input_ids': [101, 2478, 19081, 2003, 3722, 2007, 17662, 12172, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
['using', 'transformers', 'is', 'simple', 'with', 'hugging', '##face', '.']
[2478, 19081, 2003, 3722, 2007, 17662, 12172, 1012]
using transformers is simple with huggingface.


### Combining the Code with PyTorch and Tensorflow

This section shows how to combine the HuggingFace `transformers` library with PyTorch and Tensorflow. The following examples deal with PyTorch, but the code for Tensorflow is very similar, often just adding the `TF` prefix to the classes.

In this example, instead of using the `pipeline`, we just use the `tokenizer` and the `model` as if the would be PyTorch objects.

In [11]:
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import torch.nn.functional as F

In [12]:
# We create instances of the tokenizer/model
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Our input
X = ["I've been waiting for a HuggingFace course my whole life.",
     "Python is great!",
     "But I don't like at all the fact that there is no notebook available."]

batch = tokenizer(X,
                  padding=True,
                  truncation=True,
                  max_length=512,
                  return_tensors="pt") # PyTorch tensor format
print(batch) # {'input_dis': ..., 'attention_mask': ...}

# Feed the model and process the output
with torch.no_grad():
    outputs = model(**batch) # unpack the dictionary to a list
    print(outputs)
    predictions = F.softmax(outputs.logits, dim=1)
    print(predictions)
    labels = torch.argmax(predictions, dim=1)
    print(labels) # [1, 1, 0]: positive, positive, negative

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102,     0,     0,     0],
        [  101, 18750,  2003,  2307,   999,   102,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0],
        [  101,  2021,  1045,  2123,  1005,  1056,  2066,  2012,  2035,  1996,
          2755,  2008,  2045,  2003,  2053, 14960,  2800,  1012,   102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
SequenceClassifierOutput(loss=None, logits=tensor([[-1.5607,  1.6123],
        [-4.2745,  4.6111],
        [ 3.5185, -2.9038]]), hidden_states=None, attentions=None)
tensor([[4.0195e-02, 9.5981e-01],
        [1.3835e-04, 9.9986e-01],
        [9.9838e-01, 1.6223e-03]])
tensor([1, 1, 0])


### Save and Load

It is possible to load HuggingFace objects (tokenizer, model, etc.), to fine-tune them (see section below) and save them. Then, we can load them when needed and use them.

In [13]:
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForSequenceClassification

In [14]:
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Now, we can use these objects
# and even fine-tune the  model!
# After that, we would save the objects

In [15]:
# Save
save_directory = "./output/"
tokenizer.save_pretrained(save_directory)
model.save_pretrained(save_directory)
# In the save_directory, several files are created
# - related to the tokenizer: vocab.txt, tokenizer.json, ...
# - model: pytorch_model.bin
# - general: config.json

# Load
tok = AutoTokenizer.from_pretrained(save_directory)
mod = AutoModelForSequenceClassification.from_pretrained(save_directory)

### Model Hub

There are more that 200k available [HuggingFace Models](https://huggingface.co/models), built by the community, FAANG companies, etc. Notes to take into account:

- We can filter the models by
  - Task: Image Classification, Text Classification, etc.
  - Libraries: PyTorch, Tensorflow, JAX, Keras, etc.
  - Datasets
  - Languages: English, German, Spanisch, etc.
  - Licenses
  - Other

Some of them have code examples; in any case, we just need to click on a desired one and copy the model name to use it in the `pipeline` (use the icon/button).

Some example models:

- [Text clasification (default): distilbert-base-uncased-finetuned-sst-2-english](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)
- [Summarization: facebook/bart-large-cnn](https://huggingface.co/facebook/bart-large-cnn)

As for PyTorch, the pre-trained models are saved in `~/.cache/`, i.e. in `~/.cache/huggingface`. We can change that as follows:

- Setting the environment variable `HF_HOME`.
- Specifying it: `pipeline(cache_dir="/path/to/custom/cache/dir")`.

In [16]:
from transformers import pipeline

# Example of text summarization
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

ARTICLE = """ New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York.
A year later, she got married again in Westchester County, but to a different man and without divorcing her first husband.
Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared "I do" five more times, sometimes only within two weeks of each other.
In 2010, she married once more, this time in the Bronx. In an application for a marriage license, she stated it was her "first and only" marriage.
Barrientos, now 39, is facing two criminal counts of "offering a false instrument for filing in the first degree," referring to her false statements on the
2010 marriage license application, according to court documents.
Prosecutors said the marriages were part of an immigration scam.
On Friday, she pleaded not guilty at State Supreme Court in the Bronx, according to her attorney, Christopher Wright, who declined to comment further.
After leaving court, Barrientos was arrested and charged with theft of service and criminal trespass for allegedly sneaking into the New York subway through an emergency exit, said Detective
Annette Markowski, a police spokeswoman. In total, Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002.
All occurred either in Westchester County, Long Island, New Jersey or the Bronx. She is believed to still be married to four men, and at one time, she was married to eight men at once, prosecutors say.
Prosecutors said the immigration scam involved some of her husbands, who filed for permanent residence status shortly after the marriages.
Any divorces happened only after such filings were approved. It was unclear whether any of the men will be prosecuted.
The case was referred to the Bronx District Attorney\'s Office by Immigration and Customs Enforcement and the Department of Homeland Security\'s
Investigation Division. Seven of the men are from so-called "red-flagged" countries, including Egypt, Turkey, Georgia, Pakistan and Mali.
Her eighth husband, Rashid Rajput, was deported in 2006 to his native Pakistan after an investigation by the Joint Terrorism Task Force.
If convicted, Barrientos faces up to four years in prison.  Her next court appearance is scheduled for May 18.
"""
print(summarizer(ARTICLE, max_length=130, min_length=30, do_sample=False))
# [{'summary_text': 'Liana Barrientos, 39, is charged with two counts of "offering a false instrument for filing in the first degree" In total, she has been married 10 times, with nine of her marriages occurring between 1999 and 2002. She is believed to still be married to four men.'}]


[{'summary_text': 'Liana Barrientos, 39, is charged with two counts of "offering a false instrument for filing in the first degree" In total, she has been married 10 times, with nine of her marriages occurring between 1999 and 2002. She is believed to still be married to four men.'}]


### Finetuning Pre-Trained Models with Our Datasets

Once we have prepared our dataset, fine-tuning a HuggingFace model is as simple as using the `Trainer` class.

Notes after the official tutorial [Fine-tune a pretrained model](https://huggingface.co/docs/transformers/main/training), in which a review rating text model is fine-tuned using the [Yelp Reviews dataset](https://huggingface.co/datasets/yelp_review_full).

Summary of steps for fine-tuning:
1. Prepare dataset
2. Load pre-trained Tokenizer, call it with dataset -> encoding
3. Build PyTorch Dataset with the encodings
4. Load pre-trained model
5. Training / Fine-tuning
    - Load Trainer and train it
    - Use native Pytorch training loop

In [12]:
from datasets import load_dataset
from transformers import AutoTokenizer

In [13]:
# Load the Yelp Reviews Dataset
# https://huggingface.co/datasets/yelp_review_full
# The nice thing is that the dataset class from HuggingFace
# are stored to the chache folder, but only the requested instances
# are loaded to memory. Usually, you have dictionary-like
# samples.
dataset = load_dataset("yelp_review_full")
dataset["train"][100]

Found cached dataset yelp_review_full (C:/Users/Msagardi/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/e8e18e19d7be9e75642fc66b198abadb116f73599ec89a69ba5dd8d1e57ba0bf)


  0%|          | 0/2 [00:00<?, ?it/s]

{'label': 0,
 'text': 'My expectations for McDonalds are t rarely high. But for one to still fail so spectacularly...that takes something special!\\nThe cashier took my friends\'s order, then promptly ignored me. I had to force myself in front of a cashier who opened his register to wait on the person BEHIND me. I waited over five minutes for a gigantic order that included precisely one kid\'s meal. After watching two people who ordered after me be handed their food, I asked where mine was. The manager started yelling at the cashiers for \\"serving off their orders\\" when they didn\'t have their food. But neither cashier was anywhere near those controls, and the manager was the one serving food to customers and clearing the boards.\\nThe manager was rude when giving me my order. She didn\'t make sure that I had everything ON MY RECEIPT, and never even had the decency to apologize that I felt I was getting poor service.\\nI\'ve eaten at various McDonalds restaurants for over 30 years. 

In [14]:
# Information on the features
dataset["train"].features

{'label': ClassLabel(names=['1 star', '2 star', '3 stars', '4 stars', '5 stars'], id=None),
 'text': Value(dtype='string', id=None)}

In [15]:
# We instantiate the tokenizer and pack it intoa function
# which is mapped to the items of the dataset
# We could apply several functions to the tokenized_datasets
# tokenized_datasets.remove_columns(...)
# tokenized_datasets.rename_columns(...)
# tokenized_datasets.with_format("torch")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

Loading cached processed dataset at C:\Users\Msagardi\.cache\huggingface\datasets\yelp_review_full\yelp_review_full\1.0.0\e8e18e19d7be9e75642fc66b198abadb116f73599ec89a69ba5dd8d1e57ba0bf\cache-aad1af4c7095bfa1.arrow
Loading cached processed dataset at C:\Users\Msagardi\.cache\huggingface\datasets\yelp_review_full\yelp_review_full\1.0.0\e8e18e19d7be9e75642fc66b198abadb116f73599ec89a69ba5dd8d1e57ba0bf\cache-29f27748f0b54d01.arrow


In [16]:
# We create a smaller sub-dataset in this example
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

Loading cached shuffled indices for dataset at C:\Users\Msagardi\.cache\huggingface\datasets\yelp_review_full\yelp_review_full\1.0.0\e8e18e19d7be9e75642fc66b198abadb116f73599ec89a69ba5dd8d1e57ba0bf\cache-11a7619c6a3c070f.arrow
Loading cached shuffled indices for dataset at C:\Users\Msagardi\.cache\huggingface\datasets\yelp_review_full\yelp_review_full\1.0.0\e8e18e19d7be9e75642fc66b198abadb116f73599ec89a69ba5dd8d1e57ba0bf\cache-3c5c2a245be1b332.arrow


In [17]:
# Check all the features/elements in a sample
tokenized_datasets["train"].features

{'label': ClassLabel(names=['1 star', '2 star', '3 stars', '4 stars', '5 stars'], id=None),
 'text': Value(dtype='string', id=None),
 'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None),
 'token_type_ids': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None),
 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None)}

In [9]:
# Get sample 0 and see its values: label, text, input_ids, etc.
# We could apply several functions to the tokenized_datasets
# tokenized_datasets.remove_columns(...)
# tokenized_datasets.rename_columns("label", "labels")
# tokenized_datasets.with_format("torch")
tokenized_datasets["train"][0]

{'label': 4,
 'text': "dr. goldberg offers everything i look for in a general practitioner.  he's nice and easy to talk to without being patronizing; he's always on time in seeing his patients; he's affiliated with a top-notch hospital (nyu) which my parents have explained to me is very important in case something happens and you need surgery; and you can get referrals to see specialists without having to see him first.  really, what more do you need?  i'm sitting here trying to think of any complaints i have about him, but i'm really drawing a blank.",
 'input_ids': [101,
  173,
  1197,
  119,
  2284,
  2953,
  3272,
  1917,
  178,
  1440,
  1111,
  1107,
  170,
  1704,
  22351,
  119,
  1119,
  112,
  188,
  3505,
  1105,
  3123,
  1106,
  2037,
  1106,
  1443,
  1217,
  10063,
  4404,
  132,
  1119,
  112,
  188,
  1579,
  1113,
  1159,
  1107,
  3195,
  1117,
  4420,
  132,
  1119,
  112,
  188,
  6559,
  1114,
  170,
  1499,
  118,
  23555,
  2704,
  113,
  183,
  9379,
  114,
  1

#### Training with the Pytorch Trainer

In [34]:
import numpy as np
from transformers import AutoModelForSequenceClassification
from transformers import TrainingArguments, Trainer
import evaluate

model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [31]:
# We need to define a metric using the HuggingFace library evaluate
metric = evaluate.load("accuracy")

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

In [33]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [35]:
training_args = TrainingArguments(output_dir="test_trainer", evaluation_strategy="epoch")

In [36]:
# Create a Trainer instance with all defined components
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)

In [37]:
# TRAIN!
# WARNING: mlflow and Wandb are used
# We should use the Wandb API key
trainer.train()

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl

#### Training in Native Pytorch

In [18]:
tokenized_datasets["train"].features

{'label': ClassLabel(names=['1 star', '2 star', '3 stars', '4 stars', '5 stars'], id=None),
 'text': Value(dtype='string', id=None),
 'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None),
 'token_type_ids': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None),
 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None)}

In [19]:
# In order to be able to run the native Pytorch training loop
# we need to modify the dataset as follows
tokenized_datasets = tokenized_datasets.remove_columns(["text"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")

In [20]:
# Now, we have a modified dataset
tokenized_datasets["train"].features

{'labels': ClassLabel(names=['1 star', '2 star', '3 stars', '4 stars', '5 stars'], id=None),
 'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None),
 'token_type_ids': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None),
 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None)}

In [21]:
# We take a smaller subset
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

In [22]:
# Data Loader
from torch.utils.data import DataLoader

train_dataloader = DataLoader(small_train_dataset, shuffle=True, batch_size=8)
eval_dataloader = DataLoader(small_eval_dataset, batch_size=8)

In [24]:
# Model
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [26]:
# Optimizer
from torch.optim import AdamW

optimizer = AdamW(model.parameters(), lr=5e-5)

In [33]:
# Scheduler
from transformers import get_scheduler

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    name="linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps
)

In [31]:
# Device
import torch

#device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
device = "cpu"
print(device)
model.to(device)

cpu


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

In [None]:
# Training loop
from tqdm.auto import tqdm

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

In [None]:
# Evaluate
import evaluate

metric = evaluate.load("accuracy")
model.eval()
for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])

metric.compute()

### Finetuning Pre-Trained Models with Custom Datasets - Example with IMDB

Source: [Fine-tuning with custom datasets](https://huggingface.co/transformers/v3.2.0/custom_datasets.html).

In [38]:
!wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz

SYSTEM_WGETRC = c:/progra~1/wget/etc/wgetrc
syswgetrc = C:\Program Files (x86)\GnuWin32/etc/wgetrc
--2023-07-21 10:34:42--  http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu... 171.64.68.10
Connecting to ai.stanford.edu|171.64.68.10|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: `aclImdb_v1.tar.gz'

     0K .......... .......... .......... .......... ..........  0%  100K 13m41s
    50K .......... .......... .......... .......... ..........  0%  290K 9m12s
   100K .......... .......... .......... .......... ..........  0%  276K 7m47s
   150K .......... .......... .......... .......... ..........  0%  320K 6m54s
   200K .......... .......... .......... .......... ..........  0%  298K 6m26s
   250K .......... .......... .......... .......... ..........  0% 3.33M 5m25s
   300K .......... .......... .......... .......... ..........  0%  322K 5m15s
   350K .......... .......... ...

In [39]:
!tar -xf aclImdb_v1.tar.gz

In [3]:
# This data is organized into pos and neg folders with one text file per example
from pathlib import Path

def read_imdb_split(split_dir):
    split_dir = Path(split_dir)
    texts = []
    labels = []
    for label_dir in ["pos", "neg"]:
        for text_file in (split_dir/label_dir).iterdir():
            texts.append(text_file.read_text())
            labels.append(0 if label_dir is "neg" else 1)

    return texts, labels

train_texts, train_labels = read_imdb_split('aclImdb/train')
test_texts, test_labels = read_imdb_split('aclImdb/test')

  labels.append(0 if label_dir is "neg" else 1)


In [4]:
from sklearn.model_selection import train_test_split
train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=.2)

In [5]:
# We’ll eventually train a classifier using pre-trained DistilBert,
# so let’s use the DistilBert tokenizer.
from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

In [6]:
# Now we can simply pass our texts to the tokenizer.
# We’ll pass truncation=True and padding=True,
# which will ensure that all of our sequences are padded to the same length
# and are truncated to be no longer model’s maximum input length.
# This will allow us to feed batches of sequences into the model at the same time.
train_encodings = tokenizer(train_texts, truncation=True, padding=True)
val_encodings = tokenizer(val_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)

In [8]:
# Let’s turn our labels and encodings into a Dataset object
import torch

class IMDbDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = IMDbDataset(train_encodings, train_labels)
val_dataset = IMDbDataset(val_encodings, val_labels)
test_dataset = IMDbDataset(test_encodings, test_labels)

In [None]:
# Fine-tuning with Trainer
from transformers import DistilBertForSequenceClassification, Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
)

model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")

trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=val_dataset             # evaluation dataset
)

trainer.train()

In [None]:
# Fine-tuning with native PyTorch
from torch.utils.data import DataLoader
from transformers import DistilBertForSequenceClassification, AdamW

#device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
device = "cpu"

model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
model.to(device)
model.train()

train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)

optim = AdamW(model.parameters(), lr=5e-5)

for epoch in range(3):
    for batch in train_loader:
        optim.zero_grad()
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs[0]
        loss.backward()
        optim.step()

model.eval()