# Setup and Disclaimer

Run the following cell to install all the necessary.

We will use PyTorch as backbone

In [1]:
! pip install transformers datasets peft evaluate "transformers[torch]"



# Pipeline

`pipeline()` is the simplest way to use a Hugging Face model. **Model** is an umbrella term describing an **architecture** (i.e. the skeleton/structure of the model) and a **checkpoint** (i.e. the set of weights that we load on it).

For example, BERT is an architecture while `bert-base-cased`, a set of weights trained by the Google team for the first release of BERT, is a checkpoint. However, one can say “the BERT model” and “the `bert-base-cased model'”

In [2]:
from transformers import pipeline

import torch

print(torch.cuda.is_available())
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
classifier = pipeline("sentiment-analysis", checkpoint)
classifier(["I've been waiting for a HuggingFace course my whole life.", "I hate this so much!"])

True




[{'label': 'POSITIVE', 'score': 0.9598048329353333},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

In [3]:
generator = pipeline("text-generation")
generator(
    "In this course, we will teach you how to",
    max_length=30,
    num_return_sequences=2,
)

No model was supplied, defaulted to openai-community/gpt2 and revision 6c0e608 (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to create an online service that allows you to communicate with the Internet anonymously. This service will provide online shopping'},
 {'generated_text': 'In this course, we will teach you how to make a new keyboard:\n\nCreating a keyboard at Unity Engine, in HTML5, CSS,'}]

As you can see, if you provide no model checkpoint, Hugging Face will automatically pick an adequate one.

Check [here](https://huggingface.co/learn/nlp-course/chapter1/3?fw=pt) other usage examples for the pipeline function

Now use pipeline to perform a different task

In [4]:
### IMP CODE HERE ###

checkpoint = "bert-base-uncased"
unmasker = pipeline("fill-mask", checkpoint)
unmasker("This course will teach you all about [MASK] models.", top_k=10)

######################

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'score': 0.3366122841835022,
  'token': 2535,
  'token_str': 'role',
  'sequence': 'this course will teach you all about role models.'},
 {'score': 0.08371730148792267,
  'token': 2449,
  'token_str': 'business',
  'sequence': 'this course will teach you all about business models.'},
 {'score': 0.05718289315700531,
  'token': 1996,
  'token_str': 'the',
  'sequence': 'this course will teach you all about the models.'},
 {'score': 0.04392532259225845,
  'token': 2115,
  'token_str': 'your',
  'sequence': 'this course will teach you all about your models.'},
 {'score': 0.026684647426009178,
  'token': 4827,
  'token_str': 'fashion',
  'sequence': 'this course will teach you all about fashion models.'},
 {'score': 0.017819639295339584,
  'token': 2166,
  'token_str': 'life',
  'sequence': 'this course will teach you all about life models.'},
 {'score': 0.015942931175231934,
  'token': 2122,
  'token_str': 'these',
  'sequence': 'this course will teach you all about these models.'},
 {'s

# Behind the Pipeline

The Pipeline operator takes care of all steps: tokenization, going through the model, and post-processing. We now go one level lower to gain a better understanding of how everything works.

### Tokenizer

In [5]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

sequence = "I've been waiting for a HuggingFace course my whole life."

print(tokenizer(sequence))

tokens = tokenizer.tokenize(sequence) # You can also use the __call__ function directly

print(tokens)

ids = tokenizer.convert_tokens_to_ids(tokens)

print(ids)

decoded_string = tokenizer.decode(ids)

print(decoded_string)

{'input_ids': [101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
['i', "'", 've', 'been', 'waiting', 'for', 'a', 'hugging', '##face', 'course', 'my', 'whole', 'life', '.']
[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012]
i've been waiting for a huggingface course my whole life.


What are the tokens with indices 101 and 102? Find out!

[CLS] and [SEP].

### Padding
To put inputs together is called batching. In order to batch sequences together, they need to be of the same length, which implies you need **pad** the inputs to ensure this requirement is met.

In [6]:
from transformers import AutoModelForSequenceClassification
import torch

model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

# This will work

batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

print(model(torch.tensor(batched_ids)).logits)

# But this this will fail
batched_ids = [
    [200, 200, 200],
    [200, 200]
]

print(model(torch.tensor(batched_ids)).logits)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.


tensor([[-0.1976, -0.1566],
        [-0.2282, -0.1736]], grad_fn=<AddmmBackward0>)


ValueError: expected sequence of length 3 at dim 1 (got 2)

### Attention Mask
Remember that the model attends to all tokens, and we don't want our predictions to be depended on the padding tokens. We use **attention masks** to indicate which tokens should be attended and which not.

In [7]:
# This prediction
sentence_1_ids = [[200, 200, 200]]
sentence_2_ids = [[200, 200]]

# will be different from this one
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

# so we add the attention masks
attention_mask = [
    [1, 1, 1],
    [1, 1, 0],
]

print(f"{model(torch.tensor(sentence_1_ids)).logits}\n"\
      f"{model(torch.tensor(sentence_2_ids)).logits}")
print(model(torch.tensor(batched_ids)).logits)
outputs = model(torch.tensor(batched_ids), attention_mask=torch.tensor(attention_mask))
print(outputs.logits)

tensor([[-0.1976, -0.1566]], grad_fn=<AddmmBackward0>)
tensor([[-0.1632, -0.1320]], grad_fn=<AddmmBackward0>)
tensor([[-0.1976, -0.1566],
        [-0.2282, -0.1736]], grad_fn=<AddmmBackward0>)
tensor([[-0.1976, -0.1566],
        [-0.1632, -0.1320]], grad_fn=<AddmmBackward0>)


### Truncate
Most transformers handle sequences of up to 512 or 1024 tokens, and will crash when asked to process longer sequences. There are two solutions to this problem: either you (1) use a model with a longer supported sequence length or (2)truncate your sequences.

In [8]:
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

# Will pad the sequences up to the maximum sequence length
model_inputs = tokenizer(sequences, padding="longest")

## model_inputs is a dictionary of lists , where input_ids is key for sequence lists and attention_masks is key for attention masks to those sequences
print(f"model_inputs with padding according to longest sequence: {model_inputs}")

# Will pad the sequences up to the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, padding="max_length")
print(f"model_inputs with padding according to max_length specified (512 for BERT and DistilBERT): {model_inputs}")

# Will pad the sequences up to the specified max length
model_inputs = tokenizer(sequences, padding="max_length", max_length=8)
print(f"model_inputs with padding according to max_length specified (here 8): {model_inputs}")

# Will truncate the sequences that are longer than the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, truncation=True)
print(f"model_inputs with truncation: {model_inputs}")

# Will truncate the sequences that are longer than the specified max length
model_inputs = tokenizer(sequences, max_length=8, truncation=True)
print(f"model_inputs with truncation with max length 8: {model_inputs}")

model_inputs with padding according to longest sequence: {'input_ids': [[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102], [101, 2061, 2031, 1045, 999, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]}
model_inputs with padding according to max_length specified (512 for BERT and DistilBERT): {'input_ids': [[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

### Through the Model

Why does the following code fail? Can you fix it?

Models by default handle multiple sentences, so we need to add a dimension to our `input_ids`

In [9]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

### IMP CODE HERE ###

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
input_ids = torch.tensor(ids).unsqueeze(0)
########################

# This line will fail.
model(input_ids).logits

tensor([[-2.7276,  2.8789]], grad_fn=<AddmmBackward0>)

Now let's apply everything we have learned and let's run the model on `raw_inputs` without using the `pipeline()` function. We placed some importa as hints. We're using the same inputs and model checkpoint, so you should get the same output as before!

In [10]:
from transformers import AutoModelForSequenceClassification
from torch import argmax
from torch.nn.functional import softmax

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"

raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]

### IMP CODE HERE ###
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model_inputs = tokenizer(raw_inputs, padding='longest')

input_ids = torch.tensor(model_inputs['input_ids'])

attention_masks = torch.tensor(model_inputs['attention_mask'])

model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

output = model(input_ids, attention_masks)
print(output.logits)

probs = softmax(output.logits, dim=1)
prob_ = argmax(probs, dim=1)
sentiment = []
for i, idx in enumerate(prob_):
    if idx==1:
        label = 'Positive'
    else:
        label = 'Negative'
    sentiment.append({'label': label, 'score': probs[i,idx].item()})

print(sentiment)

######################

tensor([[-1.5607,  1.6123],
        [ 4.1692, -3.3464]], grad_fn=<AddmmBackward0>)
[{'label': 'Positive', 'score': 0.9598048329353333}, {'label': 'Negative', 'score': 0.9994558691978455}]


Very likely your code does not work if you use the class `Automodel` instead of `AutoModelForSequenceClassification`. Why not?



# Training

We start by loading a dataset with the dataset library

In [11]:
from datasets import load_dataset

raw_datasets = load_dataset("glue", "mrpc")
print(raw_datasets)


raw_train_dataset = raw_datasets['train']
print(raw_train_dataset.features.keys())

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})
dict_keys(['sentence1', 'sentence2', 'label', 'idx'])


In [12]:
print(raw_datasets['validation']['sentence1'])



Glue mrpc is a classification task: (sentence_1, sentence_2) -> Are they semantically equivalent?

By running the tokenizer we can observe the model expects the inputs to be of the form [CLS] sentence1 [SEP] sentence2 [SEP]. The token_type_ids determines which part is sentence_1 and which part is sentence_2"

Do not worry if you don't see token_type_ids in your tokenized inputs: as long as you use the same checkpoint for the tokenizer and the model, everything will be fine as the tokenizer knows what to provide to its model.

In [13]:
from transformers import AutoTokenizer

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

inputs = tokenizer("This is the first sentence.", "This is the second one.")
print(inputs)

print(tokenizer.convert_ids_to_tokens(inputs["input_ids"]))

{'input_ids': [101, 2023, 2003, 1996, 2034, 6251, 1012, 102, 2023, 2003, 1996, 2117, 2028, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
['[CLS]', 'this', 'is', 'the', 'first', 'sentence', '.', '[SEP]', 'this', 'is', 'the', 'second', 'one', '.', '[SEP]']


Simply applying the tokenizer to the dataset would require to store the entire dataset in memory. Instead, we would like to only store the current batch in memory by appling the `map()` function with argument `batched=True`.

This also enables to use dynamic padding through a `DataCollatorWithPadding`, i.e. we pad to the maximum length within the batch and not the entire dataset.

Define a `tokenize_function()` which we can then apply to the entire dataset. Don't forget, each sample is made of two sentences!

In [14]:
from transformers import DataCollatorWithPadding


### IMP CODE HERE ###
def tokenize_function(example):
    #tokenize_dict = {}
    #tokenize_dict['train'] = tokenizer(raw_datasets['train']['sentence1'], raw_datasets['train']['sentence2'])
    #tokenize_dict['validation'] = tokenizer(raw_datasets['validation']['sentence1'], raw_datasets['validation']['sentence2'])
    #tokenize_dict['test'] = tokenizer(raw_datasets['test']['sentence1'], raw_datasets['test']['sentence2'])

    return tokenizer(example['sentence1'], example['sentence2'])
######################

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True) #batched_true is the whole point of doing this!

samples = tokenized_datasets["train"][:8]

samples = {k: v for k, v in samples.items() if k not in ["idx", "sentence1", "sentence2"]}
print(f'samples:{samples.items()}')

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Print shapes for one batch
batch = data_collator(samples)
{k: v.shape for k, v in batch.items()}

samples:dict_items([('label', [1, 0, 1, 0, 1, 1, 0, 1]), ('input_ids', [[101, 2572, 3217, 5831, 5496, 2010, 2567, 1010, 3183, 2002, 2170, 1000, 1996, 7409, 1000, 1010, 1997, 9969, 4487, 23809, 3436, 2010, 3350, 1012, 102, 7727, 2000, 2032, 2004, 2069, 1000, 1996, 7409, 1000, 1010, 2572, 3217, 5831, 5496, 2010, 2567, 1997, 9969, 4487, 23809, 3436, 2010, 3350, 1012, 102], [101, 9805, 3540, 11514, 2050, 3079, 11282, 2243, 1005, 1055, 2077, 4855, 1996, 4677, 2000, 3647, 4576, 1999, 2687, 2005, 1002, 1016, 1012, 1019, 4551, 1012, 102, 9805, 3540, 11514, 2050, 4149, 11282, 2243, 1005, 1055, 1999, 2786, 2005, 1002, 6353, 2509, 2454, 1998, 2853, 2009, 2000, 3647, 4576, 2005, 1002, 1015, 1012, 1022, 4551, 1999, 2687, 1012, 102], [101, 2027, 2018, 2405, 2019, 15147, 2006, 1996, 4274, 2006, 2238, 2184, 1010, 5378, 1996, 6636, 2005, 5096, 1010, 2002, 2794, 1012, 102, 2006, 2238, 2184, 1010, 1996, 2911, 1005, 1055, 5608, 2018, 2405, 2019, 15147, 2006, 1996, 4274, 1010, 5378, 1996, 14792, 2005, 5096

{'input_ids': torch.Size([8, 67]),
 'token_type_ids': torch.Size([8, 67]),
 'attention_mask': torch.Size([8, 67]),
 'labels': torch.Size([8])}

Now that we can tokenize batches dynamically. We can sequentially feed them into the model to train it. We use the high-level `Trainer` class.

In [15]:
from transformers import AutoModelForSequenceClassification
from transformers import TrainingArguments, Trainer

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

training_args = TrainingArguments("test-trainer") #hyperparams for the training

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

trainer.train()

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/1377 [00:00<?, ?it/s]

{'loss': 0.505, 'grad_norm': 8.064994812011719, 'learning_rate': 3.184458968772695e-05, 'epoch': 1.09}
{'loss': 0.2662, 'grad_norm': 27.9360408782959, 'learning_rate': 1.3689179375453886e-05, 'epoch': 2.18}
{'train_runtime': 506.2692, 'train_samples_per_second': 21.735, 'train_steps_per_second': 2.72, 'train_loss': 0.31558861659762644, 'epoch': 3.0}


TrainOutput(global_step=1377, training_loss=0.31558861659762644, metrics={'train_runtime': 506.2692, 'train_samples_per_second': 21.735, 'train_steps_per_second': 2.72, 'total_flos': 405114969714960.0, 'train_loss': 0.31558861659762644, 'epoch': 3.0})

Please notice that `Trainer` doesn't provide us with evaluation metrics by default. We can retrieve them from the dataset through `evaluate` and compute them for our model.


In [16]:
from transformers import AutoModelForSequenceClassification
from transformers import TrainingArguments, Trainer
import numpy as np
import evaluate

predictions = trainer.predict(tokenized_datasets["validation"])
preds = np.argmax(predictions.predictions, axis=-1)

metric = evaluate.load("glue", "mrpc")
metric.compute(predictions=preds, references=predictions.label_ids)

  0%|          | 0/51 [00:00<?, ?it/s]

{'accuracy': 0.8578431372549019, 'f1': 0.9016949152542373}

Use `evaluate.load()` and `metric.compute()` to complete the `compute_metric()` function. We can then pass it to our `Trainer` object to track such metrics during training.

In [17]:
import numpy as np
from transformers import AutoModelForSequenceClassification
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments("test-trainer", evaluation_strategy="epoch")
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

### IMP CODE HERE ###

def compute_metrics(predictions):

    preds = np.argmax(predictions.predictions, axis=-1)


    metric = evaluate.load("glue", "mrpc")
    return metric.compute(predictions=preds, references=predictions.label_ids)

######################


trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

trainer.train()

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/1377 [00:00<?, ?it/s]

  0%|          | 0/51 [00:00<?, ?it/s]

{'eval_loss': 0.5063608288764954, 'eval_accuracy': 0.7769607843137255, 'eval_f1': 0.8557844690966719, 'eval_runtime': 7.7911, 'eval_samples_per_second': 52.367, 'eval_steps_per_second': 6.546, 'epoch': 1.0}
{'loss': 0.5294, 'grad_norm': 2.508415460586548, 'learning_rate': 3.184458968772695e-05, 'epoch': 1.09}


  0%|          | 0/51 [00:00<?, ?it/s]

{'eval_loss': 0.5048918128013611, 'eval_accuracy': 0.8382352941176471, 'eval_f1': 0.8865979381443299, 'eval_runtime': 7.4654, 'eval_samples_per_second': 54.652, 'eval_steps_per_second': 6.832, 'epoch': 2.0}
{'loss': 0.3255, 'grad_norm': 6.627474784851074, 'learning_rate': 1.3689179375453886e-05, 'epoch': 2.18}


  0%|          | 0/51 [00:00<?, ?it/s]

{'eval_loss': 0.6442598104476929, 'eval_accuracy': 0.8431372549019608, 'eval_f1': 0.891891891891892, 'eval_runtime': 7.8868, 'eval_samples_per_second': 51.732, 'eval_steps_per_second': 6.467, 'epoch': 3.0}
{'train_runtime': 593.5836, 'train_samples_per_second': 18.538, 'train_steps_per_second': 2.32, 'train_loss': 0.36934398687136893, 'epoch': 3.0}


TrainOutput(global_step=1377, training_loss=0.36934398687136893, metrics={'train_runtime': 593.5836, 'train_samples_per_second': 18.538, 'train_steps_per_second': 2.32, 'total_flos': 405114969714960.0, 'train_loss': 0.36934398687136893, 'epoch': 3.0})

# Behind the Trainer

We go one level lower the Trainer class and implement the training procedure in PyTorch.

Starting from the same setup:

In [18]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding, AutoModelForSequenceClassification

raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenize_function(example):

    return tokenizer(example['sentence1'], example['sentence2'])
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


We setup the Pytorch DataLoaders

In [19]:
from torch.utils.data import DataLoader

tokenized_datasets = tokenized_datasets.remove_columns(["sentence1", "sentence2", "idx"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")
tokenized_datasets["train"].column_names

train_dataloader = DataLoader(
    tokenized_datasets["train"], shuffle=True, batch_size=8, collate_fn=data_collator
)
eval_dataloader = DataLoader(
    tokenized_datasets["validation"], batch_size=8, collate_fn=data_collator
)

We setup an Optimizer:

In [20]:
from transformers import AdamW, get_scheduler

optimizer = AdamW(model.parameters(), lr=5e-5)

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)
print(num_training_steps)

1377




Now you write the PyTorch code inside the training loop.

In [1]:
from tqdm.auto import tqdm
import torch

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:

        ### IMP CODE HERE ###
        batch = {k:v.to(device) for k,v in batch.items()}
        preds = model(**batch)
        loss = preds.loss

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        progress_bar.update(1)
 
        ######################

NameError: name 'model' is not defined

Does it work? Congrats! You implement a training loop from scratch and you can now run the final evaluation. You should get similar results as with the `Trainer` class.

In [None]:
import evaluate

metric = evaluate.load("glue", "mrpc")
model.eval()
for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])

metric.compute()

{'accuracy': 0.6838235294117647, 'f1': 0.8122270742358079}

# Parameter-Efficient Fine-Tuning

You're given the same setup:

In [2]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding, AutoModelForSequenceClassification


raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


We provide you with these two functions for counting active parameters and defining evaluation metrics.

In [3]:
def count_parameters(model):
    total_params = sum(p.numel() for p in model.parameters())
    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
    return {'Total': total_params, 'Trainable': trainable_params}

def compute_metrics(eval_preds):
    metric = evaluate.load("glue", "mrpc")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

Use LoRA by defining a configuration (`LoRAConfig` object) and use it with the trainer class to efficiently fine-tune our model

In [4]:
from peft import LoraConfig, TaskType, get_peft_model
from transformers import TrainingArguments, Trainer
import evaluate
import numpy as np


### CODE HERE ###

lora_config = LoraConfig(r=16, 
                         lora_alpha=16, task_type=TaskType.SEQ_CLS)

######################


# Count parameters before LoRA
before_lora_count = count_parameters(model)
print(f"Before LoRA:\n{before_lora_count}")

# Apply LoRA to the model
lora_model = get_peft_model(model, lora_config)

# Count parameters after LoRA
after_loara_count = count_parameters(lora_model)
print(f"After LoRA:\n{after_loara_count}")

training_args = TrainingArguments("test-trainer", evaluation_strategy="epoch")  # Hyperparams for the training

### CODE HERE ###

trainer = Trainer(
    lora_model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

######################

trainer.train()

Before LoRA:
{'Total': 109483778, 'Trainable': 109483778}
After LoRA:
{'Total': 110075140, 'Trainable': 591362}


  0%|          | 0/1377 [00:00<?, ?it/s]

  0%|          | 0/51 [00:00<?, ?it/s]

{'eval_loss': 0.6200363039970398, 'eval_accuracy': 0.6838235294117647, 'eval_f1': 0.8122270742358079, 'eval_runtime': 7.6062, 'eval_samples_per_second': 53.641, 'eval_steps_per_second': 6.705, 'epoch': 1.0}
{'loss': 0.6363, 'grad_norm': 2.6974666118621826, 'learning_rate': 3.184458968772695e-05, 'epoch': 1.09}




  0%|          | 0/51 [00:00<?, ?it/s]

{'eval_loss': 0.5219185948371887, 'eval_accuracy': 0.7426470588235294, 'eval_f1': 0.832535885167464, 'eval_runtime': 7.9429, 'eval_samples_per_second': 51.367, 'eval_steps_per_second': 6.421, 'epoch': 2.0}
{'loss': 0.5776, 'grad_norm': 4.589046001434326, 'learning_rate': 1.3689179375453886e-05, 'epoch': 2.18}




  0%|          | 0/51 [00:00<?, ?it/s]

{'eval_loss': 0.5013276934623718, 'eval_accuracy': 0.7696078431372549, 'eval_f1': 0.8478964401294499, 'eval_runtime': 7.8577, 'eval_samples_per_second': 51.923, 'eval_steps_per_second': 6.49, 'epoch': 3.0}
{'train_runtime': 326.8629, 'train_samples_per_second': 33.665, 'train_steps_per_second': 4.213, 'train_loss': 0.5822257912496596, 'epoch': 3.0}


TrainOutput(global_step=1377, training_loss=0.5822257912496596, metrics={'train_runtime': 326.8629, 'train_samples_per_second': 33.665, 'train_steps_per_second': 4.213, 'total_flos': 407912107244064.0, 'train_loss': 0.5822257912496596, 'epoch': 3.0})