# Huggingface Introduction

Huggingface is a company and open-source platform that has become a leader in natural language processing (NLP) and machine learning (ML). It provides powerful tools, including pre-trained models, datasets, and APIs, that make it easy for developers and researchers to work with state-of-the-art machine learning models.

Huggingface provides useful libraries for pretrained models, datasets, tokenizers, training, fine-tuning and inference.

For more information, see https://huggingface.co/.

This lecture introduces the basic usage of the Huggingface API, but since the API is very extensive, it’s not possible to cover everything within the limited time of the course. When studying or working on projects, I encourage you to actively use the official Huggingface documentation, Google search, and tools like ChatGPT to find answers and explore additional features.

In this lecture, we cover:

1. How to download pretrained models, tokenizers.
2. How to generate text from pretraned language models.
3. How to download, prepare, and preprocess language datasets.
4. How to finetune language models.

In this lecture, you need to connect GPU runtime to use GPU!

You can use GPU in Colab with following steps:

- Navigate to Edit→Notebook Settings
- select GPU from the Hardware Accelerator drop-down

Let us check that GPU is correctly assigned in your runtime.

In [1]:
!nvidia-smi

Thu Apr 10 04:04:06 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   50C    P8             10W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [2]:
!pip install datasets -q 2> /dev/null

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/491.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.2/491.2 kB[0m [31m29.2 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/116.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/183.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m183.9/183.9 kB[0m [31m13.2 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/143.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.5/143.5 kB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [3]:
# helper library for pretty printing dictionary and tensors
import rich
from rich.pretty import pprint

try:
    import os
    from google.colab import drive
    drive.mount("/content/drive")
    os.chdir("/content/drive/MyDrive/lec11")
except Exception as e:
    print(e)

Mounted at /content/drive


## 1. Loading Pretrained Transformer Model and Tokenizer

In this example, we will learn how to download pretrained language models from huggingface hub.

The library for loading pretrained language models is `transformers`. First install transformers with `pip install transformers`.

In [4]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

Models and tokenizers can be loaded by `.from_pretrained` method.

Let us download GPT-2 model!

In [5]:
model_id = "openai-community/gpt2"
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_id)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [6]:

def print_example(tokenizer, text):
    ids = tokenizer.encode(text)
    print("======Tokenizer example======")
    print(f'encode("{text}")\n= {ids}')
    print()
    print(f"decode({ids})\n=" , [tokenizer.decode([tok]) for tok in ids])
    print("=============================")

# Tokenizing a text
text = "My name is ChatGPT. I am an AI assistant."
print_example(tokenizer, text)

encode("My name is ChatGPT. I am an AI assistant.")
= [3666, 1438, 318, 24101, 38, 11571, 13, 314, 716, 281, 9552, 8796, 13]

decode([3666, 1438, 318, 24101, 38, 11571, 13, 314, 716, 281, 9552, 8796, 13])
= ['My', ' name', ' is', ' Chat', 'G', 'PT', '.', ' I', ' am', ' an', ' AI', ' assistant', '.']


When using tokenizer with huggingface transformers, we mostly use `return_tensors="pt"`
to return a pytorch tensor, rather than a list.

In [7]:
rich.print("tensor output:", tokenizer.encode(text, return_tensors="pt"), sep="\n")

By calling the `tokenizer` directly, we can obtain outputs such as `input_ids` and `attention_mask` in dictionary format. These can be passed directly to a transformer model's `forward` function.

- `input_ids` contains the tokenized representation of the input text. Its shape is $(\text{batch size}, \text{sequence length})$.
- `attention_mask` is a binary (0-1) mask of the same shape as `input_ids`. A value of 0 indicates that the corresponding token should not be attended to any query tokens.

One convenient feature of the tokenizer output is that we can move **all tensors to a device (e.g., GPU)** in a single line using `.to(device)`:

```python
inputs = tokenizer(text, return_tensors="pt").to(device)
```

Transformer models in Hugging Face accept multiple keyword arguments like `input_ids` and `attention_mask`. These tokenized outputs are typically passed to the model using unpacking:

```python
outputs = model(**inputs)
```

For example, see the [GPT-2 model’s source code](https://github.com/huggingface/transformers/blob/main/src/transformers/models/gpt2/modeling_gpt2.py#L1036) to explore how these inputs are handled internally.

In [8]:
rich.print("tokenizer output:", tokenizer(text, return_tensors="pt"), sep="\n")

You can encode multiple texts into a batched form using the tokenizer.

When encoding multiple texts with varying lengths, you can enable **padding** to make all sequences the same length. Padding is done using the tokenizer's `pad_token`.

If the tokenizer does not have a predefined `pad_token`, it's common to use the `eos_token` as a substitute.

Note that `attention_mask` is also padded as 0.

In [9]:
text1 = "My name is ChatGPT. I am an AI assistant."
text2 = "Hello, world!"

print("length of tokenized text1:", len(tokenizer.encode(text1)))
print("length of tokenized text2:", len(tokenizer.encode(text2)))

tokenizer.pad_token = tokenizer.eos_token
tokenizer.pad_token_id = tokenizer.eos_token_id
tokenizer.padding_side = 'left'

print(f"Padding token: {tokenizer.pad_token}({tokenizer.pad_token_id})")

rich.print("batched tokenization:", tokenizer([text1, text2], return_tensors="pt", padding=True), sep="\n")

length of tokenized text1: 13
length of tokenized text2: 4
Padding token: <|endoftext|>(50256)


Tokenizer outputs can be provided to model's `forward` function directly.

Huggingface model's forward function returns dictionary-shaped output,
including logits with shape `(Batch Size, Sequence Length, Vocabulary Size)`

In [10]:
device = torch.device("cuda:0")
inputs = tokenizer([text1, text2], return_tensors="pt", padding=True).to(device)
outputs = model(**inputs)
print("model output keys:", outputs.keys(), "\n")
print("logit shape:", outputs.logits.shape, "= (Batch Size, Sequence Length, Vocabulary Size)")

model output keys: odict_keys(['logits', 'past_key_values']) 

logit shape: torch.Size([2, 13, 50257]) = (Batch Size, Sequence Length, Vocabulary Size)


In many Hugging Face transformer models (especially for tasks like language modeling, sequence classification, etc.), if you provide a `labels` argument during the forward pass, the model will automatically compute the negative log likelihood loss for you.

The `labels` argument is integer tensor that has same shape with `input_ids`,
which each element is $-100$ or $[0, 1, \dots, \texttt{vocab_size} - 1]$.
In causal language modeling (e.g., GPT-style models), the model is trained to predict the next token in the sequence.
The token at position $t$ in `input_ids` is used to predict the token at position $t+1$ in labels.

Therefore, the labels are effectively right-shifted compared to input_ids.


For example, the loss is computed like this in the inside of the model code.
```python
if labels is not None:
    # Shift so that tokens < n predict n
    shift_logits = logits[..., :-1, :].contiguous()
    shift_labels = labels[..., 1:].contiguous()
    # Flatten the tokens
    loss_fct = CrossEntropyLoss() # labels whose value is -100 is ignored
    loss = loss_fct(shift_logits.view(-1, self.config.vocab_size), shift_labels.view(-1))
```


This is particularly useful during training, as you don’t need to manually implement the loss function for common tasks.

In [11]:
inputs = tokenizer(text1, return_tensors="pt", padding=True).to(device)
labels = inputs['input_ids'].clone()
inputs['labels'] = labels

rich.print("Example inputs:")
rich.print(inputs)

outputs = model(**inputs)
rich.print("Keys in outputs", outputs.keys())
rich.print("loss value", outputs.loss)

`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


## 2. Generating a Text

Huggingface models have `model.generate` method, which implements several decoding strategies we learned,
such as greedy decoding, top-$k$ sampling, top-$p$ sampling, beam search, etc.

The generation parameters are handled in `GenerationConfig` class.
See the [Hugging Face Text Generation Documentation](https://huggingface.co/docs/transformers/v4.49.0/en/main_classes/text_generation#transformers.GenerationConfig)
for full parameter list and their explanations.

Let's see some examples!

### 2.1. Greedy Search

Using `model.generate` function without any specific parameters do a greedy search.

In [20]:
from transformers import GenerationConfig

inputs = tokenizer("Machine Learning is", return_tensors="pt").to(device)
config = GenerationConfig(
    stop_strings="\n", # stop when newline character is generated
    max_new_tokens=30, # max number of tokens to generate
    pad_token_id=tokenizer.eos_token_id, # pad_token_id is just for supressing warning message
)
outputs = model.generate(**inputs, generation_config=config, tokenizer=tokenizer)
print("Output with greedy decoding:", tokenizer.decode(outputs[0]))

Output with greedy decoding: Machine Learning is a new approach to machine learning that uses machine learning to learn from data.



### 2.2. Beam Search

Using the parameter `num_beams=<n>` for $n>1$, we can perform the beam search.

In [18]:
inputs = tokenizer("Machine Learning is", return_tensors="pt").to(device)
config = GenerationConfig(
    num_beams=10,      # beam search with beam size 10
    stop_strings="\n", # stop when newline character is generated
    max_new_tokens=30, # max number of tokens to generate
    pad_token_id=tokenizer.eos_token_id, # pad_token_id is just for supressing warning message
)
outputs = model.generate(**inputs, generation_config=config, tokenizer=tokenizer)
print("Output with beam search:", tokenizer.decode(outputs[0]))

Output with beam search: Machine Learning is an open-source, open-source, open-source, open-source, open-source, open-source, open-source, open


### 2.3. Top-$k$ sampling

Using the parameters `top_k=<k> do_sample=True`, we can perform the top-$k$ sampling.

In [26]:
inputs = tokenizer("Machine Learning is", return_tensors="pt").to(device)
config = GenerationConfig(
    top_k=100,         # top-k sampling with k=100
    do_sample=True,    # use sampling rather then deterministric decoding
    stop_strings="\n", # stop when newline character is generated
    max_new_tokens=30, # max number of tokens to generate
    pad_token_id=tokenizer.eos_token_id, # pad_token_id is just for supressing warning message
)
outputs = model.generate(**inputs, generation_config=config, tokenizer=tokenizer)
print("Output with top-k sampling:", tokenizer.decode(outputs[0]))

Output with top-k sampling: Machine Learning is an excellent way to explore questions used to solve specific problems. In this article we discuss how to visualize and visualize one-dimensional solutions to multiple problems such


### 2.4. Top-$p$ sampling

Using the parameters `top_p=<p> do_sample=True`, we can perform the top-$p$ sampling.

In [28]:
inputs = tokenizer("Machine Learning is", return_tensors="pt").to(device)
config = GenerationConfig(
    top_p=0.95,        # top-p sampling with p=0.95
    do_sample=True,    # use sampling rather then deterministric decoding
    stop_strings="\n", # stop when newline character is generated
    max_new_tokens=30, # max number of tokens to generate
    pad_token_id=tokenizer.eos_token_id, # pad_token_id is just for supressing warning message
)
outputs = model.generate(**inputs, generation_config=config, tokenizer=tokenizer)
print("Output with top-p sampling:", tokenizer.decode(outputs[0]))

Output with top-p sampling: Machine Learning is a framework to build a visual user interface from HTML to CSS. Using a basic JavaScript interface, it is possible to build robust, scalable and scalable interactive


## 3. Downloading and Using Text Datasets

The `datasets` library provides access to several popular datasets.

For this practice session, we will download the **[Alpaca Dataset](https://huggingface.co/datasets/tatsu-lab/alpaca)** from Hugging Face: .

The Alpaca dataset is a collection of instruction-response pairs used to fine-tune large language models for instruction-following tasks. It was originally created by Stanford University as a lightweight alternative to OpenAI’s instruction-tuned models.

To download and prepare dataset, use `load_dataset` function.

In [29]:
from datasets import load_dataset
dataset = load_dataset("tatsu-lab/alpaca")

README.md:   0%|          | 0.00/7.47k [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


(…)-00000-of-00001-a09b74b3ef9c3b56.parquet:   0%|          | 0.00/24.2M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/52002 [00:00<?, ? examples/s]

In [30]:
rich.print("dataset object:", dataset, sep="\n")
print()
rich.print("train split:", dataset['train'], sep="\n")
print()
rich.print("example datapoint:", dataset['train'][0], sep="\n")







The dataset consists of four fields.

- instruction: describes the task the model should perform. Each of the 52K instructions is unique.
- input: optional context or input for the task. For example, when the instruction is "Summarize the following article", the input is the article. Around 40% of the examples have an input.
- output: the answer to the instruction.
- text: the instruction, input and output formatted with the prompt template.

You can access row and column index in any arbitrary order, such as:

In [31]:
dataset['train'][0]['text']

'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nGive three tips for staying healthy.\n\n### Response:\n1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule.'

In [32]:
dataset['train']['text'][0]

'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nGive three tips for staying healthy.\n\n### Response:\n1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule.'

### 3.1. Loading Custom Datasets

While Hugging Face provides many famous datasets, you may need to prepare dataset on your own.  

For practice, we’ll manually load the **Alpaca dataset**.

Let's see how we can load dataset from files!

The alpaca dataset is provided in the `alpaca_data.json` file.

Dataset files which are in the form of json, csv, SQL database, etc., can be loaded by `load_dataset` function.

For more information, see [Hugging Face Datasets Load Documentation](https://huggingface.co/docs/datasets/loading).

In [33]:
dataset = load_dataset("json", data_files="alpaca_data.json")
rich.print("dataset object:", dataset)

Generating train split: 0 examples [00:00, ? examples/s]

In [34]:
rich.print("example datapoint:\n", dataset['train'][0])

For more fine-grained dataset manipulation, first create the dataset as an in-memory object like dictionary or list, then convert it into a `Dataset` object.

Here's an example.

In [35]:
from datasets import Dataset
import json

data_list = [
    {
        "input": "Hello",
        "output": "world!"
    },
    {
        "input": "Welcome to",
        "output": "Deep Learning class!"
    }
]

rich.print("example datapoint:", data_list[0], sep="\n")
print()
test_dataset = Dataset.from_list(data_list)
rich.print("dataset object:", test_dataset, sep="\n")




### 3.2. Preprocessing the Dataset

You can preprocess a dataset using functions like `shuffle`, `map`, and `filter`.

For more information, refer to the [Hugging Face Datasets Processing Documentation](https://huggingface.co/docs/datasets/en/process).

- The `map` and `filter` functions take a **preprocessing function** that is applied to each individual data point in the dataset.

- In the case of `map`, the preprocessing function should return a dictionary containing new or modified fields. These fields will be **automatically merged** into the original data point.

- The `filter` function expects a preprocessing function that returns a **boolean**. Data points for which the function returns `False` will be **removed** from the dataset.

As you may notice, our loaded Alpaca dataset does not contain a `"text"` field by default.

In this example, we will perform the following preprocessing steps:

1. **Shuffle** the training split.
2. **Construct instruction prompts** for instruction fine-tuning.
3. **Tokenize** the prompts into input tokens.
4. **Filter out** examples whose tokenized length exceeds the GPT-2 model’s maximum sequence length.

In [36]:
# Prompt template to generate instruction prompt, for instances have "input" field.
with_input_prompt = \
"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input}

### Response:
{output}"""

# Prompt template to generate instruction prompt, for instances do not have "input" field.
without_input_prompt = \
"""Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Response:
{output}"""

def create_prompt(x):
    if x['input'] == '':
        prompt = without_input_prompt.format(
            instruction=x['instruction'],
            output=x['output']
        )
    else:
        prompt = with_input_prompt.format(
            instruction=x['instruction'],
            input=x['input'],
            output=x['output'],
        )

    return {'text': prompt}

def tokenize_function(x):
    tokenized = tokenizer(x['text'])
    return tokenized

def filter_long(x):
    return len(x['input_ids']) < tokenizer.model_max_length

tokenized_dataset = dataset["train"].shuffle(seed=42).map(create_prompt).map(tokenize_function).filter(filter_long)
print("example datapoint after preprocessing:")
pprint(tokenized_dataset[0], max_length=10)

Map:   0%|          | 0/52002 [00:00<?, ? examples/s]

Map:   0%|          | 0/52002 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (1510 > 1024). Running this sequence through the model will result in indexing errors


Filter:   0%|          | 0/52002 [00:00<?, ? examples/s]

example datapoint after preprocessing:


## 4. Fine-tuning Language Models

Hugging Face's `Trainer` API simplifies training and fine-tuning transformer models by handling training loops, evaluation, and optimization automatically.

Let's fine-tune GPT-2 model!

In [37]:
from transformers import Trainer, TrainingArguments, DataCollatorForLanguageModeling

First, let us see whether GPT-2 model answers correctly with alpaca prompt.

In [38]:
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto")

Here's an example prompt!

In [39]:
prompt = \
"""Below is an instruction that describes a task, Write a response that appropriately completes the request.

### Instruction:
Explain what is Computer Science.

### Response:
"""
print("PROMPT:", prompt, sep="\n\n")
inputs = tokenizer(prompt, return_tensors='pt').to(device)

PROMPT:

Below is an instruction that describes a task, Write a response that appropriately completes the request.

### Instruction:
Explain what is Computer Science.

### Response:



In [40]:
config = GenerationConfig(
    top_p=0.95,
    do_sample=True,
    max_new_tokens=80,
    pad_token_id=tokenizer.eos_token_id,
)
output = model.generate(**inputs, generation_config=config)
output = tokenizer.decode(output[0])
rich.print("MODEL OUTPUT Before Finetuning:", output, sep="\n\n")

As you can see, this model does not follow instruction and generate wrong text.
The reason is that this model is just trained to predict next token, not trained with alpaca instruction prompt.

Let us finetune this model to follow alpaca instructions!

### 4.1. Handling Training Configurations

Similar to `GenerationConfig` in `model.generate` function,
huggingface `Trainer` handles training configurations in the
`TrainingArgument` class.

See the full parameter list and their explantions in the
[official document](https://huggingface.co/docs/transformers/v4.50.0/en/main_classes/trainer#transformers.TrainingArguments).

In the below cells, we will show a toy example to finetune
GPT-2 model with the alpaca dataset.

In [41]:
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto")
training_args = TrainingArguments(
    output_dir="./gpt2_finetuned",          # Directory to save the model and checkpoints
    per_device_train_batch_size=1,          # Batch size per device (e.g., GPU) during training
    max_steps=500,                          # Total number of training steps
    save_strategy="steps",                  # Save model checkpoint every few steps
    logging_dir="./logs",                   # Directory to store training logs
    logging_steps=50,                       # Log training metrics every 50 steps
    gradient_accumulation_steps=2,          # Accumulate gradients over 2 steps before updating weights (so the effective batch size is 1 * 2 = 2)
    bf16=True,                              # Use bfloat16 precision
    report_to="none",                       # disable wandb logging
)

Data collators are objects that will form a batch by using a list of dataset elements as input. We will use data collator that provided by huggingface.

It will dynamically do padding and add `labels` argument to compute negative log-likelihood loss for finetuning.

In [42]:
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

In [43]:
example_batch = [
    inst
    for inst in
    tokenized_dataset.remove_columns(["instruction", "input", "output", "text"]).select([0,1])
    # To use the DataCollatorForLanguageModeling class,
    # you need to keep only the "input_ids" and "attention_mask" columns
    # and remove the rest. Trainer object automatically handles it during the training.
]
print("Example batch:")
pprint(example_batch, max_length=10)
print("Batch size:", len(example_batch))
for i in range(2):
    print(f"input {i} length:", len(example_batch[i]["input_ids"]))
print()

collator_output = data_collator(example_batch)

print("Output after applying data collator:")
print("input shape:", collator_output["input_ids"].shape)
rich.print(f"input: {collator_output['input_ids']}")
rich.print(f"attention mask: {collator_output['attention_mask']}")
rich.print(f"labels: {collator_output['labels']}")

Example batch:


Batch size: 2
input 0 length: 84
input 1 length: 109

Output after applying data collator:
input shape: torch.Size([2, 109])


You can check `input_ids`, `attention_mask`, and `labels` are correctly padded!

Now let us do fine-tuning!
To do finetuning, we first need to create a `Trainer` object.

In [44]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    data_collator=data_collator
)

You can start training using `.train()` function.

In [46]:
trainer.train()

Step,Training Loss
50,1.8316
100,1.7618
150,1.742
200,1.7783
250,1.849
300,1.8231
350,1.7807
400,1.7707
450,1.8025
500,1.8546


TrainOutput(global_step=500, training_loss=1.7994193572998047, metrics={'train_runtime': 87.8008, 'train_samples_per_second': 11.389, 'train_steps_per_second': 5.695, 'total_flos': 55657754496000.0, 'train_loss': 1.7994193572998047, 'epoch': 0.019230769230769232})

In [47]:
prompt = \
"""Below is an instruction that describes a task, Write a response that appropriately completes the request.

### Instruction:
Explain what is Computer Science.

### Response:
"""
print("PROMPT:", prompt, sep="\n\n")
inputs = tokenizer(prompt, return_tensors='pt').to(device)

PROMPT:

Below is an instruction that describes a task, Write a response that appropriately completes the request.

### Instruction:
Explain what is Computer Science.

### Response:



Let's see our finetuned model can answer the question about Computer Science!

In [50]:
config = GenerationConfig(
    top_p=0.95,
    do_sample=True,
    max_new_tokens=80,
    pad_token_id=tokenizer.eos_token_id,
)
output = model.generate(**inputs, generation_config=config)
output = tokenizer.decode(output[0])
rich.print("MODEL OUTPUT After Finetuning:", output, sep="\n\n")