# Chapter 6 Fine-tuning the language model

- [I. Importance of fine-tuning the language model](#I.-Importance of fine-tuning the language model)
- [II. Fine-tuning the language model using Hugging Face](#II.-Using-Hugging-Face-Fine-tuning the language model)
- [2.1 Data processing](#2.1--Data processing)
- [2.2 Model training](#2.2-Model training)
- [III. Real-time viewing and analysis of model training results](#III.-Real-time viewing and analysis of model training results)

## 1. The Importance of Fine-tuning Language Models

![compare-method.png](../../figures/compare-method.png)

Training a language model from scratch is time-consuming and resource-intensive, and evaluating these models is equally complex and resource-intensive. Therefore, it is important to keep a close eye on the training process and use `checkpoints` to properly respond to unexpected issues that may arise. The dashboard (`dashboard`) is a valuable tool that can display the progress and metrics of training and provide `checkpoints` when needed. This can help you understand the performance of the model in a timely manner and ensure that the training process is smooth.

Fine-tuning a language model is a more cost-effective way to optimize, especially when computing resources are limited. However, caution still needs to be exercised during the evaluation process. Depending on the goals of using the language model, you can develop an appropriate evaluation strategy to ensure that the model reaches the desired performance level.

## 2. Fine-tune a language model using Hugging Face

In this course, we will show how to fine-tune a language model using Hugging Face. To do this efficiently on the CPU, we will use a small language model called `TinyStories`, which has 33 million parameters. We will fine-tune this lightweight model on a dataset of character backstories from the Dungeons & Dragons game world.

In [1]:
from transformers import AutoTokenizer
from datasets import load_dataset
from transformers import AutoModelForCausalLM
from transformers import Trainer, TrainingArguments
import transformers
transformers.set_seed(42)

import wandb

In [2]:
wandb.login(anonymous="allow")

[34m[1mwandb[0m: Currently logged in as: [33manony-moose-980007700204230807[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

AutoClasses guesses the architecture to use from the name or path of a pretrained model provided to the from_pretrained() method.

AutoConfig, AutoModel, and AutoTokenizer can automatically retrieve the relevant model based on the name/path.

- Remote: root-level representation in the huggingface.co repository, such as bert-base-uncased, or namespaced under a user or organization name, such as roneneldan/TinyStories-33M
- Local: the directory saved with the save_pretrained() method

In [3]:
model_checkpoint = "roneneldan/TinyStories-33M"

### 2.1 Data Processing

First, we will load a dataset containing the backstories of Dungeons & Dragons characters from Huggingface.

In [4]:
ds = load_dataset('MohamedRashad/characters_backstories')

In [5]:
# Let's look at an example
ds["train"][400]

{'text': 'Generate Backstory based on following information\nCharacter Name: Dewin \nCharacter Race: Halfling\nCharacter Class: Sorcerer bard\n\nOutput:\n',
 'target': 'Dewin thought he was a wizard, but it turned out it was the draconic blood in his veins that brought him eldritch power.  Music classes in wizarding college taught him yet another use for his power, and when he was expelled he took up adventuring'}

This dataset contains two columns: one is the text, which requires the model to generate a background story; the other is the target, which stores the character's background story.

We will split the dataset to create a validation set.

In [6]:
# Since this dataset does not have a pre-split validation set, we need to create one ourselves
ds = ds["train"].train_test_split(test_size=0.2, seed=42)

Before training the model, we need to concatenate the text (character information) and target (background story) and make sure they are properly segmented and padded.

During this process, the Hugging Face framework automatically assigns the corresponding correct label to each input token and uses it for model training. Since the model needs to predict the next token in the sequence, Hugging Face automatically uses the original content as the label and moves these labels one position to the right so that the model can correctly predict the next token in the sequence.

[AutoTokenizer](https://huggingface.co/learn/nlp-course/zh-CN/chapter6/1?fw=pt) will help us get the tokenizer corresponding to the pre-trained model so that the model processing expectations are consistent with those during training.

In [7]:
# We will create a tokenizer from the model checkpoint
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=False)

# We need to pad the samples so that we have sequences of the same length in the batch
tokenizer.pad_token = tokenizer.eos_token

# First concatenate the text and the target. Then use the tokenizer to tokenize the concatenated string.
def tokenize_function(example):
    merged = example["text"] + " " + example["target"]
    batch = tokenizer(merged, padding='max_length', truncation=True, max_length=128)
    batch["labels"] = batch["input_ids"].copy()
    return batch

# Apply this to our dataset and remove the text columns
tokenized_datasets = ds.map(tokenize_function, remove_columns=["text", "target"])

Map:   0%|          | 0/465 [00:00<?, ? examples/s]

> You may get some warnings here, that's okay

Before we start training the model, we will verify the quality of the generated samples to make sure everything is working fine. When we decode the output, you will first see some instructions, followed by the generated background story. If everything looks good, we can continue.

In [8]:
# Let's look at a prepared example
print(tokenizer.decode(tokenized_datasets["train"][900]['input_ids']))

Generate Backstory based on following information
Character Name: Mr. Gale
Character Race: Half-orc
Character Class: Cleric

Output:
 Growing up the only half-orc in a small rural town was rough. His mother didn't survive childbirth and so was raised in a church in a high mountain pass, his attention was always drawn by airships passing through, and dreams of an escape. Leaving to strike out on his own as early as he could he made a living for most of his life as an airship sailor, and occasionally a pirate. A single storm visits him throughout his life, marking every major


### 2.2 Model Training
We will use Hugging Face's Transformers, and its wandb integration to fine-tune a pre-trained language model on the dataset.

The model we create is for causal language modeling, which means it is an autoregressive language model, similar to GPT. Its task is to predict the next word in a sequence. We will start a new Weights & Biases run with the job type set to training. Next, we need to define some training parameters, such as the number of training epochs, learning rate, weight decay, and very importantly, we will set up `wandb` as a reporting mechanism. This means that all your results will be brought together in the same centralized dashboard. This is all you need to do to ensure a flowing display of metrics.

[AutoModelForCausalLM](https://huggingface.co/learn/nlp-course/zh-CN/chapter2/3) will automatically load the corresponding causal language model from the checkpoint.

In [9]:
# We will train a causal (autoregressive) language model based on the pre-trained checkpoints
model = AutoModelForCausalLM.from_pretrained(model_checkpoint);

In [10]:
# Start a new wandb run
run = wandb.init(project='dlai_lm_tuning', job_type="training", anonymous="allow")

In [11]:
# Define training parameters
model_name = model_checkpoint.split("/")[-1]
training_args = TrainingArguments(
    f"{model_name}-finetuned-characters-backstories",
    report_to="wandb", # 我们需要一行来跟踪 wandb 中的实验
    num_train_epochs=1,
    logging_steps=1,
    evaluation_strategy = "epoch",
    learning_rate=1e-4,
    weight_decay=0.01,
    no_cuda=True, # 强制使用 CPU，将改为 use_cpu
)

In [12]:
# We will use HF Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
)

In [13]:
# Start training
trainer.train()



Epoch,Training Loss,Validation Loss
1,2.8045,3.350718


TrainOutput(global_step=233, training_loss=3.7419421836541957, metrics={'train_runtime': 1912.6777, 'train_samples_per_second': 0.971, 'train_steps_per_second': 0.122, 'total_flos': 40423258718208.0, 'train_loss': 3.7419421836541957, 'epoch': 1.0})

## 3. Real-time viewing and analysis of model training results

During the model training process, we can view the results in real time by clicking on the link. Over time, we can observe the changes in various indicators. Among them, the indicator we are most concerned about is the training loss, and we will continue to pay attention to its changing trend. When debugging the model training run, it is very useful to check whether the loss continues to decrease. Therefore, you want to see this curve move down and to the right.

For some very large language models, training may take days or even weeks. Therefore, it is very helpful to have a function that can view the chart remotely. This ensures that our model continues to improve and effectively utilizes GPU resources to avoid waste.

![charts.png](../../figures/charts.png)

| loss0| loss1 
|:------:|:------:|
| ![loss0](../../figures/loss0.png) | ![loss1](../../figures/loss1.png) |

After training the model, we will use it to generate samples. We define some prompts and use them to generate background stories for our characters. Next, we create a new table to record the generated results. Call `model.generate` on each prompt to generate the corresponding text. We can pass various parameters here, such as `top_p` or temperature coefficient (`temperature`) to guide model generation. The generated results are added to the table and finally recorded in `wandb`.

In [14]:
transformers.logging.set_verbosity_error() # 抑制 tokenizer 警告

prefix = "Generate Backstory based on following information Character Name: "

prompts = [
    "Frogger Character Race: Aarakocra Character Class: Ranger Output: ",
    "Smarty Character Race: Aasimar Character Class: Cleric Output: ",
    "Volcano Character Race: Android Character Class: Paladin Output: ",
]

table = wandb.Table(columns=["prompt", "generation"])
# Call "model.generate" on each prompt to generate the corresponding text.
for prompt in prompts:
    input_ids = tokenizer.encode(prefix + prompt, return_tensors="pt")
    output = model.generate(input_ids, do_sample=True, max_new_tokens=50, top_p=0.3)
    output_text = tokenizer.decode(output[0], skip_special_tokens=True)
    table.add_data(prefix + prompt, output_text)

wandb.log({'tiny_generations': table})



**NOTE**: LLM does not always generate the same results. Your generated character and backstory may differ from the video.

In [15]:
wandb.finish()

VBox(children=(Label(value='0.003 MB of 0.004 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=0.655475…

0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇████
train/global_step,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇████
train/learning_rate,████▇▇▇▇▇▆▆▆▆▆▆▅▅▅▅▅▄▄▄▄▄▄▃▃▃▃▃▂▂▂▂▂▂▁▁▁
train/loss,█▅▃▂▃▃▂▃▃▃▂▂▃▃▂▃▂▂▃▃▂▃▂▂▂▂▃▃▂▃▂▂▂▂▁▂▂▁▂▂
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,3.35072
eval/runtime,158.3652
eval/samples_per_second,2.936
eval/steps_per_second,0.373
train/epoch,1.0
train/global_step,233.0
train/learning_rate,0.0
train/loss,2.8045
train/total_flos,40423258718208.0
train/train_loss,3.74194


We can see the prompts and the generated samples in the dashboard. From this, we can observe that there are some issues with this small model. This is understandable because we tuned it to optimize for speed rather than performance. From the messages and outputs you provided, you can tell whether the model is performing well or not. We encourage you to come up with metrics that may be relevant to your specific use case, implement them, and record them. For example, you could measure the number of unique words. In the sample generated by the third prompt, we can see that it only uses three words, namely `the tribe of`. This may not be a very good output.

So the next time you train or fine-tune your model, we hope that you can take advantage of these tools to get better results faster.

![tiny-generations.png](../../figures/tiny-generations.png)