# [Fine-tuning for Summarization](https://finetuningllms.vercel.app)

**Author**: Jose Cols (<jcols@uw.edu>)

**Date**: 11/13/2025

This notebook is part of a [workshop](https://finetuningllms.vercel.app/) on fine-tuning LLMs. It shows how to use **next-token** prediction to build a **summarizer** of biomedical articles for non-expert readers using [LLaMA 3.2 3B Instruct](https://huggingface.co/unsloth/Llama-3.2-3B-Instruct). It's divided into three sections: [data preprocessing](#data), [model inference](#inference), and [model fine-tuning](#finetuning).

Throughout the notebook, there are emojis to indicate the following:

- 💡 **Idea**: An important note for a code section, to answer questions like "Why is this line here?"
- 🔬 **Technical details**: Adds more technical information or changes to explore.
- 📘 **Further reading**: Provides links to external resources on a topic.
- ✅ **Check mark**: Indicates completion of one of the notebook sections.

--

*Tested on Google Colab with a T4 GPU.*

# Configuration

Before we start, we will need a few Python libraries to load the dataset, load the base model, and fine-tune it. Let's ensure we have everything installed:

In [None]:
# This cell should take a couple of minutes to run. The `%%capture` command will hide the output.
%%capture
!pip install "unsloth[colab-new]" "huggingface_hub[hf_xet]" triton==3.2.0 rouge_score

In [None]:
from datetime import datetime


SEED = 1234
MAX_SEQ_LENGTH = 4096
DATE_STRING = datetime.today().strftime("%d %b %Y")

These constants are used throughout the notebook for the following purposes:

- `SEED`: This value helps ensure reproducibility. You can change this to any number you like.
- `MAX_SEQ_LENGTH`: Controls the **context window**. Ideally, this number should be **larger** than your longest input, but **smaller** than the model's context window.
- `DATE_STRING`: Sets the current date in the model's prompts. This is required by the model's tokenizer.

🔬 — You can estimate the `max_seq_length` by tokenizing the data and identifying the longest token sequence plus some margin.

🔬 — What happens if we set a `max_seq_length` smaller than our longest text?

We also need to download punctuation data with the [NLTK library](https://www.nltk.org/), which is required by `ROUGE`, the **summarization evaluation** metric that we will use.

In [None]:
import nltk


nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

<a name="data"></a>
# 1 - Preparing the data

We will fetch the [BioLaySumm](https://biolaysumm.org/) 2025 [PLOS dataset](https://huggingface.co/datasets/BioLaySumm/BioLaySumm2025-PLOS) from [Hugging Face](https://huggingface.co/) using the [datasets](https://huggingface.co/docs/datasets/en/index) library. The data preparation process consists of the following steps:

1. Download the dataset.
2. Randomly **sample** the `train` and `test` splits to create smaller subsets.
3. Define functions to format the data as **model instructions**.
4. Extract the **abstracts** of the articles to use them as input for lay summarization.

The `train` subset contains 300 samples and the `test` subset has 10.

🔬 — Increasing the size of the `train` subset will extend the **training runtime**, while increasing the `test` subset will extend the **inference runtime**.

In [None]:
from datasets import load_dataset


plos_train = (
    load_dataset("BioLaySumm/BioLaySumm2025-PLOS", split="train")
    .shuffle(seed=SEED)
    .select(range(300))
)
plos_test = (
    load_dataset("BioLaySumm/BioLaySumm2025-PLOS", split="validation")
    .shuffle(seed=SEED)
    .select(range(10))
)

Let's examine the training data:

In [None]:
plos_train.to_pandas().head()

Unnamed: 0,article,summary,section_headings,keywords,year,title
0,Fungal pathogens exploit diverse mechanisms to...,Treating fungal infections is challenging due ...,"[Abstract, Introduction, Results, Discussion, ...","[infectious, diseases/fungal, infections, infe...",2010,PKC Signaling Regulates Drug Resistance of the...
1,Compositional data consist of vectors of propo...,Data from many fields are available primarily ...,"[Abstract, Introduction, Methods, BAnOCC:, Bay...","[ecology, and, environmental, sciences, microb...",2017,A Bayesian method for detecting pairwise assoc...
2,Sand fly saliva has an array of pharmacologica...,Parasites of the genus Leishmania cause a vari...,"[Abstract, Introduction, Materials, and, Metho...","[immunology/immunomodulation, infectious, dise...",2007,Enhanced Leishmania braziliensis Infection Fol...
3,Brucellosis is a highly contagious zoonosis an...,Brucellosis is one of the most widespread zoon...,"[Abstract, Introduction, Materials, and, Metho...","[biotechnology, medicine, infectious, diseases...",2013,Development and Validation of a Novel Diagnost...
4,Plasmodium falciparum employs antigenic variat...,Plasmodium falciparum is a protist parasite th...,"[Abstract, Introduction, Results, Discussion, ...","[infectious, diseases/protozoal, infections, g...",2011,Expression of P. falciparum var Genes Involves...


In [None]:
plos_train["article"][0]

"Fungal pathogens exploit diverse mechanisms to survive exposure to antifungal drugs . This poses concern given the limited number of clinically useful antifungals and the growing population of immunocompromised individuals vulnerable to life-threatening fungal infection . To identify molecules that abrogate resistance to the most widely deployed class of antifungals , the azoles , we conducted a screen of 1 , 280 pharmacologically active compounds . Three out of seven hits that abolished azole resistance of a resistant mutant of the model yeast Saccharomyces cerevisiae and a clinical isolate of the leading human fungal pathogen Candida albicans were inhibitors of protein kinase C ( PKC ) , which regulates cell wall integrity during growth , morphogenesis , and response to cell wall stress . Pharmacological or genetic impairment of Pkc1 conferred hypersensitivity to multiple drugs that target synthesis of the key cell membrane sterol ergosterol , including azoles , allylamines , and mo

In [None]:
plos_train["summary"][0]

'Treating fungal infections is challenging due to the emergence of drug resistance and the limited number of clinically useful antifungal drugs . We screened a library of 1 , 280 pharmacologically active compounds to identify those that reverse resistance of the leading human fungal pathogen , Candida albicans , to the most widely used antifungals , the azoles . This revealed a new role for protein kinase C ( PKC ) signaling in resistance to drugs targeting the cell membrane , including azoles , allylamines , and morpholines . We dissected mechanisms through which PKC regulates resistance in C . albicans and the model yeast Saccharomyces cerevisiae . PKC enabled survival of cell membrane stress at least in part through the mitogen-activated protein kinase ( MAPK ) cascade in both species . In S . cerevisiae , inhibition of PKC signaling blocked activation of a key regulator of membrane stress responses , calcineurin . In C . albicans , Pkc1 and calcineurin independently regulate resist

Since the entire article is **too long** to process in this notebook, we will use only its **abstract** as input to the summarization model.

In [None]:
def extract_abstracts(samples: dict) -> dict:
    """
    Extracts the abstracts from the given article samples.

    Args:
        samples: The batch of article samples.

    Returns: A new column with the extracted abstracts.
    """
    abstracts = [article.split("\n")[0].strip() for article in samples["article"]]

    return {"abstract": abstracts}


def apply_chat_ml(text: str, summary: str | None = None) -> list:
    """
    Applies Chat Markup Language structure to an input text for lay biomedical summarization.

    Args:
        text: The input article text.
        summary: The ground-truth summary.

    Returns: A list of message dictionaries in ChatML format.
    """

    messages = [
        {
            "role": "system",
            "content": "You are a specialist medical communicator responsible for translating biomedical articles into a clear, accurate 10-20 sentence summary for non-experts. The summary should be at a Flesch–Kincaid grade level of 10–14 and explain any technical terms.",
        },
        {
            "role": "user",
            "content": text,
        },
        {
            "role": "assistant",
            "content": f"Summary:{summary or ''}",
        },
    ]

    return messages


def batch_chat_ml(samples: dict, tokenizer) -> dict:
    """
    Formats a batch of samples into ChatML messages.

    Args:
        samples: The batch of samples to format.
        tokenizer: A pre-trained tokenizer.

    Returns: A new column with the formatted messages.
    """

    pairs = zip(samples["abstract"], samples["summary"])
    messages = [
        tokenizer.apply_chat_template(
            apply_chat_ml(text, label),
            date_string=DATE_STRING,
            tokenize=False,
            continue_final_message=True,  # 💡 Continue the `assistant` message added in `apply_chat_ml`.
        )
        for text, label in pairs
    ]

    return {"text": messages}


📘 **Further Reading**: [Chat Templates](https://huggingface.co/blog/chat-templates)

In [None]:
plos_train = plos_train.map(extract_abstracts, batched=True)
plos_test = plos_test.map(extract_abstracts, batched=True)

Map:   0%|          | 0/300 [00:00<?, ? examples/s]

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

In [None]:
plos_train["abstract"][0]

'Fungal pathogens exploit diverse mechanisms to survive exposure to antifungal drugs . This poses concern given the limited number of clinically useful antifungals and the growing population of immunocompromised individuals vulnerable to life-threatening fungal infection . To identify molecules that abrogate resistance to the most widely deployed class of antifungals , the azoles , we conducted a screen of 1 , 280 pharmacologically active compounds . Three out of seven hits that abolished azole resistance of a resistant mutant of the model yeast Saccharomyces cerevisiae and a clinical isolate of the leading human fungal pathogen Candida albicans were inhibitors of protein kinase C ( PKC ) , which regulates cell wall integrity during growth , morphogenesis , and response to cell wall stress . Pharmacological or genetic impairment of Pkc1 conferred hypersensitivity to multiple drugs that target synthesis of the key cell membrane sterol ergosterol , including azoles , allylamines , and mo

✅ — The `train` and `test` splits now have an **abstract** column. We still need to apply the ChatML formatting in the following sections.

<a name="inference"></a>
# 2 - Using the model for inference

In this section, we will generate abstractive summaries for the `test` split using the base model **without fine-tuning**. The steps we will follow are:

1. Load the **model** and its **tokenizer**.
2. Use the previously defined functions to format the data in the Chat Markup Language, which is the format this [specific model](https://huggingface.co/unsloth/Llama-3.2-3B-Instruct) requires.
3. Make **predictions** on the `test` split and **evaluate** the the model using `ROUGE`.

📘 **Further Reading**: [ROUGE](https://github.com/google-research/google-research/tree/master/rouge)

In [None]:
%%capture
import torch
import numpy as np
from tqdm.auto import tqdm
from rouge_score import rouge_scorer
from unsloth import FastLanguageModel
from unsloth.chat_templates import get_chat_template


# 💡 We set the default value for `max_seq_length` using the constant defined at the beginning.
def load_model(max_seq_length=MAX_SEQ_LENGTH, load_in_4bit=True) -> tuple:
    """
    Loads a pre-trained LLaMA 3.2 3B Instruct model and its tokenizer with quantization.

    Args:
        max_seq_length: Maximum number of tokens in the input sequence.
        load_in_4bit: Whether to load the model in 4-bit precision.

    Returns: A tuple containing the loaded model and tokenizer.
    """

    # 💡 The `Instruct` suffix indicates the model is fine-tuned for following instructions.
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name="unsloth/Llama-3.2-3B-Instruct",
        max_seq_length=max_seq_length,
        load_in_4bit=load_in_4bit,  # Loading in 4-bit reduces memory usage.
    )
    tokenizer = get_chat_template(
        tokenizer,
        chat_template="llama-3.2",
    )

    return model, tokenizer


def tokenize_messages(tokenizer, messages: list) -> tuple:
    """
    Tokenizes a list of ChatML-formatted messages using the given tokenizer.

    Args:
        tokenizer: A tokenizer with a chat template loaded.
        messages: A list of message dictionaries in ChatML format.

    Returns: Tokenized input tensor and its sequence length.
    """
    inputs = tokenizer.apply_chat_template(
        messages,
        date_string=DATE_STRING,
        tokenize=True,
        continue_final_message=True,  # 💡 Continue the `assistant` message added in `apply_chat_ml`.
        return_tensors="pt",
    ).to(
        "cuda"
    )  # Assuming we have an NVIDIA Cuda GPU available.

    return inputs, inputs.shape[1]


def make_predictions(model, tokenizer, data) -> list:
    """
    Runs inference on a dataset for binary sentiment classification.

    Args:
        model: The language model used for inference.
        tokenizer: The tokenizer used to encode and decode text.
        data: A dictionary containing a `abstract` field with formatted chat strings.

    Returns: A list of model predictions (decoded text).
    """

    # 💡 Preparing the model for inference disables training and enables faster generation.
    FastLanguageModel.for_inference(model)

    predictions = []
    for text in tqdm(data["abstract"], desc="Generating summaries"):
        messages = apply_chat_ml(text)
        inputs, input_length = tokenize_messages(tokenizer, messages)

        # `max_new_tokens` sets the maximum number of tokens to generate, ignoring the input tokens.
        outputs = model.generate(input_ids=inputs, max_new_tokens=256, use_cache=True)

        # 💡 The `generate` function will repeat the input tokens in its output.
        # We use the `input_length` to skip input tokens and decode only the model's predicted label.
        prediction = tokenizer.batch_decode(
            outputs[:, input_length:], skip_special_tokens=True
        )[0].strip()

        predictions.append(prediction)

    return predictions


def evaluate_rouge(predictions: list[str], references: list[str]) -> dict:
    """
    Calculate average ROUGE scores across multiple text pairs.

    Args:
        predictions: List of predicted summaries.
        references: List of ground-truth summaries.

    Returns: Average of ROUGE-1, ROUGE-2, and ROUGE-L F1 scores.
    """

    scorer = rouge_scorer.RougeScorer(
        ["rouge1", "rouge2", "rougeLsum"], use_stemmer=True, split_summaries=True
    )
    # Compute the scores for all prediction/reference pairs.
    scores = [scorer.score(pred, ref) for pred, ref in zip(predictions, references)]

    rouge1 = np.mean([score["rouge1"].fmeasure for score in scores])
    rouge2 = np.mean([score["rouge2"].fmeasure for score in scores])
    rougeL = np.mean([score["rougeLsum"].fmeasure for score in scores])
    average = np.mean([rouge1, rouge2, rougeL])

    return {
        "rouge1": rouge1.item(),
        "rouge2": rouge2.item(),
        "rougeL": rougeL.item(),
        "average": average.item(),
    }

🔬 — `model.generate` runs the transformer blocks and the language modeling head, so it supports many parameters related to decoding, such as `temperature` and `top_p`.

📘 **Further Reading**:
- [What is Quantization?](https://huggingface.co/docs/optimum/en/concept_guides/quantization)
- [Text generation configuration](https://huggingface.co/docs/transformers/en/main_classes/text_generation)

Load the model and its tokenizer:

In [None]:
# This cell should take about a minute to run the first time.
model, tokenizer = load_model()

==((====))==  Unsloth 2025.11.2: Fast Llama patching. Transformers: 4.57.1.
   \\   /|    NVIDIA A100-SXM4-80GB. Num GPUs = 1. Max memory: 79.318 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.0. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/2.35G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

Run inference using the `make_predictions` function we defined above:

In [None]:
# This cell should take about three minutes to run.
plos_pred_summaries = make_predictions(model, tokenizer, plos_test)

Generating summaries:   0%|          | 0/10 [00:00<?, ?it/s]

In [None]:
evaluate_rouge(plos_pred_summaries, plos_test["summary"])

{'rouge1': 0.37434481758439525,
 'rouge2': 0.07131477371597801,
 'rougeL': 0.3445218981302151,
 'average': 0.2633938298101961}

🔬 — `rouge1` measures lexical overlap, ignoring word order. `rouge2` captures the overlap of adjacent pairs (bigrams). `rougeL` captures word overlap based on the sequential order in which they appear. Since we are using a **stemmer**, the overlaps use the stem forms of the words instead of exact matches.

In [None]:
(plos_pred_summaries[0], "", plos_test["summary"][0])

("Researchers studied the circadian clock in a type of algae called Synechococcus to understand how a key protein called KaiC works. KaiC helps regulate the clock's genes, but it's not clear which version of the protein is active when it's doing its job. To figure this out, the scientists created a computer model that simulated different ways the KaiC protein could work. They tested 32 different models, and only one was able to accurately predict the effects of changing the KaiC protein's activity. The model showed that the KaiC protein's activity is controlled by two types of feedback mechanisms: one that activates the gene and one that suppresses it. The scientists also found similarities between their model and the way the mammalian clock is regulated, which could help us better understand how our own internal clocks work. \n\nTechnical terms explained:\n\n- Circadian model organism: Synechococcus is a type of algae that is used to study the circadian clock, a biological process tha

✅ — We got relatively low F1 scores for the overlap of the summaries. Additionally, the model generated a section of "Technical terms" that does not appear in the structure of the ground-truth summary.

<a name="finetuning"></a>
# 3 - Fine-tuning the model

In this section, we will fine-tune the model using the training data and evaluate it using the same benchmark we used in [Section 2](#inference). This section includes the following:

1. Format the training data using the functions defined above.
2. Reload the model and prepare it for fine-tuning using `PEFT`.
3. Create the `trainer` manager using `SFTTrainer` from [trl](https://huggingface.co/docs/trl/v0.17.0/en/index) (another Hugging Face library).
4. Run the train process (fine-tune the model).
5. Make predictions with the **fine-tuned model** and evaluate them.

Before fine-tuning the model, we must format the training data using the functions defined in [Section 1](#data).

In [None]:
formatted_plos_train = plos_train.map(
    batch_chat_ml, batched=True, fn_kwargs={"tokenizer": tokenizer}
)
formatted_plos_train = formatted_plos_train.select_columns(["text"])
formatted_plos_train["text"][0]

Map:   0%|          | 0/300 [00:00<?, ? examples/s]

'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 13 Nov 2025\n\nYou are a specialist medical communicator responsible for translating biomedical articles into a clear, accurate 10-20 sentence summary for non-experts. The summary should be at a Flesch–Kincaid grade level of 10–14 and explain any technical terms.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nFungal pathogens exploit diverse mechanisms to survive exposure to antifungal drugs . This poses concern given the limited number of clinically useful antifungals and the growing population of immunocompromised individuals vulnerable to life-threatening fungal infection . To identify molecules that abrogate resistance to the most widely deployed class of antifungals , the azoles , we conducted a screen of 1 , 280 pharmacologically active compounds . Three out of seven hits that abolished azole resistance of a resistant mutant of the model yeast Saccharomyces cer

In [None]:
formatted_plos_train.to_pandas()

Unnamed: 0,text
0,<|begin_of_text|><|start_header_id|>system<|en...
1,<|begin_of_text|><|start_header_id|>system<|en...
2,<|begin_of_text|><|start_header_id|>system<|en...
3,<|begin_of_text|><|start_header_id|>system<|en...
4,<|begin_of_text|><|start_header_id|>system<|en...
...,...
295,<|begin_of_text|><|start_header_id|>system<|en...
296,<|begin_of_text|><|start_header_id|>system<|en...
297,<|begin_of_text|><|start_header_id|>system<|en...
298,<|begin_of_text|><|start_header_id|>system<|en...


🔬 — This LLM only does **next-token** prediction. The "system" and "user" prompts simply format the input text that the model will autocomplete.

We need to reload the model because we won't use it for inference this time:

In [None]:
model, tokenizer = load_model()

==((====))==  Unsloth 2025.11.2: Fast Llama patching. Transformers: 4.57.1.
   \\   /|    NVIDIA A100-SXM4-80GB. Num GPUs = 1. Max memory: 79.318 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.0. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Just as we prepared the model for `inference` when making predictions, we need to configure it for `fine-tuning`. PEFT, or **"Parameter-Efficient Fine-Tuning"**, implements several techniques that **significantly reduce** the number of parameters we need to train. Otherwise, fine-tuning a 3B-parameter model would not be possible during this workshop's time.

🔬 — The main optimization we will use is called **QLoRA (Quantized Low-Rank Adaptation)**. This method freezes the original weights of the model and trains **new** parameters by representing weight updates as the product of two, much smaller, matrices. These new parameters are called **adapter weights**.

📘 **Further Reading**:

- [LoRA Conceptual Guide](https://huggingface.co/docs/peft/main/en/conceptual_guides/lora)
- [LoRA Hyperparameters Guide](https://docs.unsloth.ai/get-started/fine-tuning-guide/lora-hyperparameters-guide)

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=32,
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],
    random_state=SEED,
    use_gradient_checkpointing="unsloth",
)

In [None]:
model.print_trainable_parameters()

trainable params: 24,313,856 || all params: 3,237,063,680 || trainable%: 0.7511


The **PEFT** model takes the following arguments:

- `r`: It's the LoRA rank and controls the size of the decomposition matrices. Lower rank -> less trainable parameters -> less memory usage.
- `lora_alpha`: Scaling factor to balance the influence of the new weights with the original model weights.
- `target_modules`:  The modules of the original model to which we want to apply LoRA. In the code above, we are targeting all.
- `random_state`: Fixed seed value for reproducibility.
- `use_gradient_checkpointing`: Enable Unsloth optimization to reduce memory usage.


🔬 — The ratio $\alpha / r$ (`lora_alpha` / `r`) controls the scale of the new parameters that are added to the original model weights. A ratio of 0 means we want to ignore the new weights, 1 means that we want to fully add the new parameters, and a ratio larger than 1 emphasizes the new weights more than the existing parameters.

📘 **Further Reading**: [Unsloth Gradient Checkpointing](https://unsloth.ai/blog/long-context)

After configuring the model for fine-tuning, we need to create an object to manage the **training process**. We will use the [SFTTrainer](https://huggingface.co/docs/trl/v0.17.0/en/sft_trainer#trl.SFTTrainer) for this purpose. SFT stands for **Supervised Fine-Tuning**, and we will use this trainer to handle batching the data, saving checkpoints, adjusting the learning rate, etc.

In [None]:
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from trl import SFTTrainer
from unsloth import is_bfloat16_supported


trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=formatted_plos_train,
    dataset_text_field="text",
    max_seq_length=MAX_SEQ_LENGTH,
    data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer),
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        num_train_epochs=1,
        learning_rate=2e-4,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=SEED,
        output_dir="outputs",
        report_to="none",
    ),
)

Unsloth: Tokenizing ["text"] (num_proc=16):   0%|          | 0/300 [00:00<?, ? examples/s]

🔬 — The `train_on_responses_only` function tells the `trainer` to only use the loss from the model's next token prediction as the learning signal, instead of using the entire message (`input + output`).

Some of the key arguments used to configure the **trainer** are described below.

- `train_dataset`: The training data. At this point, it should be formatted using the ChatML template.
- `dataset_text_field`: The name of the input field in the dataset.
- `max_seq_length`: The context window as defined above.
- `data_collator`: Object to dynamically handle batching and padding in the dataset.
- `per_device_train_batch_size`: Number of samples per batch per GPU.
- `max_steps`: Train for $n$ steps before stopping. One step = one `batch_size` samples processed.
- `report_to`: This can be used to enable logging with WandB or other tools.

Let's run the fine-tuning process:

In [None]:
# This cell should take about five minutes to run.
trainer_stats = trainer.train()

Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,2.5887
2,2.6047
3,2.661
4,2.3729
5,2.4103
6,2.4792
7,2.2197
8,2.2886
9,2.0976
10,2.2388


Evaluate the fine-tuned model using the same benchmark as in [Section 2](#inference):

In [None]:
# This cell should take about three minutes to run.
plos_pred_summaries = make_predictions(model, tokenizer, plos_test)

Generating summaries:   0%|          | 0/10 [00:00<?, ?it/s]

In [None]:
evaluate_rouge(plos_pred_summaries, plos_test["summary"])

{'rouge1': 0.4943382428298432,
 'rouge2': 0.19325402177640388,
 'rougeL': 0.447780194348962,
 'average': 0.3784574863184031}

In [None]:
(plos_pred_summaries[0], "", plos_test["summary"][0])

('Synechococcus, a cyanobacterium, is a model organism for circadian biology. The KaiC protein is central to the circadian clock, and its activity is regulated by phosphorylation. KaiC is activated and repressed by phosphorylation and dephosphorylation. The positive feedback of kaiBC expression is mediated by transcriptional and translational activation by phosphorylated KaiC. Conversely, unphosphorylated KaiC represses kaiBC expression. In this study, we developed a TTFL/PTO model that simulates the effects of removal or overexpression of kai genes. We found that the TTFL/PTO model is able to reproduce existing experimental observations. We discuss parallels between our proposed TTFL/PTO model and two-loop feedback structures found in the mammalian clock.',
 '',
 'Many organisms possess a true circadian clock and coordinate their activities into daily cycles . Among the simplest organisms harboring such a 24 h-clock are cyanobacteria . Interactions among three proteins , KaiA , KaiB ,

✅ — Done!

📘 **Further Reading**:
- [Unsloth Fine-tuning Guide](https://docs.unsloth.ai/get-started/fine-tuning-guide)
- [Hugging Face LLM Course](https://huggingface.co/learn/llm-course/chapter0/1)