# Homework 3 part 2: Instruction tuning LLMs

## Learning objectives
After completing this assignment, students will be able to:     
* Implement instruction tuning of LLMs using Hugging Face tools
* Evaluate instruction-tuned LLMs against base models

## Overview
In this part of the assignment, you will instruction-tune an LLM that you choose for multilingual tasks. You will evaluate its performance on a custom test set of prompts that you come up. Your code will be completed in this Jupyter notebook and you can use the Pitt CRCD's GPUs to run your code with the class conda environment or a custom Python environment if you choose. 

**Please note that the instruction tuning can take a significant amount of time, up to 2 hours.** Using the teach cluster on the CRCD (jupyter.crc.pitt.edu) will only provide runtimes of 3 hours. If you are using the teach cluster, it is recommended to select the most powerful GPU, the Nvidia L4. If you need more time, see the CRCD's [Open OnDemand User Guide](https://crc-pages.pitt.edu/user-manual/web-portals/jupyter-ondemand/).

## Deliverables
1. Your code: the Jupyter notebook for part 2. Submit:
    * your .ipynb file
    * a **.html export of your notebook**. To get a .html version, click File > Save and Export Notebook As... > HTML from within JupyterLab. 
2. A PDF report with answers to questions provided in the template notebook. Please name your report `hw3_{your pitt email id}.pdf`. No need to include @pitt.edu, just use the email ID before that part. For example: `report_mmyoder_hw3.pdf`. **Please make only one PDF report, containing answers to part 1 and part 2.** Make sure to include the following information:
   * answers to all the numbered questions below
    * any additional resources, references, or web pages you've consulted
    * any person with whom you've discussed the assignment and describe the nature of your discussions
    * any generative AI tool used, and how it was used
    * any unresolved issues or problems

Please submit all of this material on Canvas. We will grade your report and may look over your code.

## Recommended Guide
- [Causal language modeling (Hugging Face guide)](https://huggingface.co/docs/transformers/en/tasks/language_modeling)

**To get started, start filling in this Jupyter notebook.**


## Section 4: Multilingual instruction tuning

LLMs generally perform much better for English and other high-resource languages than languages with less text available on the Internet. Here you will fine-tune a base LLM on a dataset of instructions and desired answers in a language of your choice from Cohere's [Aya Dataset](https://huggingface.co/datasets/CohereLabs/aya_dataset).

In [None]:
from datasets import load_dataset

aya_ds = load_dataset("CohereLabs/aya_dataset", split="train")
aya_ds

Let's take a look at all of the languages available in the dataset and how much instruction data is present for each.

In [None]:
import pandas as pd

pd.set_option('display.max_rows', None)

aya_df = aya_ds.to_pandas()
aya_df.language.value_counts()

## Custom test set

Choose a language from the set of languages included in the Aya dataset to use for fine-tuning a model. Ideally it's a language you're familiar with. 

* **Problem 4.1**: Provide the name of the language you chose in the report (and the reason you chose it, if you wish)

* **Problem 4.2**: Come up with a custom test set of 5 questions that you'd like an LLM to be able to answer, along with a "correct" answer to those instructions. Please list the questions here and in your report. If you're not familiar with the language you chose, you can use translation tools such as Google Translate to translate a few instructions and desired responses.

### Filter the Aya dataset to your language and preprocess
Please complete the following steps:
* Filter the Aya dataset to your chosen language
* Concatenate the `inputs` and `targets` together to form a combined text field to be used as the input text for instruction tuning
* Split that filtered dataset into 90% training and 10% test set. **This will be your provided test set**.

See the [recommended Hugging Face guide](https://huggingface.co/docs/transformers/en/tasks/language_modeling), the [Hugging Face Dataset class documentation](https://huggingface.co/docs/datasets/en/package_reference/main_classes#datasets.Dataset). Feel free to use generative AI to find functions for this (not the complete code). Add cells to complete these steps below.

## Base models
Now it's time to choose which base model/s you will fine-tune with your dataset of instruction in the language of your choice. You'll want to choose one that is small enough to run in a reasonable amount of time on the CRCD resources, ideally less than 1 billion parameters. Here are some options, but feel free to test other/s. You only need to choose one, but feel free to test more than one.

* [BLOOM 560m](https://huggingface.co/bigscience/bloom-560m)
* [Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)

**Problem 4.3**: Provide the name of the model/s you chose in your report.

Add cells below to tokenize your dataset.

Provided below is a function for grouping texts that may exceed the model's token limit. Apply this to your tokenized dataset.

In [None]:
# Handle texts that are too long
# From https://huggingface.co/docs/transformers/en/tasks/language_modeling

block_size = 128

def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
    # customize this part to your needs.
    if total_length >= block_size:
        total_length = (total_length // block_size) * block_size
    # Split by chunks of block_size.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

## Instruction tuning
Ok, it's time to prepare the data for model training and start doing the instruction tuning! Add cells below to use the Hugging Face `Trainer` class or another approach to fine-tune your chosen base model on your set of instruction data. Note that if you need to test things or otherwise keep training time short, the `num_train_epochs` parameter of `TrainingArguments` can be set to a float less than 1.

## Provided test set evaluation

* **Problem 4.4**: Please provide the inputs from your language's portion of the Aya dataset to a base model and quantitatively compare the responses you receive using a text comparison metric such as ROUGE, ChrF or BERTScore. Now measure your instruction-tuned model's responses against the reference text. Did instruction tuning improve the responses? Provide an evaluation table in your report as well as a discussion.

## Custom test set evaluation
* **Problem 4.**: Gather response from both your base model and instruction-tuned model to your custom set of 5 instructions. Qualitatively compare responses you received from both models, using a machine translation service such as Google Translate if you are unfamiliar with the language you chose. Did instruction tuning improve the responses? It's okay if it didn't, but either way, discuss any patterns you see in your report.