-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# 04L - Fine-tuning LLMs
In this lab, we will apply the fine-tuning learnings from the demo Notebook. The aim of this lab is to fine-tune an instruction-following LLM.

### ![Dolly](https://files.training.databricks.com/images/llm/dolly_small.png) Learning Objectives
1. Prepare a novel dataset
1. Fine-tune the T5-small model to classify movie reviews.
1. Leverage DeepSpeed to enhance training process.

In [0]:
assert "gpu" in spark.conf.get("spark.databricks.clusterUsageTags.sparkVersion"), "THIS LAB REQUIRES THAT A GPU MACHINE AND RUNTIME IS UTILIZED."

## Classroom Setup

In [0]:
%pip install rouge_score==0.1.2

[43mNote: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.[0m
Collecting rouge_score==0.1.2
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py): started
  Building wheel for rouge_score (setup.py): finished with status 'done'
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24936 sha256=fb58580244928033d306d29f6bba7b8809f0b01298c0b78d70df6a19ba485b6a
  Stored in directory: /root/.cache/pip/wheels/5f/dd/89/461065a73be61a532ff8599a28e9beef17985c9e9c31e541b4
Successfully built rouge_score
Installing collected packages: rouge_score
Successfully installed rouge_score-0.1.2
[43mNote: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.[0m


In [0]:
%run ../Includes/Classroom-Setup

[43mNote: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.[0m
[43mNote: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.[0m


Resetting the learning environment:
| enumerating serving endpoints...found 0...(0 seconds)
| removing the working directory "dbfs:/mnt/dbacademy-users/labuser4470716@vocareum.com/large-language-models"...(1 seconds)

Skipping install of existing datasets to "dbfs:/mnt/dbacademy-datasets/large-language-models/v01"

Validating the locally installed datasets:
| listing local files...(7 seconds)
| removing extra path: /models--EleutherAI--pythia-70m-deduped/.no_exist/e93a9faa9c77e5d09219f6c868bfc7a1bd65593c/...(0 seconds)
| removing extra path: /models--EleutherAI--pythia-70m-deduped/blobs/...(0 seconds)
| removing extra path: /models--EleutherAI--pythia-70m-deduped/snapshots/e93a9faa9c77e5d09219f6c868bfc7a1bd65593c/...(1 seconds)
| fixed 3 issues...(8 seconds total)


Importing lab testing framework.



Using the "default" schema.

Predefined paths variables:
| DA.paths.working_dir: /dbfs/mnt/dbacademy-users/labuser4470716@vocareum.com/large-language-models
| DA.paths.user_db:     dbfs:/mnt/dbacademy-users/labuser4470716@vocareum.com/large-language-models/database.db
| DA.paths.datasets:    /dbfs/mnt/dbacademy-datasets/large-language-models/v01

Setup completed (17 seconds)

The models developed or used in this course are for demonstration and learning purposes only.
Models may occasionally output offensive, inaccurate, biased information, or harmful instructions.


In [0]:
print(f"Username:          {DA.username}")
print(f"Working Directory: {DA.paths.working_dir}")

Username:          labuser4470716@vocareum.com
Working Directory: /dbfs/mnt/dbacademy-users/labuser4470716@vocareum.com/large-language-models


In [0]:
%load_ext autoreload
%autoreload 2

Creating a local temporary directory on the Driver. This will serve as a root directory for the intermediate model checkpoints created during the training process. The final model will be persisted to DBFS.

In [0]:
import tempfile

tmpdir = tempfile.TemporaryDirectory()
local_training_root = tmpdir.name

## Fine-Tuning

In [0]:
import os
import pandas as pd
from datasets import load_dataset
from transformers import (
    TrainingArguments,
    AutoTokenizer,
    AutoConfig,
    Trainer,
    AutoModelForCausalLM,
    DataCollatorForLanguageModeling,
)

import evaluate
import nltk
from nltk.tokenize import sent_tokenize



### Question 1: Data Preparation
For the instruction-following use cases we need a dataset that consists of prompt/response pairs along with any contextual information that can be used as input when training the model. The [databricks/databricks-dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k) is one such dataset that provides high-quality, human-generated prompt/response pairs. 

Let's start by loading this dataset using the `load_dataset` functionality.

In [0]:
# TODO
ds = load_dataset('databricks/databricks-dolly-15k')

Found cached dataset json (/root/.cache/huggingface/datasets/databricks___json/databricks--databricks-dolly-15k-7427aa6e57c34282/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4)


  0%|          | 0/1 [00:00<?, ?it/s]

In [0]:
# Test your answer. DO NOT MODIFY THIS CELL.

dbTestQuestion4_1(ds)

[32mPASSED[0m: All tests passed for lesson4, question1
[32mRESULTS RECORDED[0m: Click `Submit` when all questions are completed to log the results.


### Question 2: Select pre-trained model

The model that we are going to fine-tune is [pythia-70m-deduped](https://huggingface.co/EleutherAI/pythia-70m-deduped). This model is one of a Pythia Suite of models that have been developed to support interpretability research.

Let's define the pre-trained model checkpoint.

In [0]:
# TODO
model_checkpoint = "EleutherAI/pythia-70m-deduped"

In [0]:
# Test your answer. DO NOT MODIFY THIS CELL.

dbTestQuestion4_2(model_checkpoint)

[32mPASSED[0m: All tests passed for lesson4, question2
[32mRESULTS RECORDED[0m: Click `Submit` when all questions are completed to log the results.


### Question 3: Load and Configure

The next task is to load and configure the tokenizer for this model. The instruction-following process builds a body of text that contains the instruction, context input, and response values from the dataset. The body of text also includes some special tokens to identify the sections of the text. These tokens are generally configurable, and need to be added to the tokenizer.

Let's go ahead and load the tokenizer for the pre-trained model.

In [0]:
# TODO
# load the tokenizer that was used for the model
tokenizer = AutoTokenizer.from_pretrained(
  model_checkpoint, cache_dir=DA.paths.datasets
)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.add_special_tokens(
    {"additional_special_tokens": ["### End", "### Instruction:", "### Response:\n"]}
)

3

In [0]:
# Test your answer. DO NOT MODIFY THIS CELL.

dbTestQuestion4_3(tokenizer)

[32mPASSED[0m: All tests passed for lesson4, question3
[32mRESULTS RECORDED[0m: Click `Submit` when all questions are completed to log the results.


### Question 4: Tokenize

The `tokenize` method below builds the body of text for each prompt/response.

In [0]:
remove_columns = ["instruction", "response", "context", "category"]


def tokenize(x: dict, max_length: int = 1024) -> dict:
    """
    For a dictionary example of instruction, response, and context a dictionary of input_id and attention mask is returned
    """
    instr = x["instruction"]
    resp = x["response"]
    context = x["context"]

    instr_part = f"### Instruction:\n{instr}"
    context_part = ""
    if context:
        context_part = f"\nInput:\n{context}\n"
    resp_part = f"### Response:\n{resp}"

    text = f"""Below is an instruction that describes a task. Write a response that appropriately completes the request.
    {instr_part}
    {context_part}
    {resp_part}
    ### End
    """
    return tokenizer(text, max_length=max_length, truncation=True)

Let's `tokenize` the Dolly training dataset.

In [0]:
# TODO
tokenized_dataset = ds.map(tokenize, remove_columns=remove_columns)

Loading cached processed dataset at /root/.cache/huggingface/datasets/databricks___json/databricks--databricks-dolly-15k-7427aa6e57c34282/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4/cache-c0237e934fb7db47.arrow


In [0]:
# Test your answer. DO NOT MODIFY THIS CELL.

dbTestQuestion4_4(tokenized_dataset)

[32mPASSED[0m: All tests passed for lesson4, question4
[32mRESULTS RECORDED[0m: Click `Submit` when all questions are completed to log the results.


### Question 5: Setup Training

To setup the fine-tuning process we need to define the `TrainingArguments`.

Let's configure the training to have **10** training epochs (`num_train_epochs`) with a per device batch size of **8**. The optimizer (`optim`) to be used should be `adamw_torch`. Finally, the reporting (`report_to`) list should be set to *tensorboard*.

In [0]:
# TODO
checkpoint_name = "test-trainer-lab"
local_checkpoint_path = os.path.join(local_training_root, checkpoint_name)
training_args = TrainingArguments(
  local_checkpoint_path,
  num_train_epochs=10,
  per_device_train_batch_size=8,
  optim="adamw_torch",
  report_to=['tensorboard']
)

In [0]:
checkpoint_name = "test-trainer-lab"

In [0]:
# Test your answer. DO NOT MODIFY THIS CELL.

dbTestQuestion4_5(training_args)

[32mPASSED[0m: All tests passed for lesson4, question5
[32mRESULTS RECORDED[0m: Click `Submit` when all questions are completed to log the results.


### Question 6: AutoModelForCausalLM

The pre-trained `pythia-70m-deduped` model can be loaded using the [AutoModelForCausalLM](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoModelForCausalLM) class.

In [0]:
# TODO
# load the pre-trained model
model = AutoModelForCausalLM.from_pretrained(model_checkpoint, cache_dir=DA.paths.datasets)

In [0]:
# Test your answer. DO NOT MODIFY THIS CELL.

dbTestQuestion4_6(model)

[32mPASSED[0m: All tests passed for lesson4, question6
[32mRESULTS RECORDED[0m: Click `Submit` when all questions are completed to log the results.


### Question 7: Initialize the Trainer

Unlike the IMDB dataset used in the earlier Notebook, the Dolly dataset only contains a single *train* dataset. Let's go ahead and create a [`train_test_split`](https://huggingface.co/docs/datasets/v2.12.0/en/package_reference/main_classes#datasets.Dataset.train_test_split) of the train dataset.

Also, let's initialize the [`Trainer`](https://huggingface.co/docs/transformers/main_classes/trainer) with model, training arguments, the train & test datasets, tokenizer, and data collator. Here we will use the [`DataCollatorForLanguageModeling`](https://huggingface.co/docs/transformers/main/en/main_classes/data_collator#transformers.DataCollatorForLanguageModeling).

In [0]:
# TODO
# used to assist the trainer in batching the data
TRAINING_SIZE=6000
SEED=42
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=False, return_tensors="pt", pad_to_multiple_of=8
)

split_dataset = tokenized_dataset['train'].train_test_split(train_size=TRAINING_SIZE, seed=SEED)
print(split_dataset)
trainer = Trainer(
    model,
    training_args,
    train_dataset=split_dataset['train'],
    eval_dataset=split_dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

Loading cached split indices for dataset at /root/.cache/huggingface/datasets/databricks___json/databricks--databricks-dolly-15k-7427aa6e57c34282/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4/cache-def4e132fe756ab6.arrow and /root/.cache/huggingface/datasets/databricks___json/databricks--databricks-dolly-15k-7427aa6e57c34282/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4/cache-6488ea99144e6f9d.arrow


DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 6000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 9011
    })
})


In [0]:
# Test your answer. DO NOT MODIFY THIS CELL.

dbTestQuestion4_7(trainer)

[32mPASSED[0m: All tests passed for lesson4, question7
[32mRESULTS RECORDED[0m: Click `Submit` when all questions are completed to log the results.


### Question 8: Train

Before starting the training process, let's turn on Tensorboard. This will allow us to monitor the training process as checkpoint logs are created.

In [0]:
tensorboard_display_dir = f"{local_checkpoint_path}/runs"

In [0]:
%load_ext tensorboard
%tensorboard --logdir '{tensorboard_display_dir}'

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard
Your log directory might be ephemeral to the cluster, which will be deleted after cluster termination or restart. You can choose a log directory under `/dbfs/` to persist your logs in DBFS.
Tensorboard may not be displayed in the notebook cell output when 'Third-party iFraming prevention' is disabled. You can still use Tensorboard by clicking the link below to open Tensorboard in a new tab. To enable Tensorboard in notebook cell output, please ask your workspace admin to enable 'Third-party iFraming prevention'.


Reusing TensorBoard on port 6006 (pid 1507), started 0:22:51 ago. (Use '!kill 1507' to kill it.)

Start the fine-tuning process!

In [0]:
# TODO
# invoke training - note this will take approx. 30min
trainer.train()

# save model to the local checkpoint
trainer.save_model()
trainer.save_state()

You're using a GPTNeoXTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
500,3.4753
1000,2.9626
1500,2.8081
2000,2.5244
2500,2.4189
3000,2.2899
3500,2.0302
4000,1.8928
4500,1.8162
5000,1.5657


In [0]:
# Test your answer. DO NOT MODIFY THIS CELL.

dbTestQuestion4_8(trainer)

[32mPASSED[0m: All tests passed for lesson4, question8
[32mRESULTS RECORDED[0m: Click `Submit` when all questions are completed to log the results.


In [0]:
# persist the fine-tuned model to DBFS
final_model_path = f"{DA.paths.working_dir}/llm04_fine_tuning/{checkpoint_name}"
trainer.save_model(output_dir=final_model_path)

In [0]:
import gc
import torch

gc.collect()
torch.cuda.empty_cache()

In [0]:
fine_tuned_model = AutoModelForCausalLM.from_pretrained(final_model_path)

Recall that the model was trained using a body of text that contained an instruction and its response. A similar body of text, or prompt, needs to be provided when testing the model. The prompt that is provided only contains an instruction though. The model will `generate` the response accordingly.

In [0]:
def to_prompt(instr: str, max_length: int = 1024) -> dict:
    text = f"""Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instr}

### Response:
"""
    return tokenizer(text, return_tensors="pt", max_length=max_length, truncation=True)


def to_response(prediction):
    decoded = tokenizer.decode(prediction)
    # extract the Response from the decoded sequence
    m = re.search(r"#+\s*Response:\s*(.+?)#+\s*End", decoded, flags=re.DOTALL)
    res = "Failed to find response"
    if m:
        res = m.group(1).strip()
    else:
        m = re.search(r"#+\s*Response:\s*(.+)", decoded, flags=re.DOTALL)
        if m:
            res = m.group(1).strip()
    return res

In [0]:
import re
# NOTE: this cell can take up to 5mins
res = []
for i in range(100):
    instr = ds["train"][i]["instruction"]
    resp = ds["train"][i]["response"]
    inputs = to_prompt(instr)
    pred = fine_tuned_model.generate(
        input_ids=inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        pad_token_id=tokenizer.pad_token_id,
        max_new_tokens=128,
    )
    res.append((instr, resp, to_response(pred[0])))

In [0]:
pdf = pd.DataFrame(res, columns=["instruction", "response", "generated"])
display(pdf)

instruction,response,generated
When did Virgin Australia start operating?,"Virgin Australia commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route.",Virgin Australia started operating its operating plan in the year 2000.  ### Response: Virgin Australia started operating its operating plan in the year 2000.
Which is a species of fish? Tope or Rope,Tope,What is a fish?  ### Response: Tope is a species of fish in the ocean.
Why can camels survive for long without water?,Camels use the fat in their humps to keep them filled with energy and hydration for long periods of time.,1. They are a family of insects that live in the desert. 2. They are able to eat food from the desert. 3. They are able to eat food from the desert. 4. They are able to eat food from the desert.  ### Response: 1. They are able to eat food from the desert 2. They are able to eat food from the desert
"Alice's parents have three daughters: Amy, Jessy, and what’s the name of the third daughter?",The name of the third daughter is Alice,My mum bought me a fake first daughter and gave me a fake first daughter. i bought me a fake first daughter and gave me a fake first daughter. i bought me a fake first daughter and gave me a fake first daughter. i bought me a fake first daughter and gave me a fake first daughter. i bought me a fake first daughter and gave me a fake first daughter. i bought me a fake first daughter and gave me a fake first daughter. i bought me a fake first daughter and gave me a fake first daughter. i bought me a fake first daughter and gave me a fake first daughter.
When was Tomoaki Komorida born?,"Tomoaki Komorida was born on July 10,1981.","No, myoh kokomori Kokomori born on Tukoori Prefecture, Japan  ### Response: Tomoaki Komorida is a prefecture, Japan, in Japan."
"If I have more pieces at the time of stalemate, have I won?",No. Stalemate is a drawn position. It doesn't matter who has captured more pieces or is in a winning position,",...  ### Response: - -.. -.. -.. -.. -.. -.. -.. -.. -.."
"Given a reference text about Lollapalooza, where does it take place, who started it and what is it?","Lollapalooze is an annual musical festival held in Grant Park in Chicago, Illinois. It was started in 1991 as a farewell tour by Perry Farrell, singe of the group Jane's Addiction. The festival includes an array of musical genres including alternative rock, heavy metal, punk rock, hip hop, and electronic dance music. The festivals welcomes an estimated 400,000 people each year and sells out annually. Some notable headliners include: the Red Hot Chili Peppers, Chance the Rapper, Metallica, and Lady Gage. Lollapalooza is one of the largest and most iconic festivals in the world and a staple of Chicago.",Lollapalooza is a dance and dance style in Lollapalooza. It takes place on the beach in Lollapalooza.  ### Response: Lollapalooza is a dance and dance style in Lollapalooza.
Who gave the UN the land in NY to build their HQ,John D Rockerfeller,The UN has built a tower and skyscrapers - what are some of its tallest structures in the city of New York City  Input: The UN has built a tower and skyscrapers - what are some of its tallest structures in the city of New York City  Input: The UN has built a tower and skyscrapers - what are some of its tallest structures in the city of New York City  Input: The UN has built a tower and skyscrapers - what are some of its tallest structures in the city of New
Why mobile is bad for human,We are always engaged one phone which is not good.,"It is because of human heart disease that causes death to millions of people.  ### Response: It is because of human heart disease, or disease has been spreading to other parts of the world such as Europe, Asia, and Africa."
Who was John Moses Browning?,"John Moses Browning is one of the most well-known designer of modern firearms. He started building firearms in his father's shop at the age of 13, and was awarded his first patent when he was 24. He designed the first reliable automatic pistol, and the first gas-operated firearm, as well inventing or improving single-shot, lever-action, and pump-action rifles and shotguns. Today, he is most well-known for the M1911 pistol, the Browning Automatic Rifle, and the Auto-5 shotgun, all of which are in still in current production in either their original design, or with minor changes. His M1911 and Hi-Power pistols designs are some of the most reproduced firearms in the world today.",John Moses Browning John Moses Browning John Moses Browning John Moses Browning John Moses Browning John Moses Browning John Moses Browning John Moses Browning John Moses Browning John Moses Browning John Moses Browning John Moses Browning John Moses Browning John Moses Browning John Moses Browning John Moses Browning John Moses Browning John Moses Browning John Moses Browning John Moses Browning John Moses Browning John Moses Browning John Moses Browning John Moses Browning John Moses Browning John Moses Brow


**CONGRATULATIONS**

You have just taken the first step toward fine-tuning your own slimmed down version of [Dolly](https://github.com/databrickslabs/dolly)! 

Unfortunately, it does not seem to be too generative at the moment. Perhaps, with some additional training and data the model could be more capable.

### Question 9: Evaluation

Although the current model is under-trained, it is worth evaluating the responses to get a general sense of how far off the model is at this point.

Let's compute the ROGUE metrics between the reference response and the generated responses.

In [0]:
nltk.download("punkt")

rouge_score = evaluate.load("rouge")


def compute_rouge_score(generated, reference):
    """
    Compute ROUGE scores on a batch of articles.

    This is a convenience function wrapping Hugging Face `rouge_score`,
    which expects sentences to be separated by newlines.

    :param generated: Summaries (list of strings) produced by the model
    :param reference: Ground-truth summaries (list of strings) for comparison
    """
    generated_with_newlines = ["\n".join(sent_tokenize(s.strip())) for s in generated]
    reference_with_newlines = ["\n".join(sent_tokenize(s.strip())) for s in reference]
    return rouge_score.compute(
        predictions=generated_with_newlines,
        references=reference_with_newlines,
        use_stemmer=True,
    )

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

In [0]:
# TODO
generated_responses = pdf["generated"].tolist()
ground_truth_responses = pdf["response"].tolist()
rouge_scores = compute_rouge_score(generated_responses, ground_truth_responses)
display(rouge_scores)

{'rouge1': 0.160272275002856,
 'rouge2': 0.05688290149023008,
 'rougeL': 0.14037162464180356,
 'rougeLsum': 0.1513184890432173}

In [0]:
# Test your answer. DO NOT MODIFY THIS CELL.

dbTestQuestion4_9(rouge_scores)

[32mPASSED[0m: All tests passed for lesson4, question9
[32mRESULTS RECORDED[0m: Click `Submit` when all questions are completed to log the results.


## Clean up Classroom

Run the following cell to remove lessons-specific assets created during this lesson.

In [0]:
tmpdir.cleanup()

## Submit your Results (edX Verified Only)

To get credit for this lab, click the submit button in the top right to report the results. If you run into any issues, click `Run` -> `Clear state and run all`, and make sure all tests have passed before re-submitting. If you accidentally deleted any tests, take a look at the notebook's version history to recover them or reload the notebooks.

-sandbox
&copy; 2023 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="https://help.databricks.com/">Support</a>