## Instruction Tuning for Text Generation using PyTorch and Hugging Face LLMs

This notebook demonstrates how to instruction-tune pretrained large language models (LLMs) from [Hugging Face](https://huggingface.co) using datasets from the [Hugging Face Datasets catalog](https://huggingface.co/datasets) or a custom dataset.

Please install the dependencies from [setup.md](/notebooks/setup.md) before executing this notebook.

The notebook performs the following steps:
1. [Import dependencies and setup parameters](#1.-Import-dependencies-and-setup-parameters)
2. [Prepare the dataset](#2.-Prepare-the-dataset)
    1. [Option A: Use a Hugging Face dataset](#Option-A:-Use-a-Hugging-Face-dataset)
    2. [Option B: Use a custom dataset](#Option-B:-Use-a-custom-dataset)
    3. [Map and tokenize the dataset](#Map-and-tokenize-the-dataset)
    
3. [Prepare the model and test domain knowledge](#3.-Prepare-the-model-and-test-domain-knowledge)
4. [Transfer learning](#4.-Transfer-learning)
5. [Retest domain knowledge](#5.-Retest-domain-knowledge)

## 1. Import dependencies and setup parameters

This notebook assumes that you have already followed the instructions in the [setup.md](/notebooks/setup.md) to setup a PyTorch environment with all the dependencies required to run the notebook.

In [None]:
import math
import os
import sys
import torch
import torch.nn as nn
import urllib
import warnings

import datasets
from datasets import load_dataset
from datasets import logging as datasets_logging

from transformers import (
    AutoModelForCausalLM, 
    AutoTokenizer, 
    DataCollatorForSeq2Seq, 
    TrainingArguments,
    GenerationConfig,
    Trainer
)

datasets_logging.set_verbosity_error()
warnings.filterwarnings('ignore')
os.environ["TRANSFORMERS_NO_ADVISORY_WARNINGS"] = "1"
os.environ["TOKENIZERS_PARALLELISM"] = "false"

There is an additional [PEFT module](https://github.com/huggingface/peft) required to train models with low-rank adaptation (LoRA).

In [None]:
!pip install peft

from peft import LoraConfig, TaskType, get_peft_model, PeftModelForCausalLM

Specify the name of the pretrained model from Hugging Face to use (https://huggingface.co/docs/transformers/tasks/language_modeling)

Example: 
* distilgpt2
* EleutherAI/gpt-j-6b
* bigscience/bloom-560m
* bigscience/bloomz-560m
* bigscience/bloomz-3b

In [None]:
model_name = "EleutherAI/gpt-j-6b"

# Define an output directory
output_dir = os.environ["OUTPUT_DIR"] if "OUTPUT_DIR" in os.environ else \
    os.path.join(os.environ["HOME"], "output")

# Define a dataset directory
dataset_dir = os.environ["DATASET_DIR"] if "DATASET_DIR" in os.environ else \
    os.path.join(os.environ["HOME"], "dataset")

print("Model name:", model_name)
print("Output directory:", output_dir)
print("Dataset directory:", dataset_dir)

## 2. Prepare the dataset

The notebook has two options for getting a dataset:
* Option A: Use a dataset from the [Hugging Face Datasets catalog](https://huggingface.co/datasets)
* Option B: Use a custom dataset (downloaded from another source or from your local system)

In both cases, we define objects for the train and (optional) validation splits and tokenize them.

### Option A: Use a Hugging Face dataset

[Hugging Face Datasets](https://huggingface.co/datasets) has a catalog of datasets that can be specified by name. Information about the dataset is available in the catalog (including information on the size of the dataset and the splits). For instruction-tuning, choose a dataset with fields for "task"/"context"/"output" or "instruction"/"context"/"response" or similar.

```
{
    "instruction": "Convert this sentence into a question.",
    "context": "He read the book.",
    "response": "Did he read the book?"
}
```

For example: 
* databricks/databricks-dolly-15k
* togethercomputer/RedPajama-Data-Instruct 

The next cell gets a dataset from the Hugging Face datasets API. If the notebook is executed multiple times, the dataset will be used from the dataset directory, to speed up the time that it takes to run.

In [None]:
dataset_name =  'databricks/databricks-dolly-15k'
dataset = load_dataset(dataset_name)

In [None]:
# If the dataset does not have a validation split, create one
if 'validation' not in dataset.keys():
    dataset["validation"] = load_dataset(dataset_name, split=f"train[:25%]")
    dataset["train"] = load_dataset(dataset_name, split=f"train[25%:]")

In [None]:
# Inspect a random sample
dataset['train'][3]

In [None]:
# Adjust this dictionary for the keys used in your dataset
dataset_schema = {
    "instruction_key": "instruction", 
    "context_key": "context",
    "response_key": "response"
}

Skip ahead to [mapping and tokenizing](#Map-and-tokenize-the-dataset) the dataset.

### Option B: Use a custom dataset

Instead of using a dataset from the Hugging Face dataset catalog, a custom JSON file from your local system or a download can be used.

In this example, we download an instruction text dataset example, where each record of the dataset contains text fields for "instruction", "input", and "output" like the following:
```
{
    "instruction": "Convert this sentence into a question.",
    "input": "He read the book.",
    "output": "Did he read the book?"
}
```
If you are using a custom dataset or downloaded dataset that has similarly formatted json, you can use the same code as below.

In [None]:
# Choose a URL to download or skip this cell and provide a local path in the next cell
url = "https://raw.githubusercontent.com/sahil280114/codealpaca/master/data/code_alpaca_2k.json"

filename = os.path.basename(url)
destination = os.path.join(dataset_dir, filename)

# If we don't already have the json file, download it
if not os.path.exists(destination):
    response = urllib.request.urlopen(url)
    data = response.read().decode("utf-8")
    with open(destination, "w") as file:
        file.write(data)
    print('Downloaded file to {}'.format(destination))
else:
    print('Using existing file found at {}'.format(destination))

In [None]:
# Customize these variables if you want to load data from pre-existing local files
train_file = destination
validation_file = None

data_files = {}
dataset_args = {}
data_files["train"] = train_file
if validation_file is not None:
    data_files["validation"] = validation_file
extension = (
    train_file.split(".")[-1]
    if train_file is not None
    else validation_file.split(".")[-1]
)
if extension == "txt":
    extension = "text"

dataset = load_dataset(extension, data_files=data_files)

In [None]:
if 'validation' not in dataset.keys():
    dataset["validation"] = load_dataset(extension, data_files=data_files, split=f"train[:25%]")
    dataset["train"] = load_dataset(extension, data_files=data_files, split=f"train[25%:]")

In [None]:
# Inspect a random sample
dataset['train'][3]

In [None]:
# Adjust this dictionary for the keys used in your dataset
dataset_schema = {
    "instruction_key": "instruction", 
    "context_key": "input",
    "response_key": "output"
}

### Map and tokenize the dataset

After describing the schema of your dataset, create formatted prompts out of each example for instruction-tuning. Then tokenize the prompts with the model's tokenizer and concatenate them together into longer sequences to speed up fine-tuning.

In [None]:
PROMPT_DICT = {
    "prompt_with_context": (
        "Below is an instruction that describes a task, paired with an input that provides further context. "
        "Write a response that appropriately completes the request.\n\n"
        "### Instruction:\n{{{instruction_key}}}\n\n### Context:\n{{{context_key}}}\n\n### Response:\n{{{response_key}}}".format(
        **dataset_schema)
    ),
    "prompt_without_context": (
        "Below is an instruction that describes a task. "
        "Write a response that appropriately completes the request.\n\n"
        "### Instruction:\n{{{instruction_key}}}\n\n### Response:\n{{{response_key}}}".format(**dataset_schema)
    ),
}

In [None]:
def create_prompts(examples):
    prompts = []
    for example in examples:
        prompt_template = PROMPT_DICT["prompt_without_context"] \
                if (dataset_schema['context_key'] not in example.keys() or 
                    example[dataset_schema['context_key']] is None) else PROMPT_DICT["prompt_with_context"]
        prompt = prompt_template.format_map(example)
        prompts.append(prompt)
    return prompts

In [None]:
for key in dataset:
    prompts = create_prompts(dataset[key])
    columns_to_be_removed = list(dataset[key].features.keys())
    dataset[key] = dataset[key].add_column("prompts", prompts)
    dataset[key] = dataset[key].remove_columns(columns_to_be_removed)
    
dataset['train'][3]

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token_id = (0) 
tokenizer.padding_side = "left"

In [None]:
max_seq_length = 512

def tokenize(prompt, add_eos_token=True):
    results = tokenizer(
        prompt,
        truncation=True,
        max_length=max_seq_length,
        padding=False,
        return_tensors=None,
    )
    for i in range(len(results["input_ids"])):
        if (
            results["input_ids"][i][-1] != tokenizer.eos_token_id
            and len(results["input_ids"][i]) < max_seq_length
            and add_eos_token
        ):
            results["input_ids"][i].append(tokenizer.eos_token_id)
            results["attention_mask"][i].append(1)

    results["labels"] = results["input_ids"].copy()

    return results

def preprocess_function(examples):
    return tokenize(examples["prompts"])

In [None]:
tokenized_dataset = dataset.map(preprocess_function, batched=True, load_from_cache_file=True)

In [None]:
def concatenate_data(dataset, max_seq_length):
    concatenated_dataset = {}
    for column in dataset.features:
        concatenated_data = [item for sample in dataset[column] for item in sample]
        reshaped_data = [concatenated_data[i*max_seq_length:(i+1)*max_seq_length] \
            for i in range(len(concatenated_data) // max_seq_length)]
        concatenated_dataset[column] = reshaped_data
    return datasets.Dataset.from_dict(concatenated_dataset)

tokenized_dataset_ = tokenized_dataset["train"].remove_columns("prompts")
tokenized_dataset["train"] = concatenate_data(tokenized_dataset_, max_seq_length)

In [None]:
train_dataset = tokenized_dataset["train"]
validation_dataset = tokenized_dataset["validation"]

## 3. Prepare the model and test domain knowledge

This notebook uses the Hugging Face Trainer API to download a model for Causal Language Modeling and its associated tokenizer. Get the model and look at some output for a sample prompt.

In [None]:
resume_from_checkpoint = False  # User adjust as needed
experiment_identifier = 'bf16'  # User adjust as needed

model_output_dir = os.path.join(output_dir, model_name, experiment_identifier)
print("Model will be saved to:", model_output_dir)

In [None]:
if resume_from_checkpoint:
    try:
        model = AutoModelForCausalLM.from_pretrained(model_output_dir)
    except OSError:
        model = AutoModelForCausalLM.from_pretrained(model_name)
        model = PeftModelForCausalLM.from_pretrained(model, model_output_dir)
else:
    model = AutoModelForCausalLM.from_pretrained(model_name)

print('Check the model class: {}'.format(type(model)))
print('Check the model data type: {}'.format(model.dtype))

Use this sample prompt or write your own. Tokenize it, send it to the model for text generation, and then decode and print the response.

In [None]:
# For code generation custom dataset
prompt_template = PROMPT_DICT["prompt_with_context"]
test_example = {dataset_schema['instruction_key']: 'Write a Python function that sorts the following list.',
               dataset_schema['context_key']: '[3, 2, 1]',
               dataset_schema['response_key']: ''}

In [None]:
test_prompt = prompt_template.format_map(test_example)

encoded_input = tokenizer(test_prompt, padding=True, return_tensors='pt')
num_tokens = len(encoded_input['input_ids'])
encoded_input

In [None]:
generation_config = GenerationConfig(
    temperature=1.0,
    top_p=0.75,
    top_k=40,
    repetition_penalty=1.0,
    num_beams=4
)

max_new_tokens=128

output = model.generate(input_ids=encoded_input['input_ids'], 
                        generation_config=generation_config, 
                        max_new_tokens=max_new_tokens)

test_output = tokenizer.batch_decode(output)
print(test_output[0])

## 4. Transfer learning

Set up the LoRA parameters and get the PEFT model.

In [None]:
# Randomly mask the tokens
data_collator = DataCollatorForSeq2Seq(tokenizer, pad_to_multiple_of=8, return_tensors="pt", padding=True)

from llm_utils import hf_model_map
model_info = hf_model_map[model_name]

In [None]:
lora_rank = 8  # Rank parameter 
lora_alpha = 32  # Alpha parameter
lora_dropout = 0.05  # Dropout parameter 

# PEFT settings
peft_config = LoraConfig(
    r=lora_rank,
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()

Set up Hugging Face training arguments. For improved training time on Intel® fourth generation Xeon processors, you can experiment with `bf16=True` and `use_ipex=True`:

In [None]:
epochs = 3
do_eval = False  # Use the validation dataset to evaluate perplexity
bf16 = True  # Train with bfloat16 precision
use_ipex = False  # Use Intel® Optimization for PyTorch (IPEX)
max_train_samples = None  # Option to truncate training samples for faster sanity checking

In [None]:
training_args = TrainingArguments(
    output_dir=model_output_dir, 
    num_train_epochs=epochs,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    bf16=bf16,
    use_ipex=use_ipex,
    no_cuda=True,
)

In [None]:
# Optional: 
if max_train_samples is not None:
    train_dataset = train_dataset.select(range(max_train_samples))

In [None]:
# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=validation_dataset if do_eval else None,
    tokenizer=tokenizer,
    data_collator=data_collator
)

# Train the model
trainer.train()

In [None]:
if do_eval:
    eval_results = trainer.evaluate()
    print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

In [None]:
# Save the model to the model_output_dir
model.save_pretrained(training_args.output_dir)
tokenizer.save_pretrained(training_args.output_dir)

## 5. Retest domain knowledge

Inference with the test prompt to see if the fine-tuned model gives a better response. You may want to train for at least 3 epochs to see improvement.

In [None]:
model.eval()

if model.dtype == torch.bfloat16:
    with torch.cpu.amp.autocast(enabled=True, dtype=torch.bfloat16):
        output = model.generate(input_ids=torch.tensor(encoded_input['input_ids']), 
                                generation_config=generation_config, 
                                max_new_tokens=max_new_tokens)
else:
    output = model.generate(input_ids=torch.tensor(encoded_input['input_ids']), 
                            generation_config=generation_config, 
                            max_new_tokens=max_new_tokens)
    
retest_output = tokenizer.batch_decode(output)
print(retest_output[0])

## Citations

<b>[databricks-dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k)</b> - Copyright (2023) Databricks, Inc. This dataset was developed at Databricks (https://www.databricks.com) and its use is subject to the CC BY-SA 3.0 license. Certain categories of material in the dataset include materials from the following sources, licensed under the CC BY-SA 3.0 license: Wikipedia (various pages) - https://www.wikipedia.org/ Copyright © Wikipedia editors and contributors.


```
@software{together2023redpajama,
  author = {Together Computer},
  title = {RedPajama: An Open Source Recipe to Reproduce LLaMA training dataset},
  month = April,
  year = 2023,
  url = {https://github.com/togethercomputer/RedPajama-Data}
}
```

```
@misc{codealpaca,
  author = {Sahil Chaudhary},
  title = {Code Alpaca: An Instruction-following LLaMA model for code generation},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/sahil280114/codealpaca}},
}
```