# Finetuning RedPajama ("OpenLlama")
by John Robinson 05/15/2023 [Follow @johnrobinsn on Twitter](https://twitter.com/johnrobinsn)

[<img src="https://www.storminthecastle.com/img/github.svg">](https://github.com/johnrobinsn/redpajama/blob/main/finetune_redpajama_clean.ipynb) [<img src="https://www.storminthecastle.com/img/colab.svg">](https://colab.research.google.com/github/johnrobinsn/edpajama/blob/main/finetune_redpajama_clean.ipynb)

For more details please see the [blog article](https://www.storminthecastle.com/posts/finetune_redpajama/).

## Setup
Configure the base model and a few other variables that we'll use later.

In [None]:
model = '3B' #'7B' # Pick your poison

if model == '7B':
    model_name = ("togethercomputer/RedPajama-INCITE-Base-7B-v0.1","togethercomputer/RedPajama-INCITE-Base-7B-v0.1")
    run_name = 'redpj7B-lora-int8-alpaca'
    dataset = 'johnrobinsn/alpaca-cleaned'
    peft_name = 'redpj7B-lora-int8-alpaca'
    output_dir = 'redpj7B-lora-int8-alpaca-results'
else: #3B
    model_name = ("togethercomputer/RedPajama-INCITE-Base-3B-v1","togethercomputer/RedPajama-INCITE-Base-3B-v1")
    run_name = 'redpj3B-lora-int8-alpaca'
    dataset = 'johnrobinsn/alpaca-cleaned'
    peft_name = 'redpj3B-lora-int8-alpaca'
    output_dir = 'redpj3B-lora-int8-alpaca-results'

model_name[1],dataset,peft_name,run_name

Install the required dependencies.

In [None]:
def install_dependencies():
    !pip install -Uqq  git+https://github.com/huggingface/peft.git
    !pip install -Uqq "transformers==4.27.1" "datasets==2.9.0" "accelerate==0.17.1" "evaluate==0.4.0" "bitsandbytes==0.37.1" loralib --upgrade --quiet
    !pip install -Uqq wandb
    !pip install -Uqq protobuf==3.20

# uncomment the following line to install the required dependencies
install_dependencies()

__Note: If you just want to do inference you can jump all the way down to the ["Evaluate"](#evaluate) cell and start running from there to download my adapter weights from HF hub and try some prompts through the finetuned model.__

But if you want to train keep going... 

## Setting Up Tracking and Monitoring using Weights and Biases

This notebook has support for logging the training run to [weights and biases (wandb)](https://wandb.ai/site).  This makes it very easy to track, monitor and annotate your training sessions from anywhere.  

Run the next cell and follow the directions given to authenticate with wandb.

In [None]:
report_to = "wandb" # "none"

if report_to != "none":
    import wandb
    wandb.login()

After authenticating, we have to initialize wandb.  We add a few key-value pairs about the run to the information that will be logged to the wandb dashboard.  

_Note: You can add more key/values if you'd like._

In [None]:
wandb.init(project=run_name,config={
    "model": model_name[1],
    "dataset":dataset
})

After you get training started below.  You can revisit the wandb links shown above to monitor the status of your training run from anywhere with Internet connectivity.  

_Note: I like to send the link (View run) to my phone so that I can monitor on the go..._

## Tokenizer
The tokenizer converts words into a list/tensor of numbers so that the model can process them.  Each language model has been trained using a specific tokenizer.  If your base model is already supported by huggingface then the transformer library makes it very easy to load the correct tokenizer for your given model.  Just use the AutoTokenizer class to create an instance of the correct tokenizer by just specifying the model name.

In [None]:
from transformers import AutoTokenizer

print("Loading tokenizer for model: ", model_name[1])
tokenizer = AutoTokenizer.from_pretrained(model_name[1],add_eos_token=True)
tokenizer.pad_token_id = 0  

One problem that I've found with many of the finetuning scripts and notebooks found online is that the "end-of-stream" handling is not done correctly, so in many cases the finetuned models don't know when to stop emitting tokens and tend to "blabber" on.  Since we are finetuning on an instruction following task, we would like the model to respond to the instruction prompt succintly and then stop.  There are a number of ways to approach this, but the way I approach it here is to explicitly add a new token to represent end-of-stream, &lt;eos&gt; and use that eos token during training to teach the model when it should stop. Then during inference, we can use that token to recognize when the model is done responding. 

In [None]:
tokenizer.add_special_tokens({'eos_token':'<eos>'})
print('eos_token_id:',tokenizer.eos_token_id)

In [None]:
CUTOFF_LEN = 256  # 256 accounts for about 96% of the data in the alpaca dataset

def tokenize(prompt, tokenizer,add_eos_token=True):
    result = tokenizer(
        prompt+"<eos>",  # add the end-of-stream token
        truncation=True,
        max_length=CUTOFF_LEN,
        padding="max_length",
    )
    return {
        "input_ids": result["input_ids"],
        "attention_mask": result["attention_mask"],
    }


Let's give it a quick try and note the <eos> token id at the end of the sequence.

In [None]:
tokenizer('hi there<eos>')

## Dataset

When finetuning your model the dataset that you chose has to be aligned with your downstream task. We're using popular Instruction Following Dataset, called Alpaca. For convenience, I have a copy of the alpaca dataset that has been cleaned published on the HuggingFace hub. We can just download it and access it from cache using the load_dataset API shown below.

In [None]:
from datasets import load_dataset

# Load dataset from the hub
data = load_dataset(dataset)
data

Here we can see that the dataset consists of 51,942 rows with the following features ['instruction','input','output'].  Let's take a look at one.

In [None]:
data['train'][5]

Here we can see an item that includes an 'instruction' to direct our model.  An optional 'input' which provides context to the instruction.  And then an expected output for the model.

Our goal in finetuning our model is to use this dataset to train our model to "behave" in a similar way.  Given an instruction respond with an appropriate response generalizing to the knowledge already encoded in the base model.

But we can't directly use this JSON object to train our model.  Our model can only process an ordered sequence of tokens that represent words.  So we use a "prompt template" to convert each of these JSON objects in our dataset into a sequence of words.  The prompt template follows a consistent pattern.

In [None]:
def generate_prompt(data_point):
    # sorry about the formatting disaster gotta move fast
    if data_point["input"]:
        return f"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{data_point["instruction"]}

### Input:
{data_point["input"]}

### Response:
{data_point["output"]}"""
    else:
        return f"""Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{data_point["instruction"]}

### Response:
{data_point["output"]}"""


Let's see what what our example looks like when "templatized".

In [None]:
print(generate_prompt(data['train'][5]))

The exact wording of the template is somewhat arbitrary.  It's more of a consistent pattern that after training will drive the model into responding similarly when exposed to a similar prompt.  You should be able to pick out the "instruction", "input", and "output" from the example.  

It is important that the output from the dataset is at the end of templatized prompt, since at inference time we will only provide the prompt up to **but not including the output**.  We'll expect our model to respond to our instruction on its own.

We now split out a validation dataset from our training dataset so that we can track how well the finetuning process is learning to generalize to unseen prompts and so that we make sure we're only checkpointing our model when the validation loss is improving.

In [None]:
VAL_SET_SIZE = 2000  # we set aside 2000 items from our dataset for validation during training

train_val = data["train"].train_test_split(
    test_size=VAL_SET_SIZE, shuffle=True, seed=42
)
train_data = train_val["train"]
val_data = train_val["test"]

We prepare the training dataset and the validation dataset by running the data through the prompt templating process and then by tokenizing the prompts.

In [None]:
train_data = train_data.shuffle().map(lambda x: tokenize(generate_prompt(x), tokenizer))
val_data = val_data.shuffle().map(lambda x: tokenize(generate_prompt(x), tokenizer))

## Load and Configure the Model for Training

Load the specified RedPajama base model from the HuggingFace hub.

_Note: Llama, Redpajama and other decoder-only models are supported by the AutoModelForCausalLM class.  But for encoder-decoder models such as the [**google/t5**](https://huggingface.co/google/flan-t5-xxl) models you'll need to use the AutoModelForSeq2SeqLM class and the training details are a little bit different.  Here is a [similar notebook](https://github.com/johnrobinsn/flan_ul2/blob/main/train-peft-flan-ul2-int8-alpaca.ipynb) for finetuning t5* models._

In [None]:
from transformers import AutoModelForCausalLM

print("Loading model for model: ", model_name[0])

model = AutoModelForCausalLM.from_pretrained(
    model_name[0],
    load_in_8bit=True,
    device_map="auto",
)

Now, we can prepare our model for the LoRA int-8 training using the HF peft library.

In [None]:
from peft import LoraConfig, get_peft_model, prepare_model_for_int8_training, TaskType

# Define LoRA Config 
lora_config = LoraConfig(
 r= 8, 
 lora_alpha=16,
 target_modules=["query_key_value"],
 lora_dropout=0.05,
 bias="none",
 task_type=TaskType.CAUSAL_LM
)

# prepare int-8 model for training
model = prepare_model_for_int8_training(model)

# add LoRA adaptor
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

After installing the Lora Adapters into the model notice the significant reduction in the number of trainable paramters.

We'll leverage the training loop from the transformers library since it does a pretty good job with handling the details.

In [None]:
import transformers
eval_steps = 200
save_steps = 200
logging_steps = 20

trainer = transformers.Trainer(
    model=model,
    train_dataset=train_data,
    eval_dataset=val_data,
    args=transformers.TrainingArguments(
        num_train_epochs=3,
        learning_rate=3e-4,
        logging_steps=logging_steps,
        evaluation_strategy="steps",
        save_strategy="steps",
        eval_steps=eval_steps,
        save_steps=save_steps,
        output_dir=output_dir,
        report_to=report_to if report_to else "none",
        save_total_limit=3,
        load_best_model_at_end=True,
        push_to_hub=False,
        auto_find_batch_size=True
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

model.config.use_cache = False  # silence the warnings. Please re-enable for inference!

## Train
Run the training loop.  This will take quite a while on a Titan RTX the 3B will take X and the 7B took Y.

In [None]:
trainer.train() 

## Save the Trained Adpater Model to Disk

Now that we've trained the model we'll want to save our weights.  First I demonstrate how to save them to disk.

In [None]:
# Save our LoRA model & tokenizer results
trainer.model.save_pretrained(peft_name)
tokenizer.save_pretrained(peft_name)

# if you want to save the base model to disk call
# trainer.model.base_model.save_pretrained(peft_model_id)

## Push the Trained Adapter Model to the HuggingFace Hub

Even better than saving your trained weights to disk you can push them up the HuggingFace Hub.  This makes it super easy to share your trained adapter with others or to setup your model for inference on other devices.

In [None]:
!pip install -Uqq huggingface_hub
import huggingface_hub
huggingface_hub.login()

In [None]:
# If you don't already have the git extensions for large file storage you might have to install it now.
# Here is how you can do this for Linux from the shell.  For other OSs please refer to the git-lfs documentation.
# sudo apt install git-lfs

In [None]:
repo_id = f'{huggingface_hub.whoami()["name"]}/{peft_name}'
trainer.model.push_to_hub(repo_id)
tokenizer.push_to_hub(repo_id)

You chould be able to check out HuggingFace and see your LoRA Adapter Model.

## Free Up Memory

Since we likely used a lot of memory during training and we'll need that memory back to try the model out we take a few steps to free up VRAM here.

In [None]:
import torch
import gc
config = None
model = None
tokenizer=None
trainer=None
gc.collect()
torch.cuda.empty_cache()

## Evaluate
Here we'll try out the model for inference.

In [None]:
import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

In [None]:
# Load peft config for pre-trained checkpoint etc. 
#peft_model_id = "results"
#config = PeftConfig.from_pretrained(peft_model_id)

# load base LLM model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    model_name[0],
    load_in_8bit=True,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_name[1])
tokenizer.pad_token_id = 0
tokenizer.add_special_tokens({'eos_token':'<eos>'})
# tokenizer.eos_token_id = 2

model.eval()


In [None]:
#from transformers import GenerationConfig

Here is the prompt template we'll use for inference.

_Note: It's important that it's identical to one we used for training above, but it omits the "output/response" as our model will generate that for us._

In [None]:
def generate_prompt(data_point):
    # sorry about the formatting disaster gotta move fast
    if data_point["input"]:
        return f"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{data_point["instruction"]}

### Input:
{data_point["input"]}

### Response:"""
    else:
        return f"""Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{data_point["instruction"]}

### Response:"""


Here is a small utility function that let's us easily prompt our model with an instructin and an optional input.  It handles templating the prompt, tokenizing the templatized prompt, decoding the result and then finally stripping off the prompt from the response and just leaving us with the model response.

In [None]:
def generate(instruction,input=None,maxTokens=256):
    prompt = generate_prompt({'instruction':instruction,'input':input})
    input_ids = tokenizer(prompt, return_tensors="pt", truncation=True).input_ids.cuda()
    outputs = model.generate(input_ids=input_ids, max_new_tokens=maxTokens, 
                             do_sample=True, top_p=0.9,pad_token_id=tokenizer.eos_token_id,
                             forced_eos_token_id=tokenizer.eos_token_id)
    outputs = outputs[0].tolist()
    # Stop decoding when hitting the EOS token
    if tokenizer.eos_token_id in outputs:
        eos_index = outputs.index(tokenizer.eos_token_id)
        decoded = tokenizer.decode(outputs[:eos_index])
        # Don't show the prompt template
        sentinel = "### Response:"
        sentinelLoc = decoded.find(sentinel)
        if sentinelLoc >= 0:
            print(decoded[sentinelLoc+len(sentinel):])
        else:
            print('Warning: Expected prompt template to be emitted.  Ignoring output.')
    else:
        print('Warning: no <eos> detected ignoring output')

### Generating using the Base Model

This demonstrates the behavior of the RedPajama model with no finetuning applied.  Below I'll load the LoRA adapter that has been trained on the finetuning task to demonstrate the change in behavior.

In [None]:
torch.manual_seed(42)
generate('Write a short story in third person narration about a protagonist who has to make an important career decision.',maxTokens=300)

### Load the LoRA Adapter

As you can see the generated text doesn't seem very responsive to the prompt.  Now let's load the trained LoRA adapter and see what happens.

_Note: Here you can either load up my pretrained Lora adapter from HuggingFace hub.  Or if you trained your own adapter above you can uncomment the specified line below to load your adapter from disk._

In [None]:
peft_model_id = f'johnrobinsn/{peft_name}' # By default use my pretrained adapter weights
#peft_model_id = peft_name # Uncomment to use locally saved adapter weights if you trained above

# Load the LoRA model
model = PeftModel.from_pretrained(model, peft_model_id, device_map={"":0})
model.eval()

print("Peft model adapter loaded")

let's try the same prompt again.

In [None]:
torch.manual_seed(42)
generate('Write a short story in third person narration about a protagonist who has to make an important career decision.',maxTokens=300)

### A Few More Prompts

In [None]:
generate('Who was the first man to walk on the moon and tell me where he was born.')

In this example, we provide not only an instruction but also provide some context for the instruction which is a list of possible answers.

In [None]:
generate('Identify the odd one out and explain why.','Twitter, Instagram, Telegram')

In [None]:
generate('Write a poem about about a cat',maxTokens=1000)