<a href="https://colab.research.google.com/github/prupat/LLMs/blob/main/fine_tune_llama_2_Assignment_Prudence_Brou.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

❗**WARNING**❗

You cannot do this lab locally. If you bought the GPU's last time, you should still have plenty of credits left over. Use the T4 GPU or TPU.

If you cannot get GPUs for whatever reason, let Michelle know.


⭐ **BEFORE YOU BEGIN**

You can access [LLaMa2 via Hugging Face](https://huggingface.co/docs/transformers/main/model_doc/llama2) (which is what we will do). However, you MUST [request access to it from Facebook](https://ai.meta.com/llama/). There's a request link on the Hugging Face model card as well.

_"Weights for the Llama2 models can be obtained by filling out this form"_

The email you use to log in to Hugging Face and the email you use to request from Facebook must be the same email. The request is typically granted in less than 5 minutes.



## Fine Tuning lab
LLM's and ChatGPT | Fall 2023 | McSweeney | CUNY Graduate Center

**Due:** November 13


### Background
The purpose of this lab is to see how to fine tune [the LLaMa2 model from Meta/Facebook](https://huggingface.co/meta-llama/Llama-2-7b) for question answering. LLaMa2 is a popular, Open Source, foundation model that is small enough to run on a laptop and be fine tuned with personal level GPUs.

We will do a partial retraining both because it's nearly as effective as a full retraining and it is much faster and uses fewer GPU's. We'll use the LoRA method.

We will use the 7B model because it is the smallest and therefore the fastest for retraining.

You can interact with the [LLaMa2 model via the Chat API](https://www.llama2.ai/). Change the settings with the 'Settings' button in the upper right. We'll use the 7B, so I recommend switching to 7B to compare this model versus our fine tuned model.


### Notes
The way this lab is set up it's hard to change the dataset. However, if you are looking for sentiment data to do a project around fine tuning, here are some good options:
1. [Instructions from Databricks - the dolly dataset](https://huggingface.co/datasets/databricks/databricks-dolly-15k)
2. [Question answering with OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca)


### References
The code for this lab is almost completely from [this colab written by Maxime Labonne, Aug 2023](https://colab.research.google.com/drive/1PEQyJO1-f6j0S_XJ8DV50NkpzasXkrzd?usp=sharing)  

### Install packages
*Version numbers* are included because it's best practice and if you don't specify for the `bitsandbytes`, the lab won't work.

* `accelerate` allows Pytorch to run in a distributed way
* `peft` is Parameter Efficient Fine Tuning
* `bitsandbytes` gives us quantization, which also allows us to run this code more efficiently. Quantization is the process of mapping large sets to small sets.
* `transformers` is the Hugging Face library we've been using to access the models
* `trl` is transformer reinforcement learning, which gives us access to reinforcement learning, we'll use it for the supervised learning step.

In [1]:
!pip install accelerate==0.21.0
!pip install peft==0.4.0
!pip install bitsandbytes==0.40.2
!pip install transformers==4.31.0
!pip install trl==0.4.7

Collecting accelerate==0.21.0
  Downloading accelerate-0.21.0-py3-none-any.whl (244 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: accelerate
Successfully installed accelerate-0.21.0
Collecting peft==0.4.0
  Downloading peft-0.4.0-py3-none-any.whl (72 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.9/72.9 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
Collecting transformers (from peft==0.4.0)
  Downloading transformers-4.35.0-py3-none-any.whl (7.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.9/7.9 MB[0m [31m54.9 MB/s[0m eta [36m0:00:00[0m
Collecting safetensors (from peft==0.4.0)
  Downloading safetensors-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m60.0 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,

**Packages**
* Pytorch (`torch`)
* `AutoModelForCausalLM` is a model class for anything with a causal language model head (the head is the last few layers of the LLM). LLaMa2 is this type of model
* `AutoTokenizer` automatically detects which type of tokenizer the model used, so the tokenization of the new data you add will match
* `BitsAndBytesConfig` is just the configuration for the quantization
* `HfArgumentParser` needed to generate arguments from the dataset and translate arguments from `TrainingArguments`
* `TrainingArguments` used to create a subset of arguments used for training
* `pipelines` helps make the HuggingFace code easier to work with, especially for when making tasks like Q&A, Named Entity Recognition, Sentiment Analysis, etc.
* `logging` lets us control how detailed we want the error messages to be.
* `SFTTrainer` is supervised fine-tuning trainer

In [2]:
import os
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer

#### The model
We have to call the model and the dataset from Hugging Face via the API, and give the new model a name. You have to be logged in to HuggingFace for this to work.

Give the new model a name.

In [3]:
# The model that you want to train from the Hugging Face hub
model_name = "NousResearch/Llama-2-7b-chat-hf"

# The instruction dataset to use
dataset_name = "mlabonne/guanaco-llama2-1k"

# Fine-tuned model name
trained_model = "llama-2-7b-miniguanaco"


This is for the LoRA process. We are specifying the dimensions for the matrix we will add to the model. This is the matrix we're learning from the training data.

The other parameters are set to optimize the LoRA process.

In [4]:
################################################################################
# QLoRA parameters
################################################################################

# LoRA attention dimension
lora_r = 64

# Alpha parameter for LoRA scaling
lora_alpha = 16

# Dropout probability for LoRA layers
lora_dropout = 0.1

Quantization makes the fine tuning process that much more efficient. The goal is to map the large matrix to the smaller matrix. At its core, this process increases the "signal to noise" ratio, maximizing the most important/most defining features of the model's weights.

In [5]:
################################################################################
# bitsandbytes parameters
################################################################################

# Activate 4-bit precision base model loading
use_4bit = True

# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16"

# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4"

# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = False


Setting the parameters for the training process, number of epochs, learning rate, etc. We won't go through all of these, but they are all hyperparameters that have to do with training.

In [6]:
################################################################################
# TrainingArguments parameters
################################################################################

# Output directory where the model predictions and checkpoints will be stored
output_dir = "./results"

# Number of training epochs
num_train_epochs = 1

# Enable fp16/bf16 training (set bf16 to True with an A100)
fp16 = False
bf16 = False

# Batch size per GPU for training
per_device_train_batch_size = 4

# Batch size per GPU for evaluation
per_device_eval_batch_size = 4

# Number of update steps to accumulate the gradients for
gradient_accumulation_steps = 1

# Enable gradient checkpointing
gradient_checkpointing = True

# Maximum gradient normal (gradient clipping)
max_grad_norm = 0.3

# Initial learning rate (AdamW optimizer)
learning_rate = 2e-4

# Weight decay to apply to all layers except bias/LayerNorm weights
weight_decay = 0.001

# Optimizer to use
optim = "paged_adamw_32bit"

# Learning rate schedule (constant a bit better than cosine)
lr_scheduler_type = "constant"

# Number of training steps (overrides num_train_epochs)
max_steps = -1

# Ratio of steps for a linear warmup (from 0 to learning rate)
warmup_ratio = 0.03

# Group sequences into batches with same length
# Saves memory and speeds up training considerably
group_by_length = True

# Save checkpoint every X updates steps
save_steps = 25

# Log every X updates steps
logging_steps = 25

These hyperparameters are specific to the supervised fine tuning method.

In [7]:

# SFT parameters

# Maximum sequence length to use
max_seq_length = None

# Pack multiple short examples in the same input sequence to increase efficiency
packing = False

# Load the entire model on the GPU 0
device_map = {"": 0}

Finally load the dataset and split on a train/test split. Now we call all the settings that we specified before.

In [8]:
# Load dataset (you can process it here)
dataset = load_dataset(dataset_name, split="train")

# Load tokenizer and model with QLoRA configuration
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

Downloading readme:   0%|          | 0.00/1.02k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/967k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]

Load the model, again with all the hyperparameters we already established.

In [9]:
# Load base model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map=device_map
)
model.config.use_cache = False
model.config.pretraining_tp = 1

Downloading (…)lve/main/config.json:   0%|          | 0.00/583 [00:00<?, ?B/s]

Downloading (…)fetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

Downloading (…)of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/179 [00:00<?, ?B/s]

Call the tokenizer using AutoTokenizer, which will automatically detect the type of tokenizer that LLaMa uses.

In [10]:
# Load LLaMA tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

Downloading (…)okenizer_config.json:   0%|          | 0.00/746 [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/435 [00:00<?, ?B/s]

Now load all the LoRa settings we established earlier.

In [11]:
# Load LoRA configuration
peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
)

Now load the training parameters we established earlier.

In [12]:
# Set training parameters
training_arguments = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    fp16=fp16,
    bf16=bf16,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=group_by_length,
    lr_scheduler_type=lr_scheduler_type,
    report_to="tensorboard"
)


Now load as the supervised fine tuning parameters

In [13]:
# Set supervised fine-tuning parameters
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=packing,
)



Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Finally train the model. Notice that if you trace back the past 3 cells, you can follow what the name, `trainer` refers to. It seems like a simple call, but clearly the set up builds pregressively.

This will train one epoch - so not a long time. If you want to improve the model, feel free to modify the settings earlier (i.e., decrease the learning rater or batch size, increase the epochs, etc.)

Saving the trained model is essential. If you do not save the trained model, you will not be able to use it for prediction in a later step. The model only exists in active memory until you save it. You never want to work from a model in active memory because it's too big and fragile.

In [14]:
# Train model
trainer.train()

# Save trained model
trainer.model.save_pretrained(trained_model)

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
25,1.3478
50,1.6183
75,1.2104
100,1.4322
125,1.177
150,1.3557
175,1.1706
200,1.4527
225,1.1533
250,1.5222




Logging just tells the model not to complain unless it absolutely must.

Here is your model! Change the question to see how it behaves. What do you think?



In [16]:
# Ignore warnings
logging.set_verbosity(logging.CRITICAL)

# Run text generation pipeline with our next model
prompt = "What is public interest technology?"
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])

<s>[INST] What is public interest technology? [/INST] Public interest technology is a term used to describe technology that is developed and used to benefit society as a whole, rather than to benefit a single individual or organization. This can include things like open-source software, public data sets, and other technologies that are made available to the public for free or at a low cost.

Public interest technology is often developed by non-profit organizations, government agencies, or other groups that are committed to using technology to benefit society. These groups may work to develop and promote technologies that address important social issues, such as poverty, inequality, and environmental degradation.

Some examples of public interest technology include:

1. Open-source software: This is software that is made available to the public for free or at a low cost, and can be modified and distributed by anyone. Examples of open-source software include Linux, Apache, and Word


Now try this question on the [LLaMa2 API interface](https://www.llama2.ai/). Did your fine tuned model do better? What are the main differences?


Here is the answer from the LLaMa2 API:

Public interest technology refers to the use of technology and digital tools to address social, economic, or political issues that affect society as a whole, rather than just a specific group or industry. It encompasses a wide range of activities, including:
1. Digital inclusion: Ensuring that everyone has access to technology and the skills necessary to use it effectively, regardless of their socio-economic status or geographic location.
2. Open source software: Making software code available for anyone to use, modify, and distribute, with the goal of promoting collaboration and innovation in the tech sector.
3. Data journalism: Using data analysis and visualization techniques to uncover and tell important stories that might otherwise go unreported.
4. Election technology: Improving the transparency, security, and accessibility of voting systems, with the goal of ensuring that every citizen's vote counts.
5. Cybersecurity: Developing technologies and strategies to protect individuals and organizations from cyber threats, such as hacking and identity theft.
6. Artificial intelligence: Applying AI to solve complex social problems, such as healthcare disparities, climate change, and education inequality.
7. Privacy protection: Advocating for policies and technologies that safeguard individuals' personal information and prevent unwarranted surveillance and discrimination.
8. Digital rights: Defending the rights of marginalized communities online, including freedom of expression, privacy, and safety.
9. Accessible design: Creating products and services that are usable by people of all abilities, including those with disabilities.
10. Tech for good: Supporting non-profits and social enterprises that leverage technology to address social issues, such as poverty, inequality, and environmental degradation.
By focusing on these areas, public interest technology aims to harness the power of technology to improve society and promote greater equality, justice, and democracy.

The fine tuned model did a good job in explaining the question. The downside is that it didn't elaborate like LLaMa2 did. Other than that the our model is consistent.