<a href="https://colab.research.google.com/github/jackychh7878/Colab_AI_Project/blob/main/Llama2_LoRA_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Llama LoRA Finetuning Project

In this project, we are going to finetune Llama 2 7B to a text generation model using lora

In [None]:
!pip install -q accelerate==0.21.0 peft==0.4.0 bitsandbytes==0.40.2 transformers==4.31.0 trl==0.4.7

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.9/116.9 kB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.9/72.9 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.5/92.5 MB[0m [31m19.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m93.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.4/77.4 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m93.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m30.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
!pip list

Package                            Version
---------------------------------- -------------------
absl-py                            1.4.0
accelerate                         1.1.1
aiohappyeyeballs                   2.4.3
aiohttp                            3.11.2
aiosignal                          1.3.1
alabaster                          1.0.0
albucore                           0.0.19
albumentations                     1.4.20
altair                             4.2.2
annotated-types                    0.7.0
anyio                              3.7.1
argon2-cffi                        23.1.0
argon2-cffi-bindings               21.2.0
array_record                       0.5.1
arviz                              0.20.0
astropy                            6.1.6
astropy-iers-data                  0.2024.11.18.0.35.2
astunparse                         1.6.3
async-timeout                      4.0.3
atpublic                           4.1.0
attrs                              24.2.0
audioread           

# Step 1: Import model and instruction dataset

In [None]:
import torch
import os
from datasets import load_dataset
from peft import LoraConfig, get_peft_model, PeftModel
from trl import SFTTrainer

from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging
)

# Fine-tuned model name
new_model = "Llama-2-7b-chat-new-finetune"

In [None]:
from datasets import load_dataset

# The instruction dataset to use
dataset_name = "mlabonne/guanaco-llama2-1k"

dataset = load_dataset(dataset_name, split="train")

dataset

README.md:   0%|          | 0.00/1.02k [00:00<?, ?B/s]

(…)-00000-of-00001-9ad84bb9cf65a42f.parquet:   0%|          | 0.00/967k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]

Dataset({
    features: ['text'],
    num_rows: 1000
})

In [None]:
dataset[0]

{'text': '<s>[INST] Me gradué hace poco de la carrera de medicina ¿Me podrías aconsejar para conseguir rápidamente un puesto de trabajo? [/INST] Esto vale tanto para médicos como para cualquier otra profesión tras finalizar los estudios aniversarios y mi consejo sería preguntar a cuántas personas haya conocido mejor. En este caso, mi primera opción sería hablar con otros profesionales médicos, echar currículos en hospitales y cualquier centro de salud. En paralelo, trabajaría por mejorar mi marca personal como médico mediante un blog o formas digitales de comunicación como los vídeos. Y, para mejorar las posibilidades de encontrar trabajo, también participaría en congresos y encuentros para conseguir más contactos. Y, además de todo lo anterior, seguiría estudiando para presentarme a las oposiciones y ejercer la medicina en el sector público de mi país. </s>'}

# Step 2: Config the LoRA parameters

In [None]:
# LoRA parameters
##################

# LoRA attention dimension
lora_r = 64


# Alpha parameter for LoRA scaling
lora_alpha = 16


# Dropout probability for LoRA layers
lora_dropout = 0.1


##############################
# bitsandbytes parameters
###############################

# Activate 4-bit precision base model loading
use_4bit = True


# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16"


# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4"


# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = False


##################################
# TrainingArguments parameters
##################################

# Output directory where the model predictions and checkpoints will be stored
# output_dir = "./results"
output_dir = "./Llama-2-7b-finetuned-model"

# Number of training epochs
num_train_epochs = 1


# Enable fp16/bf16 training (set bf16 to True with an A100)
fp16 = False
bf16 = False


# Batch size per GPU for training
per_device_train_batch_size = 4



# Batch size per GPU for evaluation
per_device_eval_batch_size = 4

# Number of update steps to accumulate the gradients for
gradient_accumulation_steps = 1

# Enable gradient checkpointing
gradient_checkpointing = True

# Maximum gradient normal (gradient clipping)
max_grad_norm = 0.3

# Initial learning rate (AdamW optimizer)
learning_rate = 2e-4

# Weight decay to apply to all layers except bias/LayerNorm weights
weight_decay = 0.001

# Optimizer to use
optim = "paged_adamw_32bit"

# Learning rate schedule
lr_scheduler_type = "cosine"

# Number of training steps (overrides num_train_epochs)
max_steps = -1

# Ratio of steps for a linear warmup (from 0 to learning rate)
warmup_ratio = 0.03

# Group sequences into batches with same length
# Saves memory and speeds up training considerably
group_by_length = True

# Save checkpoint every X updates steps
save_steps = 0

# Log every X updates steps
logging_steps = 25


#######################
# SFT parameters
#######################

# Maximum sequence length to use
max_seq_length = None

# Pack multiple short examples in the same input sequence to increase efficiency
packing = False

# Load the entire model on the GPU 0
device_map = {"": 0}

In [None]:
# Load tokenizer and model with LoRA configuration
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config=BitsAndBytesConfig(

    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,


)

# Check GPU compatibility before training

In [None]:
torch.cuda.get_device_capability()

(8, 0)

In [None]:
# Check GPU compatibility with bfloat16
if compute_dtype == torch.float16 and use_4bit:
    major, _ = torch.cuda.get_device_capability()
    if major >= 8:
        print("=" * 80)
        print("Your GPU supports bfloat16: accelerate training with bf16=True")
        print("=" * 80)

Your GPU supports bfloat16: accelerate training with bf16=True


# Step 3: Train the model

Llama needs to be converted to hf using the below script
https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/convert_llama_weights_to_hf.py

In [None]:
model_name= "NousResearch/Llama-2-7b-chat-hf"

# Import the model
model=AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map=device_map,
    use_auth_token=True

)
model.config.use_cache = False
model.config.pretraining_tp = 1

config.json:   0%|          | 0.00/583 [00:00<?, ?B/s]



model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/200 [00:00<?, ?B/s]

In [None]:
from google.colab import userdata
# Import the tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    trust_remote_code=True
    )
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right" # Fix weird overflow issue with fp16 training
tokenizer.push_to_hub = True
tokenizer.hub_token = userdata.get('HF_TOKEN_WRITE')

tokenizer_config.json:   0%|          | 0.00/746 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/435 [00:00<?, ?B/s]

In [None]:
from google.colab import userdata

# Setup the Lora Config
peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
)

# Setup the training Arg
training_arguments=TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    fp16=fp16,
    bf16=bf16,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=group_by_length,
    lr_scheduler_type=lr_scheduler_type,
    report_to="tensorboard",
    push_to_hub=True,
    hub_token=userdata.get('HF_TOKEN_WRITE')
)

trainer=SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=packing,
)



Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

For more details, please read https://huggingface.co/docs/huggingface_hub/concepts/git_vs_http.
Cloning https://huggingface.co/JackBrain/Llama-2-7b-finetuned-model into local empty directory.


In [None]:
trainer.train()

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
  return fn(*args, **kwargs)


Step,Training Loss
25,1.4082
50,1.663
75,1.2145
100,1.4447
125,1.1764
150,1.3652
175,1.1735
200,1.4672
225,1.1581
250,1.5414


TrainOutput(global_step=250, training_loss=1.3612234802246095, metrics={'train_runtime': 245.0593, 'train_samples_per_second': 4.081, 'train_steps_per_second': 1.02, 'total_flos': 8755214190673920.0, 'train_loss': 1.3612234802246095, 'epoch': 1.0})

In [None]:
trainer.model.save_pretrained(new_model)

In [None]:
# Reload model in FP16 and merge it with LoRA weights
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map=device_map,
)
model = PeftModel.from_pretrained(base_model, new_model)
model = model.merge_and_unload()

# Reload tokenizer to save it
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
tokenizer.push_to_hub = True
tokenizer.hub_token = userdata.get('HF_TOKEN_WRITE')



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

  adapters_weights = torch.load(


# Step 4: Deploy to HuggingFace

In [None]:
from google.colab import userdata

trainer.push_to_hub(token=userdata.get('HF_TOKEN_WRITE'))

To https://huggingface.co/JackBrain/Llama-2-7b-finetuned-model
 ! [rejected]        main -> main (fetch first)
error: failed to push some refs to 'https://huggingface.co/JackBrain/Llama-2-7b-finetuned-model'
hint: Updates were rejected because the remote contains work that you do
hint: not have locally. This is usually caused by another repository pushing
hint: to the same ref. You may want to first integrate the remote changes
hint: (e.g., 'git pull ...') before pushing again.
hint: See the 'Note about fast-forwards' in 'git push --help' for details.

 ! [rejected]        main -> main (fetch first)
error: failed to push some refs to 'https://huggingface.co/JackBrain/Llama-2-7b-finetuned-model'
hint: Updates were rejected because the remote contains work that you do
hint: not have locally. This is usually caused by another repository pushing
hint: to the same ref. You may want to first integrate the remote changes
hint: (e.g., 'git pull ...') before pushing again.
hint: See the 'Note a

OSError: To https://huggingface.co/JackBrain/Llama-2-7b-finetuned-model
 ! [rejected]        main -> main (fetch first)
error: failed to push some refs to 'https://huggingface.co/JackBrain/Llama-2-7b-finetuned-model'
hint: Updates were rejected because the remote contains work that you do
hint: not have locally. This is usually caused by another repository pushing
hint: to the same ref. You may want to first integrate the remote changes
hint: (e.g., 'git pull ...') before pushing again.
hint: See the 'Note about fast-forwards' in 'git push --help' for details.


# Step 5: Inferencing

In [1]:
!pip install transformers accelerate



In [None]:
from transformers import AutoTokenizer
import transformers
import torch

model_id="JackBrain/Llama-2-7b-finetuned-model"

tokenizer = AutoTokenizer.from_pretrained(model_id)

model = transformers.pipeline(
  "text-generation",
    model=model_id,
    torch_dtype=torch.float16,
    device_map="auto",
)

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

ValueError: Could not load model JackBrain/Llama-2-7b-finetuned-model with any of the following classes: (<class 'transformers.models.auto.modeling_auto.AutoModelForCausalLM'>, <class 'transformers.models.auto.modeling_tf_auto.TFAutoModelForCausalLM'>, <class 'transformers.models.llama.modeling_llama.LlamaForCausalLM'>).

In [2]:
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load the base model and fine-tuned model
base_model = AutoModelForCausalLM.from_pretrained("NousResearch/Llama-2-7b-chat-hf")
model = PeftModel.from_pretrained(base_model, "JackBrain/Llama-2-7b-finetuned-model")

# Load the tokenizer
model_id = "JackBrain/Llama-2-7b-finetuned-model"
tokenizer = AutoTokenizer.from_pretrained(model_id)

config.json:   0%|          | 0.00/583 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/200 [00:00<?, ?B/s]

adapter_config.json:   0%|          | 0.00/453 [00:00<?, ?B/s]

adapter_model.bin:   0%|          | 0.00/134M [00:00<?, ?B/s]

In [7]:
# Define the prompt
prompt = "what is a large language model?"

# Tokenize the input
input_ids = tokenizer(
    f'<s>[INST] {prompt} [/INST]',
    return_tensors="pt"
).input_ids

# Generate sequences
sequences = model.generate(
    input_ids=input_ids,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    max_length=200,
)

# Decode and print the output
output = tokenizer.decode(sequences[0], skip_special_tokens=True)
print(output)

[INST] what is a large language model? [/INST] A large language model is a type of artificial intelligence model that is trained on a large corpus of text data, such as a language's entire corpus of books, articles, and other written works. (In the case of English, this would be the entirety of the English language, including all books, articles, and other written works ever written in the English language, such as Shakespeare's plays, the Bible, and so on.)

The goal of training a large language model is to create a model that can understand and generate human-like language, and to be able to perform a wide range of natural language processing tasks, such as language translation, text summarization, and sentiment analysis.

Some of the most well-known large language models include:

* BERT (Bidirectional Encoder Representations from Transformers): A language model developed by Google that


In [8]:
prompt = "Who is the president of US?"

input_ids = tokenizer(
    f'<s>[INST] {prompt} [/INST]',
    return_tensors="pt"
).input_ids

sequences = model.generate(
    input_ids=input_ids,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    max_length=200,
)

output = tokenizer.decode(sequences[0], skip_special_tokens=True)
print(output)

[INST] Who is the president of US? [/INST] The current president of the United States is Joe Biden.

Joe Biden was born on November 20, 1942, in Scranton, Pennsylvania. He served as the 46th Vice President of the United States from 2009 to 2017 under President Barack Obama. He was elected as the 46th President of the United States in the 2020 presidential election, defeating incumbent President Donald Trump.

Biden's political career spans over 40 years, during which he has served in various roles, including six terms in the United States Senate and as the 47th Vice President of the United States. He has been a strong advocate for progressive policies, including healthcare reform, climate change, and social justice.

As President, Biden
