# **Final FineTuning of Llama2 7b hf on psychology dataset**

**Install all required libraries**

In [None]:
!pip install -q -U bitsandbytes
# !pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install transformers==4.31
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q datasets
!pip install evaluate
!pip install -qqq trl==0.7.1



**Import all required module**



In [None]:
import torch
import time
import evaluate
import pandas as pd
import numpy as np
from datasets import Dataset, load_dataset
from transformers import (AutoModelForCausalLM, AutoTokenizer,BitsAndBytesConfig,HfArgumentParser,TrainingArguments,pipeline, logging)
import random
from peft import LoraConfig , PeftModel,AutoPeftModelForCausalLM, prepare_model_for_kbit_training
from trl import SFTTrainer

**Import dataset from huggingface**

In [None]:
psychology_dataset = "jkhedri/psychology-dataset"

#load the above dataset from huggingface community and split it on train features

# dataset = load_dataset(psychology_dataset, split = "train")
dataset = load_dataset(psychology_dataset, split = "train")
dataset

Dataset({
    features: ['question', 'response_j', 'response_k'],
    num_rows: 9846
})

#BitsAndBytesConfig
- **load_in_4bit**: Load a large model in 4bit ,for training 4-bit base models (e.g. using LoRA adapters) one should use "bnb_4bit_quant_type='nf4"

- **Note :** that once a model has been loaded in 4-bit it is currently not possible to push the quantized weights on the Hub. Note also that you cannot train 4-bit weights as this is not supported yet. However you can use 4-bit models to train extra parameters, this will be covered in the next section.

- **Training :** According to QLoRA paper, for training 4-bit base models (e.g. using LoRA adapters) one should use bnb_4bit_quant_type='nf4'.

- **NF4 (Normal Float 4) data type :** which is a new 4bit datatype adapted for weights that have been initialized using a normal distribution. For that run:

- **Use nested quantization for more memory efficient inference**<br>
We also advise users to use the nested quantization technique. This saves more memory at no additional performance - from our empirical observations, this enables fine-tuning llama-13b model on an NVIDIA-T4 16GB with a sequence length of 1024, batch size of 1 and gradient accumulation steps of 4.



In [None]:

model_id =  "NousResearch/Llama-2-7b-hf"                       # use this model for finetuning
# model_id =  "TinyPixel/Llama-2-7B-bf16-sharded"


# load tokenizer and model with Qlora configuration
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)



#load base model
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map = "auto")

# load llama tokenizer of the base model
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"


model.config.use_cache = False
model.config.pretraining_tp = 1





Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

(…)b-hf/resolve/main/generation_config.json:   0%|          | 0.00/179 [00:00<?, ?B/s]



(…)7b-hf/resolve/main/tokenizer_config.json:   0%|          | 0.00/746 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

(…)lama-2-7b-hf/resolve/main/tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

(…)-hf/resolve/main/special_tokens_map.json:   0%|          | 0.00/435 [00:00<?, ?B/s]

**we can check the memory footprint of your model with get_memory_foot print method**

In [None]:
print(model.get_memory_footprint())

3829936128


## **Total parameter of the model and our trainable paramter during finetuning**

In [None]:

def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():

        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )




model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

print(model)


LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 4096, padding_idx=0)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=11008, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=11008, bias=False)
          (down_proj): Linear4bit(in_features=11008, out_features=4096, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )


## **Common LoRA parameters in PEFT**
- Instantiate a base model.
- Create a configuration (LoraConfig) where you define LoRA-specific parameters.
- Wrap the base model with get_peft_model() to get a trainable PeftModel.
- Train the PeftModel as you normally would train the base model.

## **LoraConfig allows you to control how LoRA is applied to the base model through the following parameters:**<br>

- **r :** the rank of the update matrices, expressed in int. Lower rank results in smaller update matrices with fewer trainable parameters.
- **target_modules :** The modules (for example, attention blocks) to apply the LoRA update matrices.
- **alpha**: LoRA scaling factor.<br>
- **bias**: Specifies if the bias parameters should be trained. Can be 'none', 'all' or 'lora_only'.
- **modules_to_save:** List of modules apart from LoRA layers to be set as trainable and saved in the final checkpoint. These typically include model’s custom head that is randomly initialized for the fine-tuning task.
-**layers_to_transform:** List of layers to be transformed by LoRA. If not specified, all layers in target_modules are transformed.
-**layers_pattern**: Pattern to match layer names in target_modules, if layers_to_transform is specified. By default PeftModel will look at common layer pattern (layers, h, blocks, etc.), use it for exotic and custom models.
-**rank_pattern:** The mapping from layer names or regexp expression to ranks which are different from the default rank specified by r.
-**alpha_pattern:** The mapping from layer names or regexp expression to alphas which are different from the default alpha specified by lora_alpha.

In [None]:
from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=16,                      #lora attantion dimension
    lora_alpha=64,             # lora scaling paramter
    # target_modules=["query_key_value"],
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],       # these are the layers of Llama2 which are used for finetuning , this can be different to different model , depends on their architecture
    # target_modules = ["q_proj","v_proj"],
    lora_dropout=0.1,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
print_trainable_parameters(model)



trainable params: 16777216 || all params: 3517190144 || trainable%: 0.477006226934315


In [None]:
our_finetune_model = "Llama-2-7b-chat-finetune-psycology_model"
# Output directory where the model predictions and checkpoints will be stored
output_dir = "./results"

In [None]:

training_arguments = TrainingArguments(
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    optim="paged_adamw_32bit",
    logging_steps=30,             # means it will go 30 step forward during training of your model , we can change this into 5, 10 etc
    learning_rate=1e-4,
    fp16=True,
    max_grad_norm=0.3,
    num_train_epochs=1,
    # evaluation_strategy="steps",
    # eval_steps=0.2,
    warmup_ratio=0.05,
    save_strategy="epoch",
    group_by_length=True,
    output_dir=output_dir,
    report_to="tensorboard",
    save_safetensors=True,
    lr_scheduler_type="cosine",
    seed=42
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!





In [None]:
# Let us assume you have a dataset with multiple fields, question and answer etc. Therefore you can just run:
def formatting_prompts_func(example):
    output_texts = []
    for i in range(len(example['question'])):
        text = f"### Question: {example['question'][i]}\n ### Response J: {example['response_j'][i]}\n ### Response K: {example['response_k'][i]}"
        output_texts.append(text)
    # print(output_texts)
    return output_texts


# **SFFT/RLHF**: both method are used for finetuning, but in this case we used SFFT(supervised Finetuning method)<br>
**Start FineTuning of our model on the psychology dataset**

In [None]:
# Set supervised Finetuning parameters
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=lora_config,
    # dataset_text_field="text",
    formatting_func=formatting_prompts_func,
    max_seq_length=1024,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=False,
)

# start training of your model
trainer.train()

# after training your model , then save your train model
trainer.model.save_pretrained(our_finetune_model)

Map:   0%|          | 0/9846 [00:00<?, ? examples/s]

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
30,1.3708
60,0.7723
90,0.704
120,0.7242
150,0.646
180,0.7119
210,0.657
240,0.6465
270,0.672
300,0.5998


In [None]:


%load_ext tensorboard
%tensorboard --logdir results/runs

In [None]:
# Ignore warnings
logging.set_verbosity(logging.CRITICAL)


prompt = "I'm feeling really anxious lately and I don't know why."
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(prompt)

print(result[0]['generated_text'])

I'm feeling really anxious lately and I don't know why. hopefully this will help me figure it out.
I'm feeling really anxious lately and I don't know why.
It's possible that you're experiencing anxiety without a specific trigger. This can be caused by a variety of factors, such as genetics, stress, or a past traumatic experience.
It's important to talk to a therapist or counselor about your feelings and explore any underlying causes. They can help you develop coping strategies and work through any underlying issues.
It's also important to practice self-care and engage in activities that make you feel calm and relaxed. This can include exercise, meditation, or spending time in nature.
It's important to remember that anxiety is a common experience and there are many resources available to help you manage it.
I'm feeling really anxious


In [None]:
# the follow command are very import b/z google colab gpu are allocate to the following
# variable so we need to delet this from run time that we can use gpu

# Empty VRAM
del model
del pipe
del trainer
import gc
gc.collect()
gc.collect()

804

In [None]:
# Reload the model in FP16 and merge it with LoRa weights
import locale
our_finetune_model_dir = "/content/Llama-2-7b-chat-finetune-psycology_model"
model_id =  "NousResearch/Llama-2-7b-hf"
# model_id =  "TinyPixel/Llama-2-7B-bf16-sharded"


base_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    low_cpu_mem_usage = True,
    # return_dict = True,
    torch_dtype = torch.float16,
    device_map = {"": 0}

)

# The below code must be in the same cell , if you past it in another you face an error
# like: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

#########################################################
merge_model = PeftModel.from_pretrained(base_model,our_finetune_model_dir)
merge_model = merge_model.merge_and_unload()


#reload tokenizer and save it
tokenizer = AutoTokenizer.from_pretrained(model_id,trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer_padding_side = 'right'
####################################################3

# push Finetune model to huggingface community
# 1: you nedd access token(write)
# 2: need reposity to push your FineTune model to huggingfaces
locale.getpreferredencoding = lambda: "UTF-8"
!huggingface-cli login
merge_model.push_to_hub("LangChain12/FineTune_psychologyist_chatbot", check_pr=True)
tokenizer.push_to_hub("LangChain12/FineTune_psychologyist_chatbot", check_pr=True)


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|
    
    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: 
Add token as git credential? (Y/n) Y
Token is valid (permission: write).
[1m[31mCannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub.
Run the following command in your terminal in case you want to set the '

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

pytorch_model-00001-of-00002.bin:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

pytorch_model-00002-of-00002.bin:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/LangChain12/FineTune_psychologyist_chatbot/commit/f9423876e6a360e0752e591886fb946f89f48d7f', commit_message='Upload tokenizer', commit_description='', oid='f9423876e6a360e0752e591886fb946f89f48d7f', pr_url=None, pr_revision=None, pr_num=None)