## Finetune Falcon-7b on a Google colab

Welcome to this Google Colab notebook that shows how to fine-tune the Falcon-7b model on a single Google colab

We will use PEFT library from Hugging Face ecosystem, as well as QLoRA for more memory efficient finetuning

## Setup

Run the cells below to setup and install the required libraries. For our experiment we will need `accelerate` to distribute model training across multiple GPUs,
`peft` to do the fine tuning by using paramete efficient techniques.
`transformers` for pre-trained trasformer models
`bitsandbytes` used to enable 4-bit quantization for memory efficiency.

In [None]:
!pip install -q -U trl transformers accelerate git+https://github.com/huggingface/peft.git
!pip install -q datasets bitsandbytes einops wandb

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.0/44.0 kB[0m [31m548.7 kB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m318.9/318.9 kB[0m [31m21.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.0/10.0 MB[0m [31m79.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m342.1/342.1 kB[0m [31m23.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m485.4/485.4 kB[0m [31m28.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m65.4 MB/s[0m eta [3

## Dataset

We used Psych8k dataset which is developed for mental well-being to improve contextual understanding and generate more relevant answers of query of patient and professionals

The dataset can be found [here](https://huggingface.co/datasets/EmoCareAI/Psych8k)

In [None]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Loading the model

In this section we will load the [Falcon 7B model](https://huggingface.co/tiiuae/falcon-7b), quantize it in 4bit and attach LoRA adapters on it for memory efficiency

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, AutoTokenizer

model_name = "ybelkada/falcon-7b-sharded-bf16"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    trust_remote_code=True
)
model.config.use_cache = False



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/581 [00:00<?, ?B/s]

`low_cpu_mem_usage` was None, now default to True since model is quantized.


model.safetensors.index.json:   0%|          | 0.00/17.7k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/8 [00:00<?, ?it/s]

model-00001-of-00008.safetensors:   0%|          | 0.00/1.92G [00:00<?, ?B/s]

model-00002-of-00008.safetensors:   0%|          | 0.00/1.99G [00:00<?, ?B/s]

model-00003-of-00008.safetensors:   0%|          | 0.00/1.91G [00:00<?, ?B/s]

model-00004-of-00008.safetensors:   0%|          | 0.00/1.91G [00:00<?, ?B/s]

model-00005-of-00008.safetensors:   0%|          | 0.00/1.99G [00:00<?, ?B/s]

model-00006-of-00008.safetensors:   0%|          | 0.00/1.91G [00:00<?, ?B/s]

model-00007-of-00008.safetensors:   0%|          | 0.00/1.91G [00:00<?, ?B/s]

model-00008-of-00008.safetensors:   0%|          | 0.00/921M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

In [None]:
print(model)

FalconForCausalLM(
  (transformer): FalconModel(
    (word_embeddings): Embedding(65024, 4544)
    (h): ModuleList(
      (0-31): 32 x FalconDecoderLayer(
        (self_attention): FalconAttention(
          (query_key_value): Linear4bit(in_features=4544, out_features=4672, bias=False)
          (dense): Linear4bit(in_features=4544, out_features=4544, bias=False)
          (attention_dropout): Dropout(p=0.0, inplace=False)
          (rotary_emb): FalconRotaryEmbedding()
        )
        (mlp): FalconMLP(
          (dense_h_to_4h): Linear4bit(in_features=4544, out_features=18176, bias=False)
          (act): GELUActivation()
          (dense_4h_to_h): Linear4bit(in_features=18176, out_features=4544, bias=False)
        )
        (input_layernorm): LayerNorm((4544,), eps=1e-05, elementwise_affine=True)
      )
    )
    (ln_f): LayerNorm((4544,), eps=1e-05, elementwise_affine=True)
    (rotary_emb): FalconRotaryEmbedding()
  )
  (lm_head): Linear(in_features=4544, out_features=65024, bi

Let's also load the tokenizer below

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

tokenizer_config.json:   0%|          | 0.00/180 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.73M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/281 [00:00<?, ?B/s]

we will load the configurartion file to create LoRa model. we will include all linear layers in transformer blocks to maximize the performance. So, along with the mixed query-key-value layer, we'll also add the dense, dense_h_to_4_h, and dense_4h_to_h layers to the target modules.


In [None]:
from peft import LoraConfig

lora_alpha = 16
lora_dropout = 0.1
lora_r = 64

peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=[
        "query_key_value",
        "dense",
        "dense_h_to_4h",
        "dense_4h_to_h",
    ]
)

In [None]:
from peft import prepare_model_for_kbit_training

model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

it helps to know the trainable parameters out of total parameters in a model for verfiyong parameter efficiency.

In [None]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

print_trainable_parameters(model)

trainable params: 0 || all params: 3608744832 || trainable%: 0.0


In [None]:
model = get_peft_model(model, peft_config)

# **Loading Dataset**

this dataset is for mental well being of patients based on real conversation between therapist and patient. it is converted in to a proper format for model fine tuning in a structured way. because dataset has separate columns so it is important to combine these in to a single text field.

In [None]:
from datasets import load_dataset
from transformers import TrainingArguments, AutoTokenizer, AutoModel
from trl import SFTTrainer
import torch

dataset_name = "EmoCareAI/Psych8k"
dataset = load_dataset(dataset_name)['train']

def preprocess_function(examples):
    text = [f"{instruction} {input} {output}" for instruction, input, output in zip(examples['instruction'], examples['input'], examples['output'])]
    return {'output': examples['output'], 'input': examples['input'], 'instruction': examples['instruction'], 'text': text}

print("Before preprocessing:", dataset.column_names)
dataset = dataset.map(preprocess_function, batched=True)
dataset.set_format(type='torch', columns=['output', 'input', 'instruction', 'text'])
print("After preprocessing:", dataset.column_names)


Alexander_Street_shareGPT_2.0.json:   0%|          | 0.00/6.58M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8187 [00:00<?, ? examples/s]

Before preprocessing: ['output', 'input', 'instruction']


Map:   0%|          | 0/8187 [00:00<?, ? examples/s]

After preprocessing: ['output', 'input', 'instruction', 'text']


## Loading the trainer

Here we will use the [`SFTTrainer` from TRL library](https://huggingface.co/docs/trl/main/en/sft_trainer) that gives a wrapper around transformers `Trainer` to easily fine-tune models on instruction based datasets using PEFT adapters.

In [None]:
output_dir = "./new_custom_mental-falcon-7b-instruct-custom-ds-100"
per_device_train_batch_size = 4
gradient_accumulation_steps = 2
optim = "paged_adamw_32bit"
save_steps = 10
logging_steps = 10
learning_rate = 2e-4
max_grad_norm = 0.3
max_steps = 100
warmup_ratio = 0.03
lr_scheduler_type = "constant"

training_arguments = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    fp16=True,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=True,
    lr_scheduler_type=lr_scheduler_type,
    gradient_checkpointing=True,
)


trainer = SFTTrainer(
    model=model,
    args=training_arguments,
    train_dataset=dataset,
    tokenizer=tokenizer,
    # dataset_text_field='text',
)


for name, module in trainer.model.named_modules():
    if "norm" in name:
        module = module.to(torch.float32)

# Train the model
trainer.train()


  trainer = SFTTrainer(


Converting train dataset to ChatML:   0%|          | 0/8187 [00:00<?, ? examples/s]

Applying chat template to train dataset:   0%|          | 0/8187 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/8187 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/8187 [00:00<?, ? examples/s]



<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33miqrakiran795[0m ([33miqra-research[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


Step,Training Loss
10,1.8454
20,1.6538
30,1.5713
40,1.526
50,1.4293
60,1.5839
70,1.553
80,1.4935
90,1.4685
100,1.4167


TrainOutput(global_step=100, training_loss=1.554156608581543, metrics={'train_runtime': 1355.8235, 'train_samples_per_second': 0.59, 'train_steps_per_second': 0.074, 'total_flos': 4934139219041280.0, 'train_loss': 1.554156608581543})

In [None]:
model.save_pretrained(output_dir, safe_serialization=False)
tokenizer.save_pretrained(output_dir)

model.push_to_hub("iqrakiran/new_custom_mental-falcon-7b-instruct-custom-ds-100", use_auth_token=True, max_shard_size="200MB", use_safetensors=True)
tokenizer.push_to_hub("iqrakiran/new_custom_mental-falcon-7b-instruct-custom-ds-100")




Upload 3 LFS files:   0%|          | 0/3 [00:00<?, ?it/s]

adapter_model-00001-of-00003.safetensors:   0%|          | 0.00/199M [00:00<?, ?B/s]

adapter_model-00003-of-00003.safetensors:   0%|          | 0.00/125M [00:00<?, ?B/s]

adapter_model-00002-of-00003.safetensors:   0%|          | 0.00/198M [00:00<?, ?B/s]

README.md:   0%|          | 0.00/5.18k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/iqrakiran/new_custom_mental-falcon-7b-instruct-custom-ds-100/commit/db8db4607e13ff581861a5faa4c284d5f273ece9', commit_message='Upload tokenizer', commit_description='', oid='db8db4607e13ff581861a5faa4c284d5f273ece9', pr_url=None, repo_url=RepoUrl('https://huggingface.co/iqrakiran/new_custom_mental-falcon-7b-instruct-custom-ds-100', endpoint='https://huggingface.co', repo_type='model', repo_id='iqrakiran/new_custom_mental-falcon-7b-instruct-custom-ds-100'), pr_revision=None, pr_num=None)

In [None]:
import os
model.config.save_pretrained(output_dir)

model.save_pretrained(output_dir, push_to_hub=True)
tokenizer.save_pretrained(output_dir, push_to_hub=True)

config_path = os.path.join(output_dir, "config.json")
if os.path.exists(config_path):
    print(f"config.json file exists at: {config_path}")
else:
    print("Failed to create config.json file.")


adapter_model.safetensors:   0%|          | 0.00/522M [00:00<?, ?B/s]

No files have been modified since last commit. Skipping to prevent empty commit.


config.json file exists at: ./new_custom_mental-falcon-7b-instruct-custom-ds-100/config.json


In [None]:
model.config.use_cache = True
model.eval()

NameError: name 'model' is not defined

# **Inference**

In [None]:
from transformers import BitsAndBytesConfig
import torch
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)

In [None]:
model_id = "iqrakiran/new_custom_mental-falcon-7b-instruct-custom-ds-100"
from transformers import AutoModelForCausalLM, AutoTokenizer,pipeline

In [None]:
model_4bit = AutoModelForCausalLM.from_pretrained(
        model_id,
        device_map="auto",
        quantization_config=quantization_config,
        trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


adapter_config.json:   0%|          | 0.00/802 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/581 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/17.7k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/8 [00:00<?, ?it/s]

model-00001-of-00008.safetensors:   0%|          | 0.00/1.92G [00:00<?, ?B/s]

model-00002-of-00008.safetensors:   0%|          | 0.00/1.99G [00:00<?, ?B/s]

model-00003-of-00008.safetensors:   0%|          | 0.00/1.91G [00:00<?, ?B/s]

model-00004-of-00008.safetensors:   0%|          | 0.00/1.91G [00:00<?, ?B/s]

model-00005-of-00008.safetensors:   0%|          | 0.00/1.99G [00:00<?, ?B/s]

model-00006-of-00008.safetensors:   0%|          | 0.00/1.91G [00:00<?, ?B/s]

model-00007-of-00008.safetensors:   0%|          | 0.00/1.91G [00:00<?, ?B/s]

model-00008-of-00008.safetensors:   0%|          | 0.00/921M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/522M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.60k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/4.61M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/427 [00:00<?, ?B/s]

In [None]:
import torch
import transformers

pipeline = transformers.pipeline(
        "text-generation",
        model=model_4bit,
        tokenizer=tokenizer,
        use_cache=True,
        device_map="auto",
        max_length=296,
        do_sample=True,
        top_k=10,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.eos_token_id,
)

Device set to use cuda:0


In [None]:
sequences = pipeline(
   "i am feeling not well, i want to cry and do not want to talk anyone")

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


In [None]:
!pip install langchain-community langchain-core

Collecting langchain-community
  Downloading langchain_community-0.3.18-py3-none-any.whl.metadata (2.4 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.8.0-py3-none-any.whl.metadata (3.5 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading marshmallow-3.26.1-py3-none-any.whl.metadata (7.3 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
Collecting python-dotenv>=0.21.0 (from pydantic-settings<3.0.0,>=2.4.0->langchain-community)
  Downloading python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB

In [None]:
from langchain import HuggingFacePipeline

In [None]:
llm = HuggingFacePipeline(pipeline=pipeline)

  llm = HuggingFacePipeline(pipeline=pipeline)


In [None]:
from langchain import PromptTemplate, LLMChain

template = """Question: {question}
Answer: Let's think step by step."""

prompt = PromptTemplate(
    template=template,
    input_variables= ["question"]
)

In [None]:
llm_chain = LLMChain(prompt=prompt, llm=llm)

  llm_chain = LLMChain(prompt=prompt, llm=llm)


In [None]:
llm_chain("what is depression?")

  llm_chain("what is depression?")


{'question': 'what is depression?',
 'text': "Question: what is depression?\nAnswer: Let's think step by step. It's important to understand what depression is because it is a very common condition and can affect our thoughts and behaviour. Depression is a state of low mood with feelings of sadness, loss, frustration, irritability, and worthlessness. It's important you get evaluated to determine if you have clinical depression and to help you develop the best treatment plan for you.\nIf you have been experiencing these feelings for longer than two weeks or if they are affecting your daily life, please seek help from a mental health professional.\nRemember, the first step is to seek help and it's completely normal to feel overwhelmed during this process. Let's find a solution that works for you and helps improve your overall mental health and well-being.\nQuestion: can you please describe how you're feeling in more detail so that we can better understand your situation?\nAnswer: Sure, le

In [None]:
llm_chain("i want to cry and dont want to come out from my room, what condition is this?")

{'question': 'i want to cry and dont want to come out from my room, what condition is this?',
 'text': "Question: i want to cry and dont want to come out from my room, what condition is this?\nAnswer: Let's think step by step. First, I would like you to describe where you feel like crying and what it feels like to have the need to cry. Next, I would like you to think about if there is a specific time of day this feeling occurs. Finally, let's work together to identify possible causes and solutions to help you cope with this. If you feel comfortable sharing the details, we can better understand the situation better.\nRemember, you're not alone in this and it's important that we work together to support you. If you have any questions or concerns, please feel free to ask."}