# Finetuning Falcon

We will perform finetuning of [Falcon-7b-instruct-sharded](https://huggingface.co/vilsonrodrigues/falcon-7b-instruct-sharded) model. This is light weight version of [Falcon-7b-instruct](https://huggingface.co/tiiuae/falcon-7b-instruct), so we can run it faster and more easily. [Falcon-7b-instruct](https://huggingface.co/tiiuae/falcon-7b-instruct) is a 7B parameters decoder-only model based on [Falcon-7b](https://huggingface.co/tiiuae/falcon-7b) and finetunded on a mixture of chat/instruct datasets. (Check the links provided for more information).

As mentioned above, we will finetune Falcon-7b-instruct-sharded for specific application. The idea is to create a model specifically for beautifying the input. When we provide the model with an input sequence, the model should return prettier version of that sequence. Here's an example:
- input: Don't worry, be happy!
- response: Let go of your worries and embrace happiness!

We will train our model on a dataset with only 200 rows of data, but since we are finetuning the model that's already finetuned we can expect great results with that amount of data.

In [2]:
!pip install -qqq bitsandbytes
!pip install -qqq torch
!pip install -qqq transformers
!pip install -qqq peft
!pip install -qqq accelerate
!pip install -qqq datasets
!pip install -qqq loralib
!pip install -qqq einops

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.6/92.6 MB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m23.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m25.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m46.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m39.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.9/72.9 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.3/519.3 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━

First, let's import everything that we need

In [3]:
import pandas as pd
import torch
import torch.nn as nn
import json
import os
from pprint import pprint
import bitsandbytes as bnb
import transformers
from datasets import load_dataset
from peft import (
    LoraConfig,
    PeftConfig,
    PeftModel,
    get_peft_model,
    prepare_model_for_kbit_training
)
from transformers import (
    AutoConfig,
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig
)

We have to use the GPU to be able to run the 4bit quantization.

In [4]:
if torch.cuda.is_available():
  device = torch.device("cuda:0")
else:
  device = "cpu"

print(device)

cuda:0


## Loading the Falcon model

First, we will create bnb config, which is very important for running this model in environments with limited computational resources. We are using `NF4` (normalized float) variant of 4bit quantization and `bnb_4bit_use_double_quant`, which uses a second quantization after the frist one, and also for faster training, we are using `bfloat16` for compute type. Then we can load Falcon model and tokenizer.



In [5]:
MODEL_NAME = "vilsonrodrigues/falcon-7b-instruct-sharded"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    device_map="auto",
    trust_remote_code=True,
    quantization_config=bnb_config
)

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

Downloading (…)figuration_falcon.py:   0%|          | 0.00/6.70k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/vilsonrodrigues/falcon-7b-instruct-sharded:
- configuration_falcon.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


Downloading (…)n/modeling_falcon.py:   0%|          | 0.00/56.9k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/vilsonrodrigues/falcon-7b-instruct-sharded:
- modeling_falcon.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


Downloading (…)fetensors.index.json:   0%|          | 0.00/16.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/15 [00:00<?, ?it/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/1.68G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/1.99G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/1.82G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/1.99G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/1.99G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/1.82G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/1.99G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/1.99G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/1.82G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/1.99G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/1.99G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/1.82G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/1.99G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/1.99G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/828M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/15 [00:00<?, ?it/s]

Some weights of FalconForCausalLM were not initialized from the model checkpoint at vilsonrodrigues/falcon-7b-instruct-sharded and are newly initialized: ['lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Downloading (…)neration_config.json:   0%|          | 0.00/117 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/287 [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.73M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/281 [00:00<?, ?B/s]

In [6]:
def print_trainable_parameters(model):
  """
  Prints the number of trainable parameters in the model.
  """
  trainable_params = 0
  all_param = 0
  for _, param in model.named_parameters():
    all_param += param.numel()
    if param.requires_grad:
      trainable_params += param.numel()
  print(
      f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
  )

We use `prepare_model_for_kbit_training` method from `PEFT` to apply some preprocessing to the model. We create a `LoraConfig` and then pass that configuration along with the model to the `get_peft_model` method to create a `PeftModel`, which is much more efficent - we can see that by printing out the trainable parameters.

In [7]:
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

In [8]:
config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["query_key_value"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)
print_trainable_parameters(model)

trainable params: 4718592 || all params: 3613463424 || trainable%: 0.13058363808693696


## Preparing dataset

We don't need test and validation sets, so we are using all of our data for training.

In [9]:
data = load_dataset("csv", data_files="messages_final.csv")
data

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input', 'response'],
        num_rows: 200
    })
})

In [10]:
data["train"][0]

{'input': "I don't agree with you",
 'response': 'Regrettably, I must dissent on this matter, as our perspectives diverge gracefully. While I value your insights, my reflections lead me to a different conclusion.'}

In [11]:
def generate_prompt(data_point):
  return f"""
    {data_point["input"]}
    {data_point["response"]}
    """.strip()

def generate_and_tokenize_prompt(data_point):
  full_prompt = generate_prompt(data_point)
  tokenized_full_prompt = tokenizer(full_prompt, padding=True, truncation=True)
  return tokenized_full_prompt

In [12]:
data = data["train"].shuffle().map(generate_and_tokenize_prompt)

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

In [13]:
data

Dataset({
    features: ['input', 'response', 'input_ids', 'attention_mask'],
    num_rows: 200
})

## Finetuning the model

Finally, we can train our model. Let's set up the training arguments and start training!

In [14]:
training_args = transformers.TrainingArguments(
      per_device_train_batch_size=1,
      gradient_accumulation_steps=4,
      num_train_epochs=1,
      learning_rate=2e-4,
      fp16=True,
      save_total_limit=3,
      logging_steps=1,
      output_dir="experiments",
      optim="paged_adamw_8bit",
      lr_scheduler_type="cosine",
      warmup_ratio=0.05,
)

trainer = transformers.Trainer(
    model=model,
    train_dataset=data,
    args=training_args,
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)
)
model.config.use_cache = False
trainer.train()

You're using a PreTrainedTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
1,2.5098
2,2.2129
3,2.3978
4,2.7203
5,3.0851
6,3.2609
7,2.651
8,3.4483
9,2.7218
10,3.4754


TrainOutput(global_step=50, training_loss=2.418608965873718, metrics={'train_runtime': 281.8986, 'train_samples_per_second': 0.709, 'train_steps_per_second': 0.177, 'total_flos': 121139974375680.0, 'train_loss': 2.418608965873718, 'epoch': 1.0})

## Saving the model

We can easily save and load our new model.

In [15]:
model.save_pretrained("trained-model")

In [16]:
NEW_MODEL_NAME = "./trained-model/"

config = PeftConfig.from_pretrained(NEW_MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(
    config.base_model_name_or_path,
    return_dict=True,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
)

tokenizer=AutoTokenizer.from_pretrained(config.base_model_name_or_path)
tokenizer.pad_token = tokenizer.eos_token

model = PeftModel.from_pretrained(model, NEW_MODEL_NAME)

Loading checkpoint shards:   0%|          | 0/15 [00:00<?, ?it/s]

Some weights of FalconForCausalLM were not initialized from the model checkpoint at vilsonrodrigues/falcon-7b-instruct-sharded and are newly initialized: ['lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Run the finetuned model

Let's test our finetuned model and see how it performs!

In [17]:
generation_config = model.generation_config
generation_config.max_new_tokens = 200
generation_config.temperature = 0.7
generation_config.top_p = 0.7
generation_config.num_return_sequences = 1
generation_config.pad_token_id = tokenizer.eos_token_id
generation_config.eos_token_id = tokenizer.eos_token_id

In [38]:
%%time

inputs = [
    """
    Please tell me if I can help you with something.
    """.strip(),
    """
    You must be crazy to think that!
    """.strip(),
    """
    I would like to tell you how I feel.
    """.strip(),
    """
    I don't think that is a smart decision.
    """.strip()
]

with torch.inference_mode():
    outputs = []
    for input_text in inputs:
        encoding = tokenizer(input_text, return_tensors="pt").to(device)
        output = model.generate(
            input_ids=encoding.input_ids,
            attention_mask=encoding.attention_mask,
            generation_config=generation_config
        )
        outputs.append(output[0])

for output in outputs:
    decoded_output = tokenizer.decode(output, skip_special_tokens=True)
    print(f'{decoded_output}\n')

Please tell me if I can help you with something.
Can I assist you with anything?

You must be crazy to think that!
I'm sorry, but I don't think that's a rational thought.

I would like to tell you how I feel.
I would like to share my feelings with you.

I don't think that is a smart decision.
I believe that this decision is not a wise choice.

CPU times: user 6.87 s, sys: 0 ns, total: 6.87 s
Wall time: 8.19 s


The results are pretty good, and we only used 200 rows of data for fine tuning! It shows how capable the Falcon model is.

It's really great to be able to run a model like this on a low end configuration. If you want to learn more about how to make an LLM accessible, [this article](https://huggingface.co/blog/4bit-transformers-bitsandbytes) can help you.