# Fine-tuning Falcon 7B LLM Model on Interview QA Data

## Installing the required dependencies

In [None]:
!pip install -Uqqq pip --progress-bar off

!pip install -qqq bitsandbytes==0.42.0 --progress-bar off
!pip install -qqq torch==2.1.2 --progress-bar off
!pip install -qqq -U transformers==4.39.3 --progress-bar off
!pip install -qqq -U peft==0.10.0 --progress-bar off
!pip install -qqq -U accelerate==0.29.3 --progress-bar off
!pip install -qqq datasets==2.18.0 --progress-bar off
!pip install -qqq loralib==0.1.2 --progress-bar off
!pip install -qqq einops==0.7.0 --progress-bar off

## Importing the Libraries

In [1]:
import json
import os
from pprint import pprint

import bitsandbytes as bnb
import torch
import transformers
from datasets import load_dataset, DatasetDict
from huggingface_hub import notebook_login
from peft import (
    LoraConfig,
    PeftConfig,
    PeftModel,
    get_peft_model,
    prepare_model_for_kbit_training
)

from transformers import (
    AutoConfig,
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
)

os.environ["CUDA_VISIBLE_DEVICES"] = "0"

The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.


## Data

In [4]:
from google.colab import files
uploaded = files.upload()

Saving interview_qa_data.json to interview_qa_data (1).json


### Let’s open the JSON file and take a look at the data

In [6]:
with open("interview_qa_data.json") as json_file:
    data = json.load(json_file)

data

[{'input': 'Tell me about yourself.',
  'output': 'Start by mentioning your education, key projects, internships, and future career goals in a concise way.'},
 {'input': 'Why do you want to join our company?',
  'output': "Highlight your skills matching the company mission and your excitement about the company's growth."},
 {'input': 'What are your greatest strengths?',
  'output': 'Talk about 2-3 strengths relevant to the job role, with short examples demonstrating them.'},
 {'input': 'What are your weaknesses?',
  'output': 'Mention a real weakness but focus on how you are actively improving it.'},
 {'input': 'Where do you see yourself in five years?',
  'output': 'Show ambition but tie it to growing with the company and gaining new skills.'},
 {'input': 'Describe a challenging situation you faced and how you handled it.',
  'output': 'Use the STAR method: Situation, Task, Action, Result.'},
 {'input': 'How do you handle pressure and stress?',
  'output': 'Explain that you stay calm,

In [7]:
pprint(data[0], sort_dicts=False)

{'input': 'Tell me about yourself.',
 'output': 'Start by mentioning your education, key projects, internships, and '
           'future career goals in a concise way.'}


## Load the Model
### To load the model and tokenizer, we’ll use the AutoModelForCausalLM and AutoTokenizer classes from the 🤗 Transformers library. We’ll also set the pad_token to the eos_token to avoid issues with padding.

In [None]:
MODEL_NAME = "tiiuae/falcon-7b"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    device_map="auto",
    trust_remote_code=True,
    quantization_config=bnb_config,
)

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token

#### Note that we’re using the BitsAndBytesConfig class to load the model in 4-bit mode. We’re also using the bnb_4bit_use_double_quant parameter to enable double quantization, which is a technique that allows us to use 4-bit weights and activations while still performing 16-bit arithmetic. We also specify the nf4 (4-bit NormalFloat) from QLoRa.

## Let’s prepare the model for training:

In [9]:
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

You are using an old version of the checkpointing format that is deprecated (We will also silently ignore `gradient_checkpointing_kwargs` in case you passed it).Please update to the new format on your modeling file. To use the new format, you need to completely remove the definition of the method `_set_gradient_checkpointing` in your model.
You are using an old version of the checkpointing format that is deprecated (We will also silently ignore `gradient_checkpointing_kwargs` in case you passed it).Please update to the new format on your modeling file. To use the new format, you need to completely remove the definition of the method `_set_gradient_checkpointing` in your model.


In [10]:
def print_trainable_parameters(model):
    total_params = sum(p.numel() for p in model.parameters())  # Total parameters
    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)  # Trainable parameters

    trainable_percent = (trainable_params / total_params) * 100  # Percentage of trainable parameters

    print(f"trainable params: {trainable_params} || all params: {total_params} || trainable%: {trainable_percent}")

### The gradient_checkpointing_enable method enables gradient checkpointing, which is a technique that allows us to trade compute for memory. The prepare_model_for_kbit_training method prepares the model for training in 4-bit mode.

In [11]:
config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["query_key_value"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, config)
print_trainable_parameters(model)

trainable params: 4718592 || all params: 3613463424 || trainable%: 0.13058363808693696


#### The LoraConfig class is used to define the configuration for LoRA, and the following parameters are set:

- r=16: Specifies the rank, which controls the number of parameters in the adapted layers.
- lora_alpha=32: Sets the alpha value, which determines the trade-off between rank and model performance.
- target_modules=["query_key_value"]: Specifies the modules in the model that will be adapted using LoRA. In this case, only the “query_key_value” module will be adapted.
- task_type="CAUSAL_LM": Specifies the type of task as causal language model.
#### After configuring the LoRA model, the get_peft_model function is called to create the model based on the provided configuration. Note that we’re going to train only 0.13% of the original model parameter size.

## HuggingFace Dataset
### To train the model, we’ll convert our JSON data into a dataset that is compatible with the Transformers trainer. Luckly, HuggingFace provides a load_dataset() function that can be used to load a dataset from a JSON file:

In [12]:
data = load_dataset("json", data_files="interview_qa_data.json")
data

Generating train split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['output', 'input'],
        num_rows: 503
    })
})

## The next step is to convert each question and answer pair to a prompt and pass it to the tokenizer

In [13]:
def generate_prompt(data_point):
    return f"""
<human>: {data_point["input"]}
<assistant>: {data_point["output"]}
""".strip()


def generate_and_tokenize_prompt(data_point):
    full_prompt = generate_prompt(data_point)
    tokenized_full_prompt = tokenizer(full_prompt, padding=True, truncation=True)
    return tokenized_full_prompt

data = data["train"].shuffle().map(generate_and_tokenize_prompt)
data

Map:   0%|          | 0/503 [00:00<?, ? examples/s]

Dataset({
    features: ['output', 'input', 'input_ids', 'attention_mask'],
    num_rows: 503
})

## Training
##### Disclaimer: The training is done with a Tesla T4 GPU (16GB VRAM) and High Ram option turned on in Google Colab. You might try to increase the batch size, depending on your hardware.

### Training with a QLoRA adapter is similar to training any transformer using the Trainer by HuggingFace, but we’ll need to provide several parameters. The TrainingArguments class is used to define the training parameters:

In [14]:
OUTPUT_DIR = "OUTPUT"

training_args = transformers.TrainingArguments(
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    num_train_epochs=1,
    learning_rate=2e-4,
    fp16=True,
    save_total_limit=3,
    logging_steps=1,
    output_dir=OUTPUT_DIR,
    max_steps=80,
    optim="paged_adamw_8bit",
    lr_scheduler_type="cosine",
    warmup_ratio=0.05,
    report_to="tensorboard",
)

### We’ll train our model for 1 epoch (80 steps) using a cosine learning rate scheduler and a paged Adam optimizer, which is specific to QLoRA training. The report_to argument is used to specify that we want to log the training metrics to TensorBoard.

## Let’s use the Trainer class to train our model

In [15]:
trainer = transformers.Trainer(
    model=model,
    train_dataset=data,
    args=training_args,
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
model.config.use_cache = False
trainer.train()

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


Step,Training Loss
1,3.3939
2,2.9406
3,3.2842
4,3.3364
5,3.4238
6,3.3262
7,2.9916
8,3.2645
9,2.7334
10,2.349


TrainOutput(global_step=80, training_loss=1.9826859697699546, metrics={'train_runtime': 443.6219, 'train_samples_per_second': 0.721, 'train_steps_per_second': 0.18, 'total_flos': 466727471312640.0, 'train_loss': 1.9826859697699546, 'epoch': 0.64})

## Upload the Trained Model
### After training our model, we can save it in two common locations. First, we can save it locally using the save_pretrained() method:

In [16]:
model.save_pretrained("trained-model")



### Next, we can upload the model to the HuggingFace Hub using the push_to_hub() method:

In [None]:
notebook_login()

In [None]:
model.push_to_hub(
    "Pranav06/falcon-7b-qlora-interview_qa-support-bot"
)

## Load the Trained Model
### To load the pretrained model, we can use similar code to what we used for loading the original Falcon 7b model:

In [None]:
PEFT_MODEL = "Pranav06/falcon-7b-qlora-interview_qa-support-bot"

config = PeftConfig.from_pretrained(PEFT_MODEL)
model = AutoModelForCausalLM.from_pretrained(
    config.base_model_name_or_path,
    return_dict=True,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
tokenizer.pad_token = tokenizer.eos_token

model = PeftModel.from_pretrained(model, PEFT_MODEL)

## Evaluation

In [20]:
generation_config = model.generation_config
generation_config.max_new_tokens = 200
generation_config.temperature = 0.7
generation_config.top_p = 0.7
generation_config.num_return_sequences = 1
generation_config.pad_token_id = tokenizer.eos_token_id
generation_config.eos_token_id = tokenizer.eos_token_id

In [21]:
DEVICE = "cuda:0"

prompt = f"""
<human>: Why do you want to join our company?
<assistant>:
""".strip()

encoding = tokenizer(prompt, return_tensors="pt").to(DEVICE)
with torch.inference_mode():
    outputs = model.generate(
        input_ids=encoding.input_ids,
        attention_mask=encoding.attention_mask,
        generation_config=generation_config,
    )
print(tokenizer.decode(outputs[0], skip_special_tokens=True))



<human>: Why do you want to join our company?
<assistant>: Talk about your interest in the company's mission, values, and culture.
"I'm excited to join your company because I believe in your mission to help people live healthier lives. I'm also passionate about your values of integrity, collaboration, and innovation, and I'm eager to learn from your team."
"I'm excited to join your company because I believe in your mission to help people live healthier lives. I'm also passionate about your values of integrity, collaboration, and innovation, and I'm eager to learn from your team."
"I'm excited to join your company because I believe in your mission to help people live healthier lives. I'm also passionate about your values of integrity, collaboration, and innovation, and I'm eager to learn from your team."
"I'm excited to join your company because I believe in your mission to help people live healthier lives. I'm also passionate about your values of integrity,


## Helper function to make generating responses easier

In [None]:
def generate_response(question: str) -> str:
    prompt = f"""
<human>: {question}
<assistant>:
""".strip()
    encoding = tokenizer(prompt, return_tensors="pt").to(DEVICE)
    with torch.inference_mode():
        outputs = model.generate(
            input_ids=encoding.input_ids,
            attention_mask=encoding.attention_mask,
            generation_config=generation_config,
        )
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)

    assistant_start = "<assistant>:"
    response_start = response.find(assistant_start)
    response = response[response_start + len(assistant_start) :].strip()
    response_lines = response.split("\n")
    final_response = response_lines[0].strip()

    return final_response

In [25]:
prompt = "What do you think is the most important quality for a successful team?"
print(generate_response(prompt))



Talk about communication, collaboration, and trust.


In [26]:
prompt = "Why do you think you are a good fit for this role?"
print(generate_response(prompt))

Talk about your relevant experience, skills, and values.
