## <b><span style='color:#9146ff'>|</span> Introduction </b>

Welcome to this notebook on fine-tuning the Meta LLaMA-3 model on an WAZZUF dataset 🎉

In this notebook, you will find:

* Set up the environment and install necessary dependencies.
* Prepare and preprocess the Arabic dataset for model training.
* Configure and fine-tune the Meta LLaMA-3 model.
* Quantize the model for efficiency.
* Use Parameter-Efficient Fine-Tuning (PEFT) with LoRA.
* Utilize the SFT Trainer for fine-tuning.
* Choose appropriate hyperparameters for training.
* Test the performance of the fine-tuned model.

Note : You can generalize this notebook on any other different QA instruct dataset for chatbot

![Llama-3](https://pc-tablet.co.in/wp-content/uploads/2024/04/Llama-3.webp)


## <b>1 <span style='color:#9146ff'>|</span> Instalation and Logging </b>

In [1]:
from huggingface_hub import login
login(token='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')

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [3]:
%pip install \
    evaluate \
    rouge_score\
    loralib \
    accelerate \
    bitsandbytes \
    trl \
    peft \
    -U --quiet

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/84.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.7/84.1 kB[0m [31m882.5 kB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.7/84.1 kB[0m [31m882.5 kB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━[0m [32m81.9/84.1 kB[0m [31m667.8 kB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m614.6 kB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m314.1/314.1 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.8/119.8 MB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [4]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig, TrainingArguments, Trainer
import torch
import time
import pandas as pd
import re
import numpy as np
import string
from nltk.corpus import stopwords
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfTransformer,TfidfVectorizer
from sklearn.pipeline import Pipeline
import evaluate

In [None]:
# !pip install -q -U git+https://github.com/huggingface/peft.git

In [5]:
import os
import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer
import transformers


In [6]:
# import transformers

# torch.backends.cuda.enable_mem_efficient_sdp(False)
# torch.backends.cuda.enable_flash_sdp(False)

## <b>2 <span style='color:#9146ff'>|</span> Model Configuration and Quantization </b>

* Loading the model and its tokenizer from huggingface `AutoModelForCausalLM` library
* Apply model quantization to reduce the size and memory usage of the model This compression technique is pivotal for deploying advanced models on devices with limited computational capabilities

**Detailed Code Explanation :**
- `AutoTokenizer`: This function loads a pre-trained tokenizer from Hugging Face's model hub.
- `from_pretrained`: This method loads the tokenizer for the "meta-llama/Meta-Llama-3-8B-Instruct" model. The tokenizer is responsible for converting text into tokens that the model can process
- `getattr`: This function dynamically gets an attribute from the `torch` module. Here, it retrieves `torch.float16`, which indicates that computations will use 16-bit floating point precision. This is typically used to reduce memory usage and increase computation speed.
- `BitsAndBytesConfig`: This class is used to configure the quantization parameters.
> - `load_in_4bit=True`: Indicates that the model should be loaded with 4-bit quantization. This reduces the model size and speeds up inference by using 4-bit integers instead of the usual 32-bit floating point numbers.
> - `bnb_4bit_quant_type="nf4"`: Specifies the quantization type. "nf4" is a specific quantization format optimized for neural network weights.
> - `bnb_4bit_compute_dtype=compute_dtype`: Sets the computation data type to torch.float16. This means that while the model weights are stored as 4-bit integers, the computations are performed in 16-bit floating point precision.
> - `bnb_4bit_use_double_quant=True`: Enables double quantization, which applies a second level of quantization to further reduce model size and potentially increase accuracy.

In [8]:
from transformers import AutoTokenizer

# model_id = "microsoft/Phi-3-mini-128k-instruct"
# model_id = "google/gemma-2-9b-it"
# model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")

compute_dtype = getattr(torch, "float16")

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=True,
)

# Load base model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-8B-Instruct",
    quantization_config=quant_config,
    device_map={"": 0}
)
model.config.use_cache = False
model.config.pretraining_tp = 1

tokenizer_config.json:   0%|          | 0.00/51.0k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]

In [9]:
# Set pad_token as end-of-sentence token
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

In [10]:
def print_number_of_trainable_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    return f"trainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"

print(print_number_of_trainable_model_parameters(model))

trainable model parameters: 1050939392
all model parameters: 4540600320
percentage of trainable model parameters: 23.15%


## <b>3 <span style='color:#9146ff'>|</span> Data Preparation </b>


In [26]:
import pandas as pd

df = pd.read_csv('/content/sampled_jobs.csv')

In [27]:
import re

def remove_html_tags(text):
    # Define a regex pattern to match HTML tags
    clean = re.compile('<.*?>')
    # Substitute the matched HTML tags with an empty string
    text = re.sub(clean, '', text)
    text = text.replace('\n', ' ')
    return text

df['description'] = df['description'].apply(remove_html_tags)

In [28]:
df.drop('requirements', axis=1,inplace = True)

index_drop = df[df['description'] == ''].index
df.drop(index_drop, axis= 0, inplace=True)
# concatenate job_title with career_level
df['job_title'] = df['job_title'] + ': ' + df['career_level']
# Concatenate "Give me description about " with job_title
df['job_title'] = "Give me description about " + df['job_title']
df.drop(['career_level'], axis=1, inplace=True)

In [29]:
df.head()

Unnamed: 0,job_title,description
0,Give me description about Machine Learning Eng...,Siemens Digital Industries Software is a globa...
1,Give me description about Head OF Design: Seni...,We are looking to hire an exceptional head of ...
2,Give me description about Science Teacher - Au...,Science Teacher - August 2024Job DescriptionWe...
3,Give me description about English Customer Ser...,Etisalat Global Services is hiring English spe...
4,Give me description about Senior Software Engi...,Crossover is the world's #1 source of full-tim...


## <b>4 <span style='color:#9146ff'>|</span> Data Preprocessing </b>


**Detailed Code Explanation :**

- `tokenizer(question, ...)`: This uses the tokenizer to convert the question string into token IDs.
- `padding="max_length"`: Pads the sequences to the maximum length specified by `max_length`.
- `truncation=True`: Truncates the sequences if they exceed the `max_length`.
- `max_length`: Specifies the maximum length of the tokenized sequence.
- `return_tensors="pt"`: Returns the tokenized sequences as PyTorch tensors.
- `input_ids[0]`: Retrieves the token IDs from the tensor and assigns them to `row['input_ids']`.

In [30]:
def tokenize_function(row):
    # Tokenize the conversations
    question = ' '.join(row["job_title"]) if isinstance(row["job_title"], list) else row["job_title"]

    row['input_ids'] = tokenizer(question, padding="max_length", truncation=True, max_length = 128, return_tensors="pt").input_ids[0]

    # Assuming "answer" column is already a string, no need for conversion
    row['labels'] = tokenizer(row["description"], padding="max_length", truncation=True, max_length = 256, return_tensors="pt").input_ids[0]

    return row


# Tokenize the DataFrame
tokenized_df = df.apply(tokenize_function, axis=1)

In [31]:
# Convert columns to list
tokenized_df['input_ids'] = tokenized_df['input_ids'].apply(lambda x: x.tolist())
tokenized_df['labels'] = tokenized_df['labels'].apply(lambda x: x.tolist())

In [32]:
tokenized_df

Unnamed: 0,job_title,description,input_ids,labels
0,Give me description about Machine Learning Eng...,Siemens Digital Industries Software is a globa...,"[128000, 36227, 757, 4096, 922, 13257, 21579, ...","[128000, 22771, 73837, 14434, 37528, 4476, 374..."
1,Give me description about Head OF Design: Seni...,We are looking to hire an exceptional head of ...,"[128000, 36227, 757, 4096, 922, 11452, 3083, 7...","[128000, 1687, 527, 3411, 311, 18467, 459, 253..."
2,Give me description about Science Teacher - Au...,Science Teacher - August 2024Job DescriptionWe...,"[128000, 36227, 757, 4096, 922, 10170, 30169, ...","[128000, 36500, 30169, 482, 6287, 220, 2366, 1..."
3,Give me description about English Customer Ser...,Etisalat Global Services is hiring English spe...,"[128000, 36227, 757, 4096, 922, 6498, 12557, 5...","[128000, 32960, 285, 121754, 8121, 8471, 374, ..."
4,Give me description about Senior Software Engi...,Crossover is the world's #1 source of full-tim...,"[128000, 36227, 757, 4096, 922, 19903, 4476, 2...","[128000, 34, 38272, 374, 279, 1917, 596, 674, ..."
...,...,...,...,...
995,Give me description about Advance NDT Inspecto...,Minimum Diploma/Degree in&nbsp;Engineering PC...,"[128000, 36227, 757, 4096, 922, 47396, 452, 10...","[128000, 32025, 77131, 15302, 27874, 304, 2829..."
996,Give me description about Senior Service Sales...,Senior Service Sales Engineer page is loaded S...,"[128000, 36227, 757, 4096, 922, 19903, 5475, 1...","[128000, 48195, 5475, 16207, 29483, 2199, 374,..."
997,Give me description about Junior Marketing Ana...,Role Overview:Job DescriptionWhat you will be ...,"[128000, 36227, 757, 4096, 922, 31870, 18729, ...","[128000, 9207, 35907, 25, 12524, 7817, 3923, 4..."
998,Give me description about Assistant Outlet Man...,Organization- Hyatt Regency Al KoutSummaryYou ...,"[128000, 36227, 757, 4096, 922, 22103, 76749, ...","[128000, 42674, 12, 10320, 1617, 3263, 2301, 1..."


In [33]:
# import gc
# torch.cuda.empty_cache()
# gc.collect()
# torch.cuda.empty_cache()

In [35]:
from datasets import Dataset

# Assuming `tokenized_df` is your pandas DataFrame
dataset = Dataset.from_pandas(tokenized_df[:1000])

In [36]:
dataset

Dataset({
    features: ['job_title', 'description', 'input_ids', 'labels', '__index_level_0__'],
    num_rows: 994
})

In [37]:
tokenized_datasets = dataset.map(tokenize_function)# batched=True, # batch_size=...
tokenized_datasets = tokenized_datasets.remove_columns(['job_title', 'description'])

Map:   0%|          | 0/994 [00:00<?, ? examples/s]

## <b>5 <span style='color:#9146ff'>|</span> Model Training and Fine-tuning </b>

### LoRA (Low-Rank Adaptation) :
is a technique for Parameter-Efficient Fine-Tuning (PEFT) that adds trainable low-rank matrices to the model weights.

![LoRa](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/blog/133_trl_peft/step2.png)


In [38]:
# Load LoRA configuration
peft_args = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=8,
    bias="none",
    task_type="CAUSAL_LM",
)

### Training Arguments :

**Parameter Explanations**

1. `output_dir="./results"`:
Directory where the model checkpoints and other outputs will be saved.
num_train_epochs=1:

2. Number of epochs to train the model. An epoch is one full pass through the training dataset.

3. `per_device_train_batch_size=2`:
Batch size per GPU/TPU core/CPU for training. This means that each device will process 2 samples per forward/backward pass.

4. `gradient_accumulation_steps=1`:
Number of update steps to accumulate before performing a backward/update pass. This effectively increases the batch size by accumulating gradients over multiple steps.

5. `optim="paged_adamw_32bit"`:
Specifies the optimizer to use. paged_adamw_32bit is an AdamW optimizer variant that uses 32-bit precision and is optimized for memory efficiency.

6. `save_steps=100`:
Number of steps between model checkpoint saves. The model will be saved every 100 steps.

7. `logging_steps=100`:
Number of steps between logging outputs. Training progress will be logged every 100 steps.

8. `learning_rate=2e-5`:
Initial learning rate for the optimizer. This controls how much to adjust the model weights with respect to the loss gradient.

9. `weight_decay=0.001`:
Weight decay (L2 regularization) to apply to model parameters. Helps prevent overfitting by penalizing large weights.

10. `fp16=True`:
Enable 16-bit (half-precision) training to reduce memory usage and speed up training.

11. `bf16=False`:
Disable bfloat16 training. Bfloat16 is another 16-bit precision format, often used on TPUs.

12. `max_grad_norm=0.3`:
Maximum norm for gradient clipping. This helps prevent exploding gradients by scaling gradients that exceed this norm.

13. `warmup_ratio=0.03`:
Ratio of total training steps used for linear learning rate warmup. This gradually increases the learning rate from 0 to the initial learning rate over the first 3% of the training steps.

14. `group_by_length=True`:
Whether to group sequences of roughly the same length together for training. This can improve training efficiency and stability.

15. `lr_scheduler_type="cosine"`:
Type of learning rate scheduler to use. "cosine" refers to a cosine annealing schedule, which gradually decreases the learning rate following a cosine curve.

16. `report_to="tensorboard"`:
Specifies where to report training metrics. "tensorboard" will log metrics to TensorBoard, a visualization tool for monitoring training.

In [39]:
# Set training parameters
training_params = TrainingArguments(
    output_dir="./results",
    num_train_epochs=1,
    per_device_train_batch_size=2,
    # per_device_eval_batch_size=1,
    gradient_accumulation_steps=1,
#     evaluation_strategy="epoch",
    optim="paged_adamw_32bit",
    save_steps=100,
    logging_steps=100,
    learning_rate=2e-5,
    weight_decay=0.001,
    fp16=True,
    bf16=False,
    max_grad_norm=0.3,
#     max_steps=-1,
    warmup_ratio=0.03,
    group_by_length=True,
    lr_scheduler_type="cosine",
    report_to="tensorboard"
)

In [None]:
# # Set up training arguments
# training_args = TrainingArguments(
#     output_dir="./results",
#     num_train_epochs=3,
#     per_device_train_batch_size=16,  # Adjust according to your device and global batch size
#     gradient_accumulation_steps=2,  # Adjust according to your device and global batch size
#     logging_dir='./logs',
#     logging_steps=10,
#     evaluation_strategy="steps",
#     save_steps=10,
#     # save_total_limit=2,
#     learning_rate=2e-5,
#     lr_scheduler_type="cosine",
#     warmup_ratio=0.1,
#     fp16=True,  # Use bf16 if your hardware supports it
#     optim="adamw_torch_fused",  # Use "adamw_torch_fused" for speedup
#     report_to="tensorboard"
# )

In [40]:
from peft import get_peft_model, TaskType

peft_model = get_peft_model(model,
                            peft_args)
print(print_number_of_trainable_model_parameters(peft_model))

trainable model parameters: 3407872
all model parameters: 4544008192
percentage of trainable model parameters: 0.07%


### SFTTrainer:

- Supervised Fine-tuning (SFT): Optimized for fine-tuning pre-trained models with smaller datasets on supervised learning tasks.
- Simpler interface: Provides a streamlined workflow with fewer configuration options, making it easier to get started.
- Efficient memory usage: Uses techniques like parameter-efficient (PEFT) and packing optimizations to reduce memory consumption during training.
- Faster training: Achieves comparable or better accuracy with smaller datasets and shorter training times than Trainer.

In [41]:
# Set supervised fine-tuning parameters
trainer = SFTTrainer(
    model=peft_model,
    train_dataset=dataset,
#     eval_dataset=test_dataset,
    peft_config=peft_args,
    dataset_text_field="text",
#     max_seq_length=256,
    max_seq_length=None,
    tokenizer=tokenizer,
    args=training_params,
    packing=False,
)


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.


In [None]:
# # import torch_optimizer as optim
# from transformers import AdamW
# from transformers.optimization import get_cosine_schedule_with_warmup

# # trainer.args.fsdp = "full_shard auto_wrap"  # Configure FSDP if required

# # Initialize optimizer and scheduler
# optimizer = AdamW(model.parameters(), lr=training_args.learning_rate)
# num_training_steps = len(tokenized_datasets) // training_args.per_device_train_batch_size // training_args.gradient_accumulation_steps * training_args.num_train_epochs
# lr_scheduler = get_cosine_schedule_with_warmup(
#     optimizer,
#     num_warmup_steps=int(0.1 * num_training_steps),
#     num_training_steps=num_training_steps,
# )

# # Enable Flash Attention v2
# # flash_attention_v2_enabled = True  # Assume this is integrated in your model/library


### Training

In [42]:
trainer.train()

Step,Training Loss
100,5.5365
200,2.8644
300,2.6292
400,2.4837




TrainOutput(global_step=497, training_loss=3.2089448617977396, metrics={'train_runtime': 382.413, 'train_samples_per_second': 2.599, 'train_steps_per_second': 1.3, 'total_flos': 5731800997429248.0, 'train_loss': 3.2089448617977396, 'epoch': 1.0})

In [None]:
# import gc
# torch.cuda.empty_cache()
# gc.collect()
# torch.cuda.empty_cache()

### Save model & Publish

In [43]:
trainer.model.save_pretrained("./llama-3-8B-WAZZUF")
tokenizer.save_pretrained("./llama-3-8B-WAZZUF")



('./llama-3-8B-WAZZUF/tokenizer_config.json',
 './llama-3-8B-WAZZUF/special_tokens_map.json',
 './llama-3-8B-WAZZUF/tokenizer.json')

In [None]:
# model.push_to_hub("")
# tokenizer.push_to_hub("")

## <b>6 <span style='color:#9146ff'>|</span> Testing the model performance on a single inference </b>


In [45]:
def single_inference(question):
    messages = [
        {"role": "system", "content": "provide the job seeker with personalized career advice based on their targeted job title."},
    ]

    messages.append({"role": "user", "content": question})


    input_ids = tokenizer.apply_chat_template(
        messages,
        add_generation_prompt=True,
        return_tensors="pt"
    ).to(model.device)

    terminators = [
        tokenizer.eos_token_id,
        tokenizer.convert_tokens_to_ids("<|eot_id|>")
    ]

    outputs = model.generate(
        input_ids,
        max_new_tokens=256,
        eos_token_id=terminators,
        do_sample=False,
        temperature=0.0,
    #     top_p=0.9,
    )
    response = outputs[0][input_ids.shape[-1]:]
    output = tokenizer.decode(response, skip_special_tokens=True)
    return output

In [46]:
question = """Give me description about machine learning job"""

answer = single_inference(question)

print(f'INPUT QUESTION:\n{question}')
print(f'\n\nModel Answer:\n{answer}')

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.


INPUT QUESTION:
Give me description about machine learning job


Model Answer:
Here's a description of a Machine Learning Engineer job:

**Job Title:** Machine Learning Engineer

**Job Summary:**

We are seeking a highly skilled Machine Learning Engineer to join our team. As a Machine Learning Engineer, you will be responsible for designing, developing, and deploying machine learning models that drive business growth and improve customer experiences. You will work closely with cross-functional teams to identify business needs, design and implement machine learning solutions, and ensure the scalability and reliability of our models.

**Responsibilities:**

* Design and develop machine learning models using various algorithms and techniques (e.g., supervised and unsupervised learning, deep learning, natural language processing)
* Collaborate with data scientists and engineers to identify business needs and develop solutions that meet those needs
* Develop and maintain large-scale machine

In [None]:
question = """   """

answer = single_inference(question)

print(f'INPUT QUESTION:\n{question}')
print(f'\n\nModel Answer:\n{answer}')