<a href="https://colab.research.google.com/github/olonok69/LLM_Notebooks/blob/main/mlflow/qlora/LLama3_2_3B_fine_tuning_QLORA_DORA_customer_service.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-Tuning Open-Source LLM using QLoRA with MLflow and PEFT
- meta-llama/Llama-3.2-3B-Instruct: The Llama 3.2 collection of multilingual large language models (LLMs) is a collection of pretrained and instruction-tuned generative models in 1B and 3B sizes (text in/text out). The Llama 3.2 instruction-tuned text only models are optimized for multilingual dialogue use cases, including agentic retrieval and summarization tasks. They outperform many of the available open source and closed chat models on common industry benchmarks.
- QLoRA is a novel method that allows us to fine-tune large foundational models with limited GPU resources. It reduces the number of trainable parameters by learning pairs of rank-decomposition matrices and also applies 4-bit quantization to the frozen pretrained model to further reduce the memory footprint.
- PEFT is a library developed by HuggingFace, that enables developers to easily integrate various optimization methods with pretrained models available on the HuggingFace Hub. With PEFT, you can apply QLoRA to the pretrained model with a few lines of configurations and run fine-tuning just like the normal Transformers model training.
- MLflow manages an exploding number of configurations, assets, and metrics during the LLM training on your behalf. MLflow is natively integrated with Transformers and PEFT, and plays a crucial role in organizing the fine-tuning cycle.


# Dataset

### Bitext - Customer Service Tagged Training Dataset for LLM-based Virtual Assistants

https://huggingface.co/datasets/bitext/Bitext-customer-support-llm-chatbot-training-dataset


This hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using our NLP/NLG technology and our automated Data Labeling (DAL) tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the Customer Support sector can be easily achieved using our two-step approach to LLM Fine-Tuning. For example, if you are [ACME Company], you can create your own customized LLM by first training a fine-tuned model using this dataset, and then further fine-tuning it with a small amount of your own data. An overview of this approach can be found at: From General-Purpose LLMs to Verticalized Enterprise Models

The dataset has the following specs:

- Use Case: Intent Detection
- Vertical: Customer Service
- 27 intents assigned to 10 categories
- 26872 question/answer pairs, around 1000 per intent
- 30 entity/slot types
- 12 different types of language generation tags

In [None]:
%pip install -U transformers -q
%pip install -U datasets  -q
%pip install -U accelerate  -q
%pip install -U peft -q
%pip install -U trl -q
%pip install -U bitsandbytes -q
%pip install mlflow pyngrok -q

In [None]:
from google.colab import userdata
import mlflow
import os

MLFLOW_TRACKING_URI="databricks"
# Specify the workspace hostname and token
DATABRICKS_HOST="https://adb-2467347032368999.19.azuredatabricks.net/"
DATABRICKS_TOKEN=userdata.get('DATABRCKS_TTOKEN')

In [None]:
if "MLFLOW_TRACKING_URI" not in os.environ:
    os.environ["MLFLOW_TRACKING_URI"] = MLFLOW_TRACKING_URI
if "DATABRICKS_HOST" not in os.environ:
    os.environ["DATABRICKS_HOST"] = DATABRICKS_HOST
if "DATABRICKS_TOKEN" not in os.environ:
    os.environ["DATABRICKS_TOKEN"] = DATABRICKS_TOKEN

In [None]:

mlflow.set_tracking_uri(MLFLOW_TRACKING_URI)

mlflow.set_experiment("/Users/pepe@kk.com/llama3.2_finetuning")

In [None]:
import warnings

# Disable a few less-than-useful UserWarnings from setuptools and pydantic
warnings.filterwarnings("ignore", category=UserWarning)

In [None]:
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging

    )

from peft import (LoraConfig,
                 PeftModel,
                 prepare_model_for_kbit_training,
                 get_peft_model)

import os
import torch
import wandb
from datasets import load_dataset
from trl import SFTTrainer, setup_chat_format


# Model
https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct
# Dataset
https://huggingface.co/datasets/bitext/Bitext-customer-support-llm-chatbot-training-dataset

In [None]:
base_model = "meta-llama/Llama-3.2-3B-Instruct"
new_model = "llama-3.2-3b-it-Ecommerce-ChatBot"
dataset_name = "bitext/Bitext-customer-support-llm-chatbot-training-dataset"

In [None]:
# Set torch dtype and attention implementation
if torch.cuda.get_device_capability()[0] >= 8:
    !pip install -qqq flash-attn
    torch_dtype = torch.bfloat16
    attn_implementation = "flash_attention_2"
else:
    torch_dtype = torch.float16
    attn_implementation = "eager"

In [None]:
attn_implementation

In [None]:
tokenizer =AutoTokenizer.from_pretrained(base_model,trust_remote_code=True)

# For 8 bit quantization
#quantization_config = BitsAndBytesConfig(load_in_8bit=True,
#                                        llm_int8_threshold=200.0)

## For 4 bit quantization
quantization_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_use_double_quant=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch.bfloat16,)

model = AutoModelForCausalLM.from_pretrained(base_model,
                                             quantization_config=quantization_config,
                                             device_map="auto")

In [None]:
#Importing the dataset
dataset_raw = load_dataset(dataset_name, split="train")
dataset = dataset_raw.shuffle(seed=65).select(range(1000)) # Only use 1000 samples for quick demo
instruction = """You are a top-rated customer service agent named John.
    Be polite to customers and answer all their questions.
    """
def format_chat_template(row):

    row_json = [{"role": "system", "content": instruction },
               {"role": "user", "content": row["instruction"]},
               {"role": "assistant", "content": row["response"]}]

    row["text"] = tokenizer.apply_chat_template(row_json, tokenize=False)
    return row

dataset = dataset.map(
    format_chat_template,
    num_proc= 4,
)

In [None]:
dataset_raw

In [None]:
# Spliting Dataset Optional

split_data = dataset.train_test_split(test_size =0.2,shuffle=True)


train_dataset = split_data["train"]
test_dataset = split_data["test"]



print(f"Training Set Size: {len(train_dataset)}")
print(f"Evaluation Set Size: {len(test_dataset)}")

In [None]:
import bitsandbytes as bnb

trained_model_id = "Llama-3.2-3B-sft-lora-bitext"
output_dir = '/content/' + trained_model_id


# Lora Config
https://huggingface.co/docs/peft/main/en/package_reference/lora#peft.LoraConfig

In [None]:
# based on config
peft_config = LoraConfig(
        r=64,
        lora_alpha=16,
        lora_dropout=0.1,
        use_dora=True, # disable if you dont want to use
        bias="none",
        task_type="CAUSAL_LM",
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
)
# Enable ‘Weight-Decomposed Low-Rank Adaptation’ (DoRA)

# https://arxiv.org/pdf/2402.09353

In [None]:
#Hyperparamter
training_args = TrainingArguments(
    output_dir=new_model,
    overwrite_output_dir=True,
    per_device_eval_batch_size=1, # originally set to 8
    per_device_train_batch_size=1, # originally set to 8
    push_to_hub=True,
    hub_model_id=trained_model_id,
    gradient_accumulation_steps=2,
    optim="paged_adamw_32bit",
    num_train_epochs=1,
    eval_strategy="steps",
    eval_steps=0.2,
    logging_steps=1,
    warmup_steps=10,
    logging_strategy="steps",
    learning_rate=2e-4,
    fp16=False,
    bf16=False,
    group_by_length=True,
    report_to="mlflow"
)

In [None]:
trainer = SFTTrainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=test_dataset,
        dataset_text_field="text",
        processing_class=tokenizer,
        packing=False,
        peft_config=peft_config,
        max_seq_length=tokenizer.model_max_length,
    )

In [None]:
# trainer.processing_class=tokenizer
tokenizer.pad_token = tokenizer.eos_token

In [None]:
from datetime import datetime
import pandas as pd
name = "fine_tuning" +datetime.now().strftime("%Y-%m-%d_%H:%M:%S")
with mlflow.start_run(run_name = name) as run:
  mlflow.log_params(training_args.__dict__)
  trainer.train()

In [None]:
# Model Inferance
messages = [{"role": "system", "content": instruction},
    {"role": "user", "content": "I bought the same item twice, cancel order {{Order Number}}"}]


prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

inputs = tokenizer(prompt, return_tensors="pt",padding=True, truncation=True).to("cuda")


outputs = model.generate(**inputs, max_new_tokens=150, num_return_sequences=1)

text = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(text.split("assistant")[1])

In [None]:
prompt

In [None]:
import pprint
pprint.pprint(text.split("assistant")[1])

In [None]:
format_chat_template(train_dataset[1])

In [None]:
format_chat_template(train_dataset[1])['text']

In [None]:
from mlflow.models import infer_signature

sample = format_chat_template(train_dataset[1])

# MLflow infers schema from the provided sample input/output/params
signature = infer_signature(
    model_input=sample['text'],
    model_output=sample["response"],
    # Parameters are saved with default values if specified
    params={"max_new_tokens": 256, "repetition_penalty": 1.15, "return_full_text": False},
)
signature

In [None]:
# Basically the same format as we applied to the dataset. However, the template only accepts {prompt} variable so both table and question need to be fed in there.
prompt_template = """You are a top-rated customer service agent named John.
    Be polite to customers and answer all their questions.
{prompt}

### Response:
"""

In [None]:


import datetime
now = datetime.datetime.now()
now.strftime("%Y-%m-%d_%H:%M:%S")

In [None]:
import mlflow

# Get the ID of the MLflow Run that was automatically created above
last_run_id = mlflow.last_active_run().info.run_id

# Save a tokenizer without padding because it is only needed for training
tokenizer_no_pad = AutoTokenizer.from_pretrained(base_model, add_bos_token=True)

# If you interrupt the training, uncomment the following line to stop the MLflow run
# mlflow.end_run()


# Start an MLflow run context and log the PHi3 model wrapper along with the param-included signature to
# allow for overriding parameters at inference time
now = datetime.datetime.now()

description= """fine tuning Llama3.2 model PEFT
"""
with mlflow.start_run(run_id=last_run_id, description=description) as run:
    mlflow.log_params(peft_config.to_dict())
    mlflow.transformers.log_model(
        transformers_model={"model": trainer.model, "tokenizer": trainer.tokenizer},
        prompt_template=prompt_template,
        signature=signature,
        artifact_path="model",  # This is a relative path to save model files within MLflow run
    )

In [None]:
run.to_dictionary()

In [None]:
# import torch
# import gc
# try:
#   del trainer
#   del model
# except:
#   pass
# with torch.no_grad():
#     torch.cuda.empty_cache()
# gc.collect()

In [None]:
mlflow_model = mlflow.pyfunc.load_model("runs:/939788e363b84b4aaf8182196f8a2bbc/model")

In [None]:
prompt="""<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 11 Dec 2024

You are a top-rated customer service agent named John.
    Be polite to customers and answer all their questions.<|eot_id|><|start_header_id|>user<|end_header_id|>

I don't know what to do to change to the {{Account Type}} account<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Not a problem at all! I understand that you're uncertain about the steps required to switch to the {{Account Type}} account. Allow me to provide you with a clear and concise guide:

1. Log into your account: Start by accessing our platform through the login page.

2. Navigate to Account Settings: Once you're logged in, locate the Account Settings or Profile section. This is where you can manage and make changes to your account.

3. Find the Upgrade option: Within the Account Settings or Profile section, look for an option labeled "Upgrade" or "Switch Account Type." Click on it to proceed.

4. Select the Free account: From the list of available account types, choose the "Free" account to switch to it.

5. Confirm the changes: Follow the on-screen instructions to confirm your decision and finalize the switch to the {{Account Type}} account.

If you encounter any difficulties or have further questions, please don't hesitate to reach out to our dedicated customer support team. They're available {{Customer Service Hours}} via {{Customer Support Phone Number}} or through the Live Chat on our website at {{Website URL}}. We're here to assist you every step of the way and ensure a smooth transition to your desired account type.<|eot_id|>"""

In [None]:
mlflow_model.predict(prompt)
#

In [None]:
mlflow_model.metadata.to_dict()