In [None]:
'''
# Direct Preference Optimization (DPO) in LLMs:-
- Offers a stable, performant, and computationally lightweight approach to fine-tuning language models.

Note:- RLHF is complex and often unstable, requiring significant hyperparameter tuning and sampling from the language model.

## What is Direct Preference Optimization (DPO) ?
-> Direct Supervised Learning. 
-> Algorithm used: Simple loss function. 

## Reinforcement Learning (RL) ?
-> Reward based learning. 
-> Algorithm used: PPO, Q-Learning, Actor-Critic. 
-> Removes the reward model fitting.

# The Two Stages of DPO
1. Supervised Fine-tuning (SFT):
- Supervised fine-tuning ensures that the model’s responses align more closely with specific requirements and preferences.
- For example, a conversational AI model could be finetuned to provide user-friendly and detailed responses that resonate with a company’s tone and branding.

2. Preference Learning:
- After the supervised fine-tuning stage, the model undergoes preference learning using preference data. 
- Preference data consists of curated options or alternatives related to a specific prompt. Annotators rank these options based on human preferences, providing insights into fine-tuning the model to produce outputs that align with human expectations.

The simplicity of DPO lies in its direct definition of the preference loss as a function of the policy, eliminating the need for training a reward model first. During the fine-tuning phase, DPO treats the language model itself as the reward model. It optimizes the policy using a binary cross-entropy objective, leveraging human preference data to determine preferred responses. By comparing the model’s responses to the preferred ones, the policy is adjusted to enhance its performance.
'''

In [2]:
# Supervised Fine-Tuning (SFT):
# Begin by training the SFT model using a dataset that is in-distribution and relevant to the task at hand. This step sets the stage for the DPO algorithm to work effectively.

'''
from transformers import AutoModelForCausalLM
from datasets import load_dataset
from trl import SFTTrainer
dataset = load_dataset("your-domain-dataset", split="train")
model = AutoModelForCausalLM.from_pretrained("your-foundation-model-of-choice")
trainer = SFTTrainer(model, train_dataset=dataset, dataset_text_field="text", max_seq_length=512)
trainer.train()
'''

# Understanding the Dataset Format:
# The DPO trainer requires a specific dataset format that reflects the preference between two sentences. The dataset should include entries for the prompt, chosen responses, and rejected responses.

'''
dpo_dataset_dict = {
    "prompt": ["hello", "how are you", ...],
    "chosen": ["hi, nice to meet you", "I am fine", ...],
    "rejected": ["leave me alone", "I am not fine", ...],
}
'''

# Leveraging the DPOTrainer:
# Initialize the DPOTrainer, specifying the model to be trained, a reference model (used to compute implicit rewards), the beta hyperparameter for the implicit reward, and the dataset.
'''
dpo_trainer = DPOTrainer(
        model,
        model_ref,
        args=training_args,
        beta=0.1,
        train_dataset=train_dataset,
        tokenizer=tokenizer,
)
'''

# Training and Monitoring:
# Begin the training process with dpo_trainer.train() Monitor the model’s performance using reward metrics such as rewards/chosen, rewards/rejected, rewards/accuracies, and rewards/margins.
'''
dpo_trainer.train()
'''

# Advantages of DPO over RLHF

# Direct Preference Optimization (DPO) offers several advantages over Reinforcement Learning from Human Feedback (RLHF). DPO is stable, performant, and computationally lightweight, eliminating the need for reward model fitting and extensive hyperparameter tuning. Additionally, DPO has shown promising results in controlling sentiment, improving response quality in summarization and single-turn dialogue, and simplifying the implementation and training process.

'\ndpo_trainer.train()\n'

In [1]:
!pip install transformers trl peft accelerate datasets bitsandbytes

Collecting trl
  Using cached trl-0.18.2-py3-none-any.whl.metadata (11 kB)
Downloading trl-0.18.2-py3-none-any.whl (366 kB)
Installing collected packages: trl
Successfully installed trl-0.18.2


In [34]:
from huggingface_hub import HfApi, login
from dotenv import load_dotenv
import os 

load_dotenv()
hf_token = os.getenv("HF_TOKEN")
login(token=hf_token)

Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


In [45]:
# Check for MPS (Metal Performance Shaders) GPU support
import torch
if torch.backends.mps.is_available():
    device = torch.device("mps")
    print("Using Apple GPU via MPS")
else:
    device = torch.device("cpu")
    print("Using CPU")

Using Apple GPU via MPS


In [46]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments
from datasets import load_dataset, Dataset
from trl import SFTTrainer, DPOConfig, DPOTrainer
from peft import LoraConfig, AutoPeftModelForCausalLM, prepare_model_for_kbit_training, get_peft_config
import torch
from accelerate import Accelerator

In [47]:
!pip install -U bitsandbytes



In [48]:
## parameter for the SFT
sft_config={
            "model_ckpt": "TinyLlama/TinyLlama-1.1B-Chat-v1.0", #This is my Pretrain model
            "load_in_4bit": True,
            "device_map": {"": Accelerator().local_process_index},
            "torch_dtype": torch.float16,
            "torch_dtype": torch.float16,
            "trust_remote_code": True,

            "use_lora": True,
            "r": 16,
            "lora_alpha": 16,
            "lora_dropout": 0.05,
            "bias": "none",
            "task_type": "CAUSAL_LM",
            "target_modules": ["q_proj", "v_proj"],

            "output_dir": "sft-tiny-chatbot",
            "per_device_train_batch_size": 1,
            "gradient_accumulation_steps": 1,
            "optim": "paged_adamw_32bit",
            "learning_rate": 2e-4,
            "lr_scheduler_type": "cosine",
            "save_strategy": "epoch",
            "logging_steps": 100,
            "num_train_epochs": 1,
            "max_steps": 250,
            "fp16": True,
            "push_to_hub": True,
            "packing": False,
            "max_seq_length": 512,
            "neftune_noise_alpha": 5
}

In [49]:
class TrainSFT:
    def __init__(self, data, config) -> None:
        self.data = data
        self.config = config

    def prepare_lora_model(self):
        self.lora_config = LoraConfig(
            r = self.config["r"],
            lora_alpha = self.config["lora_alpha"],
            lora_dropout = self.config["lora_dropout"],
            bias = self.config["bias"],
            task_type = self.config["task_type"],
            target_modules = self.config["target_modules"]
        )

        self.model = get_peft_config(
            self.model,
            self.lora_config
        )
        
    def load_model_tokenizer(self):
        self.model = AutoModelForCausalLM.from_pretrained(
            self.config["model_ckpt"],
            load_in_4bit = self.config["load_in_4bit"],
            device_map = self.config["device_map"],
            torch_dtype = self.config["torch_dtype"]
        )

        self.model.config.use_cache = False
        self.model.config.pretraining_tp = 1
        self.model = prepare_model_for_kbit_training(self.model)

    def set_training_arguments(self):
        return TrainingArguments(
            output_dir = self.config["output_dir"],
            per_device_train_batch_size = self.config["per_device_train_batch_size"],
            gradient_accumulation_steps = self.config["gradient_accumulation_steps"],
            optim = self.config["optim"],
            learning_rate = self.config["learning_rate"],
            lr_scheduler_type = self.config["lr_scheduler_type"],
            save_strategy = self.config["save_strategy"],
            logging_steps = self.config["logging_steps"],
            num_train_epochs = self.config["num_train_epochs"],
            max_steps = self.config["max_steps"],
            fp16 = self.config["fp16"],
            push_to_hub = self.config["push_to_hub"],
            neftune_noise_alpha = self.config["neftune_noise_alpha"],
            report_to = "none"
        )

    def create_trainer(self):
        self.load_model_tokenizer()

        if self.config["use_lora"]:
            print(self.model.print_trainable_parameter())
            self.trainer = SFTTrainer(
                model = self.model,
                train_dataset = self.data, 
                perf_config = self.lora_config, 
                args = self.set_training_args(), 
                tokenizer = self.tokenizer
            )
                
    def train_and_save_model(self):
        self.create_trainer()
        self.trainer.train()
        self.trainer.save_model(self.config["output_dir"])
        self.tokenizer.save_pretrained(self.config["output_dir"])

In [51]:
def create_data():
    data = load_dataset("tatsu-lab/alpaca", split="train")
    data_df = data.to_pandas()
    data_df = data_df[:700]
    data_df['text'] = data_df[["input", "instruction", "output"]].apply(lambda x: "Human" + x["instruction"] + " " + x["input"] + " Assistant: " + x["output"], axis=1)
    data = Dataset.from_pandas(data_df)
    return data

data = create_data()

In [52]:
data

Dataset({
    features: ['instruction', 'input', 'output', 'text'],
    num_rows: 700
})

In [53]:
data[0]

{'instruction': 'Give three tips for staying healthy.',
 'input': '',
 'output': '1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule.',
 'text': 'HumanGive three tips for staying healthy.  Assistant: 1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule.'}

In [54]:
train_sft = TrainSFT(data, sft_config)
train_sft.train_and_save_model()

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


ImportError: Using `bitsandbytes` 4-bit quantization requires the latest version of bitsandbytes: `pip install -U bitsandbytes`

#### **Preference Alignment - DPO**

Will train or finetune our model using PEFT(SFT) then will retrain for preference alignment for controlled response using DPO. 

In [9]:
dpo_config = {
    
}

In [11]:
class TrainDPO:
    def __init__(self):
        pass

    def prepare_lora_model(self):
        pass

    def load_model_tokenizer(self):
        pass

    def set_training_args(self):
        return DPOConfig()
    
    def create_trainer(self):
        pass

    def train_and_save_model(self):
        pass

In [None]:
def create_data():
    return data

data = create_data()

In [None]:
train_dpo = TrainDPO(data, dpo_config)
train_dpo.train_and_save_model()

NameError: name 'data' is not defined

1. Unsupervised pretraining
2. SFT
3. DPO
4. Inferencing