
PFT = Base Model + Adapter Weights + Standard Supervised Loss
RLHF = Base Model (frozen or fine-tuned) + Preference-based Loss + Reward Modeling (optional)

| Method                | Base Model Frozen?           | Adapter Use | Notes                                                                    |
| --------------------- | ---------------------------- | ----------- | ------------------------------------------------------------------------ |
| **PEFT (LoRA/QLoRA)** | ✅ Yes                        | ✅ Yes       | Only adapter weights are trained. Efficient, low-rank updates.           |
| **RLHF (PPO)**        | ❌ No (usually)               | ❌ Optional  | Full model is updated with policy gradients. LoRA *can* be used, though. |
| **DPO / IPO**         | ✅ Often                      | ✅ Often     | Can work with frozen or adapter-based models. Efficient.                 |
| **SPIN**              | ⚠️ One frozen, one trainable | ⚠️ Optional | Two models: reference (frozen) vs tuned. Learns margin between them.     |




PFT

| Method         | What it changes                                                     | How it changes it                                            | Direct weight update?                      |
| -------------- | ------------------------------------------------------------------- | ------------------------------------------------------------ | ------------------------------------------ |
| **PFT / LoRA** | Model's response to input (knowledge, task-specific patterns)       | By supervised fine-tuning on labeled data                    | ✅ Yes (via gradients, backprop)            |
| **RLHF / DPO** | Model’s *preferences*, decision policy (how to rank/choose outputs) | By optimizing behavior to match human preferences or rewards | ✅ Yes (via reward gradients / preferences) |


PFT techniques:

LoRA:
=====

# Weights belong to a real number matrix of nxm
# W ∈ R n×m
# LoRAS creates 2 matrices of nxr and rxm dimensions
# A ∈ ℝⁿˣʳ
# B ∈ ℝʳˣᵐ
# W' (FT Model) = W(frozen weights) + α ⋅ AB

QLoRA:
======
Using a codebook to reduce the floating points of the model to reduce memory space

## Unsloth techniques


| Component         | Role                                          | Effect                                                          |
| ----------------- | --------------------------------------------- | --------------------------------------------------------------- |
| **Full Model**    | Pretrained base model (large, high-capacity)  | Strong foundation, high accuracy                                |
| **QLoRA**         | Quantized weights + low-rank adapters         | Memory & compute savings; efficient fine-tuning                 |
| **Distillation**  | Train a smaller model to mimic a bigger model | Smaller, faster model that retains much of the original quality |
| **Pruning**       | Remove unimportant weights                    | Smaller model, less computation                                 |
| **Other Methods** | E.g., weight sharing, knowledge transfer      | Further compression and efficiency                              |

RLHF/DPO/IPO/SPIN

In [None]:
####### Step 1: Setup a toy multi-armed bandit environment


import numpy as np
import matplotlib.pyplot as plt

class MultiArmedBandit:
    def __init__(self, probs):
        """
        probs: list of probabilities of reward=1 for each action
        """
        self.probs = probs
        self.n_actions = len(probs)

    def step(self, action):
        reward = 1 if np.random.rand() < self.probs[action] else 0
        return reward


In [None]:
######## Step 2: Define a simple policy

class SoftmaxPolicy:
    def __init__(self, n_actions, lr=0.1):
        self.logits = np.zeros(n_actions)  # initialized to zero (uniform)
        self.lr = lr

    def get_probs(self):
        exp_logits = np.exp(self.logits - np.max(self.logits))
        return exp_logits / exp_logits.sum()

    def select_action(self):
        probs = self.get_probs()
        action = np.random.choice(len(probs), p=probs)
        return action, probs[action]

    def update(self, action, reward, baseline=0):
        probs = self.get_probs()
        grad = -probs
        grad[action] += 1  # gradient for selected action
        # policy gradient update scaled by (reward - baseline)
        self.logits += self.lr * grad * (reward - baseline)


In [None]:
########## Step 3: Training loop with reward adjustment options

def train_agent(env, policy, n_steps=500, reward_adjust='none', scale=1.0, clip_val=None, temp=1.0):
    rewards_history = []
    adjusted_rewards = []
    cumulative_rewards = []
    cum_reward = 0

    for step in range(n_steps):
        action, prob = policy.select_action()
        reward = env.step(action)

        # Reward adjustments
        adj_reward = reward

        # 1. Scaling
        if reward_adjust == 'scale':
            adj_reward = reward * scale

        # 2. Clipping
        elif reward_adjust == 'clip':
            adj_reward = np.clip(reward, 0, clip_val)

        # 3. Temperature scaling (softmax temp on rewards, hypothetical)
        elif reward_adjust == 'temp':
            # for a bandit reward of 0 or 1, let's make it continuous by applying a temperature-like effect
            # This is illustrative: scaled_reward = exp(reward/temp) normalized in [0,1]
            adj_reward = np.exp(reward / temp)
            adj_reward = adj_reward / np.exp(1 / temp)

        policy.update(action, adj_reward)

        cum_reward += reward
        rewards_history.append(reward)
        adjusted_rewards.append(adj_reward)
        cumulative_rewards.append(cum_reward)

    return rewards_history, adjusted_rewards, cumulative_rewards


In [None]:
########  Step 4: Run and compare

env = MultiArmedBandit([0.2, 0.5, 0.8])
policy = SoftmaxPolicy(env.n_actions, lr=0.1)

n_steps = 500
results = {}

for adj in ['none', 'scale', 'clip', 'temp']:
    policy = SoftmaxPolicy(env.n_actions, lr=0.1)
    if adj == 'scale':
        r, ar, cr = train_agent(env, policy, n_steps, reward_adjust=adj, scale=2.0)
    elif adj == 'clip':
        r, ar, cr = train_agent(env, policy, n_steps, reward_adjust=adj, clip_val=0.5)
    elif adj == 'temp':
        r, ar, cr = train_agent(env, policy, n_steps, reward_adjust=adj, temp=0.5)
    else:
        r, ar, cr = train_agent(env, policy, n_steps, reward_adjust=adj)
    results[adj] = (r, ar, cr)


In [None]:
########  Step 5: Plot

plt.figure(figsize=(15, 5))

plt.subplot(1, 3, 1)
for adj in results:
    plt.plot(results[adj][0], label=adj)
plt.title("Original Rewards")
plt.xlabel("Step")
plt.ylabel("Reward")
plt.legend()

plt.subplot(1, 3, 2)
for adj in results:
    plt.plot(results[adj][1], label=adj)
plt.title("Adjusted Rewards")
plt.xlabel("Step")
plt.ylabel("Adjusted Reward")
plt.legend()

plt.subplot(1, 3, 3)
for adj in results:
    plt.plot(results[adj][2], label=adj)
plt.title("Cumulative Rewards")
plt.xlabel("Step")
plt.ylabel("Sum of rewards")
plt.legend()

plt.tight_layout()
plt.show()


| Method | Reward Signal Type    | Training Style       | Label Requirement      |
| ------ | --------------------- | -------------------- | ---------------------- |
| PPO    | Explicit rewards      | Policy gradient (RL) | Needs reward function  |
| DPO    | Pairwise preferences  | Direct optimization  | Needs preference pairs |
| IPO    | Implicit feedback     | Reward model + RL    | Indirect signals       |
| SPIN   | Self-supervised prefs | Joint learning       | No labels needed       |


PPO  
  
Policy π_old  --> Sample actions in env --> Collect rewards R  
         ↓  
Compute advantage A = R - baseline  
         ↓  
Compute clipped (avoid big changes) surrogate loss L_clip(π, π_old, A)  
         ↓  
Gradient ascent on L_clip to update policy π_new  



DPO  
  
Policy π  --> Generate outputs  
         ↓  
Get pairwise preferences from annotators or reward model (Pair Accepted/Rejected) and computes the reward from delta
         ↓  
Compute preference loss L_pref based on likelihood of preferred outputs  
         ↓  
Update π by maximizing L_pref (no policy gradient or RL needed)  

Math

π0​ = original pretrained model (fixed)
πθ​ = current model being trained/updated

The loss for one pair is:
L(θ)=−log⁡σ(α(log⁡πθ(x+)−log⁡πθ(x−)))

Where:

    σσ is the sigmoid function: σ(z)=11+e−zσ(z)=1+e−z1​

    α>0α>0 is a scaling factor controlling how “strongly” we push the preference margin

    log⁡πθ(⋅)logπθ​(⋅) is the log-likelihood of the model output  
