## Setup

In [1]:
!pip install -U transformers datasets accelerate bitsandbytes trl

Collecting transformers
  Downloading transformers-4.57.1-py3-none-any.whl.metadata (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.0/44.0 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
Collecting datasets
  Downloading datasets-4.3.0-py3-none-any.whl.metadata (18 kB)
Collecting accelerate
  Downloading accelerate-1.11.0-py3-none-any.whl.metadata (19 kB)
Collecting bitsandbytes
  Downloading bitsandbytes-0.48.1-py3-none-manylinux_2_24_x86_64.whl.metadata (10 kB)
Collecting trl
  Downloading trl-0.24.0-py3-none-any.whl.metadata (11 kB)
Collecting huggingface-hub<1.0,>=0.34.0 (from transformers)
  Downloading huggingface_hub-0.36.0-py3-none-any.whl.metadata (14 kB)
Collecting tokenizers<=0.23.0,>=0.22.0 (from transformers)
  Downloading tokenizers-0.22.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.8 kB)
Collecting pyarrow>=21.0.0 (from datasets)
  Downloading pyarrow-22.0.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (3.2 kB)
Collecti

In [2]:
import wandb
import os
from kaggle_secrets import UserSecretsClient
from huggingface_hub import login, whoami

user_secrets = UserSecretsClient()

hf_token = user_secrets.get_secret("HF")
wandb_key = user_secrets.get_secret("wandb")

login(token=hf_token)
user_info = whoami()
print(f"Logged in as: {user_info['name']}")

os.environ["WANDB_API_KEY"] = wandb_key
wandb.login()



Logged in as: PT-10


[34m[1mwandb[0m: Currently logged in as: [33mgrasgor10[0m ([33mgrasgor10-[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

In [3]:
from datasets import load_dataset

# Load the Steve Jobs interviews dataset
dataset = load_dataset("Hypersniper/Steve_Jobs_Interviews")

dataset['train']

README.md: 0.00B [00:00, ?B/s]

steve.json: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/427 [00:00<?, ? examples/s]

Dataset({
    features: ['output', 'instruction'],
    num_rows: 427
})

In [4]:
dataset['train'][0]

{'output': 'There are different answers for different people. In business, that question is easy to answer: You really can prepare documents much faster and at a higher quality level, and you can do many things to increase office productivity. A computer frees people from much of the menial work. Besides that, you are giving them a tool that encourages them to be creative. Remember, computers are tools. Tools help us do our work better. In education, computers are the first thing to come along since books that will sit there and interact with you endlessly, without judgment. Socratic education isn’t available anymore, and computers have the potential to be a real breakthrough in the educational process when used in conjunction with enlightened teachers. We’re in most schools already.',
 'instruction': 'How about some concrete reasons to buy a computer today? An executive in your industry recently said, “We’ve given people computers, but we haven’t shown them what to do with them. I can

In [5]:
#our data should be of the format
# Conversational prompt-completion
# {"prompt": [{"role": "user", "content": "What color is the sky?"}],
#  "completion": [{"role": "assistant", "content": "It is blue."}]}

# def preprocess_function(row):
#     return {
#         "prompt": [{"role": "user", "content": row["instruction"]}],
#         "completion": [
#             {"role": "assistant", "content": row["output"]}
#         ],
#     }

# dataset = dataset.map(preprocess_function, remove_columns=["output", "instruction"])

In [6]:
# def formatting_func(example):
#     user_msg = example["prompt"][0]["content"]
#     assistant_msg = example["completion"][0]["content"]
#     return {
#         "input_text": f"### Question:\n{user_msg}\n\n### Steve Jobs' Response:\n",
#         "labels": assistant_msg
#     }
#If we want to use the above one, then choose an instruct model which has a chat template along with the tokenizer.

In [None]:
def preprocess_function(row):
    return {
        "prompt": row["instruction"],      # simple string
        "completion": row["output"]        # simple string
    }

dataset = dataset.map(preprocess_function, remove_columns=["instruction", "output"])

#SFTTrainer with completion_only_loss=True wants simple strings for prompt and completion, not nested role/message lists or chat templates.

In [8]:
print(next(iter(dataset["train"])))

{'prompt': 'How about some concrete reasons to buy a computer today? An executive in your industry recently said, “We’ve given people computers, but we haven’t shown them what to do with them. I can balance my checkbook faster by hand than on my computer.” Why should a person buy a computer?', 'completion': 'There are different answers for different people. In business, that question is easy to answer: You really can prepare documents much faster and at a higher quality level, and you can do many things to increase office productivity. A computer frees people from much of the menial work. Besides that, you are giving them a tool that encourages them to be creative. Remember, computers are tools. Tools help us do our work better. In education, computers are the first thing to come along since books that will sit there and interact with you endlessly, without judgment. Socratic education isn’t available anymore, and computers have the potential to be a real breakthrough in the educationa

## Train

In [9]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
from peft import LoraConfig

model_name = "meta-llama/Llama-3.2-1B"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# BitsAndBytes (QLoRA)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype="bfloat16",
    bnb_4bit_use_double_quant=True,
)

# Load model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map={'':torch.cuda.current_device()}
)

model.gradient_checkpointing_enable()

# LoRA config
peft_config = LoraConfig(
    r=8,
    lora_alpha=32,
    lora_dropout=0,
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "up_proj",
        "down_proj",
    ],
    bias="none",
    task_type="CAUSAL_LM"
)

2025-10-26 11:08:14.958176: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1761476895.145167      37 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1761476895.201925      37 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


tokenizer_config.json:   0%|          | 0.00/50.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/301 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/843 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/185 [00:00<?, ?B/s]

In [None]:
from trl import SFTTrainer, SFTConfig
from transformers import TrainingArguments

sft_config = SFTConfig(
    completion_only_loss=True,         # Only optimize Steve Jobs responses
    output_dir="./llama3.2_jobs_sft",
    gradient_accumulation_steps=2,     # Effective batch size: 1-2 tokens per GPU
    learning_rate=1e-4,                # Safer for 427 rows
    num_train_epochs=3,                # Stop if validation shows overfit
    lr_scheduler_type="cosine",        # Smooth decay helps small dataset
    warmup_ratio=0.05,                  # Standard
    logging_steps=20,                   # More frequent for small dataset
    fp16=True,                          # Save VRAM
    save_strategy="epoch",              
    save_total_limit=3,                 # Only keep last 3 checkpoints
    max_length=3072,                # Accommodate long completions
    report_to="wandb",
)

trainer = SFTTrainer(
    model=model,
    train_dataset = dataset["train"],      
    args = sft_config,
    peft_config = peft_config,
    formatting_func = None
)

Adding EOS to train dataset:   0%|          | 0/427 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/427 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/427 [00:00<?, ? examples/s]

In [11]:
trainer.train()
trainer.model.save_pretrained("./llama3.2-jobs-sft-final")

  return fn(*args, **kwargs)


Step,Training Loss
20,2.6174
40,2.4469
60,2.3222
80,2.2707


  return fn(*args, **kwargs)
  return fn(*args, **kwargs)


## Inference

In [17]:
from transformers import pipeline

pipe = pipeline("text-generation", model="./llama3.2-jobs-sft-final", tokenizer=tokenizer, return_full_text=False)

prompt = "Is there an inevitable break between being an entrepreneur and a businessman? Arethe people who get things going different?"

result = pipe(
    prompt,
    max_new_tokens=3072,
    temperature=0.8,
    do_sample=True,
    top_k=50,                  
    top_p=0.9,                 
    repetition_penalty=1.2     
)
print(result[0]["generated_text"])


Device set to use cuda:0


The difference is that in business you're trying to make money, not something. You want your company to be successful--not just one or two individuals within it. And the reason we do this is because these are very personal endeavors for us; they have deep meaning. But if I had been able to go into my basement last night at midnight with no idea what was about to happen but know exactly where all of our chips were laid out on the table before me, would I take any chances right now?
Of course!


## Save Model

In [18]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Load base model in FP16 (recommended for merging)
base_model_name = "meta-llama/Llama-3.2-1B"
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    device_map="auto",
    torch_dtype="auto"
)

# Load your SFT-trained LoRA adapter
model = PeftModel.from_pretrained(model, "./llama3.2-jobs-sft-final")

# Merge LoRA weights into base model
model = model.merge_and_unload()

# Save merged model and tokenizer
save_dir = "./llama3.2-jobs-sft-merged"
model.save_pretrained(save_dir)
tokenizer.save_pretrained(save_dir)


('./llama3.2-jobs-sft-merged/tokenizer_config.json',
 './llama3.2-jobs-sft-merged/special_tokens_map.json',
 './llama3.2-jobs-sft-merged/tokenizer.json')

## DPO set generation

In [32]:
prompt = dataset['train'][0]['prompt']

result = pipe(
    prompt,
    max_new_tokens=3072,
    temperature=0.8,
    do_sample=True,
    top_k=50,                  
    top_p=0.9,                 
    repetition_penalty=1.2     
)
print(result[0]["generated_text"])

You won’t find it any easier or more convenient to use one than you would an old typewriter and paper. In fact, if anything the experience is worse because everything has gotten so much better. We’re able now to write longer documents, edit our own work far quicker, take pictures, store huge amounts of data—well beyond those two megabytes that everybody’s talking about—and sort through all this stuff much more efficiently. And there are no excuses for using something like DOS anymore. There was only one major commercial product coming out when IBM announced its PC that wasn’t totally awful: DOS. But since then Microsoft and others have built great applications around DOS—you know, Lotus 1-2-3—which make life much easier. So right away these things were very obvious advantages.


In [None]:
from datasets import Dataset
import random
from tqdm import tqdm

# Prepare dataset dict
dpo_dataset = dataset["train"].to_dict()

# Pre-create 3 columns for 3 variations, filled with None
dpo_dataset["rejected_1"] = [None] * len(dpo_dataset["prompt"])
dpo_dataset["rejected_2"] = [None] * len(dpo_dataset["prompt"])
dpo_dataset["rejected_3"] = [None] * len(dpo_dataset["prompt"])

# Define parameter variations for diversity
def sample_generation_params():
    return {
        "max_new_tokens": random.choice([2048, 2560, 3072]),
        "temperature": random.choice([0.7, 0.8, 0.9, 1.0]),
        "top_k": random.choice([40, 50, 60]),
        "top_p": random.choice([0.85, 0.9, 0.95]),
        "repetition_penalty": random.choice([1.1, 1.2, 1.3]),
        "do_sample": True,
        "early_stopping": random.choice([True, False])
    }

# Loop over prompts with tqdm
for i, prompt in enumerate(tqdm(dpo_dataset["prompt"], desc="Generating synthetic responses")):
    for j in range(3):
        gen_params = sample_generation_params()
        result = pipe(prompt, **gen_params)
        # Assign to the appropriate column
        dpo_dataset[f"rejected_{j+1}"][i] = result[0]["generated_text"]
    
    # Save checkpoint every 10 prompts
    if (i + 1) % 10 == 0:
        tmp_ds = Dataset.from_dict(dpo_dataset)
        tmp_ds.save_to_disk("./dpo_dataset_checkpoint")
        print(f"Checkpoint saved at prompt {i+1}")

# Save final dataset
dpo_dataset = Dataset.from_dict(dpo_dataset)
dpo_dataset.save_to_disk("./dpo_dataset_final")
print("Final dataset saved.")

Generating synthetic responses:   2%|▏         | 9/427 [02:22<1:59:01, 17.09s/it]

Saving the dataset (0/1 shards):   0%|          | 0/427 [00:00<?, ? examples/s]

Generating synthetic responses:   2%|▏         | 10/427 [02:32<1:44:33, 15.04s/it]

Checkpoint saved at prompt 10


Generating synthetic responses:   4%|▍         | 19/427 [04:35<1:47:14, 15.77s/it]

Saving the dataset (0/1 shards):   0%|          | 0/427 [00:00<?, ? examples/s]

Generating synthetic responses:   5%|▍         | 20/427 [04:52<1:48:18, 15.97s/it]

Checkpoint saved at prompt 20


Generating synthetic responses:   7%|▋         | 29/427 [06:52<1:39:11, 14.95s/it]

Saving the dataset (0/1 shards):   0%|          | 0/427 [00:00<?, ? examples/s]

Generating synthetic responses:   7%|▋         | 30/427 [07:08<1:40:42, 15.22s/it]

Checkpoint saved at prompt 30


Generating synthetic responses:   9%|▉         | 39/427 [08:57<1:30:20, 13.97s/it]

Saving the dataset (0/1 shards):   0%|          | 0/427 [00:00<?, ? examples/s]

Generating synthetic responses:   9%|▉         | 40/427 [09:08<1:23:15, 12.91s/it]

Checkpoint saved at prompt 40


Generating synthetic responses:  11%|█▏        | 49/427 [11:01<1:19:20, 12.59s/it]

Saving the dataset (0/1 shards):   0%|          | 0/427 [00:00<?, ? examples/s]

Generating synthetic responses:  12%|█▏        | 50/427 [11:11<1:14:15, 11.82s/it]

Checkpoint saved at prompt 50


Generating synthetic responses:  14%|█▍        | 59/427 [13:37<1:34:00, 15.33s/it]

Saving the dataset (0/1 shards):   0%|          | 0/427 [00:00<?, ? examples/s]

Generating synthetic responses:  14%|█▍        | 60/427 [13:53<1:35:44, 15.65s/it]

Checkpoint saved at prompt 60


Generating synthetic responses:  16%|█▌        | 69/427 [15:34<1:08:25, 11.47s/it]

Saving the dataset (0/1 shards):   0%|          | 0/427 [00:00<?, ? examples/s]

Generating synthetic responses:  16%|█▋        | 70/427 [15:44<1:06:37, 11.20s/it]

Checkpoint saved at prompt 70


Generating synthetic responses:  19%|█▊        | 79/427 [17:42<1:19:12, 13.66s/it]

Saving the dataset (0/1 shards):   0%|          | 0/427 [00:00<?, ? examples/s]

Generating synthetic responses:  19%|█▊        | 80/427 [17:57<1:22:20, 14.24s/it]

Checkpoint saved at prompt 80


Generating synthetic responses:  21%|██        | 89/427 [19:57<1:24:04, 14.92s/it]

Saving the dataset (0/1 shards):   0%|          | 0/427 [00:00<?, ? examples/s]

Generating synthetic responses:  21%|██        | 90/427 [20:05<1:12:33, 12.92s/it]

Checkpoint saved at prompt 90


Generating synthetic responses:  23%|██▎       | 99/427 [22:34<1:44:50, 19.18s/it]

Saving the dataset (0/1 shards):   0%|          | 0/427 [00:00<?, ? examples/s]

Generating synthetic responses:  23%|██▎       | 100/427 [22:52<1:43:17, 18.95s/it]

Checkpoint saved at prompt 100


Generating synthetic responses:  26%|██▌       | 109/427 [24:51<1:13:41, 13.90s/it]

Saving the dataset (0/1 shards):   0%|          | 0/427 [00:00<?, ? examples/s]

Generating synthetic responses:  26%|██▌       | 110/427 [24:59<1:03:35, 12.04s/it]

Checkpoint saved at prompt 110


Generating synthetic responses:  28%|██▊       | 119/427 [26:55<1:04:29, 12.56s/it]

Saving the dataset (0/1 shards):   0%|          | 0/427 [00:00<?, ? examples/s]

Generating synthetic responses:  28%|██▊       | 120/427 [26:56<47:45,  9.33s/it]  

Checkpoint saved at prompt 120


Generating synthetic responses:  30%|███       | 129/427 [29:08<1:02:42, 12.62s/it]

Saving the dataset (0/1 shards):   0%|          | 0/427 [00:00<?, ? examples/s]

Generating synthetic responses:  30%|███       | 130/427 [29:23<1:05:01, 13.14s/it]

Checkpoint saved at prompt 130


Generating synthetic responses:  33%|███▎      | 139/427 [31:16<1:05:45, 13.70s/it]

Saving the dataset (0/1 shards):   0%|          | 0/427 [00:00<?, ? examples/s]

Generating synthetic responses:  33%|███▎      | 140/427 [31:22<54:55, 11.48s/it]  

Checkpoint saved at prompt 140


Generating synthetic responses:  34%|███▍      | 146/427 [32:25<49:06, 10.49s/it]