# On-device Direct Preference Optimization (DPO) 

This notebook can be used instead of the `02_src/train_dpo.py`.
- Launch Jupyter from the repo root so paths resolve.
- Set your Hugging Face token in `00_configs/secrets.toml` or `HF_TOKEN`.
- Get a model access approvel that you would like to use.


## Check the repository root

### You should assign the repository root for your device

In [2]:
from pathlib import Path
import sys

REPO_ROOT = Path.cwd()
if not (REPO_ROOT / "00_configs").exists():
    for parent in REPO_ROOT.parents:
        if (parent / "00_configs").exists():
            REPO_ROOT = parent
            break

if not (REPO_ROOT / "00_configs").exists():
    raise RuntimeError("Could not find repo root containing 00_configs.")

SRC_DIR = REPO_ROOT / "02_src"
if str(SRC_DIR) not in sys.path:
    sys.path.insert(0, str(SRC_DIR))

print("Repo root:", REPO_ROOT)
print("Source dir:", SRC_DIR)


Repo root: c:\Users\Minseok Jung\Desktop\Programming\asap_finetuning
Source dir: c:\Users\Minseok Jung\Desktop\Programming\asap_finetuning\02_src


## Check the hardware, CUDA, and project modules

### Check the hardware condition and whether CUDA is available.

In [4]:
import torch
print("cuda_available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("device_name:", torch.cuda.get_device_name(0))
    free, total = torch.cuda.mem_get_info()
    print(f"mem_free/total_GB: {free/1e9:.2f}/{total/1e9:.2f}")

cuda_available: True
device_name: NVIDIA RTX A1000 Laptop GPU
mem_free/total_GB: 3.46/4.29


### Project modules from the project

In [7]:
import importlib

train_dpo = importlib.import_module("train_dpo")
run_inference = importlib.import_module("run_inference")
merge_lora = importlib.import_module("merge_lora")
data_utils = importlib.import_module("utils.data_utils")
formatting = importlib.import_module("utils.formatting")

# eval_module = importlib.import_module("eval.evaluate")
print("Imports OK")


  from .autonotebook import tqdm as notebook_tqdm


Imports OK


### Confirm the configuration. You can adjust the paths and hyperparameters in 00_configs

In [9]:
# Purpose: load the DPO training config from 00_configs/dpo.json.
CONFIG_PATH = REPO_ROOT / "00_configs" / "dpo.json"
config = train_dpo.load_config(CONFIG_PATH)
config

Configuration loaded from c:\Users\Minseok Jung\Desktop\Programming\asap_finetuning\00_configs\dpo.json


{'model_name': 'meta-llama/Llama-3.2-1B-Instruct',
 'dataset_hf': '01_data\\dpo\\train.jsonl',
 'output_dir': '04_models\\adapters\\output_dpo',
 'max_seq_length': 512,
 'max_prompt_length': 256,
 'max_target_length': 256,
 'num_train_epochs': 1,
 'per_device_train_batch_size': 1,
 'gradient_accumulation_steps': 4,
 'learning_rate': 5e-05,
 'lr_scheduler_type': 'cosine',
 'warmup_ratio': 0.03,
 'weight_decay': 0.01,
 'dataloader_num_workers': 2,
 'logging_steps': 10,
 'save_steps': 200,
 'save_total_limit': 2,
 'fp16': True,
 'bf16': False,
 'optim': 'paged_adamw_8bit',
 'gradient_checkpointing': True,
 'lora_r': 16,
 'lora_alpha': 32,
 'lora_dropout': 0.05,
 'lora_target_modules': ['q_proj',
  'k_proj',
  'v_proj',
  'o_proj',
  'gate_proj',
  'up_proj',
  'down_proj'],
 'load_in_4bit': True,
 'bnb_4bit_use_double_quant': True,
 'bnb_4bit_quant_type': 'nf4',
 'bnb_4bit_compute_dtype': 'float16',
 'dpo_beta': 0.1,
 'seed': 42,
 'dataset_split': 'train',
 'max_train_samples': None,
 'da

### HF token check

In [11]:
hf_token = train_dpo.resolve_hf_token(config)
train_dpo.preflight_checks(config, hf_token)

Secrets loaded from C:\Users\Minseok Jung\Desktop\Programming\asap_finetuning\00_configs\secrets.toml
HuggingFace token configured

 - 4-bit bitsandbytes is unreliable on native Windows; use WSL or disable load_in_4bit.
 - GPU reports 4.0 GB total; may OOM with current settings.



## Dry run (no training)
Run this to validate config, dataset path, and GPU before starting a full run.

In [16]:
import sys

orig_argv = sys.argv[:]
sys.argv = ["train_dpo.py", "--config", str(CONFIG_PATH), "--dry_run"]
try:
    train_dpo.main()
finally:
    sys.argv = orig_argv



Starting Llama-3.2-1B DPO Training with QLoRA

Step 1/8: Loading configuration...
Configuration loaded from c:\Users\Minseok Jung\Desktop\Programming\asap_finetuning\00_configs\dpo.json
Secrets loaded from C:\Users\Minseok Jung\Desktop\Programming\asap_finetuning\00_configs\secrets.toml
HuggingFace token configured
Running preflight checks...

 - 4-bit bitsandbytes is unreliable on native Windows; use WSL or disable load_in_4bit.
 - GPU reports 4.0 GB total; may OOM with current settings.

Dry run complete. Exiting before model/dataset load.


## Start training
This will launch DPO training and write logs to `05_logs/training.log`.


In [17]:
import sys

orig_argv = sys.argv[:]
sys.argv = ["train_dpo.py", "--config", str(CONFIG_PATH)]
try:
    train_dpo.main()
finally:
    sys.argv = orig_argv



Starting Llama-3.2-1B DPO Training with QLoRA

Step 1/8: Loading configuration...
Configuration loaded from c:\Users\Minseok Jung\Desktop\Programming\asap_finetuning\00_configs\dpo.json
Secrets loaded from C:\Users\Minseok Jung\Desktop\Programming\asap_finetuning\00_configs\secrets.toml
HuggingFace token configured
Running preflight checks...

 - 4-bit bitsandbytes is unreliable on native Windows; use WSL or disable load_in_4bit.
 - GPU reports 4.0 GB total; may OOM with current settings.


Step 2/8: Setting up 4-bit quantization...
BitsAndBytes config created (4-bit quantization enabled)

Step 3/8: Loading policy model...
Loading base model: meta-llama/Llama-3.2-1B-Instruct
This may take a few minutes...


ValueError: Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the quantized model. If you want to dispatch the model on the CPU or the disk while keeping these modules in 32-bit, you need to set `llm_int8_enable_fp32_cpu_offload=True` and pass a custom `device_map` to `from_pretrained`. Check https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu for more details. 

## Merge
Merge the trained LoRA adapters into a full, standalone model.
This is optional; skip if you only need adapters for inference.


In [9]:
# Purpose: merge LoRA adapters into the base model for a standalone checkpoint.
from pathlib import Path

BASE_MODEL = config.get("model_name", "meta-llama/Llama-3.2-1B-Instruct")
ADAPTER_PATH = Path(config.get("output_dir", REPO_ROOT / "04_models" / "adapters" / "output_dpo"))
if not ADAPTER_PATH.is_absolute():
    ADAPTER_PATH = REPO_ROOT / ADAPTER_PATH

OUTPUT_PATH = REPO_ROOT / "04_models" / "merged" / "merged_model_dpo"

merge_lora.merge_lora_to_base(
    base_model_name=BASE_MODEL,
    adapter_path=ADAPTER_PATH,
    output_path=OUTPUT_PATH,
    push_to_hub=False,
)



Merging LoRA adapters into base model

Loading base model...




Base model loaded

Loading LoRA adapters...




LoRA adapters loaded

Merging adapters...
Adapters merged successfully

Loading tokenizer...
Tokenizer loaded

Saving merged model to: c:\Users\Minseok Jung\Desktop\Programming\asap_finetuning\04_models\merged\merged_model_dpo
Merged model saved

Merge complete!
Merged model location: c:\Users\Minseok Jung\Desktop\Programming\asap_finetuning\04_models\merged\merged_model_dpo
Use like any HF model:
  model = AutoModelForCausalLM.from_pretrained("c:\Users\Minseok Jung\Desktop\Programming\asap_finetuning\04_models\merged\merged_model_dpo")
  tokenizer = AutoTokenizer.from_pretrained("c:\Users\Minseok Jung\Desktop\Programming\asap_finetuning\04_models\merged\merged_model_dpo")


# Run

In [14]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_path = r"04_models\merged\merged_model_dpo"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.float16,
    device_map="auto",
)

In [15]:
prompt = "### Instruction:\nWhat is upstream in the oil and gas industry?\n\n### Response:\n"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=128, do_sample=True, temperature=0.7, top_p=0.9)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


### Instruction:
What is upstream in the oil and gas industry?

### Response:
In the oil and gas industry, "upstream" refers to the exploration and production of crude oil and natural gas. This includes the activities involved in finding, extracting, and processing these resources.

### Example:
- A company like ExxonMobil is an example of an oil and gas company that operates upstream.
- A company like Chevron is also an example of an oil and gas company that operates upstream.

### Key points to note:
- Upstream activities typically involve drilling, exploration, and production of crude oil and natural gas.
- These activities can be performed in various locations, such as onshore or offshore, depending on the geology


# Explanations