# Fine-tune a small LM on a customer-support dataset (Local MacBook)

This notebook is adapted to run on a local MacBook (CPU or Apple MPS). It uses a small model (`distilgpt2`) by default and is conservative with batch sizes/epochs so it can run locally for demonstration. If the public dataset is unavailable the notebook falls back to a tiny synthetic dataset.

### How to use
1. (Optional) In the first code cell uncomment the `!pip install` line to install dependencies.
2. Adjust settings in the `User settings` cell (model, dataset, epochs, batch_size).
3. Run cells top-to-bottom. Training on CPU/MPS is slow; expect longer runtimes.


In [1]:
%pip install --upgrade transformers datasets accelerate peft bitsandbytes trl

Note: you may need to restart the kernel to use updated packages.


In [2]:
import os
# Prevent tokenizers parallelism warnings & deadlocks
os.environ.setdefault("TOKENIZERS_PARALLELISM", "false")

import torch
# If using MPS or CPU, pin_memory should be False to avoid warnings
# We'll set a flag used in TrainingArguments below
use_pin_memory = False if (not torch.cuda.is_available()) else True

In [3]:
import os
import torch
from datasets import load_dataset, Dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    Trainer,
    TrainingArguments,
    DataCollatorForLanguageModeling,
)

# Device selection (robust)
if torch.cuda.is_available():
    device = 'cuda'
elif getattr(torch.backends, 'mps', None) is not None and torch.backends.mps.is_available():
    device = 'mps'
else:
    device = 'cpu'

# Choose dtype depending on device to avoid unsupported dtypes on CPU
if device == 'cuda' or device == 'mps':
    model_dtype = torch.float16
else:
    model_dtype = torch.float32

print(f'Using device: {device} (model dtype={model_dtype})')


  from .autonotebook import tqdm as notebook_tqdm


Using device: mps (model dtype=torch.float16)


In [4]:
# ----------------- User settings (adjust before run) -----------------
MODEL_ID = os.environ.get('MODEL_ID', 'microsoft/Phi-3-mini-4k-instruct') # CHANGED
DATASET_ID = os.environ.get('DATASET_ID', 'bitext/Bitext-customer-support-llm-chatbot-training-dataset')
OUTPUT_DIR = os.environ.get('OUTPUT_DIR', './local_ft_output') # CHANGED OUTPUT DIR to avoid mixing checkpoints
EPOCHS = int(os.environ.get('EPOCHS', '3'))
BATCH_SIZE = int(os.environ.get('BATCH_SIZE', '1')) # Adjusted BATCH_SIZE for larger model
MAX_LENGTH = int(os.environ.get('MAX_LENGTH', '256')) # Increased length for modern model
USE_LORA = os.environ.get('USE_LORA', 'true').lower() in ('1', 'true', 'yes')

print('Settings:')
print(f' MODEL_ID={MODEL_ID}')
print(f' DATASET_ID={DATASET_ID}')
print(f' OUTPUT_DIR={OUTPUT_DIR}')
print(f' EPOCHS={EPOCHS}, BATCH_SIZE={BATCH_SIZE}, MAX_LENGTH={MAX_LENGTH}, USE_LORA={USE_LORA}')
# --------------------------------------------------------------------

Settings:
 MODEL_ID=microsoft/Phi-3-mini-4k-instruct
 DATASET_ID=bitext/Bitext-customer-support-llm-chatbot-training-dataset
 OUTPUT_DIR=./local_ft_output
 EPOCHS=3, BATCH_SIZE=1, MAX_LENGTH=256, USE_LORA=True


In [5]:
def safe_load_customer_dataset(dataset_id):
    try:
        ds = load_dataset(dataset_id)
        if isinstance(ds, dict) and 'train' in ds:
            return ds['train']
        return ds
    except Exception as e:
        print(f'Could not load dataset {dataset_id}: {e}')
        print('Falling back to a tiny synthetic customer support dataset for demo.')
        samples = [
            {'customer': "My order hasn't arrived, it's been 10 days.", 'agent': "I'm sorry. Can you share your order id?"},
            {'customer': 'I was charged twice for the same order.', 'agent': "I can help. Please share the transaction id."},
            {'customer': 'How do I return an item?', 'agent': "You can start a return from your orders page."},
        ]
        return Dataset.from_list(samples)

def build_prompt(row):
    if 'customer' in row and 'agent' in row:
        return f"Human: {row['customer']}\nAssistant: {row['agent']}\n"
    if 'input' in row and 'output' in row:
        return f"Human: {row['input']}\nAssistant: {row['output']}\n"
    if 'text' in row:
        return row['text'] + "\n"
    return str(row)

print('Helper functions defined')

Helper functions defined


In [6]:
raw_ds = safe_load_customer_dataset(DATASET_ID)
print(f'Loaded dataset size: {len(raw_ds)} (showing first 2 examples)')
for i,ex in enumerate(raw_ds[:2]):
    print('\n--- example', i, '---')
    print(ex)

# Map to text prompts
if isinstance(raw_ds[0], dict):
    def map_to_prompt(example):
        return {'text': build_prompt(example)}
    ds = raw_ds.map(map_to_prompt)
else:
    ds = raw_ds.map(lambda x: {'text': str(x)})

# Split and reduce for local run
if len(ds) > 2000:
    ds = ds.train_test_split(test_size=0.05, shuffle=True, seed=42)
    train_ds = ds['train'].select(range(4096))
    eval_ds = ds['test'].select(range(128))
else:
    split = ds.train_test_split(test_size=0.1, seed=42)
    train_ds = split['train']
    eval_ds = split['test']

print(f'Train size: {len(train_ds)}, Eval size: {len(eval_ds)}')

Loaded dataset size: 26872 (showing first 2 examples)

--- example 0 ---
flags

--- example 1 ---
instruction

--- example 2 ---
category

--- example 3 ---
intent

--- example 4 ---
response
Train size: 4096, Eval size: 128


In [7]:

# --- Prompt template setup (inserted) ---
TEMPLATE = "A"  # "A" = Human/Assistant, "B" = Instruction/Input/Response

def build_prompt_for_training(row):
    if TEMPLATE == "A":
        if isinstance(row, dict) and 'customer' in row and 'agent' in row:
            prompt = f"Human: {row['customer']}\nAssistant: {row['agent']}\n"
        elif isinstance(row, dict) and 'input' in row and 'output' in row:
            prompt = f"Human: {row['input']}\nAssistant: {row['output']}\n"
        elif isinstance(row, dict) and 'text' in row:
            prompt = row['text']
        else:
            prompt = str(row)
    else:
        instruction = row.get('instruction', "Answer the customer support query.") if isinstance(row, dict) else "Answer the customer support query."
        input_text = row.get('customer') or row.get('input') or row.get('context') or "" if isinstance(row, dict) else ""
        response = row.get('agent') or row.get('output') or row.get('response') or "" if isinstance(row, dict) else ""
        prompt = f"Instruction: {instruction}\nInput: {input_text}\nResponse: {response}\n"
    return {'text': prompt}

# Example and mapping (only run this mapping when raw_ds exists)
try:
    print("Mapping dataset to single 'text' column using TEMPLATE =", TEMPLATE)
    ds = raw_ds.map(build_prompt_for_training)
    print("Mapping complete. Columns:", ds.column_names)
except NameError:
    print("raw_ds not found in this scope; ensure you run the cell after loading raw_ds.")


Mapping dataset to single 'text' column using TEMPLATE = A
Mapping complete. Columns: ['flags', 'instruction', 'category', 'intent', 'response', 'text']


In [8]:
# ----------------- Tokenization and Model Loading -----------------
# Load tokenizer and model with a dtype appropriate for the device

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
# ensure pad token exists
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': '<|padding|>'})

# Load the model. Avoid forcing float16 on CPU.
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    trust_remote_code=True,
    torch_dtype=model_dtype,
    low_cpu_mem_usage=True
)

# Resize token embeddings if we added special tokens
model.resize_token_embeddings(len(tokenizer))

print('Model and tokenizer loaded.')

# Tokenization helper that uses the 'text' column (created earlier by map_to_prompt)
# --- Robust tokenize_for_lm (inserted) ---
def tokenize_for_lm(examples):
    # examples['text'] should be a list[str] after mapping
    texts = examples.get('text', [])
    normalized = []
    for t in texts:
        if t is None:
            normalized.append("")
        elif isinstance(t, str):
            normalized.append(t)
        elif isinstance(t, (list, tuple)):
            flat = []
            for x in t:
                if isinstance(x, (list, tuple)):
                    flat += [str(y) for y in x]
                else:
                    flat.append(str(x))
            normalized.append(" ".join(flat))
        else:
            normalized.append(str(t))
    outputs = tokenizer(
        normalized,
        truncation=True,
        max_length=MAX_LENGTH,
        padding="max_length"
    )
    outputs["labels"] = outputs["input_ids"].copy()
    return outputs

# Sanity check helper
try:
    print("Sample ds[0]['text'] type:", type(ds[0]['text']))
    print("Sample text preview:", repr(ds[0]['text'])[:200])
except NameError:
    print("ds not available yet - run this after mapping the dataset.")


print('Tokenizing datasets (this may take a bit)...')
train_tok = train_ds.map(tokenize_for_lm, batched=True, remove_columns=[c for c in train_ds.column_names if c!='text'])
eval_tok = eval_ds.map(tokenize_for_lm, batched=True, remove_columns=[c for c in eval_ds.column_names if c!='text'])

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
print('Tokenization complete. Examples:')
for i in range(min(2, len(train_tok))):
    print(train_tok[i]['text'])


`torch_dtype` is deprecated! Use `dtype` instead!
`flash-attention` package not found, consider installing for better performance: No module named 'flash_attn'.
Current `flash-attention` does not support `window_size`. Either upgrade or use `attn_implementation='eager'`.
Loading checkpoint shards: 100%|██████████| 2/2 [00:11<00:00,  5.58s/it]


Model and tokenizer loaded.
Sample ds[0]['text'] type: <class 'str'>
Sample text preview: '{\'flags\': \'B\', \'instruction\': \'question about cancelling order {{Order Number}}\', \'category\': \'ORDER\', \'intent\': \'cancel_order\', \'response\': "I\'ve understood you have a question re
Tokenizing datasets (this may take a bit)...
Tokenization complete. Examples:
{'flags': 'BILQ', 'instruction': 'what do i have to do to recover my profile key', 'category': 'ACCOUNT', 'intent': 'recover_password', 'response': 'Indeed! I\'m here to assist you in recovering your profile key. Let\'s tackle this together:\n\n1. Access our platform\'s "{{Login Page URL}}" to initiate the recovery process.\n2. Locate the "{{Forgot Password}}" option and select it to proceed.\n3. You will be prompted to provide your email address associated with your profile. Kindly input the relevant information.\n4. Keep an eye on your inbox as you should receive an email containing detailed instructions on how to recover

In [9]:
# --- Safe model initialization before training ---
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

print("Preparing model for training...")

if 'MODEL_ID' not in globals():
    raise RuntimeError("Please define MODEL_ID (e.g. 'gpt2', 'distilgpt2', etc.) before running this cell.")

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({"pad_token": "<|pad|>"})

# Pick device and dtype
device = "cuda" if torch.cuda.is_available() else ("mps" if getattr(torch.backends, "mps", None) and torch.backends.mps.is_available() else "cpu")
dtype = torch.float16 if device in ("cuda", "mps") else torch.float32
print(f"Using device={device}, dtype={dtype}")

# Load base model
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    trust_remote_code=True,
    torch_dtype=dtype,
    low_cpu_mem_usage=True,
)

# Resize embeddings (in case tokenizer was modified)
model.resize_token_embeddings(len(tokenizer))
model.to(device)

print("✅ Model loaded and moved to", device)


Preparing model for training...
Using device=mps, dtype=torch.float16


Loading checkpoint shards: 100%|██████████| 2/2 [00:18<00:00,  9.14s/it]


✅ Model loaded and moved to mps


In [10]:
# ----------------- PEFT / LoRA Configuration (optional) -----------------
use_peft = False
if USE_LORA:
    try:
        from peft import LoraConfig, get_peft_model
        use_peft = True
        lora_config = LoraConfig(
            r=8,
            lora_alpha=16,
            target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], 
            lora_dropout=0.05,
            bias="none",
            task_type="CAUSAL_LM",
        )
        model = get_peft_model(model, lora_config)
        print('Applied LoRA adapter to model (PEFT).')
    except Exception as e:
        print(f'PEFT/LoRA unavailable or failed: {e}. Continuing without LoRA.')

print('use_peft =', use_peft)


'NoneType' object has no attribute 'cadam32bit_grad_fp32'
Applied LoRA adapter to model (PEFT).
use_peft = True


  warn("The installed version of bitsandbytes was compiled without GPU support. "


In [11]:

# --- Ensure model & tokenizer exist before Trainer ---
# This cell is inserted to guarantee `model` and `tokenizer` are defined.
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

if 'MODEL_ID' not in globals():
    raise RuntimeError("Please define MODEL_ID before running this cell (e.g., MODEL_ID='gpt2').")

print("Preparing model from MODEL_ID =", MODEL_ID)

# Load tokenizer (and ensure pad token)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({"pad_token":"<|pad|>"})

# Choose device and dtype
device = "cuda" if torch.cuda.is_available() else ("mps" if getattr(torch.backends, "mps", None) and torch.backends.mps.is_available() else "cpu")
dtype = torch.float16 if device in ("cuda","mps") else torch.float32
print("Device:", device, "dtype:", dtype)

# Load base model
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    trust_remote_code=True,
    torch_dtype=dtype,
    low_cpu_mem_usage=True,
)

# Resize embeddings to tokenizer (just in case)
model.resize_token_embeddings(len(tokenizer))
model.to(device)

print("Model and tokenizer are ready. Model on device:", next(model.parameters()).device)


Preparing model from MODEL_ID = microsoft/Phi-3-mini-4k-instruct
Device: mps dtype: torch.float16


Loading checkpoint shards: 100%|██████████| 2/2 [00:13<00:00,  6.72s/it]


Model and tokenizer are ready. Model on device: mps:0


In [12]:
# ----------------- Training Arguments and Trainer -----------------
training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    num_train_epochs=EPOCHS,
    eval_strategy='epoch',
    save_strategy='epoch',
    logging_steps=10,
    save_total_limit=2,
    fp16=(device in ('cuda', 'mps')),
    remove_unused_columns=False,
    push_to_hub=False,
    dataloader_num_workers=2,
    dataloader_pin_memory=use_pin_memory
)


trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_tok,
    eval_dataset=eval_tok,
    data_collator=data_collator,
)

print('Starting training... (this may be slow on CPU/MPS)')

Starting training... (this may be slow on CPU/MPS)


In [None]:
# Better generation: multi-sample + repetition controls
from transformers import AutoTokenizer
import torch, textwrap

tokenizer = tokenizer  # already loaded in notebook
model = model          # already attached and on device

device = next(model.parameters()).device
print("Model device:", device)

# Use instruction-style prompt for clearer behaviour
prompt = (
    "Instruction: You are a friendly, helpful customer support assistant.\n"
    "Input: I haven't received my refund after 10 days. What should I do?\n"
    "Response:"
)

input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)

gen_kwargs = dict(
    max_new_tokens=150,
    do_sample=True,
    temperature=0.7,
    top_k=50,
    top_p=0.92,
    repetition_penalty=1.2,
    no_repeat_ngram_size=3,
    num_return_sequences=3,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=getattr(tokenizer, "pad_token_id", tokenizer.eos_token_id),
)

outputs = model.generate(input_ids, **gen_kwargs)

print("\n=== Candidates ===\n")
for i, out in enumerate(outputs):
    txt = tokenizer.decode(out, skip_special_tokens=True)
    # trim to response portion (after "Response:" or "Assistant:")
    if "Response:" in txt:
        txt = txt.split("Response:",1)[1].strip()
    elif "Assistant:" in txt:
        txt = txt.split("Assistant:",1)[1].strip()
    print(f"--- Candidate {i+1} ---\n{txt}\n")


Device: mps dtype: torch.float16
Tokenizer size: 50258
Base config.vocab_size (original): 50257
Loading base model with ignore_mismatched_sizes=True ...
Base model loaded. Embedding shape (before resize): torch.Size([50257, 768])
Resizing token embeddings: 50257 -> 50258


The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`


Embedding shape (after resize): torch.Size([50258, 768])
Attaching adapter from OUTPUT_DIR ...
Adapter attached.


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Model on device: mps:0
=== Generated ===
Human: I haven't received my refund after 10 days. What should I do?
Assistant: I have to answer the following questions. First, I have to provide a proof of account information. If you have a refund, I will provide it to you as soon as I have provided the information to you. Second, I am not responsible for any inconvenience. I would appreciate any inconvenience. Third, I will provide you with the information I have provided to you in advance of my refund. I will not be responsible for any inconvenience. Fourth, I will not be responsible for any inconvenience. I will not be responsible for any inconvenience. Fifth, I will not be responsible for any inconvenience. If
