# Phase-1: Fine-Tuning Mistral Model on Synthetic Marketing KPI Dataset

**Goal:** Fine-tune Mistral-7B-Instruct on synthetic marketing KPI data to produce actionable campaign recommendations.  

**Scope of Phase-1:**
- Prepare synthetic dataset
- Tokenize & map prompts
- Apply QLoRA (LoRA adapters + 4-bit quantization)
- Train without evaluation or early stopping
- Save LoRA adapters & tokenizer

**Outcome:** Portfolio-ready fine-tuned LoRA model for marketing KPI analysis.

## Step 0: Install Dependencies

In [1]:
!pip install -q transformers datasets peft accelerate bitsandbytes

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.1/59.1 MB[0m [31m31.8 MB/s[0m eta [36m0:00:00[0m
[?25h

## Step 1: Imports & GPU Check

In [2]:
import torch, json, random
from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    BitsAndBytesConfig,
    DataCollatorForLanguageModeling,
    TrainingArguments,
    Trainer
)
from peft import LoraConfig, get_peft_model

print("CUDA:", torch.cuda.is_available())

2026-01-12 19:31:26.109641: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1768246286.549610      23 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1768246286.690332      23 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1768246287.680862      23 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1768246287.680911      23 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1768246287.680914      23 computation_placer.cc:177] computation placer alr

CUDA: True


## Step 2: Quantization Config (QLoRA)

In [3]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True
)

## Step 3: Load Base Model & Tokenizer

In [4]:
model_name = "mistralai/Mistral-7B-Instruct-v0.2"

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto"
)

model.config.use_cache = False

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/596 [00:00<?, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

## Step 4: Attach LoRA

In [5]:
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
model.train()
model.print_trainable_parameters()

trainable params: 6,815,744 || all params: 7,248,547,840 || trainable%: 0.0940


## Step 5: Generate Synthetic Marketing KPI Dataset

In [6]:
dataset = []

for _ in range(3000):
    impressions = random.choice([10000, 25000, 50000, 75000, 100000])
    conversions = random.choice([10, 50, 150, 300, 500])
    budget = random.choice([500, 1000, 2000, 5000, 10000])
    cr = conversions / impressions * 100

    if cr < 0.5:
        suggestion = "Conversion rate is low. Optimize creative and targeting."
    elif cr < 1.5:
        suggestion = "Conversion rate is moderate. Improve CTA and A/B testing."
    else:
        suggestion = "Conversion rate is good. Consider scaling budget."

    dataset.append({
        "instruction": "Analyze campaign performance and suggest improvements.",
        "input": f"Impressions: {impressions}, Conversions: {conversions}, Budget: ${budget}",
        "output": f"Conversion rate is {cr:.2f}%. {suggestion}"
    })

with open("/kaggle/working/marketing_kpi.jsonl", "w") as f:
    for r in dataset:
        f.write(json.dumps(r) + "\n")

## Step 6: Load Dataset

In [7]:
dataset = load_dataset(
    "json",
    data_files="/kaggle/working/marketing_kpi.jsonl",
    split="train"
)

Generating train split: 0 examples [00:00, ? examples/s]

## Step 7: Tokenization

In [8]:
def tokenize_fn(examples):
    texts = []
    for inst, inp, out in zip(
        examples["instruction"],
        examples["input"],
        examples["output"]
    ):
        texts.append(
            f"### Instruction:\n{inst}\n\n"
            f"### Input:\n{inp}\n\n"
            f"### Response:\n{out}"
        )

    tokens = tokenizer(
        texts,
        truncation=True,
        padding="max_length",
        max_length=512
    )
    tokens["labels"] = tokens["input_ids"].copy()
    return tokens


tokenized_dataset = dataset.map(
    tokenize_fn,
    batched=True,
    remove_columns=dataset.column_names
)

Map:   0%|          | 0/3000 [00:00<?, ? examples/s]

## Step 8: Data Collator

In [9]:
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False
)

## Step 9: Training Arguments

In [10]:
training_args = TrainingArguments(
    output_dir="/kaggle/working/mistral-marketing-lora",
    num_train_epochs=3,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    fp16=True,
    logging_steps=10,
    save_steps=500,
    save_total_limit=2,
    optim="paged_adamw_8bit",
    report_to="none"
)

## Step 10: Trainer

In [11]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    data_collator=data_collator
)

## Step 11: Train

In [12]:
trainer.train()

Step,Training Loss
10,2.1294
20,1.224
30,0.6806
40,0.495
50,0.4566
60,0.4203
70,0.3603
80,0.3609
90,0.3493
100,0.345


TrainOutput(global_step=2250, training_loss=0.2477432197994656, metrics={'train_runtime': 9503.8088, 'train_samples_per_second': 0.947, 'train_steps_per_second': 0.237, 'total_flos': 1.9678397202432e+17, 'train_loss': 0.2477432197994656, 'epoch': 3.0})

## Step 12: Save LoRA Adapters

In [13]:
OUTPUT_DIR = "/kaggle/working/marketing_kpi_lora"

# Save LoRA adapter weights
model.save_pretrained(OUTPUT_DIR)

# Save tokenizer
tokenizer.save_pretrained(OUTPUT_DIR)

print("Phase 1 completed successfully.")
print(f"LoRA adapter saved to: {OUTPUT_DIR}")

Phase 1 completed successfully.
LoRA adapter saved to: /kaggle/working/marketing_kpi_lora


# Phase-1 Completed ✅

**Checkpoint saved:**
- LoRA adapters: `/kaggle/working/marketing_kpi_lora/`
- Tokenizer: `/kaggle/working/marketing_kpi_lora/`
- base model is intentionally not saved and will be reloaded in Phase 2 for inference and downstream GenAI tasks.

**Next Steps (Phase-2):**
- Load LoRA adapters
- Evaluate model on new KPI prompts
- Generate structured insights for portfolio showcase

_Phase-1 is frozen for reproducibility and portfolio use._