In [1]:
import os
from dotenv import load_dotenv
load_dotenv()

True

In [2]:
# financial qna dataset
# https://huggingface.co/datasets/ibm-research/finqa
from datasets import load_dataset
dataset = load_dataset("ibm-research/finqa", split="train", trust_remote_code=True)

In [3]:
dataset

Dataset({
    features: ['id', 'pre_text', 'post_text', 'table', 'question', 'answer', 'final_result', 'program_re', 'gold_inds'],
    num_rows: 6251
})

##### 🔧 What Does This Do?

```python
def format_instruction(example):
    return {
        "text": f"### Question: {example['question']}\n### Context: {example['table']}\n### Answer: {example['answer']}"
}

dataset = dataset.map(format_instruction)
```

---

##### 💡 Purpose:

This function **reformats the raw dataset entries into a text format that matches what you want the LLM to learn** — using **instruction-style prompts**.

---

##### 🧠 Step-by-Step Explanation:
**Formatted Output (Instruction Tuning Style):**

   ```python
   "text": f"### Question: {example['question']}\n### Context: {example['table']}\n### Answer: {example['answer']}"
   ```

   * This creates a string that clearly separates:

     * The **question** being asked
     * The **context** (the financial table relevant to answering it)
     * The **expected answer**
   * You’re turning structured fields into one long, readable **prompt-response training pair**.

   Example output:

   ```
   ### Question: What is the net income for Q4?
   ### Context: {"Revenue": "500M", "Cost": "300M", "Quarter": "Q4"}
   ### Answer: 200M
   ```

**Mapping Over the Dataset:**

   ```python
   dataset = dataset.map(format_instruction)
   ```

   * Applies the formatting function to **every example** in your dataset.
   * Adds a `"text"` field with the formatted string that can be used directly for fine-tuning.

---

##### ✅ Why Is This Important?

* LLMs are trained on **natural language prompts**.
* Your raw dataset is **structured** (with fields like `question`, `table`, `answer`) — not immediately usable by a model like Mistral.
* By reformatting, you're turning it into **instruction tuning format** (like how OpenAI instructs ChatGPT).

---

##### 🏁 What Happens After This?

Once this `"text"` field is prepared:

* You **tokenize** this field (`dataset["text"]`) using a tokenizer.
* Then, you **fine-tune the model** to learn to map from question + context → answer.

In [5]:
def format_instruction(example):
    return {
        "text": f"### Question: {example['question']}\n### Context: {example['table']}\n### Answer: {example['answer']}"
}

dataset = dataset.map(format_instruction)

Map:   0%|          | 0/6251 [00:00<?, ? examples/s]

##### Load Model with LoRA Support (QLoRA)

In [6]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, TaskType

In [None]:
import torch
# choose a good base model for finetuning
model_id = "mistralai/Mistral-7B-Instruct-v0.3"

# Configure bitsandbytes for 4-bit quantization to reduce memory usage and enable faster inference/training
# Load the model weights in 4-bit precision
# Use float16 for computation to balance speed and accuracy
# Enable double quantization for better compression and performance
# Use "nf4" (Normal Float 4) for efficient quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,                         
    bnb_4bit_compute_dtype=torch.float16,       
    bnb_4bit_use_double_quant=True,             
    bnb_4bit_quant_type="nf4"                   
)

# Load the tokenizer for the specified model
# 'use_fast=True' enables the use of the fast Rust-based tokenizer (recommended for speed)
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)

# Load the pre-trained Causal Language Model with quantization and automatic device mapping
# device_map="auto" ensures the model is spread across available GPUs/CPUs efficiently
# Automatically decide device placement for layers (e.g., GPU/CPU)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto"                
)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [23]:
print("Allocated:", torch.cuda.memory_allocated() / 1e9, "GB")
print("Reserved: ", torch.cuda.memory_reserved() / 1e9, "GB")

Allocated: 4.23515136 GB
Reserved:  4.737466368 GB


#### Configure LoRA (Using Rank 8, Alph 16)

In [None]:
# Import PEFT (Parameter-Efficient Fine-Tuning) and LoRA configuration
# LoRA (Low-Rank Adaptation) enables fine-tuning only a small number of additional parameters 
# instead of updating the full model weights, making training efficient and lightweight.
# We do so because large language models have billions of parameters and training them fully is resource-intensive.
# LoRA injects trainable low-rank matrices into certain layers (like attention), reducing memory usage and training time.

peft_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    inference_mode=False,
    r=8,
    lora_alpha=16,
    lora_dropout=0.1
)

# Apply the LoRA configuration to the original model.
# This wraps the base model with LoRA adapters so only the relevant weights are made trainable.

model = get_peft_model(model, peft_config)

# Print the number of trainable parameters compared to total parameters in the model.
# This helps verify that LoRA is working correctly by updating only a small portion of the model.

model.print_trainable_parameters()


trainable params: 3,407,872 || all params: 7,251,431,424 || trainable%: 0.0470




#### Tokenize Dataset


##### 🔄 Tokenization vs. Embedding

##### 🔹 1. **Tokenization**

* Converts raw **text → tokens → token IDs (integers)**.
* Example:

  ```
  Text: "What is ROI?"
  Tokens: ["What", "is", "ROI", "?"]
  Token IDs: [1547, 318, 2543, 30]
  ```
* This step is **required before** feeding data into any LLM.
* You do this using the model's **AutoTokenizer**.

##### 🔹 2. **Embedding**

* Converts token IDs into **dense vector representations (floats)** in high-dimensional space.
* It’s the first learned layer inside the LLM.
* OpenAIEmbeddings (like in `langchain.embeddings.OpenAIEmbeddings`) produce these vectors to **measure semantic similarity**, often used in:

  * RAG pipelines
  * Search
  * Clustering
  * Similarity scoring

---

###### ⚠️ Key Differences:

| Feature      | Tokenization                           | Embedding (OpenAIEmbeddings etc.)              |
| ------------ | -------------------------------------- | ---------------------------------------------- |
| Converts     | Text → token IDs                       | Token IDs → dense vectors (e.g., 1536 dims)    |
| Used in      | Preprocessing for model training/infer | Semantic search / similarity / vector stores   |
| Produces     | Integers (ids)                         | Float vectors                                  |
| Required for | LLM model training/fine-tuning         | RAG, vector search (not LLM training directly) |
| Library used | `AutoTokenizer` from Transformers      | `OpenAIEmbeddings`, `HuggingFaceEmbeddings`    |


##### We are using the **tokenizer that was trained alongside the Mistral model**, which ensures:

---

##### ✅ **Why using the same tokenizer as the model is critical:**

| Reason                 | Explanation                                                                                                                                                             |
| ---------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Vocabulary match**   | Mistral was trained using a specific tokenizer with its own vocabulary. Using a different tokenizer may produce token IDs that don't align with what the model expects. |
| **Token ID alignment** | Mismatched token IDs → unpredictable output or degraded performance during fine-tuning/inference.                                                                       |
| **Special tokens**     | The tokenizer also handles special tokens (`<pad>`, `<bos>`, `<eos>`) which are model-specific.                                                                         |
| **Formatting**         | Models like Mistral often use specific prompts or separator tokens — the tokenizer ensures they're handled properly.                                                    |

---

##### 🧠 Pro Tip:

You should **always** match the tokenizer and model versions (Mistral → Mistral's tokenizer) unless you're doing research with multi-tokenizer settings.


In [None]:
# Define a tokenize function to preprocess the dataset
# We apply truncation and padding to ensure all sequences are of the same length (512 tokens here),
# which is necessary for batch training and fits within memory limits of the model.

def tokenize(example):
    return tokenizer(example["text"], truncation=True, padding="max_length", max_length=512)

# Set the tokenizer's padding token to the end-of-sequence token (EOS)
# We do so because some pretrained models (like Mistral) may not have a defined pad token.
# This prevents errors during padding and ensures consistency in sequence endings.

tokenizer.pad_token = tokenizer.eos_token

# Apply the tokenize function to the entire dataset using batched processing
# This transforms all samples into tokenized format (input_ids, attention_mask), ready for training.

tokenized_dataset = dataset.map(tokenize, batched=True)

Map:   0%|          | 0/6251 [00:00<?, ? examples/s]

##### Train the Model

In [11]:
from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling




In [None]:
# Enable gradient checkpointing to save GPU memory during backpropagation
# This trades compute for memory by re-computing intermediate activations on the fly
# Especially useful when fine-tuning large models on limited VRAM (e.g., 8GB GPUs)

model.gradient_checkpointing_enable()

# Ensure input embeddings have `requires_grad=True` 
# This is necessary when using PEFT (e.g., LoRA) so gradients are correctly computed for adapter layers
# Without this, you might encounter `RuntimeError: element 0 of tensors does not require grad...`

model.enable_input_require_grads()

In [None]:
# Set up the training arguments for the Hugging Face Trainer API
# - output_dir: where to save the model
# - per_device_train_batch_size: smaller batch to fit in limited VRAM
# - gradient_accumulation_steps: accumulate gradients over multiple steps to simulate a larger batch size
# - fp16: enable mixed precision training for faster performance and lower memory use
# - save_strategy: save model at the end of every epoch
# - report_to: disables logging to external tools (e.g., WandB)

training_args = TrainingArguments(
    output_dir="./finetuned_mistral",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    num_train_epochs=3,
    logging_steps=10,
    save_strategy="epoch",
    fp16=True,
    report_to="none"
)

# Enable gradient checkpointing to save GPU memory by recomputing intermediate activations
# Especially helpful when fine-tuning large models with limited resources

model.gradient_checkpointing_enable()

# Create a data collator for Causal Language Modeling (CLM)
# - mlm=False: disables masked language modeling (used for BERT-style training)
# - Suitable for models like Mistral, GPT which are trained as autoregressive generators

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False
)

# Initialize the Hugging Face Trainer API with model, data, and training args
# - Trainer handles training loop, gradient accumulation, logging, saving, etc.
# - Works well with PEFT/LoRA for efficient fine-tuning

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    data_collator=data_collator
)

# Start the training process
trainer.train()

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


Step,Training Loss
10,1.7477
20,1.5909
30,1.3434
40,1.3127
50,1.216
60,1.2027
70,1.1768


KeyboardInterrupt: 

In [None]:
# Save LoRA adapter
# model.save_pretrained("./finetuned_mistral_lora")
# tokenizer.save_pretrained("./finetuned_mistral_lora")

('./finetuned_mistral_lora\\tokenizer_config.json',
 './finetuned_mistral_lora\\special_tokens_map.json',
 './finetuned_mistral_lora\\tokenizer.model',
 './finetuned_mistral_lora\\added_tokens.json',
 './finetuned_mistral_lora\\tokenizer.json')

In [None]:
# Merge the LoRA adapter weights into the base model weights
# This step is necessary before exporting the model for standalone inference
# After merging, the model no longer depends on the PEFT (LoRA) framework
merged_model = model.merge_and_unload()

# Save the merged model to disk so it can be used without PEFT/LoRA at inference time
# The saved model directory will include the model config and weights
merged_model.save_pretrained("./finetuned_mistral_merged")

# Save the tokenizer associated with the model
# This ensures consistent tokenization during inference
tokenizer.save_pretrained("./finetuned_mistral_merged")



('./finetuned_mistral_merged\\tokenizer_config.json',
 './finetuned_mistral_merged\\special_tokens_map.json',
 './finetuned_mistral_merged\\tokenizer.model',
 './finetuned_mistral_merged\\added_tokens.json',
 './finetuned_mistral_merged\\tokenizer.json')

In [None]:
# Prepare the input prompt in the format used during fine-tuning
# Use special markers like ### Question and ### Context to guide the model’s response
# Replace <your table> with the actual table/contextual information from your domain
input_text = "### Question: What is the total revenue in 2022?\n### Context: <your table>"

# Tokenize the input and move it to the GPU (cuda) for inference
# 'return_tensors="pt"' converts input into PyTorch tensors suitable for model input
inputs = tokenizer(input_text, return_tensors="pt").to("cuda")

# Generate the model's response using causal language modeling
# 'max_new_tokens' controls how many tokens the model can generate in its answer
outputs = model.generate(**inputs, max_new_tokens=100)

# Decode the output tokens into readable text
# 'skip_special_tokens=True' removes any padding or special tokens used during generation
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


### Question: What is the total revenue in 2022?
### Context: <your table>

To find the total revenue in 2022, we need to sum up the revenue for each month in 2022. Here's how you can do it using SQL:

```sql
SELECT SUM(revenue) AS TotalRevenue
FROM your_table
WHERE YEAR(date_column) = 2022;
```

Replace `your_table` with the name of your table and
