# **Financial QA - LLM based solutions**

## **Task**

We would like to see what you can build with the data provided. Feel free to present the results in any format you prefer and explore any additional ideas you have with the dataset. You may use any model or architecture of your choice. The goal is to demonstrate your knowledge and experience. We are particularly interested in the logic and reasoning behind your choice of accuracy metrics and your ability to communicate your solutions and ideas effectively.

We’d like you to please demonstrate a LLM driven prototype that can answer questions based on financial documents (texts, tables, figures etc.).

Here is a snippet of the json which contains the question, as well as the correct answer that your solution should aim to produce:

"**qa**": {

"**question**": "what was the percentage change in the net cash from operating activities from 2008 to 2009”,

"**answer**": "14.1%",

}

## **Dataset**

We have been provided with a labelled dataset ConvFinQA - https://github.com/czyssrs/ConvFinQA and have advised to use train.json for this task.

General fields for all data:

"**pre_text**": the texts before the table;

"**post_text**": the text after the table;

"**table**": the table;

"**id**": unique example id.

If the conversation is the Type I simple conversation, i.e., the decomposition from one FinQA question, then we have the following fields for "annotation" fields:

"**annotation**":

{
  
  "**original_program**": original FinQA question;
  
  "**dialogue_break**": the conversation, as a list of question turns.
  
  "**turn_program**": the ground truth program for each question, corresponding to the list in "dialogue_break".
  
  "**qa_split**": this field indicates the source of each question turn - 0 if from the decomposition of the first FinQA question, 1 if from the second. For the Type I simple conversations, this field is all 0s.
  
  "**exe_ans_list**": the execution results of each question turn.

}



## **Strategy**

We break down the problem as following:

* Given a financial information context, which is derived from a pre-context, post-context, and a table data, answer the question from the context.

* The answer may not be directly mentioned in the context each time and it might require the model (LLM) to determine a computing logic and then derive the answer from the context using that logic. For example, the context shows the sales volume for two consecutive years and the question is about the percentage change from 1st year to 2nd year, in that case the answer needs to be derived. Also, the answer could be a mix from the context of previous turns in the conversation.

Given the nature of the problem where the task could be a bit complex on occassions, we need a model (LLM) that is suited for reasoning abilities. The propertietary models offered by Open AI - **gpt-4o** and **gpt-4o-mini** are quick and efficient for such tasks. While **gpt-4o** is much more expensive than **gpt-4o-mini**, we will try to compare them both.

Another approach is to fine-tune an open-source model on such a task. In that scenario, we choose a model that is pre-trained and evaluated on common sense, language understanding, and logical reasoning. One such model is **Phi-2** by Microsoft. **Phi-2** showcased a nearly state-of-the-art performance among models with less than 13 billion parameters. We also consider a smaller model to ensure it can be handled in Google Colab environment and helps in a quick iteration.

So our final strategy is laid out here:

1. Demonstrate the task solving by appropriate data pre-processing (preparing data) and prompt engineering for the task, and using a well suited proprietary model to handle the task. We also evaluate the performance.
2. Design a Fine-tuning pipeline on a small LLM such as Phi-2 on the given task.

## **Environment Loading and Data Loader Classes**

In [None]:
# Cell 1: Setup and Dependencies with Progress Tracking
!pip install transformers datasets torch pandas numpy scikit-learn tqdm wandb python-dotenv accelerate huggingface_hub sentencepiece bitsandbytes peft tensorboard

Collecting datasets
  Downloading datasets-3.4.1-py3-none-any.whl.metadata (19 kB)
Collecting python-dotenv
  Downloading python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB)
Collecting bitsandbytes
  Downloading bitsandbytes-0.45.3-py3-none-manylinux_2_24_x86_64.whl.metadata (5.0 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12

In [None]:
# Cell 2: Import Libraries with Monitoring Tools
import json
from typing import List, Dict, Any, Tuple
from datasets import Dataset
import pandas as pd
import numpy as np
import os
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import bitsandbytes as bnb
import huggingface_hub
from google.colab import drive
import psutil
import gc
from tqdm.notebook import tqdm
import wandb
from datetime import datetime

Create Base Path and directories for loading the train.json and saving results file.

In [None]:
BASE_PATH = "/content/financial_qa"
DATA_DIR = f"{BASE_PATH}/ConvFinQA dataset"
OUTPUT_DIR = f"{BASE_PATH}/results"
LOG_DIR = f"{BASE_PATH}/logs"

# Create directories
os.makedirs(BASE_PATH, exist_ok=True)
os.makedirs(DATA_DIR, exist_ok=True)
os.makedirs(OUTPUT_DIR, exist_ok=True)
os.makedirs(LOG_DIR, exist_ok=True)

**NOTE:** Drag and drop train.json to ConvFinQA dataset folder as can be accessed from the left pane.

In [None]:
# Cell 3: Data Loader Implementation
class ConvFinQADataLoader:
    def __init__(self, data_path: str):
        """
        Initialize the data loader.

        Args:
            data_path (str): Path to the JSON data file
        """
        self.data_path = data_path
        self.data = None
        self.train_data = None
        self.dev_data = None

    def load_data(self) -> List[Dict[str, Any]]:
        """
        Load the JSON data file.

        Returns:
            List[Dict[str, Any]]: List of data examples
        """
        with open(self.data_path, 'r') as f:
            self.data = json.load(f)
        return self.data

    def prepare_dataset(self, data: List[Dict[str, Any]]) -> Dataset:
      """
      Prepare the data for training by combining pre_text, table, post_text and QA information.

      Args:
          data (List[Dict[str, Any]]): List of data examples

      Returns:
          Dataset: HuggingFace Dataset object
      """
      processed_data = []

      for example in data:
          # Combine pre_text, table, and post_text into a single text
          pre_text = " ".join(example["pre_text"])
          post_text = " ".join(example["post_text"])

          # Convert table to string representation
          table = example["table"]
          table_str = "\n".join(["\t".join(str(cell) for cell in row) for row in table])  # Convert all cells to strings

          # Get the conversation turns and their corresponding programs
          annotation = example["annotation"]
          dialogue_break = annotation["dialogue_break"]
          turn_programs = annotation["turn_program"]
          exe_ans_list = annotation["exe_ans_list"]

          # Process each turn in the conversation
          for turn_idx, (question, program, answer) in enumerate(zip(dialogue_break, turn_programs, exe_ans_list)):
              # Create context with previous turns
              context = f"{pre_text}\n{table_str}\n{post_text}"
              if turn_idx > 0:
                  # Add previous Q&A pairs for context
                  prev_qa = "\n".join([
                      f"Previous Question: {dialogue_break[i]}\nAnswer: {str(exe_ans_list[i])}"  # Convert answer to string
                      for i in range(turn_idx)
                  ])
                  context = f"{context}\n\nPrevious conversation:\n{prev_qa}"

              # Ensure all fields are strings
              processed_example = {
                  "text": str(context),
                  "question": str(question),
                  "program": str(program),
                  "answer": str(answer),
                  "turn_idx": turn_idx,
                  "example_id": str(example["id"])
              }
              processed_data.append(processed_example)

      # Create DataFrame first to ensure consistent types
      df = pd.DataFrame(processed_data)
      print(df.shape)
      # Convert to Dataset
      return Dataset.from_pandas(df)

    def get_train_dev_split(self, train_ratio: float = 0.8, use_subset: float = 0.2) -> Tuple[Dataset, Dataset]:
        """
        Split the data into training and development sets.

        Args:
            train_ratio (float): Ratio of training data

        Returns:
            Tuple[Dataset, Dataset]: Training and development datasets
        """
        if self.data is None:
            self.load_data()

        # Shuffle the data
        np.random.shuffle(self.data)
        self.data = self.data[:int(len(self.data) * use_subset)]# just using 50% of the dataset

        # Split the data
        split_idx = int(len(self.data) * train_ratio)
        train_data = self.data[:split_idx]
        dev_data = self.data[split_idx:]

        # Convert to datasets
        self.train_data = self.prepare_dataset(train_data)
        self.dev_data = self.prepare_dataset(dev_data)

        return self.train_data, self.dev_data

    def main(self) -> Tuple[Dataset, Dataset]:
        """
        Main function to load and prepare the dataset.

        Returns:
            Tuple[Dataset, Dataset]: Training and development datasets
        """
        # Load the data
        self.load_data()

        # Create train/dev split
        train_dataset, dev_dataset = self.get_train_dev_split()

        return train_dataset, dev_dataset

In [None]:
DATA_PATH = f"{DATA_DIR}/train.json"

In [None]:
# Load and Prepare Data
# At the moment we are only considering 20% of the data for a quick iteration over the data
data_loader = ConvFinQADataLoader(DATA_PATH)
print("Loading datasets...")
with tqdm(total=2) as pbar:
    train_dataset, dev_dataset = data_loader.get_train_dev_split(use_subset = 0.2)
    pbar.update(2)

print(f"Training set size: {len(train_dataset)}")
print(f"Development set size: {len(dev_dataset)}")

# Show an example
print("\nExample from training set:")
print(train_dataset[0])

Loading datasets...


  0%|          | 0/2 [00:00<?, ?it/s]

(1808, 6)
(455, 6)
Training set size: 1808
Development set size: 455

Example from training set:
{'text': 'republic services , inc . notes to consolidated financial statements 2014 ( continued ) 12 . share repurchases and dividends share repurchases share repurchase activity during the years ended december 31 , 2018 and 2017 follows ( in millions except per share amounts ) : .\n\t2018\t2017\nnumber of shares repurchased\t10.7\t9.6\namount paid\t$ 736.9\t$ 610.7\nweighted average cost per share\t$ 69.06\t$ 63.84\nas of december 31 , 2018 , there were no repurchased shares pending settlement . in october 2017 , our board of directors added $ 2.0 billion to the existing share repurchase authorization that now extends through december 31 , 2020 . share repurchases under the program may be made through open market purchases or privately negotiated transactions in accordance with applicable federal securities laws . while the board of directors has approved the program , the timing of any pu

In [None]:
dev_dataset

Dataset({
    features: ['text', 'question', 'program', 'answer', 'turn_idx', 'example_id'],
    num_rows: 455
})

## **Approach 1 - Open AI models and Prompt Based Task**

In [None]:
from openai import OpenAI
from typing import Dict, List, Optional
import os
from tqdm import tqdm
import time

class OpenAIFinancialQA:
    def __init__(self, api_key: str = None, model: str = "gpt-4o-mini"):
        """
        Initialize OpenAI Financial QA model

        Args:
            api_key (str): OpenAI API key
            model (str): Model to use ('gpt-4' or 'gpt-3.5-turbo')
        """
        self.api_key = api_key or os.getenv("OPENAI_API_KEY")
        if not self.api_key:
            raise ValueError("OpenAI API key is required")

        self.client = OpenAI(api_key=self.api_key)
        self.model = model

    def predict(self, text: str, question: str) -> Dict:
        """
        Generate prediction for a single question
        """
        prompt = f"""You are a financial analyst assistant. Your task is to answer questions about financial data with precise numerical answers.

Context:
{text}

Question:
{question}

Please follow these steps:
1. Identify the relevant answer from the context
2. Explain your reasoning step by step
3. Provide the final numerical answer

Format your response as:
NUMBERS: [List the relevant numbers you identified]
REASONING: [Your step by step explanation]
ANSWER: [The final numerical answer]

Remember to:
- Be precise with numerical calculations
- Show your work clearly
- Express the final answer in the same format/unit as the question implies"""

        try:
            response = self.client.chat.completions.create(
                model=self.model,
                messages=[
                    {"role": "system", "content": "You are a precise financial calculator that provides accurate numerical answers."},
                    {"role": "user", "content": prompt}
                ],
                temperature=0.1,  # Low temperature for consistent, precise answers
                max_tokens=500
            )

            return {
                'full_response': response.choices[0].message.content,
                'parsed_response': self._parse_response(response.choices[0].message.content)
            }

        except Exception as e:
            print(f"Error in prediction: {str(e)}")
            return None

    def _parse_response(self, response: str) -> Dict:
        """Parse structured response into components"""
        parts = {}

        try:
            # Split by sections
            current_section = None
            section_content = []

            for line in response.split('\n'):
                if line.startswith('NUMBERS:'):
                    if current_section and section_content:
                        parts[current_section.lower()] = '\n'.join(section_content)
                        section_content = []
                    current_section = 'NUMBERS'
                    section_content.append(line.replace('NUMBERS:', '').strip())
                elif line.startswith('REASONING:'):
                    if current_section and section_content:
                        parts[current_section.lower()] = '\n'.join(section_content)
                        section_content = []
                    current_section = 'REASONING'
                    section_content.append(line.replace('REASONING:', '').strip())
                elif line.startswith('ANSWER:'):
                    if current_section and section_content:
                        parts[current_section.lower()] = '\n'.join(section_content)
                        section_content = []
                    current_section = 'ANSWER'
                    section_content.append(line.replace('ANSWER:', '').strip())
                elif line.strip():
                    section_content.append(line.strip())

            # Add the last section
            if current_section and section_content:
                parts[current_section.lower()] = '\n'.join(section_content)

            return parts
        except Exception as e:
            print(f"Error parsing response: {str(e)}")
            return {'raw_response': response}

    def evaluate_dataset(self, dataset: Dataset, num_samples: Optional[int] = None) -> Dict:
        """
        Evaluate model on a dataset
        """
        results = []
        total_samples = len(dataset) if num_samples is None else min(num_samples, len(dataset))

        for i in tqdm(range(total_samples)):
            example = dataset[i]

            # Add delay to respect API rate limits
            time.sleep(1)

            prediction = self.predict(example['text'], example['question'])

            if prediction:
                result = {
                    'example_id': i,
                    'question': example['question'],
                    'true_answer': example['answer'],
                    'predicted': prediction['parsed_response'].get('answer', ''),
                    'reasoning': prediction['parsed_response'].get('reasoning', ''),
                    'numbers_used': prediction['parsed_response'].get('numbers', ''),
                    'full_response': prediction['full_response']
                }
                results.append(result)

        # Calculate metrics
        metrics = self._calculate_metrics(results)

        return {
            'results': results,
            'metrics': metrics
        }

    def _calculate_metrics(self, results: List[Dict]) -> Dict:
        """Calculate evaluation metrics"""
        def normalize_number(text: str) -> float:
            """Extract and normalize numerical value"""
            import re
            numbers = re.findall(r'[-+]?\d*\.?\d+', text)
            return float(numbers[0]) if numbers else None

        metrics = {
            'total_samples': len(results),
            #'successful_predictions': 0,
            'exact_matches': 0,
            'numeric_matches': 0
        }

        total_numeric_samples = 0
        for result in results:
            if result['predicted']:
                #metrics['successful_predictions'] += 1

                # Check exact match
                if result['predicted'].strip() == result['true_answer'].strip():
                    metrics['exact_matches'] += 1

                # Check numeric match
                pred_num = normalize_number(result['predicted'])
                true_num = normalize_number(result['true_answer'])

                if true_num is not None:
                    total_numeric_samples += 1
                if pred_num and true_num:
                    if abs(pred_num - true_num) < 1e-6:
                        metrics['numeric_matches'] += 1

        # Calculate percentages
        total = metrics['total_samples']
        metrics.update({
            #'success_rate': (metrics['successful_predictions'] / total) * 100,
            'exact_match_rate': (metrics['exact_matches'] / total) * 100,
            'numeric_match_rate': (metrics['numeric_matches'] / total_numeric_samples) * 100
        })

        return metrics



In [None]:
import getpass

In [None]:
# Usage example
def compare_with_baseline(key, dev_dataset, num_samples=10):
  overall_results = {}
  # Initialize OpenAI model
  for mod in ['gpt-4o-mini','gpt-4o']:
    print(f"Testing model {mod}...")
    openai_qa = OpenAIFinancialQA(api_key=key, model=mod)
    #openai_qa = OpenAIFinancialQA(api_key=key, model="gpt-4o-mini")

    # Evaluate
    print("Evaluating OpenAI model...")
    results = openai_qa.evaluate_dataset(dev_dataset, num_samples=num_samples)

    # Print results
    print("\nResults:")
    print("========")
    print("\nMetrics:")
    for metric, value in results['metrics'].items():
        if isinstance(value, float):
            print(f"{metric}: {value:.2f}%")
        else:
            print(f"{metric}: {value}")

    # Print some examples
    print("\nExample Predictions:")
    for result in results['results'][:3]:
        print("\nQuestion:", result['question'])
        print("True Answer:", result['true_answer'])
        print("Predicted Answer:", result['predicted'])
        print("Reasoning:", result['reasoning'])
        print("-" * 80)

    overall_results[mod] = results

  return overall_results

# Run comparison
#results = compare_with_baseline(dev_dataset, num_samples=10)

# Run comparison
key = getpass.getpass("Enter your Open AI key")
results = compare_with_baseline(key, dev_dataset, num_samples=50)

Enter your Open AI key··········
Testing model gpt-4o-mini...
Evaluating OpenAI model...


100%|██████████| 50/50 [03:19<00:00,  3.99s/it]



Results:

Metrics:
total_samples: 50
exact_matches: 19
numeric_matches: 33
exact_match_rate: 38.00%
numeric_match_rate: 66.00%

Example Predictions:

Question: what was the change in weighted average common shares outstanding for basic computations from 2016 to 2017?
True Answer: -11.5
Predicted Answer: 11.5 million
Reasoning: 
1. From the context, the weighted average common shares outstanding for basic computations in 2016 is 299.3 million.
2. The weighted average common shares outstanding for basic computations in 2017 is 287.8 million.
3. To find the change in weighted average common shares outstanding from 2016 to 2017, we subtract the 2017 figure from the 2016 figure:
Change = 2016 shares - 2017 shares
Change = 299.3 million - 287.8 million
Change = 11.5 million
--------------------------------------------------------------------------------

Question: and how much does that change represent percentually in relation to the weighted average common shares outstanding for basic com

100%|██████████| 50/50 [03:29<00:00,  4.18s/it]


Results:

Metrics:
total_samples: 50
exact_matches: 14
numeric_matches: 32
exact_match_rate: 28.00%
numeric_match_rate: 64.00%

Example Predictions:

Question: what was the change in weighted average common shares outstanding for basic computations from 2016 to 2017?
True Answer: -11.5
Predicted Answer: 11.5 million
Reasoning: 
1. Identify the weighted average common shares outstanding for basic computations for the years 2016 and 2017 from the context.
2. For 2016, the weighted average common shares outstanding was 299.3 million.
3. For 2017, the weighted average common shares outstanding was 287.8 million.
4. To find the change, subtract the 2017 value from the 2016 value: 299.3 million - 287.8 million.
5. Perform the subtraction: 299.3 - 287.8 = 11.5 million.
--------------------------------------------------------------------------------

Question: and how much does that change represent percentually in relation to the weighted average common shares outstanding for basic computati




### **Results**

Model - **gpt-4o-mini**

* exact_match_rate: **38.00%**
* numeric_match_rate: **66.00%**

Model - **gpt-4o**

* exact_match_rate: **28.00%**
* numeric_match_rate: **64.00%**

A few observable points: **gpt-4o-mini** is nearly the same (infact better) as compared to **gpt-4o** on this task. Also, there is scope of few more deterministic checks as a post-processing check. For e.g., in the first example by **gpt-4o-mini** the true answer is -11.5 and predicted answer is 11.5 million. Both convey same information so we need to make sure that the post-processing of the response handles such conditions.


## **Approach 2 - Fine-tuning Phi-2 on the task**

Now we look into the second approach which is to fine-tune a relatively smaller language model Phi-2.

**NOTE** - We make use of A100 GPU custer here for quick and efficient compute.

We apply QLoRA, 4-bit quantization, and peft to perform an efficient low memory fine-tuning approach.

In [None]:
# Cell 4: Memory Management Functions
def print_gpu_utilization():
    """Print GPU memory usage."""
    if torch.cuda.is_available():
        print(f"GPU memory allocated: {torch.cuda.memory_allocated(0) / 1024**2:.2f} MB")
        print(f"GPU memory cached: {torch.cuda.memory_reserved(0) / 1024**2:.2f} MB")

def clear_memory():
    """Clear unused memory."""
    gc.collect()
    if torch.cuda.is_available():
        torch.cuda.empty_cache()

In [None]:
# Cell 4: Enhanced Model Implementation
class FinancialQAModel:
    def __init__(self, model_name: str = "microsoft/phi-2"):
        """
        Initialize the financial QA model.

        Args:
            model_name (str): Name of the model to use
        """
        self.model_name = model_name
        self.tokenizer = None
        self.model = None
        self.trainer = None

        # Create cache directory
        self.cache_dir = os.path.expanduser("~/.cache/huggingface")
        os.makedirs(self.cache_dir, exist_ok=True)

        # Load model and tokenizer with quantization
        self._load_model()

    def _load_model(self):
        """Load the model and tokenizer from cache or download if needed."""
        try:
            # Quantization configuration
            bnb_config = BitsAndBytesConfig(
                load_in_4bit=True,
                bnb_4bit_use_double_quant=True,
                bnb_4bit_quant_type="nf4",
                bnb_4bit_compute_dtype=torch.bfloat16
            )

            # Load tokenizer
            self.tokenizer = AutoTokenizer.from_pretrained(
                self.model_name,
                cache_dir=self.cache_dir,
                trust_remote_code=True
            )

            # Set up padding token
            if self.tokenizer.pad_token is None:
                self.tokenizer.pad_token = self.tokenizer.eos_token
                self.tokenizer.pad_token_id = self.tokenizer.eos_token_id

            # Load model with quantization
            self.model = AutoModelForCausalLM.from_pretrained(
                self.model_name,
                cache_dir=self.cache_dir,
                trust_remote_code=True,
                quantization_config=bnb_config,
                device_map="auto"
            )

            # Update model config with pad token
            self.model.config.pad_token_id = self.tokenizer.pad_token_id
            self.model.config.padding_side = 'right'

            # Prepare model for k-bit training
            self.model = prepare_model_for_kbit_training(self.model)

            # LoRA configuration in Phi-2
            lora_config = LoraConfig(
                r=8,
                lora_alpha=16,
                lora_dropout=0.1,
                bias="none",
                task_type="CAUSAL_LM"
            )

            # Apply LoRA
            self.model = get_peft_model(self.model, lora_config)

            # Print trainable parameters info
            print("Trainable parameters:")
            model_parameters = 0
            all_parameters = 0
            for name, param in self.model.named_parameters():
                all_parameters += param.numel()
                if param.requires_grad:
                    model_parameters += param.numel()
                    #print(f"Parameter name: {name}, Shape: {param.shape}")
            print(f"Trainable parameters: {model_parameters}")
            print(f"All parameters: {all_parameters}")
            print(f"Percentage of trainable parameters: {100 * model_parameters / all_parameters:.2f}%")

        except Exception as e:
            print(f"Error loading model: {e}")
            raise

    def prepare_dataset(self, dataset: Dataset) -> Dataset:
        """
        Prepare the dataset for training by tokenizing the text and creating input-output pairs.
        """
        def tokenize_function(examples):
            # Format input with context and question
            inputs = []
            for text, question, turn_idx in zip(
                examples["text"],
                examples["question"],
                examples["turn_idx"]
            ):
                #Question {turn_idx + 1}: {question}
                prompt = f"""You are a financial question-answering assistant. Your task is to analyze financial documents and answer questions about them.

Document Context:
{text}

Question : {question}

Instructions:
1. Analyze the financial information in the context
2. Determine the required calculation to compute the answer
3. Compute and provide the final answer

Format your response as:
NUMBERS: [List the relevant numbers you identified]
REASONING: [Your compute logic/program]
ANSWER: [The final numerical answer]

Remember to:
- Be precise with numerical calculations
- Show your work clearly
- Express the final answer in the same format/unit as the question implies"""

                inputs.append(prompt)

            # Tokenize inputs
            model_inputs = self.tokenizer(
                inputs,
                padding='max_length',  # Use max_length padding
                truncation=True,
                max_length=512,
                return_tensors="pt"
            )

            # Format targets (programs and answers)
            targets = []
            for program, answer in zip(examples["program"], examples["answer"]):

                target = f"""REASONING: {program} \n\n
                ANSWER: {answer}"""

                targets.append(target)

            # Tokenize targets
            with self.tokenizer.as_target_tokenizer():
                labels = self.tokenizer(
                    targets,
                    padding='max_length',  # Use max_length padding
                    truncation=True,
                    max_length=512,  # Use same max_length as inputs
                    return_tensors="pt"
                )

            # Convert labels to list for manipulation
            labels_list = labels["input_ids"].tolist()

            # Replace padding token ids with -100 in labels
            labels_list = [
                [-100 if token == self.tokenizer.pad_token_id else token for token in label]
                for label in labels_list
            ]

            # Convert back to tensor
            model_inputs["labels"] = torch.tensor(labels_list)

            return model_inputs

        # Process the dataset
        tokenized_dataset = dataset.map(
            tokenize_function,
            batched=True,
            remove_columns=dataset.column_names,
            batch_size=8  # Process smaller batches at a time
        )

        return tokenized_dataset

    def train(self, train_dataset: Dataset, dev_dataset: Dataset):
        """
        Train the model on the provided datasets.
        """
        # Prepare datasets
        train_dataset = self.prepare_dataset(train_dataset)
        dev_dataset = self.prepare_dataset(dev_dataset)

        # Modified training arguments
        training_args = TrainingArguments(
            output_dir="./results",
            num_train_epochs=1,
            per_device_train_batch_size=2,
            per_device_eval_batch_size=2,
            gradient_accumulation_steps=4,
            warmup_steps=100,
            weight_decay=0.02,
            logging_dir="./logs",
            logging_steps=10,
            evaluation_strategy="epoch",
            save_strategy="epoch",
            load_best_model_at_end=True,
            optim="paged_adamw_32bit",
            fp16=True,
            max_grad_norm=0.3,              # Added gradient clipping
            learning_rate=5e-5,             # Adjusted learning rate
        )

        # Initialize trainer
        self.trainer = Trainer(
            model=self.model,
            args=training_args,
            train_dataset=train_dataset,
            eval_dataset=dev_dataset,
        )

        # Train the model
        self.trainer.train()

    def parse_response(self, response: str) -> Dict:
        """Parse structured response into components"""
        parts = {}

        try:
            # Split by sections
            current_section = None
            section_content = []

            for line in response.split('\n'):
                if line.startswith('NUMBERS:'):
                    if current_section and section_content:
                        parts[current_section.lower()] = '\n'.join(section_content)
                        section_content = []
                    current_section = 'NUMBERS'
                    section_content.append(line.replace('NUMBERS:', '').strip())
                elif line.startswith('REASONING:'):
                    if current_section and section_content:
                        parts[current_section.lower()] = '\n'.join(section_content)
                        section_content = []
                    current_section = 'REASONING'
                    section_content.append(line.replace('REASONING:', '').strip())
                elif line.startswith('ANSWER:'):
                    if current_section and section_content:
                        parts[current_section.lower()] = '\n'.join(section_content)
                        section_content = []
                    current_section = 'ANSWER'
                    section_content.append(line.replace('ANSWER:', '').strip())
                elif line.strip():
                    section_content.append(line.strip())

            # Add the last section
            if current_section and section_content:
                parts[current_section.lower()] = '\n'.join(section_content)

            return parts
        except Exception as e:
            print(f"Error parsing response: {str(e)}")
            return {'raw_response': response}

    def evaluate(self, dataset: Dataset, num_samples: Optional[int] = None) -> Dict[str, float]:
        """
        Evaluate the model on the provided dataset.

        Args:
            dataset (Dataset): Dataset to evaluate on

        Returns:
            Dict[str, float]: Evaluation metrics
        """
        if self.trainer is None:
            raise ValueError("Model must be trained before evaluation")

        # Prepare dataset
        prep_dataset = self.prepare_dataset(dataset)

        # Run evaluation
        # eval_results = self.trainer.evaluate(dataset)

        eval_results = {}

        # Add custom metrics
        predictions = self.trainer.predict(prep_dataset)
        pred_texts = self.tokenizer.batch_decode(predictions.predictions, skip_special_tokens=True)
        #ref_texts = self.tokenizer.batch_decode(predictions.label_ids, skip_special_tokens=True)

        ref_texts = dataset['answer']

        parsed_pred_text = [self.parse_response(text) for text in pred_texts]
        #parsed_ref_text = [self.parse_response(text) for text in ref_texts]

        # Calculate program accuracy (exact match)
        #program_correct = sum(1 for p, r in zip(pred_texts, ref_texts) if p.strip() == r.strip())
        program_correct = 0

        for p, r in zip(parsed_pred_text, ref_texts):
            if p.get('answer') == r: #r.get('answer'):
                program_correct += 1

        program_accuracy = program_correct / len(pred_texts)

        eval_results["program_accuracy"] = program_accuracy
        return eval_results

    def predict(self, text: str, question: str, max_new_tokens: int = 256) -> str:
      """
      Generate a prediction for the given context and question.

      Args:
          text (str): Context text
          question (str): Question to answer
          max_new_tokens (int): Maximum number of new tokens to generate

      Returns:
          str: Generated prediction
      """
      # Get model's device
      device = next(self.model.parameters()).device

      # Format input
      input_text = f"""You are a financial question-answering assistant. Your task is to analyze financial documents and answer questions about them.

      Context: {text}

      Question: {question}

Instructions:
1. Analyze the financial information in the context
2. Determine the required calculation to compute the answer
3. Compute and provide the final answer

Format your response as:
NUMBERS: [List the relevant numbers you identified]
REASONING: [Your compute logic/program]
ANSWER: [The final numerical answer]

Remember to:
- Be precise with numerical calculations
- Show your work clearly
- Express the final answer in the same format/unit as the question implies"""

      # Tokenize input and move to correct device
      inputs = self.tokenizer(
          input_text,
          return_tensors="pt",
          padding=True,
          truncation=True,
          max_length=512
      )
      # Move inputs to the same device as model
      inputs = {k: v.to(device) for k, v in inputs.items()}

      # Generate prediction
      with torch.cuda.device(device):
          outputs = self.model.generate(
              inputs["input_ids"],
              attention_mask=inputs["attention_mask"],
              max_new_tokens=max_new_tokens,
              temperature=0.2,           # Lower temperature for more focused outputs
              do_sample=False,           # Deterministic generation
              num_beams=2,              # Beam search for better quality
              early_stopping=True,
              pad_token_id=self.tokenizer.pad_token_id,
              eos_token_id=self.tokenizer.eos_token_id
          )

      # Decode prediction
      prediction = self.tokenizer.decode(outputs[0], skip_special_tokens=True)

      # Remove the input prompt from the prediction if present
      if prediction.startswith(input_text):
          prediction = prediction[len(input_text):].strip()

      return prediction

In [None]:
#!wandb login --relogin
#initialise wandb tracking
#wandb.init(project="financial-qa", name="qlora-training")

In [None]:
#wandb.finish()

In [None]:
# Cell 8: Training and Evaluation
clear_memory()
print("Initial GPU state:")
print_gpu_utilization()

# Initialize model
model = FinancialQAModel()

# Train model
print("\nStarting training...")
model.train(train_dataset, dev_dataset)

print("\nFinal GPU state:")
print_gpu_utilization()


Initial GPU state:
GPU memory allocated: 0.00 MB
GPU memory cached: 0.00 MB


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/7.34k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/1.08k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/735 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/35.7k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/564M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Trainable parameters:
Trainable parameters: 9175040
All parameters: 1530567680
Percentage of trainable parameters: 0.60%

Starting training...


Map:   0%|          | 0/1808 [00:00<?, ? examples/s]



Map:   0%|          | 0/455 [00:00<?, ? examples/s]

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mprashk[0m ([33mprashk-independent[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
  return fn(*args, **kwargs)


Epoch,Training Loss,Validation Loss
1,3.2526,3.308136


config.json:   0%|          | 0.00/735 [00:00<?, ?B/s]


Final GPU state:
GPU memory allocated: 2363.09 MB
GPU memory cached: 3934.00 MB


In [None]:
# Clear CUDA cache
torch.cuda.empty_cache()
# Run garbage collection
gc.collect()
# Set environment variable
os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'expandable_segments:True'

In [None]:
print("\nFinal GPU state:")
print_gpu_utilization()


Final GPU state:
GPU memory allocated: 2363.09 MB
GPU memory cached: 2468.00 MB


In [None]:
def create_subset_for_evaluation(dataset, num_samples=10):
    """
    Create a smaller subset of the dataset for evaluation

    Args:
        dataset: The original dataset
        num_samples: Number of samples to select

    Returns:
        Dataset: A smaller subset of the original dataset
    """
    # Ensure we don't try to select more samples than available
    num_samples = min(num_samples, len(dataset))

    # Create list of indices
    indices = list(range(num_samples))

    # Select subset
    subset = dataset.select(indices)

    print(f"Created subset with {len(subset)} samples from original {len(dataset)} samples")
    return subset

# Use the function
dev_dataset2 = create_subset_for_evaluation(dev_dataset, num_samples=10)

Created subset with 10 samples from original 447 samples


In [None]:
# Evaluate model
print("\nEvaluating model...")
eval_results = model.evaluate(dev_dataset)
print(f"Evaluation results: {eval_results}")



Evaluating model...




Map:   0%|          | 0/455 [00:00<?, ? examples/s]



OutOfMemoryError: CUDA out of memory. Tried to allocate 16.80 GiB. GPU 0 has a total capacity of 39.56 GiB of which 16.52 GiB is free. Process 2645 has 23.03 GiB memory in use. Of the allocated memory 19.11 GiB is allocated by PyTorch, and 3.42 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [None]:
def parse_response2(response: str) -> Dict:
  """Parse structured response into components"""
  parts = {}
  try:
      # Split by sections
      current_section = None
      section_content = []

      for line in response.split('\n'):
          if line.startswith('NUMBERS:'):
              if current_section and section_content:
                  parts[current_section.lower()] = '\n'.join(section_content)
                  section_content = []
              current_section = 'NUMBERS'
              section_content.append(line.replace('NUMBERS:', '').strip())
          elif line.startswith('REASONING:'):
              if current_section and section_content:
                  parts[current_section.lower()] = '\n'.join(section_content)
                  section_content = []
              current_section = 'REASONING'
              section_content.append(line.replace('REASONING:', '').strip())
          elif line.startswith('ANSWER:'):
              if current_section and section_content:
                  parts[current_section.lower()] = '\n'.join(section_content)
                  section_content = []
              current_section = 'ANSWER'
              section_content.append(line.replace('ANSWER:', '').strip())
          elif line.strip():
              section_content.append(line.strip())

      # Add the last section
      if current_section and section_content:
          parts[current_section.lower()] = '\n'.join(section_content)

      return parts
  except Exception as e:
    print(f"Error parsing response: {str(e)}")
    return {'raw_response': response}

def evaluate_using_predict(model, dataset, num_samples=None):
    """
    Evaluate model using predict method one sample at a time

    Args:
        model: The trained model
        dataset: Dataset to evaluate
        num_samples: Number of samples to evaluate (None for all)
    """
    results = []
    total_samples = len(dataset) if num_samples is None else min(num_samples, len(dataset))

    print(f"Starting evaluation of {total_samples} samples...")

    for i in tqdm(range(total_samples)):
        try:
            # Get example
            example = dataset[i]

            # Get prediction
            prediction = model.predict(
                text=example['text'],
                question=example['question']
            )

            # Store results
            result = {
                'example_id': i,
                'question': example['question'],
                'predicted': prediction,
                'true_program': example['program'],
                'true_answer': example['answer']
            }
            results.append(result)

            # Clear memory
            torch.cuda.empty_cache()

        except Exception as e:
            print(f"\nError processing example {i}: {str(e)}")
            continue

    # Calculate metrics
    metrics = calculate_metrics(results)

    return results, metrics

def calculate_metrics(results):
    """Calculate evaluation metrics"""
    metrics = {
        'total_samples': len(results),
        #'successful_predictions': 0,
        'exact_matches': 0,
        'answer_matches': 0
    }

    for result in results:
        pred = parse_response2(result['predicted'])#['answer']
        print("parsed_pred", pred)
        true_answer = result['true_answer']
        true_program = result['true_program']

        # Count successful predictions
        #if pred is not None:
        #    metrics['successful_predictions'] += 1

        # Check for exact matches (if needed)
        if pred and pred.strip() == true_program.strip():
            metrics['exact_matches'] += 1

        # Check if answer appears in prediction
        if pred and true_answer in result['predicted']:#pred:
            metrics['answer_matches'] += 1

    # Calculate percentages
    total = metrics['total_samples']
    #metrics['success_rate'] = (metrics['successful_predictions'] / total) * 100
    metrics['exact_match_rate'] = (metrics['exact_matches'] / total) * 100
    metrics['answer_match_rate'] = (metrics['answer_matches'] / total) * 100

    return metrics

# Use the evaluation
print("Starting evaluation using predict...")
results, metrics = evaluate_using_predict(model, dev_dataset, num_samples=1)#None)  # Set num_samples if you want to limit

# Print metrics
print("\nEvaluation Metrics:")
print("===================")
for metric, value in metrics.items():
    if isinstance(value, float):
        print(f"{metric}: {value:.2f}%")
    else:
        print(f"{metric}: {value}")


# Print some example predictions
print("\nExample Predictions:")
print("===================")
for result in results[:3]:  # Show first 3 examples
    print(f"\nQuestion: {result['question']}")
    print(f"Predicted: {result['predicted'][:200]}...")  # Show first 200 chars
    #print(f"True Program: {result['true_program']}")
    print(f"True Answer: {result['true_answer']}")
    print("-" * 80)

"""

Starting evaluation using predict...
Starting evaluation of 1 samples...


100%|██████████| 1/1 [00:26<00:00, 26.85s/it]

parsed_pred {}

Evaluation Metrics:
total_samples: 1
exact_matches: 0
answer_matches: 0
exact_match_rate: 0.00%
answer_match_rate: 0.00%

Example Predictions:

Question: what was the change in weighted average common shares outstanding for basic computations from 2016 to 2017?
Predicted: You are a financial question-answering assistant. Your task is to analyze financial documents and answer questions about them.

      Context: note 2 2013 earnings per share the weighted average numbe...
True Answer: -11.5
--------------------------------------------------------------------------------





## **Comments**

This Approach-2 is a conceptual pipeline to fine-tune an open source model. We used Phi-2 LLM here. But we performed only 1 epoch and only 0.6% of the parameters. In addition, to keep the task quick to iterate we used only 20% of the overall data available.

At the moment the model is failing to respond to the instuctions and is not coherent with its responses. One way could be to simplify the task further and only extract answers from the context instead of combining both program and answer as variables. Alternatively, we can design the task just for finding out program and use that information with another LLM to execute. This experiment was not exhaustive and needs to be run at full scale before making actual comparisons with Open AI model.