# Rule Generation Process
This notebook facilitates the iterative generation of financial reasoning rules based on model failures using Ollama, followed by verification with Llama 3.3 70B.

**Note:** Failed results are now indexed by their row number in `results.jsonl`, which corresponds directly to the record index in `train.json`.

In [2]:
ls

curate_dataset_tools.ipynb  rule_generation_process.ipynb
proof_of_concept.ipynb


In [1]:
import json
import os
import sys

# Adds the root directory (ruledistill-main) to the path
sys.path.append(os.path.abspath(os.path.join('..')))

from src.rule_generator import load_dataset, load_failures, get_rule_prompt, generate_rules_for_range, verify_rule_with_llama

dataset = load_dataset()
failures = load_failures()

  from .autonotebook import tqdm as notebook_tqdm


Loading dataset from /root/hsin_research/FinQA-main/dataset/train.json...
Loading failures from /root/hsin_research/ruledistill-main/data/failed_results_with_ids.jsonl...


## 1. Select Range of Failures
Specify the indices of the rows in `failed_results_with_ids.jsonl` you want to analyze.

In [2]:
START_IDX = 0
END_IDX = 2
OLLAMA_MODEL = "gemini-3-pro-preview:latest"

print(f"Analyzing failures in 'failed_results_with_ids.jsonl' slice [{START_IDX}:{END_IDX+1}]...")
for i in range(START_IDX, END_IDX + 1):
    fail = failures[i]
    row_idx = fail['id']
    item = dataset[row_idx]
    print(f"[Failure Slot {i}] Dataset Index: {row_idx} | Q: {item['qa']['question'][:100]}...")

Analyzing failures in 'failed_results_with_ids.jsonl' slice [0:3]...
[Failure Slot 0] Dataset Index: 0 | Q: what is the the interest expense in 2009?...
[Failure Slot 1] Dataset Index: 1 | Q: during the 2012 year , did the equity awards in which the prescribed performance milestones were ach...
[Failure Slot 2] Dataset Index: 2 | Q: what was the total operating expenses in 2018 in millions...


## 2. Generate Rules via Ollama
This step requires the `ollama` Python package and the Ollama server to be running.

In [4]:
generated_rules = generate_rules_for_range(START_IDX, END_IDX, model=OLLAMA_MODEL)

for entry in generated_rules:
    print(f"\n--- Rules for Dataset Index {entry['row_index']} ---")
    print(entry['rule'])

Loading dataset from /root/hsin_research/FinQA-main/dataset/train.json...
Loading failures from /root/hsin_research/failed_results_with_ids.jsonl...
Processing failures in 'failed_results_with_ids.jsonl' slice [0:3]...
[0] Generating rule for row index 0...
[1] Generating rule for row index 1...
[2] Generating rule for row index 2...
Error generating rule for row index 2: you've reached your premium model request limit (status code: 403)

--- Rules for Dataset Index 0 ---
**Gap Analysis**

The model failed because it interpreted the context literally as only providing "changes" (sensitivity data) rather than absolute values. It concluded that the specific "interest expense in 2009" was missing ("insufficient data"). However, in financial reasoning, specifically regarding variable rate instruments, providing the sensitivity of interest expense to rate changes allows for the calculation of the underlying **Notional Amount** (Principal).

The Ground Truth Answer (380) is derived from the 

## 3. Verify Generated Rules with Llama 3.3 70B
Now we check if the rules actually improve the model's performance on these specific samples.

In [7]:
verification_results = []
for entry in generated_rules:
    row_idx = entry['row_index']
    rule_text = entry['rule']
    print(rule_text)
    orig_fail = entry['original_failure']
    
    print(f"Verifying rule for Row Index {row_idx}...")
    res = verify_rule_with_llama(dataset[row_idx], rule_text)
    
    if res:
        print(f"  Result: {'CORRECT ✅' if res['is_correct'] else 'STILL WRONG ❌'} ({res['error_category']})")
        print(f"  Original Prediction: {orig_fail['parsed_prediction']}")
        print(f"  New Prediction:      {res['parsed_prediction']}")
        print(f"  Ground Truth:        {res['ground_truth']}")
        verification_results.append({
            "row_index": row_idx,
            "original": orig_fail,
            "new_verification": res
        })

**Gap Analysis**

The model failed because it interpreted the context literally as only providing "changes" (sensitivity data) rather than absolute values. It concluded that the specific "interest expense in 2009" was missing ("insufficient data"). However, in financial reasoning, specifically regarding variable rate instruments, providing the sensitivity of interest expense to rate changes allows for the calculation of the underlying **Notional Amount** (Principal).

The Ground Truth Answer (380) is derived from the formula:
$$ \text{Notional Amount} = \frac{\text{Change in Interest Expense}}{\text{Change in Interest Rate}} $$
Given:
*   Change in Expense = 3.8
*   Change in Rate = 100 basis points = 1% = 0.01

Calculation: $3.8 / 0.01 = 380$.

The model failed to recognize that sensitivity statements are solvable math problems for determining principal/notional balances.

**Rule Synthesis**

<Rule id="sensitivity_to_principal_inference" phase="generation" confidence="1" source="failu

  Result: STILL WRONG ❌ (computation error)
  Original Prediction: n/a
  New Prediction:      38.0
  Ground Truth:        380.0
Here is the analysis and the synthesized rule.

### 1. Gap Analysis
The model failed because it treated a **Boolean comparison question** as a **numerical calculation task**.

*   **The Question:** "Did the equity awards... exceed the equity award compensation expense...?" This requires a Yes/No answer based on a comparison of two values.
*   **The Logic Required:**
    1.  Calculate Total Fair Value: $607 \text{ (shares in thousands)} \times 18.13 = \$11,004.91 \text{ (in thousands)} \rightarrow \$11,004,910$.
    2.  Identify Expense: $\$3.3 \text{ million}$.
    3.  Compare: Is $\$11,004,910 > \$3,300,000$?
    4.  Final Output: **Yes**.
*   **The Model's Failure:** The model performed step 1 (calculating $607 \times 18.13 \approx 11,000$), but it stopped there. It output the numerical result of the first variable rather than completing the logic to compare

## 4. Save Final Session Data

In [None]:
output_path = f"rule_verification_session_{START_IDX}_{END_IDX}.json"
with open(output_path, 'w') as f:
    json.dump({
        "generated_rules": generated_rules,
        "verification_results": verification_results
    }, f, indent=2)
print(f"Session results saved to {output_path}")