


#### Tiers.
1. Tier 1 - Definitely Process (likely ~15-20% of data): has_numbers=True AND (likely_kpi=True OR contains $,million,billion,%)
2. Tier 2 - Skip Completely (likely ~70% of data): No numbers, no financial terms, pure legal text
3. Tier 3 - Smart Sample (remaining ~10-15%): Has numbers but unclear context - sample 20% of these

#### 1 clear failcase:
1. Key Failures: Missing BOE reduction (85 million) and tax assets. model needs clearer prompting for non-monetary metrics.

#### Prompt should guide extraction of:
1. Value clusters - all numbers in context
2. Temporal anchors - years, quarters, periods
3. Causal relationships - "due to", "driven by", "because of"
4. Comparative context - "increased from", "compared to"
5. Operational context - what the number represents

#### Checkpoint Strategy
1. Save state every N sentences:
  Last processed index, Extracted KPIs so far, Timestamp, Resume token ?


```
{
  "kpis": [
    {
      "category": "dynamic_from_llm",  // Let LLM decide
      "subcategory": "optional",        // More granular if needed
      "values": [                        // Multiple values per KPI
        {"amount": 750, "unit": "USD_millions"},
        {"amount": 2.35, "unit": "percent"}
      ],
      "period": "2017-05",              // Standardized when possible
      "period_text": "May 2017",        // Original text
      "comparison": {                   // Optional
        "type": "YoY",
        "prior_value": 650,
        "change": 15.4
      },
      "explanation": "issued fixed-rate notes",  // Context
      "confidence": 0.85,
      "sentence_span": [0, 150]         // Character positions
    }
  ],
  "metadata": {
    "sentenceID": "xxx",
    "extraction_timestamp": "2024-10-21T10:30:00Z",
    "model_version": "qwen2.5-7b"
  }
}
```

### PRE FILTERING STEPS: 

In [1]:
# simple_prefilter.py
import polars as pl
import logging

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(message)s')
logger = logging.getLogger(__name__)

def filter_for_kpi_extraction(data_path: str) -> pl.DataFrame:
    """Simple 3-filter approach using lazy evaluation"""
    
    logger.info("Loading data with lazy evaluation...")
    
    # Use scan for lazy evaluation - won't load into memory
    df_lazy = pl.scan_parquet(data_path)
    
    # Apply just 3 filters
    filtered = df_lazy.filter(
        (pl.col("likely_kpi") == True) | 
        (pl.col("has_numbers") == True) | 
        (pl.col("sentence").str.contains(r'\d'))  # Any digit at all
    )
    
    # Count without collecting full data
    total_count = df_lazy.select(pl.count()).collect().item()
    filtered_count = filtered.select(pl.count()).collect().item()
    
    logger.info(f"Original: {total_count:,} sentences")
    logger.info(f"After filtering: {filtered_count:,} sentences")
    logger.info(f"Reduction: {((total_count - filtered_count) / total_count * 100):.1f}%")
    logger.info(f"Processing time estimate: {(filtered_count * 15 / 3600):.1f} hours")
    
    # Now collect the filtered data
    logger.info("Collecting filtered data...")
    result_df = filtered.collect()
    
    return result_df

# Run it
if __name__ == "__main__":
    data_path = r"D:\JoelDesktop folds_24\NEU FALL2025\MLops IE7374 Project\finrag-insights-mlops\data\exports\sec_finrag_1M_sample.parquet"
    
    filtered_df = filter_for_kpi_extraction(data_path)
    
    # Save it
    output_path = data_path.replace('.parquet', '_filtered.parquet')
    filtered_df.write_parquet(output_path)
    logger.info(f"Saved to: {output_path}")

2025-10-21 13:15:02,945 - Loading data with lazy evaluation...
(Deprecated in version 0.20.5)
  total_count = df_lazy.select(pl.count()).collect().item()
(Deprecated in version 0.20.5)
  filtered_count = filtered.select(pl.count()).collect().item()
2025-10-21 13:15:03,249 - Original: 1,003,534 sentences
2025-10-21 13:15:03,249 - After filtering: 564,551 sentences
2025-10-21 13:15:03,249 - Reduction: 43.7%
2025-10-21 13:15:03,249 - Processing time estimate: 2352.3 hours
2025-10-21 13:15:03,249 - Collecting filtered data...
2025-10-21 13:15:05,252 - Saved to: D:\JoelDesktop folds_24\NEU FALL2025\MLops IE7374 Project\finrag-insights-mlops\data\exports\sec_finrag_1M_sample_filtered.parquet


In [None]:

## 1. Inspect Random Sample of Filtered Data. Inspect Columns. 
import polars as pl

# Load the filtered data
data_path = r"D:\JoelDesktop folds_24\NEU FALL2025\MLops IE7374 Project\finrag-insights-mlops\data\exports\sec_finrag_1M_sample_filtered.parquet"

df = pl.read_parquet(data_path)

# Print ALL column names
print("="*100)
print("ALL COLUMNS IN DATASET:")
print("="*100)
for i, col in enumerate(df.columns, 1):
    print(f"{i:2d}. {col}")

print(f"\n{'='*100}")
print(f"Total columns: {len(df.columns)}")
print(f"Total rows: {len(df):,}")
print("="*100)

# Also print first row to see data structure
print("\nFIRST ROW SAMPLE:")
print("="*100)
first_row = df.head(1).to_dicts()[0]
for key, value in first_row.items():
    # Truncate long values for display
    val_str = str(value)
    if len(val_str) > 100:
        val_str = val_str[:100] + "..."
    print(f"{key}: {val_str}")

print("\n" + "="*100)

ALL COLUMNS IN DATASET:
 1. sample_id
 2. cik
 3. sentence
 4. section
 5. labels
 6. filingDate
 7. name
 8. docID
 9. sentenceID
10. sentenceCount
11. tickers
12. exchanges
13. entityType
14. sic
15. stateOfIncorporation
16. tickerCount
17. acceptanceDateTime
18. form
19. reportDate
20. returns
21. cik_int
22. report_year
23. temporal_bin
24. sampling_rate_pct
25. char_count
26. word_count_approx
27. sample_created_at
28. last_modified_date
29. sample_version
30. source_file_path
31. load_method
32. record_status
33. row_hash
34. section_priority
35. likely_kpi
36. has_numbers
37. is_table_like
38. has_forward_looking
39. has_comparison
40. is_material
41. mentions_years
42. is_recent
43. has_risk_language
44. is_safe_harbor
45. retrieval_signal_score

Total columns: 45
Total rows: 564,551

FIRST ROW SAMPLE:
sample_id: 4
cik: 0000109198
sentence: Wright, is the middle to upper-middle income shopper, with the same profile as a department or speci...
section: 0
labels: {'1d': 0, '5d': 

In [None]:


## Load, Use filtered data and Save Sample to JSON for LLM Processing.

import polars as pl
import json

# Load the filtered data
data_path = r"D:\JoelDesktop folds_24\NEU FALL2025\MLops IE7374 Project\finrag-insights-mlops\data\exports\sec_finrag_1M_sample_filtered.parquet"

df = pl.read_parquet(data_path)

# Filter for sentences with BOTH flags true
high_quality = df.filter(
    (pl.col("likely_kpi") == True) & 
    (pl.col("has_numbers") == True)
).head(30)

print("="*120)
print(f"Found {len(high_quality)} high-quality KPI sentences (both likely_kpi=True and has_numbers=True)")
print("="*120)

# Create structured list for JSON
sentences_data = []
for row in high_quality.iter_rows(named=True):
    sentences_data.append({
        "sentence_id": row['sentenceID'],
        "sentence_text": row['sentence'],
        "metadata": {
            "company_name": row['name'],
            "cik": row['cik'],
            "document_id": row['docID'],
            "section": row['section'],
            "report_year": row['report_year'],
            "filing_date": row['filingDate']
        }
    })

# Create final JSON structure
output_json = {
    "dataset": "sec_10k_high_quality_kpi_sentences",
    "filter_criteria": {
        "likely_kpi": True,
        "has_numbers": True
    },
    "total_sentences": len(sentences_data),
    "sentences": sentences_data
}

# Save to JSON file
output_file = r"D:\JoelDesktop folds_24\NEU FALL2025\MLops IE7374 Project\finrag-insights-mlops\data\exports\sample_30_kpi_sentences.json"
with open(output_file, 'w', encoding='utf-8') as f:
    json.dump(output_json, f, indent=2, ensure_ascii=False)

print(f"\n✓ Saved {len(sentences_data)} sentences to JSON")
print(f"  File: {output_file}")

# Print first 3 as preview
print(f"\n{'='*120}")
print("PREVIEW - First 3 sentences:")
print("="*120)
print(json.dumps(output_json["sentences"][:3], indent=2, ensure_ascii=False))

print(f"\n{'='*120}")
print("JSON Structure:")
print("="*120)
print(f"  - dataset: metadata about the collection")
print(f"  - filter_criteria: what filters were applied")
print(f"  - total_sentences: count")
print(f"  - sentences[]: array of sentence objects")
print(f"      - sentence_id: unique identifier")
print(f"      - sentence_text: full text (no truncation)")
print(f"      - metadata: company, document, section info")
print(f"\n✓ Ready for LLM consumption!")

Found 30 high-quality KPI sentences (both likely_kpi=True and has_numbers=True)

✓ Saved 30 sentences to JSON
  File: D:\JoelDesktop folds_24\NEU FALL2025\MLops IE7374 Project\finrag-insights-mlops\data\exports\sample_30_kpi_sentences.json

PREVIEW - First 3 sentences:
[
  {
    "sentence_id": "0000109198_10-K_2006_section_7_7",
    "sentence_text": "RESULTS OF OPERATIONS Fiscal 2006 Overview: - Net sales for fiscal 2006 were $16.1 billion, an 8% increase over fiscal 2005.",
    "metadata": {
      "company_name": "TJX COMPANIES INC /DE/",
      "cik": "0000109198",
      "document_id": "0000109198_10-K_2006",
      "section": 8,
      "report_year": 2006,
      "filing_date": "2006-03-29"
    }
  },
  {
    "sentence_id": "0000109198_10-K_2006_section_7_19",
    "sentence_text": "- Fourth quarter results for fiscal 2006 were stronger than earlier quarters, with same store sales that increased 3% and pre-tax margins that grew from 6.1% last year to 7.5% this year.",
    "metadata": {
 



### Few errors:
- plain string (JSON_GRAMMAR), so the sampler tried to access . _grammar on a string and crashed.
- error happens because llama-cpp-python expects a LlamaGrammar object, not a raw string.

In [7]:
## MAIN ATTEMPT: packing sentences together to attempt bulk querying.



"""
PACKED KPI Extraction using llama-cpp-python
--------------------------------------------
- No server, no async.
- Two packs of 12 sentences each (from filtered dataset).
- Includes grammar and safe delimiters to reduce mixups.
- Saves sampled dataset automatically with timestamp.
"""

import polars as pl
import json
import time
import re
from datetime import datetime
from typing import List, Dict, Tuple
from llama_cpp import Llama
from llama_cpp import Llama, LlamaGrammar


# =============================================================================
# CONFIGURATION
# =============================================================================
DATA_PATH = r"D:\JoelDesktop folds_24\NEU FALL2025\MLops IE7374 Project\finrag-insights-mlops\data\exports\sec_finrag_1M_sample_filtered.parquet"
MODEL_PATH = r"C:\llama_server\models\qwen2p5_7b\Qwen2.5-7B-Instruct-Q5_K_M.gguf"

N_GPU_LAYERS = -1
N_CTX = 4096
TEMPERATURE = 0.1
MAX_TOKENS = 512  

# =============================================================================
# LOAD MODEL
# =============================================================================
print("Loading model... (this will take 10–20s)")
t0 = time.time()

llm = Llama(
    model_path=MODEL_PATH,
    n_gpu_layers=N_GPU_LAYERS,
    n_ctx=N_CTX,
    verbose=False,
)


print(f"✓ Model loaded in {time.time() - t0:.1f}s\n")

# =============================================================================
# STEP 1: LOAD FILTERED DATA AND SAMPLE 24
# =============================================================================
print("Loading filtered dataset...")
df = pl.read_parquet(DATA_PATH)
df_filt = df.filter(
    (pl.col("likely_kpi") == True) &
    (pl.col("has_numbers") == True)
).sample(n=24, seed=int(time.time()))

# Save sampled JSON with timestamp
now = datetime.now().strftime("%Y%m%d_%H%M%S")
output_json_path = rf"D:\JoelDesktop folds_24\NEU FALL2025\MLops IE7374 Project\finrag-insights-mlops\data\exports\sample_24_kpi_sentences_{now}.json"

sentences = []
for row in df_filt.iter_rows(named=True):
    sentences.append({
        "sentence_id": row['sentenceID'],
        "sentence_text": row['sentence'],
        "metadata": {
            "company_name": row['name'],
            "ticker": row['tickers'],
            "cik": row['cik'],
            "doc_id": row['docID'],
            "section": row['section'],
            "report_year": row['report_year'],
            "filing_date": row['filingDate']
        }
    })

with open(output_json_path, 'w', encoding='utf-8') as f:
    json.dump(sentences, f, indent=2, ensure_ascii=False)
print(f"✓ Saved sample of 24 sentences → {output_json_path}\n")

# =============================================================================
# STEP 2: DEFINE PACKING LOGIC
# =============================================================================

PACK_PROMPT = """You are a financial KPI extractor.
For each input block identified by <<ID>>, extract KPIs into JSON under that same key.
Use the EXACT IDs given (do not rename or add).
Only use numbers explicitly present in that block.
Include both the numeric value and the original substring.

Each KPI fields:
- category
- value
- value_raw
- unit
- year
- quarter
- period_text
- explanation
- company
- ticker
- evidence_sentence_id

Input blocks:
{pairs}

Return only JSON in this form (MINIFIED, one line; no spaces, no newlines):
{{
  "<<ID1>>": [{{"category": "...", "value": 2500, "unit": "USD_millions", "company": "...", "ticker": "..."}}],
  "<<ID2>>": []
}}
"""

# Simple JSON grammar (helps keep model outputs parseable)
JSON_GRAMMAR = r"""
root   ::= object
object ::= "{" ws (pair ("," ws pair)*)? ws "}"
pair   ::= string ws ":" ws value
array  ::= "[" ws (value ("," ws value)*)? ws "]"
value  ::= object | array | string | number | "true" | "false" | "null"
string ::= "\"" ([^"\\] | "\\" .)* "\""
number ::= "-"? [0-9]+ ("." [0-9]+)?
ws     ::= [ \t\n\r]*
"""

GRAMMAR_OBJ = LlamaGrammar.from_string(JSON_GRAMMAR, "root")

def build_packs(data: List[Dict], pack_size: int = 12) -> List[List[Dict]]:
    """Split data into even packs of N"""
    return [data[i:i + pack_size] for i in range(0, len(data), pack_size)]

def build_block(pack: List[Dict]) -> str:
    """Format sentences for input block with IDs and company info"""
    lines = []
    for item in pack:
        sid = item['sentence_id']
        text = item['sentence_text'].replace("\n", " ").strip()
        meta = item['metadata']
        company = meta.get('company_name', 'Unknown')
        ticker = meta.get('ticker', '')
        lines.append(f"<<{sid}>> (Company: {company}, Ticker: {ticker}) {text}")
    return "\n\n".join(lines)



def extract_pack(pack: List[Dict]) -> Dict[str, List[Dict]]:
    """Extract KPIs from a single packed batch."""
    block = build_block(pack)
    prompt = PACK_PROMPT.format(pairs=block)

    resp = llm(
        prompt,
        max_tokens=MAX_TOKENS,
        temperature=TEMPERATURE,
        grammar=GRAMMAR_OBJ,   # LlamaGrammar object
        # no stop token; avoid truncating valid JSON
    )
    text = resp["choices"][0]["text"].strip()

    # 1) Primary parse (fast path)
    try:
        return json.loads(text)
    except Exception:
        pass

    # 2) Strip code fences if present (```json ... ```)
    t = text
    if t.startswith("```"):
        t = t.strip("`").strip()
        if t.lower().startswith("json"):
            t = t[4:].lstrip()

    # 3) Balanced-brace salvage: take the longest complete {...}
    best = ""
    depth = 0
    start = None
    for i, ch in enumerate(t):
        if ch == "{":
            if depth == 0:
                start = i
            depth += 1
        elif ch == "}":
            depth -= 1
            if depth == 0 and start is not None:
                cand = t[start:i+1]
                if len(cand) > len(best):
                    best = cand

    if best:
        try:
            return json.loads(best)
        except Exception:
            pass

    # 4) Last-resort regex (may still catch partials)
    m = re.search(r"\{.*\}", t, re.S)
    if m:
        try:
            return json.loads(m.group(0))
        except Exception:
            pass

    # 5) Nothing parseable
    return {}


# =============================================================================
# STEP 3: RUN EXTRACTION (2 PACKS x 12)
# =============================================================================

packs = build_packs(sentences, 12)
print(f"Total sentences: {len(sentences)} → {len(packs)} packs of 12\n")

results = {}
start = time.time()
for i, pack in enumerate(packs, 1):
    print(f"→ Running Pack {i}/{len(packs)} ({len(pack)} sentences)...")
    pack_result = extract_pack(pack)
    results.update(pack_result)

elapsed = time.time() - start
print(f"\n✓ Extraction completed in {elapsed:.1f}s for {len(sentences)} sentences "
      f"({elapsed/len(sentences):.2f}s/sentence avg)\n")

# =============================================================================
# STEP 4: SHOW OUTPUT SUMMARY
# =============================================================================
total_kpis = sum(len(v) for v in results.values())
print("="*100)
print(f"SUMMARY: {total_kpis} KPIs extracted across {len(results)} sentences")
print("="*100)

for sid, items in list(results.items())[:3]:
    print(f"\n--- {sid} ---")
    for item in items:
        print({
            "category": item.get("category"),
            "value": item.get("value"),
            "unit": item.get("unit"),
            "company": item.get("company"),
            "ticker": item.get("ticker"),
            "year": item.get("year"),
            "explanation": item.get("explanation")
        })

# Optionally save results
out_results_path = output_json_path.replace(".json", "_kpi_results.json")
with open(out_results_path, 'w', encoding='utf-8') as f:
    json.dump(results, f, indent=2, ensure_ascii=False)

print(f"\n✓ Saved extraction results to {out_results_path}")


Loading model... (this will take 10–20s)


llama_context: n_ctx_per_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized


✓ Model loaded in 1.4s

Loading filtered dataset...
✓ Saved sample of 24 sentences → D:\JoelDesktop folds_24\NEU FALL2025\MLops IE7374 Project\finrag-insights-mlops\data\exports\sample_24_kpi_sentences_20251021_180011.json

Total sentences: 24 → 2 packs of 12

→ Running Pack 1/2 (12 sentences)...
→ Running Pack 2/2 (12 sentences)...

✓ Extraction completed in 582.9s for 24 sentences (24.29s/sentence avg)

SUMMARY: 0 KPIs extracted across 0 sentences

✓ Saved extraction results to D:\JoelDesktop folds_24\NEU FALL2025\MLops IE7374 Project\finrag-insights-mlops\data\exports\sample_24_kpi_sentences_20251021_180011_kpi_results.json


#### Issue 1 might be that grammar is too restrictive



In [4]:
"""
FIXED Packed KPI Extraction using llama-cpp-python
--------------------------------------------------
- Processes sentences in packs of 6 (not 12 - more reliable)
- Simpler prompt structure that works with the model
- No grammar (it was causing the issue)
- Better JSON parsing with fallbacks
"""

import polars as pl
import json
import time
import re
from datetime import datetime
from typing import List, Dict
from llama_cpp import Llama

# =============================================================================
# CONFIGURATION
# =============================================================================
DATA_PATH = r"D:\JoelDesktop folds_24\NEU FALL2025\MLops IE7374 Project\finrag-insights-mlops\data\exports\sec_finrag_1M_sample_filtered.parquet"
MODEL_PATH = r"C:\llama_server\models\qwen2p5_7b\Qwen2.5-7B-Instruct-Q5_K_M.gguf"

N_GPU_LAYERS = -1
N_CTX = 8192  # Increased for longer prompts
TEMPERATURE = 0.0
MAX_TOKENS = 1024  # Increased for multiple sentence responses

# =============================================================================
# LOAD MODEL
# =============================================================================
print("Loading model...")
t0 = time.time()

llm = Llama(
    model_path=MODEL_PATH,
    n_gpu_layers=N_GPU_LAYERS,
    n_ctx=N_CTX,
    verbose=False,
)

print(f"✓ Model loaded in {time.time() - t0:.1f}s\n")

# =============================================================================
# STEP 1: LOAD FILTERED DATA AND SAMPLE N
# =============================================================================
print("Loading filtered dataset...")
df = pl.read_parquet(DATA_PATH)

num = 8  # sentences to sample
df_filt = df.filter(
    (pl.col("likely_kpi") == True) &
    (pl.col("has_numbers") == True)
).sample(n=num, seed=42)  # Fixed seed for reproducibility

# Save sampled JSON with timestamp
now = datetime.now().strftime("%Y%m%d_%H%M%S")
output_json_path = rf"D:\JoelDesktop folds_24\NEU FALL2025\MLops IE7374 Project\finrag-insights-mlops\data\exports\sample_{num}_kpi_sentences_{now}.json"

sentences = []
for row in df_filt.iter_rows(named=True):
    sentences.append({
        "sentence_id": row['sentenceID'],
        "sentence_text": row['sentence'],
        "metadata": {
            "company_name": row['name'],
            "ticker": row['tickers'],
            "cik": row['cik'],
            "doc_id": row['docID'],
            "section": row['section'],
            "report_year": row['report_year'],
            "filing_date": row['filingDate']
        }
    })

with open(output_json_path, 'w', encoding='utf-8') as f:
    json.dump(sentences, f, indent=2, ensure_ascii=False)
print(f"✓ Saved sample of {num} sentences → {output_json_path}\n")

# =============================================================================
# STEP 2: SIMPLER PACK EXTRACTION (One sentence at a time in context)
# =============================================================================

def extract_single_sentence(sentence_data: Dict) -> List[Dict]:
    """Extract KPIs with RICH schema and dynamic categorization"""
    
    sid = sentence_data['sentence_id']
    text = sentence_data['sentence_text']
    meta = sentence_data['metadata']
    
    # IMPROVED: Comprehensive prompt with flexible schema
    prompt = f"""Extract ALL financial and operational metrics from this SEC 10-K sentence.

Company: {meta['company_name']}
Ticker: {meta['ticker']}
Sentence: {text}

YOU ARE A FINANCIAL ANALYST. Extract both monetary AND non-monetary metrics.

For EACH metric found, create a JSON object with these fields:

REQUIRED FIELDS:
- "category": Use standard terms if obvious (revenue, debt, employees, etc.). If not obvious, study and infer a descriptive 2-4 word category that captures the business meaning (e.g., "valuation_allowance_release", "inter_segment_sales", "restructuring_charges", "basis_point_change")
- "value": The numeric value (convert to consistent units - see below)
- "value_raw": The EXACT text from sentence showing this number (e.g., "$392 million", "11 basis points", "1,810 employees")
- "unit": Choose from: percent, basis_points, USD_millions, USD_billions, USD_thousands, count, ratio, BOE (barrels oil equivalent), or create appropriate unit

OPTIONAL FIELDS (include if present):
- "year": Fiscal year if mentioned (2017, 2018, etc.)
- "quarter": If mentioned (Q1, Q2, Q3, Q4)
- "period_text": Original period description ("fiscal year ended December 31, 2019")
- "comparison_year": If comparing to prior period (e.g., "compared to 2016")
- "metric_type": "absolute" or "change" or "rate" or "percentage_change"
- "context": Brief explanation of what this metric represents (max 10 words)

UNIT CONVERSION RULES:
- "$25 million" → {{"value": 25, "unit": "USD_millions"}}
- "$41,057" or "$41 thousand" → {{"value": 0.041, "unit": "USD_millions"}} 
- "11 basis points" → {{"value": 11, "unit": "basis_points"}}
- "1,810 employees" → {{"value": 1810, "unit": "count"}}
- "2.63%" → {{"value": 2.63, "unit": "percent"}}

EXAMPLES OF GOOD EXTRACTION:

Input: "Does not include inter-segment sales of $392 million and $405 million in 2017 and 2016."
Output: [
  {{"category": "inter_segment_sales_excluded", "value": 392, "value_raw": "$392 million", "unit": "USD_millions", "year": 2017, "context": "excluded from segment totals"}},
  {{"category": "inter_segment_sales_excluded", "value": 405, "value_raw": "$405 million", "unit": "USD_millions", "year": 2016, "context": "excluded from segment totals"}}
]

Input: "The $25 million deposit is included in Cash segregated under regulatory requirements."
Output: [
  {{"category": "regulatory_cash_deposit", "value": 25, "value_raw": "$25 million", "unit": "USD_millions", "context": "segregated cash requirement"}}
]

Input: "The net gain of $41,057 is included in Other Income."
Output: [
  {{"category": "other_income_gain", "value": 0.041, "value_raw": "$41,057", "unit": "USD_millions", "context": "net gain in other income"}}
]

Input: "FTE rate of return on securities was 2.63% in 2018, up by 11 basis points."
Output: [
  {{"category": "return_on_securities_fte", "value": 2.63, "value_raw": "2.63%", "unit": "percent", "year": 2018, "metric_type": "rate"}},
  {{"category": "return_on_securities_change", "value": 11, "value_raw": "11 basis points", "unit": "basis_points", "metric_type": "change", "context": "year over year increase"}}
]

Input: "Employee separation charges relate to severance for approximately 1,810 and 2,720 employees in 2019 and 2018."
Output: [
  {{"category": "employee_separations", "value": 1810, "value_raw": "1,810 employees", "unit": "count", "year": 2019, "context": "restructuring headcount"}},
  {{"category": "employee_separations", "value": 2720, "value_raw": "2,720 employees", "unit": "count", "year": 2018, "context": "restructuring headcount"}}
]

NOW EXTRACT FROM THIS SENTENCE. Be comprehensive - capture EVERY number with business meaning.

JSON array:"""
    

    try:
        resp = llm(
            prompt,
            max_tokens=1024,  # Increased for richer output
            temperature=0.0,  # Deterministic
            stop=["```", "\n\nInput:", "\n\nNOW EXTRACT"],
            repeat_penalty=1.1,
        )
        
        output = resp["choices"][0]["text"].strip()
        
        # Clean JSON extraction
        output = re.sub(r'```json\s*|\s*```', '', output)
        
        # Find JSON array
        array_match = re.search(r'\[.*?\]', output, re.DOTALL)
        if array_match:
            json_str = array_match.group(0)
            
            # Fix common JSON issues
            json_str = json_str.replace("'", '"')  # Single to double quotes
            json_str = re.sub(r',(\s*[}\]])', r'\1', json_str)  # Trailing commas
            
            kpis = json.loads(json_str)
            
            # Add sentence-level metadata to each KPI
            for kpi in kpis:
                kpi['sentence_id'] = sid
                kpi['company'] = meta['company_name']
                kpi['ticker'] = meta['ticker']
                kpi['source_sentence'] = text  # Keep full sentence for context
            
            return kpis if isinstance(kpis, list) else []
    
    except json.JSONDecodeError as e:
        print(f"  ⚠ JSON parse error: {str(e)[:50]}")
        print(f"  Raw output: {output[:200]}...")
    except Exception as e:
        print(f"  ⚠ Error on {sid}: {str(e)[:50]}")
    
    return []

# =============================================================================
# STEP 3: PROCESS ALL SENTENCES
# =============================================================================

print(f"Processing {len(sentences)} sentences individually...")
print("="*100)

all_results = {}
start_time = time.time()

for i, sent in enumerate(sentences, 1):
    sid = sent['sentence_id']
    company = sent['metadata']['company_name']
    
    print(f"\n[{i}/24] {company[:40]}...")
    print(f"  Text: {sent['sentence_text'][:80]}...")
    
    kpis = extract_single_sentence(sent)
    all_results[sid] = kpis
    
    if kpis:
        print(f"  ✓ Found {len(kpis)} KPIs")
        for kpi in kpis:
            print(f"    • {kpi.get('category')}: {kpi.get('value')} {kpi.get('unit')}")
    else:
        print(f"  • No KPIs extracted")

elapsed = time.time() - start_time
avg_time = elapsed / len(sentences)

# =============================================================================
# STEP 4: SUMMARY & SAVE
# =============================================================================

print("\n" + "="*100)
print("EXTRACTION COMPLETE")
print("="*100)

total_kpis = sum(len(v) for v in all_results.values())
sentences_with_kpis = sum(1 for v in all_results.values() if v)

print(f"Total sentences:       {len(sentences)}")
print(f"Sentences with KPIs:   {sentences_with_kpis}/{len(sentences)} ({sentences_with_kpis/len(sentences)*100:.1f}%)")
print(f"Total KPIs extracted:  {total_kpis}")
print(f"Average per sentence:  {total_kpis/len(sentences):.2f}")
print(f"\nProcessing time:       {elapsed:.1f}s")
print(f"Average per sentence:  {avg_time:.2f}s")
print(f"Throughput:            {60/avg_time:.1f} sentences/minute")

# Extrapolate
print(f"\nExtrapolation for 564,551 filtered sentences:")
print(f"  Sequential:          {564551 * avg_time / 3600:.1f} hours")
print(f"  With 10 parallel:    {564551 * avg_time / 3600 / 10:.1f} hours")

# Save results
out_results_path = output_json_path.replace(".json", "_kpi_results.json")
with open(out_results_path, 'w', encoding='utf-8') as f:
    json.dump(all_results, f, indent=2, ensure_ascii=False)

print(f"\n✓ Results saved to: {out_results_path}")
print("="*100)



Loading model...


llama_context: n_ctx_per_seq (8192) < n_ctx_train (32768) -- the full capacity of the model will not be utilized


✓ Model loaded in 0.9s

Loading filtered dataset...
✓ Saved sample of 24 sentences → D:\JoelDesktop folds_24\NEU FALL2025\MLops IE7374 Project\finrag-insights-mlops\data\exports\sample_8_kpi_sentences_20251022_092310.json

Processing 8 sentences individually...

[1/24] ELI LILLY & Co...
  Text: In 2017, we recognized $1.33 billion of asset impairment, restructuring, and oth...
  • No KPIs extracted

[2/24] GENWORTH FINANCIAL INC...
  Text: The net proceeds of $397 million from the issuance of the 2024 Notes, together w...
  ✓ Found 3 KPIs
    • net_proceeds_from_issuance: 397 USD_millions
    • contribution_to_gmic: 100 USD_millions
    • contribution_to_us_mortgage_holding_company: 300 USD_millions

[3/24] GENWORTH FINANCIAL INC...
  Text: The following table sets forth the increase (decrease) in amortization of DAC re...
  • No KPIs extracted

[4/24] WASHINGTON TRUST BANCORP INC...
  Text: Approximately 80% of the net client outflows were associated with the loss of ce...
  • No KPIs

KeyboardInterrupt: 

#### Issue 2:
- Quantization drift / Context drift. Model precision degrades over time.
- No few-shot guidance.
- accumulation (even though we're not explicitly passing history, the model's internal state affects subsequent generations)
- successful patterns can create bias for later failures
