
# 🦙 PDF Financial Insight Extractor (Ollama + LangChain)

This notebook extracts key financial insights (growth prospects, business changes, key triggers, material impacts) from investor call transcripts (PDF format) using **LangChain + Ollama**.

---

## 🚀 Features:
1. PDF Text Extraction using **PyMuPDF (fitz)**
2. Text Cleaning & Chunking
3. Local LLM Processing via **Ollama + LangChain**
4. Aggregated Results (structured bullet points)
5. Optional Export to `.txt`

---

## 🔥 Requirements:
- **Ollama installed & running locally** ([https://ollama.com/download](https://ollama.com/download))
- Python Packages:
  ```bash
  pip install langchain langchain-community PyMuPDF pandas tqdm
  ```

---


In [None]:

# Install required packages (uncomment if not installed)
!pip install langchain langchain-community PyMuPDF pandas tqdm


Collecting langchain-community
  Downloading langchain_community-0.3.20-py3-none-any.whl.metadata (2.4 kB)
Collecting PyMuPDF
  Downloading pymupdf-1.25.4-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (3.4 kB)
Collecting langchain
  Downloading langchain-0.3.21-py3-none-any.whl.metadata (7.8 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.8.1-py3-none-any.whl.metadata (3.5 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting langchain-text-splitters<1.0.0,>=0.3.7 (from langchain)
  Downloading langchain_text_splitters-0.3.7-py3-none-any.whl.metadata (1.9 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading marshmallow-3.26.1-

In [None]:

import fitz  # PyMuPDF
import pandas as pd
from tqdm import tqdm
import re

from langchain_community.chat_models import ChatOllama
from langchain.prompts import ChatPromptTemplate


In [None]:

def extract_text_from_pdf(pdf_path):
    """Extracts text from a PDF file using PyMuPDF."""
    text = ""
    doc = fitz.open(pdf_path)
    for page in doc:
        text += page.get_text()
    doc.close()
    return text


In [None]:

def clean_text(text):
    """Cleans extracted PDF text by removing headers/footers and unnecessary whitespace."""
    lines = text.split("\n")
    cleaned_lines = [line.strip() for line in lines if len(line.strip()) > 30]  # Remove short/noisy lines
    cleaned_text = " ".join(cleaned_lines)
    return cleaned_text


In [None]:

def chunk_text(text, max_tokens=3000):
    """Splits text into chunks suitable for LLM processing."""
    sentences = re.split(r'(?<=[.!?]) +', text)
    chunks = []
    current_chunk = ""
    for sentence in sentences:
        if len(current_chunk) + len(sentence) <= max_tokens:
            current_chunk += sentence + " "
        else:
            chunks.append(current_chunk.strip())
            current_chunk = sentence + " "
    if current_chunk:
        chunks.append(current_chunk.strip())
    return chunks


In [None]:

def extract_key_info(chunk):
    """Uses Ollama LLM to extract financial insights from a text chunk."""
    llm = ChatOllama(model="llama2")  # You can change to 'mistral', 'codellama', etc.

    prompt_template = """
    You are a financial analyst. Summarize the following earnings call transcript text into these categories:
    1. Future Growth Prospects
    2. Key Changes in Business
    3. Key Triggers
    4. Material Information Affecting Next Year’s Earnings & Growth

    Provide bullet points under each category.

    Transcript:
    {chunk}
    """

    prompt = ChatPromptTemplate.from_template(prompt_template)
    chain = prompt | llm
    response = chain.invoke({"chunk": chunk})
    return response.content


In [None]:

from tkinter import Tk
from tkinter.filedialog import askopenfilename

# Select PDF File
print("Select PDF file:")
Tk().withdraw()
pdf_path = askopenfilename()

# Extract & clean text
raw_text = extract_text_from_pdf(pdf_path)
cleaned_text = clean_text(raw_text)

# Chunk text
chunks = chunk_text(cleaned_text)

# Process chunks
all_results = []
for i, chunk in enumerate(tqdm(chunks, desc="Processing Chunks")):
    result = extract_key_info(chunk)
    all_results.append(result)

# Display output
for idx, res in enumerate(all_results):
    print(f"\n\n--- Chunk {idx+1} Summary ---\n")
    print(res)


In [None]:

# Save results to text file
with open("financial_summary.txt", "w") as f:
    for idx, res in enumerate(all_results):
        f.write(f"\n\n--- Chunk {idx+1} Summary ---\n")
        f.write(res)

print("\nSummary saved as 'financial_summary.txt'")



---

## 📝 Notes:
- Ensure **Ollama is running locally** before executing.
- You can switch models by changing `'llama2'` to `'mistral'`, `'codellama'`, etc.
- PDF can be any investor transcript (cleaned, reusable).

---
