# Team Details & Contribution

---

1. `Sarit Ghosh (2023AC05131) (100% contribution)`

2. `Soumen Choudhury (2023AC05143) (100% contribution)`

3. `Dhiman Kundu (2023AC05129) (100% contribution)`

4. `Patil Omkar Mahesh (2023AC05085) (100% contribution)`

5. `Kulkarni Siddharth Prasad (2023AC05082) (100% contribution)`

# Task I: Data Collection & Preprocessing

---

- Obtain financial statements for the last two years (publicly available or from a group member’s company).
  
- Convert documents (PDF, Excel, HTML) to plain text using OCR or appropriate parsers.

- Clean text by removing noise like headers, footers, and page numbers.

- Segment reports into logical sections (e.g., income statement, balance sheet).

- Construct at least `50` question-answer (Q/A) pairs reflecting the financial data.

- Example:
    - `Q: What was the company’s revenue in 2023?`
    - `A: The company’s revenue in 2023 was $4.13 billion`

# Data Collection

- **Company:** `PHILLIPS EDISON & COMPANY, INC`.
  
- **FY:** `Dec 31, 2023 to Sept 30, 2024`
  - **HTML Data Link:** `https://www.sec.gov/ix?doc=/Archives/edgar/data/0001476204/000147620425000065/peco-20250331.htm`
    
  - **Page No:** `1 to 35`.
    
- **FY:** `Dec 31, 2024 to June 30, 2025`
  - **HTML Data Link:** `https://www.sec.gov/Archives/edgar/data/1476204/000147620425000092/peco-20250630.htm`

  - **Page No:** `1 to 39`.
    
- `Saved the HTML's → PDF's (First 35 & 39 Pages only have FINANCIAL STATEMENTS)`.

# Importing Libraries

In [101]:
import os, requests, re, pytesseract
from pdf2image import convert_from_path
from PIL import Image
from tqdm import tqdm
from groq import Groq
from textwrap import dedent
import pandas as pd

# Parsing: PDF → Text (OCR)

In [5]:
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'

def clean_line(line):
    """
    Task: Cleans OCR'd lines by removing headers, footers and page numbers.
    Args: line (str): A line of text.
    Returns: str: Cleaned line or empty string if filtered.
    """
    line = line.strip()
    if not line:
        return ""
    if re.match(r"^\s*\d+\s*$", line):
        return ""
    if any(keyword in line.lower() for keyword in ["philips", "peco-", "unaudited"]):
        return ""
    return line


def extract_and_save_full_clean_text(pdf_path, output_txt_path, poppler_path):
    """
    Task: Extracts full OCR text (including text and tables) from a scanned/image PDF using pytesseract, applies cleaning rules and saves to a text file.
    Args:
        pdf_path (str): Path to input PDF.
        output_txt_path (str): Path to save the cleaned OCR output.
        poppler_path (str, optional): Path to Poppler binary (needed on Windows).
    """
    print(f"📄 Processing: {pdf_path}")
    
    try:
        pages = convert_from_path(pdf_path, dpi = 300, poppler_path = poppler_path)
    except Exception as e:
        print(f"[Fatal Error] Couldn't convert PDF: {e}")
        return

    full_text = []

    for i, image in enumerate(tqdm(pages, desc = f"OCR'ing {os.path.basename(pdf_path)}")):
        try:
            page_text = pytesseract.image_to_string(image)
            lines = page_text.splitlines()

            for line in lines:
                cleaned = clean_line(line)
                if cleaned:
                    full_text.append(cleaned)

        except Exception as e:
            print(f"[Error] Page {i+1}: {e}")
            continue

    try:
        with open(output_txt_path, 'w', encoding = 'utf-8') as f:
            f.write("\n".join(full_text))
        print(f"✅ Saved extracted text to: {output_txt_path}")
    except Exception as e:
        print(f"[Error] Couldn't save file: {e}")

In [7]:
extract_and_save_full_clean_text(
    pdf_path = "Philips_FD_2023-24.pdf",
    output_txt_path = "RAW_Philips_FD_2023-24.txt",
    poppler_path = r"C:\Users\Asus\Desktop\College Stuff\MTech\Sem 3\CAI\A2\poppler-24.08.0\Library\bin"
)

📄 Processing: Philips_FD_2023-24.pdf


OCR'ing Philips_FD_2023-24.pdf: 100%|██████████| 35/35 [01:51<00:00,  3.20s/it]

✅ Saved extracted text to: Philips_FD_2023-24.txt





In [9]:
extract_and_save_full_clean_text(
    pdf_path = "Philips_FD_2024-25.pdf",
    output_txt_path = "RAW_Philips_FD_2024-25.txt",
    poppler_path = r"C:\Users\Asus\Desktop\College Stuff\MTech\Sem 3\CAI\A2\poppler-24.08.0\Library\bin"
)

📄 Processing: Philips_FD_2024-25.pdf


OCR'ing Philips_FD_2024-25.pdf: 100%|██████████| 39/39 [02:08<00:00,  3.30s/it]

✅ Saved extracted text to: Philips_FD_2024-25.txt





**Insights:**

1. OCR pipeline correctly extracts text from scanned PDFs using Tesseract at `300 DPI` for high accuracy.

2. Cleaning function effectively removes headers, footers and page numbers for a cleaner dataset.

3. Processing loop handles per‑page errors gracefully without stopping the full extraction.

4. Output is stored as a single consolidated `.txt` file, preserving the document's logical flow.

5. Poppler integration enables reliable PDF-to-image conversion, crucial for Tesseract OCR on Windows.

# Segmenting Reports Into Logical Sections (GROQ LLM's)

---

#### Note: This will try to summarize long info, maths, insights into short sentences.

In [54]:
def segment_financial_text_with_groq(raw_text, file_label, output_path):
    """
    Task: Sends raw OCR text to Groq API to segment and format it into structured financial sections.
    Args:
        raw_text (str): Unstructured raw OCR text.
        file_label (str): Label like '2023-24' for headings.
        output_path (str): File path to save the structured result.
    """

    prompt = f"""
GIVEN:
- You are given raw OCR extracted text from a financial report (year: {file_label}).
- The text may be noisy but contains the financial statements, disclosures and supporting notes.

OBJECTIVE:
- Segment the content into well-structured, labeled financial sections.

RESPONSE FORMAT:
<BRD>
================== Section 1: Income Statement ==================
- Bullet points summarizing key metrics like revenue, operating income, net income, EPS, etc.
================== Section 2: Balance Sheet ==================
- Bullet points capturing total assets, liabilities, equity, key ratios, etc.
================== Section 3: Cash Flow Statement ==================
- Bullet points highlighting operating, investing, and financing cash flows.
================== Section 4: Real Estate Portfolio ==================
- Bullet points on number/type of properties, acquisitions/dispositions, geographic exposure, ABR, lease structure, etc.
================== Section 5: Leasing & Occupancy ==================
- Bullet points about lease types, occupancy trends, tenant mix, rent structures, lease term, etc.
================== Section 6: Debt & Financing ==================
- Bullet points about total debt, interest rates, maturities, fixed/variable split, credit facilities, etc.
================== Section 7: Equity & Distributions ==================
- Bullet points on equity structure, dividends, OP units, buybacks, APIC, AOCI, etc.
================== Section 8: Risk Factors & Strategic Outlook ==================
- Bullet points on macroeconomic, tenant, environmental risks, and strategic goals.
</BRD>

INSTRUCTIONS:
- You MUST include all the 8 logical financial sections.
- Use the content and labels from the raw OCR to identify appropriate additional sections.
- Summarize in clear bullet points using exact figures where available. There must be minimum 15 bullet points per section.
- Use original financial terms when possible (e.g., "FFO", "NOI", "ABR", "straight-line rent").
- Do NOT fabricate or infer data not present in the text.
- Maintain the sequence and structure above, even if some sections are sparse.

RAW OCR TEXT:
{raw_text}
"""


    client = Groq(api_key = "groq_api_key")

    completion = client.chat.completions.create(
        model = "deepseek-r1-distill-llama-70b",
        messages = [
            {
                "role": "user",
                "content": prompt
            }
        ],
        temperature = 0.3,
        top_p = 0.9,
        stream = False,
        stop = None,
        seed= 123,
    )

    full_response = completion.choices[0].message.content

    # Extract content between <BRD> and </BRD>
    match = re.search(r"<BRD>(.*?)</BRD>", full_response, re.DOTALL)
    if match:
        structured_text = match.group(1).strip()
    else:
        raise ValueError("❌ Could not find <BRD> ... </BRD> section in the response.")

    with open(output_path, 'w', encoding = 'utf-8') as f:
        f.write(structured_text)

    print(f"✅ Saved structured version to {output_path}")

In [47]:
with open("Philips_FD_2023-24.txt", "r", encoding = "utf-8") as f:
    raw_text_2023 = f.read()

segment_financial_text_with_groq(
    raw_text = raw_text_2023,
    file_label = "2023-24",
    output_path = "Cleaned_Philips_FD_2023-24.txt"
)

✅ Saved structured version to Cleaned_Philips_FD_2023-24.txt


In [49]:
with open("Philips_FD_2024-25.txt", "r", encoding = "utf-8") as f:
    raw_text_2024 = f.read()

segment_financial_text_with_groq(
    raw_text = raw_text_2024,
    file_label = "2024-25",
    output_path = "Cleaned_Philips_FD_2024-25.txt"
)

✅ Saved structured version to Cleaned_Philips_FD_2024-25.txt


**Insights:**

1. Used `Groq’s` `deepseek-r1-distill-llama-70b` to transform noisy OCR output into a `structured 8‑section financial report`.

2. Enforces a strict template with labeled sections, ensuring uniform segmentation across reports.

3. Requires at least `15` factual bullet points per section, encouraging thorough extraction of figures.

4. Retains original financial terminology `(e.g., FFO, NOI, ABR)` to preserve domain accuracy.

5. Saves the cleaned, structured output to a specified file path for downstream processing.

# Combining Two Segmented Text Files Into Single Text File

In [88]:
def combine_text_files(file_paths):
    """
    Task: Combines the content of multiple text files into one string.
    Args: file_paths (list): List of file paths to combine.
    Returns: str: Combined text content.
    """
    combined_text = []

    for path in file_paths:
        try:
            with open(path, 'r', encoding = 'utf-8') as f:
                content = f.read()
                combined_text.append(f"--- Content from {os.path.basename(path)} ---\n" + content)
        except Exception as e:
            print(f"[Error] Could not read '{path}': {e}")

    return "\n===================================================================================\n".join(combined_text)

In [90]:
combined_text = combine_text_files([
    "Cleaned_Philips_FD_2023-24.txt",
    "Cleaned_Philips_FD_2024-25.txt"
])

# Generating Top 50 QnA Pairs For The Combined Text

In [92]:
def call_groq_for_qna(text):
    """
    Task: Calls Groq API to generate Q/A pairs from extracted text.
    Args:
        text (str): OCR extracted text.
    Returns:
        list: List of 75 Q/A dicts.
    """

    prompt = f"""ROLE: You are a financial analyst assistant.

---------------
TASK: From the financial statement text provided below, generate exactly **75 diverse question-answer pairs**. These questions should test understanding of the financial content, including **numeric values, interpretations, trends, concepts, and general context** related to the financial statement.

---------------
EXAMPLE FORMAT (Zero-Shot Example):
Q: What was the company’s revenue in 2023?
A: The company’s revenue in 2023 was $4.13 billion.

---------------
INSTRUCTIONS:

1. All 75 questions must be **diverse**:
- Some should ask about **specific financial figures** (e.g., revenue, profit, EPS, operating margin, assets, cash flow, etc.) with the **associated financial year**.
- Others should be **theoretical or descriptive**, asking about financial concepts, trends, comparisons between years, or the meaning of certain financial terms or sections.
- Include a few **general or contextual questions** that test understanding beyond raw numbers (e.g., "What does the income statement reveal about the company's operations?" or "What can be inferred from the change in liabilities?").

2. Answers must be **complete and informative**, not just single words or numbers. For numeric-based questions, include the **value with the year**, but also **briefly explain what the figure represents**.

3. Use **natural, formal, and precise language** suitable for finance professionals.

4. Return output STRICTLY in the following JSON object format:
{{
    "qna_pairs": [
        {{"generated_q1": "generated_a1"}},
        {{"generated_q2": "generated_a2"}},
        ...
        {{"generated_q75": "generated_a75"}}
    ]
}}

5. Do NOT include any explanations, notes, or extra commentary — only the JSON object as shown above.

---------------
INPUT TEXT:
{text}
"""

    client = Groq(api_key = "groq_api_key")

    completion = client.chat.completions.create(
        model = "deepseek-r1-distill-llama-70b",
        messages = [{"role": "user", "content": prompt}],
        temperature = 0.3,
        top_p = 0.9,
        stream = False,
        response_format = {"type": "json_object"},
        stop = None,
        seed = 123,
    )

    raw_response = completion.choices[0].message.content.strip()
    output_data = json.loads(raw_response)
    return output_data["qna_pairs"]

In [94]:
qna_pairs = call_groq_for_qna(combined_text)
qna_pairs

[{'generated_q1': 'What was the company’s total revenue for Q3 2024?',
  'generated_a1': 'The company’s total revenue for Q3 2024 was $165.53 million, representing an 8.6% increase from Q3 2023.'},
 {'generated_q2': 'What was the year-to-date revenue growth in 2024 compared to 2023?',
  'generated_a2': 'The year-to-date revenue growth in 2024 was 7.2% compared to 2023.'},
 {'generated_q3': 'What was the rental income for Q3 2024?',
  'generated_a3': 'The rental income for Q3 2024 was $161.78 million.'},
 {'generated_q4': 'What were the operating expenses for Q3 2024?',
  'generated_a4': 'The operating expenses for Q3 2024 were $126.54 million, up from $112.39 million in Q3 2023.'},
 {'generated_q5': 'What was the net income attributable to stockholders in Q3 2024?',
  'generated_a5': 'The net income attributable to stockholders in Q3 2024 was $11.60 million.'},
 {'generated_q6': 'What was the EPS for Q3 2024?',
  'generated_a6': 'The EPS for Q3 2024 was $0.09.'},
 {'generated_q7': 'Wha

**Insights:**

1. Uses `Groq's deepseek-r1-distill-llama-70b` to create exactly `75` structured financial `Q/A pairs` from extracted statements.

2. Ensures question diversity across numeric facts, descriptive insights, trend analysis and contextual interpretation.

3. Enforces complete, explanatory answers rather than standalone numbers for richer learning value.

# Storing Generated QnA Dataset → CSV

In [103]:
def save_qna_to_csv(qna_list, file_path):
    """
    Task: Saves Q/A pairs to a CSV file.
    Args:
        qna_list (list): List of Q/A dicts.
        file_path (str): Path to save CSV.
    """
    formatted_data = []
    for item in qna_pairs:
        for q_key, q_val in item.items():
            if q_key.startswith("generated_q"):
                num = q_key.replace("generated_q", "")
                a_key = f"generated_a{num}"
                question = q_val
                answer = item.get(a_key, "")
                formatted_data.append({"Question": question, "Answer": answer})
    
    # Create DataFrame
    df = pd.DataFrame(formatted_data)
    
    # Save to CSV
    df.to_csv("financial_qna_pairs.csv", index = False, encoding = 'utf-8-sig')
    
    print("CSV saved as 'financial_qna_pairs.csv'")

In [105]:
save_qna_to_csv(qna_pairs, "financial_qna_pairs.csv")

CSV saved as 'financial_qna_pairs.csv'
