<a href="https://colab.research.google.com/github/md-marop-hossain/Filesure-Internship-Take-Home-Assignment/blob/main/extractor.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The file **"Form ADT-1-29092023_signed.pdf"** contains some input field text in the form section that **cannot be selected or copied** because it is actually a **scanned image** of a document. Libraries like **PyMuPDF, pdfplumber, and pdfminer.six** can only extract **machine-readable text**—they cannot interpret:
*   Pixel-based text in scanned documents
*   Text embedded within images

To handle this unselectable content, I used an **OCR solution that combines Tesseract with Poppler**.

This OCR setup uses **Poppler, Tesseract, and PDF2Image** to extract text from scanned PDFs by converting pages into images and recognizing text automatically. It works well with complex layouts, and multiple pages.


# **Install Python Libraries**

In [9]:
!apt-get install -y tesseract-ocr
!apt-get install -y poppler-utils
!pip install pytesseract pdf2image Pillow

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
tesseract-ocr is already the newest version (4.1.1-2.1build1).
0 upgraded, 0 newly installed, 0 to remove and 35 not upgraded.
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
poppler-utils is already the newest version (22.02.0-2ubuntu0.8).
0 upgraded, 0 newly installed, 0 to remove and 35 not upgraded.


**for using AI model**

In [10]:
!pip install -q -U groq

# **Extract Structured data from PDF**

In [11]:
import os
import re
import json
import pytesseract
from pdf2image import convert_from_path

def extract_text_with_ocr(pdf_path):
    if not os.path.exists(pdf_path):
        raise FileNotFoundError(f"File not found: {pdf_path}")
    try:
        images = convert_from_path(pdf_path)
        full_text = ""
        for i, img in enumerate(images, start=1):
            page_text = pytesseract.image_to_string(img, lang='eng')
            full_text += page_text + "\n"
        return full_text
    except Exception as e:
        raise RuntimeError(f"OCR extraction failed: {e}")

def parse_and_clean_adt1_data(text):
    data = {
        "company_name": "",
        "cin": "",
        "registered_office": "",
        "appointment_date": "",
        "auditor_name": "",
        "auditor_address": "",
        "auditor_frn_or_membership": "",
        "appointment_type": "New Appointment"
    }

    def find_between(start, end, block):
        pattern = re.escape(start) + r'(.*?)' + re.escape(end)
        match = re.search(pattern, block, re.DOTALL | re.IGNORECASE)
        return match.group(1) if match else ""

    def search_pattern(pat, block):
        match = re.search(pat, block, re.IGNORECASE)
        return match.group(1) if match and match.groups() else (match.group(0) if match else "")

    raw_cin       = search_pattern(r'(U74999[A-Z0-9]{15})', text)
    raw_company   = find_between("Name of the company", "Address of the registered office", text)
    raw_office    = find_between("Address of the registered office", "email id of the company", text)
    raw_date      = search_pattern(r'Date of appointment\s*([\d/]+)', text)
    raw_auditor   = find_between("Name of the auditor or auditor's firm", "Membership Number", text)
    raw_frn       = find_between("firm's registration number", "Address of the Auditor", text)
    raw_address   = find_between("Address of the Auditor", "email id of the auditor", text)

    def advanced_clean(s):
        if not s:
            return ""
        cleaned = re.sub(r'\s+', ' ', s).strip()
        cleaned = re.sub(r'\bor auditor\'s firm\b', '', cleaned, flags=re.IGNORECASE)
        cleaned = re.sub(r'\([a-z]\)', '', cleaned)
        cleaned = re.sub(r'[\[\]]', '', cleaned)
        cleaned = re.sub(r'\*+', '', cleaned)
        cleaned = re.sub(r'Pre-fill', '', cleaned, flags=re.IGNORECASE)
        cleaned = re.sub(r'of the company', '', cleaned, flags=re.IGNORECASE)
        cleaned = re.sub(r'[“”]', '', cleaned)
        return re.sub(r'\s+', ' ', cleaned).strip()

    def clean_address(s):
        if not s:
            return ""
        cleaned = re.sub(r'Line\s*[I|l|1]+\s*/?', '', s, flags=re.IGNORECASE)
        cleaned = re.sub(r'\*City|\*State|Country|\*Pin code', '', cleaned, flags=re.IGNORECASE)
        cleaned = re.sub(r'\bor auditor\'s firm\b', '', cleaned, flags=re.IGNORECASE)
        cleaned = re.sub(r'\([a-z]\)', '', cleaned)
        cleaned = re.sub(r'\|+', '', cleaned)
        cleaned = re.sub(r'/+', ' ', cleaned)
        cleaned = re.sub(r'(\d+)\s+(\d+),', r'\1/\2,', cleaned)
        cleaned = re.sub(r'[“”]', '', cleaned)
        return re.sub(r'\s+', ' ', cleaned).strip()

    data["cin"]                     = advanced_clean(raw_cin)
    data["company_name"]            = advanced_clean(raw_company)
    data["registered_office"]       = advanced_clean(raw_office)
    data["appointment_date"]        = advanced_clean(raw_date)
    data["auditor_name"]            = advanced_clean(raw_auditor)
    data["auditor_frn_or_membership"] = re.sub(r'[^A-Z0-9]', '', raw_frn).strip()
    data["auditor_address"]         = clean_address(raw_address)

    if re.search(r"tenure of previous appointment", text, re.IGNORECASE):
        data["appointment_type"] = "Reappointment"

    return data

def main():
    pdf_file    = "Form ADT-1-29092023_signed.pdf"
    output_json = "output.json"

    ocr_text    = extract_text_with_ocr(pdf_file)
    structured  = parse_and_clean_adt1_data(ocr_text)

    with open(output_json, 'w', encoding='utf-8') as f:
        json.dump(structured, f, indent=4, ensure_ascii=False)

    print(f"Extraction complete. Data saved to '{output_json}'")

if __name__ == "__main__":
    main()


Extraction complete. Data saved to 'output.json'


# **Generate an AI-style summary based on the JSON using Groq API**

In [12]:
import os
import json
from groq import Groq
from google.colab import userdata
from google.colab import files

try:
    api_key = userdata.get('GROQ_API_KEY')
    client = Groq(api_key=api_key)
    print("Groq API Key loaded and client initialized successfully.")
except userdata.SecretNotFoundError:
    print("Error: Secret 'GROQ_API_KEY' not found.")
    print("Please follow the instructions in the cell comments to add your API key.")
except Exception as e:
    print(f"An unexpected error occurred while loading the secret: {e}")


json_file_path = "/content/output.json"
print(f"Path set to '{json_file_path}'.")
print("Please ensure the file exists at this path before proceeding.")


def generate_summary_with_groq(json_data):
    if not json_data:
        return "No data to summarize."
    prompt = f"""
    Based on the following structured data, please generate a 3-5 line summary.
    This summary should sound like it came from an AI assistant explaining a corporate filing
    to a non-technical person.

    JSON Data:
    {json.dumps(json_data, indent=2)}

    Example Summary Format:
    "XYZ Pvt Ltd has appointed M/s Rao & Associates as its statutory auditor for FY 2023-24,
    effective from 1 July 2023. The appointment has been disclosed via Form ADT-1,
    with all supporting documents submitted."
    """

    try:
        chat_completion = client.chat.completions.create(
            messages=[
                {
                    "role": "system",
                    "content": "You are a helpful assistant that summarizes corporate filings for a non-technical audience."
                },
                {
                    "role": "user",
                    "content": prompt,
                }
            ],
            model="llama3-8b-8192",
            temperature=0.7,
            max_tokens=150,
        )
        return chat_completion.choices[0].message.content.strip()
    except Exception as e:
        return f"An unexpected error occurred during the API call: {e}"

if json_file_path and 'client' in locals():
    try:
        with open(json_file_path, 'r', encoding='utf-8') as f:
            extracted_data = json.load(f)

        print("Generating summary with Groq...")
        summary = generate_summary_with_groq(extracted_data)

        print("\n--- AI-Generated Summary ---")
        print(summary)
        print("--------------------------\n")
        summary_filename = "summary.txt"
        with open(summary_filename, "w") as f:
            f.write(summary)
        print(f"Summary saved to '{summary_filename}'.")
        print("Initiating download...")
        files.download(summary_filename)

    except FileNotFoundError:
        print(f"Error: The file '{json_file_path}' was not found. Please upload it before running this cell.")
    except json.JSONDecodeError:
        print(f"Error: Could not decode JSON from the file '{json_file_path}'.")
    except Exception as e:
        print(f"An unexpected error occurred during execution: {e}")
else:
    print("Cannot proceed")

Groq API Key loaded and client initialized successfully.
Path set to '/content/output.json'.
Please ensure the file exists at this path before proceeding.
Generating summary with Groq...

--- AI-Generated Summary ---
Here is a summary of the corporate filing:

"Alupa Foods Private Limited has reappointed Mallya & Mallya as its statutory auditor, with effect from September 26, 2022. The auditor's address is 29/2, 1st Floor, Parijatha Complex, Race Course Road, Bangalore, Karnataka. This appointment has been disclosed through a regulatory filing."
--------------------------

Summary saved to 'summary.txt'.
Initiating download...


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>