# üìò Proof of Concept: Automated Proposal Section Extraction (Executive Summary Focus)

## üß© Objective
This POC aims to explore an automated method to:
- Extract raw text from consulting proposals (PDFs)
- Clean and preprocess the content
- Identify and extract key high-level sections, including:
  - **Executive Summary**
  - **Background and Context (Situation)**
  - **Problem Statement (Complication)**
  - **Proposed Solution**
  - **Approach / Methodology**
  - **Costing and Pricing**

Since every proposal follows a different structure and uses inconsistent headings, it is critical to reliably detect the correct sections.

---

## üîç Challenge
Consulting proposals do *not* have a consistent format.  
For example:
- Executive summary may appear under ‚ÄúAbout Us‚Äù, ‚ÄúOverview‚Äù, or ‚ÄúIntroduction‚Äù.
- Background might be mixed with Approach or Case Study.
- Pricing may appear as tables, lists, or embedded within paragraphs.

Traditional rule-based extraction fails because of this variability.

---

## üöÄ Proposed Approach (RAG-Based Section Classification)

To solve this, we explore a **RAG-style pipeline**:

1. **Extract text from each PDF page**  
   Convert each page to plain text for processing.

2. **Preprocess and clean text**  
   Normalize whitespace, remove artifacts, fix line breaks, etc.

3. **Page-Level Section Classification**  
   For each page, we use an LLM classifier to determine *which section* the text belongs to.

4. **Tag and Store Sections**  
   Each page is assigned a label such as:  
   `executive_summary`, `background`, `problem`, `solution`, `approach`, `pricing`, etc.

5. **Vector Store Ingestion**  
   The tagged and cleaned sections are inserted into a vector store for:
   - Semantic retrieval  
   - Query answering  
   - Proposal generation  
   - Automated structuring  

---

## üéØ Focus of This POC
For simplicity, this POC focuses **only on extracting the Executive Summary**.  
Once validated, the same classification logic will be extended to the remaining sections.

The goal is to validate:
- Can we reliably detect the Executive Summary even when the PDF uses random headings?
- Can RAG-style classification improve consistency across different proposal formats?

---

## üìå Outcome
This POC will help determine whether a hybrid:
- Page-level extraction  
- LLM-based classification  
- Vector-store backed retrieval  

‚Ä¶is the correct foundation for a scalable **Proposal Understanding Engine**.



In [None]:
import os
from PyPDF2 import PdfReader
from dotenv import load_dotenv

In [None]:
load_dotenv()

In [None]:
raw_pdf_dir = "../data/raw/proposals"

pdf_files = [os.path.join(raw_pdf_dir, f) for f in os.listdir(raw_pdf_dir)]

In [None]:
extracted_pdf_files = []

for pdf_file in pdf_files:
    pdf_reader = PdfReader(pdf_file)
    page_count = 0
    
    for page in pdf_reader.pages:
        text = page.extract_text()
        extracted_pdf_files.append({'page_num': page_count, 'text': text})
        page_count += 1

In [None]:
extracted_pdf_files

In [None]:
executive_summary_page = []

executive_summary_keywords = ("About", "ABOUT", "Overview", "OVERVIEW", "Introduction", "INTRODUCTION",
                              "Summary", "SUMMARY", "Company", "COMPANY", "Executive","EXECUTIVE")

for pf in extracted_pdf_files:

    if pf['page_num'] < 4 and any(keyword in pf['text'] for keyword in executive_summary_keywords):
        executive_summary_page.append(pf)

In [None]:
executive_summary_page

In [None]:
from langchain_groq import ChatGroq

In [None]:
llm = ChatGroq(
    model="llama-3.1-8b-instant",
    api_key=os.getenv('GROQ_API_KEY'),
    temperature=0,
    model_kwargs={
        "top_p": 1    
    }
)

In [None]:
from langchain_core.prompts import PromptTemplate

text_relevant_prompt = PromptTemplate.from_template("""
You are a classifier. Determine whether this section is acting as an 
executive summary in a consulting proposal.

Executive summary indicators:
- About the firm and what the firm does
- High-level overview of the client‚Äôs goals or needs
- High-level description of the provider and solution
- Explains value, outcomes, and strategic approach
- Non-technical, business-focused language

Return only: ‚ÄúYES‚Äù or ‚ÄúNO‚Äù.
Even if 1 summary indicators is valid then return "YES" otherwise "NO"

Section Content:
{context}
""")

In [None]:
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

relevance_rag_chain = (
    {"context": RunnablePassthrough()}
    | text_relevant_prompt
    | llm
    | StrOutputParser()
    
)

In [None]:
for idx, summary in enumerate(executive_summary_page):
    is_executive = relevance_rag_chain.invoke(summary['text'])
    if is_executive == 'YES':
        executive_summary_page[idx]['type'] = "Executive Summary"

In [None]:
executive_summary_page

In [None]:
from langchain_community.docstore.document import Document

splits = [Document(page_content=executive['text'], metadata={'id': str(i)}) for i, executive in enumerate(executive_summary_page) if executive.get('type', '') == 'Executive Summary']

In [None]:
len(splits)

In [None]:
from langchain_community.embeddings import HuggingFaceBgeEmbeddings

In [None]:
bge_embedding = HuggingFaceBgeEmbeddings(
    model_name='BAAI/bge-large-en-v1.5',
    model_kwargs={"device": "cpu"},
)

In [None]:
from langchain_community.vectorstores import Chroma

db = Chroma.from_documents(
    documents=splits,
    embedding=bge_embedding,
    persist_directory="./bge_chroma_store"
)

In [None]:
dense_retriever = db.as_retriever(search_kwargs={"k": 10})

In [None]:
dense_retriever.invoke("Kumar Digital is a results-driven digital marketing agency specializing in SEO, social media, and performance marketing. We blend data-backed strategies with creative execution to help businesses increase visibility and generate qualified leads. With a focus on measurable outcomes, our team delivers transparent, optimized, and growth-focused marketing solutions.")