An end-to-end AI pipeline that converts unstructured company data into investment-grade M&A teaser decks.
This is a deterministic, reproducible pipeline designed to automate the early-stage investment analysis workflow typically performed by junior investment banking and private equity teams.
It ingests raw company data (PDFs, Excel models, Markdown briefs), enriches it with targeted public-web intelligence, extracts high-density, non-marketing investment facts using a rigorously constrained LLM agent, and programmatically renders:
- a native, editable PowerPoint teaser deck
- a citation audit document tracing data sources
No screenshots. No slide templates. No manual formatting.
Early-stage M&A analysis suffers from three structural problems:
-
Unstructured inputs
Company data arrives as PDFs, Excel sheets, one-pagers, and notes. -
Low signal-to-noise summaries
Generic AI summaries produce marketing fluff rather than investment facts. -
Manual slide production
Analysts spend hours formatting decks instead of thinking.
We solves this by enforcing:
- data density over prose
- strict schemas over free-form text
- code-driven slide construction
-
Ingests private company data
PDFs, Excel files, Markdown, and text files. -
Augments missing context via public web search
Focused queries for products, certifications, customers, and financials. -
Builds a unified βtruth contextβ
Private + public data fused into a single analysis source. -
Runs a highly constrained LLM extraction agent
Outputs structured JSON only β no prose. -
Renders investment slides programmatically
Business overview, financials, KPIs, and thesis. -
Generates a citation audit document
Ensures traceability of claims.
The system follows a GenAI-adapted ETL pipeline that converts unstructured company data into investment-grade outputs via structured LLM extraction and deterministic rendering.
.
βββ analyze.py # LLM agent & extraction logic
βββ ingest.py # Data ingestion + web enrichment
βββ ppt_engine.py # Programmatic PowerPoint renderer
βββ generate_citations.py # Citation audit document generator
βββ main.py # CLI entrypoint & pipeline orchestration
βββ check_models.py # Gemini model availability checker
βββ utils.py # Image download helper (Pexels)
βββ requirements.txt # Python dependencies
βββ examples/ # INPUT: files or data-pack folders
βββ Final_Submissions/ # OUTPUT: PPT + citations
Purpose:
Convert heterogeneous inputs into a single, analyzable text corpus.
Supported Inputs
.pdfβ parsed via LlamaParse (handles tables & scanned docs).xlsx / .xlsβ flattened using pandas.md / .txtβ read directly- Folder-based βdata packsβ (multiple files per company)
Public Web Augmentation
- Uses Tavily Search
- Executes targeted queries for:
- product capabilities
- certifications & awards
- revenue / geography indicators
Key Design Choice
The pipeline never assumes private data is complete.
Public data is always used to fill analytical gaps.
All extracted text is saved to: {CompanyName}_FULL_CONTEXT.txt for full transparency and debugging.
This is the core intelligence layer.
Model
- Google Gemini (
gemini-2.5-flashor compatible) - Chosen for:
- very large context windows
- reliable JSON compliance
Critical Constraints Enforced
- β No marketing language
- β No vague adjectives
- β No βN/Aβ
- β Numbers preferred over words
- β
Explicit inference with
(est.)tagging - β Slide-ready sentences (β€20 words)
Output A rigid, predefined JSON schema covering:
- business overview
- infrastructure metrics
- product capabilities
- applications & certifications
- financial indicators
- investment thesis
Saved as: {CompanyName}_ANALYSIS.json
This file is the single source of truth for all downstream steps.
Purpose:
Render investment-grade slides using code, not templates.
Key Characteristics
- Uses
python-pptx - Draws:
- vector shapes
- text boxes
- metric tiles
- native charts
- All text remains editable in PowerPoint
Slides Generated
-
Business Overview
- company profile
- infrastructure highlights
- product & capability grid
-
Financial Performance
- revenue & margin cards
- revenue growth chart
- operational KPIs
-
Investment Thesis
- evidence-backed investment hooks
-
Legal Disclaimer
Design Philosophy
- No fragile XML hacks
- No version-specific PowerPoint features
- Clean, conservative βconsulting-gradeβ layout
Purpose:
Create an audit trail for extracted insights.
- Reads the
citationsfield from analysis JSON - Generates a
.docxfile listing data sources - Intended for:
- internal review
- compliance checks
- analyst validation
Output: Final_Submissions/{CompanyName}_Citations.docx
This is the CLI controller.
Capabilities
- Detects whether input is:
- a single file
- a multi-file data pack folder
- Automatically derives company name
- Runs:
- analysis
- slide generation
- citation generation
Users interact via a simple numeric menu.
- Python 3.10+
- API keys for:
- Google Gemini
- Tavily Search
- LlamaParse
- Pexels (optional, for images)
git clone https://github.com/neepun06/AI-ML-GC-RND.git
cd AI-ML-GC-RNDpython -m venv venv source venv/bin/activate # macOS / Linux venv\Scripts\activate # Windows
pip install -r requirements.txt
Create a .env file in the project root:
GEMINI_API_KEY=your_gemini_key
TAVILY_API_KEY=your_tavily_key
LLAMA_CLOUD_API_KEY=your_llama_key
PEXELS_API_KEY=your_pexels_key
Option A β Single File
examples/
βββ Company-OnePager.md
Option B β Data Pack Folder
examples/
βββ Company_Data/
βββ annual_report.pdf
βββ financials.xlsx
βββ notes.md
python main.py
--- GENERATOR ---
Files and Data Pack Folders found:
[1] [FOLDER] Company_Data
[2] [FILE] Company-OnePager.md
Select Item #: 1
Final_Submissions/
βββ Company_Teaser_Atomic.pptx
βββ Company_Citations.docx
- Strict JSON schema was chosen over flexibility to guarantee slide safety.
- Estimation over N/A reflects real analyst behavior.
- Programmatic slides avoid template lock-in.
- Public web enrichment reduces dependency on perfect private data.
- Fail-soft design ensures partial data never breaks the pipeline.
- API keys are excluded via .gitignore
- No data is transmitted except to configured APIs
- Generated outputs are local only
Private project / hackathon submission. Not intended for public commercial redistribution.
This is a deterministic analytical system that treats LLMs as controlled extraction engines, not creative writers. This design choice is intentional.
