# **Startup Pitch Text Evaluation with NLP**

**Hello, I’m Mohan Raj Murugesan, a student at Coimbatore Institute of Technology, contactable at [Mohan Raj Murugesan](https://www.linkedin.com/in/mohan-raj-m-450560225). My project, *Startup Pitch Text Evaluation with NLP*, is an pipeline designed to evaluate startup pitch decks for investors, developed as part of the `Startup_PitchTextEvaluation_WithNLP` notebook.**

---

The purposed pipeline processes PDF pitch decks, extracting text, categorizing it into investor-relevant sections (e.g., Problem, Market), and scoring them using advanced NLP techniques. It computes a normalized final score (0–100) by combining BART’s zero-shot quality scores (70% weight) and VADER’s sentiment scores (30% weight) across six dimensions—Problem, Market, Traction, Team, Business Model, Vision/Moat, plus a deck-wide Confidence score. The system also generates summaries and insights with T5, classifies industries using BART with a keyword-based fallback, clusters decks with KMeans, and presents results through an interactive Dash dashboard featuring radar charts, bar charts, heatmaps, and word clouds.


# Downloading Dependencies

In [1]:
%pip install pdfplumber
%pip install vaderSentiment
%pip install dash
!pip install pdfplumber spacy vaderSentiment transformers pandas plotly==5.15.0 kaleido==0.2.1 pytesseract
!apt-get install -y tesseract-ocr libtesseract-dev
!python -m spacy download en_core_web_sm

Collecting pdfplumber
  Downloading pdfplumber-0.11.7-py3-none-any.whl.metadata (42 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/42.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pdfminer.six==20250506 (from pdfplumber)
  Downloading pdfminer_six-20250506-py3-none-any.whl.metadata (4.2 kB)
Collecting pypdfium2>=4.18.0 (from pdfplumber)
  Downloading pypdfium2-4.30.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (48 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.5/48.5 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
Downloading pdfplumber-0.11.7-py3-none-any.whl (60 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.0/60.0 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pdfminer_six-20250506-py3-none-any.whl (5.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━

# Import libraries


The README provides a clear summary:
- **Inputs**: Pitch deck PDFs in a folder (`/content/pdf_decks`).
- **Processes**: Text extraction, categorization, sentiment analysis, scoring, summarization, and clustering.
- **Outputs**: A DataFrame with analysis results, visualizations (radar, bar, heatmap, word cloud), and an interactive dashboard.
- **Dependencies**: Libraries for PDF processing (`pdfplumber`, `pytesseract`), NLP (`spacy`, `vaderSentiment`, `transformers`), visualization (`plotly`, `dash`, `wordcloud`), and clustering (`scikit-learn`).

Key Components and Methods
  - **pdfplumber, pytesseract, PIL**: Handle PDF text extraction and OCR for image-based content.
  - **spacy, vaderSentiment, transformers**: Enable NLP tasks (entity recognition, sentiment analysis, scoring, summarization).
  - **pandas, numpy**: Manage and process data (e.g., storing scores, summaries, and metadata in a DataFrame).
  - **plotly, dash, matplotlib, wordcloud**: Create visualizations (radar charts, bar charts, heatmaps, word clouds) and an interactive dashboard.
  - **os, pathlib, re**: Handle file operations and text preprocessing.
  - **sklearn.cluster.KMeans**: Groups pitch decks into clusters for comparative analysis.
  - **logging**: Tracks pipeline progress and errors.
  - **google.colab.files**: Supports file downloads in Colab environments.


In [1]:
import pdfplumber
import pandas as pd
import spacy
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from transformers import pipeline, AutoTokenizer
import plotly.express as px
import plotly.graph_objects as go
import dash
from dash import dcc, html
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import os
from pathlib import Path
import re
import numpy as np
import pytesseract
from PIL import Image
import io
import base64
from sklearn.cluster import KMeans
import logging
try:
    from google.colab import files  # For Colab downloads
except ImportError:
    files = None

# Input

In [2]:
from google.colab import files
uploaded = files.upload()
!mkdir -p /content/pdf_decks
!mv *.pdf /content/pdf_decks

Saving 6737d05825e11f73f6d5a289_Ndc8GMUtaMNHOXDfqftyW1Jb7b5h2JE_ThY_Joc5Cf8.pdf to 6737d05825e11f73f6d5a289_Ndc8GMUtaMNHOXDfqftyW1Jb7b5h2JE_ThY_Joc5Cf8.pdf
Saving doordash-pitch-deck.pdf to doordash-pitch-deck.pdf
Saving FACEBOOK.pdf to FACEBOOK.pdf
Saving Pitch-Example-Air-BnB-PDF.pdf to Pitch-Example-Air-BnB-PDF.pdf
Saving uber-pitch-deck.pdf to uber-pitch-deck.pdf


# Initialize NLP tools

### Why These Methods?
- **spaCy (`en_core_web_sm`)**: Chosen for its lightweight efficiency and robust entity recognition, ideal for parsing structured pitch deck sections.
- **VADER**: Selected for its speed and suitability for short, persuasive texts, providing a quick sentiment metric relevant to investor perception.
- **BART (`facebook/bart-large-mnli`)**: Used for its zero-shot flexibility, enabling scoring across diverse dimensions without training data, a key advantage for a generalizable pipeline.
- **T5 (`t5-base`)**: Preferred for its abstractive summarization capabilities, delivering concise, investor-friendly summaries.
- **BART Tokenizer**: Essential for preprocessing text to match BART’s requirements, ensuring accurate scoring.
- **Zero-Shot Classification (BART)**: Using BART for scoring pitch decks is experimental because zero-shot classification is less common in this domain than supervised methods. It’s innovative as it eliminates the need for a labeled dataset of scored decks, which is often unavailable. However, its accuracy depends on well-crafted prompts, and it may require validation against human investor scores.
- **T5 Summarization**: While T5 is a standard choice for summarization, applying it to pitch decks (which have varied structures and jargon) is somewhat experimental. The model’s ability to generate abstractive summaries ensures flexibility but may need fine-tuning for domain-specific terms.

In [3]:
nlp = spacy.load("en_core_web_sm")
vader = SentimentIntensityAnalyzer()
zero_shot = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
summarizer = pipeline("summarization", model="t5-base")
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-mnli")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cpu


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

Device set to use cpu


# Placeholder for extract_text_hybrid_easyocr (assumed to return list of slide texts)

- **Text Extraction (`extract_text_hybrid_easyocr`)**:
  - This function is the first step in the pipeline, converting raw PDFs into usable text. It feeds into subsequent steps like section parsing (via spaCy), sentiment analysis (VADER), scoring (BART), and summarization (T5).
  - It combines `pdfplumber` for text-based PDFs and `pytesseract` for image-based content, ensuring comprehensive text extraction. Grayscale conversion and specific Tesseract configurations optimize OCR for pitch deck layouts.
  - **Logic**: Ensures all content is captured, whether text-based or image-based, making the pipeline robust to diverse PDF formats.
- **Text Cleaning (`clean_text`)**:
  - Prepares extracted text for NLP tasks by removing noise, ensuring models focus on meaningful content.
  - Removes formatting noise (newlines, page numbers, fractions) to prepare text for NLP tasks, focusing on meaningful content.
  - **Logic**: Cleansed text improves the accuracy of downstream tasks like summarization and scoring by reducing irrelevant tokens.

The hybrid extraction approach maximizes coverage across PDF types, while the cleaning function ensures high-quality input for summarization and scoring. The methods are practical, leveraging lightweight tools (`pdfplumber`, `pytesseract`, regex) suitable for pitch deck analysis.

In [4]:
def extract_text_hybrid_easyocr(pdf_file):
    """Extract text from PDF using hybrid method (pdfplumber + EasyOCR)."""
    text = []
    try:
        with pdfplumber.open(pdf_file) as pdf:
            for page in pdf.pages:
                page_text = page.extract_text()
                if page_text and len(page_text.strip()) > 10:
                    text.append(page_text.strip())
                else:
                    # Fallback to OCR (placeholder for EasyOCR)
                    try:
                        img = page.to_image(resolution=300).original
                        img = img.convert("L")  # Grayscale
                        page_text = pytesseract.image_to_string(img, config='--psm 6 --oem 3')
                        text.append(page_text.strip())
                    except Exception as e:
                        logging.warning(f"OCR failed for {pdf_file} on page {page.page_number}: {e}")
                        text.append("")
        return text if text else ["No text extracted"]
    except Exception as e:
        logging.error(f"Error processing {pdf_file}: {e}")
        return ["No text extracted"]

# Placeholder for clean_text (assumed to clean text for summarization)
def clean_text(text):
    """Clean text for summarization."""
    text = re.sub(r"\n\s*\n", "\n", text)
    text = re.sub(r"\d+/\d+", "", text)
    text = re.sub(r"(\bpage\b|\b\d+\b)", "", text, flags=re.IGNORECASE)
    return text.strip()

# Helper function to truncate text to max tokens

In [5]:
# Initialize logging
logging.basicConfig(level=logging.INFO)

The `truncate_to_max_tokens` function tokenizes input text using the BART tokenizer, truncates it to fit within a 512-token limit, and decodes it back to a string, ensuring compatibility with transformer models like BART and T5. It’s a critical preprocessing step to prevent model errors due to excessive input length.
- **The function uses the model-specific BART tokenizer to ensure accurate tokenization for scoring, with a default 512-token limit suitable for pitch deck content. Decoding ensures the output is usable for downstream tasks. The approach is simple and robust, aligning with the pipeline’s need to process concise, structured text.**
- **Using the BART tokenizer for both BART and T5 tasks is a pragmatic simplification but could be refined by adding a T5-specific tokenizer for better summarization performance.**

In [6]:
def truncate_to_max_tokens(text, max_tokens=512):
    """Truncate text to fit within max_tokens for the model."""
    tokens = tokenizer(text, truncation=True, max_length=max_tokens, return_tensors="pt")
    truncated_text = tokenizer.decode(tokens["input_ids"][0], skip_special_tokens=True)
    return truncated_text

# Step 1: Categorize Text

### Purpose
The `categorize_text` function processes extracted text from pitch deck pages (e.g., from `extract_text_hybrid_easyocr`) and organizes it into predefined sections (Problem, Solution, Market, etc.) using a combination of keyword matching and spaCy’s named entity recognition (NER). This structured output facilitates downstream tasks like scoring (BART) and summarization (T5).

---

The function `extract_text_hybrid_easyocr` follows text extraction and cleaning `clean_text`, and precedes scoring (BART) and summarization (T5). It structures raw text into investor-relevant sections for targeted analysis.
- **By organizing text into sections like "Problem" or "Market", the function enables:**
  - **Scoring**: BART can score specific sections (e.g., "Problem Clarity" based on the "Problem" text).
  - **Summarization**: T5 can summarize individual sections or the entire deck using categorized text.
  - **Analysis**: Sections like "Traction" or "Business Model" provide focused data for investor insights.
- **Example Workflow**:
  1. Extract text from a PDF (`extract_text_hybrid_easyocr`).
  2. Clean the text (`clean_text`).
  3. Categorize the text into sections (`categorize_text`).
  4. Truncate section text (`truncate_to_max_tokens`) and feed to BART for scoring or T5 for summarization.

In [7]:
def categorize_text(pages):
    """Categorize text into sections using keywords and spaCy entities."""
    sections = {
        "Problem": ["problem", "challenge", "pain point", "issue", "need", "obstacle", "difficulty", "barrier"],
        "Solution": ["solution", "product", "offering", "solve", "platform", "service", "technology", "app"],
        "Market": ["market", "TAM", "SAM", "SOM", "opportunity", "industry", "segment", "potential", "demand"],
        "Traction": ["traction", "growth", "users", "revenue", "metrics", "customers", "sales", "adoption", "engagement"],
        "Team": ["team", "founder", "experience", "background", "advisor", "staff", "leadership", "executive", "member"],
        "Business Model": ["business model", "revenue", "monetization", "pricing", "subscription", "income", "profit", "model"],
        "Vision/Moat": ["vision", "moat", "advantage", "IP", "patent", "defensibility", "strategy", "differentiation", "unique"]
    }
    categorized = {key: [] for key in sections}
    for page in pages:
        if page == "No text extracted":
            continue
        doc = nlp(page)
        current_section = None
        for line in page.split("\n"):
            line_lower = line.lower()
            for section, keywords in sections.items():
                if any(keyword in line_lower for keyword in keywords):
                    current_section = section
                    break
            if current_section:
                categorized[current_section].append(line)
        for ent in doc.ents:
            if ent.label_ in ["PERSON", "ORG"] and any(keyword in page.lower() for keyword in sections["Team"]):
                categorized["Team"].append(ent.text)
            elif ent.label_ in ["MONEY", "CARDINAL"] and any(keyword in page.lower() for keyword in sections["Market"]):
                categorized["Market"].append(ent.text)
    return {key: " ".join(set(val)) for key, val in categorized.items() if val}

# Step 2: Score Each Deck

### Overview
The `score_dimension` function scores a single section of a pitch deck (e.g., Problem, Market) based on its text, using VADER for sentiment analysis and BART for zero-shot classification. The `score_deck` function aggregates scores across multiple dimensions and computes a normalized final score, incorporating a Confidence score for the entire deck. These functions rely on text categorized by `categorize_text`, cleaned by `clean_text`, and extracted by `extract_text_hybrid_easyocr`.

---
### How Scoring Is Done
1. **Per-Dimension Scoring (`score_dimension`)**:
   - **Input**: Text for a section (e.g., Problem text from `categorize_text`).
   - **Process**:
     - **Sentiment**: VADER computes a `compound` score (-1 to +1) for the first 500 characters, reflecting tone.
     - **Quality**: BART evaluates the first 500 characters against criteria (e.g., "clear and specific"), producing a probability (0–1) scaled to 0–10.
     - **Combined Score**:
       \[
       dimension_score = 0.7 x (BART score x 10) + 0.3 x(VADER compound + 1) x 5
       \]
     - Weights prioritize quality (70%) over sentiment (30%).
   - **Output**: A score (0–10) for the dimension.
   - **Example**:
     - Text: "Our platform solves inefficiencies in healthcare delivery."
     - VADER: `compound = 0.4` → `(0.4 + 1) * 5 = 7.0`.
     - BART: `scores["clear and specific"] = 0.8` → `0.8 * 10 = 8.0`.
     - Score: \( 0.7 x 8.0 + 0.3 x 7.0 = 5.6 + 2.1 = 7.7 \).

2. **Deck Scoring (`score_deck`)**:
   - **Input**: Dictionary of section texts from `categorize_text`.
   - **Process**:
     - Scores six dimensions (Problem, Market, Traction, Team, Business Model, Vision/Moat) using `score_dimension`.
     - Computes a Confidence score for the entire deck’s text:
       \[
       text{Confidence} = (VADER compound x 5) + 5
       \]
     - Sums all seven scores (6 dimensions + Confidence) to get `total` (0–70).
     - Normalizes to 0–100:
       \[
       normalized = (total/70) x 100
       \]
   - **Output**: A dictionary of scores and a normalized final score.
   - **Example**:
     - Scores: Problem = 7.7, Market = 8.0, Traction = 6.5, Team = 7.0, Business Model = 6.8, Vision/Moat = 7.2, Confidence = 6.0.
     - Total: \( 7.7 + 8.0 + 6.5 + 7.0 + 6.8 + 7.2 + 6.0 = 49.2 \).
     - Normalized: \( (49.2 / 70) x 100 = 70.3 \).

---

### How the Final Score Is Calculated
The final score (`normalized`) is a normalized percentage (0–100) based on the sum of seven scores (six dimensions + Confidence), each ranging from 0–10.

1. **Dimension Scores**:
   - Each of the six dimensions (Problem, Market, Traction, Team, Business Model, Vision/Moat) is scored using `score_dimension`:
     \[
     dimension_score = 0.7 x (BART score x 10) + 0.3 x (VADER compound + 1) x 5
     \]
   - Range: 0–10 per dimension.
   - Total for six dimensions: 0–60.

2. **Confidence Score**:
   - Computed for the entire deck’s text:
     \[
     Confidence = (VADER compound x 5) + 5
     \]
   - Range: 0–10 (since `compound` is -1 to +1, scaled to 0–10).

3. **Total Score**:
   - Sum of all seven scores:
     \[
     total = Problem + Market + Traction + Team + Business Model + Vision/Moat + Confidence
     \]
   - Maximum: \( 10 x 7 = 70 \).

4. **Normalized Final Score**:
   - Scaled to 0–100:
     \[
     normalized = (total/70) x 100
     \]
   - Rounded to one decimal place for readability.

5. **Example Calculation**:
   - Assume scores: Problem = 7.7, Market = 8.0, Traction = 6.5, Team = 7.0, Business Model = 6.8, Vision/Moat = 7.2, Confidence = 6.0.
   - Total: \( 7.7 + 8.0 + 6.5 + 7.0 + 6.8 + 7.2 + 6.0 = 49.2 \).
   - Normalized: \( (49.2 / 70) x = 70.3 \).
   - Output: `scores = {"Problem": 7.7, "Market": 8.0, ..., "Confidence": 6.0}, normalized = 70.3`.

---

In [10]:
def score_dimension(text, dimension, criteria):
    """Score a dimension using NLP."""
    if not text or text == "No text extracted":
        return 0
    sentiment = vader.polarity_scores(text[:500])["compound"]
    try:
        scores = zero_shot(text[:500], candidate_labels=criteria, multi_label=False)
        quality_score = scores["scores"][0] * 10
    except Exception as e:
        print(f"Zero-shot classification failed for {dimension}: {e}")
        quality_score = 0
    return round(0.7 * quality_score + 0.3 * (sentiment + 1) * 5, 1)

def score_deck(sections):
    """Score a deck across all dimensions."""
    criteria = {
        "Problem": ["clear and specific", "vague", "generic"],
        "Market": ["large and quantified", "small", "unfocused"],
        "Traction": ["strong metrics", "weak metrics", "no data"],
        "Team": ["experienced and relevant", "inexperienced", "generic"],
        "Business Model": ["clear monetization", "unclear", "unsustainable"],
        "Vision/Moat": ["defensible and scalable", "generic", "weak"]
    }
    scores = {}
    for dim in criteria:
        scores[dim] = score_dimension(sections.get(dim, ""), dim, criteria[dim])
    overall_text = " ".join(sections.values())
    scores["Confidence"] = round(vader.polarity_scores(overall_text)["compound"] * 5 + 5, 1) if overall_text else 0
    total = sum(scores.values())
    normalized = round((total / 70) * 100, 1)
    return scores, normalized

# Step 3: Process All Decks

### Overview
The `process_decks` function is the core of the pitch deck analysis pipeline, orchestrating the following steps for each PDF in a specified folder:
1. **Text Extraction**: Uses `extract_text_hybrid_easyocr` to extract text from PDFs.
2. **Text Categorization**: Uses `categorize_text` to organize text into investor-relevant sections.
3. **Scoring**: Uses `score_deck` to compute dimension scores and a normalized final score.
4. **Summarization**: Uses `summarizer` (T5) to generate summaries and insights.
5. **Industry Classification**: Uses `zero_shot` (BART) and keyword-based fallback to classify the deck’s industry.
6. **Clustering**: Uses KMeans to group decks based on scores.
7. **Output**: Returns a DataFrame with results, including scores, summaries, industry labels, suggestions, and cluster assignments.

The final score, as computed by `score_deck`, is a normalized 0–100 score based on seven dimension scores (Problem, Market, Traction, Team, Business Model, Vision/Moat, Confidence).

---
#### Logic Breakdown
1. **Setup**:
   - Creates the `output_folder` if it doesn’t exist (`os.makedirs(output_folder, exist_ok=True)`).
   - Defines a hardcoded list of `deck_names` (e.g., "Pitch-Example-Air-BnB-PDF", "uber-pitch-deck").
   - Initializes an empty `results` list to store analysis for each deck.

2. **Text Extraction**:
   - Constructs the PDF file path (`pdf_file`) for each deck using `Path`.
   - Checks if the PDF exists; if not, logs a warning and sets `text = ["No text extracted"]`.
   - Otherwise, calls `extract_text_hybrid_easyocr` to extract text per page.
   - Saves extracted text to a `.txt` file in `output_folder`, with each slide labeled (e.g., `---SLIDE 1---`).

3. **Text Categorization**:
   - Calls `categorize_text` to organize extracted text into sections (e.g., Problem, Market).

4. **Scoring**:
   - Calls `score_deck` to compute dimension scores (Problem, Market, Traction, Team, Business Model, Vision/Moat, Confidence) and a normalized final score (0–100).
   - **Final Score Calculation** (from `score_deck`):
     - Each dimension score (0–10) combines BART zero-shot quality (70%) and VADER sentiment (30%):
       \[
       dimension_score = 0.7 x(BART score x 10) + 0.3 x (VADER compound + 1) x 5
       \]
     - Confidence score (0–10):
       \[
       Confidence = (VADER compound x 5) + 5
       \]
     - Total score (0–70):
       \[
       total = summation( dimension score + Confidence)
       \]
     - Normalized final score (0–100):
       \[
       final_score = (total/70) x 100
       \]
   - Example:
     - Scores: Problem = 7.7, Market = 8.0, Traction = 6.5, Team = 7.0, Business Model = 6.8, Vision/Moat = 7.2, Confidence = 6.0.
     - Total: \( 7.7 + 8.0 + 6.5 + 7.0 + 6.8 + 7.2 + 6.0 = 49.2 \).
     - Final Score: \( (49.2 / 70) x 100 = 70.3 \).

5. **Summarization**:
   - Concatenates all page texts, cleans the first 1000 characters (`clean_text`), and uses T5 (`summarizer`) to generate:
     - A summary (20–50 tokens).
     - An insight (5–10 tokens, based on the first 200 characters).
   - Handles errors by logging and setting fallback messages.

6. **Industry Classification**:
   - Concatenates all text, truncates to 512 tokens (`truncate_to_max_tokens`), and uses BART (`zero_shot`) to classify the deck’s industry (e.g., Fintech, Social Media).
   - **Fallback**: If the predicted label isn’t in the candidate list, uses a keyword-based approach:
     - Counts weighted keywords (e.g., "payment" for Fintech) in the text.
     - Boosts scores with spaCy NER (e.g., ORG, PRODUCT entities).
     - Selects the industry with the highest score.
   - Sets "Unknown" if classification fails.

7. **Improvement Suggestions**:
   - For each dimension with a score < 5, adds a suggestion to improve clarity or metrics.
   - Joins suggestions with semicolons or sets a default message if none apply.

8. **Results Storage**:
   - Appends a dictionary to `results` with:
     - Deck name, dimension scores, final score, summary, insight, industry, suggestions.

9. **Clustering**:
   - Converts `results` to a DataFrame (`df`).
   - If multiple decks exist, applies KMeans clustering (up to 3 clusters) on dimension scores, filling missing values with 0.
   - Adds a "Cluster" column to group similar decks.

10. **Output**:
    - Returns the DataFrame with all results.


In [11]:
def process_decks(folder_path, output_folder):
    """Process all PDFs, extract text, and analyze."""
    os.makedirs(output_folder, exist_ok=True)
    deck_names = [
        "Pitch-Example-Air-BnB-PDF",
        "uber-pitch-deck",
        "6737d05825e11f73f6d5a289_Ndc8GMUtaMNHOXDfqftyW1Jb7b5h2JE_ThY_Joc5Cf8",
        "FACEBOOK",
        "doordash-pitch-deck"
    ]
    results = []

    for deck_name in deck_names:
        pdf_file = Path(folder_path) / f"{deck_name}.pdf"
        txt_out_path = Path(output_folder) / f"{deck_name}.txt"

        if not pdf_file.exists():
            logging.warning(f"File {pdf_file} not found")
            text = ["No text extracted"]
        else:
            text = extract_text_hybrid_easyocr(pdf_file)
            with open(txt_out_path, "w", encoding="utf-8") as f:
                for i, slide in enumerate(text):
                    f.write(f"---SLIDE {i+1}---\n{slide}\n\n")

        sections = categorize_text(text)
        scores, final_score = score_deck(sections)

        try:
            clean_summary_text = clean_text(" ".join(text)[:1000])
            summary = summarizer(clean_summary_text, max_length=50, min_length=20, do_sample=False, max_new_tokens=None)[0]["summary_text"]
            insight = f"Insight: {summarizer(clean_summary_text[:200], max_length=10, min_length=5, do_sample=False, max_new_tokens=None)[0]['summary_text']}"
        except Exception as e:
            logging.error(f"Error summarizing {deck_name}: {e}")
            summary = f"Summary not available for {deck_name}"
            insight = f"Insight not available for {deck_name}"

        try:
            full_text = " ".join(text)
            classification_input = truncate_to_max_tokens(full_text, max_tokens=512)
            candidate_labels = ["Fintech", "HealthTech", "SaaS", "B2C", "Social Media", "Food Delivery", "Ride-Sharing"]
            industry = zero_shot(classification_input, candidate_labels=candidate_labels, multi_label=False)
            industry_label = industry["labels"][industry["scores"].index(max(industry["scores"]))]
            # Fallback: Keyword-based classification
            if industry_label not in ["Fintech", "HealthTech", "SaaS", "B2C", "Social Media", "Food Delivery", "Ride-Sharing"]:
                doc = nlp(full_text)
                keywords = {
                    "Fintech": [("payment", 2), ("finance", 1.5), ("banking", 1.5), ("transaction", 1)],
                    "HealthTech": [("health", 2), ("medical", 1.5), ("patient", 1), ("care", 1)],
                    "SaaS": [("software", 2), ("subscription", 1.5), ("cloud", 1), ("tool", 1)],
                    "B2C": [("consumer", 2), ("marketplace", 1.5), ("booking", 1), ("sharing", 1)],
                    "Social Media": [("social", 2), ("network", 1.5), ("connect", 1), ("community", 1)],
                    "Food Delivery": [("delivery", 2), ("restaurant", 1.5), ("food", 1.5), ("courier", 1)],
                    "Ride-Sharing": [("ride", 2), ("transport", 1.5), ("driver", 1), ("car", 1)]
                }
                keyword_scores = {label: 0 for label in candidate_labels}
                for label, kws in keywords.items():
                    for kw, weight in kws:
                        keyword_scores[label] += weight * full_text.lower().count(kw)
                for ent in doc.ents:
                    if ent.label_ in ["ORG", "PRODUCT"]:
                        for label, kws in keywords.items():
                            if any(kw[0] in ent.text.lower() for kw in kws):
                                keyword_scores[label] += 2
                max_score = max(keyword_scores.values(), default=0)
                if max_score > 0:
                    industry_label = max(keyword_scores, key=keyword_scores.get)
        except Exception as e:
            logging.error(f"Industry classification failed for {deck_name}: {e}")
            industry_label = "Unknown"

        # Improvement suggestions
        suggestions = []
        for dim, score in scores.items():
            if score < 5:
                suggestions.append(f"Improve {dim}: Ensure clear, specific details and strong metrics.")

        results.append({
            "Deck": deck_name,
            **scores,
            "Final Score": final_score,
            "Summary": summary,
            "Insight": insight,
            "Industry": industry_label,
            "Suggestions": "; ".join(suggestions) if suggestions else "No major improvements needed"
        })

    df = pd.DataFrame(results)
    # Clustering decks
    if len(df) > 1:
        dimensions = ["Problem", "Market", "Traction", "Team", "Business Model", "Vision/Moat", "Confidence"]
        X = df[dimensions].fillna(0)
        kmeans = KMeans(n_clusters=min(3, len(df)), random_state=42)
        df["Cluster"] = kmeans.fit_predict(X)
    return df

# Step 4: Visualizations

### Overview
The visualization functions (`create_radar_chart`, `create_bar_chart`, `create_heatmap`, `create_word_cloud`) take the DataFrame output from `process_decks` and create:
1. **Radar Chart**: Compares dimension scores (Problem, Market, etc.) across decks.
2. **Bar Chart**: Displays final scores for each deck.
3. **Heatmap**: Shows correlations between dimension scores.
4. **Word Cloud**: Visualizes the frequency of industry labels.

These visualizations enhance interpretability of the pitch deck analysis, leveraging the final score (normalized 0–100 from `score_deck`) and dimension scores to provide insights for investors.

---

### Experimental Aspects
- **Radar Chart**: Visualizing seven dimensions in a radar chart is experimental, as it assumes equal importance of dimensions. Weighting dimensions (e.g., Market > Team) could improve relevance.
- **Heatmap**: Correlation analysis assumes linear relationships, which may not hold for subjective scores. Non-linear methods (e.g., mutual information) could be explored.
- **Word Cloud**: Using frequencies for industry labels is simple but may overemphasize common industries. Alternative visualizations (e.g., pie charts) could provide clearer insights.


In [12]:
def create_radar_chart(df, output_folder):
    """Create a radar chart comparing decks and save to file."""
    dimensions = ["Problem", "Market", "Traction", "Team", "Business Model", "Vision/Moat", "Confidence"]
    fig = go.Figure()
    for _, row in df.iterrows():
        fig.add_trace(go.Scatterpolar(
            r=[row[dim] for dim in dimensions],
            theta=dimensions,
            fill="toself",
            name=row["Deck"]
        ))
    fig.update_layout(
        polar=dict(radialaxis=dict(visible=True, range=[0, 10])),
        showlegend=True,
        title="Pitch Deck Comparison (Radar Chart)"
    )
    fig.show()
    # Save to file
    output_path = os.path.join(output_folder, "radar_chart.png")
    try:
        fig.write_image(output_path)
        logging.info(f"Radar chart saved to {output_path}")
    except Exception as e:
        logging.error(f"Failed to save radar chart: {e}")
    return fig

def create_bar_chart(df, output_folder):
    """Create a bar chart of final scores and save to file."""
    fig = px.bar(df, x="Deck", y="Final Score", title="Final Scores by Deck", color="Deck")
    fig.show()
    # Save to file
    output_path = os.path.join(output_folder, "bar_chart.png")
    try:
        fig.write_image(output_path)
        logging.info(f"Bar chart saved to {output_path}")
    except Exception as e:
        logging.error(f"Failed to save bar chart: {e}")
    return fig

def create_heatmap(df, output_folder):
    """Create a correlation heatmap of dimensions and save to file."""
    dimensions = ["Problem", "Market", "Traction", "Team", "Business Model", "Vision/Moat", "Confidence"]
    corr = df[dimensions].corr()
    fig = px.imshow(corr, text_auto=True, title="Dimension Correlation Heatmap")
    fig.show()
    # Save to file
    output_path = os.path.join(output_folder, "heatmap.png")
    try:
        fig.write_image(output_path)
        logging.info(f"Heatmap saved to {output_path}")
    except Exception as e:
        logging.error(f"Failed to save heatmap: {e}")
    return fig

def create_word_cloud(df, output_folder):
    """Create a word cloud from industry labels and save to file."""
    if "Industry" not in df.columns or df["Industry"].isna().all():
        logging.warning("No industry labels available for word cloud")
        return None
    # Create frequency dictionary of industry labels
    industry_counts = df["Industry"].value_counts().to_dict()
    # Generate word cloud with frequencies
    wordcloud = WordCloud(
        width=800,
        height=400,
        background_color="white",
        min_font_size=10,
        max_font_size=100
    ).generate_from_frequencies(industry_counts)
    # Convert to Plotly figure
    fig = go.Figure()
    plt.figure(figsize=(10, 5))
    plt.imshow(wordcloud, interpolation="bilinear")
    plt.axis("off")
    plt.title("Word Cloud of Industry Labels")
    buf = io.BytesIO()
    plt.savefig(buf, format="png")
    buf.seek(0)
    img_str = "data:image/png;base64," + base64.b64encode(buf.read()).decode()
    # Save word cloud as image
    output_path = os.path.join(output_folder, "word_cloud.png")
    try:
        wordcloud.to_file(output_path)
        logging.info(f"Word cloud saved to {output_path}")
    except Exception as e:
        logging.error(f"Failed to save word cloud: {e}")
    buf.close()
    plt.close()
    fig.add_layout_image(
        dict(
            source=img_str,
            xref="paper", yref="paper",
            x=0, y=1,
            sizex=1, sizey=1,
            xanchor="left", yanchor="top",
            layer="below"
        )
    )
    fig.update_layout(
        title="Word Cloud of Industry Labels",
        xaxis=dict(visible=False),
        yaxis=dict(visible=False),
        width=800,
        height=400
    )
    fig.show()
    return fig



### Overview
----

The `create_dashboard` function creates an interactive web-based dashboard using Dash, a Python framework for building analytical applications. The dashboard consolidates:
- **Visualizations**: Radar chart, bar chart, heatmap, and word cloud from the respective visualization functions.
- **Table**: A detailed table displaying all columns from the `process_decks` DataFrame (e.g., Deck, dimension scores, Final Score, Summary, Insight, Industry, Suggestions, Cluster).
- **Purpose**: Provides a user-friendly interface for investors to explore pitch deck analysis results, leveraging the final score (0–100) and dimension scores (0–10) computed by `score_deck` and processed by `process_decks`.

#### Purpose
Creates an interactive Dash dashboard to display visualizations and a results table, consolidating the analysis from `process_decks` for investor review.
- **Dashboard Functionality**:
  - Displays a radar chart (dimension scores), bar chart (final scores), heatmap (dimension correlations), word cloud (industry labels), and a table of all DataFrame columns.
  - Integrates results from `process_decks`, with final scores (0–100) from `score_deck` shown in the bar chart and table, and dimension scores (0–10) in the radar chart and table.

In [13]:
def create_dashboard(df, output_folder):
    """Create an interactive Dash dashboard."""
    app = dash.Dash(__name__)
    word_cloud_fig = create_word_cloud(df, output_folder)
    app.layout = html.Div([
        html.H1("Pitch Deck Evaluation Dashboard"),
        html.H2("Scores Table"),
        dcc.Graph(figure=create_radar_chart(df, output_folder)),
        dcc.Graph(figure=create_bar_chart(df, output_folder)),
        dcc.Graph(figure=create_heatmap(df, output_folder)),
        html.H2("Word Cloud of Industry Labels"),
        dcc.Graph(figure=word_cloud_fig) if word_cloud_fig else html.P("No word cloud available"),
        html.H2("Detailed Results"),
        html.Table([
            html.Thead(html.Tr([html.Th(col) for col in df.columns])),
            html.Tbody([
                html.Tr([html.Td(df.iloc[i][col]) for col in df.columns])
                for i in range(len(df))
            ])
        ])
    ])
    return app

# Main Execution

### Overview
The main execution block is the entry point for running the pitch deck analysis pipeline. It:
1. **Processes PDFs**: Calls `process_decks` to extract text, categorize, score, summarize, classify industries, and cluster decks.
2. **Saves Results**: Stores the DataFrame to a CSV file.
3. **Prints Insights**: Displays the DataFrame, top/bottom 3 decks by final score, and formatted summaries/insights.
4. **Launches Dashboard**: Runs the Dash dashboard via `create_dashboard`.
5. **Downloads Files**: Downloads the CSV and visualization PNGs in Google Colab.

The main block uses the `Final Score` to rank decks and display results.

---

#### Logic Breakdown
1. **Folder Setup**:
   - Defines `folder_path = "/content/pdf_decks"` (input directory for PDFs) and `output_folder = "/content/output"` (output directory for CSVs and PNGs).
   - These paths are typical for Google Colab, where `/content` is the default working directory.

2. **Process Decks**:
   - Calls `process_decks(folder_path, output_folder)` to:
     - Extract text from PDFs (e.g., Airbnb, Uber) using `extract_text_hybrid_easyocr`.
     - Categorize text into sections (e.g., Problem, Market) with `categorize_text`.
     - Score dimensions and compute the final score with `score_deck`.
     - Generate summaries and insights with `summarizer` (T5).
     - Classify industries with `zero_shot` (BART) and keyword fallback.
     - Cluster decks with KMeans.
   - Returns a DataFrame (`df`) with columns: Deck, Problem, Market, Traction, Team, Business Model, Vision/Moat, Confidence, Final Score, Summary, Insight, Industry, Suggestions, Cluster.

3. **Save Results**:
   - Saves the DataFrame to `results.csv` in `output_folder` using `df.to_csv(csv_path, index=False)`.
   - Logs the save operation with `logging.info`.

4. **Print Results**:
   - **Header**: Prints "=== Pitch Deck Evaluation Dashboard ===".
   - **Full DataFrame**: Displays `df` with all columns and rows.
   - **Top 3 Decks**: Uses `df.nlargest(3, "Final Score")` to show the top 3 decks by Final Score, displaying Deck, Final Score, and Industry.
   - **Bottom 3 Decks**: Uses `df.nsmallest(3, "Final Score")` to show the bottom 3 decks.
   - **Summaries and Insights**:
     - Iterates through `df` rows, printing for each deck:
       - Deck name and Industry (e.g., "uber-pitch-deck (Ride-Sharing)").
       - Final Score (e.g., 70.3).
       - Insight (from `process_decks`).
       - Summary, formatted with bullet points (replaces ". " with ".\n- " for readability).

5. **Launch Dashboard**:
   - Calls `create_dashboard(df, output_folder)` to create a Dash app with:
     - Radar chart (dimension scores), bar chart (final scores), heatmap (dimension correlations), word cloud (industry labels), and a results table.
   - Runs the app with `app.run(debug=True)`, launching a web server (default: `http://127.0.0.1:8050`).

6. **Download Files (Colab)**:
   - Checks if `files` (from `google.colab.files`) is defined to confirm Colab environment.
   - Downloads `results.csv`, `radar_chart.png`, `bar_chart.png`, `heatmap.png`, and `word_cloud.png` using `files.download`.


---

### Conclusion
- **Execution Flow**:
  - Processes PDFs with `process_decks`, saves results to `results.csv`, prints the DataFrame, top/bottom 3 decks by Final Score, and formatted summaries/insights, launches the Dash dashboard, and downloads outputs in Colab.
- **Final Score Integration**:
  - Used to rank decks (`nlargest`, `nsmallest`), displayed in the bar chart and table via `create_dashboard`.
  - Example: Uber (Final Score: 70.3) appears in the bar chart and top/bottom rankings.



In [14]:
if __name__ == "__main__":
    folder_path = "/content/pdf_decks"
    output_folder = "/content/output"
    df = process_decks(folder_path, output_folder)
    # Save DataFrame to CSV
    csv_path = os.path.join(output_folder, "results.csv")
    df.to_csv(csv_path, index=False)
    logging.info(f"DataFrame saved to {csv_path}")
    print("=== Pitch Deck Evaluation Dashboard ===")
    print(df)
    print("\nTop 3 Decks:")
    print(df.nlargest(3, "Final Score")[["Deck", "Final Score", "Industry"]])
    print("\nBottom 3 Decks:")
    print(df.nsmallest(3, "Final Score")[["Deck", "Final Score", "Industry"]])
    print("\n=== Summaries and Insights ===")
    for _, row in df.iterrows():
        print(f"Deck: {row['Deck']} ({row['Industry']})")
        print(f"Final Score: {row['Final Score']}")
        print(f"Insight: {row['Insight']}")
        summary_formatted = row["Summary"].replace(". ", ".\n- ")
        print(f"Summary:\n- {summary_formatted}")
        print()
    app = create_dashboard(df,output_folder)
    app.run(debug=True)
    if files is not None:  # Running in Colab
        files.download(csv_path)
        files.download(os.path.join(output_folder, "radar_chart.png"))
        files.download(os.path.join(output_folder, "bar_chart.png"))
        files.download(os.path.join(output_folder, "heatmap.png"))
        files.download(os.path.join(output_folder, "word_cloud.png"))

=== Pitch Deck Evaluation Dashboard ===
                                                Deck  Problem  Market  \
0                          Pitch-Example-Air-BnB-PDF      6.4     9.0   
1                                    uber-pitch-deck      6.2     6.8   
2  6737d05825e11f73f6d5a289_Ndc8GMUtaMNHOXDfqftyW...      6.4     7.8   
3                                           FACEBOOK      0.0     6.0   
4                                doordash-pitch-deck      0.0     0.0   

   Traction  Team  Business Model  Vision/Moat  Confidence  Final Score  \
0       7.7   0.0             5.0          7.4         9.4         64.1   
1       8.4   9.5             7.5          5.5        10.0         77.0   
2       0.0   0.0             0.0          0.0         1.6         22.6   
3       8.6   7.4             6.2          4.0        10.0         60.3   
4       8.6   0.0             0.0          0.0         8.5         24.4   

                                             Summary  \
0  no easy way





<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Thank you for the opportunity to discuss my pitch deck analysis pipeline. Given the time constraints, I developed a robust system that extracts text from PDFs, categorizes it into investor-relevant sections, scores dimensions like Problem and Market using BART and VADER, and computes a normalized final score (0–100) based on a weighted sum of seven dimensions. It also generates summaries, industry classifications, and interactive visualizations via a Dash dashboard, providing a comprehensive tool for investors as of July 24, 2025.

With more time, I would focus on three key improvements to enhance accuracy and usability:

1. **Dynamic File Processing**: Replace hardcoded deck names with dynamic file discovery using `os.listdir` to scale the pipeline for any number of PDFs, improving flexibility for real-world applications.

2. **Advanced NLP Models**: Fine-tune a domain-specific model like Llama 3 for scoring and industry classification, replacing the general-purpose BART model to boost accuracy for pitch deck-specific language.

3. **Interactive Dashboard**: Add filters to the Dash dashboard, such as deck or industry selectors, to allow investors to focus on specific results, enhancing usability and engagement.

These enhancements would make the pipeline more scalable, precise, and investor-friendly, aligning with the needs of modern startup evaluation.
