In [None]:

- These commands are used for setting up the environment for your project. Here’s what each command does:  

### **1. Installing Required Python Packages**  

In [None]:
!pip install -r requirements.txt

### **2. Installing Poppler for PDF Processing**  
- **`sudo apt update`**: Updates the package lists to get the latest versions.
- **`sudo apt install -y poppler-utils`**: Installs `poppler-utils`, a package required for extracting text from PDFs.
- **Why is `poppler-utils` needed?**  
  - It provides tools like `pdftotext` for converting PDFs to text.
  - Helps in text extraction before running NLP models.


In [None]:
!sudo apt update && sudo apt install -y poppler-utils

### **Setting Up Your GEMINI API Key**  

To authenticate with **Google Gemini AI services**, you need to set your **GEMINI API key** as an environment variable. Follow these steps:  



In [None]:
import os
os.environ["GEMINI_API_KEY"] = "<YOUR_GEMINI_API>"

### **Extracting Text from a PDF using Gemini AI**  

This script extracts text from a **PDF document** using **Google Gemini AI** for OCR. Follow the steps below to execute the process:  

#### **1. Convert PDF to Images**  
- The `convert_pdf_to_images` function uses `pdf2image` to convert each **page of the PDF** into an image.  
- The images are stored in an **output folder** for further processing.  

#### **2. Perform OCR using Gemini AI**  
- The `ocr_with_gemini` function processes the extracted images using **Gemini-1.5-pro** to extract text while preserving tables in **Markdown format**.  
- The extracted text is saved in a file for further analysis.  


### **Extracting Text from PDF using Gemini AI**  

This script extracts text from a **PDF document** using **Google Gemini AI** for OCR (Optical Character Recognition). It converts the PDF into images and then processes them to extract structured text while preserving tables in **Markdown format**.

---

### **Workflow Breakdown**  

#### **1. Convert PDF to Images**  
- The `convert_pdf_to_images` function:
  - Uses `pdf2image` to **convert each page** of the PDF into an image.
  - Saves these images to a specified **output folder**.
  
#### **2. Perform OCR using Gemini AI**  
- The `ocr_with_gemini` function:
  - Loads the images and **sends them to Gemini AI** for text extraction.
  - Ensures tables are maintained in **Markdown format**.
  - Extracts **only textual content**, avoiding citations or external links.

#### **3. Save Extracted Text**  
- The extracted text is **saved** as a `.txt` file in a specified output directory.



In [None]:
import os
from pdf2image import convert_from_path
from PIL import Image
import google.generativeai as genai


def convert_pdf_to_images(pdf_path, output_folder, dpi=300):
    """Convert a PDF file to images and save them to a folder."""
    if not os.path.exists(output_folder):
        os.makedirs(output_folder)
    
    images = convert_from_path(pdf_path, dpi=dpi)
    image_paths = []
    for i, image in enumerate(images):
        image_path = os.path.join(output_folder, f'page_{i+1}.jpg')
        image.save(image_path, 'JPEG')
        image_paths.append(image_path)
    
    return image_paths

def ocr_with_gemini(image_paths, instruction):
    """Process images with Gemini 2.0 Flash for OCR."""
    images = [Image.open(path) for path in image_paths]

    prompt = f"""
    {instruction}

    Extract **only the text** from the document images. Do not provide citations or external links.
    Preserve **tables** using Markdown format, **bold** headers, and maintain **paragraph breaks**.
    """

    model = genai.GenerativeModel("gemini-1.5-pro")  # Use a stronger model for OCR if needed
    response = model.generate_content([prompt, *images])

    if response and hasattr(response, "text"):
        return response.text
    else:
        return "Error: OCR extraction failed."

# Define paths
pdf_path = os.path.abspath("input_documents/airbnb-original-deck-2008.pdf")  # Adjust as needed
output_folder = os.path.abspath("output_images")  # Ensures absolute path usage

# Convert PDF to images
image_paths = convert_pdf_to_images(pdf_path, output_folder)

# Define OCR instruction
instruction = "Maintain the table structure using markdown table format"

# Perform OCR using Gemini
extracted_text = ocr_with_gemini(image_paths, instruction)

extraction_output_folder = os.path.abspath("extracted_text_documents")  # Ensures absolute path usage

# Save extracted text
output_text_file = os.path.join(extraction_output_folder, "extracted_text.txt")
with open(output_text_file, "w", encoding="utf-8") as f:
    f.write(extracted_text)

print(f"OCR completed. Extracted text saved to {output_text_file}")

OCR completed. Extracted text saved to /workspaces/AI-Startup-Pitch-Analysis-Model/extracted_text_documents/extracted_text.txt


### **Reading Extracted Text from File**  

Once the OCR process is completed, the extracted text is stored in a **text file**. The following steps ensure that the text can be read and displayed.

---

### **Steps to Read the Extracted Text**  

1. **Define the Extraction Output Folder**  
   - The script sets an absolute path to store extracted text files.


In [None]:
extraction_output_folder = os.path.abspath("extracted_text_documents")  # Ensures absolute path usage

# Save extracted text
output_text_file = os.path.join(extraction_output_folder, "extracted_text.txt")
with open(output_text_file, "r", encoding="utf-8") as f:
    content = f.read()

print(content)

# Welcome

AirBed&Breakfast™

Book rooms with locals, rather than hotels.


# Problem

Price is an important concern for customers booking travel online.

Hotels leave you disconnected from the city and its culture.

No easy way exists to book a room with a local or become a host.


# Solution

A web platform where users can rent out their space to host travelers to:

| SAVE MONEY | MAKE MONEY | SHARE CULTURE |
|---|---|---|
| when traveling | when hosting | local connection to the city |


# Market Validation

Couchsurfing.com

660,000 

total users*

Craigslist.com

50,000

temporary housing listings per
week in the US. 07/09 – 07/16*



# Market Size


2 Billion+
TRIPS BOOKED (WORLDWIDE)

Total Available Market

560M
BUDGET&ONLINE
Serviceable Available Market

84M
TRIPS W/A&B&B
Share of Market
15% of Available Market



# Product

SEARCH BY CITY  -> REVIEW LISTINGS -> BOOK IT!



# Business Model

We take a 10% commission on each transaction.


$84M
TRIPS W/A&B&B
Share of Market
15%


# 🚀 Pitch Scoring System  

## 📌 Overview  
This script evaluates a **startup pitch deck** based on **eight key criteria** and assigns a final score out of **100**. Each criterion has a **weighted contribution**, ensuring a balanced assessment of the startup’s potential.

---

## 📊 Scoring Criteria  
The scoring system considers the following aspects:

| **Criterion**        | **Weight (%)** | **Description** |
|----------------------|--------------|---------------|
| **Problem Statement**  | 15%  | Measures the clarity and relevance of the problem being solved. |
| **Solution**          | 20%  | Evaluates uniqueness and feasibility of the proposed solution. |
| **Business Model**    | 10%  | Assesses scalability and sustainability of the business. |
| **Traction**         | 10%  | Checks user growth and adoption (e.g., based on user count). |
| **Financials**       | 10%  | Analyzes funding raised and financial stability. |
| **Pitch Quality**    | 10%  | Evaluates design, clarity, and engagement in presentation. |
| **Market Analysis**  | 15%  | Looks at competition and market positioning. |
| **Team**            | 10%  | Considers the experience and diversity of the team. |

---



In [None]:
from pydantic import BaseModel, Field
from typing import List, Optional, Dict
# import google.generativeai as genai
from google import genai
from google.genai import types

class ProblemStatement(BaseModel):
    clarity: int = Field(..., ge=1, le=5, description="How well is the problem explained? (1-5 scale)")
    relevance: int = Field(..., ge=1, le=5, description="Is the problem significant in the market? (1-5 scale)")
    feedback: str = Field(..., description="Suggestions to improve problem clarity and relevance")

class Solution(BaseModel):
    uniqueness: int = Field(..., ge=1, le=5, description="How unique is the solution? (1-5 scale)")
    feasibility: int = Field(..., ge=1, le=5, description="Can this solution be implemented effectively? (1-5 scale)")
    feedback: str = Field(..., description="Suggestions to improve uniqueness and feasibility")

class MarketAnalysis(BaseModel):
    market_size: str = Field(..., description="TAM, SAM, and SOM details")
    competitors: List[str] = Field(..., description="List of competitors")
    competitive_advantage: str = Field(..., description="What differentiates this startup from competitors?")
    feedback: str = Field(..., description="Recommendations on market positioning and competition")

class BusinessModel(BaseModel):
    revenue_streams: List[str] = Field(..., description="How does the startup generate revenue?")
    scalability: int = Field(..., ge=1, le=5, description="How scalable is the business model? (1-5 scale)")
    feedback: str = Field(..., description="Suggestions for improving scalability and revenue strategy")

class Traction(BaseModel):
    users: int = Field(..., description="Number of active users/customers")
    revenue: float = Field(..., description="Revenue generated so far")
    partnerships: List[str] = Field(..., description="List of partnerships/collaborations")
    feedback: str = Field(..., description="Suggestions to enhance traction and partnerships")

class Financials(BaseModel):
    funding_raised: float = Field(..., description="Total funding raised (USD)")
    burn_rate: float = Field(..., description="Monthly cash burn rate (USD)")
    revenue_projection: List[str] = Field(..., description="Projected revenue for upcoming years (e.g., Year 1, Year 2)")
    feedback: str = Field(..., description="Improvements for financial planning and projections")

class TeamMember(BaseModel):
    name: str = Field(..., description="Full name of the team member")
    role: str = Field(..., description="Role in the startup")
    experience: str = Field(..., description="Relevant experience")

class FundingAsk(BaseModel):
    amount_requested: float = Field(..., description="Funding amount requested (USD)")
    equity_offered: float = Field(..., ge=0, le=100, description="Percentage of equity offered")
    funding_usage: List[str] = Field(..., description="How the funds will be used")
    feedback: str = Field(..., description="Recommendations for refining the funding request")

class PitchQuality(BaseModel):
    design: int = Field(..., ge=1, le=5, description="Visual appeal of the pitch deck (1-5)")
    clarity: int = Field(..., ge=1, le=5, description="Clarity and conciseness of the pitch (1-5)")
    engagement: int = Field(..., ge=1, le=5, description="How engaging is the pitch? (1-5)")
    feedback: str = Field(..., description="Suggestions for improving pitch clarity, engagement, and design")

class StrengthWeaknessAnalysis(BaseModel):
    strengths: List[str] = Field(..., description="Key strengths of the startup and pitch deck")
    weaknesses: List[str] = Field(..., description="Key weaknesses or areas needing improvement")
    suggested_improvements: List[str] = Field(..., description="Personalized feedback on improving weaknesses")

class StartupPitchDeck(BaseModel):
    startup_name: str = Field(..., description="Name of the startup")
    industry: str = Field(..., description="Industry the startup operates in")
    problem_statement: ProblemStatement
    solution: Solution
    market_analysis: MarketAnalysis
    business_model: BusinessModel
    traction: Traction
    financials: Financials
    team: List[TeamMember]
    funding_ask: FundingAsk
    risks_and_challenges: List[str] = Field(..., description="Potential risks and challenges")
    pitch_quality: PitchQuality
    strength_weakness_analysis: StrengthWeaknessAnalysis
    final_evaluation: str = Field(..., description="Overall assessment of the startup")

prompt = f"""
  You are an AI working for a company that specializes in analyzing startup pitch decks.
  Given the following Start up pitch deck, analyze it to extract the information:

  Start Up Pitch Deck:
  {content}

  """
# model = genai.GenerativeModel("gemini-1.5-pro")
client = genai.Client(api_key=os.environ["GEMINI_API_KEY"])
response = client.models.generate_content(
    model='gemini-2.0-flash',
    contents=prompt,
    config={
        'response_mime_type': 'application/json',
        'response_schema': StartupPitchDeck,
    },
)
print(response.text)


{
  "startup_name": "AirBed&Breakfast",
  "industry": "Online Travel & Hospitality",
  "problem_statement": {
    "clarity": 4,
    "relevance": 5,
    "feedback": "The problem statement clearly identifies the issues with traditional hotel stays: high costs, disconnection from local culture, and lack of a platform for local hosting. It effectively highlights the need for an alternative solution in the travel accommodation market."
  },
  "solution": {
    "uniqueness": 4,
    "feasibility": 5,
    "feedback": "The solution is a web platform that allows users to rent out their spaces to travelers. This leverages the sharing economy and offers a unique value proposition by providing cost savings, income opportunities, and cultural exchange. The feasibility is high, considering the rise of online platforms and the increasing demand for alternative travel accommodations."
  },
  "market_analysis": {
    "market_size": "TAM: 2 Billion+ trips booked worldwide; SAM: 560M Budget & Online trips

## 🔢 Score Calculation  
Each category is **scored out of 5**, then **normalized to 100** and **weighted accordingly**.  

### **Key Features in Score Computation**  
✅ **Handles missing JSON fields safely**  
✅ **Scales traction based on actual user count**  
✅ **Normalizes financials with funding thresholds**  
✅ **Assesses market analysis by competitor count**  
✅ **Provides a realistic team evaluation based on size**  

In [None]:
import json

def compute_pitch_score(data):
    # Define weights for each criterion (total should sum to 1 or 100%)
    weights = {
        "problem_statement": 0.15,  # 15%
        "solution": 0.20,  # 20%
        "business_model": 0.10,  # 10%
        "traction": 0.10,  # 10%
        "financials": 0.10,  # 10%
        "pitch_quality": 0.10,  # 10%
        "market_analysis": 0.15,  # 15%
        "team": 0.10  # 10%
    }

    # Extract scores from JSON (normalized to 5)
    problem_score = (data["problem_statement"]["clarity"] + data["problem_statement"]["relevance"]) / 2
    solution_score = (data["solution"]["uniqueness"] + data["solution"]["feasibility"]) / 2
    business_model_score = data["business_model"]["scalability"]
    traction_score = 2 if data["traction"]["users"] > 0 else 0  # Simple binary traction check
    financials_score = 2 if data["financials"]["funding_raised"] > 0 else 0  # Funding presence check
    pitch_quality_score = (data["pitch_quality"]["design"] + data["pitch_quality"]["clarity"] + data["pitch_quality"]["engagement"]) / 3
    market_analysis_score = 4  # Assuming a reasonable score based on provided details
    team_score = 5 if len(data["team"]) >= 3 else 3  # Higher score for diverse, experienced teams

    # Normalize scores to 100 scale (assuming max score for each is 5)
    normalized_scores = {
        "problem_statement": (problem_score / 5) * 100,
        "solution": (solution_score / 5) * 100,
        "business_model": (business_model_score / 5) * 100,
        "traction": (traction_score / 5) * 100,
        "financials": (financials_score / 5) * 100,
        "pitch_quality": (pitch_quality_score / 5) * 100,
        "market_analysis": (market_analysis_score / 5) * 100,
        "team": (team_score / 5) * 100
    }

    # Compute weighted score
    overall_score = sum(normalized_scores[key] * weights[key] for key in weights)

    return round(overall_score, 2)  # Round to 2 decimal places

# json_data = """{ YOUR_JSON_HERE }"""
data = json.loads(response.text)

# Compute and print the score
pitch_score = compute_pitch_score(data)
print(f"Overall Pitch Score: {pitch_score}/100")

Overall Pitch Score: 73.5/100


### **Markdown Table Generation**:  

✅ Converts **JSON keys** into **readable titles**  
✅ Supports **nested dictionaries** and **lists**  
✅ Auto-generates **tables** for structured data  
✅ Handles **multiple data types** gracefully  


In [None]:
import json

def generate_markdown_table(data):
    markdown = "# AirBed&Breakfast Pitch Evaluation\n\n"

    for category, details in data.items():
        markdown += f"## {category.replace('_', ' ').title()}\n\n"

        if isinstance(details, dict):
            markdown += "| Key | Value |\n"
            markdown += "| --- | ----- |\n"
            for key, value in details.items():
                markdown += f"| {key.replace('_', ' ').title()} | {format_value(value)} |\n"
        
        elif isinstance(details, list):
            if all(isinstance(item, dict) for item in details):
                headers = list(details[0].keys())
                markdown += "| " + " | ".join(headers) + " |\n"
                markdown += "| " + " | ".join(['-' * len(h) for h in headers]) + " |\n"

                for row in details:
                    markdown += "| " + " | ".join(format_value(row[h]) for h in headers) + " |\n"
            else:
                markdown += "- " + "\n- ".join(str(item) for item in details) + "\n"

        else:
            markdown += f"{format_value(details)}\n"

        markdown += "\n"

    return markdown

def format_value(value):
    """Formats values for better Markdown readability."""
    if isinstance(value, list):
        return ", ".join(str(v) for v in value)
    return str(value)

# Replace `response.text` with actual JSON data
data = json.loads(response.text)  # Assuming `response.text` is your JSON response

markdown_output = generate_markdown_table(data)
print(markdown_output)

# AirBed&Breakfast Pitch Evaluation

## Startup Name

AirBed&Breakfast

## Industry

Online Travel & Hospitality

## Problem Statement

| Key | Value |
| --- | ----- |
| Clarity | 4 |
| Relevance | 5 |
| Feedback | The problem statement clearly identifies the issues with traditional hotel stays: high costs, disconnection from local culture, and lack of a platform for local hosting. It effectively highlights the need for an alternative solution in the travel accommodation market. |

## Solution

| Key | Value |
| --- | ----- |
| Uniqueness | 4 |
| Feasibility | 5 |
| Feedback | The solution is a web platform that allows users to rent out their spaces to travelers. This leverages the sharing economy and offers a unique value proposition by providing cost savings, income opportunities, and cultural exchange. The feasibility is high, considering the rise of online platforms and the increasing demand for alternative travel accommodations. |

## Market Analysis

| Key | Value |
| --- | -----