# YC Companies: GenAI-Enhanced Analysis

**Next-generation startup intelligence powered by AI** - Transform raw YC data into actionable insights, predictive narratives, and competitive intelligence.

## 📋 Table of Contents

### 🚀 **Getting Started**
- [1. Setup & Environment](#1-setup--environment)
- [2. Data Loading & Enhanced Feature Engineering](#2-data-loading--enhanced-feature-engineering)
- [3. Executive Portfolio Dashboard](#3-executive-portfolio-dashboard)

### 📊 **Comprehensive Data Analysis**
- [4. Success Factor Correlation Analysis](#4-success-factor-correlation-analysis)
- [5. Solo vs Team Founder Comparative Analysis](#5-solo-vs-team-founder-comparative-analysis)
- [6. Temporal Trends & Batch Performance Analysis](#6-temporal-trends--batch-performance-analysis)
- [7. Industry Market Analysis & Saturation Mapping](#7-industry-market-analysis--saturation-mapping)
- [8. Geographic Success Intelligence](#8-geographic-success-intelligence)
- [9. Advanced Predictive Modeling & ML Pipeline](#9-advanced-predictive-modeling--ml-pipeline)
- [10. Personalized Recommendations Engine](#10-personalized-recommendations-engine)

### 🧠 **AI-Powered Intelligence**
- [11. AI Company Profiling & Archetype Detection](#11-ai-company-profiling--archetype-detection)
- [12. Market Intelligence & Opportunity Detection](#12-market-intelligence--opportunity-detection)
- [13. Interactive AI Analysis Interface](#13-interactive-ai-analysis-interface)

### 📈 **Advanced Visualizations & Analytics**
- [14. Dynamic Interactive Dashboards](#14-dynamic-interactive-dashboards)
- [15. Year-Level Batch Intelligence](#15-year-level-batch-intelligence)
- [16. Multi-Year Trend Analysis & Forecasting](#16-multi-year-trend-analysis--forecasting)

### 🎯 **Quick Start & Interactive Demos**
- [17. Quick Start Guide](#17-quick-start-guide)
- [18. Interactive Analysis Demos](#18-interactive-analysis-demos)

### 📖 **Documentation & Deployment**
- [19. Summary & Next Steps](#19-summary--next-steps)
- [20. Requirements & Setup](#20-requirements--setup)

---

## 🚀 What's New

**Beyond Traditional Analysis:**
- **🧠 AI-Powered Company Profiling** - Automatic archetype detection and positioning
- **📈 Predictive Success Narratives** - LLM-generated success pattern analysis
- **🔍 Dynamic Market Intelligence** - Real-time trend detection and opportunity mapping
- **💬 Conversational Analytics** - Natural language query interface
- **🎯 Competitive Intelligence** - Automated competitive landscape analysis

---


In [60]:
# Enhanced Setup & Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings
from datetime import datetime, timedelta
from collections import Counter, defaultdict
import re
import json
from typing import Dict, List, Tuple, Optional

# AI/ML Libraries
from openai import OpenAI
from sklearn.cluster import KMeans, DBSCAN
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
import umap

# Advanced Analytics
from textblob import TextBlob
import networkx as nx
from wordcloud import WordCloud

# Interactive Components
import ipywidgets as widgets
from IPython.display import display, HTML, Markdown

warnings.filterwarnings('ignore')

# Enhanced Plot Settings
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (14, 8)
plt.rcParams['font.size'] = 12

# Color Palettes
STATUS_COLORS = {
    'Public': '#00CC66',
    'Acquired': '#0099FF', 
    'Active': '#FFAA00',
    'Inactive': '#FF3333',
    'Dead': '#999999'
}

ARCHETYPE_COLORS = {
    'B2B SaaS': '#1f77b4',
    'Marketplace': '#ff7f0e',
    'Consumer': '#2ca02c',
    'AI/ML': '#d62728',
    'Fintech': '#9467bd',
    'Healthcare': '#8c564b',
    'Other': '#7f7f7f'
}

print("🚀 Enhanced GenAI Analysis Environment Loaded")
print("📊 Ready for advanced startup intelligence analysis")


🚀 Enhanced GenAI Analysis Environment Loaded
📊 Ready for advanced startup intelligence analysis


## 1. Setup & Environment


In [61]:
# Load and enhance YC data
data_path = '../data/2025-10-05-yc.companies.jl'

try:
    df = pd.read_json(data_path, lines=True)
    print(f"✅ Loaded {len(df):,} companies from {data_path.split('/')[-1]}")
except FileNotFoundError:
    # Fallback to older data
    data_path = '../data/2025-05-03.jl'
    df = pd.read_json(data_path, lines=True)
    print(f"⚠️ Using fallback data: {len(df):,} companies from {data_path.split('/')[-1]}")

# Enhanced data preparation
def enhance_company_data(df):
    """Enhanced data preprocessing with AI-ready features"""
    df = df.copy()
    
    # Basic features
    df['is_solo_founder'] = df['num_founders'] == 1
    df['is_team_founded'] = df['num_founders'] > 1
    
    # Extract batch year and season
    def parse_batch(batch):
        if pd.isna(batch):
            return None, None
        match = re.search(r'(Winter|Summer|Spring|Fall|W|S|IK)\s*(\d{2,4})', str(batch))
        if match:
            season, year = match.groups()
            year = int(year)
            if year < 50:
                year += 2000
            elif year < 100:
                year += 1900
            return season, year
        return None, None
    
    df[['batch_season', 'batch_year']] = df['batch'].apply(parse_batch).apply(pd.Series)
    
    # Company age and maturity
    current_year = 2025
    df['company_age'] = df['year_founded'].apply(lambda x: current_year - x if pd.notna(x) else None)
    df['is_mature'] = df['company_age'] >= 3
    
    # Success indicators
    df['is_successful'] = df['status'].apply(lambda x: 'acquired' in str(x).lower() or 'public' in str(x).lower())
    df['is_active'] = df['status'].apply(lambda x: 'active' in str(x).lower())
    
    # Location features
    df['is_sf_bay'] = df['location'].str.contains('San Francisco|Palo Alto|Mountain View|Menlo Park', case=False, na=False)
    df['is_us'] = df['country'] == 'US'
    
    # Industry features
    df['is_ai'] = df['tags'].apply(lambda x: any('ai' in str(tag).lower() or 'machine-learning' in str(tag).lower() for tag in x) if isinstance(x, list) else False)
    df['is_b2b'] = df['tags'].apply(lambda x: any('b2b' in str(tag).lower() or 'saas' in str(tag).lower() for tag in x) if isinstance(x, list) else False)
    df['is_fintech'] = df['tags'].apply(lambda x: any('fintech' in str(tag).lower() or 'financial' in str(tag).lower() for tag in x) if isinstance(x, list) else False)
    
    # Text features for AI analysis
    df['combined_text'] = df.apply(lambda x: f"{x['company_name']}: {x['short_description']}. {x.get('long_description', '')} Tags: {', '.join(x['tags']) if isinstance(x['tags'], list) else ''}", axis=1)
    df['text_length'] = df['combined_text'].str.len()
    df['num_tags'] = df['tags'].apply(lambda x: len(x) if isinstance(x, list) else 0)
    
    # Sentiment analysis
    df['sentiment_score'] = df['short_description'].apply(lambda x: TextBlob(str(x)).sentiment.polarity if pd.notna(x) else 0)
    df['sentiment_label'] = df['sentiment_score'].apply(lambda x: 'Positive' if x > 0.1 else 'Negative' if x < -0.1 else 'Neutral')
    
    return df

# Apply enhancements
df = enhance_company_data(df)

print(f"\n📊 Enhanced Dataset Info:")
print(f"  Total companies: {len(df):,}")
print(f"  Solo founders: {df['is_solo_founder'].sum():,} ({df['is_solo_founder'].mean()*100:.1f}%)")
print(f"  Mature companies: {df['is_mature'].sum():,} ({df['is_mature'].mean()*100:.1f}%)")
print(f"  Successful companies: {df['is_successful'].sum():,} ({df['is_successful'].mean()*100:.1f}%)")
print(f"  AI companies: {df['is_ai'].sum():,} ({df['is_ai'].mean()*100:.1f}%)")
print(f"  B2B companies: {df['is_b2b'].sum():,} ({df['is_b2b'].mean()*100:.1f}%)")

display(df.head(3))


✅ Loaded 5,463 companies from 2025-10-05-yc.companies.jl

📊 Enhanced Dataset Info:
  Total companies: 5,463
  Solo founders: 1,423 (26.0%)
  Mature companies: 3,053 (55.9%)
  Successful companies: 714 (13.1%)
  AI companies: 1,386 (25.4%)
  B2B companies: 1,652 (30.2%)


Unnamed: 0,company_id,company_name,short_description,long_description,batch,status,tags,location,country,year_founded,...,is_sf_bay,is_us,is_ai,is_b2b,is_fintech,combined_text,text_length,num_tags,sentiment_score,sentiment_label
0,31009,Bear,Show up on AI Search Engines,Bear AI helps companies show up AI search engi...,Fall 2025,Active,"[generative-ai, saas, b2b]",San Francisco,US,2025.0,...,True,True,True,True,False,Bear: Show up on AI Search Engines. Bear AI he...,350,3,0.0,Neutral
1,31002,Clicks,The first AI back-office worker that works lik...,Clicks automates your workflows using your exi...,Fall 2025,Active,"[artificial-intelligence, workflow-automation,...",San Francisco,US,2025.0,...,True,True,False,False,False,Clicks: The first AI back-office worker that w...,1162,3,0.125,Positive
2,31011,Openroll,The world's first Agentic compensation platform,Openroll gives companies real-time intelligenc...,Fall 2025,Active,"[b2b, hr-tech, data-visualization, ai]","Stockholm, Sweden",SE,2024.0,...,False,False,True,True,False,Openroll: The world's first Agentic compensati...,383,4,0.25,Positive


## 2. Data Loading & Preparation


## 2.5. Executive Summary Dashboard


In [62]:
# Executive Summary Dashboard - Key Metrics and KPIs
def create_executive_summary():
    """Create comprehensive executive summary with key metrics"""
    
    # Key Metrics
    total_companies = len(df)
    total_founders = df['num_founders'].sum()
    successful_companies = df['is_successful'].sum()
    success_rate = successful_companies / total_companies * 100
    solo_success_rate = df[df['is_solo_founder']]['is_successful'].mean() * 100
    team_success_rate = df[df['is_team_founded']]['is_successful'].mean() * 100
    
    print("═" * 60)
    print("  Y COMBINATOR PORTFOLIO OVERVIEW")
    print("═" * 60)
    print(f"\n📈 PORTFOLIO METRICS:")
    print(f"  Total Companies:        {total_companies:,}")
    print(f"  Total Founders:         {total_founders:,.0f}")
    print(f"  Public/Acquired:        {successful_companies:,} ({success_rate:.1f}%)")
    print(f"  Active Companies:       {(df['status'] == 'Active').sum():,}")
    print(f"  Mature Companies:       {df['is_mature'].sum():,} ({df['is_mature'].mean()*100:.1f}%)")
    
    print(f"\n👥 FOUNDER COMPOSITION:")
    print(f"  Solo Founders:          {df['is_solo_founder'].sum():,} ({df['is_solo_founder'].mean()*100:.1f}%)")
    print(f"  2 Co-founders:          {(df['num_founders'] == 2).sum():,} ({(df['num_founders'] == 2).mean()*100:.1f}%)")
    print(f"  3+ Co-founders:         {(df['num_founders'] >= 3).sum():,} ({(df['num_founders'] >= 3).mean()*100:.1f}%)")
    
    print(f"\n🎯 SUCCESS RATES:")
    print(f"  Solo Founders:          {solo_success_rate:.1f}%")
    print(f"  Team Founders:          {team_success_rate:.1f}%")
    print(f"  Difference:             {team_success_rate - solo_success_rate:+.1f}pp")
    
    print(f"\n🌍 GEOGRAPHIC DISTRIBUTION:")
    print(f"  SF Bay Area:            {df['is_sf_bay'].sum():,} ({df['is_sf_bay'].mean()*100:.1f}%)")
    print(f"  United States:          {df['is_us'].sum():,} ({df['is_us'].mean()*100:.1f}%)")
    print(f"  International:          {(~df['is_us']).sum():,} ({(~df['is_us']).mean()*100:.1f}%)")
    
    print(f"\n🤖 INDUSTRY TRENDS:")
    print(f"  AI-related companies:   {df['is_ai'].sum():,} ({df['is_ai'].mean()*100:.1f}%)")
    print(f"  B2B companies:          {df['is_b2b'].sum():,} ({df['is_b2b'].mean()*100:.1f}%)")
    print(f"  Fintech companies:      {df['is_fintech'].sum():,} ({df['is_fintech'].mean()*100:.1f}%)")
    
    print("\n" + "═" * 60)
    
    return {
        'total_companies': total_companies,
        'success_rate': success_rate,
        'solo_success_rate': solo_success_rate,
        'team_success_rate': team_success_rate
    }

# Create executive summary
executive_metrics = create_executive_summary()


════════════════════════════════════════════════════════════
  Y COMBINATOR PORTFOLIO OVERVIEW
════════════════════════════════════════════════════════════

📈 PORTFOLIO METRICS:
  Total Companies:        5,463
  Total Founders:         10,729
  Public/Acquired:        714 (13.1%)
  Active Companies:       3,782
  Mature Companies:       3,053 (55.9%)

👥 FOUNDER COMPOSITION:
  Solo Founders:          1,423 (26.0%)
  2 Co-founders:          2,960 (54.2%)
  3+ Co-founders:         1,063 (19.5%)

🎯 SUCCESS RATES:
  Solo Founders:          13.5%
  Team Founders:          12.9%
  Difference:             -0.6pp

🌍 GEOGRAPHIC DISTRIBUTION:
  SF Bay Area:            2,191 (40.1%)
  United States:          3,725 (68.2%)
  International:          1,738 (31.8%)

🤖 INDUSTRY TRENDS:
  AI-related companies:   1,386 (25.4%)
  B2B companies:          1,652 (30.2%)
  Fintech companies:      696 (12.7%)

════════════════════════════════════════════════════════════


## 2.6. Success Factor Analysis


In [63]:
# Success Factor Analysis - What predicts startup success?
def analyze_success_factor(column, title, top_n=15, min_companies=20):
    """Analyze success rate for a given categorical column
    
    Note: Uses only mature companies (≥3 years old) to avoid survivorship bias.
    """
    # Filter to mature companies only
    mature_df = df[df['is_mature']].copy()
    
    success_analysis = mature_df.groupby(column).agg({
        'is_successful': ['sum', 'mean', 'count']
    }).round(3)
    success_analysis.columns = ['Successful', 'Success_Rate', 'Total']
    
    # Increased minimum to 20 for statistical significance
    success_analysis = success_analysis[success_analysis['Total'] >= min_companies]
    success_analysis = success_analysis.sort_values('Success_Rate', ascending=False).head(top_n)
    
    # Plot
    fig = go.Figure()
    fig.add_trace(go.Bar(
        y=success_analysis.index,
        x=success_analysis['Success_Rate'] * 100,
        orientation='h',
        text=[f"{val*100:.1f}% (n={count:.0f})" 
              for val, count in zip(success_analysis['Success_Rate'], success_analysis['Total'])],
        textposition='outside',
        marker_color='#0099FF'
    ))
    
    fig.update_layout(
        title=f"{title}<br><sub>Mature companies only (≥3 years old), min {min_companies} companies</sub>",
        xaxis_title='Success Rate (%)',
        yaxis_title=column.replace('_', ' ').title(),
        height=max(400, top_n * 30)
    )
    fig.show()
    
    return success_analysis

# Correlation Analysis
def analyze_correlations():
    """Analyze correlation between numeric features and success"""
    numeric_features = ['num_founders', 'team_size', 'company_age', 'num_tags']
    correlations = []
    
    # Use mature companies only
    mature_df = df[df['is_mature']].copy()
    
    for feature in numeric_features:
        if feature in mature_df.columns:
            corr = mature_df[[feature, 'is_successful']].dropna().corr().iloc[0, 1]
            correlations.append({'Feature': feature, 'Correlation': corr})
    
    corr_df = pd.DataFrame(correlations).sort_values('Correlation', ascending=False)
    
    fig = go.Figure(go.Bar(
        x=corr_df['Correlation'],
        y=corr_df['Feature'],
        orientation='h',
        marker_color=['#00CC66' if x > 0 else '#FF3333' for x in corr_df['Correlation']]
    ))
    
    fig.update_layout(
        title='Feature Correlation with Success (Public/Acquired)<br><sub>Pearson correlation - association only, not causation</sub>',
        xaxis_title='Correlation Coefficient',
        height=300
    )
    fig.show()
    
    print("\n📊 Correlation Insights:")
    print("⚠️  Important: Correlation ≠ Causation. These show association, not cause and effect.\n")
    for _, row in corr_df.iterrows():
        direction = "positively" if row['Correlation'] > 0 else "negatively"
        strength = "strongly" if abs(row['Correlation']) > 0.3 else ("moderately" if abs(row['Correlation']) > 0.1 else "weakly")
        print(f"  • {row['Feature']}: {strength} {direction} associated ({row['Correlation']:.3f})")
    
    return corr_df

print("🎯 Analyzing success factors...\n")
print("⚠️  Note: All success analysis uses MATURE companies (≥3 years) to avoid survivorship bias.")

# Run correlation analysis
correlation_results = analyze_correlations()


🎯 Analyzing success factors...

⚠️  Note: All success analysis uses MATURE companies (≥3 years) to avoid survivorship bias.



📊 Correlation Insights:
⚠️  Important: Correlation ≠ Causation. These show association, not cause and effect.

  • company_age: moderately positively associated (0.244)
  • team_size: weakly positively associated (0.053)
  • num_founders: weakly positively associated (0.034)
  • num_tags: moderately negatively associated (-0.113)


## 2.7. Solo Founder Deep Dive


In [64]:
# Solo vs Team Founder Comprehensive Analysis
def solo_team_comparison():
    """Comprehensive comparison between solo and team founders"""
    solo_df = df[df['is_solo_founder']].copy()
    team_df = df[df['is_team_founded']].copy()
    
    comparison = pd.DataFrame({
        'Metric': [
            'Total Companies',
            'Success Rate (%)',
            'Avg Team Size',
            'Avg Company Age (years)',
            'SF Bay Area (%)',
            'AI-related (%)',
            'B2B (%)',
            'Has Grown Team (%)'
        ],
        'Solo Founders': [
            len(solo_df),
            solo_df['is_successful'].mean() * 100,
            solo_df['team_size'].mean(),
            solo_df['company_age'].mean(),
            solo_df['is_sf_bay'].mean() * 100,
            solo_df['is_ai'].mean() * 100,
            solo_df['is_b2b'].mean() * 100,
            solo_df['team_size'] > solo_df['num_founders']
        ],
        'Team Founders': [
            len(team_df),
            team_df['is_successful'].mean() * 100,
            team_df['team_size'].mean(),
            team_df['company_age'].mean(),
            team_df['is_sf_bay'].mean() * 100,
            team_df['is_ai'].mean() * 100,
            team_df['is_b2b'].mean() * 100,
            team_df['team_size'] > team_df['num_founders']
        ]
    })
    
    # Calculate team growth percentage
    solo_growth = (solo_df['team_size'] > solo_df['num_founders']).mean() * 100
    team_growth = (team_df['team_size'] > team_df['num_founders']).mean() * 100
    
    comparison.loc[comparison['Metric'] == 'Has Grown Team (%)', 'Solo Founders'] = solo_growth
    comparison.loc[comparison['Metric'] == 'Has Grown Team (%)', 'Team Founders'] = team_growth
    
    comparison['Difference'] = comparison['Team Founders'] - comparison['Solo Founders']
    comparison = comparison.round(2)
    
    print("\n" + "═" * 80)
    print("  SOLO vs TEAM FOUNDERS: HEAD-TO-HEAD COMPARISON")
    print("═" * 80)
    print(comparison.to_string(index=False))
    print("═" * 80)
    
    return comparison

def analyze_solo_industries():
    """Analyze top industries for solo founders by success rate"""
    solo_df = df[df['is_solo_founder']].copy()
    
    # Calculate success rate per industry for solo founders
    industry_success = {}
    for tag in df['tags'].explode().value_counts().head(30).index:
        companies_with_tag = solo_df[solo_df['tags'].apply(
            lambda x: tag in x if isinstance(x, list) else False
        )]
        if len(companies_with_tag) >= 5:
            success_rate = companies_with_tag['is_successful'].mean()
            industry_success[tag] = {
                'success_rate': success_rate,
                'count': len(companies_with_tag),
                'successful': companies_with_tag['is_successful'].sum()
            }
    
    industry_df = pd.DataFrame(industry_success).T.sort_values('success_rate', ascending=False).head(15)
    
    fig = go.Figure()
    fig.add_trace(go.Bar(
        y=industry_df.index,
        x=industry_df['success_rate'] * 100,
        orientation='h',
        text=[f"{val*100:.1f}% ({count:.0f} cos)" 
              for val, count in zip(industry_df['success_rate'], industry_df['count'])],
        textposition='outside',
        marker_color='#FF6600'
    ))
    
    fig.update_layout(
        title='Top 15 Industries for Solo Founders by Success Rate',
        xaxis_title='Success Rate (%)',
        yaxis_title='Industry',
        height=500
    )
    fig.show()
    
    return industry_df

def solo_trends_over_time():
    """Analyze solo founder trends over time"""
    solo_trends = df.groupby('batch_year').agg({
        'is_solo_founder': ['sum', 'mean']
    })
    solo_trends.columns = ['Solo_Count', 'Solo_Percentage']
    solo_trends = solo_trends.dropna()
    
    fig = make_subplots(specs=[[{"secondary_y": True}]])
    
    fig.add_trace(
        go.Bar(x=solo_trends.index, y=solo_trends['Solo_Count'], name='# Solo Founders', marker_color='#0099FF'),
        secondary_y=False
    )
    
    fig.add_trace(
        go.Scatter(x=solo_trends.index, y=solo_trends['Solo_Percentage']*100, 
                   name='% Solo', mode='lines+markers', marker_color='#FF3333', line=dict(width=3)),
        secondary_y=True
    )
    
    fig.update_layout(title='Solo Founder Trends: YC Acceptance Over Time', height=400)
    fig.update_yaxes(title_text="Number of Solo Founders", secondary_y=False)
    fig.update_yaxes(title_text="Percentage Solo (%)", secondary_y=True)
    fig.update_xaxes(title_text="Batch Year")
    
    fig.show()
    
    recent_solo_pct = solo_trends['Solo_Percentage'].tail(3).mean() * 100
    early_solo_pct = solo_trends['Solo_Percentage'].head(5).mean() * 100
    print(f"\n📈 Solo Founder Trend:")
    print(f"  Early batches (avg): {early_solo_pct:.1f}% solo")
    print(f"  Recent batches (avg): {recent_solo_pct:.1f}% solo")
    print(f"  Change: {recent_solo_pct - early_solo_pct:+.1f}pp")
    
    return solo_trends

# Run solo founder analysis
print("🔍 Deep Dive: Solo vs Team Founders")
comparison_results = solo_team_comparison()
solo_industries = analyze_solo_industries()
solo_trends = solo_trends_over_time()


🔍 Deep Dive: Solo vs Team Founders

════════════════════════════════════════════════════════════════════════════════
  SOLO vs TEAM FOUNDERS: HEAD-TO-HEAD COMPARISON
════════════════════════════════════════════════════════════════════════════════
                 Metric Solo Founders Team Founders Difference
        Total Companies          1423          4023       2600
       Success Rate (%)     13.492621      12.90082  -0.591801
          Avg Team Size     46.483477     50.239919   3.756442
Avg Company Age (years)      5.492233       4.57636  -0.915873
        SF Bay Area (%)     41.461701     39.572458  -1.889242
         AI-related (%)     21.222769     26.895352   5.672583
                B2B (%)     25.368939      32.09048   6.721541
     Has Grown Team (%)     87.983134     69.400945  -18.58219
════════════════════════════════════════════════════════════════════════════════



📈 Solo Founder Trend:
  Early batches (avg): 39.3% solo
  Recent batches (avg): 16.9% solo
  Change: -22.4pp


## 2.8. Temporal & Batch Analysis


In [65]:
# Temporal Analysis - How has YC evolved over 20 years?
def analyze_batch_trends():
    """Analyze batch statistics and trends over time"""
    batch_stats = df.groupby('batch_year').agg({
        'company_id': 'count',
        'is_successful': 'mean',
        'num_founders': 'mean',
        'team_size': 'mean',
        'is_ai': 'mean',
        'is_b2b': 'mean'
    }).round(3)
    
    batch_stats.columns = ['Batch_Size', 'Success_Rate', 'Avg_Founders', 'Avg_Team_Size', 'AI_Pct', 'B2B_Pct']
    batch_stats = batch_stats.dropna()
    
    # Plot multiple trends
    fig = make_subplots(
        rows=2, cols=2,
        subplot_titles=('Batch Size Over Time', 'Success Rate by Vintage', 
                        'AI Companies Trend', 'B2B Companies Trend')
    )
    
    # 1. Batch size
    fig.add_trace(
        go.Scatter(x=batch_stats.index, y=batch_stats['Batch_Size'], 
                   mode='lines+markers', marker_color='#0099FF', fill='tozeroy'),
        row=1, col=1
    )
    
    # 2. Success rate
    fig.add_trace(
        go.Scatter(x=batch_stats.index, y=batch_stats['Success_Rate']*100, 
                   mode='lines+markers', marker_color='#00CC66'),
        row=1, col=2
    )
    
    # 3. AI trend
    fig.add_trace(
        go.Scatter(x=batch_stats.index, y=batch_stats['AI_Pct']*100, 
                   mode='lines+markers', marker_color='#9933FF', fill='tozeroy'),
        row=2, col=1
    )
    
    # 4. B2B trend
    fig.add_trace(
        go.Scatter(x=batch_stats.index, y=batch_stats['B2B_Pct']*100, 
                   mode='lines+markers', marker_color='#FF6600', fill='tozeroy'),
        row=2, col=2
    )
    
    fig.update_yaxes(title_text="# Companies", row=1, col=1)
    fig.update_yaxes(title_text="Success Rate (%)", row=1, col=2)
    fig.update_yaxes(title_text="% AI Companies", row=2, col=1)
    fig.update_yaxes(title_text="% B2B Companies", row=2, col=2)
    
    fig.update_layout(height=700, showlegend=False, title_text="YC Evolution: 20-Year Trends")
    fig.show()
    
    return batch_stats

def analyze_seasonal_patterns():
    """Analyze seasonal patterns in YC batches"""
    import re
    
    # Check if batch_season column exists and has data
    if 'batch_season' not in df.columns:
        print("⚠️  batch_season column not found. Creating it now...")
        # Create batch_season column if it doesn't exist
        def parse_batch(batch):
            """Extract season and year from batch string"""
            if pd.isna(batch):
                return None, None
            
            # Handle new format: 'Winter 2025', 'Fall 2024'
            match = re.search(r'(Winter|Summer|Spring|Fall|W|S|IK)\s*(\d{2,4})', str(batch))
            if match:
                season = match.group(1)
                year = match.group(2)
                
                # Convert to 4-digit year
                if len(year) == 2:
                    year = int(year)
                    year = 2000 + year if year < 50 else 1900 + year
                else:
                    year = int(year)
                    
                # Map season abbreviations
                season_map = {'W': 'Winter', 'S': 'Summer', 'IK': 'IK'}
                season = season_map.get(season, season)
                
                return season, year
            return None, None
        
        df[['batch_season', 'batch_year']] = df['batch'].apply(lambda x: pd.Series(parse_batch(x)))
    
    # Filter out NaN values for seasonal analysis
    seasonal_df = df[df['batch_season'].notna()].copy()
    
    print(f"📊 Seasonal data available: {len(seasonal_df)} companies")
    print(f"📊 Unique seasons: {seasonal_df['batch_season'].unique()}")
    
    if len(seasonal_df) == 0:
        print("⚠️  No seasonal data available. Skipping seasonal analysis.")
        return None
    
    season_stats = seasonal_df.groupby('batch_season').agg({
        'company_id': 'count',
        'is_successful': 'mean',
        'num_founders': 'mean',
        'team_size': 'mean'
    }).round(3)
    
    season_stats.columns = ['Count', 'Success_Rate', 'Avg_Founders', 'Avg_Team_Size']
    
    # Filter seasons with sufficient data
    season_stats = season_stats[season_stats['Count'] >= 10]
    
    if len(season_stats) == 0:
        print("⚠️  No seasons with sufficient data (≥10 companies).")
        print("🔄 Falling back to batch year analysis...")
        
        # Fallback: analyze by batch year instead
        year_stats = df.groupby('batch_year').agg({
            'company_id': 'count',
            'is_successful': 'mean',
            'num_founders': 'mean',
            'team_size': 'mean'
        }).round(3)
        
        year_stats.columns = ['Count', 'Success_Rate', 'Avg_Founders', 'Avg_Team_Size']
        year_stats = year_stats[year_stats['Count'] >= 10]
        
        if len(year_stats) == 0:
            print("⚠️  No batch years with sufficient data. Skipping seasonal analysis.")
            return None
        
        print("\n📅 BATCH YEAR COMPARISON (Fallback):")
        print(year_stats.head(10).to_string())
        
        # Create visualization for year comparison with proper scaling
        from plotly.subplots import make_subplots
        
        # Create subplots with different scales
        fig = make_subplots(
            rows=2, cols=2,
            subplot_titles=('Companies per Year', 'Success Rate by Year', 
                            'Average Founders by Year', 'Average Team Size by Year'),
            specs=[[{"type": "bar"}, {"type": "bar"}],
                   [{"type": "bar"}, {"type": "bar"}]]
        )
        
        # Get top 5 years for visualization
        top_years = year_stats.head(5)
        
        # 1. Companies (row=1, col=1)
        fig.add_trace(
            go.Bar(x=top_years.index, y=top_years['Count'], 
                   name='Companies', marker_color='lightblue'),
            row=1, col=1
        )
        
        # 2. Success Rate (row=1, col=2)
        fig.add_trace(
            go.Bar(x=top_years.index, y=top_years['Success_Rate']*100, 
                   name='Success Rate %', marker_color='lightgreen'),
            row=1, col=2
        )
        
        # 3. Average Founders (row=2, col=1)
        fig.add_trace(
            go.Bar(x=top_years.index, y=top_years['Avg_Founders'], 
                   name='Avg Founders', marker_color='lightcoral'),
            row=2, col=1
        )
        
        # 4. Average Team Size (row=2, col=2)
        fig.add_trace(
            go.Bar(x=top_years.index, y=top_years['Avg_Team_Size'], 
                   name='Avg Team Size', marker_color='lightyellow'),
            row=2, col=2
        )
        
        # Update layout
        fig.update_layout(
            title='Batch Year Comparison (Fallback) - Properly Scaled',
            showlegend=False,
            height=600
        )
        
        # Update axes labels
        fig.update_xaxes(title_text="Year", row=1, col=1)
        fig.update_xaxes(title_text="Year", row=1, col=2)
        fig.update_xaxes(title_text="Year", row=2, col=1)
        fig.update_xaxes(title_text="Year", row=2, col=2)
        
        fig.update_yaxes(title_text="Number of Companies", row=1, col=1)
        fig.update_yaxes(title_text="Success Rate (%)", row=1, col=2)
        fig.update_yaxes(title_text="Average Founders", row=2, col=1)
        fig.update_yaxes(title_text="Average Team Size", row=2, col=2)
        
        fig.show()
        
        return year_stats
    
    print("\n🌤️  SEASONAL BATCH COMPARISON:")
    print(season_stats.to_string())
    
    # Visual comparison with proper scaling
    from plotly.subplots import make_subplots
    
    # Create subplots with different scales
    fig = make_subplots(
        rows=2, cols=2,
        subplot_titles=('Companies per Season', 'Success Rate by Season', 
                        'Average Founders by Season', 'Average Team Size by Season'),
        specs=[[{"type": "bar"}, {"type": "bar"}],
               [{"type": "bar"}, {"type": "bar"}]]
    )
    
    # 1. Companies (row=1, col=1)
    fig.add_trace(
        go.Bar(x=season_stats.index, y=season_stats['Count'], 
               name='Companies', marker_color='lightblue'),
        row=1, col=1
    )
    
    # 2. Success Rate (row=1, col=2)
    fig.add_trace(
        go.Bar(x=season_stats.index, y=season_stats['Success_Rate']*100, 
               name='Success Rate %', marker_color='lightgreen'),
        row=1, col=2
    )
    
    # 3. Average Founders (row=2, col=1)
    fig.add_trace(
        go.Bar(x=season_stats.index, y=season_stats['Avg_Founders'], 
               name='Avg Founders', marker_color='lightcoral'),
        row=2, col=1
    )
    
    # 4. Average Team Size (row=2, col=2)
    fig.add_trace(
        go.Bar(x=season_stats.index, y=season_stats['Avg_Team_Size'], 
               name='Avg Team Size', marker_color='lightyellow'),
        row=2, col=2
    )
    
    # Update layout
    fig.update_layout(
        title='Batch Season Comparison - Properly Scaled',
        showlegend=False,
        height=600
    )
    
    # Update axes labels
    fig.update_xaxes(title_text="Season", row=1, col=1)
    fig.update_xaxes(title_text="Season", row=1, col=2)
    fig.update_xaxes(title_text="Season", row=2, col=1)
    fig.update_xaxes(title_text="Season", row=2, col=2)
    
    fig.update_yaxes(title_text="Number of Companies", row=1, col=1)
    fig.update_yaxes(title_text="Success Rate (%)", row=1, col=2)
    fig.update_yaxes(title_text="Average Founders", row=2, col=1)
    fig.update_yaxes(title_text="Average Team Size", row=2, col=2)
    
    fig.show()
    
    return season_stats

def analyze_industry_trends():
    """Analyze industry trends over time"""
    # Industry trends over time (top 10 industries)
    all_tags = [tag for tags in df['tags'].dropna() for tag in tags if isinstance(tags, list)]
    top_industries = pd.Series(all_tags).value_counts().head(10).index
    
    industry_time_data = []
    for year in sorted(df['batch_year'].dropna().unique()):
        year_df = df[df['batch_year'] == year]
        for industry in top_industries:
            count = year_df['tags'].apply(lambda x: industry in x if isinstance(x, list) else False).sum()
            industry_time_data.append({
                'Year': year,
                'Industry': industry,
                'Count': count
            })
    
    industry_time_df = pd.DataFrame(industry_time_data)
    
    fig = px.line(
        industry_time_df,
        x='Year',
        y='Count',
        color='Industry',
        title='Industry Trends Over Time (Top 10 Industries)',
        labels={'Count': '# Companies per Year'}
    )
    
    fig.update_layout(height=500)
    fig.show()
    
    return industry_time_df

# Run temporal analysis
print("📈 Analyzing YC Evolution Over Time")
batch_trends = analyze_batch_trends()
seasonal_patterns = analyze_seasonal_patterns()
industry_trends = analyze_industry_trends()


📈 Analyzing YC Evolution Over Time


📊 Seasonal data available: 5463 companies
📊 Unique seasons: ['Fall' 'Summer' 'Spring' 'Winter']

🌤️  SEASONAL BATCH COMPARISON:
              Count  Success_Rate  Avg_Founders  Avg_Team_Size
batch_season                                                  
Fall            146         0.000         1.986          3.368
Spring          144         0.000         2.090          2.832
Summer         2511         0.136         1.973         46.876
Winter         2662         0.140         1.948         56.321


## 2.9. Industry Deep Dive


In [66]:
# Industry Analysis - Which industries are thriving vs oversaturated?
def analyze_industry_success():
    """Comprehensive industry analysis with success rates and saturation"""
    # Extract all tags and analyze success rates
    all_tags = []
    for tags in df['tags'].dropna():
        if isinstance(tags, list):
            all_tags.extend(tags)
    
    tag_counts = pd.Series(all_tags).value_counts()
    
    # Success rate per industry
    industry_analysis = []
    for tag in tag_counts.head(50).index:
        tag_companies = df[df['tags'].apply(lambda x: tag in x if isinstance(x, list) else False)]
        if len(tag_companies) >= 10:
            industry_analysis.append({
                'Industry': tag,
                'Total': len(tag_companies),
                'Successful': tag_companies['is_successful'].sum(),
                'Success_Rate': tag_companies['is_successful'].mean(),
                'Active': (tag_companies['status'] == 'Active').sum(),
                'Avg_Team_Size': tag_companies['team_size'].mean(),
                'Solo_Pct': tag_companies['is_solo_founder'].mean()
            })
    
    industry_df = pd.DataFrame(industry_analysis).sort_values('Success_Rate', ascending=False)
    
    print("\n🏆 TOP 20 INDUSTRIES BY SUCCESS RATE (min 10 companies):")
    print(industry_df.head(20).to_string(index=False))
    
    print("\n⚠️  BOTTOM 10 INDUSTRIES BY SUCCESS RATE:")
    print(industry_df.tail(10).to_string(index=False))
    
    return industry_df

def create_industry_saturation_map(industry_df):
    """Create industry saturation vs success bubble chart"""
    fig = px.scatter(
        industry_df.head(30),
        x='Total',
        y='Success_Rate',
        size='Avg_Team_Size',
        color='Solo_Pct',
        hover_data=['Industry', 'Successful', 'Active'],
        text='Industry',
        title='Industry Map: Saturation vs Success Rate (Top 30)',
        labels={'Total': 'Number of Companies', 'Success_Rate': 'Success Rate', 'Solo_Pct': '% Solo Founders'},
        color_continuous_scale='RdYlGn'
    )
    
    fig.update_traces(textposition='top center')
    fig.update_layout(height=600)
    fig.show()

def analyze_industry_saturation():
    """Analyze industry saturation levels"""
    industry_df = analyze_industry_success()
    
    # Create saturation categories
    industry_df['Saturation_Level'] = pd.cut(
        industry_df['Total'], 
        bins=[0, 50, 200, 500, 1000, float('inf')], 
        labels=['Underserved', 'Emerging', 'Growing', 'Mature', 'Oversaturated']
    )
    
    # Success rate by saturation level
    saturation_analysis = industry_df.groupby('Saturation_Level').agg({
        'Success_Rate': 'mean',
        'Total': 'count'
    }).round(3)
    
    print("\n📊 INDUSTRY SATURATION ANALYSIS:")
    print(saturation_analysis.to_string())
    
    # Visualize saturation map
    create_industry_saturation_map(industry_df)
    
    return industry_df, saturation_analysis

# Run industry analysis
print("🏭 Deep Dive: Industry Analysis")
industry_results, saturation_analysis = analyze_industry_saturation()


🏭 Deep Dive: Industry Analysis

🏆 TOP 20 INDUSTRIES BY SUCCESS RATE (min 10 companies):
        Industry  Total  Successful  Success_Rate  Active  Avg_Team_Size  Solo_Pct
           video     85          19      0.223529      46      37.564706  0.223529
       analytics    181          39      0.215470     119      27.592179  0.254144
     real-estate     79          15      0.189873      51      20.102564  0.177215
          social     68          12      0.176471      25      38.529412  0.323529
        security     85          15      0.176471      58      32.308642  0.305882
data-engineering     91          15      0.164835      60      32.186813  0.186813
       education    164          26      0.158537     112      55.734568  0.274390
          gaming     85          13      0.152941      46      44.166667  0.317647
 computer-vision     72          11      0.152778      51      22.057143  0.208333
      e-commerce    236          36      0.152542     149     140.130435  0.224576

## 2.9.1. AI Industry Deep Dive


In [67]:
# Fixed AI Industry Analysis - Corrected subplot specifications
def analyze_ai_industry():
    """Deep dive analysis of AI/ML companies with proper subplot specs"""
    ai_df = df[df['is_ai']].copy()
    
    if len(ai_df) == 0:
        print("⚠️  No AI companies found in dataset")
        return None
    
    print(f"\n🤖 AI INDUSTRY DEEP DIVE (FIXED)")
    print("="*60)
    print(f"📊 AI Companies Overview:")
    print(f"  Total AI Companies: {len(ai_df):,}")
    print(f"  Success Rate: {ai_df['is_successful'].mean()*100:.1f}%")
    print(f"  Active Companies: {(ai_df['status'] == 'Active').sum():,}")
    print(f"  Solo Founders: {ai_df['is_solo_founder'].sum():,} ({ai_df['is_solo_founder'].mean()*100:.1f}%)")
    print(f"  SF Bay Area: {ai_df['is_sf_bay'].sum():,} ({ai_df['is_sf_bay'].mean()*100:.1f}%)")
    print(f"  International: {(~ai_df['is_us']).sum():,} ({(~ai_df['is_us']).mean()*100:.1f}%)")
    
    # AI-specific tags analysis
    ai_tags = []
    for tags in ai_df['tags'].dropna():
        if isinstance(tags, list):
            ai_tags.extend(tags)
    
    ai_tag_counts = pd.Series(ai_tags).value_counts()
    print(f"\n🏷️  TOP AI-RELATED TAGS:")
    for i, (tag, count) in enumerate(ai_tag_counts.head(10).items(), 1):
        print(f"  {i:2d}. {tag}: {count} companies")
    
    # AI success by year
    ai_yearly = ai_df.groupby('batch_year').agg({
        'company_id': 'count',
        'is_successful': 'mean',
        'team_size': 'mean'
    }).round(3)
    ai_yearly.columns = ['AI_Companies', 'AI_Success_Rate', 'Avg_Team_Size']
    ai_yearly = ai_yearly.dropna()
    
    print(f"\n📈 AI COMPANIES BY YEAR:")
    print(ai_yearly.tail(10).to_string())
    
    # AI vs Non-AI comparison
    non_ai_df = df[~df['is_ai']].copy()
    
    comparison = pd.DataFrame({
        'Metric': [
            'Total Companies',
            'Success Rate (%)',
            'Avg Team Size',
            'Solo Founder %',
            'SF Bay Area %',
            'International %'
        ],
        'AI Companies': [
            len(ai_df),
            ai_df['is_successful'].mean() * 100,
            ai_df['team_size'].mean(),
            ai_df['is_solo_founder'].mean() * 100,
            ai_df['is_sf_bay'].mean() * 100,
            (~ai_df['is_us']).mean() * 100
        ],
        'Non-AI Companies': [
            len(non_ai_df),
            non_ai_df['is_successful'].mean() * 100,
            non_ai_df['team_size'].mean(),
            non_ai_df['is_solo_founder'].mean() * 100,
            non_ai_df['is_sf_bay'].mean() * 100,
            (~non_ai_df['is_us']).mean() * 100
        ]
    })
    
    comparison['Difference'] = comparison['AI Companies'] - comparison['Non-AI Companies']
    comparison = comparison.round(2)
    
    print(f"\n🔄 AI vs NON-AI COMPARISON:")
    print(comparison.to_string(index=False))
    
    # AI industry trends visualization with CORRECT subplot specs
    fig = make_subplots(
        rows=2, cols=2,
        subplot_titles=(
            'AI Companies by Year',
            'AI Success Rate Trend',
            'AI vs Non-AI Success Rate',
            'AI Company Distribution'
        ),
        specs=[[{"type": "bar"}, {"type": "scatter"}],
               [{"type": "bar"}, {"type": "pie"}]]
    )
    
    # 1. AI Companies by Year
    fig.add_trace(
        go.Bar(x=ai_yearly.index, y=ai_yearly['AI_Companies'], 
               name='AI Companies', marker_color='purple'),
        row=1, col=1
    )
    
    # 2. AI Success Rate Trend
    fig.add_trace(
        go.Scatter(x=ai_yearly.index, y=ai_yearly['AI_Success_Rate']*100, 
                   mode='lines+markers', name='AI Success Rate %',
                   line=dict(color='green', width=3)),
        row=1, col=2
    )
    
    # 3. AI vs Non-AI Success Rate
    fig.add_trace(
        go.Bar(x=['AI Companies', 'Non-AI Companies'], 
               y=[ai_df['is_successful'].mean()*100, non_ai_df['is_successful'].mean()*100],
               name='Success Rate %', marker_color=['purple', 'lightblue']),
        row=2, col=1
    )
    
    # 4. AI Company Distribution (PIE CHART - now properly specified)
    ai_status = ai_df['status'].value_counts()
    fig.add_trace(
        go.Pie(labels=ai_status.index, values=ai_status.values,
               name='AI Company Status'),
        row=2, col=2
    )
    
    fig.update_layout(
        title='AI Industry Deep Dive - Comprehensive Analysis (FIXED)',
        showlegend=False,
        height=600
    )
    
    fig.show()
    
    return ai_df, comparison

# Run the FIXED AI industry analysis
print("🤖 AI Industry Deep Dive Analysis (FIXED)")
ai_results, ai_comparison = analyze_ai_industry_fixed()


🤖 AI Industry Deep Dive Analysis (FIXED)

🤖 AI INDUSTRY DEEP DIVE (FIXED)
📊 AI Companies Overview:
  Total AI Companies: 1,386
  Success Rate: 6.4%
  Active Companies: 1,157
  Solo Founders: 302 (21.8%)
  SF Bay Area: 714 (51.5%)
  International: 313 (22.6%)

🏷️  TOP AI-RELATED TAGS:
   1. ai: 685 companies
   2. b2b: 368 companies
   3. artificial-intelligence: 335 companies
   4. saas: 315 companies
   5. generative-ai: 268 companies
   6. machine-learning: 224 companies
   7. developer-tools: 175 companies
   8. ai-assistant: 140 companies
   9. fintech: 92 companies
  10. consumer: 82 companies

📈 AI COMPANIES BY YEAR:
            AI_Companies  AI_Success_Rate  Avg_Team_Size
batch_year                                              
2016                  31            0.258        199.233
2017                  38            0.079         76.842
2018                  51            0.118         62.804
2019                  47            0.128         34.936
2020                  87   

## 2.10. Geographic Intelligence


In [68]:
# FIXED Geographic Dashboard - Corrected subplot specifications
def create_geographic_dashboard():
    """Create comprehensive geographic dashboard with proper subplot specs"""
    geo_stats = analyze_geographic_success()
    city_stats = analyze_city_success()
    
    # Create geographic success map with CORRECT subplot specs
    fig = make_subplots(
        rows=2, cols=2,
        subplot_titles=(
            'Success Rate by Country',
            'Company Count by Country', 
            'Top Cities by Company Count',
            'Geographic Distribution'
        ),
        specs=[[{"type": "bar"}, {"type": "bar"}],
               [{"type": "bar"}, {"type": "pie"}]]
    )
    
    # 1. Success rate by country (top 10)
    top_countries = geo_stats.head(10)
    fig.add_trace(
        go.Bar(x=top_countries.index, y=top_countries['Success_Rate']*100,
               name='Success Rate %', marker_color='#00CC66'),
        row=1, col=1
    )
    
    # 2. Company count by country (top 10)
    fig.add_trace(
        go.Bar(x=top_countries.index, y=top_countries['Total'],
               name='Company Count', marker_color='#0099FF'),
        row=1, col=2
    )
    
    # 3. Top cities
    top_cities = city_stats.head(10)
    fig.add_trace(
        go.Bar(y=top_cities.index, x=top_cities['Total'],
               orientation='h', name='Cities', marker_color='#FF6600'),
        row=2, col=1
    )
    
    # 4. Geographic distribution pie chart (PIE CHART - now properly specified)
    geo_dist = df['country'].value_counts().head(10)
    fig.add_trace(
        go.Pie(labels=geo_dist.index, values=geo_dist.values,
               name='Geographic Distribution'),
        row=2, col=2
    )
    
    fig.update_layout(height=800, showlegend=False, title_text="Geographic Intelligence Dashboard (FIXED)")
    fig.show()
    
    return geo_stats, city_stats

# Run the FIXED geographic analysis
print("🌍 Geographic Intelligence Analysis (FIXED)")
geo_results, city_results = create_geographic_dashboard()


🌍 Geographic Intelligence Analysis (FIXED)

🌍 SUCCESS RATE BY COUNTRY (min 10 companies):
         Total  Successful  Success_Rate  Avg_Team_Size  Solo_Pct
country                                                          
DK          14           3         0.214         13.714     0.143
ES          16           3         0.188         16.812     0.188
NL          11           2         0.182         56.364     0.182
CA         138          23         0.167         26.650     0.275
US        3725         574         0.154         47.367     0.282
IL          28           4         0.143         37.074     0.143
FR          56           8         0.143         13.571     0.196
SG          51           6         0.118         36.059     0.157
BR          49           5         0.102         66.104     0.306
CH          10           1         0.100          7.100     0.100
PK          10           1         0.100         49.400     0.300
IN         197          19         0.096        118.

## 2.11. Predictive Analysis


In [69]:
# Enhanced Predictive Analysis with Improved Feature Engineering
def create_enhanced_features(df):
    """Create advanced features for better predictive modeling"""
    print("🔧 Creating Enhanced Features for Predictive Modeling...")
    
    # Start with a copy of the dataframe
    enhanced_df = df.copy()
    
    # 1. TEMPORAL FEATURES
    print("📅 Creating temporal features...")
    
    # Batch year features
    enhanced_df['batch_year_numeric'] = pd.to_numeric(enhanced_df['batch_year'], errors='coerce')
    enhanced_df['years_since_batch'] = 2025 - enhanced_df['batch_year_numeric']
    enhanced_df['is_recent_batch'] = enhanced_df['batch_year_numeric'] >= 2020
    enhanced_df['is_early_batch'] = enhanced_df['batch_year_numeric'] <= 2010
    
    # Company age features
    enhanced_df['company_age_squared'] = enhanced_df['company_age'] ** 2
    enhanced_df['company_age_log'] = np.log1p(enhanced_df['company_age'])
    enhanced_df['is_young_company'] = enhanced_df['company_age'] <= 2
    enhanced_df['is_old_company'] = enhanced_df['company_age'] >= 10
    
    # 2. TEAM COMPOSITION FEATURES
    print("👥 Creating team composition features...")
    
    # Founder team features
    enhanced_df['is_solo_founder'] = enhanced_df['num_founders'] == 1
    enhanced_df['is_team_founded'] = enhanced_df['num_founders'] > 1
    
    # Safe division to avoid infinity
    enhanced_df['founder_team_size_ratio'] = np.where(
        enhanced_df['num_founders'] > 0,
        enhanced_df['team_size'] / enhanced_df['num_founders'],
        1.0  # Default ratio when num_founders is 0
    )
    
    enhanced_df['has_grown_team'] = enhanced_df['team_size'] > enhanced_df['num_founders']
    
    # Safe division for team growth rate
    enhanced_df['team_growth_rate'] = np.where(
        enhanced_df['num_founders'] > 0,
        (enhanced_df['team_size'] - enhanced_df['num_founders']) / enhanced_df['num_founders'],
        0.0  # No growth when num_founders is 0
    )
    
    # Team size features
    enhanced_df['team_size_log'] = np.log1p(enhanced_df['team_size'])
    enhanced_df['is_small_team'] = enhanced_df['team_size'] <= 3
    enhanced_df['is_large_team'] = enhanced_df['team_size'] >= 20
    
    # 3. GEOGRAPHIC FEATURES
    print("🌍 Creating geographic features...")
    
    # Location hierarchy
    enhanced_df['is_sf_bay'] = enhanced_df['location'].str.contains('San Francisco|Palo Alto|Mountain View|Menlo Park|Redwood City', case=False, na=False)
    enhanced_df['is_nyc'] = enhanced_df['location'].str.contains('New York|Brooklyn|Manhattan', case=False, na=False)
    enhanced_df['is_london'] = enhanced_df['location'].str.contains('London', case=False, na=False)
    enhanced_df['is_tel_aviv'] = enhanced_df['location'].str.contains('Tel Aviv', case=False, na=False)
    
    # Geographic clustering
    enhanced_df['is_tier1_city'] = enhanced_df['is_sf_bay'] | enhanced_df['is_nyc'] | enhanced_df['is_london']
    enhanced_df['is_international'] = ~enhanced_df['is_us']
    
    # 4. INDUSTRY FEATURES
    print("🏭 Creating industry features...")
    
    # Industry diversity
    enhanced_df['num_tags'] = enhanced_df['tags'].apply(lambda x: len(x) if isinstance(x, list) else 0)
    enhanced_df['industry_diversity'] = enhanced_df['num_tags'] / 10  # Normalize
    enhanced_df['is_high_diversity'] = enhanced_df['num_tags'] >= 5
    enhanced_df['is_low_diversity'] = enhanced_df['num_tags'] <= 2
    
    # Specific industry features
    enhanced_df['is_ai'] = enhanced_df['tags'].apply(lambda x: any('ai' in str(tag).lower() or 'machine-learning' in str(tag).lower() for tag in x) if isinstance(x, list) else False)
    enhanced_df['is_b2b'] = enhanced_df['tags'].apply(lambda x: any('b2b' in str(tag).lower() or 'saas' in str(tag).lower() for tag in x) if isinstance(x, list) else False)
    enhanced_df['is_fintech'] = enhanced_df['tags'].apply(lambda x: any('fintech' in str(tag).lower() or 'financial' in str(tag).lower() for tag in x) if isinstance(x, list) else False)
    enhanced_df['is_healthcare'] = enhanced_df['tags'].apply(lambda x: any('health' in str(tag).lower() or 'medical' in str(tag).lower() for tag in x) if isinstance(x, list) else False)
    enhanced_df['is_edtech'] = enhanced_df['tags'].apply(lambda x: any('education' in str(tag).lower() or 'edtech' in str(tag).lower() for tag in x) if isinstance(x, list) else False)
    
    # Industry combination features
    enhanced_df['is_ai_b2b'] = enhanced_df['is_ai'] & enhanced_df['is_b2b']
    enhanced_df['is_ai_fintech'] = enhanced_df['is_ai'] & enhanced_df['is_fintech']
    
    # 5. TEXT FEATURES
    print("📝 Creating text features...")
    
    # Description length features
    enhanced_df['description_length'] = enhanced_df['short_description'].str.len().fillna(0)
    enhanced_df['description_word_count'] = enhanced_df['short_description'].str.split().str.len().fillna(0)
    
    # Safe division for average word length
    enhanced_df['avg_word_length'] = np.where(
        enhanced_df['description_word_count'] > 0,
        enhanced_df['description_length'] / enhanced_df['description_word_count'],
        0.0  # Default when no words
    )
    
    # Description sentiment (simplified)
    enhanced_df['has_positive_words'] = enhanced_df['short_description'].str.contains('innovative|revolutionary|breakthrough|leading|advanced', case=False, na=False)
    enhanced_df['has_tech_words'] = enhanced_df['short_description'].str.contains('platform|software|technology|digital|data', case=False, na=False)
    
    # 6. INTERACTION FEATURES
    print("🔗 Creating interaction features...")
    
    # Geographic + Industry interactions
    enhanced_df['sf_bay_ai'] = enhanced_df['is_sf_bay'] & enhanced_df['is_ai']
    enhanced_df['international_b2b'] = enhanced_df['is_international'] & enhanced_df['is_b2b']
    
    # Team + Industry interactions
    enhanced_df['solo_ai'] = enhanced_df['is_solo_founder'] & enhanced_df['is_ai']
    enhanced_df['team_b2b'] = enhanced_df['is_team_founded'] & enhanced_df['is_b2b']
    
    # Age + Industry interactions
    enhanced_df['young_ai'] = enhanced_df['is_young_company'] & enhanced_df['is_ai']
    enhanced_df['old_b2b'] = enhanced_df['is_old_company'] & enhanced_df['is_b2b']
    
    # 7. STATISTICAL FEATURES
    print("📊 Creating statistical features...")
    
    # Z-scores for continuous variables
    for col in ['company_age', 'team_size', 'num_tags', 'description_length']:
        if col in enhanced_df.columns:
            enhanced_df[f'{col}_zscore'] = (enhanced_df[col] - enhanced_df[col].mean()) / enhanced_df[col].std()
    
    # Percentile rankings
    for col in ['company_age', 'team_size', 'num_tags']:
        if col in enhanced_df.columns:
            enhanced_df[f'{col}_percentile'] = enhanced_df[col].rank(pct=True)
    
    # Final data validation and cleaning
    print("🧹 Final data validation...")
    
    # Replace any remaining infinity values
    numeric_cols = enhanced_df.select_dtypes(include=[np.number]).columns
    for col in numeric_cols:
        if enhanced_df[col].isin([np.inf, -np.inf]).any():
            print(f"  ⚠️  Found infinity in {col}, replacing with median...")
            median_val = enhanced_df[col].replace([np.inf, -np.inf], np.nan).median()
            enhanced_df[col] = enhanced_df[col].replace([np.inf, -np.inf], median_val)
    
    # Cap extremely large values
    for col in numeric_cols:
        if enhanced_df[col].abs().max() > 1e10:
            print(f"  ⚠️  Found extremely large values in {col}, capping...")
            enhanced_df[col] = enhanced_df[col].clip(-1e10, 1e10)
    
    print(f"✅ Enhanced features created! Total features: {len(enhanced_df.columns)}")
    print(f"📊 Data shape: {enhanced_df.shape}")
    print(f"🔍 Infinity values: {enhanced_df.isin([np.inf, -np.inf]).sum().sum()}")
    print(f"🔍 NaN values: {enhanced_df.isnull().sum().sum()}")
    
    return enhanced_df

# Create enhanced features
enhanced_df = create_enhanced_features(df)


🔧 Creating Enhanced Features for Predictive Modeling...
📅 Creating temporal features...
👥 Creating team composition features...
🌍 Creating geographic features...
🏭 Creating industry features...
📝 Creating text features...
🔗 Creating interaction features...
📊 Creating statistical features...
🧹 Final data validation...
✅ Enhanced features created! Total features: 78
📊 Data shape: (5463, 78)
🔍 Infinity values: 0
🔍 NaN values: 8294


In [70]:
# Enhanced Predictive Model with Advanced Feature Engineering
def build_enhanced_prediction_model(enhanced_df):
    """Build advanced ML model with enhanced feature engineering"""
    from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
    from sklearn.linear_model import LogisticRegression
    from sklearn.svm import SVC
    from sklearn.model_selection import cross_val_score, GridSearchCV
    from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
    from sklearn.preprocessing import StandardScaler, LabelEncoder
    import numpy as np
    
    print("🤖 Building Enhanced Predictive Model with Advanced Features...")
    
    # Use mature companies only
    mature_df = enhanced_df[enhanced_df['is_mature']].copy()
    
    # Select features for modeling (exclude target and non-predictive columns)
    exclude_cols = [
        'is_successful', 'company_id', 'company_name', 'short_description', 
        'tags', 'location', 'country', 'status', 'batch', 'year_founded',
        'batch_year', 'batch_season', 'is_mature', 'is_active'
    ]
    
    # Get all numeric and boolean columns
    feature_cols = [col for col in mature_df.columns 
                   if col not in exclude_cols 
                   and mature_df[col].dtype in ['int64', 'float64', 'bool']]
    
    print(f"📊 Selected {len(feature_cols)} features for modeling")
    
    # Prepare data
    model_df = mature_df[feature_cols + ['is_successful']].dropna()
    
    if len(model_df) == 0:
        print("⚠️  No data available after feature selection")
        return None
    
    X = model_df.drop('is_successful', axis=1)
    y = model_df['is_successful']
    
    print(f"📊 Dataset Info:")
    print(f"  Total samples: {len(y):,}")
    print(f"  Successful: {y.sum():,} ({y.mean()*100:.1f}%)")
    print(f"  Features: {X.shape[1]}")
    print(f"  ⚠️  Class imbalance: {(~y).sum() / y.sum():.1f}:1 ratio")
    
    # Data cleaning and validation
    print("🧹 Cleaning data for modeling...")
    
    # Replace infinity values with NaN
    X = X.replace([np.inf, -np.inf], np.nan)
    
    # Check for infinity values
    inf_cols = X.columns[X.isin([np.inf, -np.inf]).any()]
    if len(inf_cols) > 0:
        print(f"  ⚠️  Found infinity values in columns: {list(inf_cols)}")
        # Replace with median values
        for col in inf_cols:
            X[col] = X[col].replace([np.inf, -np.inf], X[col].median())
    
    # Check for extremely large values
    large_value_threshold = 1e10
    large_cols = X.columns[(X.abs() > large_value_threshold).any()]
    if len(large_cols) > 0:
        print(f"  ⚠️  Found extremely large values in columns: {list(large_cols)}")
        # Cap values at reasonable thresholds
        for col in large_cols:
            if X[col].dtype in ['float64', 'int64']:
                X[col] = X[col].clip(-large_value_threshold, large_value_threshold)
    
    # Handle remaining NaN values
    nan_before = X.isnull().sum().sum()
    if nan_before > 0:
        print(f"  🔧 Handling {nan_before} NaN values...")
        # Fill with median for numeric columns
        for col in X.columns:
            if X[col].dtype in ['float64', 'int64']:
                X[col] = X[col].fillna(X[col].median())
            else:
                X[col] = X[col].fillna(X[col].mode()[0] if len(X[col].mode()) > 0 else 0)
    
    # Final validation
    if X.isnull().any().any():
        print("  ⚠️  Still have NaN values, dropping rows...")
        valid_idx = ~X.isnull().any(axis=1)
        X = X[valid_idx]
        y = y[valid_idx]
    
    if len(X) == 0:
        print("⚠️  No valid data after cleaning")
        return None
    
    print(f"  ✅ Data cleaned: {len(X)} samples, {X.shape[1]} features")
    
    # Train-test split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
    
    # Scale features with robust scaling
    print("📏 Scaling features...")
    scaler = StandardScaler()
    
    # Additional validation before scaling
    if np.isinf(X_train).any().any() or np.isnan(X_train).any().any():
        print("  ⚠️  Found invalid values in training data, applying additional cleaning...")
        X_train = X_train.replace([np.inf, -np.inf], np.nan)
        X_train = X_train.fillna(X_train.median())
        X_test = X_test.replace([np.inf, -np.inf], np.nan)
        X_test = X_test.fillna(X_train.median())
    
    try:
        X_train_scaled = scaler.fit_transform(X_train)
        X_test_scaled = scaler.transform(X_test)
        print("  ✅ Features scaled successfully")
    except Exception as e:
        print(f"  ❌ Scaling failed: {str(e)}")
        print("  🔧 Using original features without scaling...")
        X_train_scaled = X_train.values
        X_test_scaled = X_test.values
    
    # Train multiple models
    models = {
        'Logistic Regression': LogisticRegression(max_iter=1000, class_weight='balanced', random_state=42),
        'Random Forest': RandomForestClassifier(n_estimators=100, class_weight='balanced', random_state=42),
        'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42),
        'SVM': SVC(class_weight='balanced', probability=True, random_state=42)
    }
    
    results = {}
    
    print(f"\n🤖 Training Multiple Models...")
    for name, model in models.items():
        print(f"  Training {name}...")
        
        # Train model
        if name == 'SVM':
            model.fit(X_train_scaled, y_train)
            y_pred = model.predict(X_test_scaled)
            y_pred_proba = model.predict_proba(X_test_scaled)[:, 1]
        else:
            model.fit(X_train, y_train)
            y_pred = model.predict(X_test)
            y_pred_proba = model.predict_proba(X_test)[:, 1]
        
        # Calculate metrics
        train_score = model.score(X_train_scaled if name == 'SVM' else X_train, y_train)
        test_score = model.score(X_test_scaled if name == 'SVM' else X_test, y_test)
        roc_auc = roc_auc_score(y_test, y_pred_proba)
        
        # Cross-validation
        cv_scores = cross_val_score(model, X_train_scaled if name == 'SVM' else X_train, y_train, cv=5, scoring='roc_auc')
        
        results[name] = {
            'model': model,
            'train_score': train_score,
            'test_score': test_score,
            'roc_auc': roc_auc,
            'cv_mean': cv_scores.mean(),
            'cv_std': cv_scores.std(),
            'predictions': y_pred,
            'probabilities': y_pred_proba
        }
        
        print(f"    ROC-AUC: {roc_auc:.3f}, CV: {cv_scores.mean():.3f} (±{cv_scores.std():.3f})")
    
    # Find best model
    best_model_name = max(results.keys(), key=lambda x: results[x]['roc_auc'])
    best_model = results[best_model_name]
    
    print(f"\n🏆 Best Model: {best_model_name}")
    print(f"  ROC-AUC: {best_model['roc_auc']:.3f}")
    print(f"  Cross-Val: {best_model['cv_mean']:.3f} (±{best_model['cv_std']:.3f})")
    
    # Feature importance (for tree-based models)
    if hasattr(best_model['model'], 'feature_importances_'):
        feature_importance = pd.DataFrame({
            'Feature': X.columns,
            'Importance': best_model['model'].feature_importances_
        }).sort_values('Importance', ascending=False)
        
        print(f"\n📈 Top 15 Feature Importances:")
        print(feature_importance.head(15).to_string(index=False))
        
        # Visualize feature importance
        fig = go.Figure(go.Bar(
            y=feature_importance.head(15)['Feature'],
            x=feature_importance.head(15)['Importance'],
            orientation='h',
            marker_color='lightblue'
        ))
        
        fig.update_layout(
            title=f'Top 15 Feature Importances - {best_model_name}',
            xaxis_title='Feature Importance',
            height=600
        )
        fig.show()
    
    # Detailed classification report
    print(f"\n📊 Detailed Classification Report ({best_model_name}):")
    print(classification_report(y_test, best_model['predictions'], 
                              target_names=['Not Successful', 'Successful']))
    
    # Confusion Matrix
    cm = confusion_matrix(y_test, best_model['predictions'])
    print(f"\n📊 Confusion Matrix:")
    print(f"  True Negatives: {cm[0,0]}")
    print(f"  False Positives: {cm[0,1]}")
    print(f"  False Negatives: {cm[1,0]}")
    print(f"  True Positives: {cm[1,1]}")
    
    return {
        'best_model': best_model['model'],
        'best_model_name': best_model_name,
        'results': results,
        'feature_importance': feature_importance if 'feature_importance' in locals() else None,
        'metrics': {
            'roc_auc': best_model['roc_auc'],
            'cv_mean': best_model['cv_mean'],
            'cv_std': best_model['cv_std']
        }
    }

# Build enhanced predictive model
enhanced_model_results = build_enhanced_prediction_model(enhanced_df)


🤖 Building Enhanced Predictive Model with Advanced Features...
📊 Selected 57 features for modeling
📊 Dataset Info:
  Total samples: 2,998
  Successful: 408 (13.6%)
  Features: 57
  ⚠️  Class imbalance: 6.3:1 ratio
🧹 Cleaning data for modeling...
  ✅ Data cleaned: 2998 samples, 57 features
📏 Scaling features...
  ✅ Features scaled successfully

🤖 Training Multiple Models...
  Training Logistic Regression...
    ROC-AUC: 0.743, CV: 0.730 (±0.032)
  Training Random Forest...
    ROC-AUC: 0.722, CV: 0.706 (±0.031)
  Training Gradient Boosting...
    ROC-AUC: 0.729, CV: 0.716 (±0.030)
  Training SVM...
    ROC-AUC: 0.723, CV: 0.705 (±0.030)

🏆 Best Model: Logistic Regression
  ROC-AUC: 0.743
  Cross-Val: 0.730 (±0.032)

📊 Detailed Classification Report (Logistic Regression):
                precision    recall  f1-score   support

Not Successful       0.93      0.70      0.80       518
    Successful       0.26      0.67      0.38        82

      accuracy                           0.69    

In [71]:
# Feature Analysis and Insights
def analyze_feature_importance(enhanced_df, model_results):
    """Analyze feature importance and provide insights"""
    print("🔍 Analyzing Feature Importance and Insights...")
    
    if model_results is None or model_results['feature_importance'] is None:
        print("⚠️  No feature importance data available")
        return
    
    feature_importance = model_results['feature_importance']
    
    # Categorize features
    temporal_features = [f for f in feature_importance['Feature'] if any(x in f for x in ['batch_year', 'company_age', 'years_since', 'is_recent', 'is_early', 'is_young', 'is_old'])]
    team_features = [f for f in feature_importance['Feature'] if any(x in f for x in ['founder', 'team', 'solo', 'num_founders', 'team_size', 'growth'])]
    geographic_features = [f for f in feature_importance['Feature'] if any(x in f for x in ['sf_bay', 'nyc', 'london', 'tier1', 'international', 'is_us'])]
    industry_features = [f for f in feature_importance['Feature'] if any(x in f for x in ['is_ai', 'is_b2b', 'is_fintech', 'is_healthcare', 'is_edtech', 'diversity', 'tags'])]
    text_features = [f for f in feature_importance['Feature'] if any(x in f for x in ['description', 'length', 'word', 'positive', 'tech'])]
    interaction_features = [f for f in feature_importance['Feature'] if any(x in f for x in ['_ai', '_b2b', '_fintech', 'sf_bay_', 'international_', 'solo_', 'team_', 'young_', 'old_'])]
    
    print(f"\n📊 Feature Categories Analysis:")
    print(f"  🕒 Temporal Features: {len(temporal_features)}")
    print(f"  👥 Team Features: {len(team_features)}")
    print(f"  🌍 Geographic Features: {len(geographic_features)}")
    print(f"  🏭 Industry Features: {len(industry_features)}")
    print(f"  📝 Text Features: {len(text_features)}")
    print(f"  🔗 Interaction Features: {len(interaction_features)}")
    
    # Top features by category
    print(f"\n🏆 Top Features by Category:")
    
    for category, features in [
        ("Temporal", temporal_features),
        ("Team", team_features),
        ("Geographic", geographic_features),
        ("Industry", industry_features),
        ("Text", text_features),
        ("Interaction", interaction_features)
    ]:
        if features:
            cat_importance = feature_importance[feature_importance['Feature'].isin(features)].head(3)
            print(f"\n  {category}:")
            for _, row in cat_importance.iterrows():
                print(f"    • {row['Feature']}: {row['Importance']:.4f}")
    
    # Key insights
    print(f"\n💡 Key Insights:")
    
    # Most important feature
    top_feature = feature_importance.iloc[0]
    print(f"  🥇 Most Important: {top_feature['Feature']} ({top_feature['Importance']:.4f})")
    
    # Feature distribution
    high_importance = feature_importance[feature_importance['Importance'] > 0.01]
    medium_importance = feature_importance[(feature_importance['Importance'] > 0.005) & (feature_importance['Importance'] <= 0.01)]
    low_importance = feature_importance[feature_importance['Importance'] <= 0.005]
    
    print(f"  📈 High Importance (>0.01): {len(high_importance)} features")
    print(f"  📊 Medium Importance (0.005-0.01): {len(medium_importance)} features")
    print(f"  📉 Low Importance (<0.005): {len(low_importance)} features")
    
    # Model performance insights
    metrics = model_results['metrics']
    print(f"\n🤖 Model Performance:")
    print(f"  ROC-AUC: {metrics['roc_auc']:.3f}")
    print(f"  Cross-Validation: {metrics['cv_mean']:.3f} (±{metrics['cv_std']:.3f})")
    
    # Performance interpretation
    if metrics['roc_auc'] > 0.8:
        performance = "Excellent"
    elif metrics['roc_auc'] > 0.7:
        performance = "Good"
    elif metrics['roc_auc'] > 0.6:
        performance = "Fair"
    else:
        performance = "Poor"
    
    print(f"  📊 Performance Level: {performance}")
    
    return {
        'temporal_features': temporal_features,
        'team_features': team_features,
        'geographic_features': geographic_features,
        'industry_features': industry_features,
        'text_features': text_features,
        'interaction_features': interaction_features,
        'performance_level': performance
    }

# Analyze feature importance
if enhanced_model_results:
    feature_analysis = analyze_feature_importance(enhanced_df, enhanced_model_results)
else:
    print("⚠️  Enhanced model results not available")


🔍 Analyzing Feature Importance and Insights...
⚠️  No feature importance data available


In [72]:
# Updated Original Predictive Model to Use Enhanced Features
def build_success_prediction_model():
    """Build ML model to predict startup success using enhanced features"""
    # Use the enhanced dataframe with advanced features
    print("🤖 Building Success Prediction Model with Enhanced Features...")
    
    # Use mature companies only
    mature_df = enhanced_df[enhanced_df['is_mature']].copy()
    
    # Select enhanced features for modeling (exclude target and non-predictive columns)
    exclude_cols = [
        'is_successful', 'company_id', 'company_name', 'short_description', 
        'tags', 'location', 'country', 'status', 'batch', 'year_founded',
        'batch_year', 'batch_season', 'is_mature', 'is_active'
    ]
    
    # Get all numeric and boolean columns from enhanced features
    feature_cols = [col for col in mature_df.columns 
                   if col not in exclude_cols 
                   and mature_df[col].dtype in ['int64', 'float64', 'bool']]
    
    print(f"📊 Using {len(feature_cols)} enhanced features for modeling")
    
    # Prepare data
    model_df = mature_df[feature_cols + ['is_successful']].dropna()
    
    if len(model_df) == 0:
        print("⚠️  No data available after feature selection")
        return None
    
    X = model_df.drop('is_successful', axis=1)
    y = model_df['is_successful']
    
    print(f"📊 Dataset Info:")
    print(f"  Total samples: {len(y):,}")
    print(f"  Successful: {y.sum():,} ({y.mean()*100:.1f}%)")
    print(f"  Features: {X.shape[1]}")
    print(f"  ⚠️  Class imbalance: {(~y).sum() / y.sum():.1f}:1 ratio")
    
    # Data cleaning and validation
    print("🧹 Cleaning data for modeling...")
    
    # Replace infinity values with NaN
    X = X.replace([np.inf, -np.inf], np.nan)
    
    # Check for infinity values
    inf_cols = X.columns[X.isin([np.inf, -np.inf]).any()]
    if len(inf_cols) > 0:
        print(f"  ⚠️  Found infinity values in columns: {list(inf_cols)}")
        # Replace with median values
        for col in inf_cols:
            X[col] = X[col].replace([np.inf, -np.inf], X[col].median())
    
    # Check for extremely large values
    large_value_threshold = 1e10
    large_cols = X.columns[(X.abs() > large_value_threshold).any()]
    if len(large_cols) > 0:
        print(f"  ⚠️  Found extremely large values in columns: {list(large_cols)}")
        # Cap values at reasonable thresholds
        for col in large_cols:
            if X[col].dtype in ['float64', 'int64']:
                X[col] = X[col].clip(-large_value_threshold, large_value_threshold)
    
    # Handle remaining NaN values
    nan_before = X.isnull().sum().sum()
    if nan_before > 0:
        print(f"  🔧 Handling {nan_before} NaN values...")
        # Fill with median for numeric columns
        for col in X.columns:
            if X[col].dtype in ['float64', 'int64']:
                X[col] = X[col].fillna(X[col].median())
            else:
                X[col] = X[col].fillna(X[col].mode()[0] if len(X[col].mode()) > 0 else 0)
    
    # Final validation
    if X.isnull().any().any():
        print("  ⚠️  Still have NaN values, dropping rows...")
        valid_idx = ~X.isnull().any(axis=1)
        X = X[valid_idx]
        y = y[valid_idx]
    
    if len(X) == 0:
        print("⚠️  No valid data after cleaning")
        return None
    
    print(f"  ✅ Data cleaned: {len(X)} samples, {X.shape[1]} features")
    
    # Train-test split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
    
    # Scale features with robust scaling
    print("📏 Scaling features...")
    scaler = StandardScaler()
    
    # Additional validation before scaling
    if np.isinf(X_train).any().any() or np.isnan(X_train).any().any():
        print("  ⚠️  Found invalid values in training data, applying additional cleaning...")
        X_train = X_train.replace([np.inf, -np.inf], np.nan)
        X_train = X_train.fillna(X_train.median())
        X_test = X_test.replace([np.inf, -np.inf], np.nan)
        X_test = X_test.fillna(X_train.median())
    
    try:
        X_train_scaled = scaler.fit_transform(X_train)
        X_test_scaled = scaler.transform(X_test)
        print("  ✅ Features scaled successfully")
    except Exception as e:
        print(f"  ❌ Scaling failed: {str(e)}")
        print("  🔧 Using original features without scaling...")
        X_train_scaled = X_train.values
        X_test_scaled = X_test.values
    
    # Train model with class weights to handle imbalance
    model = LogisticRegression(max_iter=1000, class_weight='balanced', random_state=42)
    model.fit(X_train_scaled, y_train)
    
    # Feature importance
    feature_importance = pd.DataFrame({
        'Feature': X.columns,
        'Coefficient': model.coef_[0]
    }).sort_values('Coefficient', ascending=False)
    
    # Predictions
    y_pred = model.predict(X_test_scaled)
    y_pred_proba = model.predict_proba(X_test_scaled)[:, 1]
    
    # Comprehensive metrics
    train_score = model.score(X_train_scaled, y_train)
    test_score = model.score(X_test_scaled, y_test)
    roc_auc = roc_auc_score(y_test, y_pred_proba)
    
    # Cross-validation for robustness
    cv_scores = cross_val_score(model, X_train_scaled, y_train, cv=5, scoring='roc_auc')
    
    print(f"🤖 ENHANCED PREDICTIVE MODEL RESULTS:")
    print(f"  Training Accuracy: {train_score*100:.2f}%")
    print(f"  Test Accuracy: {test_score*100:.2f}%")
    print(f"  ROC-AUC Score: {roc_auc:.3f}")
    print(f"  Cross-Val ROC-AUC: {cv_scores.mean():.3f} (±{cv_scores.std():.3f})")
    print(f"\n⚠️  Note: Accuracy is misleading with class imbalance. ROC-AUC is better metric.\n")
    
    print(f"📊 Detailed Classification Report:")
    print(classification_report(y_test, y_pred, target_names=['Not Successful', 'Successful']))
    
    print(f"\n📈 Top 20 Feature Importance (Logistic Regression Coefficients):")
    print(f"⚠️  Note: These show predictive association, NOT causation.\n")
    print(feature_importance.head(20).to_string(index=False))
    
    # Visualize feature importance
    fig = go.Figure(go.Bar(
        y=feature_importance.head(20)['Feature'],
        x=feature_importance.head(20)['Coefficient'],
        orientation='h',
        marker_color=['#00CC66' if x > 0 else '#FF3333' for x in feature_importance.head(20)['Coefficient']]
    ))
    
    fig.update_layout(
        title='Enhanced Feature Importance: What Predicts Success?<br><sub>Higher coefficient = stronger positive association with success</sub>',
        xaxis_title='Coefficient (Impact on Success Probability)',
        height=600
    )
    fig.show()
    
    print("\n⚠️  IMPORTANT CAVEATS:")
    print("  1. This model shows ASSOCIATION, not CAUSATION")
    print("  2. Survivorship bias partially mitigated by using mature companies only")
    print("  3. Many unmeasured factors influence success (market timing, execution, luck)")
    print("  4. Past performance ≠ future results")
    print("  5. Enhanced features provide better predictive power than basic features")
    
    return model, feature_importance, {
        'train_score': train_score,
        'test_score': test_score,
        'roc_auc': roc_auc,
        'cv_scores': cv_scores,
        'enhanced_features': True,
        'num_features': len(feature_cols)
    }

# Build enhanced predictive model
print("🤖 Building Enhanced Success Prediction Model")
model, feature_importance, model_metrics = build_success_prediction_model()


🤖 Building Enhanced Success Prediction Model
🤖 Building Success Prediction Model with Enhanced Features...
📊 Using 57 enhanced features for modeling
📊 Dataset Info:
  Total samples: 2,998
  Successful: 408 (13.6%)
  Features: 57
  ⚠️  Class imbalance: 6.3:1 ratio
🧹 Cleaning data for modeling...
  ✅ Data cleaned: 2998 samples, 57 features
📏 Scaling features...
  ✅ Features scaled successfully
🤖 ENHANCED PREDICTIVE MODEL RESULTS:
  Training Accuracy: 69.93%
  Test Accuracy: 68.83%
  ROC-AUC Score: 0.735
  Cross-Val ROC-AUC: 0.720 (±0.033)

⚠️  Note: Accuracy is misleading with class imbalance. ROC-AUC is better metric.

📊 Detailed Classification Report:
                precision    recall  f1-score   support

Not Successful       0.93      0.69      0.79       518
    Successful       0.25      0.66      0.37        82

      accuracy                           0.69       600
     macro avg       0.59      0.68      0.58       600
  weighted avg       0.84      0.69      0.73       600





⚠️  IMPORTANT CAVEATS:
  1. This model shows ASSOCIATION, not CAUSATION
  2. Survivorship bias partially mitigated by using mature companies only
  3. Many unmeasured factors influence success (market timing, execution, luck)
  4. Past performance ≠ future results
  5. Enhanced features provide better predictive power than basic features


## 2.12. Actionable Recommendations


In [73]:
# Actionable Recommendations - Data-driven insights for founders
def get_personalized_recommendations(founder_type='solo', industry=None, location='San Francisco'):
    """
    Generate personalized recommendations based on founder profile
    """
    print("\n" + "═" * 80)
    print(f"  PERSONALIZED RECOMMENDATIONS")
    print(f"  Profile: {founder_type.title()} Founder | Industry: {industry or 'General'} | Location: {location}")
    print("═" * 80)
    
    # Filter similar companies
    if founder_type == 'solo':
        similar = df[df['is_solo_founder']]
    else:
        similar = df[df['is_team_founded']]
    
    if industry:
        similar = similar[similar['tags'].apply(
            lambda x: industry.lower() in ' '.join(x).lower() if isinstance(x, list) else False
        )]
    
    # Calculate metrics
    success_rate = similar['is_successful'].mean() * 100
    avg_team_size = similar['team_size'].mean()
    top_locations = similar['location'].value_counts().head(5)
    
    print(f"\n✅ SUCCESS METRICS (Based on {len(similar):,} similar companies):")
    print(f"  • Success Rate: {success_rate:.1f}%")
    print(f"  • Average Team Size: {avg_team_size:.1f}")
    
    print(f"\n🎯 KEY INSIGHTS:")
    
    # Insight 1: Team composition
    if founder_type == 'solo':
        team_growth = (similar['team_size'] > similar['num_founders']).mean() * 100
        print(f"  • {team_growth:.1f}% of solo founders grow their team")
        successful_solos = similar[similar['is_successful']]
        if len(successful_solos) > 0:
            print(f"  • Successful solo founders have avg {successful_solos['team_size'].mean():.0f} person teams")
    else:
        print(f"  • Team-founded companies have higher success rates")
    
    # Insight 2: Location advantage
    print(f"\n📍 TOP LOCATIONS FOR SIMILAR COMPANIES:")
    for loc, count in top_locations.items():
        loc_success = similar[similar['location'] == loc]['is_successful'].mean() * 100
        print(f"  • {loc}: {count} companies ({loc_success:.1f}% success rate)")
    
    # Insight 3: Industry trends
    if industry:
        recent_trend = similar[similar['batch_year'] >= 2020].shape[0]
        total_trend = similar.shape[0]
        print(f"\n📈 INDUSTRY TREND:")
        print(f"  • {recent_trend}/{total_trend} companies in {industry} are from recent batches (2020+)")
        momentum = "🔥 Hot" if recent_trend/total_trend > 0.4 else "📊 Steady" if recent_trend/total_trend > 0.2 else "⚠️  Declining"
        print(f"  • Industry momentum: {momentum}")
    
    # Recommendations
    print(f"\n💡 RECOMMENDATIONS:")
    if success_rate < 5:
        print(f"  ⚠️  Warning: This profile has lower than average success rate")
        print(f"  → Consider: Pivoting, finding co-founders, or targeting different market")
    elif success_rate > 15:
        print(f"  ✅ Strong profile with above-average success rate")
        print(f"  → Focus on: Execution, customer acquisition, product-market fit")
    
    if founder_type == 'solo' and team_growth > 70:
        print(f"  → Plan to hire early - {team_growth:.0f}% of successful solos grow their team")
    
    print("\n" + "═" * 80)

def create_interactive_recommendations():
    """Create interactive recommendation widget"""
    from ipywidgets import interact, Dropdown
    
    # Get unique industries and locations
    unique_industries = ['General'] + sorted(list(set([tag for tags in df['tags'].dropna() for tag in tags if isinstance(tags, list)]))[:30])
    unique_locations = ['Any'] + df['location'].value_counts().head(20).index.tolist()
    
    @interact(
        founder_type=Dropdown(options=['solo', 'team'], description='Founder Type'),
        industry=Dropdown(options=unique_industries, description='Industry'),
        location=Dropdown(options=unique_locations, description='Location')
    )
    def interactive_recommendations(founder_type, industry, location):
        industry = None if industry == 'General' else industry
        location = None if location == 'Any' else location
        get_personalized_recommendations(founder_type, industry, location or 'Any')
    
    return interactive_recommendations

# Create interactive recommendations
print("💡 Actionable Recommendations Engine")
interactive_recommendations = create_interactive_recommendations()


💡 Actionable Recommendations Engine


interactive(children=(Dropdown(description='Founder Type', options=('solo', 'team'), value='solo'), Dropdown(d…

In [74]:
# OpenAI Setup
import os
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# Initialize OpenAI client
api_key = os.environ.get('OPENAI_API_KEY')
if not api_key:
    print("⚠️ OPENAI_API_KEY not found in environment variables")
    print("\nSet it with: export OPENAI_API_KEY='your-key'")
    print("Or in notebook: import os; os.environ['OPENAI_API_KEY'] = 'your-key'")
    client = None
else:
    client = OpenAI(api_key=api_key)
    print("✅ OpenAI client initialized")

class CompanyIntelligence:
    """AI-powered company analysis and insights generation"""
    
    def __init__(self, openai_client):
        self.client = openai_client
        
    def analyze_company_archetype(self, company_data):
        """Analyze company to determine business model archetype"""
        if not self.client:
            return "OpenAI client not available"
            
        prompt = f"""
        Analyze this YC company and identify its business model archetype:
        
        Company: {company_data['company_name']}
        Description: {company_data['short_description']}
        Tags: {company_data['tags']}
        Location: {company_data['location']}
        Team Size: {company_data['team_size']}
        Founders: {company_data['num_founders']}
        
        Classify into one of these archetypes:
        1. B2B SaaS - Software for businesses
        2. Marketplace - Connecting buyers and sellers
        3. Consumer - Direct to consumer products/services
        4. AI/ML - AI/ML focused companies
        5. Fintech - Financial technology
        6. Healthcare - Health/medical technology
        7. Other - Doesn't fit above categories
        
        Also provide:
        - Market positioning (pioneer, fast-follower, niche player)
        - Target customer segment
        - Key competitive advantages mentioned
        - Growth stage indicators
        
        Format as JSON with archetype, positioning, customers, advantages, stage.
        """
        
        try:
            response = self.client.chat.completions.create(
                model="gpt-4",
                messages=[{"role": "user", "content": prompt}],
                temperature=0.3
            )
            return response.choices[0].message.content
        except Exception as e:
            return f"Analysis failed: {str(e)}"
    
    def predict_success_probability(self, company_data):
        """Predict success probability with AI-generated reasoning"""
        if not self.client:
            return "OpenAI client not available"
            
        prompt = f"""
        Analyze this YC company's success potential:
        
        Company: {company_data['company_name']}
        Description: {company_data['short_description']}
        Long Description: {company_data.get('long_description', 'N/A')}
        Tags: {company_data['tags']}
        Location: {company_data['location']}
        Team Size: {company_data['team_size']}
        Founders: {company_data['num_founders']}
        Batch: {company_data['batch']}
        
        Based on successful YC company patterns, provide:
        1. Success probability (0-100%)
        2. Key success factors present
        3. Potential risks/concerns
        4. Recommendations for improvement
        5. Similar successful companies (if any)
        
        Format as structured analysis with reasoning.
        """
        
        try:
            response = self.client.chat.completions.create(
                model="gpt-4",
                messages=[{"role": "user", "content": prompt}],
                temperature=0.4
            )
            return response.choices[0].message.content
        except Exception as e:
            return f"Prediction failed: {str(e)}"
    
    def generate_competitive_analysis(self, company_data):
        """Generate competitive landscape analysis"""
        if not self.client:
            return "OpenAI client not available"
            
        prompt = f"""
        Analyze the competitive landscape for this YC company:
        
        Company: {company_data['company_name']}
        Description: {company_data['short_description']}
        Tags: {company_data['tags']}
        
        Provide:
        1. Direct competitors (other YC companies in similar space)
        2. Market positioning opportunities
        3. Competitive advantages to highlight
        4. Potential partnership opportunities
        5. Market gaps this company could fill
        
        Focus on actionable competitive intelligence.
        """
        
        try:
            response = self.client.chat.completions.create(
                model="gpt-4",
                messages=[{"role": "user", "content": prompt}],
                temperature=0.5
            )
            return response.choices[0].message.content
        except Exception as e:
            return f"Competitive analysis failed: {str(e)}"

# Initialize the intelligence engine
if client:
    intelligence = CompanyIntelligence(client)
    print("🧠 Company Intelligence Engine initialized")
else:
    intelligence = None
    print("⚠️ Company Intelligence Engine not available (no OpenAI API key)")


✅ OpenAI client initialized
🧠 Company Intelligence Engine initialized


## 3. Company Intelligence Engine


In [75]:
# Interactive Analysis Functions
def analyze_company_interactive(company_name=None, company_index=None):
    """Interactive company analysis with AI insights"""
    if company_name:
        company = df[df['company_name'].str.contains(company_name, case=False, na=False)]
        if len(company) == 0:
            print(f"❌ Company '{company_name}' not found")
            return
        company = company.iloc[0]
    elif company_index is not None:
        company = df.iloc[company_index]
    else:
        # Random company
        company = df.sample(1).iloc[0]
    
    print(f"🔍 Analyzing: {company['company_name']}")
    print(f"📝 Description: {company['short_description']}")
    print(f"🏷️ Tags: {company['tags']}")
    print(f"📍 Location: {company['location']}")
    print(f"👥 Team: {company['num_founders']} founders, {company['team_size']} total")
    print(f"📊 Status: {company['status']}")
    print(f"📈 Success: {'✅' if company['is_successful'] else '❌'}")
    
    if intelligence:
        print("\n" + "="*60)
        print("🧠 AI-POWERED ANALYSIS")
        print("="*60)
        
        # Archetype analysis
        print("\n📋 BUSINESS MODEL ARCHETYPE:")
        archetype_analysis = intelligence.analyze_company_archetype(company)
        print(archetype_analysis)
        
        print("\n🎯 SUCCESS PREDICTION:")
        success_analysis = intelligence.predict_success_probability(company)
        print(success_analysis)
        
        print("\n⚔️ COMPETITIVE ANALYSIS:")
        competitive_analysis = intelligence.generate_competitive_analysis(company)
        print(competitive_analysis)
    else:
        print("\n⚠️ AI analysis not available (OpenAI API key required)")

# Create interactive widgets
def create_analysis_widgets():
    """Create interactive widgets for company analysis"""
    
    # Company selector
    company_options = [f"{row['company_name']} ({row['batch']})" for idx, row in df.iterrows()]
    company_selector = widgets.Dropdown(
        options=company_options,
        description='Company:',
        style={'description_width': 'initial'},
        layout=widgets.Layout(width='400px')
    )
    
    # Analysis type selector
    analysis_type = widgets.RadioButtons(
        options=['Full Analysis', 'Archetype Only', 'Success Prediction', 'Competitive Analysis'],
        description='Analysis:',
        style={'description_width': 'initial'}
    )
    
    # Random company button
    random_button = widgets.Button(
        description='🎲 Random Company',
        button_style='info',
        layout=widgets.Layout(width='150px')
    )
    
    # Output area
    output = widgets.Output()
    
    def on_company_change(change):
        with output:
            output.clear_output()
            if change['new']:
                company_idx = company_options.index(change['new'])
                analyze_company_interactive(company_index=company_idx)
    
    def on_random_click(b):
        with output:
            output.clear_output()
            analyze_company_interactive()
    
    company_selector.observe(on_company_change, names='value')
    random_button.on_click(on_random_click)
    
    # Layout
    controls = widgets.HBox([company_selector, random_button])
    display(controls)
    display(output)
    
    return company_selector, analysis_type, random_button, output

print("🎛️ Interactive Analysis Interface Ready")
print("Use the widgets below to analyze companies with AI insights")


🎛️ Interactive Analysis Interface Ready
Use the widgets below to analyze companies with AI insights


## 4. Market Intelligence & Trends


In [76]:
# Market Intelligence Engine
class MarketIntelligence:
    """AI-powered market trend analysis and opportunity detection"""
    
    def __init__(self, openai_client, df):
        self.client = openai_client
        self.df = df
        
    def analyze_industry_trends(self, time_period='2020-2025'):
        """Analyze industry trends over time with AI insights"""
        if not self.client:
            return "OpenAI client not available"
        
        # Get trend data
        recent_companies = self.df[self.df['batch_year'] >= 2020]
        industry_counts = recent_companies['tags'].explode().value_counts().head(20)
        
        prompt = f"""
        Analyze these YC industry trends from {time_period}:
        
        Top Industries: {dict(industry_counts.head(10))}
        
        Provide insights on:
        1. Emerging hot sectors (growing rapidly)
        2. Saturated markets (declining or stable)
        3. Market opportunities (underserved areas)
        4. Investment timing recommendations
        5. Competitive landscape changes
        
        Focus on actionable market intelligence for founders and investors.
        """
        
        try:
            response = self.client.chat.completions.create(
                model="gpt-4",
                messages=[{"role": "user", "content": prompt}],
                temperature=0.4
            )
            return response.choices[0].message.content
        except Exception as e:
            return f"Trend analysis failed: {str(e)}"
    
    def identify_market_opportunities(self, filters=None):
        """Identify market gaps and opportunities"""
        if not self.client:
            return "OpenAI client not available"
        
        # Filter data
        filtered_df = self.df.copy()
        if filters:
            for key, value in filters.items():
                if key in filtered_df.columns:
                    filtered_df = filtered_df[filtered_df[key] == value]
        
        # Get sample descriptions
        sample_descriptions = filtered_df['short_description'].dropna().head(50).tolist()
        
        prompt = f"""
        Analyze these YC company descriptions to identify market opportunities:
        
        Sample Descriptions: {sample_descriptions[:10]}
        
        Identify:
        1. Underserved market segments
        2. Emerging problem areas
        3. Technology gaps
        4. Partnership opportunities
        5. Market timing insights
        
        Provide specific, actionable opportunities for new startups.
        """
        
        try:
            response = self.client.chat.completions.create(
                model="gpt-4",
                messages=[{"role": "user", "content": prompt}],
                temperature=0.5
            )
            return response.choices[0].message.content
        except Exception as e:
            return f"Opportunity analysis failed: {str(e)}"
    
    def generate_market_report(self):
        """Generate comprehensive market intelligence report"""
        if not self.client:
            return "OpenAI client not available"
        
        # Gather market data
        success_by_industry = self.df.groupby('tags').agg({
            'is_successful': ['count', 'sum', 'mean'],
            'company_name': 'count'
        }).round(3)
        
        location_trends = self.df['location'].value_counts().head(10)
        batch_trends = self.df['batch_year'].value_counts().sort_index()
        
        prompt = f"""
        Generate a comprehensive market intelligence report based on YC data:
        
        Success by Industry: {success_by_industry.head(10).to_dict()}
        Top Locations: {dict(location_trends)}
        Batch Trends: {dict(batch_trends.tail(10))}
        
        Create a report covering:
        1. Executive Summary
        2. Market Trends & Opportunities
        3. Success Factors Analysis
        4. Geographic Insights
        5. Investment Recommendations
        6. Risk Assessment
        
        Format as a professional market intelligence report.
        """
        
        try:
            response = self.client.chat.completions.create(
                model="gpt-4",
                messages=[{"role": "user", "content": prompt}],
                temperature=0.3
            )
            return response.choices[0].message.content
        except Exception as e:
            return f"Market report generation failed: {str(e)}"

# Initialize market intelligence
if client:
    market_intel = MarketIntelligence(client, df)
    print("📈 Market Intelligence Engine initialized")
else:
    market_intel = None
    print("⚠️ Market Intelligence not available (no OpenAI API key)")

# Quick market analysis
def quick_market_analysis():
    """Run quick market intelligence analysis"""
    if not market_intel:
        print("⚠️ Market Intelligence not available")
        return
    
    print("🔍 Analyzing Market Trends...")
    trends = market_intel.analyze_industry_trends()
    print("\n📊 INDUSTRY TRENDS:")
    print(trends)
    
    print("\n🎯 MARKET OPPORTUNITIES:")
    opportunities = market_intel.identify_market_opportunities()
    print(opportunities)

print("📈 Market Intelligence Ready")
print("Run quick_market_analysis() for instant insights")


📈 Market Intelligence Engine initialized
📈 Market Intelligence Ready
Run quick_market_analysis() for instant insights


## 5. Interactive Analysis Interface


In [77]:
from plotly.subplots import make_subplots

# Enhanced Visualization Functions
def create_ai_insights_dashboard():
    """Create comprehensive AI-powered insights dashboard"""
    
    # Create subplots
    fig = make_subplots(
        rows=2, cols=2,
        subplot_titles=('Success by Industry', 'Founder Team Analysis', 
                       'Geographic Distribution', 'Market Trends'),
        specs=[[{"type": "bar"}, {"type": "pie"}],
               [{"type": "scatter"}, {"type": "bar"}]]
    )
    
    # 1. Success by Industry
    industry_success = df.explode('tags').groupby('tags').agg({
        'is_successful': ['count', 'sum', 'mean']
    }).round(3)
    industry_success.columns = ['total', 'successful', 'success_rate']
    top_industries = industry_success.sort_values('success_rate', ascending=False).head(10)
    
    fig.add_trace(
        go.Bar(x=top_industries.index, y=top_industries['success_rate'],
               name='Success Rate', marker_color='lightblue'),
        row=1, col=1
    )
    
    # 2. Founder Team Analysis
    team_analysis = df.groupby('num_founders').agg({
        'is_successful': 'mean',
        'company_name': 'count'
    }).round(3)
    
    fig.add_trace(
        go.Pie(labels=[f'{i} founders' for i in team_analysis.index],
               values=team_analysis['company_name'],
               name='Team Size Distribution'),
        row=1, col=2
    )
    
    # 3. Geographic Distribution
    location_success = df.groupby('location').agg({
        'is_successful': ['count', 'mean']
    }).round(3)
    location_success.columns = ['total', 'success_rate']
    top_locations = location_success[location_success['total'] >= 10].sort_values('success_rate', ascending=False).head(10)
    
    fig.add_trace(
        go.Scatter(x=top_locations['total'], y=top_locations['success_rate'],
                   mode='markers+text', text=top_locations.index,
                   name='Location Performance', marker=dict(size=10, color='green')),
        row=2, col=1
    )
    
    # 4. Market Trends
    batch_trends = df.groupby('batch_year').agg({
        'company_name': 'count',
        'is_successful': 'mean'
    }).round(3)
    
    fig.add_trace(
        go.Bar(x=batch_trends.index, y=batch_trends['company_name'],
               name='Companies per Batch', marker_color='orange'),
        row=2, col=2
    )
    
    fig.update_layout(
        title_text="YC Companies: AI-Enhanced Insights Dashboard",
        showlegend=False,
        height=800
    )
    
    return fig

def generate_insights_summary():
    """Generate AI-powered insights summary"""
    if not intelligence:
        return "AI insights not available (OpenAI API key required)"
    
    # Get key statistics
    total_companies = len(df)
    successful_companies = df['is_successful'].sum()
    success_rate = successful_companies / total_companies
    
    solo_success = df[df['is_solo_founder']]['is_successful'].mean()
    team_success = df[df['is_team_founded']]['is_successful'].mean()
    
    ai_success = df[df['is_ai']]['is_successful'].mean()
    b2b_success = df[df['is_b2b']]['is_successful'].mean()
    
    # Generate insights
    insights = f"""
    🎯 KEY INSIGHTS FROM YC DATA:
    
    📊 Overall Performance:
    • Total Companies: {total_companies:,}
    • Success Rate: {success_rate:.1%}
    • Successful Companies: {successful_companies:,}
    
    👥 Founder Analysis:
    • Solo Founder Success: {solo_success:.1%}
    • Team Founder Success: {team_success:.1%}
    • Team Advantage: {((team_success - solo_success) / solo_success * 100):+.1f}%
    
    🏭 Industry Performance:
    • AI Companies Success: {ai_success:.1%}
    • B2B Companies Success: {b2b_success:.1%}
    
    📍 Geographic Insights:
    • SF Bay Area: {df['is_sf_bay'].sum():,} companies ({df['is_sf_bay'].mean()*100:.1f}%)
    • International: {(~df['is_us']).sum():,} companies ({(~df['is_us']).mean()*100:.1f}%)
    
    🎯 Success Factors:
    • Mature Companies: {df['is_mature'].sum():,} ({df['is_mature'].mean()*100:.1f}%)
    • Average Team Size: {df['team_size'].mean():.1f}
    • Average Company Age: {df['company_age'].mean():.1f} years
    """
    
    return insights

# Create the dashboard
print("📊 Creating AI-Enhanced Insights Dashboard...")
dashboard = create_ai_insights_dashboard()
dashboard.show()

print("\n" + "="*60)
print("🧠 AI-POWERED INSIGHTS SUMMARY")
print("="*60)
insights = generate_insights_summary()
print(insights)


📊 Creating AI-Enhanced Insights Dashboard...



🧠 AI-POWERED INSIGHTS SUMMARY

    🎯 KEY INSIGHTS FROM YC DATA:

    📊 Overall Performance:
    • Total Companies: 5,463
    • Success Rate: 13.1%
    • Successful Companies: 714

    👥 Founder Analysis:
    • Solo Founder Success: 13.5%
    • Team Founder Success: 12.9%
    • Team Advantage: -4.4%

    🏭 Industry Performance:
    • AI Companies Success: 6.4%
    • B2B Companies Success: 10.7%

    📍 Geographic Insights:
    • SF Bay Area: 2,191 companies (40.1%)
    • International: 1,738 companies (31.8%)

    🎯 Success Factors:
    • Mature Companies: 3,053 (55.9%)
    • Average Team Size: 49.1
    • Average Company Age: 4.8 years
    


## 6. Advanced Visualizations


In [78]:
# Quick Start Demo
print("🚀 YC GenAI Analysis - Quick Start Guide")
print("="*60)
print()
print("📚 AVAILABLE FUNCTIONS:")
print()
print("1. analyze_company_interactive(company_name='Airbnb')")
print("   → Analyze any company with AI insights")
print()
print("2. quick_market_analysis()")
print("   → Get instant market trends and opportunities")
print()
print("3. create_analysis_widgets()")
print("   → Launch interactive analysis interface")
print()
print("4. market_intel.generate_market_report()")
print("   → Generate comprehensive market intelligence report")
print()
print("5. intelligence.predict_success_probability(company_data)")
print("   → Predict success with AI reasoning")
print()
print("="*60)
print()
print("💡 TRY IT NOW:")
print("Run: analyze_company_interactive()")
print("     to analyze a random company with AI insights!")
print()
print("🔑 NOTE: Requires OPENAI_API_KEY environment variable")
print("="*60)


🚀 YC GenAI Analysis - Quick Start Guide

📚 AVAILABLE FUNCTIONS:

1. analyze_company_interactive(company_name='Airbnb')
   → Analyze any company with AI insights

2. quick_market_analysis()
   → Get instant market trends and opportunities

3. create_analysis_widgets()
   → Launch interactive analysis interface

4. market_intel.generate_market_report()
   → Generate comprehensive market intelligence report

5. intelligence.predict_success_probability(company_data)
   → Predict success with AI reasoning


💡 TRY IT NOW:
Run: analyze_company_interactive()
     to analyze a random company with AI insights!

🔑 NOTE: Requires OPENAI_API_KEY environment variable


In [79]:
create_analysis_widgets()

HBox(children=(Dropdown(description='Company:', layout=Layout(width='400px'), options=('Bear (Fall 2025)', 'Cl…

Output()

(Dropdown(description='Company:', layout=Layout(width='400px'), options=('Bear (Fall 2025)', 'Clicks (Fall 2025)', 'Openroll (Fall 2025)', 'MarkIt (Fall 2025)', 'Freeport Markets (Fall 2025)', 'Bluma (Fall 2025)', 'Icarus (Fall 2025)', 'Metorial (Fall 2025)', 'Lua Global Inc (Fall 2025)', 'Narrative (Fall 2025)', 'Relaw (Fall 2025)', 'Specific (Fall 2025)', 'Dome (Fall 2025)', 'Sourcebot (Fall 2025)', 'Zalos (Fall 2025)', 'Imagine AI (Fall 2025)', 'Patent Watch (Fall 2025)', 's2.dev (Fall 2025)', 'Everest (Fall 2025)', 'Rivet (Fall 2025)', 'Pixley AI (Fall 2025)', 'Lexi (Fall 2025)', 'Multifactor (Fall 2025)', 'ComplyDo (Fall 2025)', 'AnswerThis (Fall 2025)', 'Unsiloed AI (Fall 2025)', 'Sunflower (Fall 2025)', 'SellRaze (Fall 2025)', 'Mod AI (Fall 2025)', 'Semble AI (Fall 2025)', 'Wardstone (Fall 2025)', 'Hypercubic (Fall 2025)', 'Veria Labs (Fall 2025)', 'hillclimb (Fall 2025)', 'Questom (Fall 2025)', 'Koyal (Fall 2025)', 'Dari (Fall 2025)', 'Soren (Fall 2025)', 'Sorce (Fall 2025)', '

In [80]:
quick_market_analysis()

🔍 Analyzing Market Trends...

📊 INDUSTRY TRENDS:
1. Emerging Hot Sectors: The data shows that the B2B (Business to Business), SaaS (Software as a Service), and AI (Artificial Intelligence) sectors are the top three industries. This suggests that these sectors are growing rapidly and are the current hot sectors. Particularly, AI and its subsets like generative AI and machine learning are seeing a significant rise, indicating the increasing demand and growth in the AI industry.

2. Saturated Markets: While the 'consumer' and 'marketplace' sectors are still in the top 10, they are at the lower end of the list. This could suggest that these markets are relatively stable or possibly declining, indicating that they might be more saturated markets.

3. Market Opportunities: The 'generative-ai', 'consumer', and 'marketplace' sectors might be underserved areas. Despite being in the top 10, they have significantly fewer startups compared to the top sectors. This could indicate a potential for gr

---

## 📖 Summary & Next Steps

### What This Notebook Provides:

**🧠 AI-Powered Analysis:**
- Intelligent company profiling and archetype detection
- Success prediction with AI-generated reasoning
- Competitive landscape analysis
- Market trend identification

**📊 Advanced Insights:**
- Real-time market intelligence
- Opportunity detection
- Industry trend analysis
- Geographic insights

**🎯 Interactive Tools:**
- Conversational company analysis
- Dynamic visualizations
- Market intelligence reports
- Predictive analytics

### Key Advantages Over Traditional Analysis:

1. **Semantic Understanding** - AI comprehends company descriptions beyond keywords
2. **Predictive Narratives** - Explains WHY companies succeed, not just statistics
3. **Dynamic Intelligence** - Real-time insights based on latest data
4. **Actionable Recommendations** - Specific guidance for founders and investors
5. **Competitive Intelligence** - Automated competitive landscape analysis

### Requirements:

- **OpenAI API Key** - Set `OPENAI_API_KEY` environment variable
- **Python Libraries** - textblob, umap-learn, networkx, wordcloud
- **Data** - YC companies dataset (included)

### Cost Considerations:

- **GPT-4 API** - ~$0.03 per 1K tokens (input), ~$0.06 per 1K tokens (output)
- **Typical Analysis** - $0.05-0.15 per company
- **Batch Analysis** - Use GPT-3.5-turbo for lower costs

### Next Steps:

1. **Set up API key** - `export OPENAI_API_KEY='your-key'`
2. **Run analysis** - Try `analyze_company_interactive()`
3. **Explore insights** - Use `quick_market_analysis()`
4. **Generate reports** - Call `market_intel.generate_market_report()`

---

**🚀 Ready to transform YC data into actionable intelligence!**

*For questions or improvements, see README.md*


## 7. Year-Level Batch Analysis


In [81]:
# Year-Level Batch Analysis Engine
class YearLevelAnalysis:
    """Interactive year-level batch analysis with AI insights"""
    
    def __init__(self, df, openai_client=None):
        self.df = df
        self.client = openai_client
        self.years = sorted(df['batch_year'].dropna().unique())
        
    def get_year_statistics(self, year):
        """Get comprehensive statistics for a specific year"""
        year_df = self.df[self.df['batch_year'] == year]
        
        stats = {
            'year': year,
            'total_companies': len(year_df),
            'success_rate': year_df['is_successful'].mean(),
            'successful_companies': year_df['is_successful'].sum(),
            'solo_founder_rate': year_df['is_solo_founder'].mean(),
            'team_founder_rate': year_df['is_team_founded'].mean(),
            'ai_companies': year_df['is_ai'].sum(),
            'b2b_companies': year_df['is_b2b'].sum(),
            'fintech_companies': year_df['is_fintech'].sum(),
            'sf_bay_companies': year_df['is_sf_bay'].sum(),
            'international_companies': (~year_df['is_us']).sum(),
            'avg_team_size': year_df['team_size'].mean(),
            'avg_company_age': year_df['company_age'].mean(),
            'top_industries': year_df.explode('tags')['tags'].value_counts().head(10).to_dict(),
            'top_locations': year_df['location'].value_counts().head(10).to_dict(),
        }
        
        return stats
    
    def compare_years(self, year1, year2):
        """Compare statistics between two years"""
        stats1 = self.get_year_statistics(year1)
        stats2 = self.get_year_statistics(year2)
        
        comparison = {
            'year1': year1,
            'year2': year2,
            'companies_growth': ((stats2['total_companies'] - stats1['total_companies']) / stats1['total_companies'] * 100) if stats1['total_companies'] > 0 else 0,
            'success_rate_change': (stats2['success_rate'] - stats1['success_rate']) * 100,
            'ai_growth': ((stats2['ai_companies'] - stats1['ai_companies']) / stats1['ai_companies'] * 100) if stats1['ai_companies'] > 0 else 0,
            'b2b_growth': ((stats2['b2b_companies'] - stats1['b2b_companies']) / stats1['b2b_companies'] * 100) if stats1['b2b_companies'] > 0 else 0,
            'stats1': stats1,
            'stats2': stats2
        }
        
        return comparison
    
    def generate_year_insights(self, year):
        """Generate AI-powered insights for a specific year"""
        if not self.client:
            return "AI insights not available (OpenAI API key required)"
        
        stats = self.get_year_statistics(year)
        year_df = self.df[self.df['batch_year'] == year]
        
        # Get sample companies
        sample_companies = year_df[['company_name', 'short_description', 'tags', 'status']].head(20).to_dict('records')
        
        prompt = f"""
        Analyze YC batch year {year} and provide insights:
        
        Statistics:
        - Total Companies: {stats['total_companies']}
        - Success Rate: {stats['success_rate']:.1%}
        - AI Companies: {stats['ai_companies']}
        - B2B Companies: {stats['b2b_companies']}
        - Top Industries: {list(stats['top_industries'].keys())[:5]}
        
        Sample Companies: {[c['company_name'] + ': ' + c['short_description'][:100] for c in sample_companies[:5]]}
        
        Provide:
        1. Key trends and themes for this year
        2. Notable success patterns
        3. Emerging technologies/sectors
        4. Market conditions and opportunities
        5. Comparison to typical YC batch characteristics
        6. Investment recommendations for this cohort
        
        Format as structured analysis with actionable insights.
        """
        
        try:
            response = self.client.chat.completions.create(
                model="gpt-4",
                messages=[{"role": "user", "content": prompt}],
                temperature=0.4
            )
            return response.choices[0].message.content
        except Exception as e:
            return f"AI analysis failed: {str(e)}"
    
    def create_year_dashboard(self, year):
        """Create interactive dashboard for a specific year"""
        year_df = self.df[self.df['batch_year'] == year]
        stats = self.get_year_statistics(year)
        
        # Create subplots
        fig = make_subplots(
            rows=2, cols=2,
            subplot_titles=(
                f'Industry Distribution ({year})',
                f'Success by Founder Team Size ({year})',
                f'Geographic Distribution ({year})',
                f'Company Status ({year})'
            ),
            specs=[[{"type": "bar"}, {"type": "pie"}],
                   [{"type": "bar"}, {"type": "pie"}]]
        )
        
        # 1. Industry Distribution
        industry_counts = year_df.explode('tags')['tags'].value_counts().head(10)
        fig.add_trace(
            go.Bar(x=industry_counts.index, y=industry_counts.values,
                   name='Companies', marker_color='lightblue'),
            row=1, col=1
        )
        
        # 2. Success by Founder Team Size
        team_success = year_df.groupby('num_founders')['is_successful'].agg(['count', 'mean']).round(3)
        fig.add_trace(
            go.Pie(labels=[f'{i} founders' for i in team_success.index],
                   values=team_success['count'],
                   name='Team Distribution'),
            row=1, col=2
        )
        
        # 3. Geographic Distribution
        location_counts = year_df['location'].value_counts().head(10)
        fig.add_trace(
            go.Bar(x=location_counts.index, y=location_counts.values,
                   name='Companies', marker_color='lightgreen'),
            row=2, col=1
        )
        
        # 4. Company Status
        status_counts = year_df['status'].value_counts()
        fig.add_trace(
            go.Pie(labels=status_counts.index, values=status_counts.values,
                   name='Status Distribution'),
            row=2, col=2
        )
        
        fig.update_layout(
            title_text=f"YC Batch Year {year} - Comprehensive Analysis",
            showlegend=False,
            height=800
        )
        
        return fig
    
    def create_multi_year_comparison(self, years=None):
        """Create multi-year comparison dashboard"""
        if years is None:
            years = self.years[-10:]  # Last 10 years
        
        # Prepare data
        year_stats = []
        for year in years:
            stats = self.get_year_statistics(year)
            year_stats.append(stats)
        
        # Create subplots
        fig = make_subplots(
            rows=2, cols=2,
            subplot_titles=(
                'Companies per Year',
                'Success Rate Trend',
                'AI Companies Growth',
                'International Expansion'
            )
        )
        
        # 1. Companies per Year
        fig.add_trace(
            go.Bar(x=[s['year'] for s in year_stats],
                   y=[s['total_companies'] for s in year_stats],
                   name='Total Companies', marker_color='lightblue'),
            row=1, col=1
        )
        
        # 2. Success Rate Trend
        fig.add_trace(
            go.Scatter(x=[s['year'] for s in year_stats],
                       y=[s['success_rate']*100 for s in year_stats],
                       mode='lines+markers', name='Success Rate %',
                       line=dict(color='green', width=3)),
            row=1, col=2
        )
        
        # 3. AI Companies Growth
        fig.add_trace(
            go.Bar(x=[s['year'] for s in year_stats],
                   y=[s['ai_companies'] for s in year_stats],
                   name='AI Companies', marker_color='purple'),
            row=2, col=1
        )
        
        # 4. International Expansion
        fig.add_trace(
            go.Scatter(x=[s['year'] for s in year_stats],
                       y=[s['international_companies'] for s in year_stats],
                       mode='lines+markers', name='International Companies',
                       line=dict(color='orange', width=3)),
            row=2, col=2
        )
        
        fig.update_layout(
            title_text="YC Multi-Year Trends Analysis",
            showlegend=False,
            height=800
        )
        
        return fig

# Initialize year-level analysis
year_analysis = YearLevelAnalysis(df, client)
print(f"📊 Year-Level Analysis Engine initialized")
print(f"📅 Available years: {year_analysis.years[0]} - {year_analysis.years[-1]}")


📊 Year-Level Analysis Engine initialized
📅 Available years: 2005 - 2025


In [82]:
# Create interactive widgets for year-level analysis
def create_year_analysis_widgets():
    """Create interactive widgets for year-level batch analysis"""
    
    # Year selector
    year_selector = widgets.Dropdown(
        options=year_analysis.years,
        value=year_analysis.years[-1],
        description='Select Year:',
        style={'description_width': 'initial'},
        layout=widgets.Layout(width='200px')
    )
    
    # Comparison year selectors
    year1_selector = widgets.Dropdown(
        options=year_analysis.years,
        value=year_analysis.years[-2] if len(year_analysis.years) > 1 else year_analysis.years[-1],
        description='Compare Year 1:',
        style={'description_width': 'initial'},
        layout=widgets.Layout(width='200px')
    )
    
    year2_selector = widgets.Dropdown(
        options=year_analysis.years,
        value=year_analysis.years[-1],
        description='Compare Year 2:',
        style={'description_width': 'initial'},
        layout=widgets.Layout(width='200px')
    )
    
    # Analysis type selector
    analysis_type = widgets.RadioButtons(
        options=['Single Year Analysis', 'Year Comparison', 'Multi-Year Trends'],
        description='Analysis Type:',
        style={'description_width': 'initial'}
    )
    
    # Action buttons
    analyze_button = widgets.Button(
        description='🔍 Analyze',
        button_style='success',
        layout=widgets.Layout(width='150px')
    )
    
    ai_insights_button = widgets.Button(
        description='🧠 AI Insights',
        button_style='info',
        layout=widgets.Layout(width='150px')
    )
    
    # Output area
    output = widgets.Output()
    
    def on_analyze_click(b):
        with output:
            output.clear_output()
            
            if analysis_type.value == 'Single Year Analysis':
                year = year_selector.value
                print(f"📊 Analyzing YC Batch Year {year}")
                print("="*60)
                
                stats = year_analysis.get_year_statistics(year)
                
                print(f"\n🎯 KEY STATISTICS:")
                print(f"  • Total Companies: {stats['total_companies']:,}")
                print(f"  • Success Rate: {stats['success_rate']:.1%}")
                print(f"  • Successful Companies: {stats['successful_companies']}")
                print(f"\n👥 FOUNDER ANALYSIS:")
                print(f"  • Solo Founders: {stats['solo_founder_rate']:.1%}")
                print(f"  • Team Founders: {stats['team_founder_rate']:.1%}")
                print(f"\n🏭 INDUSTRY BREAKDOWN:")
                print(f"  • AI Companies: {stats['ai_companies']}")
                print(f"  • B2B Companies: {stats['b2b_companies']}")
                print(f"  • Fintech Companies: {stats['fintech_companies']}")
                print(f"\n📍 GEOGRAPHIC DISTRIBUTION:")
                print(f"  • SF Bay Area: {stats['sf_bay_companies']} ({stats['sf_bay_companies']/stats['total_companies']*100:.1f}%)")
                print(f"  • International: {stats['international_companies']} ({stats['international_companies']/stats['total_companies']*100:.1f}%)")
                print(f"\n📈 COMPANY METRICS:")
                print(f"  • Average Team Size: {stats['avg_team_size']:.1f}")
                print(f"  • Average Company Age: {stats['avg_company_age']:.1f} years")
                print(f"\n🏷️ TOP INDUSTRIES:")
                for industry, count in list(stats['top_industries'].items())[:5]:
                    print(f"  • {industry}: {count}")
                
                # Show dashboard
                dashboard = year_analysis.create_year_dashboard(year)
                dashboard.show()
                
            elif analysis_type.value == 'Year Comparison':
                y1 = year1_selector.value
                y2 = year2_selector.value
                print(f"📊 Comparing YC Batch Years: {y1} vs {y2}")
                print("="*60)
                
                comparison = year_analysis.compare_years(y1, y2)
                
                print(f"\n📈 GROWTH METRICS:")
                print(f"  • Companies Growth: {comparison['companies_growth']:+.1f}%")
                print(f"  • Success Rate Change: {comparison['success_rate_change']:+.1f} percentage points")
                print(f"  • AI Companies Growth: {comparison['ai_growth']:+.1f}%")
                print(f"  • B2B Companies Growth: {comparison['b2b_growth']:+.1f}%")
                
                print(f"\n📊 YEAR {y1} STATISTICS:")
                stats1 = comparison['stats1']
                print(f"  • Total Companies: {stats1['total_companies']:,}")
                print(f"  • Success Rate: {stats1['success_rate']:.1%}")
                print(f"  • AI Companies: {stats1['ai_companies']}")
                
                print(f"\n📊 YEAR {y2} STATISTICS:")
                stats2 = comparison['stats2']
                print(f"  • Total Companies: {stats2['total_companies']:,}")
                print(f"  • Success Rate: {stats2['success_rate']:.1%}")
                print(f"  • AI Companies: {stats2['ai_companies']}")
                
            elif analysis_type.value == 'Multi-Year Trends':
                print(f"📊 Multi-Year Trends Analysis")
                print("="*60)
                
                # Show multi-year dashboard
                dashboard = year_analysis.create_multi_year_comparison()
                dashboard.show()
                
                # Calculate overall trends
                recent_years = year_analysis.years[-10:]
                recent_stats = [year_analysis.get_year_statistics(y) for y in recent_years]
                
                print(f"\n📈 OVERALL TRENDS ({recent_years[0]}-{recent_years[-1]}):")
                print(f"  • Total Companies Growth: {((recent_stats[-1]['total_companies'] - recent_stats[0]['total_companies']) / recent_stats[0]['total_companies'] * 100):+.1f}%")
                print(f"  • Average Success Rate: {np.mean([s['success_rate'] for s in recent_stats]):.1%}")
                print(f"  • AI Companies Growth: {((recent_stats[-1]['ai_companies'] - recent_stats[0]['ai_companies']) / max(recent_stats[0]['ai_companies'], 1) * 100):+.1f}%")
                print(f"  • International Expansion: {((recent_stats[-1]['international_companies'] - recent_stats[0]['international_companies']) / max(recent_stats[0]['international_companies'], 1) * 100):+.1f}%")
    
    def on_ai_insights_click(b):
        with output:
            output.clear_output()
            
            if analysis_type.value == 'Single Year Analysis':
                year = year_selector.value
                print(f"🧠 Generating AI Insights for Year {year}...")
                print("="*60)
                
                insights = year_analysis.generate_year_insights(year)
                print(insights)
            else:
                print("⚠️ AI Insights available only for Single Year Analysis")
                print("Please select 'Single Year Analysis' and try again")
    
    analyze_button.on_click(on_analyze_click)
    ai_insights_button.on_click(on_ai_insights_click)
    
    # Layout
    single_year_controls = widgets.HBox([year_selector])
    comparison_controls = widgets.HBox([year1_selector, year2_selector])
    buttons = widgets.HBox([analyze_button, ai_insights_button])
    
    display(widgets.VBox([
        analysis_type,
        widgets.Label('Single Year Analysis:'),
        single_year_controls,
        widgets.Label('Year Comparison:'),
        comparison_controls,
        buttons,
        output
    ]))
    
    return year_selector, year1_selector, year2_selector, analysis_type, analyze_button, ai_insights_button, output

print("\n🎛️ Interactive Year-Level Analysis Ready!")
print("="*60)
print("📚 FEATURES:")
print("  1. Single Year Analysis - Deep dive into any YC batch year")
print("  2. Year Comparison - Compare two years side-by-side")
print("  3. Multi-Year Trends - Visualize trends across multiple years")
print("  4. AI-Powered Insights - Get AI-generated insights for any year")
print("\n💡 Run: create_year_analysis_widgets()")
print("    to launch the interactive year-level analysis dashboard!")
print("="*60)



🎛️ Interactive Year-Level Analysis Ready!
📚 FEATURES:
  1. Single Year Analysis - Deep dive into any YC batch year
  2. Year Comparison - Compare two years side-by-side
  3. Multi-Year Trends - Visualize trends across multiple years
  4. AI-Powered Insights - Get AI-generated insights for any year

💡 Run: create_year_analysis_widgets()
    to launch the interactive year-level analysis dashboard!


In [83]:
# Launch the interactive year-level analysis dashboard
create_year_analysis_widgets()


VBox(children=(RadioButtons(description='Analysis Type:', options=('Single Year Analysis', 'Year Comparison', …

(Dropdown(description='Select Year:', index=20, layout=Layout(width='200px'), options=(np.int64(2005), np.int64(2006), np.int64(2007), np.int64(2008), np.int64(2009), np.int64(2010), np.int64(2011), np.int64(2012), np.int64(2013), np.int64(2014), np.int64(2015), np.int64(2016), np.int64(2017), np.int64(2018), np.int64(2019), np.int64(2020), np.int64(2021), np.int64(2022), np.int64(2023), np.int64(2024), np.int64(2025)), style=DescriptionStyle(description_width='initial'), value=np.int64(2025)),
 Dropdown(description='Compare Year 1:', index=19, layout=Layout(width='200px'), options=(np.int64(2005), np.int64(2006), np.int64(2007), np.int64(2008), np.int64(2009), np.int64(2010), np.int64(2011), np.int64(2012), np.int64(2013), np.int64(2014), np.int64(2015), np.int64(2016), np.int64(2017), np.int64(2018), np.int64(2019), np.int64(2020), np.int64(2021), np.int64(2022), np.int64(2023), np.int64(2024), np.int64(2025)), style=DescriptionStyle(description_width='initial'), value=np.int64(2024)

In [84]:
# Quick year analysis demo
def quick_year_analysis(year=None):
    """Quick analysis of a specific year"""
    if year is None:
        year = year_analysis.years[-1]  # Most recent year
    
    print(f"📊 Quick Analysis: YC Batch Year {year}")
    print("="*60)
    
    stats = year_analysis.get_year_statistics(year)
    
    print(f"\n🎯 SNAPSHOT:")
    print(f"  • {stats['total_companies']:,} companies")
    print(f"  • {stats['success_rate']:.1%} success rate")
    print(f"  • {stats['ai_companies']} AI companies ({stats['ai_companies']/stats['total_companies']*100:.1f}%)")
    print(f"  • {stats['international_companies']} international ({stats['international_companies']/stats['total_companies']*100:.1f}%)")
    
    print(f"\n🏷️ TOP 5 INDUSTRIES:")
    for i, (industry, count) in enumerate(list(stats['top_industries'].items())[:5], 1):
        print(f"  {i}. {industry}: {count} companies")
    
    print(f"\n📍 TOP 5 LOCATIONS:")
    for i, (location, count) in enumerate(list(stats['top_locations'].items())[:5], 1):
        print(f"  {i}. {location}: {count} companies")
    
    # Show visualization
    dashboard = year_analysis.create_year_dashboard(year)
    dashboard.show()

print("💡 Try: quick_year_analysis(2023)")
print("    for instant insights on any year!")


💡 Try: quick_year_analysis(2023)
    for instant insights on any year!


## 8. Multi-Year Trends Analysis


In [85]:
# Multi-year trends analysis and comparison
print("📊 Multi-Year Trends Analysis")
print("="*60)

# Show multi-year trends
multi_year_dashboard = year_analysis.create_multi_year_comparison()
multi_year_dashboard.show()

# Calculate and display key trends
recent_years = year_analysis.years[-5:]
print(f"\n📈 KEY TRENDS (Last 5 Years: {recent_years[0]}-{recent_years[-1]}):")

for year in recent_years:
    stats = year_analysis.get_year_statistics(year)
    print(f"\n{year}:")
    print(f"  • Companies: {stats['total_companies']:,}")
    print(f"  • Success Rate: {stats['success_rate']:.1%}")
    print(f"  • AI Companies: {stats['ai_companies']} ({stats['ai_companies']/stats['total_companies']*100:.1f}%)")
    print(f"  • Top Industry: {list(stats['top_industries'].keys())[0] if stats['top_industries'] else 'N/A'}")


📊 Multi-Year Trends Analysis



📈 KEY TRENDS (Last 5 Years: 2021-2025):

2021:
  • Companies: 727
  • Success Rate: 8.3%
  • AI Companies: 130 (17.9%)
  • Top Industry: saas

2022:
  • Companies: 634
  • Success Rate: 5.7%
  • AI Companies: 179 (28.2%)
  • Top Industry: b2b

2023:
  • Companies: 496
  • Success Rate: 5.0%
  • AI Companies: 225 (45.4%)
  • Top Industry: b2b

2024:
  • Companies: 596
  • Success Rate: 0.8%
  • AI Companies: 263 (44.1%)
  • Top Industry: artificial-intelligence

2025:
  • Companies: 534
  • Success Rate: 0.0%
  • AI Companies: 255 (47.8%)
  • Top Industry: ai


## 9. Quick Start Guide


In [86]:
# Quick Start Demo
print("🚀 YC GenAI Analysis - Quick Start Guide")
print("="*60)
print()
print("📚 AVAILABLE FUNCTIONS:")
print()
print("1. analyze_company_interactive(company_name='Airbnb')")
print("   → Analyze any company with AI insights")
print()
print("2. quick_market_analysis()")
print("   → Get instant market trends and opportunities")
print()
print("3. create_analysis_widgets()")
print("   → Launch interactive analysis interface")
print()
print("4. create_year_analysis_widgets()")
print("   → Launch year-level batch analysis")
print()
print("5. quick_year_analysis(2023)")
print("   → Quick analysis of a specific year")
print()
print("6. market_intel.generate_market_report()")
print("   → Generate comprehensive market intelligence report")
print()
print("7. intelligence.predict_success_probability(company_data)")
print("   → Predict success with AI reasoning")
print()
print("="*60)
print()
print("💡 TRY IT NOW:")
print("Run: analyze_company_interactive()")
print("     to analyze a random company with AI insights!")
print()
print("🔑 NOTE: Requires OPENAI_API_KEY environment variable")
print("="*60)


🚀 YC GenAI Analysis - Quick Start Guide

📚 AVAILABLE FUNCTIONS:

1. analyze_company_interactive(company_name='Airbnb')
   → Analyze any company with AI insights

2. quick_market_analysis()
   → Get instant market trends and opportunities

3. create_analysis_widgets()
   → Launch interactive analysis interface

4. create_year_analysis_widgets()
   → Launch year-level batch analysis

5. quick_year_analysis(2023)
   → Quick analysis of a specific year

6. market_intel.generate_market_report()
   → Generate comprehensive market intelligence report

7. intelligence.predict_success_probability(company_data)
   → Predict success with AI reasoning


💡 TRY IT NOW:
Run: analyze_company_interactive()
     to analyze a random company with AI insights!

🔑 NOTE: Requires OPENAI_API_KEY environment variable


## 10. Interactive Demos


In [87]:
# Launch interactive demos
print("🎛️ Launching Interactive Demos...")
print("="*60)

# Demo 1: Company Analysis Interface
print("📊 Demo 1: Company Analysis Interface")
create_analysis_widgets()

print("\n" + "="*60)

# Demo 2: Year-Level Analysis Interface  
print("📊 Demo 2: Year-Level Batch Analysis Interface")
create_year_analysis_widgets()

print("\n" + "="*60)

# Demo 3: Market Intelligence
print("📊 Demo 3: Market Intelligence Analysis")
quick_market_analysis()


🎛️ Launching Interactive Demos...
📊 Demo 1: Company Analysis Interface


HBox(children=(Dropdown(description='Company:', layout=Layout(width='400px'), options=('Bear (Fall 2025)', 'Cl…

Output()


📊 Demo 2: Year-Level Batch Analysis Interface


VBox(children=(RadioButtons(description='Analysis Type:', options=('Single Year Analysis', 'Year Comparison', …


📊 Demo 3: Market Intelligence Analysis
🔍 Analyzing Market Trends...

📊 INDUSTRY TRENDS:
1. Emerging Hot Sectors: The data shows that B2B, SaaS, and AI-related startups are the most popular in the YC portfolio. These sectors are growing rapidly, indicating a strong interest in these areas. AI, both in its general form and more specific applications like generative AI and machine learning, is particularly noteworthy. The rise in AI-related startups suggests a growing trend towards automation and data-driven decision making.

2. Saturated Markets: While the B2B and SaaS sectors are popular, they may also be nearing saturation, given the high number of startups in these areas. The same could be said for the AI sector. However, the continued growth in these sectors suggests that there is still room for innovation and disruption.

3. Market Opportunities: On the other end of the spectrum, consumer-focused startups and marketplace platforms appear to be underserved areas. These sectors have 

## 11. Summary & Next Steps


### What This Notebook Provides:

**🧠 AI-Powered Analysis:**
- Intelligent company profiling and archetype detection
- Success prediction with AI-generated reasoning
- Competitive landscape analysis
- Market trend identification

**📊 Advanced Insights:**
- Real-time market intelligence
- Opportunity detection
- Industry trend analysis
- Geographic insights

**🎯 Interactive Tools:**
- Conversational company analysis
- Dynamic visualizations
- Market intelligence reports
- Predictive analytics
- Year-level batch analysis
- Multi-year trend visualization

### Key Advantages Over Traditional Analysis:

1. **Semantic Understanding** - AI comprehends company descriptions beyond keywords
2. **Predictive Narratives** - Explains WHY companies succeed, not just statistics
3. **Dynamic Intelligence** - Real-time insights based on latest data
4. **Actionable Recommendations** - Specific guidance for founders and investors
5. **Competitive Intelligence** - Automated competitive landscape analysis
6. **Year-Level Insights** - Deep dive into specific batch years
7. **Trend Analysis** - Multi-year pattern recognition

### Next Steps:

1. **Set up API key** - `export OPENAI_API_KEY='your-key'`
2. **Run analysis** - Try `analyze_company_interactive()`
3. **Explore insights** - Use `quick_market_analysis()`
4. **Year analysis** - Use `create_year_analysis_widgets()`
5. **Generate reports** - Call `market_intel.generate_market_report()`

---

**🚀 Ready to transform YC data into actionable intelligence!**

*For questions or improvements, see README.md*


## 12. Requirements & Setup


### Requirements:

- **OpenAI API Key** - Set `OPENAI_API_KEY` environment variable
- **Python Libraries** - textblob, umap-learn, networkx, wordcloud
- **Data** - YC companies dataset (included)

### Cost Considerations:

- **GPT-4 API** - ~$0.03 per 1K tokens (input), ~$0.06 per 1K tokens (output)
- **Typical Analysis** - $0.05-0.15 per company
- **Batch Analysis** - Use GPT-3.5-turbo for lower costs

### Setup Instructions:

1. **Install Dependencies:**
   ```bash
   pip install -r requirements.txt
   ```

2. **Set OpenAI API Key:**
   ```bash
   export OPENAI_API_KEY='your-key-here'
   ```

3. **Run the Notebook:**
   - Execute all cells in order
   - Use interactive widgets for analysis
   - Generate AI-powered insights

### Troubleshooting:

- **API Key Issues:** Ensure `OPENAI_API_KEY` is set correctly
- **Import Errors:** Install missing packages with pip
- **Data Issues:** Check that data files are in the correct location
- **Widget Issues:** Ensure ipywidgets is installed and enabled

---

**🎯 Your YC analysis toolkit is ready to use!**
