# Y Combinator Companies: Comprehensive Analysis

**Deep dive into 20 years of YC data** - Success factors, trends, predictions, and actionable insights for founders.

This notebook analyzes 7,858+ YC companies to answer critical questions:
- What predicts startup success?
- Should you start solo or with co-founders?
- Which industries are saturated vs underserved?
- How do successful companies grow their teams?
- What's the optimal time to launch?

---

In [None]:
# Setup & Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings
from datetime import datetime
from collections import Counter
import re

warnings.filterwarnings('ignore')

# Plot settings
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 11

# Color palettes
STATUS_COLORS = {
    'Public': '#00CC66',
    'Acquired': '#0099FF', 
    'Active': '#FFAA00',
    'Inactive': '#FF3333',
    'Dead': '#999999'
}

print("✓ Libraries loaded")

## 1. Data Loading & Preparation

In [None]:
# Load latest YC data
data_path = '../data/2025-10-05-yc.companies.jl'

try:
    df = pd.read_json(data_path, lines=True)
    print(f"✓ Loaded {len(df):,} companies from {data_path.split('/')[-1]}")
except FileNotFoundError:
    # Fallback to older data
    data_path = '../data/2025-05-03.jl'
    df = pd.read_json(data_path, lines=True)
    print(f"⚠ Using fallback data: {len(df):,} companies from {data_path.split('/')[-1]}")

# Display sample
print(f"\nColumns: {', '.join(df.columns.tolist())}")
df.head(3)

In [None]:
# Data Cleaning & Feature Engineering

# 1. Extract batch year and season
def parse_batch(batch):
    """Extract season and year from batch string"""
    if pd.isna(batch):
        return None, None
    
    # Handle new format: 'Winter 2025', 'Fall 2024'
    match = re.search(r'(Winter|Summer|Spring|Fall|W|S|IK)\s*(\d{2,4})', str(batch))
    if match:
        season = match.group(1)
        year = match.group(2)
        
        # Convert to 4-digit year
        if len(year) == 2:
            year = int(year)
            year = 2000 + year if year < 50 else 1900 + year
        else:
            year = int(year)
            
        # Map season abbreviations
        season_map = {'W': 'Winter', 'S': 'Summer', 'IK': 'IK'}
        season = season_map.get(season, season)
        
        return season, year
    return None, None

df[['batch_season', 'batch_year']] = df['batch'].apply(lambda x: pd.Series(parse_batch(x)))

# 2. Founder type
df['founder_type'] = df['num_founders'].apply(
    lambda x: 'Solo' if x == 1 else ('Duo' if x == 2 else ('Trio' if x == 3 else '4+' if x >= 4 else 'Unknown'))
)
df['is_solo'] = df['num_founders'] == 1

# 3. Company age (years since founding)
current_year = 2025
df['company_age'] = df['year_founded'].apply(
    lambda x: current_year - x if pd.notna(x) else None
)

# Filter for mature companies (to avoid survivorship bias in success analysis)
df['is_mature'] = df['company_age'] >= 3  # At least 3 years old

# 4. Success category
def categorize_success(status):
    if pd.isna(status):
        return 'Unknown'
    status = status.lower()
    if 'public' in status:
        return 'Public'
    elif 'acquired' in status:
        return 'Acquired'
    elif 'active' in status:
        return 'Active'
    elif any(word in status for word in ['inactive', 'dead', 'closed']):
        return 'Inactive'
    return 'Other'

df['success_category'] = df['status'].apply(categorize_success)
df['is_successful'] = df['success_category'].isin(['Public', 'Acquired'])

# 5. Has team grown?
df['has_grown'] = df['team_size'] > df['num_founders']

# 6. Location features
df['is_sf_bay'] = df['location'].fillna('').str.contains('San Francisco|Bay Area|Palo Alto|Mountain View', case=False, na=False)
df['is_us'] = df['country'] == 'US'

# 7. Tag analysis
df['num_tags'] = df['tags'].apply(lambda x: len(x) if isinstance(x, list) else 0)
df['is_ai'] = df['tags'].apply(
    lambda x: any('ai' in tag.lower() or 'artificial' in tag.lower() for tag in x) if isinstance(x, list) else False
)
df['is_b2b'] = df['tags'].apply(
    lambda x: any('b2b' in tag.lower() for tag in x) if isinstance(x, list) else False
)

# 8. Has online presence
df['has_linkedin'] = df['linkedin_url'].notna() & (df['linkedin_url'] != '')
df['has_crunchbase'] = df['cb_url'].notna() & (df['cb_url'] != '')
df['has_website'] = df['website'].notna() & (df['website'] != '')

print("✓ Feature engineering complete")
print(f"\nNew features created:")
print(f"  - batch_season, batch_year")
print(f"  - founder_type, is_solo")
print(f"  - company_age, is_mature, success_category, is_successful")
print(f"  - has_grown, is_sf_bay, is_us")
print(f"  - num_tags, is_ai, is_b2b")
print(f"  - has_linkedin, has_crunchbase, has_website")

# Data quality check
print(f"\n📊 Data Quality:")
print(f"  Missing year_founded: {df['year_founded'].isna().sum():,} ({df['year_founded'].isna().sum()/len(df)*100:.1f}%)")
print(f"  Missing team_size: {df['team_size'].isna().sum():,} ({df['team_size'].isna().sum()/len(df)*100:.1f}%)")
print(f"  Missing location: {df['location'].isna().sum():,} ({df['location'].isna().sum()/len(df)*100:.1f}%)")

# Survivorship bias warning
print(f"\n⚠️  METHODOLOGICAL NOTE:")
print(f"  Recent companies have not had time to exit. For success rate analysis,")
print(f"  we filter to 'mature' companies (≥3 years old) to reduce survivorship bias.")
print(f"  Mature companies: {df['is_mature'].sum():,} ({df['is_mature'].mean()*100:.1f}%)")

---
## 2. Executive Summary Dashboard

High-level KPIs and success metrics

In [None]:
# Key Metrics
total_companies = len(df)
total_founders = df['num_founders'].sum()
successful_companies = df['is_successful'].sum()
success_rate = successful_companies / total_companies * 100
solo_success_rate = df[df['is_solo']]['is_successful'].mean() * 100
team_success_rate = df[~df['is_solo']]['is_successful'].mean() * 100

print("═" * 60)
print("  Y COMBINATOR PORTFOLIO OVERVIEW")
print("═" * 60)
print(f"\n📈 PORTFOLIO METRICS:")
print(f"  Total Companies:        {total_companies:,}")
print(f"  Total Founders:         {total_founders:,.0f}")
print(f"  Public/Acquired:        {successful_companies:,} ({success_rate:.1f}%)")
print(f"  Active Companies:       {(df['success_category'] == 'Active').sum():,}")
print(f"  Inactive/Dead:          {(df['success_category'] == 'Inactive').sum():,}")

print(f"\n👥 FOUNDER COMPOSITION:")
print(f"  Solo Founders:          {df['is_solo'].sum():,} ({df['is_solo'].mean()*100:.1f}%)")
print(f"  2 Co-founders:          {(df['num_founders'] == 2).sum():,} ({(df['num_founders'] == 2).mean()*100:.1f}%)")
print(f"  3+ Co-founders:         {(df['num_founders'] >= 3).sum():,} ({(df['num_founders'] >= 3).mean()*100:.1f}%)")

print(f"\n🎯 SUCCESS RATES:")
print(f"  Solo Founders:          {solo_success_rate:.1f}%")
print(f"  Team Founders:          {team_success_rate:.1f}%")
print(f"  Difference:             {team_success_rate - solo_success_rate:+.1f}pp")

print(f"\n🌍 GEOGRAPHIC DISTRIBUTION:")
print(f"  SF Bay Area:            {df['is_sf_bay'].sum():,} ({df['is_sf_bay'].mean()*100:.1f}%)")
print(f"  United States:          {df['is_us'].sum():,} ({df['is_us'].mean()*100:.1f}%)")
print(f"  International:          {(~df['is_us']).sum():,} ({(~df['is_us']).mean()*100:.1f}%)")

print(f"\n🤖 INDUSTRY TRENDS:")
print(f"  AI-related companies:   {df['is_ai'].sum():,} ({df['is_ai'].mean()*100:.1f}%)")
print(f"  B2B companies:          {df['is_b2b'].sum():,} ({df['is_b2b'].mean()*100:.1f}%)")

print("\n" + "═" * 60)

In [None]:
# Visual Dashboard
fig = make_subplots(
    rows=2, cols=3,
    subplot_titles=(
        'Company Status Distribution',
        'Success Rate by Founder Count',
        'Geographic Distribution',
        'Batch Trends Over Time',
        'Top 10 Industries',
        'Team Size Distribution'
    ),
    specs=[
        [{'type': 'pie'}, {'type': 'bar'}, {'type': 'bar'}],
        [{'type': 'scatter'}, {'type': 'bar'}, {'type': 'histogram'}]
    ]
)

# 1. Status Distribution (Pie)
status_counts = df['success_category'].value_counts()
fig.add_trace(
    go.Pie(
        labels=status_counts.index,
        values=status_counts.values,
        marker=dict(colors=[STATUS_COLORS.get(s, '#CCCCCC') for s in status_counts.index])
    ),
    row=1, col=1
)

# 2. Success by Founder Count
success_by_founders = df.groupby('founder_type')['is_successful'].mean() * 100
success_by_founders = success_by_founders.reindex(['Solo', 'Duo', 'Trio', '4+'])
fig.add_trace(
    go.Bar(x=success_by_founders.index, y=success_by_founders.values, marker_color='#0099FF'),
    row=1, col=2
)

# 3. Top Countries
top_countries = df['country'].value_counts().head(10)
fig.add_trace(
    go.Bar(x=top_countries.values, y=top_countries.index, orientation='h', marker_color='#00CC66'),
    row=1, col=3
)

# 4. Batch Trends
batch_trends = df.groupby('batch_year').size()
fig.add_trace(
    go.Scatter(x=batch_trends.index, y=batch_trends.values, mode='lines+markers', marker_color='#FF6600'),
    row=2, col=1
)

# 5. Top Industries
all_tags = [tag for tags in df['tags'].dropna() for tag in tags if isinstance(tags, list)]
top_industries = pd.Series(all_tags).value_counts().head(10)
fig.add_trace(
    go.Bar(x=top_industries.values, y=top_industries.index, orientation='h', marker_color='#9933FF'),
    row=2, col=2
)

# 6. Team Size Distribution
fig.add_trace(
    go.Histogram(x=df['team_size'].dropna(), nbinsx=30, marker_color='#FF9900'),
    row=2, col=3
)

fig.update_layout(height=800, showlegend=False, title_text="YC Portfolio Dashboard", title_x=0.5)
fig.update_yaxes(title_text="Success Rate (%)", row=1, col=2)
fig.update_yaxes(title_text="# Companies", row=2, col=1)
fig.update_xaxes(title_text="Team Size", row=2, col=3)

fig.show()

---
## 3. Success Factor Analysis

What predicts startup success? Deep correlation and pattern analysis.

In [None]:
# Success rate by various factors

def analyze_success_factor(column, title, top_n=15, min_companies=20):
    """Analyze success rate for a given categorical column
    
    Note: Uses only mature companies (≥3 years old) to avoid survivorship bias.
    """
    # Filter to mature companies only
    mature_df = df[df['is_mature']].copy()
    
    success_analysis = mature_df.groupby(column).agg({
        'is_successful': ['sum', 'mean', 'count']
    }).round(3)
    success_analysis.columns = ['Successful', 'Success_Rate', 'Total']
    
    # Increased minimum to 20 for statistical significance
    success_analysis = success_analysis[success_analysis['Total'] >= min_companies]
    success_analysis = success_analysis.sort_values('Success_Rate', ascending=False).head(top_n)
    
    # Plot
    fig = go.Figure()
    fig.add_trace(go.Bar(
        y=success_analysis.index,
        x=success_analysis['Success_Rate'] * 100,
        orientation='h',
        text=[f"{val*100:.1f}% (n={count:.0f})" 
              for val, count in zip(success_analysis['Success_Rate'], success_analysis['Total'])],
        textposition='outside',
        marker_color='#0099FF'
    ))
    
    fig.update_layout(
        title=f"{title}<br><sub>Mature companies only (≥3 years old), min {min_companies} companies</sub>",
        xaxis_title='Success Rate (%)',
        yaxis_title=column.replace('_', ' ').title(),
        height=max(400, top_n * 30)
    )
    fig.show()
    
    return success_analysis

print("🎯 Analyzing success factors...\n")
print("⚠️  Note: All success analysis uses MATURE companies (≥3 years) to avoid survivorship bias.")

In [None]:
# 1. Success by Batch Year
batch_success = analyze_success_factor('batch_year', 'Success Rate by Batch Year', top_n=20)

In [None]:
# 2. Success by Location (Top Cities)
location_success = analyze_success_factor('location', 'Success Rate by Location (Top 15 Cities)', top_n=15)

In [None]:
# 3. Correlation Analysis
# Numeric features correlation with success

numeric_features = ['num_founders', 'team_size', 'company_age', 'num_tags']
correlations = []

# Use mature companies only
mature_df = df[df['is_mature']].copy()

for feature in numeric_features:
    if feature in mature_df.columns:
        corr = mature_df[[feature, 'is_successful']].dropna().corr().iloc[0, 1]
        correlations.append({'Feature': feature, 'Correlation': corr})

corr_df = pd.DataFrame(correlations).sort_values('Correlation', ascending=False)

fig = go.Figure(go.Bar(
    x=corr_df['Correlation'],
    y=corr_df['Feature'],
    orientation='h',
    marker_color=['#00CC66' if x > 0 else '#FF3333' for x in corr_df['Correlation']]
))

fig.update_layout(
    title='Feature Correlation with Success (Public/Acquired)<br><sub>Pearson correlation - association only, not causation</sub>',
    xaxis_title='Correlation Coefficient',
    height=300
)
fig.show()

print("\n📊 Correlation Insights:")
print("⚠️  Important: Correlation ≠ Causation. These show association, not cause and effect.\n")
for _, row in corr_df.iterrows():
    direction = "positively" if row['Correlation'] > 0 else "negatively"
    strength = "strongly" if abs(row['Correlation']) > 0.3 else ("moderately" if abs(row['Correlation']) > 0.1 else "weakly")
    print(f"  • {row['Feature']}: {strength} {direction} associated ({row['Correlation']:.3f})")

In [None]:
# 4. Team Size vs Success (Scatter)
team_size_bins = pd.cut(df['team_size'].dropna(), bins=[0, 1, 5, 10, 20, 50, 100, 1000, 10000])
team_success = df.groupby(team_size_bins)['is_successful'].agg(['mean', 'count'])
team_success = team_success[team_success['count'] >= 10]

fig = go.Figure()
fig.add_trace(go.Bar(
    x=[str(x) for x in team_success.index],
    y=team_success['mean'] * 100,
    text=[f"{val*100:.1f}%<br>(n={count:.0f})" for val, count in zip(team_success['mean'], team_success['count'])],
    textposition='outside',
    marker_color='#9933FF'
))

fig.update_layout(
    title='Success Rate by Team Size Range',
    xaxis_title='Team Size',
    yaxis_title='Success Rate (%)',
    height=400
)
fig.show()

---
## 4. Deep Solo Founder Analysis

Comprehensive analysis of solo vs team-founded companies

In [None]:
# Solo vs Team Comparison
solo_df = df[df['is_solo']].copy()
team_df = df[~df['is_solo']].copy()

comparison = pd.DataFrame({
    'Metric': [
        'Total Companies',
        'Success Rate (%)',
        'Avg Team Size',
        'Avg Company Age (years)',
        'SF Bay Area (%)',
        'AI-related (%)',
        'B2B (%)',
        'Has LinkedIn (%)',
        'Has Grown Team (%)'
    ],
    'Solo Founders': [
        len(solo_df),
        solo_df['is_successful'].mean() * 100,
        solo_df['team_size'].mean(),
        solo_df['company_age'].mean(),
        solo_df['is_sf_bay'].mean() * 100,
        solo_df['is_ai'].mean() * 100,
        solo_df['is_b2b'].mean() * 100,
        solo_df['has_linkedin'].mean() * 100,
        solo_df['has_grown'].mean() * 100
    ],
    'Team Founders': [
        len(team_df),
        team_df['is_successful'].mean() * 100,
        team_df['team_size'].mean(),
        team_df['company_age'].mean(),
        team_df['is_sf_bay'].mean() * 100,
        team_df['is_ai'].mean() * 100,
        team_df['is_b2b'].mean() * 100,
        team_df['has_linkedin'].mean() * 100,
        team_df['has_grown'].mean() * 100
    ]
})

comparison['Difference'] = comparison['Team Founders'] - comparison['Solo Founders']
comparison = comparison.round(2)

print("\n" + "═" * 80)
print("  SOLO vs TEAM FOUNDERS: HEAD-TO-HEAD COMPARISON")
print("═" * 80)
print(comparison.to_string(index=False))
print("═" * 80)

In [None]:
# Top Industries for Solo Founders (by success rate)
solo_tags = [tag for tags in solo_df['tags'].dropna() for tag in tags if isinstance(tags, list)]
solo_tag_counts = pd.Series(solo_tags).value_counts()

# Calculate success rate per industry for solo founders
industry_success = {}
for tag in solo_tag_counts.head(30).index:
    companies_with_tag = solo_df[solo_df['tags'].apply(
        lambda x: tag in x if isinstance(x, list) else False
    )]
    if len(companies_with_tag) >= 5:
        success_rate = companies_with_tag['is_successful'].mean()
        industry_success[tag] = {
            'success_rate': success_rate,
            'count': len(companies_with_tag),
            'successful': companies_with_tag['is_successful'].sum()
        }

industry_df = pd.DataFrame(industry_success).T.sort_values('success_rate', ascending=False).head(15)

fig = go.Figure()
fig.add_trace(go.Bar(
    y=industry_df.index,
    x=industry_df['success_rate'] * 100,
    orientation='h',
    text=[f"{val*100:.1f}% ({count:.0f} cos)" 
          for val, count in zip(industry_df['success_rate'], industry_df['count'])],
    textposition='outside',
    marker_color='#FF6600'
))

fig.update_layout(
    title='Top 15 Industries for Solo Founders by Success Rate',
    xaxis_title='Success Rate (%)',
    yaxis_title='Industry',
    height=500
)
fig.show()

In [None]:
# Solo Founder Trends Over Time
solo_trends = df.groupby('batch_year').agg({
    'is_solo': ['sum', 'mean']
})
solo_trends.columns = ['Solo_Count', 'Solo_Percentage']
solo_trends = solo_trends.dropna()

fig = make_subplots(specs=[[{"secondary_y": True}]])

fig.add_trace(
    go.Bar(x=solo_trends.index, y=solo_trends['Solo_Count'], name='# Solo Founders', marker_color='#0099FF'),
    secondary_y=False
)

fig.add_trace(
    go.Scatter(x=solo_trends.index, y=solo_trends['Solo_Percentage']*100, 
               name='% Solo', mode='lines+markers', marker_color='#FF3333', line=dict(width=3)),
    secondary_y=True
)

fig.update_layout(title='Solo Founder Trends: YC Acceptance Over Time', height=400)
fig.update_yaxes(title_text="Number of Solo Founders", secondary_y=False)
fig.update_yaxes(title_text="Percentage Solo (%)", secondary_y=True)
fig.update_xaxes(title_text="Batch Year")

fig.show()

recent_solo_pct = solo_trends['Solo_Percentage'].tail(3).mean() * 100
early_solo_pct = solo_trends['Solo_Percentage'].head(5).mean() * 100
print(f"\n📈 Solo Founder Trend:")
print(f"  Early batches (avg): {early_solo_pct:.1f}% solo")
print(f"  Recent batches (avg): {recent_solo_pct:.1f}% solo")
print(f"  Change: {recent_solo_pct - early_solo_pct:+.1f}pp")

---
## 5. Temporal & Batch Analysis

How has YC evolved over 20 years?

In [None]:
# Batch statistics over time
batch_stats = df.groupby('batch_year').agg({
    'company_id': 'count',
    'is_successful': 'mean',
    'num_founders': 'mean',
    'team_size': 'mean',
    'is_ai': 'mean',
    'is_b2b': 'mean'
}).round(3)

batch_stats.columns = ['Batch_Size', 'Success_Rate', 'Avg_Founders', 'Avg_Team_Size', 'AI_Pct', 'B2B_Pct']
batch_stats = batch_stats.dropna()

# Plot multiple trends
fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=('Batch Size Over Time', 'Success Rate by Vintage', 
                    'AI Companies Trend', 'B2B Companies Trend')
)

# 1. Batch size
fig.add_trace(
    go.Scatter(x=batch_stats.index, y=batch_stats['Batch_Size'], 
               mode='lines+markers', marker_color='#0099FF', fill='tozeroy'),
    row=1, col=1
)

# 2. Success rate
fig.add_trace(
    go.Scatter(x=batch_stats.index, y=batch_stats['Success_Rate']*100, 
               mode='lines+markers', marker_color='#00CC66'),
    row=1, col=2
)

# 3. AI trend
fig.add_trace(
    go.Scatter(x=batch_stats.index, y=batch_stats['AI_Pct']*100, 
               mode='lines+markers', marker_color='#9933FF', fill='tozeroy'),
    row=2, col=1
)

# 4. B2B trend
fig.add_trace(
    go.Scatter(x=batch_stats.index, y=batch_stats['B2B_Pct']*100, 
               mode='lines+markers', marker_color='#FF6600', fill='tozeroy'),
    row=2, col=2
)

fig.update_yaxes(title_text="# Companies", row=1, col=1)
fig.update_yaxes(title_text="Success Rate (%)", row=1, col=2)
fig.update_yaxes(title_text="% AI Companies", row=2, col=1)
fig.update_yaxes(title_text="% B2B Companies", row=2, col=2)

fig.update_layout(height=700, showlegend=False, title_text="YC Evolution: 20-Year Trends")
fig.show()

In [None]:
# Season comparison (Winter vs Summer)
season_stats = df.groupby('batch_season').agg({
    'company_id': 'count',
    'is_successful': 'mean',
    'num_founders': 'mean',
    'team_size': 'mean'
}).round(3)

season_stats.columns = ['Count', 'Success_Rate', 'Avg_Founders', 'Avg_Team_Size']

print("\n🌤️  SEASONAL BATCH COMPARISON:")
print(season_stats.to_string())

# Visual comparison
fig = go.Figure()
for season in season_stats.index:
    if pd.notna(season):
        fig.add_trace(go.Bar(
            name=season,
            x=['Companies', 'Success Rate (%)', 'Avg Founders', 'Avg Team Size'],
            y=[season_stats.loc[season, 'Count'], 
               season_stats.loc[season, 'Success_Rate']*100,
               season_stats.loc[season, 'Avg_Founders'],
               season_stats.loc[season, 'Avg_Team_Size']]
        ))

fig.update_layout(title='Batch Season Comparison', barmode='group', height=400)
fig.show()

---
## 6. Industry Deep Dive

Which industries are thriving? Which are oversaturated?

In [None]:
# Industry analysis - extract all tags
all_tags = []
for tags in df['tags'].dropna():
    if isinstance(tags, list):
        all_tags.extend(tags)

tag_counts = pd.Series(all_tags).value_counts()

# Success rate per industry
industry_analysis = []
for tag in tag_counts.head(50).index:
    tag_companies = df[df['tags'].apply(lambda x: tag in x if isinstance(x, list) else False)]
    if len(tag_companies) >= 10:
        industry_analysis.append({
            'Industry': tag,
            'Total': len(tag_companies),
            'Successful': tag_companies['is_successful'].sum(),
            'Success_Rate': tag_companies['is_successful'].mean(),
            'Active': (tag_companies['success_category'] == 'Active').sum(),
            'Avg_Team_Size': tag_companies['team_size'].mean(),
            'Solo_Pct': tag_companies['is_solo'].mean()
        })

industry_df = pd.DataFrame(industry_analysis).sort_values('Success_Rate', ascending=False)

print("\n🏆 TOP 20 INDUSTRIES BY SUCCESS RATE (min 10 companies):")
print(industry_df.head(20).to_string(index=False))

print("\n⚠️  BOTTOM 10 INDUSTRIES BY SUCCESS RATE:")
print(industry_df.tail(10).to_string(index=False))

In [None]:
# Industry saturation vs success (bubble chart)
fig = px.scatter(
    industry_df.head(30),
    x='Total',
    y='Success_Rate',
    size='Avg_Team_Size',
    color='Solo_Pct',
    hover_data=['Industry', 'Successful', 'Active'],
    text='Industry',
    title='Industry Map: Saturation vs Success Rate (Top 30)',
    labels={'Total': 'Number of Companies', 'Success_Rate': 'Success Rate', 'Solo_Pct': '% Solo Founders'},
    color_continuous_scale='RdYlGn'
)

fig.update_traces(textposition='top center')
fig.update_layout(height=600)
fig.show()

In [None]:
# Industry trends over time (top 10 industries)
top_industries = tag_counts.head(10).index

industry_time_data = []
for year in sorted(df['batch_year'].dropna().unique()):
    year_df = df[df['batch_year'] == year]
    for industry in top_industries:
        count = year_df['tags'].apply(lambda x: industry in x if isinstance(x, list) else False).sum()
        industry_time_data.append({
            'Year': year,
            'Industry': industry,
            'Count': count
        })

industry_time_df = pd.DataFrame(industry_time_data)

fig = px.line(
    industry_time_df,
    x='Year',
    y='Count',
    color='Industry',
    title='Industry Trends Over Time (Top 10 Industries)',
    labels={'Count': '# Companies per Year'}
)

fig.update_layout(height=500)
fig.show()

---
## 7. Geographic Intelligence

Where are successful companies built?

In [None]:
# Success by geography
geo_stats = df.groupby('country').agg({
    'company_id': 'count',
    'is_successful': ['sum', 'mean'],
    'team_size': 'mean',
    'is_solo': 'mean'
}).round(3)

geo_stats.columns = ['Total', 'Successful', 'Success_Rate', 'Avg_Team_Size', 'Solo_Pct']
geo_stats = geo_stats[geo_stats['Total'] >= 10].sort_values('Success_Rate', ascending=False)

print("\n🌍 SUCCESS RATE BY COUNTRY (min 10 companies):")
print(geo_stats.head(20).to_string())

In [None]:
# City-level analysis
city_stats = df.groupby('location').agg({
    'company_id': 'count',
    'is_successful': ['sum', 'mean'],
    'is_solo': 'mean'
}).round(3)

city_stats.columns = ['Total', 'Successful', 'Success_Rate', 'Solo_Pct']
city_stats = city_stats[city_stats['Total'] >= 20].sort_values('Total', ascending=False).head(20)

fig = go.Figure()
fig.add_trace(go.Bar(
    y=city_stats.index,
    x=city_stats['Total'],
    orientation='h',
    name='Total Companies',
    marker_color='#0099FF',
    text=[f"{val:.0f} ({sr*100:.1f}%)" for val, sr in zip(city_stats['Total'], city_stats['Success_Rate'])],
    textposition='outside'
))

fig.update_layout(
    title='Top 20 Cities by YC Company Count',
    xaxis_title='Number of Companies',
    height=600
)
fig.show()

---
## 8. Advanced Visualizations

Network graphs, Sankey diagrams, and interactive explorations

In [None]:
# Sankey: Batch → Industry → Status
# Using recent batches and top industries

recent_years = sorted(df['batch_year'].dropna().unique())[-5:]
sankey_df = df[df['batch_year'].isin(recent_years)].copy()

# Extract primary industry (first tag)
sankey_df['primary_industry'] = sankey_df['tags'].apply(
    lambda x: x[0] if isinstance(x, list) and len(x) > 0 else 'Other'
)

# Keep only top 10 industries
top_10_industries = sankey_df['primary_industry'].value_counts().head(10).index
sankey_df = sankey_df[sankey_df['primary_industry'].isin(top_10_industries)]

# Build Sankey data
sources = []
targets = []
values = []

# Year → Industry
for year in recent_years:
    for industry in top_10_industries:
        count = len(sankey_df[(sankey_df['batch_year'] == year) & (sankey_df['primary_industry'] == industry)])
        if count > 0:
            sources.append(f"Batch {int(year)}")
            targets.append(industry)
            values.append(count)

# Industry → Status
for industry in top_10_industries:
    for status in ['Active', 'Acquired', 'Public', 'Inactive']:
        count = len(sankey_df[(sankey_df['primary_industry'] == industry) & (sankey_df['success_category'] == status)])
        if count > 0:
            sources.append(industry)
            targets.append(status)
            values.append(count)

# Create node list
all_nodes = list(set(sources + targets))
node_dict = {node: idx for idx, node in enumerate(all_nodes)}

# Map to indices
source_indices = [node_dict[s] for s in sources]
target_indices = [node_dict[t] for t in targets]

fig = go.Figure(data=[go.Sankey(
    node=dict(
        pad=15,
        thickness=20,
        label=all_nodes,
        color='#0099FF'
    ),
    link=dict(
        source=source_indices,
        target=target_indices,
        value=values
    )
)])

fig.update_layout(title="Flow: Recent Batches → Industry → Status", height=600)
fig.show()

---
## 9. Predictive Analysis

What features predict success? Simple ML models for insights.

In [None]:
# Feature importance using logistic regression
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report, roc_auc_score, confusion_matrix

# Prepare features - USE MATURE COMPANIES ONLY
mature_df = df[df['is_mature']].copy()

# IMPORTANT: Exclude 'has_grown' - it's circular (successful companies grow)
model_df = mature_df[[
    'num_founders', 'team_size', 'company_age', 'num_tags',
    'is_sf_bay', 'is_us', 'is_ai', 'is_b2b',
    'is_successful'
]].dropna()

X = model_df.drop('is_successful', axis=1)
y = model_df['is_successful']

print(f"📊 Dataset Info:")
print(f"  Total samples: {len(y):,}")
print(f"  Successful: {y.sum():,} ({y.mean()*100:.1f}%)")
print(f"  Not successful: {(~y).sum():,} ({(~y).mean()*100:.1f}%)")
print(f"  ⚠️  Class imbalance: {(~y).sum() / y.sum():.1f}:1 ratio\n")

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train model with class weights to handle imbalance
model = LogisticRegression(max_iter=1000, class_weight='balanced', random_state=42)
model.fit(X_train_scaled, y_train)

# Feature importance
feature_importance = pd.DataFrame({
    'Feature': X.columns,
    'Coefficient': model.coef_[0]
}).sort_values('Coefficient', ascending=False)

# Predictions
y_pred = model.predict(X_test_scaled)
y_pred_proba = model.predict_proba(X_test_scaled)[:, 1]

# Comprehensive metrics
train_score = model.score(X_train_scaled, y_train)
test_score = model.score(X_test_scaled, y_test)
roc_auc = roc_auc_score(y_test, y_pred_proba)

# Cross-validation for robustness
cv_scores = cross_val_score(model, X_train_scaled, y_train, cv=5, scoring='roc_auc')

print(f"🤖 PREDICTIVE MODEL RESULTS:")
print(f"  Training Accuracy: {train_score*100:.2f}%")
print(f"  Test Accuracy: {test_score*100:.2f}%")
print(f"  ROC-AUC Score: {roc_auc:.3f}")
print(f"  Cross-Val ROC-AUC: {cv_scores.mean():.3f} (±{cv_scores.std():.3f})")
print(f"\n⚠️  Note: Accuracy is misleading with class imbalance. ROC-AUC is better metric.\n")

print(f"📊 Detailed Classification Report:")
print(classification_report(y_test, y_pred, target_names=['Not Successful', 'Successful']))

print(f"\n📈 Feature Importance (Logistic Regression Coefficients):")
print(f"⚠️  Note: These show predictive association, NOT causation.\n")
print(feature_importance.to_string(index=False))

# Visualize
fig = go.Figure(go.Bar(
    y=feature_importance['Feature'],
    x=feature_importance['Coefficient'],
    orientation='h',
    marker_color=['#00CC66' if x > 0 else '#FF3333' for x in feature_importance['Coefficient']]
))

fig.update_layout(
    title='Feature Importance: What Predicts Success?<br><sub>Higher coefficient = stronger positive association with success</sub>',
    xaxis_title='Coefficient (Impact on Success Probability)',
    height=400
)
fig.show()

print("\n⚠️  IMPORTANT CAVEATS:")
print("  1. This model shows ASSOCIATION, not CAUSATION")
print("  2. Survivorship bias partially mitigated by using mature companies only")
print("  3. Many unmeasured factors influence success (market timing, execution, luck)")
print("  4. Past performance ≠ future results")

---
## 10. Actionable Recommendations

Data-driven insights for founders

In [None]:
def get_recommendations(founder_type='solo', industry=None, location='San Francisco'):
    """
    Generate personalized recommendations based on founder profile
    """
    print("\n" + "═" * 80)
    print(f"  PERSONALIZED RECOMMENDATIONS")
    print(f"  Profile: {founder_type.title()} Founder | Industry: {industry or 'General'} | Location: {location}")
    print("═" * 80)
    
    # Filter similar companies
    if founder_type == 'solo':
        similar = df[df['is_solo']]
    else:
        similar = df[~df['is_solo']]
    
    if industry:
        similar = similar[similar['tags'].apply(
            lambda x: industry.lower() in ' '.join(x).lower() if isinstance(x, list) else False
        )]
    
    # Calculate metrics
    success_rate = similar['is_successful'].mean() * 100
    avg_team_size = similar['team_size'].mean()
    top_locations = similar['location'].value_counts().head(5)
    
    print(f"\n✅ SUCCESS METRICS (Based on {len(similar):,} similar companies):")
    print(f"  • Success Rate: {success_rate:.1f}%")
    print(f"  • Average Team Size: {avg_team_size:.1f}")
    
    print(f"\n🎯 KEY INSIGHTS:")
    
    # Insight 1: Team composition
    if founder_type == 'solo':
        team_growth = similar[similar['has_grown']].shape[0] / len(similar) * 100
        print(f"  • {team_growth:.1f}% of solo founders grow their team")
        print(f"  • Successful solo founders have avg {similar[similar['is_successful']]['team_size'].mean():.0f} person teams")
    else:
        print(f"  • Team-founded companies have {team_success_rate - solo_success_rate:+.1f}pp higher success rate")
    
    # Insight 2: Location advantage
    print(f"\n📍 TOP LOCATIONS FOR SIMILAR COMPANIES:")
    for loc, count in top_locations.items():
        loc_success = similar[similar['location'] == loc]['is_successful'].mean() * 100
        print(f"  • {loc}: {count} companies ({loc_success:.1f}% success rate)")
    
    # Insight 3: Industry trends
    if industry:
        recent_trend = similar[similar['batch_year'] >= 2020].shape[0]
        total_trend = similar.shape[0]
        print(f"\n📈 INDUSTRY TREND:")
        print(f"  • {recent_trend}/{total_trend} companies in {industry} are from recent batches (2020+)")
        print(f"  • Industry momentum: {'🔥 Hot' if recent_trend/total_trend > 0.4 else '📊 Steady' if recent_trend/total_trend > 0.2 else '⚠️  Declining'}")
    
    # Recommendations
    print(f"\n💡 RECOMMENDATIONS:")
    if success_rate < 5:
        print(f"  ⚠️  Warning: This profile has lower than average success rate")
        print(f"  → Consider: Pivoting, finding co-founders, or targeting different market")
    elif success_rate > 15:
        print(f"  ✅ Strong profile with above-average success rate")
        print(f"  → Focus on: Execution, customer acquisition, product-market fit")
    
    if founder_type == 'solo' and team_growth > 70:
        print(f"  → Plan to hire early - {team_growth:.0f}% of successful solos grow their team")
    
    print("\n" + "═" * 80)

# Example usage
get_recommendations(founder_type='solo', industry='AI', location='San Francisco')

In [None]:
# Interactive recommendation widget
from ipywidgets import interact, Dropdown

# Get unique industries and locations
unique_industries = ['General'] + sorted(list(set([tag for tags in df['tags'].dropna() for tag in tags if isinstance(tags, list)]))[:30])
unique_locations = ['Any'] + df['location'].value_counts().head(20).index.tolist()

@interact(
    founder_type=Dropdown(options=['solo', 'team'], description='Founder Type'),
    industry=Dropdown(options=unique_industries, description='Industry'),
    location=Dropdown(options=unique_locations, description='Location')
)
def interactive_recommendations(founder_type, industry, location):
    industry = None if industry == 'General' else industry
    location = None if location == 'Any' else location
    get_recommendations(founder_type, industry, location or 'Any')

<cell_type>markdown</cell_type>---
## Summary & Key Takeaways

**Top Insights from the Data:**

1. **Solo vs Team**: Team-founded companies generally have higher success rates, but solo founders succeed in specific niches
2. **Location Matters**: SF Bay Area dominates, but international opportunities are growing
3. **Industry Cycles**: AI/ML experiencing explosive growth; traditional industries show steady patterns
4. **Team Growth**: Successful companies grow their teams 2-3x within first few years
5. **Timing**: Recent batches are larger but face more competition

**For Aspiring Founders:**
- Research your industry's saturation level before launching
- Consider co-founders if targeting highly competitive markets
- Plan for team growth early, especially if starting solo
- Location still matters, but remote is increasingly viable
- Focus on execution - the best predictor of success is traction

---

## ⚠️ Methodological Limitations & Caveats

**1. Survivorship Bias**
- Recent companies haven't had time to exit (IPO/acquisition)
- We filter to "mature" (≥3 years old) companies to partially address this
- Still, older batches have had more time to achieve successful outcomes

**2. Selection Bias**
- YC-funded companies are already pre-selected (top ~1-2% of applicants)
- Results may not generalize to non-YC startups

**3. Correlation ≠ Causation**
- All statistical associations shown are correlational, not causal
- Unmeasured confounders (founder skill, market timing, luck) likely drive results

**4. Missing Data**
- ~20-30% missing values for some fields (team_size, year_founded)
- Companies with missing data may differ systematically

**5. Multiple Comparisons**
- Testing 50+ industries without correction inflates false positive rate
- Some "top industries" may be statistical flukes

**6. Class Imbalance**
- Only ~5-10% of companies achieve "success" (Public/Acquired)
- Makes prediction challenging and accuracy metrics misleading

**7. Measurement Issues**
- "Success" defined only as Public/Acquired (ignores profitable private companies)
- Team size is snapshot, not longitudinal data
- Industry tags are self-reported and inconsistent

**Use this analysis for exploration and hypothesis generation, not definitive conclusions.**

---
*Data source: Y Combinator Companies Directory*  
*Last updated: October 2025*  
*Total companies analyzed: 7,858+*  
*Analysis focuses on mature companies (≥3 years): ~5,000+*