<cell_type>markdown</cell_type># YC Companies Analysis for Solo Entrepreneurs

This interactive notebook helps solo founders explore trends, **success factors**, and actionable insights from the Y Combinator companies dataset. Use the filters and visualizations to benchmark and discover opportunities.

⚠️ **Important Note:** This analysis has limitations due to survivorship bias (recent companies haven't had time to exit). For rigorous statistical analysis, see `yc_comprehensive_analysis.ipynb`.

In [1]:
# Setup
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from ipywidgets import interact, widgets
import warnings
warnings.filterwarnings('ignore')

# Set plot style
sns.set(style='whitegrid')

## Load Data
Upload or specify the path to your YC companies CSV file (e.g., '2024-06-09-yc-companies.csv').

In [2]:
import pandas as pd

# Change the filename if needed
json_path = '../data/2025-05-03.jl'
try:
    df = pd.read_json(json_path, lines=True)
    print(f'Loaded {len(df)} records from {json_path}')
except Exception as e:
    print(f'Error loading file: {e}')

Loaded 7858 records from ../data/2025-05-03.jl


## Data Cleaning & Preparation
- Mark solo founders (num_founders == 1)
- Extract year from batch
- Parse tags as lists if needed


In [None]:
import re

df['is_solo_founder'] = df['num_founders'] == 1

# Extract year from batch - handle both old (W25, S24) and new (Winter 2025, Fall 2024) formats
def parse_batch_year(batch):
    if pd.isna(batch):
        return None
    match = re.search(r'(\d{2,4})', str(batch))
    if match:
        year = match.group(1)
        if len(year) == 2:
            year = int(year)
            return 2000 + year if year < 50 else 1900 + year
        return int(year)
    return None

df['year'] = df['batch'].apply(parse_batch_year)

# Calculate company age
current_year = 2025
df['company_age'] = df['year_founded'].apply(lambda x: current_year - x if pd.notna(x) else None)

# Define success (for companies ≥3 years old to reduce survivorship bias)
df['is_mature'] = df['company_age'] >= 3
df['is_successful'] = df['status'].apply(lambda x: 'acquired' in str(x).lower() or 'public' in str(x).lower())

# If tags are stored as strings, convert to lists
if df['tags'].dtype == object and isinstance(df['tags'].iloc[0], str):
    import ast
    df['tags'] = df['tags'].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) and x.startswith('[') else x)

print('✓ Data prepared')
print(f'  Total companies: {len(df):,}')
print(f'  Solo founders: {df["is_solo_founder"].sum():,} ({df["is_solo_founder"].mean()*100:.1f}%)')
print(f'  Mature companies (≥3 years): {df["is_mature"].sum():,}')
print(f'\n⚠️  Note: Success analysis uses only mature companies to avoid survivorship bias')
display(df.head())

## Interactive Filters
Use the widgets below to filter by year, industry, and solo/team founders.

In [None]:
from ipywidgets import interact, widgets

years = sorted(df['year'].dropna().unique().astype(int))
min_year, max_year = int(min(years)), int(max(years))
industries = sorted({tag for tags in df['tags'].dropna() for tag in tags})

@interact(
    year_range=widgets.IntRangeSlider(value=[min_year, max_year], min=min_year, max=max_year, step=1, description='Year Range'),
    industry=widgets.SelectMultiple(options=industries, description='Industry', layout=widgets.Layout(width='50%')),
    solo_only=widgets.Checkbox(value=True, description='Solo Founders Only')
)
def filter_data(year_range, industry, solo_only):
    filtered = df[(df['year'] >= year_range[0]) & (df['year'] <= year_range[1])]
    if industry:
        filtered = filtered[filtered['tags'].apply(lambda tags: any(i in tags for i in industry) if isinstance(tags, list) else False)]
    if solo_only:
        filtered = filtered[filtered['is_solo_founder']]
    print(f'Companies found: {len(filtered)}')
    display(filtered[['company_name', 'batch', 'status', 'tags', 'location', 'year', 'num_founders', 'team_size', 'website']].head(10))

## Visualizations
Explore trends and success metrics for solo founders.

In [None]:
# Solo vs. Team-Founded Companies
plt.figure(figsize=(5,3))
df['is_solo_founder'].value_counts().plot(kind='pie', labels=['Team', 'Solo'], autopct='%1.1f%%', startangle=90, colors=['#66b3ff','#ff9999'])
plt.title('Proportion of Solo vs. Team-Founded Companies')
plt.ylabel('')
plt.show()

# Top Industries for Solo Founders
solo = df[df['is_solo_founder']]
top_industries = pd.Series([i for tags in solo['tags'].dropna() for i in tags]).value_counts().head(10)
plt.figure(figsize=(8,4))
sns.barplot(x=top_industries.values, y=top_industries.index, palette='viridis')
plt.title('Top Industries for Solo Founders (by count)')
plt.xlabel('Number of Companies')
plt.show()

# Status Breakdown for Solo Founders
plt.figure(figsize=(6,3))
solo['status'].value_counts().plot(kind='bar', color='#ffcc99')
plt.title('Status of Solo-Founded Companies')
plt.xlabel('Status')
plt.ylabel('Count')
plt.show()

# ===== NEW: SUCCESS RATE ANALYSIS (MATURE COMPANIES ONLY) =====
print("\n" + "="*60)
print("SUCCESS RATE COMPARISON (Mature Companies ≥3 years)")
print("="*60)

mature_df = df[df['is_mature']].copy()
mature_solo = mature_df[mature_df['is_solo_founder']]
mature_team = mature_df[~mature_df['is_solo_founder']]

solo_success_rate = mature_solo['is_successful'].mean() * 100
team_success_rate = mature_team['is_successful'].mean() * 100

print(f"\nSolo Founders:")
print(f"  Sample size: {len(mature_solo):,} companies")
print(f"  Success rate: {solo_success_rate:.1f}%")
print(f"  Successful: {mature_solo['is_successful'].sum():,} companies")

print(f"\nTeam Founders:")
print(f"  Sample size: {len(mature_team):,} companies")
print(f"  Success rate: {team_success_rate:.1f}%")
print(f"  Successful: {mature_team['is_successful'].sum():,} companies")

print(f"\nDifference: {team_success_rate - solo_success_rate:+.1f} percentage points")
print(f"⚠️  Note: This is correlation, not causation. Many factors affect success.")
print("="*60)

# Visualization
fig, ax = plt.subplots(figsize=(7,4))
categories = ['Solo Founders', 'Team Founders']
success_rates = [solo_success_rate, team_success_rate]
colors_bar = ['#ff9999', '#66b3ff']

bars = ax.bar(categories, success_rates, color=colors_bar, alpha=0.7)
ax.set_ylabel('Success Rate (%)')
ax.set_title('Success Rate: Solo vs Team Founders\n(Mature companies ≥3 years only)')
ax.set_ylim(0, max(success_rates) * 1.3)

# Add value labels on bars
for bar in bars:
    height = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2., height,
            f'{height:.1f}%',
            ha='center', va='bottom', fontsize=12, fontweight='bold')

plt.tight_layout()
plt.show()

In [10]:
import plotly.express as px

# Assuming df is already loaded and 'tags' is a list of industries
df_exploded = df.explode('tags')

# Group and count companies per year and industry
industry_trend = df_exploded.groupby(['year_founded', 'tags']).size().reset_index(name='count')

# Create an interactive line plot
fig = px.line(
    industry_trend,
    x='year_founded',
    y='count',
    color='tags',
    title='Industry Trend Over the Years',
    labels={'year_founded': 'Year Founded', 'count': 'Number of Companies', 'tags': 'Industry'},
    hover_data=['tags', 'count', 'year_founded']
)

fig.update_layout(
    legend_title_text='Industry',
    xaxis=dict(dtick=1),
    template='plotly_white',
    legend=dict(
        orientation="v",
        yanchor="top",
        y=1,
        xanchor="left",
        x=1.05
    )
)

fig.show()

<cell_type>markdown</cell_type>## Actionable Insights & Analysis

### 🎯 Key Findings (Mature Companies ≥3 years):

**Solo vs Team Success:**
- The analysis above shows the actual success rates (Public/Acquired) for mature companies
- Team-founded companies typically show higher success rates, but solo founders can succeed in specific niches
- Success depends on many factors: industry, timing, execution, market conditions, founder skill, and luck

**Industry Selection:**
- Top industries by count ≠ top industries by success rate
- For success rate by industry, see the comprehensive analysis notebook
- Consider market saturation vs opportunity when choosing an industry

**Team Growth Patterns:**
- Most successful solo founders eventually hire and grow their teams
- Early team growth correlates with (but doesn't cause) success

---

### ⚠️ Important Caveats

**Limitations of This Analysis:**

1. **Survivorship Bias**: Recent companies haven't had time to exit. We partially address this by filtering to companies ≥3 years old for success analysis.

2. **Correlation ≠ Causation**: All patterns shown are correlational. We cannot prove that having co-founders *causes* success.

3. **Selection Bias**: YC companies are already pre-selected (top ~1-2% of applicants). Results may not generalize.

4. **Small Samples**: Some industry/location combinations have <20 companies, making statistics unreliable.

5. **Definition of Success**: "Success" = Public or Acquired. This ignores profitable private companies that never exit.

6. **Missing Data**: ~20-30% of companies have missing team_size or year_founded data.

---

### 💡 Recommendations for Solo Founders

**Before deciding to go solo:**
1. Analyze your specific industry's solo success rate (not just overall)
2. Assess your own skills: do you cover technical, business, and sales?
3. Consider: is the problem suited for a solo founder or does it need diverse expertise?
4. Plan for eventual hiring - most successful solos grow teams within 1-2 years

**For rigorous statistical analysis:**
- See `yc_comprehensive_analysis.ipynb` for:
  - Success rate by industry
  - Geographic analysis  
  - Predictive modeling
  - Statistical significance testing
  - Comprehensive caveats and limitations

Use these insights for **exploration and hypothesis generation**, not definitive decisions.