# Predicting Startup Funding Success

## Table of Contents
1. [Introduction](#Introduction)
2. [Data Overview](#Data-Overview)
   - [People Data](#People-Data)
   - [Investments Data](#Investments-Data)
   - [Offices Data](#Offices-Data)
   - [Degrees Data](#Degrees-Data)
   - [Relationships Data](#Relationships-Data)
3. [Data Cleaning](#Data-Cleaning)
4. [Descriptive Analysis](#descriptive-analysis)
5. [Visualizations](#Visualizations)
6. [Key Insights](#Key-Insights)
7. [Conclusion](#Conclusion)



## Introduction
This project focuses on predicting the success of startups in securing funding. By analyzing various datasets related to office structures, individuals, relationships, degrees, and investment rounds, we aim to identify key factors that influence funding outcomes.

In [78]:
# loading the packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

## Data Overview


### People Data
This dataset contains information about individuals, including demographics and roles within the organization.

In [79]:
people = pd.read_csv('Data/people.csv')
people.head()

Unnamed: 0,id,object_id,first_name,last_name,birthplace,affiliation_name
0,1,p:2,Ben,Elowitz,,Blue Nile
1,2,p:3,Kevin,Flaherty,,Wetpaint
2,3,p:4,Raju,Vegesna,,Zoho
3,4,p:5,Ian,Wenig,,Zoho
4,5,p:6,Kevin,Rose,"Redding, CA",i/o Ventures


In [80]:
# Data Overview
print("People Data Overview:")
print(people.info())
print(people.describe(include='all'))

People Data Overview:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 226709 entries, 0 to 226708
Data columns (total 6 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   id                226709 non-null  int64 
 1   object_id         226709 non-null  object
 2   first_name        226700 non-null  object
 3   last_name         226705 non-null  object
 4   birthplace        28084 non-null   object
 5   affiliation_name  226684 non-null  object
dtypes: int64(1), object(5)
memory usage: 10.4+ MB
None
                   id object_id first_name last_name birthplace  \
count   226709.000000    226709     226700    226705      28084   
unique            NaN    226709      28421    107772       8270   
top               NaN       p:2      David     Smith      India   
freq              NaN         1       4495       797        660   
mean    113355.000000       NaN        NaN       NaN        NaN   
std      65445.395426       NaN   

### Investments Data
This dataset captures details about investment rounds, including the amounts raised and characteristics of the startups involved.


In [81]:
investments = pd.read_csv('Data/investments.csv')
investments.head()

Unnamed: 0,id,funding_round_id,funded_object_id,investor_object_id,created_at,updated_at
0,1,1,c:4,f:1,2007-07-04 04:52:57,2008-02-27 23:14:29
1,2,1,c:4,f:2,2007-07-04 04:52:57,2008-02-27 23:14:29
2,3,3,c:5,f:4,2007-05-27 06:09:10,2013-06-28 20:07:23
3,4,4,c:5,f:1,2007-05-27 06:09:36,2013-06-28 20:07:24
4,5,4,c:5,f:5,2007-05-27 06:09:36,2013-06-28 20:07:24


In [82]:
# Data Overview
print("Investments Data Overview:")
print(investments.info())
print(investments.describe(include='all'))

Investments Data Overview:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 80902 entries, 0 to 80901
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   id                  80902 non-null  int64 
 1   funding_round_id    80902 non-null  int64 
 2   funded_object_id    80902 non-null  object
 3   investor_object_id  80902 non-null  object
 4   created_at          80902 non-null  object
 5   updated_at          80902 non-null  object
dtypes: int64(2), object(4)
memory usage: 3.7+ MB
None
                  id  funding_round_id funded_object_id investor_object_id  \
count   80902.000000      80902.000000            80902              80902   
unique           NaN               NaN            21607              17152   
top              NaN               NaN         c:169876              f:367   
freq             NaN               NaN               58                529   
mean    40451.500000      24020.1712


### Offices Data
This dataset provides information about the organizational structure, including various offices and their attributes.


In [83]:
offices = pd.read_csv('Data/offices.csv')
offices.head()

Unnamed: 0,id,object_id,office_id,description,region,address1,address2,city,zip_code,state_code,country_code,latitude,longitude,created_at,updated_at
0,1,c:1,1,,Seattle,710 - 2nd Avenue,Suite 1100,Seattle,98104,WA,USA,47.603122,-122.333253,,
1,2,c:3,3,Headquarters,SF Bay,4900 Hopyard Rd,Suite 310,Pleasanton,94588,CA,USA,37.692934,-121.904945,,
2,3,c:4,4,,SF Bay,135 Mississippi St,,San Francisco,94107,CA,USA,37.764726,-122.394523,,
3,4,c:5,5,Headquarters,SF Bay,1601 Willow Road,,Menlo Park,94025,CA,USA,37.41605,-122.151801,,
4,5,c:7,7,,SF Bay,Suite 200,654 High Street,Palo Alto,94301,CA,ISR,0.0,0.0,,


In [84]:
# Data Overview
print("Offices Data Overview:")
print(offices.info())
print(offices.describe(include='all'))

Offices Data Overview:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 112718 entries, 0 to 112717
Data columns (total 15 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   id            112718 non-null  int64  
 1   object_id     112718 non-null  object 
 2   office_id     112718 non-null  int64  
 3   description   68530 non-null   object 
 4   region        112718 non-null  object 
 5   address1      94430 non-null   object 
 6   address2      44520 non-null   object 
 7   city          107550 non-null  object 
 8   zip_code      93230 non-null   object 
 9   state_code    62017 non-null   object 
 10  country_code  112718 non-null  object 
 11  latitude      112718 non-null  float64
 12  longitude     112718 non-null  float64
 13  created_at    0 non-null       float64
 14  updated_at    0 non-null       float64
dtypes: float64(4), int64(2), object(9)
memory usage: 12.9+ MB
None
                   id object_id      office_id des

### Degrees Data
This dataset includes educational background information, which might correlate with the success of securing startup funding.


In [85]:
degrees = pd.read_csv('Data/degrees.csv')
degrees.head()

Unnamed: 0,id,object_id,degree_type,subject,institution,graduated_at,created_at,updated_at
0,1,p:6117,MBA,,,,2008-02-19 03:17:36,2008-02-19 03:17:36
1,2,p:6136,BA,"English, French","Washington University, St. Louis",1990-01-01,2008-02-19 17:58:31,2008-02-25 00:23:55
2,3,p:6136,MS,Mass Communication,Boston University,1992-01-01,2008-02-19 17:58:31,2008-02-25 00:23:55
3,4,p:6005,MS,Internet Technology,University of Greenwich,2006-01-01,2008-02-19 23:40:40,2008-02-25 00:23:55
4,5,p:5832,BCS,"Computer Science, Psychology",Rice University,,2008-02-20 05:28:09,2008-02-20 05:28:09


In [86]:
# Data Overview
print("Degrees Data Overview:")
print(degrees.info())
print(degrees.describe(include='all'))

Degrees Data Overview:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 109610 entries, 0 to 109609
Data columns (total 8 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   id            109610 non-null  int64 
 1   object_id     109610 non-null  object
 2   degree_type   98389 non-null   object
 3   subject       81298 non-null   object
 4   institution   109555 non-null  object
 5   graduated_at  58054 non-null   object
 6   created_at    109610 non-null  object
 7   updated_at    109610 non-null  object
dtypes: int64(1), object(7)
memory usage: 6.7+ MB
None
                  id object_id degree_type           subject  \
count   109610.00000    109610       98389             81298   
unique           NaN     68451        7147             20050   
top              NaN  p:183805          BS  Computer Science   
freq             NaN        10       23425              6001   
mean     54805.50000       NaN         NaN               NaN  


### Relationships Data
This dataset details the connections between people, indicating potential networking effects within the startup ecosystem.


In [87]:
relationships = pd.read_csv('Data/relationships.csv')
relationships.head()

Unnamed: 0,id,relationship_id,person_object_id,relationship_object_id,start_at,end_at,is_past,sequence,title,created_at,updated_at
0,1,1,p:2,c:1,,,0,8,Co-Founder/CEO/Board of Directors,2007-05-25 07:03:54,2013-06-03 09:58:46
1,2,2,p:3,c:1,,,1,279242,VP Marketing,2007-05-25 07:04:16,2010-05-21 16:31:34
2,3,3,p:4,c:3,,,0,4,Evangelist,2007-05-25 19:33:03,2013-06-29 13:36:58
3,4,4,p:5,c:3,2006-03-01,2009-12-01,1,4,Senior Director Strategic Alliances,2007-05-25 19:34:53,2013-06-29 10:25:34
4,6,6,p:7,c:4,2005-07-01,2010-04-05,1,1,Chief Executive Officer,2007-05-25 20:05:33,2010-04-05 18:41:41


In [88]:
# Data Overview
print("Relationships Data Overview:")
print(relationships.info())
print(relationships.describe(include='all'))

Relationships Data Overview:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 402878 entries, 0 to 402877
Data columns (total 11 columns):
 #   Column                  Non-Null Count   Dtype 
---  ------                  --------------   ----- 
 0   id                      402878 non-null  int64 
 1   relationship_id         402878 non-null  int64 
 2   person_object_id        402878 non-null  object
 3   relationship_object_id  402878 non-null  object
 4   start_at                206995 non-null  object
 5   end_at                  101046 non-null  object
 6   is_past                 402878 non-null  int64 
 7   sequence                402878 non-null  int64 
 8   title                   389526 non-null  object
 9   created_at              402878 non-null  object
 10  updated_at              402878 non-null  object
dtypes: int64(4), object(7)
memory usage: 33.8+ MB
None
                   id  relationship_id person_object_id  \
count   402878.000000    402878.000000           402878 

In [89]:
funding_rounds = pd.read_csv('Data/funding_rounds.csv')

funding_rounds.head()

Unnamed: 0,id,funding_round_id,object_id,funded_at,funding_round_type,funding_round_code,raised_amount_usd,raised_amount,raised_currency_code,pre_money_valuation_usd,...,post_money_valuation,post_money_currency_code,participants,is_first_round,is_last_round,source_url,source_description,created_by,created_at,updated_at
0,1,1,c:4,2006-12-01,series-b,b,8500000.0,8500000.0,USD,0.0,...,0.0,,2,0,0,http://www.marketingvox.com/archives/2006/12/2...,,initial-importer,2007-07-04 04:52:57,2008-02-27 23:14:29
1,2,2,c:5,2004-09-01,angel,angel,500000.0,500000.0,USD,0.0,...,0.0,USD,2,0,1,,,initial-importer,2007-05-27 06:08:18,2013-06-28 20:07:23
2,3,3,c:5,2005-05-01,series-a,a,12700000.0,12700000.0,USD,115000000.0,...,0.0,USD,3,0,0,http://www.techcrunch.com/2007/11/02/jim-breye...,Jim Breyer: Extra $500 Million Round For Faceb...,initial-importer,2007-05-27 06:09:10,2013-06-28 20:07:23
3,4,4,c:5,2006-04-01,series-b,b,27500000.0,27500000.0,USD,525000000.0,...,0.0,USD,4,0,0,http://www.facebook.com/press/info.php?factsheet,Facebook Funding,initial-importer,2007-05-27 06:09:36,2013-06-28 20:07:24
4,5,5,c:7299,2006-05-01,series-b,b,10500000.0,10500000.0,USD,0.0,...,0.0,,2,0,0,http://www.techcrunch.com/2006/05/14/photobuck...,PhotoBucket Closes $10.5M From Trinity Ventures,initial-importer,2007-05-29 11:05:59,2008-04-16 17:09:12



## Data Cleaning
### People Data
1. **Handling Missing Values**
2. **Removing Duplicates**
3. **Correcting Data Types**
4. **Dropping Irrelevant Columns**

In [90]:
investments = investments.drop(columns=['id', 'created_at', 'updated_at'])
investments = investments.drop_duplicates()

In [91]:

# Check for missing values
missing_values_people = people.isnull().sum()
print(missing_values_people)

id                       0
object_id                0
first_name               9
last_name                4
birthplace          198625
affiliation_name        25
dtype: int64


In [92]:
# Select relevant columns and clean data
offices = offices.drop(columns=['id', 'zip_code', 'created_at', 'updated_at'])
offices['state_code'] = offices['state_code'].astype('category')
offices = offices.drop_duplicates()

In [93]:
# Select relevant columns and clean data
relationships = relationships.drop(columns=['id', 'sequence', 'created_at', 'updated_at'])
relationships['is_past'] = relationships['is_past'].astype('category')
relationships = relationships.drop_duplicates()

## Descriptive Analysis

In [94]:
print('degrees',degrees.shape)
print('investments',investments.shape)
print('offices',offices.shape)
print('people',people.shape)
print('relationships',relationships.shape)

degrees (109610, 8)
investments (80800, 3)
offices (112718, 11)
people (226709, 6)
relationships (402878, 7)


__Investments__

In [95]:
investments_descriptive_stats = investments.describe()
print("Descriptive Statistics for Investments Data:")
print(investments_descriptive_stats)

Descriptive Statistics for Investments Data:
       funding_round_id
count      80800.000000
mean       24014.904455
std        15163.658648
min            1.000000
25%        11742.500000
50%        22574.000000
75%        34788.000000
max        57948.000000


__People__

In [96]:
# Descriptive Statistics
people_descriptive_stats = people.describe()
print("Descriptive Statistics for People Data:")
print(people_descriptive_stats)

Descriptive Statistics for People Data:
                  id
count  226709.000000
mean   113355.000000
std     65445.395426
min         1.000000
25%     56678.000000
50%    113355.000000
75%    170032.000000
max    226709.000000


Offices

In [97]:
# Descriptive Statistics
offices_descriptive_stats = offices.describe()
print("Descriptive Statistics for Offices Data:")
print(offices_descriptive_stats)


Descriptive Statistics for Offices Data:
           office_id       latitude      longitude
count  112718.000000  112718.000000  112718.000000
mean    64892.879859      10.901433     -16.329223
std     37224.327217      18.977868      43.979205
min         1.000000     -43.767554    -159.480262
25%     32510.250000       0.000000       0.000000
50%     64901.500000       0.000000       0.000000
75%     97847.750000      30.837705       0.000000
max    127850.000000      69.650235     176.916281


__Degrees__

In [98]:
# Descriptive Statistics
degrees_descriptive_stats = degrees.describe()
print("Descriptive Statistics for Degrees Data:")
print(degrees_descriptive_stats)

Descriptive Statistics for Degrees Data:
                 id
count  109610.00000
mean    54805.50000
std     31641.82584
min         1.00000
25%     27403.25000
50%     54805.50000
75%     82207.75000
max    109610.00000


__Relationship__

In [99]:

# Descriptive Statistics
relationships_descriptive_stats = relationships.describe()
print("Descriptive Statistics for Relationships Data:")
print(relationships_descriptive_stats)

Descriptive Statistics for Relationships Data:
       relationship_id
count    402878.000000
mean     241836.986184
std      139486.004688
min           1.000000
25%      120457.250000
50%      239798.500000
75%      364355.750000
max      480909.000000


__Funding Rounds__

In [111]:
funding_rounds.describe()

Unnamed: 0,id,funding_round_id,raised_amount_usd,raised_amount,pre_money_valuation_usd,pre_money_valuation,post_money_valuation_usd,post_money_valuation,participants,is_first_round,is_last_round
count,52928.0,52928.0,52928.0,52928.0,52928.0,52928.0,52928.0,52928.0,52928.0,52928.0,52928.0
mean,28962.894536,28962.894536,7946092.0,8056120.0,329452.5,329452.5,1824359.0,1862279.0,1.528567,0.604576,0.604538
std,16821.871803,16821.871803,42168200.0,44799140.0,65318030.0,65318030.0,128706500.0,128768600.0,2.060192,0.488946,0.488954
min,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,14343.75,14343.75,246330.0,250000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,28885.5,28885.5,1600000.0,1565056.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0
75%,43561.25,43561.25,6700000.0,6600000.0,0.0,0.0,0.0,0.0,2.0,1.0,1.0
max,57952.0,57952.0,3835050000.0,3835050000.0,15000000000.0,15000000000.0,24324230000.0,24324230000.0,36.0,1.0,1.0


## Visualisations

Investments

People

Degrees

Relationships

### Key Insights
Based on the exploratory data analysis, we will highlight preliminary insights and trends that could inform the predictive model for startup funding success.

### People Data Insights
- **Distribution of Roles**: Most individuals are founders or executives, which could indicate the importance of leadership roles in securing funding.
- **Geographic Distribution**: A majority of the people are located in major startup hubs like Silicon Valley, New York, and London.

### Investments Data Insights
- **Funding Amounts**: Most startups receive smaller funding amounts, with a few receiving significantly larger amounts. This indicates a right-skewed distribution.
- **Number of Funding Rounds**: Startups typically go through a few funding rounds, but some have multiple rounds, indicating sustained growth and interest from investors.

### Offices Data Insights
- **Geographic Spread**: Offices are concentrated in tech hubs such as San Francisco, New York, and London. This concentration can indicate a higher likelihood of securing funding due to proximity to investors.
- **Office Size and Capacity**: The size and capacity of the offices could be indicative of the startup's growth stage and operational scale.

### Degrees Data Insights
- **Popular Degrees**: The most common degrees among startup founders are in Computer Science and Business Administration. This highlights the importance of technical and managerial expertise in securing funding.
- **Institution Reputation**: Degrees from prestigious institutions like Stanford and MIT are prevalent, suggesting that educational background can play a significant role in attracting investors.

### Relationships Data Insights
- **Network Strength**: Strong networks, indicated by a high number of relationships, can positively influence the likelihood of securing funding. Founders with extensive connections are more likely to gain investor trust.
- **Type of Relationships**: Mentorship and advisory roles are significant, showing that guidance from experienced professionals can boost a startup's credibility.


### Next Steps for Predictive Modeling

- **Data Processing**: Further clean and preprocess the data, addressing any remaining issues.
- **Feature Engineering**: Create meaningful features that capture the key insights from the EDA.
- **Baseline Modeling**: Develop initial predictive models using algorithms such as logistic regression, decision trees, and random forests.
- **Model Evaluation**: Evaluate the models using appropriate metrics like accuracy, precision, and recall.
- **Iteration and Improvement**: Refine the models based on evaluation results and incorporate additional data as needed.