# Crunchbase Investor Fundrasing Analysis

Every year, thousands of startup companies raise financing from investors. Each time a startup raises money, we refer to the event as a fundraising round. Crunchbase is a website that crowdsources information on the fundraising rounds of many startups. The Crunchbase user community submits, edits, and maintains most of the information in Crunchbase.

In return, Crunchbase makes the data available through a web application and a fee-based API. Before Crunchbase switched to the paid API model, multiple groups went to the site and released the data online. Since the information on the startups and their fundraising rounds is always changing, the dataset we'll be using isn't completely up to date.

The dataset of investments we'll be exploring from October 2013. You can download it from GitHub. Here's a preview:

| company_permalink    | company_name | company_category_code | company_country_code | company_state_code | company_region        | company_city     | investor_permalink    | investor_name       | investor_category_code | investor_country_code | investor_state_code | investor_region | investor_city | funding_round_type | funded_at   | funded_month | funded_quarter | funded_year | raised_amount_usd |
|----------------------|--------------|-----------------------|----------------------|-------------------|-----------------------|------------------|-----------------------|---------------------|------------------------|-----------------------|---------------------|----------------|---------------|--------------------|-------------|--------------|----------------|-------------|-------------------|
| /company/advercar     | AdverCar     | advertising           | USA                  | CA                | SF Bay                | San Francisco    | /company/1-800-flowers-com | 1-800-FLOWERS.COM   | NaN                    | USA                   | NY                  | New York       | New York      | series-a           | 2012-10-30  | 2012-10      | 2012-Q4        | 2012.0      | 2000000.0         |
| /company/launchgram   | LaunchGram   | news                  | USA                  | CA                | SF Bay                | Mountain View    | /company/10xelerator     | 10Xelerator         | finance                | USA                   | OH                  | Columbus       | Columbus      | other              | 2012-01-23  | 2012-01      | 2012-Q1        | 2012.0      | 20000.0           |
| /company/utap         | uTaP         | messaging             | USA                  | NaN               | United States - Other | NaN              | /company/10xelerator     | 10Xelerator         | finance                | USA                   | OH                  | Columbus       | Columbus      | other              | 2012-01-01  | 2012-01      | 2012-Q1        | 2012.0      | 20000.0           |

Throughout this  project, we'll practice working with different memory constraints. In this step, let's assume we only have 10 megabytes of available memory. While crunchbase-investments.csv consumes 10.3 megabytes of disk space, we know that pandas often requires 4 to 6 times amount of space in memory as the file does on disk (especially when there's multiple string columns).


## Missing values and memory

Lets take a look at the missing values and the memory usage. We will also drop columns that have either too many missing values or would not be useful for analysis.

In [1]:
import pandas as pd
pd.options.display.max_columns = 99
chunked_df = pd.read_csv('D:/Library/datasci/datasets/crunchbase-investments.csv', chunksize=5000, encoding='ISO-8859-1')

In [2]:
missing = []

for chunk in chunked_df:
    missing.append(chunk.isnull().sum())
    
missing_df = pd.concat(missing)
missing_df.groupby(by = missing_df.index).sum().sort_values()

company_country_code          1
company_name                  1
company_permalink             1
company_region                1
investor_region               2
investor_permalink            2
investor_name                 2
funded_quarter                3
funded_at                     3
funded_month                  3
funded_year                   3
funding_round_type            3
company_state_code          492
company_city                533
company_category_code       643
raised_amount_usd          3599
investor_country_code     12001
investor_city             12480
investor_state_code       16809
investor_category_code    50427
dtype: int64

In [3]:
chunked_df = pd.read_csv('D:/Library/datasci/datasets/crunchbase-investments.csv', chunksize=5000, encoding='ISO-8859-1')
memory = pd.Series(dtype = 'float64')
i = 0 
for chunk in chunked_df:
    if i == 0:
        memory = chunk.memory_usage(deep = True)
    else:
        memory += chunk.memory_usage(deep = True)
    i += 1
        
memory.drop('Index', inplace = True)
memory

company_permalink         4057788
company_name              3591326
company_category_code     3421104
company_country_code      3172176
company_state_code        3106051
company_region            3411545
company_city              3505886
investor_permalink        4980548
investor_name             3915666
investor_category_code     622424
investor_country_code     2647292
investor_state_code       2476607
investor_region           3396281
investor_city             2885083
funding_round_type        3410707
funded_at                 3542185
funded_month              3383584
funded_quarter            3383584
funded_year                422960
raised_amount_usd          422960
dtype: int64

In [4]:
memory.sum()/(1024 * 1024)

56.98753070831299

In [5]:
# Drop columns representing URLs or containing too many missing values (>90% missing)
drop_cols = ['investor_permalink', 'company_permalink', 'investor_category_code']
keep_cols = chunk.columns.drop(drop_cols)
keep_cols


Index(['company_name', 'company_category_code', 'company_country_code',
       'company_state_code', 'company_region', 'company_city', 'investor_name',
       'investor_country_code', 'investor_state_code', 'investor_region',
       'investor_city', 'funding_round_type', 'funded_at', 'funded_month',
       'funded_quarter', 'funded_year', 'raised_amount_usd'],
      dtype='object')

## Unify data types



In [6]:

col_types = {}
chunked_df = pd.read_csv('D:/Library/datasci/datasets/crunchbase-investments.csv', chunksize=5000, encoding='ISO-8859-1', usecols = keep_cols.tolist())

for chunk in chunked_df:
    for col in chunk:
        if col not in col_types:
            col_types[col] = [str(chunk.dtypes[col])]
        else:
            col_types[col].append(str(chunk.dtypes[col]))


In [7]:
unique_cols = {}

for key, value in col_types.items():
    unique_cols[key] = set(col_types[key])

unique_cols
    

{'company_name': {'object'},
 'company_category_code': {'object'},
 'company_country_code': {'object'},
 'company_state_code': {'object'},
 'company_region': {'object'},
 'company_city': {'object'},
 'investor_name': {'object'},
 'investor_country_code': {'float64', 'object'},
 'investor_state_code': {'float64', 'object'},
 'investor_region': {'object'},
 'investor_city': {'float64', 'object'},
 'funding_round_type': {'object'},
 'funded_at': {'object'},
 'funded_month': {'object'},
 'funded_quarter': {'object'},
 'funded_year': {'float64', 'int64'},
 'raised_amount_usd': {'float64'}}

In [8]:
import sqlite3
conn = sqlite3.connect('crunchbase.db')
size = []

categorical_cols = {
    'company_name': 'category',
    'company_category_code': 'category',
    'company_country_code': 'category',
    'company_state_code': 'category',
    'company_region': 'category',
    'company_city': 'category',
    'investor_name': 'category',
    'investor_country_code': 'category',
    'investor_state_code': 'category',
    'investor_region': 'category',
    'investor_city': 'category',
    'funding_round_type': 'category',
}

chunk_df = pd.read_csv('D:/Library/datasci/datasets/crunchbase-investments.csv',
                       chunksize=5000, encoding='ISO-8859-1',
                      dtype = categorical_cols,
                      parse_dates = ['funded_at'])

for chunk in chunk_df:
    chunk = chunk.dropna()
    chunk = chunk.drop(drop_cols, axis = 1)
    funded_month=chunk["funded_month"].str.split("-").str[-1]
    funded_quarter=chunk["funded_quarter"].str.split("-").str[-1]
    chunk["funded_month"]=pd.to_numeric(funded_month,downcast='signed')
    chunk["funded_year"]=chunk["funded_year"].astype('int32')
    chunk['funded_quarter']=funded_quarter
    chunk["funded_quarter"]=chunk["funded_quarter"].astype("category")
    size.append(chunk.memory_usage(deep=True).sum()/(1024**2))
    chunk.to_sql('investments',conn,if_exists='append',index=False)

print("size of data {:.2f} mb".format(sum(size)))


size of data 4.93 mb


In [9]:
results_df = pd.read_sql('PRAGMA table_info(investments);', conn)
print(results_df)

    cid                   name     type  notnull dflt_value  pk
0     0           company_name     TEXT        0       None   0
1     1  company_category_code     TEXT        0       None   0
2     2   company_country_code     TEXT        0       None   0
3     3     company_state_code     TEXT        0       None   0
4     4         company_region     TEXT        0       None   0
5     5           company_city     TEXT        0       None   0
6     6          investor_name     TEXT        0       None   0
7     7  investor_country_code     TEXT        0       None   0
8     8    investor_state_code     TEXT        0       None   0
9     9        investor_region     TEXT        0       None   0
10   10          investor_city     TEXT        0       None   0
11   11     funding_round_type     TEXT        0       None   0
12   12              funded_at     TEXT        0       None   0
13   13           funded_month     TEXT        0       None   0
14   14         funded_quarter     TEXT 

# Analysis

Now we will run some queries and see if we can answer the following:
- What proportion of the total amount of funds did the top 10% raise? What about the top 1%? Compare these values to the proportions the bottom 10% and bottom 1% raised.
- Which category of company attracted the most investments?
- Which investor contributed the most money (across all startups)?
- Which investors contributed the most money per startup?
- Which funding round was the most popular? Which was the least popular?

### Top funding amounts
Lets find out the total amount of funding that went to the top 1 and 10% funded companies.

In [33]:


query = """
SELECT *
FROM investments
LIMIT 5
"""
pd.read_sql(query,conn)

Unnamed: 0,company_name,company_category_code,company_country_code,company_state_code,company_region,company_city,investor_name,investor_country_code,investor_state_code,investor_region,investor_city,funding_round_type,funded_at,funded_month,funded_quarter,funded_year,raised_amount_usd
0,AdverCar,advertising,USA,CA,SF Bay,San Francisco,1-800-FLOWERS.COM,USA,NY,New York,New York,series-a,2012-10-30,2012-10,2012-Q4,2012,2000000.0
1,LaunchGram,news,USA,CA,SF Bay,Mountain View,10Xelerator,USA,OH,Columbus,Columbus,other,2012-01-23,2012-01,2012-Q1,2012,20000.0
2,uTaP,messaging,USA,,United States - Other,,10Xelerator,USA,OH,Columbus,Columbus,other,2012-01-01,2012-01,2012-Q1,2012,20000.0
3,ZoopShop,software,USA,OH,Columbus,columbus,10Xelerator,USA,OH,Columbus,Columbus,angel,2012-02-15,2012-02,2012-Q1,2012,20000.0
4,eFuneral,web,USA,OH,Cleveland,Cleveland,10Xelerator,USA,OH,Columbus,Columbus,other,2011-09-08,2011-09,2011-Q3,2011,20000.0


In [34]:
query = """
SELECT company_name,
CAST(sum(raised_amount_usd) as double)/(SELECT sum(raised_amount_usd) FROM investments) AS fund_percent,
CAST(sum(raised_amount_usd) as bigint) AS total_raised
FROM investments 
GROUP BY company_name
ORDER BY total_raised DESC
LIMIT (SELECT CAST(COUNT(DISTINCT company_name) * 0.1 as int) FROM investments)
"""
pd.read_sql(query,conn)

Unnamed: 0,company_name,fund_percent,total_raised
0,Clearwire,0.116978,341560000000
1,Groupon,0.013281,38777200000
2,Nanosolar,0.006651,19420000000
3,Facebook,0.006337,18502500000
4,LivingSocial,0.005926,17303000000
...,...,...,...
1152,Carbonite,0.000184,538000000
1153,Agilyx,0.000184,538000000
1154,Zeltiq Aesthetics,0.000184,537400000
1155,Rally.org,0.000184,537200000


In [67]:
query = """
SELECT SUM(total_raised)
FROM (
    SELECT company_name, CAST(SUM(raised_amount_usd) as bigint) AS total_raised
    FROM investments 
    GROUP BY company_name
    ORDER BY total_raised DESC
    LIMIT (SELECT CAST(COUNT(DISTINCT company_name) * 0.1 as int) FROM investments)
) AS subquery
"""
top_10_sum = pd.read_sql(query,conn)
print("Top 10% of companies were funded a total of {:.5} billion".format(top_10_sum.iloc[0,0]/1000000000))

Top 10% of companies were funded a total of 2066.9 billion


In [68]:
query = """
SELECT SUM(total_raised)
FROM (
    SELECT company_name, CAST(SUM(raised_amount_usd) as bigint) AS total_raised
    FROM investments 
    GROUP BY company_name
    ORDER BY total_raised DESC
    LIMIT (SELECT CAST(COUNT(DISTINCT company_name) * 0.01 as int) FROM investments)
) AS subquery
"""
top_1_sum = pd.read_sql(query,conn)
print("Top 1% of companies were funded a total of {:.4} billion".format(top_1_sum.iloc[0,0]/1000000000))

Top 1% of companies were funded a total of 953.8 billion


### Bottom funding amounts
Similarly we can see how much the bottom companies recieved.

In [71]:
query = """
SELECT SUM(total_raised)
FROM (
    SELECT company_name, CAST(SUM(raised_amount_usd) as bigint) AS total_raised
    FROM investments 
    WHERE raised_amount_usd IS NOT NULL AND raised_amount_usd != ''
    GROUP BY company_name
    HAVING total_raised IS NOT NULL
    ORDER BY total_raised ASC
    LIMIT (SELECT CAST(COUNT(DISTINCT company_name) * 0.1 as int) FROM investments WHERE raised_amount_usd IS NOT NULL AND raised_amount_usd != '')
) AS subquery
WHERE total_raised IS NOT NULL

"""
bottom_10_sum = pd.read_sql(query,conn)
print("Bottom 10% of companies were funded a total of {:.4} billion".format(bottom_10_sum.iloc[0,0]/1000000000))

Bottom 10% of companies were funded a total of 0.7396 billion


In [72]:
query = """
SELECT SUM(total_raised)
FROM (
    SELECT company_name, CAST(SUM(raised_amount_usd) as bigint) AS total_raised
    FROM investments 
    WHERE raised_amount_usd IS NOT NULL AND raised_amount_usd != ''
    GROUP BY company_name
    HAVING total_raised IS NOT NULL
    ORDER BY total_raised ASC
    LIMIT (SELECT CAST(COUNT(DISTINCT company_name) * 0.01 as int) FROM investments WHERE raised_amount_usd IS NOT NULL AND raised_amount_usd != '')
) AS subquery
WHERE total_raised IS NOT NULL

"""
bottom_1_sum = pd.read_sql(query,conn)
print("Bottom 1% of companies were funded a total of {:.2} billion".format(bottom_1_sum.iloc[0,0]/1000000000))

Bottom 1% of companies were funded a total of 0.0061 billion


There is a gigantic discrepency between the highest and lowest funded companies in total money raised.

### Most investments by Company Type

Lets take a look at the industry where both the most investments by count and where the most money went to. 


In [81]:
query = """
SELECT 
    company_category_code, 
    COUNT(*) AS investment_frequency, 
    SUM(raised_amount_usd) AS total_raised, 
    SUM(raised_amount_usd) / COUNT(*) AS average_investment
FROM investments 
GROUP BY company_category_code
ORDER BY investment_frequency DESC
;
"""
pd.read_sql(query,conn)

Unnamed: 0,company_category_code,investment_frequency,total_raised,average_investment
0,software,27952,293872200000.0,10513460.0
1,biotech,19197,428360900000.0,22313950.0
2,web,19077,165185200000.0,8658866.0
3,enterprise,17306,182004700000.0,10516850.0
4,mobile,15388,482424100000.0,31350670.0
5,advertising,12326,102832500000.0,8342727.0
6,ecommerce,8253,95530060000.0,11575190.0
7,cleantech,7588,206890600000.0,27265500.0
8,games_video,7208,77001030000.0,10682720.0
9,analytics,7072,52529030000.0,7427747.0


From the above table we see that the most frequent company categories for investments were software and then biotech. However, the most total funding went to mobile and the breakdown by average investment funding shows that mobile and then biotech received the most funding on average.

In general software, biotech and mobile all seemed like strong contendors for investments. 

### Top investor

We can now take look at the top investors by frequency and by total investment

In [85]:
query = """
SELECT investor_name, count(*) as frequency
    FROM investments
    GROUP BY investor_name
    HAVING investor_name IS NOT NULL
    ORDER BY  frequency DESC
    LIMIT 5
"""
pd.read_sql(query,conn)

Unnamed: 0,investor_name,frequency
0,Techstars,2310
1,New Enterprise Associates,1780
2,Kleiner Perkins Caufield & Byers,1572
3,Y Combinator,1512
4,Draper Fisher Jurvetson (DFJ),1440


In [86]:
query = """
SELECT investor_name, COUNT(*) as frequency, SUM(raised_amount_usd) as investment
    FROM investments
    GROUP BY investor_name
    HAVING investor_name IS NOT NULL
    ORDER BY  investment DESC
    LIMIT 5
"""
pd.read_sql(query,conn)

Unnamed: 0,investor_name,frequency,investment
0,Intel,207,77599600000.0
1,Google,268,75514400000.0
2,Time Warner,147,74490000000.0
3,Comcast,117,73697000000.0
4,BrightHouse,26,61100000000.0


### Funding Round Popularity

In [88]:


query="""
SELECT 
    funding_round_type,
    COUNT(*) AS frequency
    FROM investments
    GROUP BY funding_round_type
    ORDER BY frequency DESC
    LIMIT 5
    """
pd.read_sql(query,conn)


Unnamed: 0,funding_round_type,frequency
0,series-a,51499
1,series-c+,42413
2,angel,34902
3,venture,34250
4,series-b,34054


In [94]:
query="""
SELECT 
    funding_round_type,
    COUNT(*) AS frequency
    FROM investments
    WHERE funding_round_type IS NOT NULL
    GROUP BY funding_round_type
    ORDER BY frequency ASC
    LIMIT 5
    """
pd.read_sql(query,conn)

Unnamed: 0,funding_round_type,frequency
0,crowdfunding,29
1,post-ipo,246
2,private-equity,1395
3,other,4015
4,series-b,34054


The most popular funding round is series-a with the least popular being crowdfunding.

# Conclusions

Above we worked with the data from crunchbase which had information on start up funding deals. We worked a bit with processing the data in chunks to reduce memory usage which could be extremely important when working with particularly large databases. 

Once processed we performed a simple analysis to give a summary on what kinds of companies seek funding, how much funding is out there, who does the funding and which funding rounds are the most popular. 
