# Analyzing Startup Fundraising Deals from Crunchbase

In this course, we explored a few different ways to work with larger datasets in pandas. In this guided project, we'll practice using some of the techniques we learned to analyze startup investments from Crunchbase.com.
Throughout this guided project, we'll practice working with different memory constraints. In this step, let's assume we only have 10 megabytes of available memory. While crunchbase-investments.csv consumes 10.3 megabytes of disk space, we know from earlier missions that pandas often requires 4 to 6 times amount of space in memory as the file does on disk (especially when there's many string columns).

## Introduction

Because the data set contains over 50,000 rows, you'll need to read the data set into dataframes using 5,000 row chunks to ensure that each chunk consumes much less than 10 megabytes of memory.
Across all of the chunks, become familiar with:
Each column's missing value counts
Each column's memory footprint
The total memory footprint of all of the chunks combined
Which column(s) we can drop because they aren't useful for analysis

In [2]:
import pandas as pd
import numpy as np

In [3]:
chunk_iter = pd.read_csv('crunchbase-investments.csv', chunksize=5000, encoding='ISO-8859-1')
for chunk in chunk_iter:
    print(chunk.memory_usage(deep=True).sum()/(1024*1024))  

5.579195022583008
5.528186798095703
5.535004615783691
5.528162956237793
5.5243072509765625
5.553412437438965
5.531391143798828
5.509613037109375
5.396090507507324
4.63945198059082
2.663668632507324


In [4]:
sum_chunk = 0
chunk_iter = pd.read_csv('crunchbase-investments.csv', chunksize=5000, encoding='ISO-8859-1')
for chunk in chunk_iter:
    sum_chunk += chunk.memory_usage(deep=True).sum()/(1024*1024)  
print(sum_chunk)    
      

56.988484382629395


The memory footprint of each column

In [13]:
chunk_iter = pd.read_csv('crunchbase-investments.csv', chunksize=5000, encoding='ISO-8859-1')
memory_series = pd.Series()
counter = 0
for chunk in chunk_iter:
    if counter == 0:
        memory_series = chunk.memory_usage(deep=True)/(1024*1024)
    else:
        memory_series += chunk.memory_usage(deep=True)/(1024*1024) 
    counter += 1
    
memory_series  

Index                     0.000877
company_permalink         3.869808
company_name              3.424955
company_category_code     3.262619
company_country_code      3.025223
company_state_code        2.962161
company_region            3.253541
company_city              3.343512
investor_permalink        4.749821
investor_name             3.734270
investor_category_code    0.593590
investor_country_code     2.524654
investor_state_code       2.361876
investor_region           3.238946
investor_city             2.751430
funding_round_type        3.252704
funded_at                 3.378091
funded_month              3.226837
funded_quarter            3.226837
funded_year               0.403366
raised_amount_usd         0.403366
dtype: float64

Each column's missing value counts

In [15]:
chunk_iter = pd.read_csv('crunchbase-investments.csv', chunksize=5000, encoding='ISO-8859-1')

mv_list = []
for chunk in chunk_iter:
    mv_list.append(chunk.isnull().sum())
    
combined_mv_vc = pd.concat(mv_list)
unique_combined_mv_vc = combined_mv_vc.groupby(combined_mv_vc.index).sum()
unique_combined_mv_vc.sort_values()

company_country_code          1
company_name                  1
company_permalink             1
company_region                1
investor_region               2
investor_permalink            2
investor_name                 2
funded_quarter                3
funded_at                     3
funded_month                  3
funded_year                   3
funding_round_type            3
company_state_code          492
company_city                533
company_category_code       643
raised_amount_usd          3599
investor_country_code     12001
investor_city             12480
investor_state_code       16809
investor_category_code    50427
dtype: int64

What column's can we drop because they aren't useful for analysis?

In [16]:
# Drop columns representing URL's or containing way too many missing values (>90% missing)
drop_cols = ['investor_permalink', 'company_permalink', 'investor_category_code']
keep_cols = chunk.columns.drop(drop_cols)

## Selecting Data Types

Now that we have a good sense of the missing values, let's get familiar with the column types before adding the data into SQLite.

Identify the types for each column.

In [17]:
# Key: Column name, Value: List of types
col_types = {}
chunk_iter = pd.read_csv('crunchbase-investments.csv', chunksize=5000, encoding='ISO-8859-1', usecols=keep_cols)

for chunk in chunk_iter:
    for col in chunk.columns:
        if col not in col_types:
            col_types[col] = [str(chunk.dtypes[col])]
        else:
            col_types[col].append(str(chunk.dtypes[col]))

            
uniq_col_types = {}
for k,v in col_types.items():
    uniq_col_types[k] = set(col_types[k])
uniq_col_types

{'company_category_code': {'object'},
 'company_city': {'object'},
 'company_country_code': {'object'},
 'company_name': {'object'},
 'company_region': {'object'},
 'company_state_code': {'object'},
 'funded_at': {'object'},
 'funded_month': {'object'},
 'funded_quarter': {'object'},
 'funded_year': {'float64', 'int64'},
 'funding_round_type': {'object'},
 'investor_city': {'float64', 'object'},
 'investor_country_code': {'float64', 'object'},
 'investor_name': {'object'},
 'investor_region': {'object'},
 'investor_state_code': {'float64', 'object'},
 'raised_amount_usd': {'float64'}}

Identify the numeric columns we can represent using more space efficient types.

In [18]:
chunk

Unnamed: 0,company_name,company_category_code,company_country_code,company_state_code,company_region,company_city,investor_name,investor_country_code,investor_state_code,investor_region,investor_city,funding_round_type,funded_at,funded_month,funded_quarter,funded_year,raised_amount_usd
50000,NuORDER,fashion,USA,CA,Los Angeles,West Hollywood,Mortimer Singer,,,unknown,,series-a,2012-10-01,2012-10,2012-Q4,2012,3060000.0
50001,ChaCha,advertising,USA,IN,Indianapolis,Carmel,Morton Meyerson,,,unknown,,series-b,2007-10-01,2007-10,2007-Q4,2007,12000000.0
50002,Binfire,software,USA,FL,Bocat Raton,Bocat Raton,Moshe Ariel,,,unknown,,angel,2008-04-18,2008-04,2008-Q2,2008,500000.0
50003,Binfire,software,USA,FL,Bocat Raton,Bocat Raton,Moshe Ariel,,,unknown,,angel,2010-01-01,2010-01,2010-Q1,2010,750000.0
50004,Unified Color,software,USA,CA,SF Bay,South San Frnacisco,Mr. Andrew Oung,,,unknown,,angel,2010-01-01,2010-01,2010-Q1,2010,
50005,HItviews,advertising,USA,NY,New York,New York City,multiple parties,,,unknown,,angel,2007-11-29,2007-11,2007-Q4,2007,485000.0
50006,LockerDome,social,USA,MO,Saint Louis,St. Louis,multiple parties,,,unknown,,angel,2012-04-17,2012-04,2012-Q2,2012,300000.0
50007,ThirdLove,ecommerce,USA,CA,SF Bay,San Francisco,Munjal Shah,,,unknown,,series-a,2012-12-01,2012-12,2012-Q4,2012,5600000.0
50008,Hakia,search,USA,,TBD,,Murat Vargi,,,unknown,,series-a,2006-11-01,2006-11,2006-Q4,2006,16000000.0
50009,bookacoach,sports,USA,IN,Indianapolis,Indianapolis,Myles Grote,,,unknown,,angel,2012-11-01,2012-11,2012-Q4,2012,


Numeric columns to change to more space-efficient types: **funded_at, funded_month, funded_quarter, funded_year, raise_amount_usd**

In [19]:
##How many unique values are there in each string column? How many of the string columns contain values that are less than 50% unique?
chunk_iter = pd.read_csv('crunchbase-investments.csv', chunksize=5000, encoding='ISO-8859-1', usecols=keep_cols)

uniques = {}
for chunk in chunk_iter:
    strings_only = chunk.select_dtypes(include=['object'])
    cols = strings_only.columns
    for c in cols:
        val_counts = strings_only[c].value_counts()
        if c in uniques:
            uniques[c].append(val_counts)
        else:
            uniques[c] = [val_counts]

uniques_combined = {}
unique_stats = {
    'column_name': [],
    'total_values': [],
    'unique_values': [],
}

useful_obj_cols = []

for col in uniques:
    u_concat = pd.concat(uniques[col])
    u_group = u_concat.groupby(u_concat.index).sum()
    uniques_combined[col] = u_group
    if (u_group.shape[0]/50000) < 0.2:
        useful_obj_cols.append(col)
        print(col, u_group.shape[0])

company_region 546
investor_region 585
investor_state_code 50
company_city 1229
company_state_code 50
company_country_code 2
funded_at 2808
investor_city 990
investor_country_code 72
funded_quarter 72
funding_round_type 9
company_category_code 43
funded_month 192


In [20]:
## Create dictionary (key: column, value: list of Series objects representing each chunk's value counts)
chunk_iter = pd.read_csv('crunchbase-investments.csv', chunksize=5000, encoding='ISO-8859-1', usecols=keep_cols)
str_cols_vc = {}
for chunk in chunk_iter:
    str_cols = chunk.select_dtypes(include=['object'])
    for col in str_cols.columns:
        current_col_vc = str_cols[col].value_counts()
        if col in str_cols_vc:
            str_cols_vc[col].append(current_col_vc)
        else:
            str_cols_vc[col] = [current_col_vc]

In [21]:
## Combine the value counts.
combined_vcs = {}

for col in str_cols_vc:
    combined_vc = pd.concat(str_cols_vc[col])
    final_vc = combined_vc.groupby(combined_vc.index).sum()
    combined_vcs[col] = final_vc

In [22]:
for col in useful_obj_cols:
    print(col)
    print(combined_vcs[col])
    print("-----------")

company_region
2008                    1
Akron                  11
Alachua                19
Albuquerque            83
Allentown              20
Alliance                1
Ames                    1
Amherst                 9
Angier                  4
Appleton                3
Asheville               3
Ashford                 1
Ashland                 4
Atlanta               558
Atlantic Highlands      8
Auburn Hlls             2
Augusta                 1
Aurora                  3
Austin                947
Avon                    1
B                       5
Bakersfield             1
Bala Cynwyd             6
Baltimore              95
Bangalore               3
Bar Harbor              1
Barre                   2
Baton Rouge            14
Battleground            3
Bedford                 8
                     ... 
West Trenton            2
Westfield              14
Westport               17
Whippany                6
White River             3
Whiting                 1
Wilbraham              

Columns to change to category: 
* company_region 
* investor_region 
* investor_state_code 
* company_city 1229
* company_state_code 50
* company_country_code 2
* investor_city 990
* investor_country_code 72
* funding_round_type 9
* company_category_code 43

Let's convert the following to datetime:
* funded_at
* funded_month
* funded_quarter


In [23]:
convert_col_dtypes = {
    "company_region": "category", "investor_region": "category", 
    "investor_state_code": "category", "company_city": "category",
    "company_state_code": "category", "company_country_code": "category",
"investor_city": "category", "investor_country_code": "category",
"funding_round_type": "category", "company_category_code": "category", "loan_status": "category"}

In [24]:
chunk_iter = pd.read_csv('crunchbase-investments.csv', chunksize=5000, encoding='ISO-8859-1', usecols=keep_cols, dtype=convert_col_dtypes, 
                         parse_dates=["funded_at", "funded_month", "funded_quarter"])

new_mem_usage = []

for chunk in chunk_iter:
    new_mem_usage.append(chunk.memory_usage(deep=True).sum()/ 1024 ** 2)



In [25]:
print("New Memory Usage: ", sum(new_mem_usage))


New Memory Usage:  11.08881664276123


In [26]:
chunk_iter = pd.read_csv('crunchbase-investments.csv', chunksize=5000, encoding='ISO-8859-1', usecols=keep_cols, dtype=convert_col_dtypes, 
                         parse_dates=["funded_at", "funded_month", "funded_quarter"])

memory_series = pd.Series()
counter = 0
for chunk in chunk_iter:
    if counter == 0:
        memory_series = chunk.memory_usage(deep=True)/(1024*1024)
    else:
        memory_series += chunk.memory_usage(deep=True)/(1024*1024) 
    counter += 1
    
memory_series  

Index                    0.000877
company_name             3.424955
company_category_code    0.091980
company_country_code     0.051188
company_state_code       0.091649
company_region           0.317376
company_city             0.624051
investor_name            3.734270
investor_country_code    0.079145
investor_state_code      0.079806
investor_region          0.217028
investor_city            0.300412
funding_round_type       0.059248
funded_at                0.403366
funded_month             0.403366
funded_quarter           0.403366
funded_year              0.403366
raised_amount_usd        0.403366
dtype: float64

In [30]:
chunk_iter = pd.read_csv('crunchbase-investments.csv', chunksize=5000, encoding='ISO-8859-1', usecols=keep_cols, dtype=convert_col_dtypes, 
                         parse_dates=["funded_at", "funded_month", "funded_quarter"])

chunk

Unnamed: 0,company_name,company_category_code,company_country_code,company_state_code,company_region,company_city,investor_name,investor_country_code,investor_state_code,investor_region,investor_city,funding_round_type,funded_at,funded_month,funded_quarter,funded_year,raised_amount_usd
50000,NuORDER,fashion,USA,CA,Los Angeles,West Hollywood,Mortimer Singer,,,unknown,,series-a,2012-10-01,2012-10-01,2012-10-01,2012,3060000.0
50001,ChaCha,advertising,USA,IN,Indianapolis,Carmel,Morton Meyerson,,,unknown,,series-b,2007-10-01,2007-10-01,2007-10-01,2007,12000000.0
50002,Binfire,software,USA,FL,Bocat Raton,Bocat Raton,Moshe Ariel,,,unknown,,angel,2008-04-18,2008-04-01,2008-04-01,2008,500000.0
50003,Binfire,software,USA,FL,Bocat Raton,Bocat Raton,Moshe Ariel,,,unknown,,angel,2010-01-01,2010-01-01,2010-01-01,2010,750000.0
50004,Unified Color,software,USA,CA,SF Bay,South San Frnacisco,Mr. Andrew Oung,,,unknown,,angel,2010-01-01,2010-01-01,2010-01-01,2010,
50005,HItviews,advertising,USA,NY,New York,New York City,multiple parties,,,unknown,,angel,2007-11-29,2007-11-01,2007-10-01,2007,485000.0
50006,LockerDome,social,USA,MO,Saint Louis,St. Louis,multiple parties,,,unknown,,angel,2012-04-17,2012-04-01,2012-04-01,2012,300000.0
50007,ThirdLove,ecommerce,USA,CA,SF Bay,San Francisco,Munjal Shah,,,unknown,,series-a,2012-12-01,2012-12-01,2012-10-01,2012,5600000.0
50008,Hakia,search,USA,,TBD,,Murat Vargi,,,unknown,,series-a,2006-11-01,2006-11-01,2006-10-01,2006,16000000.0
50009,bookacoach,sports,USA,IN,Indianapolis,Indianapolis,Myles Grote,,,unknown,,angel,2012-11-01,2012-11-01,2012-10-01,2012,


## Loading Chunks into SQLite

Create and connect to a new SQLite database file.
Expand on the existing chunk processing code to export each chunk to a new table in the SQLite database.
Query the table and make sure the data types match up to what you had in mind for each column.

In [31]:
import sqlite3
conn = sqlite3.connect('crunchbase.db')
chunk_iter = pd.read_csv('crunchbase-investments.csv', chunksize=5000, encoding='ISO-8859-1', usecols=keep_cols, dtype=convert_col_dtypes, 
                         parse_dates=["funded_at", "funded_month", "funded_quarter"])

for chunk in chunk_iter:
    chunk.to_sql("investments", conn, if_exists='append', index=False)
    
results_df = pd.read_sql("PRAGMA table_info(investments);", conn)
print(results_df)

    cid                   name       type  notnull dflt_value  pk
0     0           company_name       TEXT        0       None   0
1     1  company_category_code       TEXT        0       None   0
2     2   company_country_code       TEXT        0       None   0
3     3     company_state_code       TEXT        0       None   0
4     4         company_region       TEXT        0       None   0
5     5           company_city       TEXT        0       None   0
6     6          investor_name       TEXT        0       None   0
7     7  investor_country_code       TEXT        0       None   0
8     8    investor_state_code       TEXT        0       None   0
9     9        investor_region       TEXT        0       None   0
10   10          investor_city       TEXT        0       None   0
11   11     funding_round_type       TEXT        0       None   0
12   12              funded_at  TIMESTAMP        0       None   0
13   13           funded_month  TIMESTAMP        0       None   0
14   14   

## Next Steps

Use the pandas SQLite workflow to answer the following questions:

* What proportion of the total amount of funds did the top 10% raise? What about the top 1%? Compare these values to the proportions the bottom 10% and bottom 1% raised.
* Which category of company attracted the most investments?
* Which investor contributed the most money (across all startups)?
* Which investors contributed the most money per startup?
* Which funding round was the most popular? Which was the least popular?

In [32]:
#top 10% fund rasiser analysis

query="""
      select iv.company_name,
      cast(sum(raised_amount_usd) as double)/(select cast(sum(raised_amount_usd) as bigint) from investments) as percentage_funding,
      cast(sum(raised_amount_usd) as bigint) as funding_amount
      from investments as iv
      group by iv.company_name 
      order by funding_amount desc
      limit (select cast(count(distinct company_name)*.1 as int) from investments)
      """

In [33]:
top_10_raised = pd.read_sql(query, conn)

In [44]:
print("Funding raised by top 10 percent $%.2f billion"%(top_10_raised["funding_amount"].sum()/10000000000))


Funding raised by top 10 percent $45.76 billion


In [45]:
#top 1% fund rasiser analysis
query="""
      select iv.company_name,
      cast(sum(raised_amount_usd) as double)/(select cast(sum(raised_amount_usd) as bigint) from investments) as percentage_funding,
      cast(sum(raised_amount_usd) as bigint) as funding_amount
      from investments as iv
      group by iv.company_name 
      order by funding_amount desc
      limit (select cast(count(distinct company_name)*.01 as int) from investments)
      """

In [46]:
top_1_raised=pd.read_sql(query,conn)

print("Funding raised by top 1 percent $%.2f billion"%(top_1_raised["funding_amount"].sum()/10000000000))


Funding raised by top 1 percent $17.87 billion


In [47]:
#bottom 10% fund rasiser analysis

query="""
      select iv.company_name,
      round(cast(sum(raised_amount_usd) as double)/(select cast(sum(raised_amount_usd) as double) from investments),6) as percentage_funding,
      cast(sum(raised_amount_usd) as bigint) as funding_amount
      from investments as iv
      group by iv.company_name
      having funding_amount is not Null
      order by funding_amount asc
      limit (select cast(count(distinct company_name)*.1 as int) from investments)
      """

In [48]:
btm_10_raised=pd.read_sql(query,conn)
print("Funding raised by bottom 10 percent $%.10f billion"%(btm_10_raised["funding_amount"].sum()/10000000000))


Funding raised by bottom 10 percent $0.0252174228 billion


In [49]:
#category of company attracted most of investors 

query="""
      select iv.company_category_code,count(*) as frequency
      from investments as iv
      group by iv.company_category_code
      order by frequency desc
      limit 1
      """

In [50]:
investment=pd.read_sql(query,conn)
print("Category: %s , frequency_investment: %d"%(investment["company_category_code"][0],investment["frequency"][0]))


Category: software , frequency_investment: 7243


In [51]:
#category of investor contributed the money 

query="""
      select iv.investor_name,count(*) as frequency
      from investments as iv
      group by iv.investor_name
      having investor_name is not Null
      order by frequency desc
      limit 1
      """

In [52]:
investor_name=pd.read_sql(query,conn)
investor_name


Unnamed: 0,investor_name,frequency
0,New Enterprise Associates,445


In [53]:
#category of investor contributed the most money per startup  
query="""
      select iv.investor_name,count(*) as frequency,
      sum(raised_amount_usd) as investment
      from investments as iv
      group by iv.investor_name
      order by investment desc
      limit 1
      """

In [54]:
investor_money=pd.read_sql(query,conn)
print("investor: %s , investment in billons: %d"%(investor_money["investor_name"][0],investor_money["investment"][0]/10000000000))


investor: Kleiner Perkins Caufield & Byers , investment in billons: 1


In [55]:
#What funding round was most popular?

query="""
      select iv.funding_round_type as f_r_t,
      count(*) as frequency
      from investments as iv
      group by f_r_t
      order by frequency desc 
      limit 1
      """

In [56]:
funding_pop=pd.read_sql(query,conn)
print("funding popular: %s"%(funding_pop["f_r_t"][0]))


funding popular: series-a
