
# Goal 
Practice using some of the techniques we learned to analyze startup investments from Crunchbase.com.

## [Dataset](https://github.com/datahoarder/crunchbase-october-2013/blob/master/crunchbase-investments.csv)

In [1]:
import pandas as pd
import numpy as np
import sqlite3

Because the data set contains over 50,000 rows, we'll read the data set into dataframes using 5,000 row chunks to ensure that each chunk consumes much less than 10 megabytes of memory.

In [2]:
cb_iter = pd.read_csv('crunchbase-investments.csv', chunksize=5000, encoding='ISO-8859-1')

In [3]:
for chunk in cb_iter:
    size = chunk.memory_usage(deep='True').sum() / (1024 * 1024)
    print(size)

5.579195022583008
5.528186798095703
5.535004615783691
5.528162956237793
5.5243072509765625
5.553412437438965
5.531391143798828
5.509613037109375
5.396090507507324
4.63945198059082
2.663668632507324


Across all of the chunks, let's become familiar with:
 - Each column's missing value counts
 - Each column's memory footprint
 - The total memory footprint of all of the chunks combined
 - Which column(s) we can drop because they aren't useful for analysis

Missing value counts

In [4]:
cb_iter = pd.read_csv('crunchbase-investments.csv', chunksize=5000, encoding='ISO-8859-1')
missing = []
for chunk in cb_iter:
    missing.append(chunk.isnull().sum())

combined_missing = pd.concat(missing)
final_missing = combined_missing.groupby(combined_missing.index).sum().sort_values()
print(final_missing)

company_country_code          1
company_name                  1
company_permalink             1
company_region                1
investor_region               2
investor_permalink            2
investor_name                 2
funded_quarter                3
funded_at                     3
funded_month                  3
funded_year                   3
funding_round_type            3
company_state_code          492
company_city                533
company_category_code       643
raised_amount_usd          3599
investor_country_code     12001
investor_city             12480
investor_state_code       16809
investor_category_code    50427
dtype: int64


Columns memory footprint (in Mb)

In [5]:
cb_iter = pd.read_csv('crunchbase-investments.csv', chunksize=5000, encoding='ISO-8859-1')
memory = []
for chunk in cb_iter:
    memory.append(chunk.memory_usage(deep=True))

combined_memory = pd.concat(memory)
final_memory = combined_memory.groupby(combined_memory.index).sum().sort_values(ascending=False) / (1024 * 1024)
print(final_memory)

investor_permalink        4.749821
company_permalink         3.869808
investor_name             3.734270
company_name              3.424955
funded_at                 3.378091
company_city              3.343512
company_category_code     3.262619
company_region            3.253541
funding_round_type        3.252704
investor_region           3.238946
funded_month              3.226837
funded_quarter            3.226837
company_country_code      3.025223
company_state_code        2.962161
investor_city             2.751430
investor_country_code     2.524654
investor_state_code       2.361876
investor_category_code    0.593590
raised_amount_usd         0.403366
funded_year               0.403366
Index                     0.000877
dtype: float64


Total memory footprint (in Mb)

In [6]:
cb_iter = pd.read_csv('crunchbase-investments.csv', chunksize=5000, encoding='ISO-8859-1')
size = 0
for chunk in cb_iter:
    size += chunk.memory_usage(deep='True').sum() / (1024 * 1024)
print(size)

56.988484382629395


#### Drop columns those contain links (investor_permalink, company_permalink) or contain too much missing values (investor_category_code). Also we can drop columns funded_month funded_quarter and funded_year because we have funded_at column that fully covers them

In [7]:
drop_cols = ['investor_permalink', 'company_permalink', 'investor_category_code', 'funded_month', 'funded_quarter', 'funded_year']
keep_cols = chunk.columns.drop(drop_cols)

In [8]:
keep_cols.tolist

<bound method Index.tolist of Index(['company_name', 'company_category_code', 'company_country_code',
       'company_state_code', 'company_region', 'company_city', 'investor_name',
       'investor_country_code', 'investor_state_code', 'investor_region',
       'investor_city', 'funding_round_type', 'funded_at',
       'raised_amount_usd'],
      dtype='object')>

### Selecting Data Types
Let's first determine which columns shift types across chunks. Note that we only lay the groundwork for this step.

In [9]:
col_types = {}
cb_iter = pd.read_csv('crunchbase-investments.csv', chunksize=5000, encoding='ISO-8859-1', usecols=keep_cols)
for chunk in cb_iter:
    for col in chunk.columns:
        if col not in col_types:
            col_types[col] = [str(chunk[col].dtype)]
        else:
            if str(chunk[col].dtype) not in col_types[col]:
                col_types[col].append(str(chunk[col].dtype))

In [10]:
col_types

{'company_category_code': ['object'],
 'company_city': ['object'],
 'company_country_code': ['object'],
 'company_name': ['object'],
 'company_region': ['object'],
 'company_state_code': ['object'],
 'funded_at': ['object'],
 'funding_round_type': ['object'],
 'investor_city': ['object', 'float64'],
 'investor_country_code': ['object', 'float64'],
 'investor_name': ['object'],
 'investor_region': ['object'],
 'investor_state_code': ['object', 'float64'],
 'raised_amount_usd': ['float64']}

In [13]:
to_be_converted = []

for col in col_types:
    if len(col_types[col]) > 1:
        print('{}: '.format(col), col_types[col])
        to_be_converted.append(col)

investor_state_code:  ['object', 'float64']
investor_country_code:  ['object', 'float64']
investor_city:  ['object', 'float64']


In [14]:
chunk

Unnamed: 0,company_name,company_category_code,company_country_code,company_state_code,company_region,company_city,investor_name,investor_country_code,investor_state_code,investor_region,investor_city,funding_round_type,funded_at,raised_amount_usd
50000,NuORDER,fashion,USA,CA,Los Angeles,West Hollywood,Mortimer Singer,,,unknown,,series-a,2012-10-01,3060000.0
50001,ChaCha,advertising,USA,IN,Indianapolis,Carmel,Morton Meyerson,,,unknown,,series-b,2007-10-01,12000000.0
50002,Binfire,software,USA,FL,Bocat Raton,Bocat Raton,Moshe Ariel,,,unknown,,angel,2008-04-18,500000.0
50003,Binfire,software,USA,FL,Bocat Raton,Bocat Raton,Moshe Ariel,,,unknown,,angel,2010-01-01,750000.0
50004,Unified Color,software,USA,CA,SF Bay,South San Frnacisco,Mr. Andrew Oung,,,unknown,,angel,2010-01-01,
50005,HItviews,advertising,USA,NY,New York,New York City,multiple parties,,,unknown,,angel,2007-11-29,485000.0
50006,LockerDome,social,USA,MO,Saint Louis,St. Louis,multiple parties,,,unknown,,angel,2012-04-17,300000.0
50007,ThirdLove,ecommerce,USA,CA,SF Bay,San Francisco,Munjal Shah,,,unknown,,series-a,2012-12-01,5600000.0
50008,Hakia,search,USA,,TBD,,Murat Vargi,,,unknown,,series-a,2006-11-01,16000000.0
50009,bookacoach,sports,USA,IN,Indianapolis,Indianapolis,Myles Grote,,,unknown,,angel,2012-11-01,


Convert these columns to object type and total check memory footprint:
 - investor_state_code:  ['object', 'float64']
 - investor_country_code:  ['object', 'float64']
 - investor_city:  ['object', 'float64']

In [16]:
cb_iter = pd.read_csv('crunchbase-investments.csv', chunksize=5000, encoding='ISO-8859-1', usecols=keep_cols)
size = 0
for chunk in cb_iter:
    chunk[to_be_converted] = chunk[to_be_converted].astype('str')
    size += chunk.memory_usage(deep='True').sum() / (1024 * 1024)
print(size)

42.561177253723145


### Load dataset to Database

In [18]:
conn = sqlite3.connect('cb.db')

In [34]:
cb_iter = pd.read_csv('crunchbase-investments.csv', chunksize=5000, encoding='ISO-8859-1', usecols=keep_cols)
for chunk in cb_iter:
    chunk[to_be_converted] = chunk[to_be_converted].astype('str')
    chunk.to_sql("crunchbase", conn, if_exists='append', index=False)

In [35]:
!ls -l --block-size=M  cb.db

-rw-r--r-- 1 dq dq 13M Oct 15 06:18 cb.db


In [38]:
db = pd.read_sql("Select * From crunchbase", conn)

In [39]:
db

Unnamed: 0,company_name,company_category_code,company_country_code,company_state_code,company_region,company_city,investor_name,investor_country_code,investor_state_code,investor_region,investor_city,funding_round_type,funded_at,raised_amount_usd
0,AdverCar,advertising,USA,CA,SF Bay,San Francisco,1-800-FLOWERS.COM,USA,NY,New York,New York,series-a,2012-10-30,2000000.0
1,LaunchGram,news,USA,CA,SF Bay,Mountain View,10Xelerator,USA,OH,Columbus,Columbus,other,2012-01-23,20000.0
2,uTaP,messaging,USA,,United States - Other,,10Xelerator,USA,OH,Columbus,Columbus,other,2012-01-01,20000.0
3,ZoopShop,software,USA,OH,Columbus,columbus,10Xelerator,USA,OH,Columbus,Columbus,angel,2012-02-15,20000.0
4,eFuneral,web,USA,OH,Cleveland,Cleveland,10Xelerator,USA,OH,Columbus,Columbus,other,2011-09-08,20000.0
5,Tackk,web,USA,OH,Cleveland,Cleveland,10Xelerator,USA,OH,Columbus,Columbus,other,2012-02-01,20000.0
6,Acclaimd,analytics,USA,OH,Columbus,Columbus,10Xelerator,USA,OH,Columbus,Columbus,angel,2012-06-01,20000.0
7,Acclaimd,analytics,USA,OH,Columbus,Columbus,10Xelerator,USA,OH,Columbus,Columbus,angel,2012-08-07,70000.0
8,ToVieFor,ecommerce,USA,NY,New York,New York,2010 NYU Stern Business Plan Competition,,,unknown,,angel,2010-04-01,75000.0
9,OHK Labs,sports,USA,FL,Palm Beach,Boca Raton,22Hundred Group,,,unknown,,angel,2011-09-01,100000.0


### Use the pandas SQLite workflow to answer the following questions:
 - What proportion of the total amount of funds did the top 10% raise? What about the top 1%? Compare these values to the proportions the bottom 10% and bottom 1% raised.
 - Which category of company attracted the most investments?
 - Which investor contributed the most money (across all startups)?
 - Which investors contributed the most money per startup?
 - Which funding round was the most popular? Which was the least popular?

In [40]:
cur = conn.cursor()

#### What proportion of the total amount of funds did the top 10% raise? 

In [100]:
cur.execute("""SELECT SUM(raised_amount_usd) FROM crunchbase""")
total_funds = cur.fetchall()

In [101]:
cur.execute("""SELECT SUM(raised_amount_usd) FROM (SELECT * FROM crunchbase ORDER BY raised_amount_usd DESC LIMIT 4927)""")
top_ten_funds = cur.fetchall()

In [102]:
top_ten_funds[0][0] / total_funds[0][0]

0.4985610752406796

In [103]:
cur.execute("""SELECT SUM(raised_amount_usd) FROM (SELECT * FROM crunchbase ORDER BY raised_amount_usd DESC LIMIT 493)""")
top_one_funds = cur.fetchall()

In [104]:
top_one_funds[0][0] / total_funds[0][0]

0.19299778563809128

#### What about the top 1%? Compare these values to the proportions the bottom 10% and bottom 1% raised.

In [105]:
bottom_ten_funds = pd.read_sql("SELECT raised_amount_usd FROM crunchbase ORDER BY 1", conn)

In [106]:
bottom_ten_funds = bottom_ten_funds.dropna()

In [107]:
bottom_ten_funds.head(round(len(bottom_ten_funds) / 10)).sum()[0] / top_ten_funds[0][0]

0.00527409025208126

In [108]:
bottom_ten_funds.head(round(len(bottom_ten_funds) / 100)).sum()[0] / top_ten_funds[0][0]

2.6612958705066928e-05

#### Which category of company attracted the most investments?

In [45]:
cur.execute("""
            SELECT company_category_code, SUM(raised_amount_usd) FROM crunchbase
            GROUP BY 1
            ORDER BY 2 DESC
            """)
top_categories = cur.fetchmany(10)

In [46]:
top_categories

[('biotech', 110396423062.0),
 ('software', 73084516724.0),
 ('mobile', 64777379752.0),
 ('cleantech', 52705225028.0),
 ('enterprise', 45860927273.0),
 ('web', 40143264989.0),
 ('medical', 25367105281.0),
 ('advertising', 25076661879.0),
 ('ecommerce', 22567220071.0),
 ('network_hosting', 22419683840.0)]

#### Which investor contributed the most money (across all startups)?

In [48]:
cur.execute("""
            SELECT investor_name, SUM(raised_amount_usd) FROM crunchbase
            GROUP BY 1
            ORDER BY 2 DESC
            """)
top_investors = cur.fetchmany(10)

In [49]:
top_investors

[('Kleiner Perkins Caufield & Byers', 11217826376.0),
 ('New Enterprise Associates', 9692542344.0),
 ('Accel Partners', 6472126199.0),
 ('Goldman Sachs', 6375459000.0),
 ('Sequoia Capital', 6039402410.0),
 ('Intel', 5969200000.0),
 ('Google', 5808800000.0),
 ('Time Warner', 5730000000.0),
 ('Comcast', 5669000000.0),
 ('Greylock Partners', 4960982939.0)]

#### Which investors contributed the most money per startup?

In [50]:
cur.execute("""
            SELECT investor_name, SUM(raised_amount_usd) / COUNT(raised_amount_usd) FROM crunchbase
            GROUP BY 1
            ORDER BY 2 DESC
            """)
top_investors_per_company = cur.fetchmany(10)

In [51]:
top_investors_per_company

[('Marlin Equity Partners', 2600000000.0),
 ('BrightHouse', 2350000000.0),
 ('GI Partners', 1050000000.0),
 ('Sprint Nextel', 833333333.3333334),
 ('Siemens PLM Software', 750000000.0),
 ('Comcast', 629888888.8888888),
 ('Eagle River Holdings', 614250000.0),
 ('Time Warner', 520909090.90909094),
 ('Laurel Crown Partners', 450000000.0),
 ('Intel', 397946666.6666667)]

#### Which funding round was the most popular? Which was the least popular?

In [58]:
cur.execute("""
            SELECT funding_round_type, COUNT(funding_round_type) FROM crunchbase
            GROUP BY 1
            ORDER BY 2 DESC
            """)
top_rounds = cur.fetchmany(3)

In [59]:
top_rounds

[('series-a', 13938), ('series-c+', 10870), ('angel', 8989)]

In [60]:
cur.execute("""
            SELECT funding_round_type, COUNT(funding_round_type) FROM crunchbase
            GROUP BY 1
            ORDER BY 2
            """)
bottom_rounds = cur.fetchmany(3)

In [61]:
bottom_rounds

[(None, 0), ('crowdfunding', 5), ('post-ipo', 33)]