# Analyzing Startup Fundraising Deals from Crunchbase

In this project, I will be analyzing a dataset from Crunchbase, which is website that crowdsources information on the fundraising rounds of many startups. The dataset can be found [here](https://github.com/datahoarder/crunchbase-october-2013/blob/master/crunchbase-investments.csv) on GitHub. 

I will be working under the assumption that I have only 10 megabytes of available memory. As this dataset, `crunchbase-investments.csv`, consumes 10.3 megabytes of disk space, I will have to use batch processing. I will read the dataset into pandas dataframes using 5000-row chunks to ensure that each chunk consumes much less than 10 megabytes of memory.

I will start with becoming familiar with each of the following:

* The total memory footprint of all of the chunks combined
* Each column's memory footprint
* Each column's missing value counts
* Which column(s) we can drop because they aren't useful for analysis

In [1]:
import sqlite3 
import pandas as pd
import json

## Memory footprints

First, I will look at memory footprint across all chunks, and the total memory footprint of all chunks combined. Then I will look at memory footprint across each column. 

In [2]:
chunk_iter = pd.read_csv('crunchbase-investments.csv', chunksize=5000, encoding='ISO-8859-1')
list_mem = []
print("Megabytes of total memory usage in each chunk:")
for chunk in chunk_iter:
    mem_bytes = chunk.memory_usage(deep=True).sum()
    mem_megabytes = mem_bytes/(2**20)
    print(mem_megabytes)
    list_mem.append(mem_megabytes)

print("\nTotal memory usage across all chunks:")
print(sum(list_mem))

Megabytes of total memory usage in each chunk:
5.579195022583008
5.528186798095703
5.535004615783691
5.528162956237793
5.5243072509765625
5.553412437438965
5.531391143798828
5.509613037109375
5.396090507507324
4.63945198059082
2.663668632507324

Total memory usage across all chunks:
56.988484382629395


We can see that the memory usage across each chunk is less than 6 megabytes, which keeps us on the safe side of the 10 megabyte limit. Now I will determine how much memory is taken up by each column. 

In [3]:
chunk_iter = pd.read_csv('crunchbase-investments.csv', chunksize=5000, encoding='ISO-8859-1')
counter = 0
mem_series = pd.Series()
for chunk in chunk_iter:
    if counter == 0:
        mem_series = chunk.memory_usage(deep=True)
    else:
        mem_series += chunk.memory_usage(deep=True)
    counter += 1

print("Bytes of total memory usage in each column:")
mem_series = mem_series.drop('Index')
print(mem_series)

print("\nTotal megabytes of memory usage:")
print(sum(mem_series/(2**20)))

Bytes of total memory usage in each column:
company_permalink         4057788
company_name              3591326
company_category_code     3421104
company_country_code      3172176
company_state_code        3106051
company_region            3411585
company_city              3505926
investor_permalink        4980548
investor_name             3915666
investor_category_code     622424
investor_country_code     2647292
investor_state_code       2476607
investor_region           3396281
investor_city             2885083
funding_round_type        3410707
funded_at                 3542185
funded_month              3383584
funded_quarter            3383584
funded_year                422960
raised_amount_usd          422960
dtype: int64

Total megabytes of memory usage:
56.9876070022583


The total memory usage (~57 megabytes) is the same when split by chunk or by column. 

## Missing value counts

In [4]:
chunk_iter = pd.read_csv('crunchbase-investments.csv', chunksize=5000, encoding='ISO-8859-1')
counter = 0
missing_val = pd.Series()
for chunk in chunk_iter:
    if counter == 0:
        missing_val = chunk.isnull().sum()
    else:
        missing_val += chunk.isnull().sum()
    counter += 1

print("Number of missing values in each column:")
print(missing_val.sort_values())

Number of missing values in each column:
company_permalink             1
company_name                  1
company_country_code          1
company_region                1
investor_permalink            2
investor_name                 2
investor_region               2
funded_year                   3
funded_quarter                3
funded_month                  3
funded_at                     3
funding_round_type            3
company_state_code          492
company_city                533
company_category_code       643
raised_amount_usd          3599
investor_country_code     12001
investor_city             12480
investor_state_code       16809
investor_category_code    50427
dtype: int64


## Columns we can drop

Now, I am looking at the first ten rows of the dataset in order to determine which columns we can possibly drop. 

In [5]:
first_ten = pd.read_csv('crunchbase-investments.csv', nrows=10, encoding='ISO-8859-1')
print(first_ten)

     company_permalink company_name company_category_code  \
0    /company/advercar     AdverCar           advertising   
1  /company/launchgram   LaunchGram                  news   
2        /company/utap         uTaP             messaging   
3    /company/zoopshop     ZoopShop              software   
4    /company/efuneral     eFuneral                   web   
5       /company/tackk        Tackk                   web   
6    /company/acclaimd     Acclaimd             analytics   
7    /company/acclaimd     Acclaimd             analytics   
8    /company/toviefor     ToVieFor             ecommerce   
9    /company/ohk-labs     OHK Labs                sports   

  company_country_code company_state_code         company_region  \
0                  USA                 CA                 SF Bay   
1                  USA                 CA                 SF Bay   
2                  USA                NaN  United States - Other   
3                  USA                 OH               

It seems like the following columns will not be necessary for analysis:

* company_permalink (already have company_name)
* investor_permalink (already have investor_name)
* investor_category_code (>90% missing values)

I will drop these columns. 

In [6]:
useful_cols = chunk.columns.drop(['company_permalink', 'investor_permalink', 
                                     'investor_category_code'])
print(useful_cols)

Index(['company_name', 'company_category_code', 'company_country_code',
       'company_state_code', 'company_region', 'company_city', 'investor_name',
       'investor_country_code', 'investor_state_code', 'investor_region',
       'investor_city', 'funding_round_type', 'funded_at', 'funded_month',
       'funded_quarter', 'funded_year', 'raised_amount_usd'],
      dtype='object')


## Determining column types

Since I have a good sense of the memory footprints and missing values, I can now get familiar with the column types before adding the data into SQLite. 

In [7]:
chunk_iter = pd.read_csv('crunchbase-investments.csv', chunksize=5000, 
                         encoding='ISO-8859-1', usecols=useful_cols)        
col_types = {}      
for chunk in chunk_iter:
    for col in chunk.columns:
        if col not in col_types:
            col_types[col] = [str(chunk[col].dtypes)]
        elif str(chunk[col].dtypes) not in col_types[col]:
            col_types[col].append(str(chunk[col].dtypes))
            
print("Column types:")
print(json.dumps(col_types, indent=2))

Column types:
{
  "company_country_code": [
    "object"
  ],
  "company_region": [
    "object"
  ],
  "investor_city": [
    "object",
    "float64"
  ],
  "funded_year": [
    "int64",
    "float64"
  ],
  "funding_round_type": [
    "object"
  ],
  "funded_month": [
    "object"
  ],
  "investor_region": [
    "object"
  ],
  "funded_at": [
    "object"
  ],
  "investor_country_code": [
    "object",
    "float64"
  ],
  "company_category_code": [
    "object"
  ],
  "company_state_code": [
    "object"
  ],
  "investor_state_code": [
    "object",
    "float64"
  ],
  "funded_quarter": [
    "object"
  ],
  "raised_amount_usd": [
    "float64"
  ],
  "company_city": [
    "object"
  ],
  "company_name": [
    "object"
  ],
  "investor_name": [
    "object"
  ]
}


I can see that some columns have several more than one data type. I will now see how many unique types there are for each column, since columns with few unique types could be candidates for the `category` data type.

In [8]:
chunk_iter = pd.read_csv('crunchbase-investments.csv', chunksize=5000, 
                         encoding='ISO-8859-1', usecols=useful_cols)  
uniques = {}
for chunk in chunk_iter:
    for c in chunk.columns:
        uniques[c] = []

chunk_iter = pd.read_csv('crunchbase-investments.csv', chunksize=5000, 
                         encoding='ISO-8859-1', usecols=useful_cols)  
for chunk in chunk_iter:
    cols = chunk.columns
    for c in cols:
        for row in chunk[c]:
            if pd.isnull(row) == False:
                if row not in uniques[c]:
                    uniques[c].append(row)
                    
lengths = {}
for column in uniques:
    lengths[column] = len(uniques[column])
print("Number of unique values in each column:\n")
print(json.dumps(lengths, indent=2))

Number of unique values in each column:

{
  "company_country_code": 2,
  "company_region": 546,
  "investor_city": 990,
  "funding_round_type": 9,
  "funded_month": 192,
  "investor_region": 585,
  "funded_at": 2808,
  "investor_country_code": 72,
  "company_category_code": 43,
  "company_state_code": 50,
  "investor_state_code": 50,
  "funded_quarter": 72,
  "raised_amount_usd": 1458,
  "company_city": 1229,
  "company_name": 11573,
  "funded_year": 20,
  "investor_name": 10465
}


## Optimizing column types

It looks like the following columns with few unique values could be `category` data types:

* `company_country_code`
* `funding_round_type`

I can also optimize the data types of the following columns:

* `investor_state_code` (object)
* `investor_country_code` (object)
* `investor_city` (object)
* `raised_amount_usd` (int)
* `funded_at` (datetime)
* `funded_month` (datetime)
* `funded_year` (datetime)

In [9]:
chunk_iter = pd.read_csv('crunchbase-investments.csv', chunksize=5000, 
                         encoding='ISO-8859-1', usecols=useful_cols,
                         parse_dates=['funded_at', 'funded_month', 'funded_year'])  

print("Megabytes of total memory usage in each chunk:\n")
list_mem = []
for chunk in chunk_iter:
    chunk['company_country_code'] = chunk['company_country_code'].astype('category')
    chunk['funding_round_type'] = chunk['funding_round_type'].astype('category')
    chunk['investor_state_code'] = chunk['investor_state_code'].astype('object')
    chunk['investor_country_code'] = chunk['investor_country_code'].astype('object')
    chunk['investor_city'] = chunk['investor_city'].astype('object')
    chunk['raised_amount_usd'] = pd.to_numeric(chunk['raised_amount_usd'], downcast='integer')
    mem_bytes = chunk.memory_usage(deep=True).sum()
    mem_megabytes = mem_bytes/(2**20)
    print(mem_megabytes)
    list_mem.append(mem_megabytes)

print("\nTotal memory usage across all chunks:")
print(sum(list_mem))

Megabytes of total memory usage in each chunk:

3.44927978515625
3.518195152282715
3.5227088928222656
3.5202817916870117
3.508152961730957
3.531367301940918
3.515960693359375
3.506814956665039
3.4132614135742188
3.090296745300293
1.7744121551513672

Total memory usage across all chunks:
36.35073184967041


We can see that optimizing the dataset gives us a total memory usage that is about 20 megabytes less than the original memory usage. 


## Loading data into SQLite database

Now that the data is in a good shape to start exploring and analyzing, I can load each chunk into a table in a SQLite database to begin querying. I will create and connect to a new SQLite database file called `cb_investments.db`, and this table in SQLite will be called `investments`. 

In [12]:
conn = sqlite3.connect('cb_investments.db')

chunk_iter = pd.read_csv('crunchbase-investments.csv', chunksize=5000, 
                         encoding='ISO-8859-1', usecols=useful_cols,
                         parse_dates=['funded_at', 'funded_month', 'funded_year'])  

for chunk in chunk_iter:
    chunk['company_country_code'] = chunk['company_country_code'].astype('category')
    chunk['funding_round_type'] = chunk['funding_round_type'].astype('category')
    chunk['investor_state_code'] = chunk['investor_state_code'].astype('object')
    chunk['investor_country_code'] = chunk['investor_country_code'].astype('object')
    chunk['investor_city'] = chunk['investor_city'].astype('object')
    chunk['raised_amount_usd'] = pd.to_numeric(chunk['raised_amount_usd'], downcast='integer')
    chunk.to_sql('investments', conn, if_exists='append', index=False)

I have now created a table called `investments` that I can query to make sure the data types match up with what I expected. 

In [13]:
q1 = '''
    select * 
    from investments 
    limit 10
'''
results1 = pd.read_sql(q1, conn)
print(results1)

  company_name company_category_code company_country_code company_state_code  \
0     AdverCar           advertising                  USA                 CA   
1   LaunchGram                  news                  USA                 CA   
2         uTaP             messaging                  USA               None   
3     ZoopShop              software                  USA                 OH   
4     eFuneral                   web                  USA                 OH   
5        Tackk                   web                  USA                 OH   
6     Acclaimd             analytics                  USA                 OH   
7     Acclaimd             analytics                  USA                 OH   
8     ToVieFor             ecommerce                  USA                 NY   
9     OHK Labs                sports                  USA                 FL   

          company_region   company_city  \
0                 SF Bay  San Francisco   
1                 SF Bay  Mountai

In [14]:
q2 = 'pragma table_info(investments)'
results2 = pd.read_sql(q2, conn)
print(results2)

    cid                   name       type  notnull dflt_value  pk
0     0           company_name       TEXT        0       None   0
1     1  company_category_code       TEXT        0       None   0
2     2   company_country_code       TEXT        0       None   0
3     3     company_state_code       TEXT        0       None   0
4     4         company_region       TEXT        0       None   0
5     5           company_city       TEXT        0       None   0
6     6          investor_name       TEXT        0       None   0
7     7  investor_country_code       TEXT        0       None   0
8     8    investor_state_code       TEXT        0       None   0
9     9        investor_region       TEXT        0       None   0
10   10          investor_city       TEXT        0       None   0
11   11     funding_round_type       TEXT        0       None   0
12   12              funded_at  TIMESTAMP        0       None   0
13   13           funded_month  TIMESTAMP        0       None   0
14   14   

It seems like each of the types are what I expected. I will now use the pandas SQLite workflow to answer several business questions.

###  Which category of company attracted the most investments?

In [20]:
q = '''
    select 
        company_category_code as "Company Category",
        count(company_category_code) as "Number of Investments"
    from investments
    group by 1
    order by 2 desc
    limit 1
'''
results = pd.read_sql(q, conn)
print(results)

  Company Category  Number of Investments
0         software                   7243


###  Which investor contributed the most money (across all startups)?

In [22]:
q = '''
    select 
        investor_name as "Investor",
        sum(raised_amount_usd) as "Total Amount Invested"
    from investments
    group by 1
    order by 2 desc
    limit 1
'''
results = pd.read_sql(q, conn)
print(results)

                           Investor  Total Investments
0  Kleiner Perkins Caufield & Byers       1.121783e+10


###  Which investors contributed the most money per startup?

In [28]:
q = '''
    select 
        company_name as "Company Name",
        investor_name as "Investor",
        max(raised_amount_usd) as "Max Amount Invested"
    from investments
    group by 1
    order by 3 desc
    limit 10
'''
results = pd.read_sql(q, conn)
print(results)

             Company Name                   Investor  Max Amount Invested
0               Clearwire                BrightHouse         3.200000e+09
1               sigmacare     Marlin Equity Partners         2.600000e+09
2                Facebook   Digital Sky Technologies         1.500000e+09
3          Wave Broadband                GI Partners         1.050000e+09
4                     AOL                     Google         1.000000e+09
5                 Groupon        Andreessen Horowitz         9.500000e+08
6  University of Maryland       Siemens PLM Software         7.500000e+08
7            Vivint, Inc.              Goldman Sachs         5.650000e+08
8                Solyndra  U.S. Department of Energy         5.350000e+08
9       Fisker Automotive  U.S. Department of Energy         5.290000e+08
