## Processing Crunchbase Data in Chunks and Loading into SQLite

### Introduction

In this project, we will be looking at data for startup investments from Crunchbase.com. Every year, thousands of startup companies raise financing from investors. Crunchbase is a website that crowdsources information on the fundraising rounds of many startups. The data set of investments we'll be exploring is from October 2013 and can be found from [Github](https://github.com/datahoarder/crunchbase-october-2013).

Throughout this project, we'll work with different memory constraints and assume we only have 10 megabytes of available memory. Because the data set contains over 50,000 rows, we'll need to read the data set into dataframes using 5,000 row chunks to ensure that each chunk consumes much less than 10 megabytes of memory. Finally, we will load each chunk into a table in a SQLite database.

Let's start by importing pandas and compute each column's missing value counts.

In [3]:
import pandas as pd
chunks = pd.read_csv('crunchbase-investments.csv', chunksize=5000, encoding='ISO-8859-1')

In [4]:
missing_value_list = []
for chunk in chunks:
    missing_value_list.append(chunk.isnull().sum())
    
combined_missing_value_counts = pd.concat(missing_value_list)
unique_combined_missing_value_counts = combined_missing_value_counts.groupby(combined_missing_value_counts.index).sum()
unique_combined_missing_value_counts.sort_values()

company_country_code          1
company_name                  1
company_permalink             1
company_region                1
investor_region               2
investor_permalink            2
investor_name                 2
funded_quarter                3
funded_at                     3
funded_month                  3
funded_year                   3
funding_round_type            3
company_state_code          492
company_city                533
company_category_code       643
raised_amount_usd          3599
investor_country_code     12001
investor_city             12480
investor_state_code       16809
investor_category_code    50427
dtype: int64

Let's take a look at the memory usage for each column.

In [6]:
chunks = pd.read_csv('crunchbase-investments.csv', chunksize=5000, encoding='ISO-8859-1')
counter = 0
series_memory = pd.Series()
for chunk in chunks:
    if counter == 0:
        series_memory = chunk.memory_usage(deep=True)
    else:
        series_memory += chunk.memory_usage(deep=True)
    counter += 1

# Drop memory usage calculation for the index
series_memory = series_memory.drop('Index')
series_memory

company_permalink         4057788
company_name              3591326
company_category_code     3421104
company_country_code      3172176
company_state_code        3106051
company_region            3411585
company_city              3505926
investor_permalink        4980548
investor_name             3915666
investor_category_code     622424
investor_country_code     2647292
investor_state_code       2476607
investor_region           3396281
investor_city             2885083
funding_round_type        3410707
funded_at                 3542185
funded_month              3383584
funded_quarter            3383584
funded_year                422960
raised_amount_usd          422960
dtype: int64

Let's add the memory usage for all columns to get the total memory usage of all the data (in megabytes).

In [7]:
series_memory.sum() / (1024 * 1024)

56.9876070022583

By looking at the missing values counts, we can drop the inventor_category_code column since most of the values are missing. We can also drop the columns containing URL's as they are not useful for our analysis.

In [8]:
drop_cols = ['investor_permalink', 'company_permalink', 'investor_category_code']
use_cols = chunk.columns.drop(drop_cols)
use_cols.tolist

<bound method Index.tolist of Index(['company_name', 'company_category_code', 'company_country_code',
       'company_state_code', 'company_region', 'company_city', 'investor_name',
       'investor_country_code', 'investor_state_code', 'investor_region',
       'investor_city', 'funding_round_type', 'funded_at', 'funded_month',
       'funded_quarter', 'funded_year', 'raised_amount_usd'],
      dtype='object')>

### Data Types

Now that we have a good sense of the missing values, let's get familiar with the data types and look at the first 5 rows of the data before adding it into SQLite.

In [12]:
# Key: Column name, Value: Data types
data_types = {}
chunks = pd.read_csv('crunchbase-investments.csv', chunksize=5000, encoding='ISO-8859-1', usecols=use_cols)

for chunk in chunks:
    for col in chunk.columns:
        if col not in data_types:
            data_types[col] = [str(chunk.dtypes[col])]
        else:
            data_types[col].append(str(chunk.dtypes[col]))

unique_data_types = {}
for k,v in data_types.items():
    unique_data_types[k] = set(data_types[k])
unique_data_types

{'company_category_code': {'object'},
 'company_city': {'object'},
 'company_country_code': {'object'},
 'company_name': {'object'},
 'company_region': {'object'},
 'company_state_code': {'object'},
 'funded_at': {'object'},
 'funded_month': {'object'},
 'funded_quarter': {'object'},
 'funded_year': {'float64', 'int64'},
 'funding_round_type': {'object'},
 'investor_city': {'float64', 'object'},
 'investor_country_code': {'float64', 'object'},
 'investor_name': {'object'},
 'investor_region': {'object'},
 'investor_state_code': {'float64', 'object'},
 'raised_amount_usd': {'float64'}}

In [13]:
chunk.head()

Unnamed: 0,company_name,company_category_code,company_country_code,company_state_code,company_region,company_city,investor_name,investor_country_code,investor_state_code,investor_region,investor_city,funding_round_type,funded_at,funded_month,funded_quarter,funded_year,raised_amount_usd
50000,NuORDER,fashion,USA,CA,Los Angeles,West Hollywood,Mortimer Singer,,,unknown,,series-a,2012-10-01,2012-10,2012-Q4,2012,3060000.0
50001,ChaCha,advertising,USA,IN,Indianapolis,Carmel,Morton Meyerson,,,unknown,,series-b,2007-10-01,2007-10,2007-Q4,2007,12000000.0
50002,Binfire,software,USA,FL,Bocat Raton,Bocat Raton,Moshe Ariel,,,unknown,,angel,2008-04-18,2008-04,2008-Q2,2008,500000.0
50003,Binfire,software,USA,FL,Bocat Raton,Bocat Raton,Moshe Ariel,,,unknown,,angel,2010-01-01,2010-01,2010-Q1,2010,750000.0
50004,Unified Color,software,USA,CA,SF Bay,South San Frnacisco,Mr. Andrew Oung,,,unknown,,angel,2010-01-01,2010-01,2010-Q1,2010,


### Loading Chunks into SQLite

In order to start analyzing the data, we will load each chunk into a table in a SQLite database so we can query the full data set.

In [None]:
import sqlite3
conn = sqlite3.connect('crunchbase.db')
chunks = pd.read_csv('crunchbase-investments.csv', chunksize=5000, encoding='ISO-8859-1')

for chunk in chunks:
    chunk.to_sql("investments", conn, if_exists='append', index=False)