# Analyzing Startup Fundraising Deals from Crunchbase

 In this guided project, we'll practice using different ways to work with larger datasets in pandas to analyze startup investments from Crunchbase.com.
 
Crunchbase is a website that crowdsources information on the fundraising rounds of many startups. The Crunchbase user community submits, edits, and maintains most of the information in Crunchbase.

Throughout this guided project, we'll practice working with different memory constraints. In this step, let's assume we only have 10 megabytes of available memory. While crunchbase-investments.csv consumes 10.3 megabytes of disk space, we know from earlier missions that pandas often requires 4 to 6 times amount of space in memory as the file does on disk (especially when there's many string columns).

solution:
https://github.com/dataquestio/solutions/blob/master/Mission167Solutions.ipynb

In [1]:
import pandas as pd
import numpy as np


Because the data set contains over 50,000 rows, you'll need to read the data set into dataframes using 5,000 row chunks to ensure that each chunk consumes much less than 10 megabytes of memory.


Across all of the chunks, become familiar with:
Each column's missing value counts
Each column's memory footprint
The total memory footprint of all of the chunks combined
Which column(s) we can drop because they aren't useful for analysis


In [None]:
mv_by_col = {}
memory_by_col = {}
total_memory = 0

let's get familiar with the column types before adding the data into SQLite.

Identify the types for each column.
Identify the numeric columns we can represent using more space efficient types.
For text columns:
Analyze the unique value counts across all of the chunks to see if we can convert them to a numeric type.
See if we clean clean any text columns and separate them into multiple numeric columns without adding any overhead when querying.
Make your changes to the code from the last step so that the overall memory the data consumes stays under 10 megabytes.


The next step is to load each chunk into a table in a SQLite database so we can query the full data set.

Create and connect to a new SQLite database file.
Expand on the existing chunk processing code to export each chunk to a new table in the SQLite database.
Query the table and make sure the data types match up to what you had in mind for each column.
Use the !wc IPython command to return the file size of the database.


Now that the data is in SQLite, we can use the pandas SQLite workflow we learned in the last mission to explore and analyze startup investments. Remember that each row isn't a unique company, but a unique investment from a single investor. This means that many startups will span multiple rows.

Use the pandas SQLite workflow to answer the following questions:
What proportion of the total amount of funds did the top 10% raise? What about the top 1%? Compare these values to the proportions the bottom 10% and bottom 1% raised.
Which category of company attracted the most investments?
Which investor contributed the most money (across all startups)?
Which investors contributed the most money per startup?
Which funding round was the most popular? Which was the least popular?


Here are some ideas for further exploration:

Repeat the tasks in this guided project using stricter memory constraints (under 1 megabyte).
Clean and analyze the other Crunchbase data sets from the same GitHub repo.
Understand which columns the data sets share, and how the data sets are linked.
Create a relational database design that links the data sets together and reduces the overall disk space the database file consumes.
Use pandas to populate each table in the database, create the appropriate indexes, and so on.