# 100_load_startup_datasets

## Purpose
In this notebook we will begin our analysis of the Crunchbase datasets. Primarily, we will focus on loading and reviewing the datasets. The aim for this notebook is to get two seperate datasets. 
The reason we did this:
* The first dataset will be used to analyse the first RQ: 
 * (RQ1: Correlation between a company’s industry and location, with the amount of funding they receive?)
* The second dataset will be used to analyse the second and third RQ: 
 * (RQ2: Founders education affect the amount of funding his company receives from Venture Capitalists?) and 
 * (RQ3: Founders w/ previous experience vs no experience.)


## Datasets
* _Input_: people.csv, degrees.csv, jobs.csv, organizations.csv
* _Output_: 100_dataset1.pkl, 100_dataset2.pkl

In [2]:
import os
import re
import sys
import hashlib
import pandas as pd
import numpy as np
%matplotlib inline
pd.set_option('display.max_columns', None)
module_path = os.path.abspath(os.path.join('../../data/..'))
if module_path not in sys.path:
    sys.path.append(module_path)

## Loading the Datasets
The datasets are all in a standard csv format, so we can read these in using Pandas handy `read_csv` method as follows.

In [3]:
people_df = pd.read_csv('../../data/raw/people.csv', header=0)
degrees_df = pd.read_csv("../../data/raw/degrees.csv", header=0)
jobs_df = pd.read_csv("../../data/raw/jobs.csv", header=0)

org_df = pd.read_csv("../../data/raw/organizations.csv", header=0, low_memory=False)

Now we have our 4 datasets loaded in, one for each category (people, degrees, jobs, and organisations), each dataset has a different number of rows.

- org_df is the first dataset that we will use

### Merging raw datsets to form our second dataset to be used for analysis:

In the below cell we rename columns in to order to enable successful merging. We then merge the people_df, and jobs_df using the 'person_uuid' column, with the org_df using the 'org_uuid' column. The merged dataset is necessary for our second dataset.

In [4]:
## renaming of columns to allow for merging
people_df = people_df.rename(columns={'uuid':'person_uuid'})
org_df = org_df.rename(columns={'uuid':'org_uuid'})

## Merged to result in raw Dataset 1
merged_df = people_df.merge(degrees_df, on='person_uuid').merge(jobs_df, on='person_uuid').merge(org_df, on='org_uuid')
merged_df.shape

(808014, 65)

# Saving our two Datasets

First dataset will be used to analyse the following research question:
    - correlation between a company’s industry and location, with the amount of funding they receive? 
    
Our second dataset will be used for our research questions:
    - founders education affect the amount of funding his company receives from Venture Capitalists?
    - founders w/ previous experience vs no experience.

## Saving 1st Dataset
As mentioned above, this dataset will be used to analyse the first research question:
* Correlation between a company’s industry and location, with the amount of funding they receive?

The results below give us an insight into the columns available for us to use from our org_df that we created above. It is useful to see this information as it allows it to select the necessary columns needed for analysis.

In [5]:
org_df.columns

Index(['company_name', 'roles', 'permalink', 'domain', 'homepage_url',
       'country_code', 'state_code', 'region', 'city', 'address', 'status',
       'short_description', 'category_list', 'category_group_list',
       'funding_rounds', 'funding_total_usd', 'founded_on', 'last_funding_on',
       'closed_on', 'employee_count', 'email', 'phone', 'facebook_url',
       'linkedin_url', 'cb_url', 'logo_url', 'twitter_url', 'aliases',
       'org_uuid', 'created_at', 'updated_at', 'primary_role', 'type'],
      dtype='object')

For this dataset we only save the necessary columns needed from the organisations dataset to allow us to do adequate analysis, without any unneccessary data. This is done below using a list. We identified the following columns in the below list as being neccessary for analysis.

In [6]:
cols_to_save = [ 
            'company_name', 'roles', 'country_code', 'state_code', 'region', 'city', 'status', # company/startup details
            'category_list', 'category_group_list',                   # startups industry details
            'funding_rounds', 'funding_total_usd', 'last_funding_on', # startups funding details
            'founded_on', 'employee_count',                           # extra details
            'org_uuid',
            'primary_role', 'type']

### Saving first resulting dataset from 100_load_startup_datasets to pickle

In [7]:
org_df[cols_to_save].to_pickle("../../data/processed/100_dataset1.pkl")

## Saving 2nd Dataset
As mentioned above, this dataset will be used to analyse the second and third research question:
* Founders education affect the amount of funding his company receives from Venture Capitalists?
* Founders w/ previous experience vs no experience.


In [8]:
merged_df.columns

Index(['first_name', 'last_name', 'country_code_x', 'state_code_x', 'city_x',
       'cb_url_x', 'logo_url_x', 'twitter_url_x', 'facebook_url_x',
       'linkedin_url_x', 'primary_affiliation_organization',
       'primary_affiliation_title', 'primary_organization_uuid', 'gender',
       'person_uuid', 'created_at_x', 'updated_at_x', 'degree_uuid',
       'institution_uuid', 'degree_type', 'subject', 'started_on_x',
       'completed_on', 'is_completed', 'created_at_y', 'updated_at_y',
       'org_uuid', 'job_uuid', 'started_on_y', 'ended_on', 'is_current',
       'title', 'job_type', 'company_name', 'roles', 'permalink', 'domain',
       'homepage_url', 'country_code_y', 'state_code_y', 'region', 'city_y',
       'address', 'status', 'short_description', 'category_list',
       'category_group_list', 'funding_rounds', 'funding_total_usd',
       'founded_on', 'last_funding_on', 'closed_on', 'employee_count', 'email',
       'phone', 'facebook_url_y', 'linkedin_url_y', 'cb_url_y', 'log

The above results give us an insight into the columns available for us to use from our merged_df that we created above. It is useful to see this information as it allows it to select the necessary columns needed for analysis.

In [9]:
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 808014 entries, 0 to 808013
Data columns (total 65 columns):
first_name                          808014 non-null object
last_name                           808014 non-null object
country_code_x                      504181 non-null object
state_code_x                        374996 non-null object
city_x                              488219 non-null object
cb_url_x                            808014 non-null object
logo_url_x                          808014 non-null object
twitter_url_x                       297147 non-null object
facebook_url_x                      148717 non-null object
linkedin_url_x                      661426 non-null object
primary_affiliation_organization    742514 non-null object
primary_affiliation_title           745807 non-null object
primary_organization_uuid           742514 non-null object
gender                              807448 non-null object
person_uuid                         808014 non-null object
crea

Once again, the above result gives us a good insight to columns we are dealing with these columns. It is important to know the types of each of the columns we will be dealing with, as well as the amount of non-null values. For example, if a column has very little non-null values, it might not have sufficient data for analysis.

For this dataset we only save the necessary columns needed from the merged dataset to allow us to do adequate analysis, without any unneccessary data. This is done below using a list. We identified the following columns in the below list as being neccessary for analysis.

In [10]:
cols_to_save2 = [
    'first_name', 'last_name', 'gender',                                    # person details
    'company_name', 'funding_rounds', 'funding_total_usd', 'primary_role',  # startup details which person involved with
    'country_code_y', 'state_code_y', 'city_y',
    'title','job_type',                                                     # job details in startup
    'subject', 'degree_type',                                               # degree details
    'person_uuid', 'degree_uuid', 'institution_uuid', 'org_uuid']           # uuid including unique institution id

### Saving resulting second dataset from 100_load_startup_datasets in pickle

In [11]:
merged_df[cols_to_save2].to_pickle("../../data/processed/100_dataset2.pkl")