This file contains the code for cleaning the rounds2 file of the project.

## Code Style
    - Case: 
        - snake_case for objects
        - camelCase for functions and classes
    - Double quotes first, then single quotes

## Libraries used
    - Pandas
    - Numpy

## The Workflow
    - Check parsing of variables
    - Treat missing values

In [1]:
# import the required libraries
import numpy as np # req for pandas. version 1.15.0
import pandas as pd # for data wrangling. version 0.23.4

  return f(*args, **kwds)
  return f(*args, **kwds)


In [7]:
# import matplotlib for a few plots
%matplotlib inline
import matplotlib.pyplot as plt

In [2]:
# read in data to rounds data frame
rounds = pd.read_csv("../../Data/rounds2.csv", sep = ",", encoding = "ISO-8859-1")
# using ISO-8859-1 encoding as DF contains some unique characters not present in UTF-8 charset

In [3]:
print(rounds.info()); print(rounds.dtypes)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 114949 entries, 0 to 114948
Data columns (total 6 columns):
company_permalink          114949 non-null object
funding_round_permalink    114949 non-null object
funding_round_type         114949 non-null object
funding_round_code         31140 non-null object
funded_at                  114949 non-null object
raised_amount_usd          94959 non-null float64
dtypes: float64(1), object(5)
memory usage: 5.3+ MB
None
company_permalink           object
funding_round_permalink     object
funding_round_type          object
funding_round_code          object
funded_at                   object
raised_amount_usd          float64
dtype: object


In [4]:
rounds.columns

Index(['company_permalink', 'funding_round_permalink', 'funding_round_type',
       'funding_round_code', 'funded_at', 'raised_amount_usd'],
      dtype='object')

## Some initial information about the data 

The DF contains 6 variables and about a 114500 rows. Out of these rows, only one is numeric (64 bit floating point data type). The other 5 are objects (mainly characters).

In [5]:
rounds.head(10);

In [6]:
# statistics about raised_amount_usd, the only numeric var
rounds.raised_amount_usd.describe()

count    9.495900e+04
mean     1.042687e+07
std      1.148212e+08
min      0.000000e+00
25%      3.225000e+05
50%      1.680511e+06
75%      7.000000e+06
max      2.127194e+10
Name: raised_amount_usd, dtype: float64

In [7]:
# number of unique values in the other 5 variables
rounds.iloc[:, :].apply(lambda x: print(len(x.unique())))

90247
114949
14
9
5033
22096


company_permalink          None
funding_round_permalink    None
funding_round_type         None
funding_round_code         None
funded_at                  None
raised_amount_usd          None
dtype: object

We can see that funding_round_type contains about 14 unique values, funding_round_code contains 8 unique values and some missing values, while funded_at is a date. Next, we'll move on to the missing values.

In [9]:
# column wise percentage of missing values sorted in descending order
rounds.isnull().mean().sort_values(ascending = False)

funding_round_code         0.729097
raised_amount_usd          0.173903
funded_at                  0.000000
funding_round_type         0.000000
funding_round_permalink    0.000000
company_permalink          0.000000
dtype: float64

Funding round code contains the most number of missing values, while raised_amount_usd contains quite a few missing values.

## Missing Value Treatment
There are two columns that contain missing values: funding_round_code which contains 73% missing values and raised_amount_usd which contains 17.4% missing values. Of the two, raised_amount_usd is far more important to our analysis than funding_round_code. So, it's better to ignore funding_round_code for now and focus on treating the missing values in raised_amount_usd. 

The missing values can be treated in one of two ways:
    - Ignore the rows with missing values by excluding them from the analysis.
    - Impute the missing values using mean, median, weighted mean or results from a model. The first two are far more simple, so we'll go with that.

In [6]:
# statistics of raised_amount_usd
rounds.raised_amount_usd.describe()

count    9.495900e+04
mean     1.042687e+07
std      1.148212e+08
min      0.000000e+00
25%      3.225000e+05
50%      1.680511e+06
75%      7.000000e+06
max      2.127194e+10
Name: raised_amount_usd, dtype: float64

Now, that's a pretty skewed distribution. Let's plot it to get a feel for it. 

To figure out the imputation method, look at the difference between mean - min and mean - max

In [None]:
raised_amt_median = rounds.raised_amount_usd.median()
raised_amt_min = rounds.raised_amount_usd.min()
raised_amt_max = rounds.raised_amount_usd.max()
raised_amt_mean = rounds.raised_amount_usd.mean()

In [None]:
print((raised_amt_max - raised_amt_mean) / (raised_amt_mean - raised_amt_min));
print((raised_amt_max - raised_amt_median) / (raised_amt_median - raised_amt_min))

The mean doesn't seem to be exactly at the center of the distribution of raised_amount_usd. The difference between the max and the mean seems to be greater than the difference between the mean and min by a factor of 2000!.
It's even worse with the median. We're better off imputing using the mean!

Since it's been decided that imputing with the mean is the better method, instead of randomly imputing with the mean, we can impute according to missing values according to the mean of the particular category of funding.

# Checkpoints
## Checkpoint 1: Data Cleaning
## Checkpoint 2: Funding Type Analysis
## Checkpoint 3: Country Analysis
## Checkpoint 4: Sector Analysis 1
## Checkpoint 5: Sector Analysis 2

## Checkpiont 1: Data Cleaning
In this checkpoint, we're supposed to answer 5 questions and 3 of them pertian to the rounds data frame. They are:
    - How many unique companies are in rounds2?
    - Are there any companies in rounds2 that are not present in companies?
    - Merge the companies and rounds2 DF's so that all the variables in companies is added to rounds2.

Before this can be done however, the data frame needs to be cleaned. This means removing missing values, which need to removed with future checkpoints in mind. 

For most of the checkpoints, funding_round_type, the variable with the highest percentage of missing values is not required. So, this variable can be simply ignored for the rest of the analysis.

The first question actually deals with a simple count. The question asks the number of unique companies present in the dataset. This can be determined using the permalink which acts an id for each of the organizations.

In [4]:
# number of unique companies in the dataset
len(rounds.company_permalink.unique()) 

# number of unique companies in the dataset after unifying the case of all company names
print(len(rounds.company_permalink.str.lower().unique()))

66370


There are 90,247 unique companies in this dataset. The dataset contains nearly a 115K observations. This means that some companies have had more than 1 shot at getting investment.

Based on the idea of using regex to extract company names from company permalink, which seems to be pretty time saving, let's check the number of unique values.

In [45]:
# extracting the company names from the company permalink column
rounds["company_name"] = rounds.company_permalink.str.extract(r"\/(organization|ORGANIZATION)\/(.*)").iloc[:, 1]

In [46]:
len(rounds.company_permalink.str.extract(r"\/organization\/(.*)").dropna().iloc[:, 0].unique())

45145