This file contains the code for cleaning the rounds2 file of the project.

## Code Style
    - Case: 
        - snake_case for objects
        - camelCase for functions and classes
    - Double quotes first, then single quotes

## Libraries used
    - Pandas
    - Numpy

## The Workflow
    - Check parsing of variables
    - Treat missing values

In [1]:
# import the required libraries
import numpy as np # req for pandas. version 1.15.0
import pandas as pd # for data wrangling. version 0.23.4

  return f(*args, **kwds)
  return f(*args, **kwds)


In [2]:
# read in data to rounds data frame
rounds = pd.read_csv("../../Data/rounds2.csv", sep = ",", encoding = "ISO-8859-1")
# using ISO-8859-1 encoding as DF contains some unique characters not present in UTF-8

In [3]:
print(rounds.info()); print(rounds.dtypes)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 114949 entries, 0 to 114948
Data columns (total 6 columns):
company_permalink          114949 non-null object
funding_round_permalink    114949 non-null object
funding_round_type         114949 non-null object
funding_round_code         31140 non-null object
funded_at                  114949 non-null object
raised_amount_usd          94959 non-null float64
dtypes: float64(1), object(5)
memory usage: 5.3+ MB
None
company_permalink           object
funding_round_permalink     object
funding_round_type          object
funding_round_code          object
funded_at                   object
raised_amount_usd          float64
dtype: object


In [4]:
rounds.columns

Index(['company_permalink', 'funding_round_permalink', 'funding_round_type',
       'funding_round_code', 'funded_at', 'raised_amount_usd'],
      dtype='object')

## Some initial information about the data 

The DF contains 6 variables and about a 114500 rows. Out of these rows, only one is numeric (64 bit floating point data type). The other 5 are objects (mainly characters).

In [5]:
rounds.head(10)

Unnamed: 0,company_permalink,funding_round_permalink,funding_round_type,funding_round_code,funded_at,raised_amount_usd
0,/organization/-fame,/funding-round/9a01d05418af9f794eebff7ace91f638,venture,B,05-01-2015,10000000.0
1,/ORGANIZATION/-QOUNTER,/funding-round/22dacff496eb7acb2b901dec1dfe5633,venture,A,14-10-2014,
2,/organization/-qounter,/funding-round/b44fbb94153f6cdef13083530bb48030,seed,,01-03-2014,700000.0
3,/ORGANIZATION/-THE-ONE-OF-THEM-INC-,/funding-round/650b8f704416801069bb178a1418776b,venture,B,30-01-2014,3406878.0
4,/organization/0-6-com,/funding-round/5727accaeaa57461bd22a9bdd945382d,venture,A,19-03-2008,2000000.0
5,/ORGANIZATION/004-TECHNOLOGIES,/funding-round/1278dd4e6a37fa4b7d7e06c21b3c1830,venture,,24-07-2014,
6,/organization/01games-technology,/funding-round/7d53696f2b4f607a2f2a8cbb83d01839,undisclosed,,01-07-2014,41250.0
7,/ORGANIZATION/0NDINE-BIOMEDICAL-INC,/funding-round/2b9d3ac293d5cdccbecff5c8cb0f327d,seed,,11-09-2009,43360.0
8,/organization/0ndine-biomedical-inc,/funding-round/954b9499724b946ad8c396a57a5f3b72,venture,,21-12-2009,719491.0
9,/ORGANIZATION/0XDATA,/funding-round/383a9bd2c04f7038bb543ccef5ba3eae,seed,,22-05-2013,3000000.0


In [6]:
# statistics about raised_amount_usd, the only numeric var
rounds.raised_amount_usd.describe()

count    9.495900e+04
mean     1.042687e+07
std      1.148212e+08
min      0.000000e+00
25%      3.225000e+05
50%      1.680511e+06
75%      7.000000e+06
max      2.127194e+10
Name: raised_amount_usd, dtype: float64

In [7]:
# number of unique values in the other 5 variables
rounds.iloc[:, :].apply(lambda x: print(len(x.unique())))

90247
114949
14
9
5033
22096


company_permalink          None
funding_round_permalink    None
funding_round_type         None
funding_round_code         None
funded_at                  None
raised_amount_usd          None
dtype: object

In [8]:
rounds.iloc[:, :].apply(lambda x: print(x.unique))

<bound method Series.unique of 0                                       /organization/-fame
1                                    /ORGANIZATION/-QOUNTER
2                                    /organization/-qounter
3                       /ORGANIZATION/-THE-ONE-OF-THEM-INC-
4                                     /organization/0-6-com
5                            /ORGANIZATION/004-TECHNOLOGIES
6                          /organization/01games-technology
7                       /ORGANIZATION/0NDINE-BIOMEDICAL-INC
8                       /organization/0ndine-biomedical-inc
9                                      /ORGANIZATION/0XDATA
10                                     /organization/0xdata
11                                     /ORGANIZATION/0XDATA
12                                     /organization/0xdata
13                                          /ORGANIZATION/1
14                                          /organization/1
15                                          /ORGANIZATION/1
16       

company_permalink          None
funding_round_permalink    None
funding_round_type         None
funding_round_code         None
funded_at                  None
raised_amount_usd          None
dtype: object

We can see that funding_round_type contains about 14 unique values, funding_round_code contains 8 unique values and some missing values, while funded_at is a date. Next, we'll move on to the missing values.

In [9]:
# column wise percentage of missing values sorted in descending order
rounds.isnull().mean().sort_values(ascending = False)

funding_round_code         0.729097
raised_amount_usd          0.173903
funded_at                  0.000000
funding_round_type         0.000000
funding_round_permalink    0.000000
company_permalink          0.000000
dtype: float64

Funding round code contains the most number of missing values, while raised_amount_usd contains quite a few missing values.

# Checkpoints
## Checkpoint 1: Data Cleaning
## Checkpoint 2: Funding Type Analysis
## Checkpoint 3: Country Analysis
## Checkpoint 4: Sector Analysis 1
## Checkpoint 5: Sector Analysis 2

## Checkpiont 1: Data Cleaning
In this checkpoint, we're supposed to answer 5 questions and 3 of them pertian to the rounds data frame. They are:
    - How many unique companies are in rounds2?
    - Are there any companies in rounds2 that are not present in companies?
    - Merge the companies and rounds2 DF's so that all the variables in companies is added to rounds2.

Before this can be done however, the data frame needs to be cleaned. This means removing missing values, which need to removed with future checkpoints in mind. 

For most of the checkpoints, funding_round_type, the variable with the highest percentage of missing values is not required. So, this variable can be simply ignored for the rest of the analysis.

The first question actually deals with a simple count. The question asks the number of unique companies present in the dataset. This can be determined using the permalink which acts an id for each of the organizations.

In [10]:
# number of unique companies in the dataset
len(rounds.company_permalink.unique())

90247

There are 90,247 unique companies in this dataset. The dataset contains nearly a 115K observations. This means that some companies have had more than 1 shot at getting investment.

To answer the second question, we need the companies dataset. Let's not get into the details now. 

The third and probably the most important variable for this analysis is the amount of money raised. Here's the description of this variable.

In [11]:
rounds.raised_amount_usd.describe()

count    9.495900e+04
mean     1.042687e+07
std      1.148212e+08
min      0.000000e+00
25%      3.225000e+05
50%      1.680511e+06
75%      7.000000e+06
max      2.127194e+10
Name: raised_amount_usd, dtype: float64

To figure out the imputation method, look at the difference between mean - min and mean - max

In [53]:
raised_amt_median = rounds.raised_amount_usd.median()
raised_amt_min = rounds.raised_amount_usd.min()
raised_amt_max = rounds.raised_amount_usd.max()
raised_amt_mean = rounds.raised_amount_usd.mean()

In [55]:
print((raised_amt_max - raised_amt_mean) / (raised_amt_mean - raised_amt_min));
print((raised_amt_max - raised_amt_median) / (raised_amt_median - raised_amt_min))

2039.1075641766872
12657.015924918076


The mean doesn't seem to be exactly at the center of the distribution of raised_amount_usd. The difference between the max and the mean seems to be greater than the difference between the mean and min by a factor of 2000!.
It's even worse with the median. We're better off imputing using the mean!