This file contains the code for cleaning the rounds2 file of the project.

## Code Style
    - Case: 
        - snake_case for objects
        - camelCase for functions and classes
    - Double quotes first, then single quotes

## Libraries used
    - Pandas
    - Numpy

## Obejctives of Analysis
Identify the most heavily invested main sectors in each of the three countries (for funding type FT and investments range of 5-15 M USD).

Business objective: Identify the best: a. Sectors; b. Countries; c. Investment rounds for Spark Funds.

[This means that we need to focus on just a few variables]

## Metric
Mean amount of money invested in a particular country. 

## The Workflow
The workflow for this analysis is rather simple. Focus on answering the questions asked in the checkpoints. Following this flow, the code in this .ipynb is organized according to the checkpoints. There will be a clear heading indicating the starting and ending of each checkpoint and question.

In [1]:
# importing dependencies
# numpy
import numpy as np # version: 1.15.0

# pandas
import pandas as pd # version: 0.23.4

  return f(*args, **kwds)
  return f(*args, **kwds)


# Checkpoint 1: Data Cleaning
There are five tasks in this checkpoint:
    - Number of unique companies in rounds2.csv
	- Number of unique companies in companies.tsv
	- Key column from the companies dataset that can be used to merge it with rounds data
	- Organizations in companies that are missing in rounds2.
    - Merge the two datasets.

## Importing the data
The first step of the analysis is to import the two main datasets that we will be needing for the analysis: companies and rounds. 

In [31]:
# import companies.csv as companies
companies = pd.read_csv("../../Data/companies.tsv", sep = "\t", encoding = "ISO-8859-1") 

# import rounds2.csv as rounds
rounds = pd.read_csv("../../Data/rounds2.csv", sep = ",", encoding = "ISO-8859-1")
# ISO for lack of charset in UTF-8

In [10]:
#information of the companies dataset
print(companies.info()); print("shape of dataset: ", companies.shape); print("variable dtypes:\n", companies.dtypes)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 66368 entries, 0 to 66367
Data columns (total 10 columns):
permalink        66368 non-null object
name             66367 non-null object
homepage_url     61310 non-null object
category_list    63220 non-null object
status           66368 non-null object
country_code     59410 non-null object
state_code       57821 non-null object
region           58338 non-null object
city             58340 non-null object
founded_at       51147 non-null object
dtypes: object(10)
memory usage: 5.1+ MB
None
shape of dataset:  (66368, 10)
variable dtypes:
 permalink        object
name             object
homepage_url     object
category_list    object
status           object
country_code     object
state_code       object
region           object
city             object
founded_at       object
dtype: object


In [11]:
# information about the rounds dataset
print(rounds.info()); print("shape of dataset: ", rounds.shape); print("variable dtypes:\n", rounds.dtypes)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 114949 entries, 0 to 114948
Data columns (total 6 columns):
company_permalink          114949 non-null object
funding_round_permalink    114949 non-null object
funding_round_type         114949 non-null object
funding_round_code         31140 non-null object
funded_at                  114949 non-null object
raised_amount_usd          94959 non-null float64
dtypes: float64(1), object(5)
memory usage: 5.3+ MB
None
shape of dataset:  (114949, 6)
variable dtypes:
 company_permalink           object
funding_round_permalink     object
funding_round_type          object
funding_round_code          object
funded_at                   object
raised_amount_usd          float64
dtype: object


Since we will be focusing mostly on four variables only, let's remove all the extraneous variables from both the datasets. 
We'll remove from the companies dataset the following variables:
    - state_code
    - region
    - city
    - homepage_url
    - founded_at
    - name

In [32]:
# removing unnecessary columns from companies
companies.drop(["state_code", "region", "city", "homepage_url", "founded_at", "name"], axis = 1, inplace = True)

We'll remove the following from rounds:
    - funded_at
    - funding_round_code
    - funding_round_permalink

In [33]:
# removing unnecessary columns from rounds
rounds.drop(["funded_at", "funding_round_code", "funding_round_permalink"], axis = 1, inplace = True)

## Checkpoint 1 Q1: Number of unique companies in rounds
To do this, we'll use the company_permalink column. However, instead of doing this directly, we'll first convert the company_permalink to lowercase and then determine the number of unique records.

In [18]:
# converting company_permalink to lower case and getting number of unique records.
rounds.company_permalink.str.lower().nunique()

66370

There seem to be 66370 unique companies in the dataset. This means that there are companies that had more than one round of funding. [Import Observation]

## Checpoint 1 Q2: Number of unique companies in companies
This time, we'll use the permalink, which is supposed to be the UID of a company. As with rounds.company_permalink, we'll first convert to lower case and then proceed to count the number of unique records.

In [19]:
companies.permalink.str.lower().nunique()

66368

There seems to be a discrepancy between the number of unique records in companies and rounds. Does this mean that there are at least 2 companies in rounds that are not present in companies?

## Checkpoint 1 Q3: Key column to merge companies and rounds
This is pretty easy. From the data dictionary we know the companies.permalink and rounds.company_permalink are UID's of each company in the dataset. So, we'll use companies.permalink as the key to merge with rounds.

## Checkpoint 1 Q4: Mismatches between rounds and companies
Ok. Now, we're required to find out if there are any records that are unique to rounds only. That is these organizations are not present in companies but are present in rounds.

We can do this by merging on companies.permalink and rounds.company_permalink. But, we'll take a slightly different approach here. 

First off, we'll create two new columns in rounds and companies called company_name and name resp. Then, we'll merge based on those columns and check for missing values. If there are missing values, then there are companies which are unique to rounds only. 

In [34]:
# creating comapnies.name
companies["name"] = companies.permalink.str.lower().str.extract("\/[A-Za-z]*\/(.*)")

In [35]:
# creating rounds.company_name
rounds["company_name"] = rounds.company_permalink.str.lower().str.extract("\/[A-Za-z]*\/(.*)")

In [57]:
# checking if there are any unique records.
rounds.company_name[~rounds.company_name.isin(companies.name)].dropna()

29597                                 e-cãbica
31863            energystone-games-çµç³æ¸¸æ
45176                    huizuche-com-æ ç§ÿè½¦
58473                  magnet-tech-ç£ç³ç§æ
101036    tipcat-interactive-æ²èÿä¿¡æ¯ç§æ
109969                 weiche-tech-åè½¦ç§æ
113839                     zengame-ç¦
æ¸¸ç§æ
Name: company_name, dtype: object

(Look at this stackoverflow answer: https://stackoverflow.com/questions/28901683/pandas-get-rows-which-are-not-in-other-dataframe for the full explanation of the code used above.)

So, there seem to be 7 companies that are in rounds but not in companies. So, the answer to the fourth question is yes. There are organizations that are present in rounds but not in companies.

## Checkpoing 1 Q5: Merge the two dataframes
This is the basis of all our analysis. Merging the two DataFrames will give us a single data frame which contains all the data needed. After this step, we can finally start treating the missing values.