This file contains the code for cleaning the rounds2 file of the project.

## Code Style
    - Case: 
        - snake_case for objects
        - camelCase for functions and classes
    - Double quotes first, then single quotes

## Libraries used
    - Pandas
    - Numpy

## Obejctives of Analysis
Identify the most heavily invested main sectors in each of the three countries (for funding type FT and investments range of 5-15 M USD).

Business objective: Identify the best: a. Sectors; b. Countries; c. Investment rounds for Spark Funds.

[This means that we need to focus on just a few variables]

## Metric
Mean amount of money invested in a particular country. 

## The Workflow
The workflow for this analysis is rather simple. Focus on answering the questions asked in the checkpoints. Following this flow, the code in this .ipynb is organized according to the checkpoints. There will be a clear heading indicating the starting and ending of each checkpoint and question.

In [None]:
# importing dependencies
# numpy
import numpy as np # version: 1.15.0

# pandas
import pandas as pd # version: 0.23.4

# Checkpoint 1: Data Cleaning
There are five tasks in this checkpoint:
    - Number of unique companies in rounds2.csv
	- Number of unique companies in companies.tsv
	- Key column from the companies dataset that can be used to merge it with rounds data
	- Organizations in companies that are missing in rounds2.
    - Merge the two datasets.

## Importing the data
The first step of the analysis is to import the two main datasets that we will be needing for the analysis: companies and rounds. 

In [None]:
# import companies.csv as companies
companies = pd.read_csv("../../Data/companies.tsv", sep = "\t", encoding = "ISO-8859-1") 

# import rounds2.csv as rounds
rounds = pd.read_csv("../../Data/rounds2.csv", sep = ",", encoding = "ISO-8859-1")
# ISO for lack of charset in UTF-8

In [None]:
#information of the companies dataset
print(companies.info()); print("shape of dataset: ", companies.shape); print("variable dtypes:\n", companies.dtypes)

In [None]:
# information about the rounds dataset
print(rounds.info()); print("shape of dataset: ", rounds.shape); print("variable dtypes:\n", rounds.dtypes)

Since we will be focusing mostly on four variables only, let's remove all the extraneous variables from both the datasets. 
We'll remove from the companies dataset the following variables:
    - state_code
    - region
    - city
    - homepage_url
    - founded_at
    - name

In [None]:
# removing unnecessary columns from companies
companies.drop(["state_code", "region", "city", "homepage_url", "founded_at", "name"], axis = 1, inplace = True)

We'll remove the following from rounds:
    - funded_at
    - funding_round_code
    - funding_round_permalink

In [None]:
# removing unnecessary columns from rounds
rounds.drop(["funded_at", "funding_round_code", "funding_round_permalink"], axis = 1, inplace = True)

## Checkpoint 1 Q1: Number of unique companies in rounds
To do this, we'll use the company_permalink column. However, instead of doing this directly, we'll first convert the company_permalink to lowercase and then determine the number of unique records.

In [None]:
# converting company_permalink to lower case and getting number of unique records.
rounds.company_permalink.str.lower().nunique()

There seem to be 66370 unique companies in the dataset. This means that there are companies that had more than one round of funding. [Import Observation]

## Checpoint 1 Q2: Number of unique companies in companies
This time, we'll use the permalink, which is supposed to be the UID of a company. As with rounds.company_permalink, we'll first convert to lower case and then proceed to count the number of unique records.

In [None]:
companies.permalink.str.lower().nunique()

There seems to be a discrepancy between the number of unique records in companies and rounds. Does this mean that there are at least 2 companies in rounds that are not present in companies?

## Checkpoint 1 Q3: Key column to merge companies and rounds
This is pretty easy. From the data dictionary we know the companies.permalink and rounds.company_permalink are UID's of each company in the dataset. So, we'll use companies.permalink as the key to merge with rounds.

## Checkpoint 1 Q4: Mismatches between rounds and companies
Ok. Now, we're required to find out if there are any records that are unique to rounds only. That is these organizations are not present in companies but are present in rounds.

We can do this by merging on companies.permalink and rounds.company_permalink. But, we'll take a slightly different approach here. 

First off, we'll create two new columns in rounds and companies called company_name and name resp. Then, we'll merge based on those columns and check for missing values. If there are missing values, then there are companies which are unique to rounds only.

# Update
So, I've found out that using a case-unified form of the permalinks produces the same result as using the names. Thus, to avoid creating unnecessary variables and excess storage consumption, I'll hold off on creating those extra columns and use the lower-case permalinks themselves.

In [None]:
# converting rounds.company_permalink and companies.permalink to lower case
companies["permalink"] = companies.permalink.str.lower()

rounds["company_permalink"] = rounds.company_permalink.str.lower()

In [None]:
# checking if there are any unique records.
print(rounds.company_permalink[~rounds.company_permalink.isin(companies.permalink)].dropna());

(Look at this stackoverflow answer: https://stackoverflow.com/questions/28901683/pandas-get-rows-which-are-not-in-other-dataframe for the full explanation of the code used above.)

There seem to be 7 companies that are in rounds but not in companies. So, the answer to the fourth question is yes. There are organizations that are present in rounds but not in companies.

## Checkpoint 1 Q5: Merge the two dataframes
This is the basis of all our analysis. Merging the two DataFrames will give us a single data frame which contains all the data needed. After this step, we can finally start treating the missing values.

So, after some thought it was decided unanimously that the best approach would be to drop all missing values from the raised_amount_usd column. Therefore, that's what I'll do.

In [None]:
# dropping all missing values from the rounds DataFrame.
rounds.dropna(inplace = True)

In [None]:
rounds.shape # new shape of rounds: (94959, 3)

Dropping those missing values leaves us with 95K observations. So, we lost about 20K observations. Not a big problem though. The next step is to combine the two datasets.

In [None]:
master = pd.merge(rounds, companies, how = "inner", left_on = "company_permalink", 
                  right_on = "permalink")

So, performing this merge resulted in a DataFrame with 94958 rows and 7 columns.

In [None]:
# checking for missing values
master.isnull().mean().sort_values(ascending = False)

There are two variables with missing values: country_code (an important variable) has 6% values missing and category_list has 1% values missing. 

The next thing to do is to check for duplicates. If there are duplicates, just drop them.

In [None]:
# checking for duplicates in the master DataFrame
master.duplicated().sum() # there seem to be 1311 duplicated values.

In [None]:
master.drop_duplicates(keep = "first", inplace = True)

So, dropping duplicates resulted in the loss of about 2K rows. We're now left with about 93K rows. This is about 81% of the initial observations. I don't think that deleting any more values is a good idea. This would result in the loss of more data, bringing the total down even more. 

Still, if we can retain more than 75% of the rows, we will have pretty reliable results.

# Treating the remaining missing values
The remaining missing values are both present in Series of dtype object. This means that we can do something cool. We can treat the missing value itself as a new category. This is sweet. This means that we don't need to lose any more data. So, let's go ahead and do that. What we'll do is replace the NaN's with a new sentinel value that tells us if we know the country_code of an organization or not.

In [None]:
# replacing missing values in country_code with new sentinel value
master["country_code"].fillna("unknown", inplace = True)

In [None]:
# on running the following code, we see that the missing value percentage has gone down
master.isnull().mean().sort_values(ascending = False)

Now, we have about 1% of the total values missing 1% of it's data. 1% of 93647 is about 940. We can go ahead and delete these rows. We'll still be left with 80.5% of the total observations.

In [None]:
master.dropna(inplace = True) # Now that we have this DataFrame, we can finally start with the analysis.

# Checkpoint 2: Funding Type Analysis

In [None]:
# checking the information about the master table after performing all the operations
print(master.info()); print(master.shape)

-  ### Subtask 2.1: Change the unit of 'raised_amount_usd' column

Convert the unit of the `raised_amount_usd` from `$` to `million $`.

In [None]:
# code for unit conversion here
master["raised_amount_usd"] = master["raised_amount_usd"] / 1000000

-  ### Subtask 2.2: Calculate the average investment amount for each of the four funding types (venture, angel, seed, and private equity) 

In [None]:
master.groupby("funding_round_type").raised_amount_usd.mean().sort_values()

### Table 2.1: Average Values of Investments for Each of these Funding Types 
 Average funding amount of venture type: 11.682595 (million USD) 	                              
 Average funding amount of angel type: 0.961039 (million USD)	 
 Average funding amount of seed type: 0.721313 (million USD)	 
 Average funding amount of private equity type: 73.350449 (million USD)	 

Considering that Spark Funds wants to invest between 5 to 15 million USD per investment round, which investment type is the most suitable for it?

venture
	                                                                                       

# Checkpoint 3: Country Analysis

 -  ### Subtask 3.1: Find the top nine countries with highest total funding for the given investment = "venture"

    -- Spark Funds wants to see the top nine countries which have received the highest total funding (across ALL sectors for the chosen investment type)

    -- For the chosen investment type, make a data frame named top9 with the top nine countries (based on the total investment amount each country has received)

In [None]:
#groupby command here to find the country-wise total funding for the investment type 
master_grpby_country= master.loc[master.funding_round_type == "venture", :].groupby("country_code")

In [None]:
#code to find the top nine countries with highest total funding
top9 = master_grpby_country["raised_amount_usd"].sum().sort_values(ascending=False)[:9]

In [None]:
top9

-  ### Subtask 3.2: Identify the top three English-speaking countries in the data frame top9.

### Table 3.1: Analysing the Top 3 English-Speaking Countries
Based on the list of countries where English is an official language - the top three English-speaking countries are:

 1. Top English-speaking country:    USA (United States)             
 2. Second English-speaking country: GBR (United Kingdom)
 3. Third English-speaking country:  IND (India)

# Checkpoint 4: Sector Analysis 1
So, let's start off with the sector analysis. Here's what needs to be done in this part of the analysis.
    - Extract the primary sector of each category list from the category_list column.
    - Use the mapping file 'mapping.csv' to map each primary sector to one of the eight main sectors (Note that ‘Others’ is also considered one of the main sectors)

The primary sector is the string that appears before the first pipe "|" in the category_list variable. So, let's get on with it.

## Extracting the primary sector from category_list
First off, let's start by getting a good look at the dataframe. After that, we'll create a new variable called primary_sector to store the primary sector of each organization in the dataframe.

In [None]:
print(master.info()); print(master.shape) # cool

In [None]:
# creating the primary sector
master["primary_sector"] = master.category_list.str.split("|").str.get(0)

Now that we have the primary sector, the next to-do item on our list is to map each of those primary sectors to a main sector. For that, we need the mapping data. We'll load that up and proceed to map the primary sectors to a main sector.

In [None]:
# loading mapping.csv as mapping
mapping = pd.read_csv("../../Data/mapping.csv", sep = ",", encoding = "ISO-8859-1")

In [None]:
# some basic information about mapping
print(mapping.info())

In [None]:
mapping.head()

OK. So, this mapping file has turned out to be a one-to-one sparse matrix that maps each primary sector to one of the eight main sectors. What I have to do now is get a main sector for each primary sector. I'm not aware of any native pandas methods that enable me to do this. My first impulse is to write a function which will allow me to do just this. Then there's relational algebra which can produce results really quickly. I'll try out my function first.

As it turns out, this operation is what Hadley Wickham calls gathering the columns. (read more about it here: https://r4ds.had.co.nz/tidy-data.html).

The main task here is this: convert the wide (and sparse) representation of the mapping into a long representation. This means bringing in all the colums under one roof. It's way easier to demonstrate than to explain. 

Luckily, pandas does provide a function to do this: pd.melt() (just like reshape2::melt() from R). Check the docs to understand how awesome this function is. (For a tutorial on pd.melt() go here: https://www.ibm.com/developerworks/community/blogs/jfp/entry/Tidy_Data_In_Python?lang=en)

In [None]:
# dropping the Blanks column since this just serves as a flag to identify NaN's
mapping.drop("Blanks", axis = 1, inplace = True)

# dropping the single NaN at the head of the dataset
mapping.dropna(axis = 0, inplace = True)

In [None]:
# gathering all the columns under one roof
mapping_long = pd.melt(mapping, id_vars = ["category_list"], var_name = "main_sector", value_name = "yes_no")

# tidying mapping_long to produce the final version of the mapping dataset
mapping_tidy = mapping_long.loc[mapping_long.yes_no == 1, ["category_list", "main_sector"]]

To finish up, let's take one look at the mapping_tidy dataset just to make sure everything is alright

In [None]:
print(mapping_tidy.info());
print(mapping_tidy.isnull().mean()) # no null values. We can proceed to merge the two datasets!

Now, all that's left out is to merge the two datasets and proceed to perform a series of checks. Here are the final steps to finish up checkpoint 4.
1. Merge master and mapping_tidy
2. Check for: a. Missing values; b. Duplicates
3. Treat missing values and drop duplicates`

# Issue with merging
Now, while merging the two datasets, master and mapped_tidy, one of two types of joins can be used. Either the inner or outer. There are the left and right joins, but we'll leave them out of the picture for now. 

If the inner join is used during the merge, there is a loss data. There are some sub-sectors that are present in master.primary_sector but aren't present in mapped.category_list. 

If the outer join is used during the merge, NULL values are inserted into the dataset. Again, dropping this is just equivalent to using the inner join. One other way to treat them is to manually add them to one of the 8 main sectors. 

What do you guys think?

In [None]:
# primary sectors present in master.primary_sector but not present in mapping_tidy.category_list
master.primary_sector[~master.primary_sector.isin(mapping_tidy.category_list)].dropna().unique()

From the result of the code above, it's clear that there are 89 sub-sectors that are not included in the mapping dataset. So, if we're manually imputing the values, it would mean the addition of utmost, 89 lines of code. This would preserve the data, but decrease the readability of the file quite a bit. Since we're graded on code readability quite a bit, what do you think we should do?

To deal with this merging issue, here's what I've decided. By observation, I've found out that the following primary_sectors consitute about 90% of the missing values in the dataset. So, I've decided to place these primary_sectors into one of the 8 main_sectors. Thereby, significantly reducing the data loss that will occur if we drop the missing data. 

These are primary sectors:
Social, Analytics, Finance, Advertising:
    - Analytics 
    - Finance 
    - Financial Services 
    - Finance Technology 
    - Business Analytics 
    - Big Data Analytics 
    - Investment Management 
    - Social Media Advertising 
    - Personal Finance 
    - Predictive Analytics 
    - Financial Exchanges
    - Mobile Analytics 
    - Social Media Management
    - Promotional
Cleantech / Semiconductors:
    - Waste Management
    - Natural Gas Uses
    - Biotechnology and Semiconductor
    - Green Tech
    - Energy Management
    - Natural Resources
Health:
    - Alternative Medicine
    - Cannabis
    - Medical Professionals
    - Personal Health
    - Mobile Emergency&Health
(I've segregated each of these primary sectors into one of the eight main sectors based on what I think is appropriate. Please go through the list and let me know if there are any changes.)

In [None]:
# creating lists of primary_sectors that fall under a main sector
# Social, Finance, Analytics, Advertising
social_analytics = ["Analytics", "Finance", "Financial Services", "Finance Technology", "Business Analytics", 
"Big Data Analytics", "Investment Management", "Social Media Advertising", "Personal Finance", 
"Predictive Analytics", "Financial Exchanges", "Mobile Analytics", "Social Media Management", "Promotional"]

# Cleantech / Semiconductors
cleantech_semiconductors = ["Waste Management", "Natural Gas Uses", "Natural Resources", 
"Biotechnology and Semiconductor", "GreenTech", "Energy Management"]

# Health
health = ["Alternative Medicine", "Cannabis", "Meidcal Professionals",
"Personal Health", "Mobile Emergency&Health"]

Now that the bins are ready, we'll perform the following steps in order:
1. Merge the mapping_tidy dataset with the master dataset using an left join
2. Impute the missing main_sector values in the merged master dataset
3. Replace the other missing values in main_sector with the "Others" flag.

### Why I think deleting missing values is a bad idea at this stage?
The answers for questions in Checkpoints 2 and 3 were found out with a dataset containing 92607 rows. Because of that, deleting any missing values, which might lead to the deletion of rows might bias the analysis and produce erroneous results. Therefore, I feel that it is better to flag the remaining 1000 rows as "Others" than to remove those rows.

In [None]:
# merging the master and mapping_tidy
master = pd.merge(master, mapping_tidy, how = "left", left_on = "primary_sector", right_on = "category_list")

In [None]:
# removing the category_list_y variable and renaming category_list_x as category_list
master.drop("category_list_y", axis = 1, inplace = True)

master.rename(index = str, columns = {"category_list_x": "category_list"}, inplace = True)

In [None]:
# filling up the main_sector with appropriate values.
# social, analytics, finance, advertising
master.loc[master.primary_sector.isin(social_analytics), "main_sector"] = "Social, Finance, Analytics, Advertising"

# cleantech / semiconductors
master.loc[master.primary_sector.isin(cleantech_semiconductors), "main_sector"] = "Cleantech / Semiconductors"

# health
master.loc[master.primary_sector.isin(health), "main_sector"] = "Health"

# filling up the remaining missing values with others
master.loc[master.main_sector.isnull(), "main_sector"] = "Others"

Now, the dataset can be used to for the analysis.

In [None]:
# sample code. Not for the final file. 
# converting raised_amount_usd to millions (divide 1000000)
# master["raised_amount_usd"] = master.raised_amount_usd / 1000000
# master.groupby("main_sector").raised_amount_usd.mean().sort_values(ascending = False)
master.groupby("funding_round_type").raised_amount_usd.mean().sort_values(ascending = False)

# Checkpoint 5: Sector Analysis 2

The aim is to find out the most heavily invested main sectors in each of the three countries (for funding type FT and investments range of 5-15 M USD).

Now that we have the top 3 countries, and the preferred funding type, the last step is to identify the most preferred sectors in each country. To do this, we're required to create three dataframes, one for each country and get the total investment and counts for all of the main sectors. Here's how we'll tackle this:
1. Create a dataframe for each country with the preferred funding type.
2. Add the total investements and counts of investements in each sector to the dataframes.
3. Fill out the table with the results we get.

In [None]:
# creating the three dataframes
# usa
usa = master.loc[(master.country_code == "USA") & (master.funding_round_type == "venture"), :]

# great britain
gbr = master.loc[(master.country_code == "GBR") & (master.funding_round_type == "venture"), :]

# india
ind = master.loc[(master.country_code == "IND") & (master.funding_round_type == "venture"), :]

In [None]:
# getting the total investements and adding them to the dataframes
#usa
usa_summary = usa.groupby("main_sector").raised_amount_usd.agg(["sum", "count"])

usa = pd.merge(usa, usa_summary, how = "left", on = "main_sector")

# great britain
gbr_summary = gbr.groupby("main_sector").raised_amount_usd.agg(["sum", "count"])

gbr = pd.merge(gbr, gbr_summary, how = "left", on = "main_sector")

# india
ind_summary = ind.groupby("main_sector").raised_amount_usd.agg(["sum", "count"])

ind = pd.merge(ind, ind_summary, how = "left", on = "main_sector")

Now that we have the dataframes, it's time to answer the questions asked.

In [None]:
# total number of investments in countries
# usa
usa.shape

# great britain
gbr.shape

# india
ind.shape

Total number of investments in each country:
USA: 35292
GBR: 2027
IND: 813

In [None]:
# total size of investments in countries
top3 = master.loc[(master.country_code.isin(["USA", "GBR", "IND"])) & (master.funding_round_type == "venture"), :]
top3.groupby("country_code").raised_amount_usd.agg(["sum", "count"])

Total amount invested in each country:
USA: 411102.768986
GBR: 19931.867246
IND: 14134.008718

In [None]:
# top three sectors based on count of investments
# usa
usa.groupby("main_sector").raised_amount_usd.count().sort_values(ascending = False)

# great britain
gbr.groupby("main_sector").raised_amount_usd.count().sort_values(ascending = False)

# india
ind.groupby("main_sector").raised_amount_usd.count().sort_values(ascending = False)

Top Sector Based on Count of investment:
USA: Others
GBR: Others
IND: Others

Second best sector based on count of investments:
USA: Cleantech / Semiconductors
GBR: Cleantech / Semiconductors
IND: Social, Finance, Analytics, Advertising

Third best secotr based on count of investments:
USA: Social, Finance, Analytics, Advertising
GBR: Social, Finance, Analytics, Advertising
IND: News, Search and Messaging

Number of investments in top sector
USA:8521
GBR:526
IND:285

Number of investments in second best sector
USA:7723
GBR:436
IND:144

Number of investments in third best sector
USA:6984
GBR:414
IND:139

In [None]:
# For the top sector count-wise (point 3), which company received the highest investment?
# usa
usa[usa.main_sector == "Others"].groupby("permalink").raised_amount_usd.sum().sort_values()

# gbr
gbr[gbr.main_sector == "Others"].groupby("permalink").raised_amount_usd.sum().sort_values()

# ind
ind[ind.main_sector == "Others"].groupby("permalink").raised_amount_usd.sum().sort_values()

For the top sector count-wise (point 3), which company received the highest investment?
USA: social-finance 
GBR: oneweb
IND: flipkart

In [None]:
# For the second-best sector which company received the highest investment?
# usa
usa[usa.main_sector == "Cleantech / Semiconductors"].groupby("permalink").raised_amount_usd.sum().sort_values()

# gbr
gbr[gbr.main_sector == "Cleantech / Semiconductors"].groupby("permalink").raised_amount_usd.sum().sort_values()

# ind
ind[ind.main_sector == "Social, Finance, Analytics, Advertising"].groupby("permalink").raised_amount_usd.sum().sort_values()

For the second-best sector count-wise (point 4), which company received the highest investment?
USA: Freescale
GBR: ImmunoCore
IND: Shopclues.com