### Investments Case study - EDA

In [1]:
# Supress Warnings
import warnings
warnings.filterwarnings('ignore')

#### Input files for analysis of all the sections

Files provided with the case study: 

   `companies.txt`  (file ingested in Section 1.1)<br>
   `rounds2.csv`    (file ingested in Section 1.1)<br> 
   `mapping.csv`    (file ingested in Section 5.1) <br>

#### Code Developed on 24-October-2018
`def get_encoding(file):` <br>
>   `file = open(file, 'rb')` <br>
>  `tmp_file = file.read()` <br>
>   `file.close()` <br>
>   `detectedValue = detector.detect(tmp_file)` <br>
>    `encoding = detectedValue['encoding']` <br>
>    `return encoding` <br>

#### Code Optimized on 01-November-2018 ; entire code executes in  0:00:33.246118 seconds

##### https://pypi.org/project/python-magic/ <br>
##### conda install -c conda-forge python-magic   <br>
##### import magic <br>

print(magic.from_file("companies.txt")) <br>
`UTF-8 Unicode text, with very long lines` <br>
#### Windows and Linux had errors when UTF-8 was typed; this worked when we changed everything to ISO 8859-1
print(magic.from_file("rounds2.csv")) <br>
`Non-ISO extended-ASCII text, with CRLF line terminators`
print(magic.from_file("mapping.csv")) <br>
`ASCII text, with CRLF line terminators` <br>

##### extended-ASCII is usually foreign language ; ISO 8859-1, Latin-1, mac_roman works; we have used age-old IBM way of reading encoding <br>
###### https://en.wikipedia.org/wiki/Extended_ASCII <br>
companies = pd.read_csv("companies.txt", encoding= 'UTF-8',sep='\t',header=0)  <br>
rounds2 = pd.read_csv("rounds2.csv" , encoding = 'ISO 8859-1') <br>
mapping = pd.read_csv("mapping.csv", encoding ='ISO 8859-1') <br>



#### Special instructions for Mac OSX users to install iso3166
##### I had this issue on Mac OSX High Sierra 10.13.6 with installation of iso3166 on conda. Just fire up terminal and `run pip install iso3166` instead of `conda install -c mcrot iso3166=0.7`:
1. Run `conda install -c conda-forge pypdf2`<br>
2. Run `pip install iso3166` on command line <br>

In [2]:
# Import the numpy and pandas packages
import numpy as np
import pandas as pd
import chardet as detector # to detect the encoding of file
# Use  [ conda install -c conda-forge pypdf2]  on MAC OSX 10.13.6 (17G65) to install this package
import PyPDF2 #read pdf
import re #Find and replace
from iso3166 import countries #Country names for code

from datetime import datetime

# https://pypi.org/project/python-magic/
#  conda install -c conda-forge python-magic  
import magic

# Mark the start timestamp of code execution
startTime = datetime.now()



print(magic.from_file("companies.txt"))
#UTF-8 Unicode text, with very long lines
print(magic.from_file("rounds2.csv"))
print(magic.from_file("mapping.csv"))
#Non-ISO extended-ASCII text, with CRLF line terminators
# extended-ASCII is usually foreign language ; ISO 8859-1, Latin-1, mac_roman works; we have used age-old IBM way of reading encoding
# https://en.wikipedia.org/wiki/Extended_ASCII
#Please change encoding='ISO 8859-1' ; Windows10/Mac/Linux would show UTF-8 , 
# however, code does not compile using UTF-8; changing it to 'ISO 8859-1' worked across Windows 10, Mac and Linux
#one of our developer running Windows had issues when 'UTF-8' was used
companies = pd.read_csv("companies.txt", encoding= 'ISO 8859-1',sep='\t',header=0) 
rounds2 = pd.read_csv("rounds2.csv" , encoding = 'ISO 8859-1') # Write your code for importing the csv file here
#mapping = pd.read_csv("mapping.csv", encoding ='ISO 8859-1')

#Safety first - Ensure to get correct encoding value
#def get_encoding(file):
#    file = open(file, 'rb')
#    tmp_file = file.read()
#    file.close()
#    detectedValue = detector.detect(tmp_file)
#    encoding = detectedValue['encoding']
#    return encoding
#Ensure output of 2 decimals
#pd.options.display.float_format = "{:.2f}".format


UTF-8 Unicode text, with very long lines
Non-ISO extended-ASCII text, with CRLF line terminators
ASCII text, with CRLF line terminators


#### Unicode/non-ascii special character clean-up

> We went through all the codecs (and also checked auto-detection of encoding which gave `{'encoding': 'Windows-1254', 'confidence': 0.4186155476629225, 'language': 'Turkish'}` and tried to evaluate the charset which did not miss data and skipped the error;  <br>

The most suited encoding for the data set provided is `ISO 8859-1`<br> 
> https://docs.python.org/2/library/codecs.html#standard-encodings <br>

#### Loading TSV

> https://stackoverflow.com/questions/9652832/how-to-load-a-tsv-file-into-a-pandas-dataframe

In [3]:
#Load TSV. SOURCE: https://stackoverflow.com/questions/9652832/how-to-load-a-tsv-file-into-a-pandas-dataframe
#companies = pd.read_csv("companies.txt", encoding = get_encoding("companies.txt"),sep='\t',header=0) 
#Load rounds CSV
#Auto detection of encoding gave {'encoding': 'Windows-1254', 'confidence': 0.4186155476629225, 'language': 'Turkish'}
#This resulted in error. Manual check revealed latin-1 did not miss data and skipped error.
#rounds2 = pd.read_csv("rounds2.csv", encoding = 'latin-1') # Write your code for importing the csv file here
#Following code was used to understand the columns and encoding issues
#print(companies.columns) #get columns
#companies
# Only permalink, city and name seem to have encoding issues based on output. Fixing the same
# Converting all to uppercase and filtering any non al
companies[['permalink', 'name','city']]=companies[['permalink', 'name','city']].astype(str).applymap(lambda x: ''.join(filter(str.isalnum, x.encode('utf-8').decode('ascii', 'ignore'))).upper())
#Following code was used to understand the columns and encoding issues
#print(rounds2.columns) #get columns
#rounds2                             
#Only company_permalink needs to be decoded from utf-8 and encoded in ascii
#Cleanse rounds data frame. The singleton list for key avoids 'Series' object has no attribute 'applymap' exception
rounds2[['company_permalink']]=rounds2[['company_permalink']].astype(str).applymap(lambda x: ''.join(filter(str.isalnum, x.encode('utf-8').decode('ascii', 'ignore'))).upper())
#Initial unique rows count in companies by permalink using describe()
# companies.permalink.describe() #select unique
original_rows_companies= 66290
#rounds2.company_permalink.describe() #select unique
original_rows_rounds = 114949


### Understand the Data Set

Here are our assumptions while trying to ingest, slice, dice and merge the data <br>

#### Identify null percentages / drop the ones causing distortion if the percentage of drops is reasonable

 permalink         0.00<br>
 name              0.00<br>
 status            0.00<br>
 homepage_url      7.62 -Ignore these<br>
 `category_list     4.74 -Drop these as these cannot be analysed`<br>
 `country_code     10.48 -- attempt filling in values based on other fields?`<br>
 `state_code       12.88 -- attempt filling in the values based on other fields ; Countries without state like singapore exist.`<br>
 `region           12.10 -Drop these as these cannot be analysed`<br>
 `city             12.10 -Drop these as these cannot be analysed`<br>
 `founded_at       22.93 This is a big number - Do not touch!`<br>



    1. Some companies that don't have home-page but still contain investment, we cannot ignore them 
    2. 6,238 companies are closed, we don't want to include closed companies in our analysis as it is a background noise that can be eliminated
    3. 6,958 companies are not incorporated in any country, it is risky to consider them in our analysis. Moreover the URL is not sufficient to determine the country. For example, http://www.ardana.co.uk belongs to UK ; however majority of these contain commercial incorporated url's such as "http://beansaround.com" and information is not sufficient to evaluate the country of incorporation other than someone manually logging in to the url and then determining the country

`companies.groupby(['status']).permalink.describe()` <br>


>status |  count |	unique|	top|	freq  <br>
> acquired|	5549|	5549	|/Organization/Enigma-Digital|	1 <br>
> `closed|	6238|	6238`	|/Organization/Couchone	|1 <br>
> ipo|	1547	|1547|	/Organization/Nitromed|	1 <br>
> operating	|53034	|53034|	/Organization/Statace	| <br>


    
    
 `companies.isnull().sum()` <br> 

>permalink            0 <br>
>name                 0 <br>
>homepage_url      5058 <br>
>category_list     3148 <br>
>status               0 <br>
>`country_code      6958`  <br>
>state_code        8547<br>
>region            8030<br>
>city              8028 <br>
>founded_at       15221 <br>
> dtype: int64<br>



#### There are 156 duplicate entries with Same name and different locations
398  ***   ORGANIZATION5MINUTES     ***    5MINUTES     ...    ***        Shanghai  ***       NaN<br> 
451  ***      ORGANIZATION5MINUTES  ***       5MINUTES     ...  ***            London *** 19-03-2011<br>
#### Duplicate entries with and without nulls:
 5875  ***        ORGANIZATIONBARKCO   ***        BARKCO     ...  ***          New York***  01-01-2011<br>
 5876 ***         ORGANIZATIONBARKCO  ***         BARKCO     ...  ***          New York  ***       NaN<br>
 59911 ***     ORGANIZATIONTVCOMPASS  ***      TVCOMPASS     ...   ***          Chicago***  01-01-2003<br>
 59922 ***     ORGANIZATIONTVCOMPASS  ***      TVCOMPASS     ...   ***          Chicago *** 01-01-2003<br>
#### Different permalink and name but same company due to homepage_url
 66028    ***               ORGANIZATIONZINGBOX   ***                ZINGBOX     ...  ***    Mountain View    ***     NaN<br>
 66030    ***            ORGANIZATIONZINGBOXLTD   ***             ZINGBOXLTD     ...  ***              NaN *** 11-11-2014<br>
 65927    ***               ORGANIZATIONZHAOPIN    ***               ZHAOPIN     ...  ***          Beijing *** 01-01-1997<br>
 65943    ***        ORGANIZATIONZHILIANZHAOPIN    ***        ZHILIANZHAOPIN     ...   ***         Beijing   ***      NaN<br>


#### Null RAISED_AMOUNT_USD

round(100*(rounds2.isnull().sum()/len(rounds2.index)), 2) <br>
company_permalink           0.00 <br>
funding_round_permalink     0.00 <br>
funding_round_type          0.00 <br>
funding_round_code         72.91 <br>
funded_at                   0.00 <br>
`raised_amount_usd          17.39` - NOT INTERESTED IN THESE <br>
##### Companies with null homepage_url and founded_at are ok to be present in the dataframe, but, not the rest

In [4]:
#Clean-up
# First find null percentages... Uncomment to see result!
#round(100*(companies.isnull().sum()/len(companies.index)), 2)
# permalink         0.00
# name              0.00
# status            0.00
# homepage_url      7.62 -Ignore these
# category_list     4.74 -Drop these as nothing can be done in anlysis
# country_code     10.48 -- Fill in values based on other fields?
# state_code       12.88 -- Fill in values based on other fields? Countries without state like singapore exist.
# region           12.10 -Drop these as nothing can be done in anlysis
# city             12.10 -Drop these as nothing can be done in anlysis
# founded_at       22.93 This is a big number - Do not touch!
# Finding duplicates : Uncomment below code if you wish to see results
# pd.concat(c for _, c in companies.groupby("permalink") if len(c) > 1)
# reference: https://stackoverflow.com/questions/14657241/how-do-i-get-a-list-of-all-the-duplicate-items-using-pandas-in-python
# There are 156 duplicate entries.
# SAME NAME DIFFERENT LOCATIONS
#398         ORGANIZATION5MINUTES         5MINUTES     ...            Shanghai         NaN
#451         ORGANIZATION5MINUTES         5MINUTES     ...              London  19-03-2011
# Duplicate entries with and without nulls:
# 5875          ORGANIZATIONBARKCO           BARKCO     ...            New York  01-01-2011
# 5876          ORGANIZATIONBARKCO           BARKCO     ...            New York         NaN
# 59911      ORGANIZATIONTVCOMPASS        TVCOMPASS     ...             Chicago  01-01-2003
# 59922      ORGANIZATIONTVCOMPASS        TVCOMPASS     ...             Chicago  01-01-2003
# Different permalink and name but same company due to homepage_url
# 66028                   ORGANIZATIONZINGBOX                   ZINGBOX     ...      Mountain View         NaN
# 66030                ORGANIZATIONZINGBOXLTD                ZINGBOXLTD     ...                NaN  11-11-2014
# 65927                   ORGANIZATIONZHAOPIN                   ZHAOPIN     ...            Beijing  01-01-1997
# 65943            ORGANIZATIONZHILIANZHAOPIN            ZHILIANZHAOPIN     ...            Beijing         NaN

# Any null values here?
#round(100*(rounds2.isnull().sum()/len(rounds2.index)), 2)
#company_permalink           0.00
#funding_round_permalink     0.00
#funding_round_type          0.00
#funding_round_code         72.91
#funded_at                   0.00
#raised_amount_usd          17.39 - NOT INTERESTED IN THESE
# We can have companies with null homepage_url and founded_at, but, not the rest.
companies.dropna(subset=['category_list','region','city'],inplace=True)
#We need to remove entries where permalink or homepage is same!
companies.drop_duplicates(subset=['permalink'], keep=False, inplace=True)
companies.drop_duplicates(subset=['homepage_url'], keep=False, inplace=True)
#Status is closed... What to do? Drop them... Not needed...
companies = companies[companies.status != 'closed']
#Drop data on companies which have not generated funds...
rounds2.dropna(subset=['raised_amount_usd'],inplace=True)
rows_retained_companies = len(companies.isnull().sum(axis=1).index)
rows_retained_rounds = len(rounds2.isnull().sum(axis=1).index)
rows_retained_companies_percentage = round((rows_retained_companies)/original_rows_companies*100,2)
rows_retained_rounds_percentage =round((rows_retained_rounds)/original_rows_rounds*100,2)
print ("Rows retained in companies = ",rows_retained_companies,"and Percentage of rows retained = ",rows_retained_companies_percentage)
print ("Rows retained in rounds2 = ",rows_retained_rounds,"and Percentage of rows retained = ",rows_retained_rounds_percentage)
#Still having 74.31 % of rows in companies and 82.61 % of rows in rounds. Good to go!

Rows retained in companies =  49259 and Percentage of rows retained =  74.31
Rows retained in rounds2 =  94959 and Percentage of rows retained =  82.61


## Checkpoint 1
#### Code and answer for Section 1.1

`Note:` <br>

`1.The Coding cell is immedeately followed by Markdown table, team has updated results in the markdown table as per requirements in addition to updating results in the spreadsheet` <br>
`2.Code is very clearly commented and print statements put up in the same manner for avoiding any confusion`

In [5]:
# NOTE: We have already filtered the data for most possible anamolies. Hence the numbers in answers are for the subset we consider analysable.
# Code for Section 1.1 goes here. The next cell has Markdown table, please update results in the markdown table as well
# ****************************************************************************
# Understand the Data Set
#****************************************************************************
#How many unique companies are present in rounds2?
print(len(rounds2.company_permalink.unique()),"unique rows in rounds2.")
#unique                     53861
#-------------------------------------------------------------------------------
# How many unique companies are present in companies?   
# One possible way to ask is companies.describe
# Take care to check how many companies are repeating. Same home page - 119 Companies, Same name - 265 Companies
# Otherwise, it is 66368 based on just permalink
print(len(companies.permalink.unique()),"unique rows in companies.")
# unique                    49259
#-------------------------------------------------------------------------------
#In the companies data frame, which column can be used as the unique key for each company? Write the name of the column.
# companies.describe()
#premalink is uinque and not null. Hence it should be used as unique key. URL is repeated for some.
companies.set_index("permalink", inplace = True)
#-------------------------------------------------------------------------------
# Are there any companies in the rounds2 file which are not present in companies? Answer yes or no: Y/N	 
# Answer : Yes
print("Difference of rounds2 and companies tables (after cleaning):", len([x for x in rounds2.company_permalink if x not in companies.index]), "Companies")
# Write your code here...
#-------------------------------------------------------------------------------
# Merge the two data frames so that all variables (columns) in the companies frame are added to the rounds2 data frame. Name the merged frame master_frame. How many observations are present in master_frame?
master_frame = rounds2.merge(companies,how='inner',left_on = 'company_permalink',right_on = 'permalink')
#Convert amount to billions of usd - do not round as it counts small amounts
master_frame['raised_amount_usd'] = master_frame.raised_amount_usd/1000000
master_frame.rename( columns= {"raised_amount_usd":"raised_amount_million_usd"}, inplace=True)
print("Total number of observations in master_frame is",len(master_frame))
#-------------------------------------------------------------------------------

53861 unique rows in rounds2.
49259 unique rows in companies.
Difference of rounds2 and companies tables (after cleaning): 16560 Companies
Total number of observations in master_frame is 78399



#### Answer for Section 1.1
#### Table 1.1: Understand the Data Set 


<table border ="1" width=700>
    <tr>
        <td>How many unique companies are present in rounds2?</td> 
        <td> 53861 </td>
    </tr>
    <tr>
        <td>How many unique companies are present in companies?	</td> 
        <td> 49259 </td>
    </tr>
    <tr>
        <td>In the companies data frame, which column can be used as the unique key for each company? Write the name of the column.</td> 
        <td>  permalink </td>
    </tr>
    <tr>
        <td>Are there any companies in the rounds2 file which are not present in companies? Answer yes or no: Y/N</td> 
        <td> Y </td>
    </tr>
    <tr>
        <td>Merge the two data frames so that all variables (columns) in the companies frame are added to the rounds2 data frame. Name the merged frame master_frame. How many observations are present in master_frame?</td> 
        <td> 78399 </td>
    </tr>
    
</table>



## Checkpoint 2

#### Code and answer for Section 2.1

In [6]:
# Code for Section 2.1 goes here. The next cell has Markdown table, please update results in the markdown table as well
# ****************************************************************************
#Table 2.1: Average Values of Investments for Each of these Funding Types 
# ****************************************************************************

#Average funding amount of venture type	
print (master_frame.groupby('funding_round_type').mean().loc['venture'])
#raised_amount_usd    11.88
#-------------------------------------------------------------------------------
# Average funding amount of angel type	 
print (master_frame.groupby('funding_round_type').mean().loc['angel'])
#raised_amount_usd    0.99
#-------------------------------------------------------------------------------
# Average funding amount of seed type	 
print (master_frame.groupby('funding_round_type').mean().loc['seed'])
#raised_amount_usd    0.76
#-------------------------------------------------------------------------------
# Average funding amount of private equity type	 
print (master_frame.groupby('funding_round_type').mean().loc['private_equity'])
#raised_amount_usd    73.33
#-------------------------------------------------------------------------------
# Considering that Spark Funds wants to invest between 5 to 15 million USD per investment round, which investment type is the most suitable for it?
# Venture...
                                                                                       

raised_amount_million_usd    11.879979
Name: venture, dtype: float64
raised_amount_million_usd    0.994762
Name: angel, dtype: float64
raised_amount_million_usd    0.762296
Name: seed, dtype: float64
raised_amount_million_usd    73.331531
Name: private_equity, dtype: float64



#### Answer for Section 2.1
#### Table 2.1: Average Values of Investments for Each of these Funding Types 


<table border ="1" width=700>
    <tr>
        <td>Average funding amount of venture type</td> 
         <td> 
        | Type of Investment    | Million USD | <br>
        |-----------------------|:-----------:|  <br>
        | venture               | 11.88       |  <br>
       </td>
    </tr>
     <tr>
        <td>Average funding amount of angel type</td> 
        <td> 
        | Type of Investment    | Million USD | <br>
        |-----------------------|:-----------:|  <br>
        | angel               | 0.99       |  <br>
        </td>
    </tr>
    <tr>
    <tr>
        <td>Average funding amount of seed type</td> 
        <td> 
        | Type of Investment    | Million USD | <br>
        |-----------------------|:-----------:|  <br>
        | seed               | 0.76      |  <br>
        </td>
    </tr>
    <tr>
        <td>Average funding amount of private equity type	 </td> 
        <td> 
        | Type of Investment    | Million USD | <br>
        |-----------------------|:-----------:|  <br>
        | private_equity               | 73.33       |  <br>
        </td>
    </tr>
    <tr>
        <td>Considering that Spark Funds wants to invest between 5 to 15 million USD per investment round, which investment type is the most suitable for it?</td> 
        <td> Venture, 11.88M US$ </td>
    </tr>
    
</table>


## Checkpoint 3
#### Code and answer for Section 3.1

Use `PyPDF2` to read `Countries_where_English_is_an_official_language.pdf` and print top 9 countries

In [7]:
# Code for Section 3.1 goes here. The next cell has Markdown table, please update results in the markdown table as well
#Spark Funds wants to see the top nine countries which have received the highest total funding (across ALL sectors for the chosen investment type - venture)
#For the chosen investment type, make a data frame named top9 with the top nine countries (based on the total investment amount each country has received)
top9= master_frame[master_frame['funding_round_type'] == 'venture'].groupby('country_code').sum().sort_values(by = 'raised_amount_million_usd', ascending = False)[:9]
#index is ['USA', 'CHN', 'GBR', 'IND', 'CAN', 'FRA', 'ISR', 'DEU', 'SWE']
#Make a pretty display
print(top9)
#Load list of english speaking countries
pdfFileObj = open('Countries_where_English_is_an_official_language.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj) 
pageObj = pdfReader.getPage(0)
englist_continents_countries= ' '.join(re.sub(r"(\w)([A-Z])", r"\1 \2",pageObj.extractText()).splitlines()[1:])
pdfFileObj.close()
top9_country_names={' '.join( countries.get(country).apolitical_name.split()[:2]):country for country in top9.index}
top3_english_country_names=[name for name in list(top9_country_names.keys()) if name in  englist_continents_countries][:3]
# ****************************************************************************
# Table 3.1: Analysing the Top 3 English-Speaking Countries
# ****************************************************************************
#-------------------------------------------------------------------------------
#Top English-speaking country	              
print("Top English speaking country:",top3_english_country_names[0])
#-------------------------------------------------------------------------------
#Second English-speaking country	 
print("Second English speaking country:",top3_english_country_names[1])
#------------------------------------------------------------------------------    
#Third English-speaking country	 
print("Third English speaking country:",top3_english_country_names[2])



              raised_amount_million_usd
country_code                           
USA                       375805.380141
CHN                        34432.116713
GBR                        16762.922982
IND                        13456.066718
CAN                         8244.403575
FRA                         6169.320568
ISR                         6011.867458
DEU                         5833.951235
SWE                         2842.573022
Top English speaking country: United States
Second English speaking country: United Kingdom
Third English speaking country: India


#### Answer for section 3.1
#### Table 3.1: Analysing the Top 3 English-Speaking Countries


<table border ="1" width=700>
    <tr>
        <td>Top English speaking country</td> 
         <td> 
UNITED STATES OF AMERICA [375,805.38 Mil USD]
       </td>
    </tr>
    <tr>
        <td>Second English speaking country</td> 
        <td> UNITED KINGDOM	[16,762.92 Mil USD] </td>
    </tr>
    <tr>
        <td>Third English speaking country	 </td> 
        <td>  INDIA [	13,456.07  Mil USD]   </td>
    </tr>

</table>




## Checkpoint 5
#### Code for Section 5.1 - Dataframe preparation for analysis in section 5.2

This section leverages on data preparation done in all the cells above. In addition `Mapping.csv` is ingested to merge with the earlier dataframe; Also note that `32` categories were missing and we had to fix them

`These missing categories could not be ignored as they had quite a number of rounds of investments associated with them. Including them gives us more leverage on the data and hence the fortified analysis`

In [8]:
#Data Prep Sector Analysis 1 (5.1)
#Read Mapping file into a data frame
mapping = pd.read_csv("mapping.csv", encoding ='ISO 8859-1')
# mapping = pd.read_csv("mapping.csv", encoding = get_encoding("mapping.csv"))
mapping.dropna(subset=['category_list'],inplace=True)
#Someone did a find and replace of na with 0... Also words are not in proper case...
#Replace na
print ("Fixing 0 in category_list...")
mapping['category_list'] = list(map( lambda x: re.sub(r"0", r"na", x), mapping.category_list))
#Bruteforce to upper
print ("Fixing case in category_list...")
mapping['category_list'] = mapping['category_list'].str.upper()

#extract first column from master_Frame ; numerous columns are present separated by a '|'
master_frame['primary_sector'] = master_frame['category_list'].str.split('|',1,expand=True)[0]
print ('Created primary_sector in master_frame')

#Some categories are missing...
print("Missing number of categories:",len({x for x in master_frame['primary_sector'].str.upper() if x not in list(mapping.category_list)}), "found")
print ("Fixing missing categories...")

#Fixing them
labels =['category_list','Automotive & Sports', 'Blanks', 'Cleantech / Semiconductors', 'Entertainment', 'Health', 'Manufacturing', 'News, Search and Messaging', 'Others','Social, Finance, Analytics, Advertising']

extras = [('GROUP EMAIL',0,0,0,0,0,0,1,0,0), 
 ('KINECT',0,0,0,1,0,0,0,0,0), 
 ('ENTERPRISE 2.0',0,0,0,0,0,1,0,0,0), 
 ('REGISTRARS',0,0,0,0,0,0,0,1,0), 
 ('VACATION RENTALS',0,0,0,0,0,0,0,0,1), 
 ('SPECIALTY RETAIL',0,0,0,0,0,0,0,0,1), 
 ('INTERNET TECHNOLOGY',0,0,0,0,0,0,1,0,0), 
 ('RAPIDLY EXPANDING',0,0,0,0,0,0,0,1,0), 
 ('NATURAL GAS USES',0,0,0,0,0,0,0,1,0), 
 ('ENTERPRISE HARDWARE',0,0,0,0,0,1,0,0,0), 
 ('LINGERIE',0,0,0,0,0,0,0,1,0), 
 ('GREENTECH',0,0,1,0,0,0,0,0,0), 
 ('CAUSE MARKETING',0,0,0,0,0,0,0,0,1), 
 ('SPAS',0,0,0,0,1,0,0,0,0), 
 ('PSYCHOLOGY',0,0,0,0,1,0,0,0,0), 
 ('INTERNET TV',0,0,0,1,0,0,0,0,0), 
 ('GOOGLE GLASS',0,0,0,1,0,0,0,0,0), 
 ('DEEP INFORMATION TECHNOLOGY',0,0,0,0,0,0,1,0,0), 
 ('SOCIAL MEDIA ADVERTISING',0,0,0,0,0,0,0,0,1), 
 ('REAL ESTATE INVESTORS',0,0,0,0,0,0,0,0,1), 
 ('BIOTECHNOLOGY AND SEMICONDUCTOR',0,0,1,0,0,0,0,0,0), 
 ('SKILL GAMING',1,0,0,0,0,0,0,0,0), 
 ('SWIMMING',1,0,0,0,0,0,0,0,0), 
 ('ADAPTIVE EQUIPMENT',1,0,0,0,0,0,0,0,0), 
 ('GOLF EQUIPMENT',1,0,0,0,0,0,0,0,0), 
 ('GENERATION Y-Z',0,0,0,0,0,0,0,0,1), 
 ('TOYS',0,0,0,0,0,0,0,1,0), 
 ('MOBILE EMERGENCY&HEALTH',0,0,0,0,1,0,0,0,0),
 ('ENGLISH-SPEAKING',0,0,0,0,0,0,0,0,1),
 ('NIGHTLIFE',0,0,0,1,0,0,0,0,0),
 ('SEX INDUSTRY',0,0,0,1,0,0,0,0,0),
 ('SPONSORSHIP',0,0,0,0,0,0,0,1,0)
]
mapping_extras = pd.DataFrame.from_records(extras,columns=labels)
mapping = pd.concat([mapping,mapping_extras])
mapping.set_index('category_list',inplace=True)

#map sector
master_frame['main_sector'] = list(map(lambda x : list(mapping.columns[mapping.loc[x.upper()] == 1])[0], master_frame['primary_sector']))
print ('mapped primary_sector to main_sector in master_frame')
#drop Blanks
print ('Dropping blanks in master frame...')
master_frame = master_frame[master_frame['main_sector'] != 'Blanks']


Fixing 0 in category_list...
Fixing case in category_list...
Created primary_sector in master_frame
Missing number of categories: 32 found
Fixing missing categories...
mapped primary_sector to main_sector in master_frame
Dropping blanks in master frame...


#### Section 5.2  - Sector analysis

##### 1. Please note that  favourite `venture` investment ( best investment between `5 Million-15 Million USD`) is chosen within each round of investment markup planning
##### 2. `English as official language` countries are chosen to facilitate the ease of business/investment



In [9]:
#Sector analysis -2 code goes here
#Conditions as per problem statement
country_1 = top9_country_names['United States']
country_2 = top9_country_names['United Kingdom']
country_3 = top9_country_names['India']
ft='venture'

#Creating filters. Reusable this way...
filter1 = (master_frame['country_code'] == country_1) &  (master_frame['funding_round_type']==ft) & (master_frame['raised_amount_million_usd'] >= 5.0) & (master_frame['raised_amount_million_usd'] <= 15.0)
filter2 = (master_frame['country_code'] == country_2) &  (master_frame['funding_round_type']==ft) & (master_frame['raised_amount_million_usd'] >= 5.0) & (master_frame['raised_amount_million_usd'] <= 15.0)
filter3 = (master_frame['country_code'] == country_3) &  (master_frame['funding_round_type']==ft) & (master_frame['raised_amount_million_usd'] >= 5.0) & (master_frame['raised_amount_million_usd'] <= 15.0)

#Total investment by main sector...
total_investment1=master_frame[['main_sector','raised_amount_million_usd']][filter1].groupby('main_sector').sum().sort_values(by='raised_amount_million_usd',ascending=False)
total_investment2=master_frame[['main_sector','raised_amount_million_usd']][filter2].groupby('main_sector').sum().sort_values(by='raised_amount_million_usd',ascending=False)
total_investment3=master_frame[['main_sector','raised_amount_million_usd']][filter3].groupby('main_sector').sum().sort_values(by='raised_amount_million_usd',ascending=False)
number_investment1=master_frame[['main_sector','funded_at']][filter1].groupby('main_sector').count().sort_values(by='funded_at',ascending=False)
number_investment2=master_frame[['main_sector','funded_at']][filter2].groupby('main_sector').count().sort_values(by='funded_at',ascending=False)
number_investment3=master_frame[['main_sector','funded_at']][filter3].groupby('main_sector').count().sort_values(by='funded_at',ascending=False)

# Create D1, D2 and D3
D1 = master_frame [filter1]
D2 = master_frame [filter2]
D3 = master_frame [filter3]

d1_meta = pd.merge(total_investment1,number_investment1,on='main_sector')
d1_meta.rename(index=str, columns={'raised_amount_million_usd':'total_invested',  'funded_at':'number_investments'},inplace=True)
d2_meta = pd.merge(total_investment2,number_investment2,on='main_sector')
d2_meta.rename(index=str, columns={'raised_amount_million_usd':'total_invested',  'funded_at':'number_investments'},inplace=True)
d3_meta = pd.merge(total_investment3,number_investment3,on='main_sector')
d3_meta.rename(index=str, columns={'raised_amount_million_usd':'total_invested',  'funded_at':'number_investments'},inplace=True)
print ('\nValue','\t\t\t\t\t\t\tUnited States','\tUnited Kingdom','\tIndia')
# 1. Total number of investments (count)
print('Total number of investments\t\t\t\t', d1_meta.number_investments.sum(),'\t\t',d2_meta.number_investments.sum(),'\t\t',d3_meta.number_investments.sum())
# 2. Total amount of investment (USD)
print('Total amount(Million USD) of investments\t\t', round(d1_meta.total_invested.sum(),2),'\t',round(d2_meta.total_invested.sum(),2),'\t',round(d3_meta.total_invested.sum(),2))
# 3. Top sector (based on count of investments)
#(note that ‘Other’ is one of the eight main sectors)
print('Top sector (based on investments count)\t\t\t', d1_meta.number_investments.index[0],'\n\t\t\t\t\t\t\t\t\t',d2_meta.number_investments.index[0],'\n\t\t\t\t\t\t\t\t\t\t\t',d3_meta.number_investments.index[0])
# 4. Second-best sector (based on count of investments)
print('Second-best sector (based on investments count)\t\t', d1_meta.number_investments.index[1],'\n\t\t\t\t\t\t\t\t\t',d2_meta.number_investments.index[1],'\n\t\t\t\t\t\t\t\t\t\t\t',d3_meta.number_investments.index[1])
# 5. Third-best sector (based on count of investments)
print('Third-best sector (based on investments count)\t\t', d1_meta.number_investments.index[2],'\n\t\t\t\t\t\t\t\t\t',d2_meta.number_investments.index[2],'\n\t\t\t\t\t\t\t\t\t\t\t',d3_meta.number_investments.index[2])
# 6. Number of investments in the top sector (refer to point 3)
print('Total number of investments in the top sector\t\t', d1_meta.number_investments.loc[d1_meta.number_investments.index[0]],'\t\t',d2_meta.number_investments.loc[d1_meta.number_investments.index[0]],'\t\t',d3_meta.number_investments.loc[d1_meta.number_investments.index[0]])
# 7. Number of investments in the second-best sector (refer to point 4)
print('Total number of investments in the second-best sector\t', d1_meta.number_investments.loc[d1_meta.number_investments.index[1]],'\t\t',d2_meta.number_investments.loc[d1_meta.number_investments.index[1]],'\t\t',d3_meta.number_investments.loc[d1_meta.number_investments.index[1]])
# 8. Number of investments in the third-best sector (refer to point 5)
print('Total number of investmentsin the third-best sector\t', d1_meta.number_investments.loc[d1_meta.number_investments.index[2]],'\t\t',d2_meta.number_investments.loc[d1_meta.number_investments.index[2]],'\t\t',d3_meta.number_investments.loc[d1_meta.number_investments.index[2]])
top_d1_companies = D1[['company_permalink','name','raised_amount_million_usd']][D1['main_sector']==d1_meta.number_investments.index[0]].groupby(['name','company_permalink']).sum().sort_values('raised_amount_million_usd',ascending=False)
top_d2_companies = D2[['company_permalink','name','raised_amount_million_usd']][D2['main_sector']==d2_meta.number_investments.index[0]].groupby(['name','company_permalink']).sum().sort_values('raised_amount_million_usd',ascending=False)
top_d3_companies = D3[['company_permalink','name','raised_amount_million_usd']][D3['main_sector']==d3_meta.number_investments.index[0]].groupby(['name','company_permalink']).sum().sort_values('raised_amount_million_usd',ascending=False)
second_d1_companies = D1[['company_permalink','name','raised_amount_million_usd']][D1['main_sector']==d1_meta.number_investments.index[1]].groupby(['name','company_permalink']).sum().sort_values('raised_amount_million_usd',ascending=False)
second_d2_companies = D2[['company_permalink','name','raised_amount_million_usd']][D2['main_sector']==d2_meta.number_investments.index[1]].groupby(['name','company_permalink']).sum().sort_values('raised_amount_million_usd',ascending=False)
second_d3_companies = D3[['company_permalink','name','raised_amount_million_usd']][D3['main_sector']==d3_meta.number_investments.index[1]].groupby(['name','company_permalink']).sum().sort_values('raised_amount_million_usd',ascending=False)

# 9. For the top sector count-wise (point 3), which company received the highest investment?
print('company with highest investment in top sector\t\t',top_d1_companies.index[0][0],'\t',top_d2_companies.index[0][0],'\t',top_d3_companies.index[0][0])

#10. For the second-best sector count-wise (point 4), which company received the highest investment?
print('company with highest investment in second-best sector\t',second_d1_companies.index[0][0],'\t',second_d2_companies.index[0][0],'\t',second_d3_companies.index[0][0])


Value 							United States 	United Kingdom 	India
Total number of investments				 10698 		 528 		 307
Total amount(Million USD) of investments		 95678.87 	 4604.06 	 2739.67
Top sector (based on investments count)			 Others 
									 Others 
											 Others
Second-best sector (based on investments count)		 Social, Finance, Analytics, Advertising 
									 Social, Finance, Analytics, Advertising 
											 Social, Finance, Analytics, Advertising
Third-best sector (based on investments count)		 Cleantech / Semiconductors 
									 Cleantech / Semiconductors 
											 News, Search and Messaging
Total number of investments in the top sector		 2563 		 125 		 102
Total number of investments in the second-best sector	 2504 		 123 		 57
Total number of investmentsin the third-best sector	 2064 		 103 		 20
company with highest investment in top sector		 VIRTUSTREAM 	 ELECTRICCLOUD 	 FIRSTCRYCOM
company with highest investment in second-best sector	 SSTINCFORMERLYSHOTSPOTTER 	 CELLT

#### Table 5.1 : Sector-wise Investment Analysis
<table border ="1" width=700>
    <th>
        <td>United States</td>
        <td>United Kingdom</td>
        <td>India</td>
    </th>
    <tr>
        <td>1. Total number of investments</td>
        <td>10698</td>
        <td>528</td>
        <td>307</td>
    </tr>
    <tr>
        <td>2. Total amount(Million USD) of investments</td>
        <td>95678.87</td>
        <td>4604.06</td>
        <td>2739.67</td>
    </tr>
    <tr>
        <td>3. Top sector (based on investments count)</td>
        <td>Others</td>
        <td>Others</td>
        <td>Others</td>
    </tr>
    <tr>
        <td>4. Second-best sector (based on investments count)</td>
        <td>Social, Finance, Analytics, Advertising</td>
        <td>Social, Finance, Analytics, Advertising</td>
        <td>Social, Finance, Analytics, Advertising</td>
    </tr>
    <tr>
        <td>5. Third-best sector (based on investments count)</td>
        <td>Cleantech \ Semiconductors</td>
        <td>Cleantech \ Semiconductors</td>
        <td>News, Search and Messaging</td>
    </tr>
    <tr><td>6. Total number of investments in the top sector</td>
        <td>2563</td>
        <td>125</td>
        <td>102</td>
    </tr>
    <tr>
        <td>7. Total number of investments in the second-best sector</td>
        <td>2504</td>
        <td>123</td>
        <td>57</td>
    </tr>
    <tr>
        <td>8. Total number of investmentsin the third-best sector</td>
        <td>2064</td>
        <td>103</td>
        <td>20</td>
    </tr>
    <tr>
        <td>9. Company with highest investment in top sector</td>
        <td>Virtustream</td>
        <td>Electric Cloud</td>
        <td>First Cry</td>
    </tr>
    <tr>
        <td>10. Company with highest investment in second-best sector</td>
        <td>SST Inc. Formerly Shot Spotter</td>
        <td>Cell Tick Technologies</td>
        <td>Manthan Systems</td>
    </tr>
</table>

### Presentation -  Output files for Tableau plot

In [10]:
# Output three dataframes to three csv files that will be imported in Tableau for further analysis
#
top9['Country Name']=[name.upper() for name in list(top9_country_names.keys())]
top3 = (pd.concat([D1, D2, D3 ], axis = 0))
top3_country_names={' '.join( countries.get(country).apolitical_name.split()[:2]):country for country in top3.country_code}
#print(top3_country_names)
#{'United States': 'USA', 'United Kingdom': 'GBR', 'India': 'IND'}
D1['Country Name'] = 'UNITED STATES OF AMERICA'
D2['Country Name'] = 'UNITED KINGDOM'
D3['Country Name'] = 'INDIA'
top3 = (pd.concat([D1, D2, D3 ], axis = 0))
#
# Output master frame from section 2.1
master_frame.to_csv("total_investments.csv", sep=',')
# Output top 9 countries from section 3.1 
top9.to_csv("top9_analysis.csv", sep=',')
# Output top 3 country data frames that has additional data from sections 5.1 and 5.2
top3.to_csv("top3_analysis.csv", sep=',')

In [11]:
print('\nTime elasped: ', datetime.now() - startTime)
#


Time elasped:  0:00:35.007022


#### Assess the time taken for code execution
##### For larger data frames this would be very good checkmark to optimize code execution; for dataset of this size, we believe that the performance for the analysis questions in this workbook is optimum

`Time elasped:  0:00:30.659647`
