## Investment Case Group Project 
>Submitted By       : ***Srinivasan. G, Naveed.J, Denny.J and Kumar.A***<br>
>Date of Submission : ***04-November-2018***
### Project Details
>    An asset management company,Spark Funds wants to make investments in a few companies.     The CEO of Spark Funds wants to understand the global trends in investments so that she can take the investment decisions effectively.

###  constraints for investments:
>1. It wants to invest between `5 to 15 million` USD per round of investment.
>2. It wants to invest only in `English-speaking` countries because of the ease of communication with the companies.
>3. Consider country to be English speaking only if English is one of the official languages in that country.

### Business objective
>1. The overall strategy is to invest where others are investing, implying that the `Best` sectors and countries are the ones `where most investors are investing`.
### Goals
>1. ***Investment type analysis*** : Comparing the typical investment amounts in the `venture, seed, angel, private equity` etc. so that they can choose the type that is best suited for their strategy.
>2. ***Country analysis:*** : Identifying the countries which have been the most heavily invested in the past.  
>3. ***Sector analysis:*** Understanding the distribution of investments across the eight main sectors. (Note that we are interested in the eight `'main sectors'`.
<br>
### Data Files for used Analysis
>1. `**companies.txt**` : A file with basic data of companies.
>2. `**round2.csv**`    : A file having company funding rounds details
>3. `**mapping.csv**`   : A file having sectors, classified in to 8 broad categories 
>4. `**Countries_where_English_is_an_official_language.pdf**` : list of countries where English is an official language.

### Checkpoint 1: Data Cleaning 
>  Load the companies and rounds data,into two data frames and name them companies and rounds2 respectively.<br>

###  Results Expected: Table-1.1
>1. How many unique companies are present in rounds2?
>2. How many unique companies are present in companies?
>3. In the companies data frame, which column can be used as the unique key for each  company? Write the name of the column.
>4. Are there any companies in the rounds2 file which are not present in companies? Answer yes or no: Y/N
>5. Merge the two data frames so that all variables (columns) in the companies frame are added to the rounds2 data frame. Name the merged frame master_frame. How many observations are present in master_frame?

### To Do List - Check Point-1
>1. Load companies and rounds2 files to data frame,handle Encoding issues<br>
>2. Check Duplicates in companies and rounds  
>3. check null values in companies and rounds
>4. Check Unique counts of companies in companies and rounds
>5. Identify Unique Key Column
>6. Check Company present in rounds but not in companies 
>7. merge the both companies and rounds2 on key column as master_frame
>8. Remove unnecessary and other duplicate columns, which are not required for analysis
>9. Chck percentage of nulls (Nan) present in each of the merged remaining columns
>10. Remove rows having columns with high percentage of nulls, keep the nulls to minimum
>11. if required, impute Null column values , if column is absolutely required for analysis
>12. Ensure a good percentage of clean data for analysis

### Assumptions
>- Data provided include even “Closed” status companies.
   - This has not been removed from our analysis as amount of closed companies in the      filtered count has been found to be negligible.
   - Assumed that they need to be ignored in our analysis.
>- Data provided includes companies as old as 1982 when they were last funded.
   - Understand that the Business Goal has been to invest where other companies are investing, but the data found in this category has been found to be negligible.

#### IMPORTANT DISCLAIMER!!<br>
- The Package **`PDFMINER`** is required to be installed for the extraction of data from pdf file povided in case study ***`Countries_where_English_is_an_official_language.pdf`.*** <br><br>

- The **`PDFMINER`** can be installed using command - ***'pip install `pdfminer.six`'***<br><br>

- The consolidated list of all country codes with names data is obtained from online link: ***`https://github.com/lukes/ISO-3166-Countries-with-Regional-Codes/blob/master/all/all.csv`*** as p in the upgrad discussion forum is taken and loaded in as dictionary inline as **`countries_code`**  <br><br>

In [1]:
# Import Numpy and Pandas Package and other Ipython modules for dispaly interactively.

import numpy as np
import pandas as pd
from IPython.display import display, HTML
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Some pandas set_options 
pd.set_option('display.max_rows',8000)
pd.set_option('display.max_columns', 30)
pd.set_option('display.width', 1000)
pd.set_option('display.max_colwidth', 1000)

# Rounding of float format value used for column such as raised_amount_usd
pd.options.display.float_format = '{:,.2f}'.format

## Load Data<br>  
**Note:** Load  companies.txt (tab delimited) and rounds2.csv file in to data frames. The encoding format of files are in `latin-1/iso-8859-1` and contains few special characters.

In [2]:
# Read companies & rounds2 data.

companies = pd.read_csv("Data/companies.txt",sep="\t",encoding='iso-8859-1')
rounds2 = pd.read_csv("Data/rounds2.csv",sep=",",encoding='iso-8859-1')

In [3]:
## Analayse the shape,column details of companies and rounds2 data frame.

companies.shape
companies.info()

rounds2.shape
rounds2.info()

(66368, 10)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 66368 entries, 0 to 66367
Data columns (total 10 columns):
permalink        66368 non-null object
name             66367 non-null object
homepage_url     61310 non-null object
category_list    63220 non-null object
status           66368 non-null object
country_code     59410 non-null object
state_code       57821 non-null object
region           58338 non-null object
city             58340 non-null object
founded_at       51147 non-null object
dtypes: object(10)
memory usage: 5.1+ MB


(114949, 6)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 114949 entries, 0 to 114948
Data columns (total 6 columns):
company_permalink          114949 non-null object
funding_round_permalink    114949 non-null object
funding_round_type         114949 non-null object
funding_round_code         31140 non-null object
funded_at                  114949 non-null object
raised_amount_usd          94959 non-null float64
dtypes: float64(1), object(5)
memory usage: 5.3+ MB


***Result:*** ;
- Companies.txt - **`66368 rows`** and **`10 columns`** 
- rounds2.csv -   **`114949 rows`** and **`6 columns`**.

In [4]:
# view data.

companies.tail()
rounds2.tail()

Unnamed: 0,permalink,name,homepage_url,category_list,status,country_code,state_code,region,city,founded_at
66363,/Organization/Zznode-Science-And-Technology-Co-Ltd,ZZNode Science and Technology,http://www.zznode.com,Enterprise Software,operating,CHN,22,Beijing,Beijing,
66364,/Organization/Zzzzapp-Com,Zzzzapp Wireless ltd.,http://www.zzzzapp.com,Advertising|Mobile|Web Development|Wireless,operating,HRV,15,Split,Split,13-05-2012
66365,/Organization/ÃEron,ÃERON,http://www.aeron.hu/,,operating,,,,,01-01-2011
66366,/Organization/ÃAsys-2,Ãasys,http://www.oasys.io/,Consumer Electronics|Internet of Things|Telecommunications,operating,USA,CA,SF Bay Area,San Francisco,01-01-2014
66367,/Organization/Ä°Novatiff-Reklam-Ve-Tanä±Tä±M-Hizmetleri-Tic,Ä°novatiff Reklam ve TanÄ±tÄ±m Hizmetleri Tic,http://inovatiff.com,Consumer Goods|E-Commerce|Internet,operating,,,,,


Unnamed: 0,company_permalink,funding_round_permalink,funding_round_type,funding_round_code,funded_at,raised_amount_usd
114944,/organization/zzzzapp-com,/funding-round/8f6d25b8ee4199e586484d817bceda05,convertible_note,,01-03-2014,41313.0
114945,/ORGANIZATION/ZZZZAPP-COM,/funding-round/ff1aa06ed5da186c84f101549035d4ae,seed,,01-05-2013,32842.0
114946,/organization/ãeron,/funding-round/59f4dce44723b794f21ded3daed6e4fe,venture,A,01-08-2014,
114947,/ORGANIZATION/ÃASYS-2,/funding-round/35f09d0794651719b02bbfd859ba9ff5,seed,,01-01-2015,18192.0
114948,/organization/ä°novatiff-reklam-ve-tanä±tä±m-hizmetleri-tic,/funding-round/af942869878d2cd788ef5189b435ebc4,grant,,01-10-2013,14851.0


#### Observation: #### 
>- There are some special characters in the files.These special characters has to be removed from both companies & rounds2.(e.g, row numbers 66365,66366 and 66367 in both `permalink` in companies and name and row numbers 114947,114948 in `company_permalink` column in rounds2.)<br>

#### Inference: #### 
>- The special characters should be `removed` from data frame companies and rounds2 on the impacted columns.

In [5]:
# Apply encoding/decoding to companies data frame to remove(ignore) the special characters.
# Inference : The special characters in the permalink & name column 
#              due to western european char set.
#
# The special characters for company_permalink are cleaned up using below ;

companies['permalink'] = companies.permalink.str.encode('utf-8').\
                  str.decode('ascii',errors='ignore').map(lambda x: x)
companies['name'] = companies.name.str.encode('utf-8').\
                  str.decode('ascii',errors='ignore').map(lambda x: x)
companies.tail()

Unnamed: 0,permalink,name,homepage_url,category_list,status,country_code,state_code,region,city,founded_at
66363,/Organization/Zznode-Science-And-Technology-Co-Ltd,ZZNode Science and Technology,http://www.zznode.com,Enterprise Software,operating,CHN,22,Beijing,Beijing,
66364,/Organization/Zzzzapp-Com,Zzzzapp Wireless ltd.,http://www.zzzzapp.com,Advertising|Mobile|Web Development|Wireless,operating,HRV,15,Split,Split,13-05-2012
66365,/Organization/Eron,ERON,http://www.aeron.hu/,,operating,,,,,01-01-2011
66366,/Organization/Asys-2,asys,http://www.oasys.io/,Consumer Electronics|Internet of Things|Telecommunications,operating,USA,CA,SF Bay Area,San Francisco,01-01-2014
66367,/Organization/Novatiff-Reklam-Ve-TanTM-Hizmetleri-Tic,novatiff Reklam ve Tantm Hizmetleri Tic,http://inovatiff.com,Consumer Goods|E-Commerce|Internet,operating,,,,,


In [6]:
# Apply encoding/decoding to rounds2 data frame to remove the special characters.
# Inference : The special characters in the company_permalink column 
#              due to western european char set.
# The special characters for company_permalink are cleaned up using below ;

rounds2['company_permalink'] = rounds2.company_permalink.str.encode('utf-8').\
                  str.decode('ascii',errors='ignore').map(lambda x: x)

rounds2['funding_round_permalink'] = rounds2.funding_round_permalink.\
                                       str.encode('utf-8').\
                                       str.decode('ascii',errors='ignore').map(lambda x: x)

rounds2.tail()

Unnamed: 0,company_permalink,funding_round_permalink,funding_round_type,funding_round_code,funded_at,raised_amount_usd
114944,/organization/zzzzapp-com,/funding-round/8f6d25b8ee4199e586484d817bceda05,convertible_note,,01-03-2014,41313.0
114945,/ORGANIZATION/ZZZZAPP-COM,/funding-round/ff1aa06ed5da186c84f101549035d4ae,seed,,01-05-2013,32842.0
114946,/organization/eron,/funding-round/59f4dce44723b794f21ded3daed6e4fe,venture,A,01-08-2014,
114947,/ORGANIZATION/ASYS-2,/funding-round/35f09d0794651719b02bbfd859ba9ff5,seed,,01-01-2015,18192.0
114948,/organization/novatiff-reklam-ve-tantm-hizmetleri-tic,/funding-round/af942869878d2cd788ef5189b435ebc4,grant,,01-10-2013,14851.0


#### Observation: #### 
>- The `'company_permalink'` column in rounds2 has same values but in different row with upper case. For e.g., We have `'/ORGANIZATION/0XDATA'` and `'/organization/0xdata'`. So we need to convert to same case. e.g., lower case, before considering them as key column before merge.<br>
>- Similarly, `'permalink'` column in companies data frame has mixed case values , so convert them to lower case for the join to work correctly.

#### Inference: #### 
>- The conversion to common `cases` of key column values help for merge the data frames.

In [7]:
# Convert the key column of companies (permalink) to lower case
companies['permalink'] = companies['permalink'].str.lower()

# Convert  key columns of rounds2 (company_permalink and funding_round_permalink) 
# to lower case
rounds2['company_permalink'] = rounds2['company_permalink'].str.lower()
rounds2['funding_round_permalink'] = rounds2['funding_round_permalink'].str.lower()

# Verify companies & rounds 
companies.tail()
rounds2.tail()

Unnamed: 0,permalink,name,homepage_url,category_list,status,country_code,state_code,region,city,founded_at
66363,/organization/zznode-science-and-technology-co-ltd,ZZNode Science and Technology,http://www.zznode.com,Enterprise Software,operating,CHN,22,Beijing,Beijing,
66364,/organization/zzzzapp-com,Zzzzapp Wireless ltd.,http://www.zzzzapp.com,Advertising|Mobile|Web Development|Wireless,operating,HRV,15,Split,Split,13-05-2012
66365,/organization/eron,ERON,http://www.aeron.hu/,,operating,,,,,01-01-2011
66366,/organization/asys-2,asys,http://www.oasys.io/,Consumer Electronics|Internet of Things|Telecommunications,operating,USA,CA,SF Bay Area,San Francisco,01-01-2014
66367,/organization/novatiff-reklam-ve-tantm-hizmetleri-tic,novatiff Reklam ve Tantm Hizmetleri Tic,http://inovatiff.com,Consumer Goods|E-Commerce|Internet,operating,,,,,


Unnamed: 0,company_permalink,funding_round_permalink,funding_round_type,funding_round_code,funded_at,raised_amount_usd
114944,/organization/zzzzapp-com,/funding-round/8f6d25b8ee4199e586484d817bceda05,convertible_note,,01-03-2014,41313.0
114945,/organization/zzzzapp-com,/funding-round/ff1aa06ed5da186c84f101549035d4ae,seed,,01-05-2013,32842.0
114946,/organization/eron,/funding-round/59f4dce44723b794f21ded3daed6e4fe,venture,A,01-08-2014,
114947,/organization/asys-2,/funding-round/35f09d0794651719b02bbfd859ba9ff5,seed,,01-01-2015,18192.0
114948,/organization/novatiff-reklam-ve-tantm-hizmetleri-tic,/funding-round/af942869878d2cd788ef5189b435ebc4,grant,,01-10-2013,14851.0


#### Observations :  <br>
>- There are huge number null values (NaN) and duplicates present in both companies and rounds2. 
  - Find out the number of null columns and their counts. 
  - Find out the number of duplicates 
  - Find out the unique items in the key columns (`permalink` and `company_permalink`).
  
#### Inference: <br>
>- The duplicates and null values should be removed as part of data cleansing.
>- Keep the key columns unique for joining.

In [8]:
# Check the companies & rounds data frames 
# have columns having missing data (number of NaN)

companies.isnull().sum()
rounds2.isnull().sum()

permalink            0
name                 1
homepage_url      5058
category_list     3148
status               0
country_code      6958
state_code        8547
region            8030
city              8028
founded_at       15221
dtype: int64

company_permalink              0
funding_round_permalink        0
funding_round_type             0
funding_round_code         83809
funded_at                      0
raised_amount_usd          19990
dtype: int64

###### Results for Table 1.1 - Understand the Dataset.
**Question 1:** How many unique companies are present in rounds2?<br>

In [9]:
# Results for Table 1.1 Understand the Dataset.
# Inference as below.

# Number of duplicate companies present in rounds2 data frame.
print("The Number of duplicate company_permalink present in rounds2 is :",\
                             rounds2.company_permalink.duplicated().sum())

# Number unique companies (permalink) present in rounds2.

print("The Number of unique company_permalink present in rounds2 is :",\
                                       rounds2['company_permalink'].nunique(dropna=False))

The Number of duplicate company_permalink present in rounds2 is : 48581
The Number of unique company_permalink present in rounds2 is : 66368


**Answer:**
>1. The number of duplicated company entries (company_permalink) in rounds2 : `48581`<br>
>2. The Total number of unique companies are present in rounds2 : `66368`<br>

###### Results for Table 1.1 - Understand the Dataset.
>**Question 2:** How many unique companies are present in companies?<br>

In [10]:
# Number of duplicate companies present in companies data frame.
# Inference as Below.
print("The Number of duplicate permalink present in companies",\
                        companies.permalink.duplicated().sum())

# Number unique companies (permalink) present in companies.
print("The Number of unique permalink present in companies",\
                               companies.permalink.nunique(dropna=False))

The Number of duplicate permalink present in companies 0
The Number of unique permalink present in companies 66368


**Answer:**
>1. The number of duplicated company entries in companies : `0`<br>
>2. The Total number of unique companies which are present in companies : `66368`<br>

 ###### Results for Table 1.1 - Understand the Dataset.
 >**Question 3:** In the companies data frame, which column can be used as the  unique key for each company? Write the name of the column ? <br>
 >**Answer:** `permalink`<br>

###### Results for Table 1.1 - Understand the Dataset.
>**Question 4:** Are there any companies in the rounds2 file which are not 
               present in companies?<br>

In [11]:
# Verify all company permalink in rounds2 present in permalink of companies 
# to ensure nothing is missing for the key coloumns.
# Inference : Empty rows indicates no key is missing in key columns in both data frames.

rounds2.loc[~rounds2['company_permalink'].isin(companies['permalink'])]

Unnamed: 0,company_permalink,funding_round_permalink,funding_round_type,funding_round_code,funded_at,raised_amount_usd


>**Answer (Yes or No):** `No`<br>

###### Results for Table 1.1 - Understand the Dataset.
>**Question 5:**  Merge the two data frames so that all  variables (columns)  in the companies frame are added to the rounds2 data frame. Name the merged frame master_frame. How many observations are present in master_frame ? ?<br> 

In [12]:
# How many observations are present in master_frame ?
# Inference: After merging both frame are joined (using inner join)

master_frame = pd.merge(left=rounds2,right=companies,left_on='company_permalink',\
                                       right_on='permalink',how="inner")
master_frame.shape

(114949, 16)

>**Answer:** The rounds2 and companies data frames are merged using `company_permalink` and `permalink` respectively, as key using `Inner Join`. The number of rows after merging, in the resultant `master_frame` is **`114949`** rows and 16 columns.

## Data Cleansing

In [13]:
# Check the columns have missing data - 
# Total number of NaN values in each column in master_frame.

master_frame.isnull().sum()

company_permalink              0
funding_round_permalink        0
funding_round_type             0
funding_round_code         83809
funded_at                      0
raised_amount_usd          19990
permalink                      0
name                           1
homepage_url                6134
category_list               3410
status                         0
country_code                8678
state_code                 10946
region                     10167
city                       10164
founded_at                 20521
dtype: int64

In [14]:
# Evaulate missing data in master data frame columns in percentages

round(100*(master_frame.isnull().sum()/len(master_frame.index)), 2)

company_permalink          0.00
funding_round_permalink    0.00
funding_round_type         0.00
funding_round_code        72.91
funded_at                  0.00
raised_amount_usd         17.39
permalink                  0.00
name                       0.00
homepage_url               5.34
category_list              2.97
status                     0.00
country_code               7.55
state_code                 9.52
region                     8.84
city                       8.84
founded_at                17.85
dtype: float64

####  Observation :
>- There are columns in master having huge number of null values percentages.<br>
>- e.g., The columns `'funding_round_code'`, `'raised_amount_usd'` and `'founded_at'` are having high percentage of missing data.

#### Inference: 
>- The columns having huge number of nulls should idenfied and cleaned;
>- Remove the columns having higher percentage nulls, if they are not necessary for analysis.<br>
>- Remove rows having columns with higher percentage of nulls.<br>
>- As per above, following columns are not required for this analysis and/or duplicate column for this case study, and it can be dropped. The columns are ;
  - `funding_round_code`
  - `founded_at`
  - `company_permalink (duplicate of permalink)`
  - `homepage_url`
  - `state_code`
  - `region`
  - `city`

In [15]:
## The columns with high percentage null values are ;
# 1. funding_round_code - Highest Percentage Nan Value Column.
# 2. founded_at - Not useful for case study analysis
# 3. company_permalink - Duplicate column of permalink (post merge of rounds2 & companies)
# 4. homepage_url - Not Used for analysis ( contains huge amount null values)
# 5. state_code - Not Used ( contains huge amount of nulls)
# 6. region - Not Used 
# 7. city  - Not used


column_list = ['funding_round_code',
               'founded_at',
               'company_permalink',
               'homepage_url',
               'state_code',
               'region',
               'city']
master_frame.drop(columns=column_list,axis=1,inplace=True)

In [16]:
# Find out the missing data again for percentage missing data 
# after removal of columns above.

round(100*(master_frame.isnull().sum()/len(master_frame.index)), 2)

funding_round_permalink    0.00
funding_round_type         0.00
funded_at                  0.00
raised_amount_usd         17.39
permalink                  0.00
name                       0.00
category_list              2.97
status                     0.00
country_code               7.55
dtype: float64

####  Observation :
>- Highest Null percentage is for column `raised_amount_usd`.
>- Followed by `country_code` and `category_list`.

#### Inference: 
>- Since `raised_amount_usd` is a critical column in our analysis the data can be either cleansed or imputed with some other values.
>- Since the percentage of NaN (~17%)for raised_amount_usd, is comparitively low against total, the rows having NaN values can be removed safely.<br>
>- The presence of NaN rows are not useful for analysis ,as it is being a critical column.<br><br>

In [17]:
# Filter and remove all rows having raised_amount_usd is null (NaN)
master_frame = master_frame[master_frame['raised_amount_usd'].notnull()]

# Rerun percentage NaN Values on columns.
master_frame.isnull().sum()
round(100*(master_frame.isnull().sum()/len(master_frame.index)), 2)

funding_round_permalink       0
funding_round_type            0
funded_at                     0
raised_amount_usd             0
permalink                     0
name                          1
category_list              1044
status                        0
country_code               5851
dtype: int64

funding_round_permalink   0.00
funding_round_type        0.00
funded_at                 0.00
raised_amount_usd         0.00
permalink                 0.00
name                      0.00
category_list             1.10
status                    0.00
country_code              6.16
dtype: float64

#### Observation :  
>- The null percentage is `0` now for raised_amount_usd column.
>- There are `1, 1044 and 5851` rows with nulls for `'name', 'category_list' and 'country_code'` columns respectively.

#### Inference   :
>- The 1 name column can be imputed with the company name from the correspnding prmalink column.

In [18]:
# Idenitfy the one Nan Name column 
master_frame.loc[master_frame.name.isnull(),:]

Unnamed: 0,funding_round_permalink,funding_round_type,funded_at,raised_amount_usd,permalink,name,category_list,status,country_code
98692,/funding-round/9c987e616755a78c51a4aa67c27a2a93,seed,01-03-2012,25000.0,/organization/tell-it-in,,Startups,closed,USA


In [19]:
# Replace the value with permalink company name.
master_frame.loc[master_frame.name.isnull(),'name'] = 'tell-it-in'
master_frame.loc[master_frame.name.isnull(),:]
master_frame.isnull().sum()
master_frame.shape

Unnamed: 0,funding_round_permalink,funding_round_type,funded_at,raised_amount_usd,permalink,name,category_list,status,country_code


funding_round_permalink       0
funding_round_type            0
funded_at                     0
raised_amount_usd             0
permalink                     0
name                          0
category_list              1044
status                        0
country_code               5851
dtype: int64

(94959, 9)

In [20]:
# Calculate the percentage wise null for each column of master.
round(100*(master_frame.isnull().sum()/len(master_frame.index)), 2)

funding_round_permalink   0.00
funding_round_type        0.00
funded_at                 0.00
raised_amount_usd         0.00
permalink                 0.00
name                      0.00
category_list             1.10
status                    0.00
country_code              6.16
dtype: float64

In [21]:
# Percentage of rows left after removing Nan and not so useful 
# columns for the analysis from the master data frame.

print("The percentage of row left after cleansing: ",\
        round(100*(len(master_frame.index)/114949),2),"%")

The percentage of row left after cleansing:  82.61 %


In [22]:
# The column country_code & category_list still having Nan values.

master_frame['country_code'].isnull().sum()
master_frame['category_list'].isnull().sum()

5851

1044

#### Observation:
>- There are still null values exists for `country_code' and 'categoy_list` columns with  5851 and 1044 rows respectively.

#### Inference :
>- So, To proceed with Data cleansing process following has been considered as of now ;<br>
>- As per the discussion forum, https://learn.upgrad.com/v/course/208/question/93880 , following conclusion is arrived (updated on 30-oct-2018)  ;
>- Since percentage of NaN present in ***`country_code` & `category_list`*** are negligibly small and we cannot impute the column with any useful values.
>- Hence these rows can be safely dropped of from the master.

In [23]:
# Drop the Nan value rows for country_code & category_list from master.

master_frame.dropna(axis='index',subset=['country_code','category_list'],inplace=True)

# Verify the dropped Nan from coutry_code and category_list

master_frame.shape
len(master_frame.loc[master_frame.country_code.isnull(),:].index)
len(master_frame.loc[master_frame.category_list.isnull(),:].index)

(88529, 9)

0

0

In [24]:
# Calculate the percentage wise null for each column of master.

round(100*(master_frame.isnull().sum()/len(master_frame.index)), 2)

funding_round_permalink   0.00
funding_round_type        0.00
funded_at                 0.00
raised_amount_usd         0.00
permalink                 0.00
name                      0.00
category_list             0.00
status                    0.00
country_code              0.00
dtype: float64

In [25]:
# Finally, view the percentage rows left after complete cleaning.

print("The percentage of row left after cleansing: ",\
                  round(100*(len(master_frame.index)/114949),2),"%")

#View the number of rows and columns of master.

master_frame.shape

The percentage of row left after cleansing:  77.02 %


(88529, 9)

#### Observation:
>- We have cleaned unncessary or duplicate columns and not useful rows which are having null values.

#### Inference:
>- We have ***`Cleaned up 23%`*** of the total rows after removing the null values and dropping unused columns from merged Master.

In [26]:
#Verify master_frame date.
# master_frame

In [27]:
# Export the data to csv file.

master_frame.to_csv("Data/master_frame.csv")


### Check-point 2 - Funding Type Analysis (Investment type Analysis)
>- Spark Funds wants to choose one of these four investment types for each potential investment they will make.
   - Venture
   - Angel
   - Seed
   - private equity 
>- Calculate the average investment amount for each of the above four funding types (venture, angel, seed, and private equity) and report the answers in Table 2.1.
>- Based on the average investment amount calculated above, which investment type do you think is the most suitable for Spark Funds?.

>- ***CONSTRAINT*** : Only Invest between 5 to 15 million USD per round of investment.

###  Results Expected: Table 2.1
>1. Average funding amount of venture type 
>2. Average funding amount of angel type
>3. Average funding amount of seed type
>4. Average funding amount of private equity type
>5. Considering that Spark Funds wants to invest between 5 to 15 million USD per    investment round, which investment type is the most suitable for them?

### To Do List - Check Point-2
>1. Group the master based on funding_type for the raised amount type 
>2. Calculate the average of funding amount 

In [28]:
##Checking   - this returns actual value of raised amount

master_frame.loc[:,"raised_amount_usd"].head()

0   10,000,000.00
2      700,000.00
4    2,000,000.00
6       41,250.00
7       43,360.00
Name: raised_amount_usd, dtype: float64

In [29]:
# The columns to be foucssed are raised_amount_usd and type and funding_round_type. 
# Find the average of raised_amount_usd using below 

master_frame.groupby('funding_round_type').mean()

Unnamed: 0_level_0,raised_amount_usd
funding_round_type,Unnamed: 1_level_1
angel,971573.89
convertible_note,1337186.65
debt_financing,17167653.47
equity_crowdfunding,509897.97
grant,4512698.29
non_equity_assistance,480753.38
post_ipo_debt,169451789.77
post_ipo_equity,66077058.57
private_equity,73938486.28
product_crowdfunding,1353226.91


In [30]:
# Group master data on funding type and calculate the average of funding amount

inv_frame_all_avg=master_frame.groupby('funding_round_type')['raised_amount_usd']\
                    .mean().sort_values(ascending=False)

inv_frame_all_avg

funding_round_type
post_ipo_debt           169,451,789.77
secondary_market         84,438,532.25
private_equity           73,938,486.28
post_ipo_equity          66,077,058.57
debt_financing           17,167,653.47
undisclosed              15,891,661.39
venture                  11,724,222.69
grant                     4,512,698.29
product_crowdfunding      1,353,226.91
convertible_note          1,337,186.65
angel                       971,573.89
seed                        747,793.68
equity_crowdfunding         509,897.97
non_equity_assistance       480,753.38
Name: raised_amount_usd, dtype: float64

In [31]:
# Group master data on funding type and calculate the median on funding raised amount

inv_frame_all_med = master_frame.groupby('funding_round_type')['raised_amount_usd']\
                        .median().sort_values(ascending=False)
master_frame.shape
inv_frame_all_med

(88529, 9)

funding_round_type
secondary_market        45,850,000.00
private_equity          20,000,000.00
post_ipo_debt           19,900,000.00
post_ipo_equity         12,262,852.50
venture                  5,000,000.00
undisclosed              1,100,000.00
debt_financing           1,096,653.00
angel                      414,906.00
seed                       300,000.00
convertible_note           300,000.00
grant                      225,000.00
product_crowdfunding       211,500.00
equity_crowdfunding         85,000.00
non_equity_assistance       60,000.00
Name: raised_amount_usd, dtype: float64

In [32]:
# Group master data on funding type and calculate the count of investement 
# made across sectors

inv_frame_all_count = master_frame.groupby('funding_round_type')['raised_amount_usd']\
                        .count().sort_values(ascending=False)
inv_frame_all_count

funding_round_type
venture                  47809
seed                     21095
debt_financing            6506
angel                     4400
grant                     1939
private_equity            1820
undisclosed               1345
convertible_note          1320
equity_crowdfunding       1128
post_ipo_equity            598
product_crowdfunding       330
post_ipo_debt              151
non_equity_assistance       60
secondary_market            28
Name: raised_amount_usd, dtype: int64

In [33]:
# To find the Average/Median for Investment types venture, angel, seed and private equity.

inv_frame = master_frame.set_index(['funding_round_type'])
fil_frame = inv_frame.loc[inv_frame.index\
                    .isin(['venture','angel','seed','private_equity'])]

In [34]:
# Verify data frame with four funding types.

fil_frame.shape
fil_frame.info()
fil_frame.index.value_counts()

(75124, 8)

<class 'pandas.core.frame.DataFrame'>
Index: 75124 entries, venture to seed
Data columns (total 8 columns):
funding_round_permalink    75124 non-null object
funded_at                  75124 non-null object
raised_amount_usd          75124 non-null float64
permalink                  75124 non-null object
name                       75124 non-null object
category_list              75124 non-null object
status                     75124 non-null object
country_code               75124 non-null object
dtypes: float64(1), object(7)
memory usage: 5.2+ MB


venture           47809
seed              21095
angel              4400
private_equity     1820
Name: funding_round_type, dtype: int64

In [35]:
# Group fil_frame data on funding type and 
# calculate the average of funding amount for 4 funding types

inv_frame_four_avg = fil_frame.groupby('funding_round_type')['raised_amount_usd'].\
                                            mean().sort_values(ascending=False)
inv_frame_four_avg

# Group master four data on funding type and 
# calculate the median of funding amount for 4 funding type

inv_frame_four_med = fil_frame.groupby('funding_round_type')['raised_amount_usd'].\
                                            median().sort_values(ascending=False)
inv_frame_four_med

funding_round_type
private_equity   73,938,486.28
venture          11,724,222.69
angel               971,573.89
seed                747,793.68
Name: raised_amount_usd, dtype: float64

funding_round_type
private_equity   20,000,000.00
venture           5,000,000.00
angel               414,906.00
seed                300,000.00
Name: raised_amount_usd, dtype: float64

In [36]:
# Group master four data on funding type and 
# calculate the count of type by investments for 4 fundig type

inv_frame_four_count = fil_frame.groupby('funding_round_type')['raised_amount_usd'].\
                                            count().sort_values(ascending=False)
inv_frame_four_count

funding_round_type
venture           47809
seed              21095
angel              4400
private_equity     1820
Name: raised_amount_usd, dtype: int64

In [37]:
# Calculate the aggregate function to explore the other details.
avg_fund = fil_frame.groupby(['funding_round_type']).\
                                agg(['mean','median','sum','std'])
avg_fund

Unnamed: 0_level_0,raised_amount_usd,raised_amount_usd,raised_amount_usd,raised_amount_usd
Unnamed: 0_level_1,mean,median,sum,std
funding_round_type,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
angel,971573.89,414906.0,4274925121.0,7710904.33
private_equity,73938486.28,20000000.0,134568045021.0,201776467.38
seed,747793.68,300000.0,15774707732.0,2288317.64
venture,11724222.69,5000000.0,560523362596.0,88215713.61


#### Observation:
>- The Mean and Median are calculated for the 4 different funding type - seed,angel, venture and private equity.

#### Inference :
>- Based on the above metrics it has been identified that Venture will be ideal investment type which spark funds will choose as the median `$5 of millions` to an average of approximately `$11.5 millions`.

***IMPORTANT!!!*** ::: 
>- Plot suitable graph to depict the funding type against the average raise amount for all and the four major funding types.

### Results Expected: Table 2.1
>***Question1:*** Average funding amount of venture type ?<br> 
>***Answer:*** The mean amount `$11,724,222.69` and the median amount `$5,000,000.00`.<br><br>
>***Question2:*** Average funding amount of angel type?<br>
>***Answer:*** The mean amount `$971,573.89` and the median amount `414,906.00`.<br><br>
>*** Question3: *** Average funding amount of seed type?<br>
>*** Answer: *** The mean amount `$747,793.68` and the median amount `$300,000.00`.<br><br>
>*** Question4:*** Average funding amount of private equity type?<br> 
>*** Answer:*** The mean amount `$73,938,486.28` and the median amount `$20,000,000.00`.<br><br>
>***Question5:*** Considering that Spark Funds wants to invest between 5 to 15 million USD per    investment round, which investment type is the most suitable for them?<br>
>*** Answer:*** Based on the above metrics it has been identified that `venture` will be ideal investment type which spark funds will choose as the median of `$5 millions` and an average of approximately `$11.7 millions`.(which falls under an investment between 5 to 15 million USD per round of investment they have budgeted for.)<br>

In [38]:
## For Further checkpoints and analysis only venture Type 
# of companies we will be focusded...

v_frame = master_frame.set_index(['funding_round_type'])
venture_frame = master_frame.loc[inv_frame.index.isin(['venture'])]

In [39]:
# verify Top funding type data frame.
venture_frame.shape
venture_frame.head()

(47809, 9)

Unnamed: 0,funding_round_permalink,funding_round_type,funded_at,raised_amount_usd,permalink,name,category_list,status,country_code
0,/funding-round/9a01d05418af9f794eebff7ace91f638,venture,05-01-2015,10000000.0,/organization/-fame,#fame,Media,operating,IND
4,/funding-round/5727accaeaa57461bd22a9bdd945382d,venture,19-03-2008,2000000.0,/organization/0-6-com,0-6.com,Curated Web,operating,CHN
8,/funding-round/954b9499724b946ad8c396a57a5f3b72,venture,21-12-2009,719491.0,/organization/0ndine-biomedical-inc,Ondine Biomedical Inc.,Biotechnology,operating,CAN
10,/funding-round/3bb2ee4a2d89251a10aaa735b1180e44,venture,09-11-2015,20000000.0,/organization/0xdata,H2O.ai,Analytics,operating,USA
11,/funding-round/ae2a174c06517c2394aed45006322a7e,venture,03-01-2013,1700000.0,/organization/0xdata,H2O.ai,Analytics,operating,USA


In [40]:
# Export the VENTURE investment type companies data to csv file.

venture_frame.to_csv("Data/venture_frame.csv")

### Checkpoint 3: Country Analysis
>As we know the type of investment suited for Spark - i.e., `venture`. However, Spark Funds wants to invest in countries with the highest amount of funding for the chosen investment type. This is a part of its broader strategy to invest where most investments are occurring.

>- Spark Funds wants to see the top nine countries which have received the highest total funding (across ALL sectors for the chosen investment type)
>- For the chosen investment type, make a data frame named `top9` with the top nine countries (based on the total investment amount each country has received).
>- Identify the `top3` English-speaking countries in the data frame `top9`.<br>

 ***Constraints*** : 
>   - Use only the `venture_frame` data frame created in check point-2 which is having only having data with highest investment fund type (i.e., `venture`).
>   - Only `English Speaking countries` from the master_frame.

###  Results Expected: Table 3.1
>All codes for data frame `top9` with ;
 1. Top English-speaking country
 2. Second English-speaking country
 3. Third English-speaking country


### To Do List - Check Point-3
>1. Based on the total investment amount filter down top 9 country data for the chosen `venture` investment type.
>2. Identify the top 3 English speaking countries with investment Amount from the top9.
>3. ***Use pdfminer python package to parase pdf  document***   `Countries_where_English_is_an_official_language.pdf` to get the corresponding country names for the english speaking countries only.
>- join the english speaking country name frame with venture_frame to genreate top9 destination countries for investment.

In [41]:
## This dictionary contains All country code and country names details from online location
# https://github.com/lukes/ISO-3166-Countries-with-Regional-Codes/blob/master/all/all.csv

countries_code={'name': ['Afghanistan', 'Åland Islands', 'Albania', 'Algeria', 'American Samoa', 'Andorra', 'Angola', 'Anguilla', 'Antarctica', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Aruba', 'Australia', 'Austria', 'Azerbaijan', 'The Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bermuda', 'Bhutan', 'Bolivia (Plurinational State of)', 'Bonaire, Sint Eustatius and Saba', 'Bosnia and Herzegovina', 'Botswana', 'Bouvet Island', 'Brazil', 'British Indian Ocean Territory', 'Brunei Darussalam', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cabo Verde', 'Cambodia', 'Cameroon', 'Canada', 'Cayman Islands', 'Central African Republic', 'Chad', 'Chile', 'China', 'Christmas Island', 'Cocos (Keeling) Islands', 'Colombia', 'Comoros', 'Congo', 'Congo (Democratic Republic of the)', 'Cook Islands', 'Costa Rica', "Côte d'Ivoire", 'Croatia', 'Cuba', 'Curaçao', 'Cyprus', 'Czechia', 'Denmark', 'Djibouti', 'Dominica', 'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia', 'Swaziland', 'Ethiopia', 'Falkland Islands (Malvinas)', 'Faroe Islands', 'Fiji', 'Finland', 'France', 'French Guiana', 'French Polynesia', 'French Southern Territories', 'Gabon', 'The Gambia', 'Georgia', 'Germany', 'Ghana', 'Gibraltar', 'Greece', 'Greenland', 'Grenada', 'Guadeloupe', 'Guam', 'Guatemala', 'Guernsey', 'Guinea', 'Guinea-Bissau', 'Guyana', 'Haiti', 'Heard Island and McDonald Islands', 'Holy See', 'Honduras', 'Hong Kong', 'Hungary', 'Iceland', 'India', 'Indonesia', 'Iran (Islamic Republic of)', 'Iraq', 'Ireland', 'Isle of Man', 'Israel', 'Italy', 'Jamaica', 'Japan', 'Jersey', 'Jordan', 'Kazakhstan', 'Kenya', 'Kiribati', "Korea (Democratic People's Republic of)", 'Korea (Republic of)', 'Kuwait', 'Kyrgyzstan', "Lao People's Democratic Republic", 'Latvia', 'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Liechtenstein', 'Lithuania', 'Luxembourg', 'Macao', 'Macedonia (the former Yugoslav Republic of)', 'Madagascar', 'Malawi', 'Malaysia', 'Maldives', 'Mali', 'Malta', 'Marshall Islands', 'Martinique', 'Mauritania', 'Mauritius', 'Mayotte', 'Mexico', 'Federated States of Micronesia', 'Moldova (Republic of)', 'Monaco', 'Mongolia', 'Montenegro', 'Montserrat', 'Morocco', 'Mozambique', 'Myanmar', 'Namibia', 'Nauru', 'Nepal', 'Netherlands', 'New Caledonia', 'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'Niue', 'Norfolk Island', 'Northern Mariana Islands', 'Norway', 'Oman', 'Pakistan', 'Palau', 'Palestine, State of', 'Panama', 'Papua New Guinea', 'Paraguay', 'Peru', 'Philippines', 'Pitcairn', 'Poland', 'Portugal', 'Puerto Rico', 'Qatar', 'Réunion', 'Romania', 'Russian Federation', 'Rwanda', 'Saint Barthélemy', 'Saint Helena, Ascension and Tristan da Cunha', 'Saint Kitts and Nevis', 'Saint Lucia', 'Saint Martin (French part)', 'Saint Pierre and Miquelon', 'Saint Vincent and the Grenadines', 'Samoa', 'San Marino', 'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia', 'Seychelles', 'Sierra Leone', 'Singapore', 'Sint Maarten (Dutch part)', 'Slovakia', 'Slovenia', 'Solomon Islands', 'Somalia', 'South Africa', 'South Georgia and the South Sandwich Islands', 'South Sudan', 'Spain', 'Sri Lanka', 'Sudan', 'Suriname', 'Svalbard and Jan Mayen', 'Sweden', 'Switzerland', 'Syrian Arab Republic', 'Taiwan, Province of China', 'Tajikistan', 'Tanzania', 'Thailand', 'Timor-Leste', 'Togo', 'Tokelau', 'Tonga', 'Trinidad and Tobago', 'Tunisia', 'Turkey', 'Turkmenistan', 'Turks and Caicos Islands', 'Tuvalu', 'Uganda', 'Ukraine', 'United Arab Emirates', 'United Kingdom', 'United States', 'United States Minor Outlying Islands', 'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela (Bolivarian Republic of)', 'Viet Nam', 'Virgin Islands (British)', 'Virgin Islands (U.S.)', 'Wallis and Futuna', 'Western Sahara', 'Yemen', 'Zambia', 'Zimbabwe'], 'alpha-3': ['AFG', 'ALA', 'ALB', 'DZA', 'ASM', 'AND', 'AGO', 'AIA', 'ATA', 'ATG', 'ARG', 'ARM', 'ABW', 'AUS', 'AUT', 'AZE', 'BHS', 'BHR', 'BGD', 'BRB', 'BLR', 'BEL', 'BLZ', 'BEN', 'BMU', 'BTN', 'BOL', 'BES', 'BIH', 'BWA', 'BVT', 'BRA', 'IOT', 'BRN', 'BGR', 'BFA', 'BDI', 'CPV', 'KHM', 'CMR', 'CAN', 'CYM', 'CAF', 'TCD', 'CHL', 'CHN', 'CXR', 'CCK', 'COL', 'COM', 'COG', 'COD', 'COK', 'CRI', 'CIV', 'HRV', 'CUB', 'CUW', 'CYP', 'CZE', 'DNK', 'DJI', 'DMA', 'DOM', 'ECU', 'EGY', 'SLV', 'GNQ', 'ERI', 'EST', 'SWZ', 'ETH', 'FLK', 'FRO', 'FJI', 'FIN', 'FRA', 'GUF', 'PYF', 'ATF', 'GAB', 'GMB', 'GEO', 'DEU', 'GHA', 'GIB', 'GRC', 'GRL', 'GRD', 'GLP', 'GUM', 'GTM', 'GGY', 'GIN', 'GNB', 'GUY', 'HTI', 'HMD', 'VAT', 'HND', 'HKG', 'HUN', 'ISL', 'IND', 'IDN', 'IRN', 'IRQ', 'IRL', 'IMN', 'ISR', 'ITA', 'JAM', 'JPN', 'JEY', 'JOR', 'KAZ', 'KEN', 'KIR', 'PRK', 'KOR', 'KWT', 'KGZ', 'LAO', 'LVA', 'LBN', 'LSO', 'LBR', 'LBY', 'LIE', 'LTU', 'LUX', 'MAC', 'MKD', 'MDG', 'MWI', 'MYS', 'MDV', 'MLI', 'MLT', 'MHL', 'MTQ', 'MRT', 'MUS', 'MYT', 'MEX', 'FSM', 'MDA', 'MCO', 'MNG', 'MNE', 'MSR', 'MAR', 'MOZ', 'MMR', 'NAM', 'NRU', 'NPL', 'NLD', 'NCL', 'NZL', 'NIC', 'NER', 'NGA', 'NIU', 'NFK', 'MNP', 'NOR', 'OMN', 'PAK', 'PLW', 'PSE', 'PAN', 'PNG', 'PRY', 'PER', 'PHL', 'PCN', 'POL', 'PRT', 'PRI', 'QAT', 'REU', 'ROU', 'RUS', 'RWA', 'BLM', 'SHN', 'KNA', 'LCA', 'MAF', 'SPM', 'VCT', 'WSM', 'SMR', 'STP', 'SAU', 'SEN', 'SRB', 'SYC', 'SLE', 'SGP', 'SXM', 'SVK', 'SVN', 'SLB', 'SOM', 'ZAF', 'SGS', 'SSD', 'ESP', 'LKA', 'SDN', 'SUR', 'SJM', 'SWE', 'CHE', 'SYR', 'TWN', 'TJK', 'TZA', 'THA', 'TLS', 'TGO', 'TKL', 'TON', 'TTO', 'TUN', 'TUR', 'TKM', 'TCA', 'TUV', 'UGA', 'UKR', 'ARE', 'GBR', 'USA', 'UMI', 'URY', 'UZB', 'VUT', 'VEN', 'VNM', 'VGB', 'VIR', 'WLF', 'ESH', 'YEM', 'ZMB', 'ZWE']}
load_countries=pd.DataFrame(countries_code)


## perform pip install pdfminder.six to executing the below set of code


In [42]:
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter
from pdfminer.layout import LAParams
import io

In [43]:
#Reading the PDF file of the english speaking countries
fileopen = open("Data/Countries_where_English_is_an_official_language.pdf", 'rb')
PdfRsrMgr = PDFResourceManager()
retstr = io.StringIO()
lp = LAParams()
device = TextConverter(PdfRsrMgr, retstr, codec='utf-8', laparams=lp)

In [44]:
# Create a PDF interpreter object.
interpreter = PDFPageInterpreter(PdfRsrMgr, device)

In [45]:
# Process each page contained in the document.
for page in PDFPage.get_pages(fileopen):
    interpreter.process_page(page)
    data =  retstr.getvalue()

In [46]:
#The resultant PDF extract
data

'List\xa0of\xa0countries\xa0where\xa0English\xa0is\xa0an\xa0official\xa0language\xa0\n\n\xa0\nAsia\xa0\nIndia\xa0\nPakistan\xa0\nPhilippines\xa0\nSingapore\xa0\n\xa0\nAustralia/Oceania\xa0\nAustralia\xa0\nFiji\xa0\nKiribati\xa0\nMarshall\xa0Islands\xa0\nFederated\xa0States\xa0of\xa0Micronesia\xa0\nNauru\xa0\nNew\xa0Zealand\xa0\nPalau\xa0\xa0\nPapua\xa0New\xa0Guinea\xa0\nSamoa\xa0\nSolomon\xa0Islands\xa0\nTonga\xa0\nTuvalu\xa0\nVanuatu\xa0\n\xa0\nEurope\xa0\nIreland\xa0\nMalta\xa0\nUnited\xa0Kingdom\xa0\xa0\n\xa0\n\xa0\n\xa0\n\xa0\n\n\xa0\nAfrica\xa0\nBotswana\xa0\xa0\nCameroon\xa0\nEthiopia\xa0\nEritrea\xa0\nThe\xa0Gambia\xa0\nGhana\xa0\nKenya\xa0\nLesotho\xa0\nLiberia\xa0\nMalawi\xa0\nMauritius\xa0\nNamibia\xa0\nNigeria\xa0\nRwanda\xa0\nSeychelles\xa0\nSierra\xa0Leone\xa0\nSouth\xa0Africa\xa0\nSouth\xa0Sudan\xa0\nSudan\xa0\nSwaziland\xa0\nTanzania\xa0\xa0\nUganda\xa0\nZambia\xa0\nZimbabwe\xa0\n\xa0\nAmericas\xa0\nAntigua\xa0and\xa0Barbuda\xa0\nThe\xa0Bahamas\xa0\nBarbados\xa0\nBelize\

In [47]:
#removing the special characters as it has some special characters
removeSpecialChars = data.replace("\xa0", " ")

#removing the trailing spaces
removeRspace=removeSpecialChars.rstrip()

#converting the string into a list
value_pdf=removeRspace.split('\n')
# value_pdf

In [48]:
#creating a dataframe of the extracted information
pdf_countries1=pd.DataFrame({'name':value_pdf})
pdf_countries1.head()

Unnamed: 0,name
0,List of countries where English is an official language
1,
2,
3,Asia
4,India


In [49]:
#removing the trailing spaces from each values in the dataframe
pdf_countries1['name'] = pdf_countries1['name'].map(lambda x: x.strip())

#replacing the blanks with NaN values
pdf_countries1['name'].replace('', np.nan, inplace=True)
pdf_countries1.head()

Unnamed: 0,name
0,List of countries where English is an official language
1,
2,
3,Asia
4,India


In [50]:
#Calculating the sum of null values
pdf_countries1['name'].isnull().sum()

# Evaulate the missing data again for percentage missing data
round(100*(pdf_countries1.isnull().sum()/len(pdf_countries1.index)), 2)

11

name   14.47
dtype: float64

In [51]:
#Removal null rows
pdf_countries1 = pdf_countries1[pdf_countries1['name'].notnull()]

# Re-evaulate the missing data again for percentage missing data
round(100*(pdf_countries1.isnull().sum()/len(pdf_countries1.index)), 2)

#Analysing the shape and information of the the pdf_countries dataframe
pdf_countries1.shape
pdf_countries1.info()

name   0.00
dtype: float64

(65, 1)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 65 entries, 0 to 75
Data columns (total 1 columns):
name    65 non-null object
dtypes: object(1)
memory usage: 1.0+ KB


In [52]:
#remove the 0the row(i.e."List of countries where English is an official language") and 
# the following continents Asia, Australia/Oceania,Europe,Africa & Americas
pdf_countries1=pdf_countries1[~pdf_countries1.name.\
        isin(['Asia','Australia/Oceania','Europe','Africa','Americas',
              'List of countries where English is an official language'])]
pdf_countries=pdf_countries1.copy()

In [53]:
#Load Countries code and name data from dictionary
load_countries.head(5)

Unnamed: 0,name,alpha-3
0,Afghanistan,AFG
1,Åland Islands,ALA
2,Albania,ALB
3,Algeria,DZA
4,American Samoa,ASM


In [54]:
#Merging the two Dataframe to obtain the country code
consolidate=pd.merge(pdf_countries,load_countries,how='left',on='name')

print("\n\nThe below countries has to be renamed\n",\
       consolidate[consolidate['alpha-3'].isnull()])



The below countries has to be renamed
 Empty DataFrame
Columns: [name, alpha-3]
Index: []


In [55]:
#remapping to the load_countries to validate if the desired results are obtained
country3_consolidate=pd.merge(pdf_countries,load_countries,how='left',on='name')

#validating for null records
# print(final_consolidate[final_consolidate['alpha-3'].isnull()])
country3_consolidate.sort_values(by='name').tail(10)

Unnamed: 0,name,alpha-3
25,The Gambia,GMB
15,Tonga,TON
57,Trinidad and Tobago,TTO
16,Tuvalu,TUV
42,Uganda,UGA
20,United Kingdom,GBR
58,United States,USA
17,Vanuatu,VUT
43,Zambia,ZMB
44,Zimbabwe,ZWE


#### Observation:
>- The default PyPDF2 is not working as it has inherent problem of handling of `whilte spaces` in the pdf document resulting in generating unstructured scrambled output.
>- We have used ***`pdfminer` package instead ( which can be installed from the command line using pip as below; *** ;
    - ***pip install pdfminer.six***
>- The final consolidated list all english speaking countries with codes and names are stored in a data frame ***`country3_consolidated`***

#### Inference:
>- The English speaking country names from the given pdf are extracted  and mapped with a dataframe through a static dictionary of country codes and names. The dictionary data is obtained from online data( https://github.com/lukes/ISO-3166-Countries-with-Regional-Codes/blob/master/all/all.csv) (standard iso codes from online) to identify countries.

#### Note:: 
>- It is mandatory and prerequsite to install ***`pdfminer`*** package before proceeding further to identify top9 english speaking countries.

In [56]:
# Merge the venture_frame and country3_consolidate

venture_countries = pd.merge(left=venture_frame,right=country3_consolidate,\
                                    left_on='country_code',\
                                       right_on='alpha-3',how="left",indicator=True)

In [57]:
# Check the unique country code in the merged frame which are not marked as
# both (i.e., Left_Only)

venture_countries.loc[venture_countries['_merge'] != 'both',:].country_code.unique()

# It is clear now that the countries which are not in 'both' are 
# non english speaking countries and can be ignored.

# Filter out only english speaking from the venture list.

venture_english = venture_countries.loc[venture_countries['_merge'] == 'both',:].copy()
venture_english.country_code.unique()

#venture_english.loc[venture_english.country_code.
#                      isin(venture_english.country_code.unique()),\
#                      ['country_code','name_y']]

array(['CHN', 'FRA', 'ROM', 'KOR', 'SWE', 'NLD', 'RUS', 'BEL', 'ESP',
       'HUN', 'JPN', 'DEU', 'ITA', 'HKG', 'BRA', 'FIN', 'CHE', 'PRT',
       'SVN', 'THA', 'DNK', 'TWN', 'ISR', 'NOR', 'LTU', 'ISL', 'MEX',
       'AUT', 'ARG', 'MNE', 'MYS', 'TUR', 'POL', 'LVA', 'GGY', 'EST',
       'LBN', 'GRC', 'IDN', 'CYP', 'SVK', 'ARE', 'EGY', 'ARM', 'TUN',
       'COL', 'CZE', 'PRI', 'CYM', 'PER', 'ECU', 'CHL', 'VNM', 'URY',
       'HRV', 'LUX', 'UKR', 'BMU', 'BGR', 'PAN', 'MMR', 'JOR', 'KAZ',
       'MAR', 'LIE', 'GTM', 'SAU', 'TAN', 'SEN', 'MCO', 'BAH', 'KWT',
       'LAO', 'BGD', 'MAF', 'GIB'], dtype=object)

array(['IND', 'CAN', 'USA', 'GBR', 'IRL', 'SGP', 'AUS', 'NZL', 'PHL',
       'ZAF', 'KEN', 'CMR', 'NGA', 'PAK', 'MUS', 'TTO', 'KNA', 'MLT',
       'GHA', 'UGA', 'BWA'], dtype=object)

In [58]:
# The venture_english data frame will be used for finding out top9_english & top3_english
# spaking countries by cleaning up venture_english data frame.

# Rename name_y column merged.

venture_english.rename(columns={'name_y':'country_name'},inplace=True)

# drop alpha-3 (duplicate of country_code & _merge unused)
venture_english =venture_english.drop(columns=['alpha-3','_merge'],axis=1)
venture_english.head()

# export the data frame to csv for plotting graph.
venture_english.shape
venture_english.to_csv("Data/venture_english.csv")

Unnamed: 0,funding_round_permalink,funding_round_type,funded_at,raised_amount_usd,permalink,name_x,category_list,status,country_code,country_name
0,/funding-round/9a01d05418af9f794eebff7ace91f638,venture,05-01-2015,10000000.0,/organization/-fame,#fame,Media,operating,IND,India
2,/funding-round/954b9499724b946ad8c396a57a5f3b72,venture,21-12-2009,719491.0,/organization/0ndine-biomedical-inc,Ondine Biomedical Inc.,Biotechnology,operating,CAN,Canada
3,/funding-round/3bb2ee4a2d89251a10aaa735b1180e44,venture,09-11-2015,20000000.0,/organization/0xdata,H2O.ai,Analytics,operating,USA,United States
4,/funding-round/ae2a174c06517c2394aed45006322a7e,venture,03-01-2013,1700000.0,/organization/0xdata,H2O.ai,Analytics,operating,USA,United States
5,/funding-round/e1cfcbe1bdf4c70277c5f29a3482f24e,venture,19-07-2014,8900000.0,/organization/0xdata,H2O.ai,Analytics,operating,USA,United States


(40824, 10)

In [59]:
# Then top9 is created with 9 English speaking countries based on
# Total Investment made (i.e., raised_amount_usd) from the 
# venture_english data frame created above

top9_english = venture_english.\
           groupby(['country_code','country_name','funding_round_type'])\
                   ['raised_amount_usd'].sum().sort_values(ascending=False).head(9)
top9_english

country_code  country_name    funding_round_type
USA           United States   venture              420,068,029,342.00
GBR           United Kingdom  venture               20,072,813,004.00
IND           India           venture               14,261,508,718.00
CAN           Canada          venture                9,482,217,668.00
SGP           Singapore       venture                2,793,917,856.00
IRL           Ireland         venture                1,669,285,543.00
AUS           Australia       venture                1,319,028,698.00
NZL           New Zealand     venture                  448,316,383.00
ZAF           South Africa    venture                  233,713,106.00
Name: raised_amount_usd, dtype: float64

In [60]:
# Identitfiy the top three english speaking countries and their total investments.

top3_english = venture_english.loc[venture_english['country_code'].\
                            isin(['USA','GBR','IND']),:].copy()
top3_english.shape

# Now, Group the top3_english by country_code and find out the total investment
# across each country code

top3_english.groupby(['country_code','country_name'])\
                     ['raised_amount_usd'].sum().sort_values(ascending=False)

(38803, 10)

country_code  country_name  
USA           United States    420,068,029,342.00
GBR           United Kingdom    20,072,813,004.00
IND           India             14,261,508,718.00
Name: raised_amount_usd, dtype: float64

#### Observation:
>- The top fund type data `venture_frame` is joined with `country3_consolidated` obtained from pdf document (i.e., for countries) is merged to obtain the `top9_english` frame - the top9 english speaking destination on total investment is created.

>- The 3 top countries based on total investments is further obtained.

#### Inference   : 
>Based on the results above,  it has been identified that `USA`,`GBR` and `IND` are the top investment destination for spark funds. 

***IMPORTANT!!!*** ::: 
>- Plot suitable graph to depict the Top 3 investment based on the countries

#### Results Expected: Table 3.1 
>***Question1:*** Top English-speaking country ?<br>
>***Answer:*** It is `USA` with Total investment of `$420,068,029,342.00`<br><br>
>***Question2:*** Second English-speaking country ?<br>
>***Answer:*** It is `GBR` with Total investment of `$20,072,813,004.00`<br><br>
>*** Question3:*** Third English-speaking country ?<br>
>***Answer:*** It is `IND` with Total investment of `$14,261,508,718.00`<br><br>

## Checkpoint 4: Sector Analysis 1

>The sector analysis, is referring to one of the eight sectors -i.e., main sector listed in the mapping file.<br>
>The business rule is that, the first string before the vertical bar in the `category_list` in the master data frame will be considered as `Primary Sector`.<br>

***CONSTRAINT*** : 
>- Use only the `top3_english` data frame created in check point-3 , having top 3 English speaking  destination where total investment is highest for the fund type identified fund type `venture`.

## Results Expected: Table 3.1
>- Code for a merged data frame with each primary sector mapped to its main sector (the primary sector should be present in a separate column).

## To Do List - Check Point-4
>- Extract the Primary Sector of each category list from the `category_list` Column of `top3_english`.
>- Use the mapping file `mapping.csv` to map each primary sector to one of eight main sectors 
>- (Note: 'Others' is also considered one of the main sectors).

In [61]:
#The business rule that the first string berfor the vertical bar in the category_list 
# will be considered as Primary Sector.

top3_english['primary_sector']=top3_english['category_list']\
                           .apply(lambda x:x.split('|',1)[0])
top3_english.sort_values(by='country_name',ascending=False).head()

Unnamed: 0,funding_round_permalink,funding_round_type,funded_at,raised_amount_usd,permalink,name_x,category_list,status,country_code,country_name,primary_sector
24017,/funding-round/a72758fc10661a4f1b7aa17be32ef7ab,venture,01-01-2008,8151750.0,/organization/loc-aid,Locaid,Enterprise Software|Location Based Services|Mobile|Wireless,acquired,USA,United States,Enterprise Software
31443,/funding-round/81daf9d027abc85f843864639650f074,venture,18-03-2014,24000000.0,/organization/percolate,Percolate,Brand Marketing|Content|Enterprise Software|Information Technology|Infrastructure|Sales and Marketing|Social Media,operating,USA,United States,Brand Marketing
31465,/funding-round/1142ec6e781675ab51b7bbdcfa85152b,venture,09-04-2014,6149845.0,/organization/perficient,Perficient,Consulting|Information Technology|Internet,ipo,USA,United States,Consulting
31473,/funding-round/ee4995c5f5a747102300403e53182bea,venture,15-01-2010,3000000.0,/organization/performable,Performable,Advertising|Analytics|Marketing Automation|Optimization|Sales and Marketing|Software|Web Design,acquired,USA,United States,Advertising
31476,/funding-round/97e7cfb2ce5b9425f1fd9d8179ea4a7e,venture,06-11-2015,6500000.0,/organization/performance-indicator,Performance Indicator,Biotechnology,operating,USA,United States,Biotechnology


In [62]:
## Load the mapping.csv file in to a new data frame.
mapping_frame = pd.read_csv("Data/mapping.csv",encoding='iso-8859-1')
mapping_frame.head()

Unnamed: 0,category_list,Automotive & Sports,Blanks,Cleantech / Semiconductors,Entertainment,Health,Manufacturing,"News, Search and Messaging",Others,"Social, Finance, Analytics, Advertising"
0,,0,1,0,0,0,0,0,0,0
1,3D,0,0,0,0,0,1,0,0,0
2,3D Printing,0,0,0,0,0,1,0,0,0
3,3D Technology,0,0,0,0,0,1,0,0,0
4,Accounting,0,0,0,0,0,0,0,0,1


In [63]:
# Using melt function to unpivot mapping_frame having variable(main sector names) 
# & values (1 and 0) for the given category_list. 

mapping_pivot = pd.melt(mapping_frame,id_vars=['category_list'])              
mapping_pivot.head()

Unnamed: 0,category_list,variable,value
0,,Automotive & Sports,0
1,3D,Automotive & Sports,0
2,3D Printing,Automotive & Sports,0
3,3D Technology,Automotive & Sports,0
4,Accounting,Automotive & Sports,0


In [64]:
# Rename the variable and category list to main_sector and primary_sector to join
mapping_pivot.rename(columns={'variable':'main_sector',\
                              'category_list':'primary_sector'},\
                              inplace=True)
mapping_pivot.head()

Unnamed: 0,primary_sector,main_sector,value
0,,Automotive & Sports,0
1,3D,Automotive & Sports,0
2,3D Printing,Automotive & Sports,0
3,3D Technology,Automotive & Sports,0
4,Accounting,Automotive & Sports,0


In [65]:
# ignore the rows with values 0 and filter for existence of # primary sector
# into 9 main sectors rows only.

mapping_pivot = mapping_pivot[mapping_pivot.value == 1]

# drop the unusable column value.
mapping_pivot = mapping_pivot.drop(columns='value',axis=1)
mapping_pivot.shape

(688, 2)

In [66]:
mapping_pivot.sort_values(by='primary_sector').head()

Unnamed: 0,primary_sector,main_sector
1847,0notechnology,Cleantech / Semiconductors
1848,0tural Language Processing,Cleantech / Semiconductors
1849,0tural Resources,Cleantech / Semiconductors
4602,0vigation,"News, Search and Messaging"
3441,3D,Manufacturing


In [67]:
# While analysing mapping_pivot it has been identified that lot of primary_sector
# sector values contains 0 in place of string 'na'
# For e.g., 0notechnology ,Chio internet,Cloud Ma0gement. etc

# See how many such primary_sector values has 0 instead 'na'
mapping_pivot.primary_sector.str.contains('[A-Za-z]{1,}0{1}[A-Za-z]{1,}').sum()
mapping_pivot.primary_sector.str.contains('^0[A-Za-z]{1,}').sum()

47

4

In [68]:
#Replace the identifiedprimary sectors having '0' with string 'na'

mapping_pivot['primary_sector'] = mapping_pivot['primary_sector'].str.replace('0','na')

# Verify again the prsence of '0' instead of 'na'
mapping_pivot.primary_sector.str.contains('[A-Za-z]{1,}0[A-Za-z]{1,}').sum()
mapping_pivot.primary_sector.str.contains('^0[A-Za-z]{1,}').sum()

0

0

In [69]:
# Sort it again on index and see changes are getting reflected.
mapping_pivot.sort_index().head()

Unnamed: 0,primary_sector,main_sector
8,Adventure Travel,Automotive & Sports
14,Aerospace,Automotive & Sports
45,Auto,Automotive & Sports
46,Automated Kiosk,Automotive & Sports
47,Automotive,Automotive & Sports


In [70]:
# Join both top3 data frame and mapping_pivot data frame on key primary_sector
top3_english.info()
mapping_pivot.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 38803 entries, 0 to 47806
Data columns (total 11 columns):
funding_round_permalink    38803 non-null object
funding_round_type         38803 non-null object
funded_at                  38803 non-null object
raised_amount_usd          38803 non-null float64
permalink                  38803 non-null object
name_x                     38803 non-null object
category_list              38803 non-null object
status                     38803 non-null object
country_code               38803 non-null object
country_name               38803 non-null object
primary_sector             38803 non-null object
dtypes: float64(1), object(10)
memory usage: 3.6+ MB
<class 'pandas.core.frame.DataFrame'>
Int64Index: 688 entries, 8 to 6167
Data columns (total 2 columns):
primary_sector    687 non-null object
main_sector       688 non-null object
dtypes: object(2)
memory usage: 16.1+ KB


In [71]:
# To join on primary sector, ensure all values are in same case (e.g., lower case). 
# The values are appearing in mixed cases in both (top3_english & mapping) data frames.

top3_english['primary_sector'] = top3_english['primary_sector'].str.lower()
mapping_pivot['primary_sector'] = mapping_pivot['primary_sector'].str.lower()
top3_english.primary_sector.head()
mapping_pivot.primary_sector.head()

0        media
3    analytics
4    analytics
5    analytics
6         apps
Name: primary_sector, dtype: object

8     adventure travel
14           aerospace
45                auto
46     automated kiosk
47          automotive
Name: primary_sector, dtype: object

In [1]:
# Now merge  both tables top3_english and mapping frames using 'left join' on primary_sector.
# Since it is a left join use indicator parameter to flag any missing in mapping 
# file.

top3_sectors = pd.merge(left=top3_english,right=mapping_pivot,how='left',\
                                     on='primary_sector',indicator=True)
top3_sectors.shape

# Check how many are only in top3_english but not in mapping.

top3_merge_left_only = top3_sectors[top3_sectors['_merge'] != 'both']
len(top3_merge_left_only)

NameError: name 'pd' is not defined

In [73]:
#List out the primary_sector from left_only and check their presence in mapping
top3_merge_left_only.primary_sector.sort_values(ascending=True).unique()

array(['adaptive equipment', 'biotechnology and semiconductor',
       'enterprise 2.0', 'greentech', 'natural gas uses',
       'product search', 'racing', 'rapidly expanding', 'retirement',
       'specialty retail'], dtype=object)

In [74]:
# Confirm above primary_sectors are not present in mapping data frame.
missing_sectors = ['adaptive equipment', 'biotechnology and semiconductor',
                  'enterprise 2.0', 'greentech', 'natural gas uses',
                  'product search', 'racing', 'rapidly expanding', 'retirement',
                  'specialty retail']
len(mapping_pivot.loc[mapping_pivot.primary_sector.isin(missing_sectors),:])

0

In [75]:
# It is confirmed now the missing primary sectors are not present in mapping data frame.
# So,We can safely drop them off from merged data frame.

top3_sectors = top3_sectors[top3_sectors['_merge'] == 'both']

# drop the indicator column '_merge'
top3_sectors = top3_sectors.drop('_merge', axis=1)

In [76]:
# print the clean up merged data frame.
# (having columns primary sector & main_sector merged.)
top3_sectors.shape
top3_sectors.head()

(38788, 12)

Unnamed: 0,funding_round_permalink,funding_round_type,funded_at,raised_amount_usd,permalink,name_x,category_list,status,country_code,country_name,primary_sector,main_sector
0,/funding-round/9a01d05418af9f794eebff7ace91f638,venture,05-01-2015,10000000.0,/organization/-fame,#fame,Media,operating,IND,India,media,Entertainment
1,/funding-round/3bb2ee4a2d89251a10aaa735b1180e44,venture,09-11-2015,20000000.0,/organization/0xdata,H2O.ai,Analytics,operating,USA,United States,analytics,"Social, Finance, Analytics, Advertising"
2,/funding-round/ae2a174c06517c2394aed45006322a7e,venture,03-01-2013,1700000.0,/organization/0xdata,H2O.ai,Analytics,operating,USA,United States,analytics,"Social, Finance, Analytics, Advertising"
3,/funding-round/e1cfcbe1bdf4c70277c5f29a3482f24e,venture,19-07-2014,8900000.0,/organization/0xdata,H2O.ai,Analytics,operating,USA,United States,analytics,"Social, Finance, Analytics, Advertising"
4,/funding-round/b952cbaf401f310927430c97b68162ea,venture,17-03-2015,5000000.0,/organization/1-mainstream,1 Mainstream,Apps|Cable|Distribution|Software,acquired,USA,United States,apps,"News, Search and Messaging"


#### Observation:
>- The first string before the vertical bar in the category_list is split and added as new column `primary_sector` in the `top3_english` data frame.
>- The `mapping.csv` is loaded in to `mapping_frame` and unpivoted using `melt` function to new data frame called `mapping_pivot`.
>- It has been observed many of the primary_sectors in `mapping_pivot` column values contains `0` inplace of string `na` either at the beginning or at the middle of the string.
>- The prime_sectors values are cleand up `0` to `na using replace with regex`
>- The `top3_english` data frame & `mapping_frame` key columns - the `primary_sector` is converted to lower case for the join to work correctly.


#### Inference   :
>- Based on the results above,it has been identified that few primary_sectors does not exists in `mapping` frame.
>- Plot suitable graph to depict the `top3_sectors` based on country wise  sector investments. 

#### Result:
>- The resultant merged data frame is called `top3_sectors`.

### Checkpoint 5: Sector Analysis 2

>- The `top3_sectors` now merged with each company's main sector along with primary sectors.
For sector analysis it falls under one of main eight sectors identified above.<br>
>- Also, the `top3_sectors' contains only Top 3 english speaking countries 
and most suitable funding types for spark funds to invest in.<br>
>- Three three identified countries are classified as ;
   - country1(D1) - `USA`
   - country2(D2) - `GBR`
   - country3(D3) - `IND`
>- Funding Type (FT) - `venture`<br>
>- The range of funding selected is between `$5 Million and 15 Million USD`.
   
#### Objective:
>- Find out the most heavily invested main sectors in each of the three countries (for funding type FT and investments range of 5-15 M USD).<br>
>- Create three separate data frames D1, D2 and D3 for each of the three countries containing the observations of funding type FT falling within the 5-15 million USD range. - >The three data frames should contain:
    1. All the columns of the master_frame along with the primary sector and the main sector
    2.  The total number (or count) of investments for each main sector in a separate column (main_sector_total_count)
    3.  The total amount invested in each main sector in a separate column (main_sector_total_amount) <br>

#### Results Expected: Table 5.1
>- Three data frames D1, D2 and D3 
>- Table 5.1: Based on the analysis of the sectors, which main sectors and countries would you recommend Spark Funds to invest in? Present your conclusions in the presentation. The conclusions are subjective (i.e. there may be no ‘one right answer’), but it should be based on the basic strategy — invest in sectors where most investments are occurring. 
>- Two new column in D1, D2 and D3 `main_sector_total_count` and `main_sector_total_amount`.

In [77]:
# Verify the Top3 sectors obtained in Check Point 4 - Sector Analysis.

top3_sectors.shape
top3_sectors.country_code.value_counts()
top3_sectors.head()

(38788, 12)

USA    35929
GBR     2040
IND      819
Name: country_code, dtype: int64

Unnamed: 0,funding_round_permalink,funding_round_type,funded_at,raised_amount_usd,permalink,name_x,category_list,status,country_code,country_name,primary_sector,main_sector
0,/funding-round/9a01d05418af9f794eebff7ace91f638,venture,05-01-2015,10000000.0,/organization/-fame,#fame,Media,operating,IND,India,media,Entertainment
1,/funding-round/3bb2ee4a2d89251a10aaa735b1180e44,venture,09-11-2015,20000000.0,/organization/0xdata,H2O.ai,Analytics,operating,USA,United States,analytics,"Social, Finance, Analytics, Advertising"
2,/funding-round/ae2a174c06517c2394aed45006322a7e,venture,03-01-2013,1700000.0,/organization/0xdata,H2O.ai,Analytics,operating,USA,United States,analytics,"Social, Finance, Analytics, Advertising"
3,/funding-round/e1cfcbe1bdf4c70277c5f29a3482f24e,venture,19-07-2014,8900000.0,/organization/0xdata,H2O.ai,Analytics,operating,USA,United States,analytics,"Social, Finance, Analytics, Advertising"
4,/funding-round/b952cbaf401f310927430c97b68162ea,venture,17-03-2015,5000000.0,/organization/1-mainstream,1 Mainstream,Apps|Cable|Distribution|Software,acquired,USA,United States,apps,"News, Search and Messaging"


In [78]:
# One of the constraints is the funding type 'venture' investment amount
# should be in the range 5 million and 10 million
# So, find out how many rows for fund type 'venture' in top three sector in the range.

# Total number of rows in top3_sectors (USA,GBP and IND)
len(top3_sectors)

# Number of rows for amount less than 5 millions
len(top3_sectors.loc[top3_sectors.raised_amount_usd < 5000000].index)

# number of rows for mount greater than 15 millions
len(top3_sectors.loc[top3_sectors.raised_amount_usd > 15000000].index)

38788

18471

7305

In [79]:
# Filter out the rows which are out of range (lower bound & upper bound) 
# as pert the constraint.

top3_sectors = top3_sectors.\
               drop(top3_sectors.loc[top3_sectors.raised_amount_usd < 5000000].index)
top3_sectors = top3_sectors.\
               drop(top3_sectors.loc[top3_sectors.raised_amount_usd > 15000000].index)

# Number of rows of top3 sectors having amount in the range of 5 million to 15 millions.
len(top3_sectors)

13012

In [80]:
# 1. Three separate data frames D1, D2 and D3 for each of the three countries 
# containing the observations of funding type FT 
# falling within the 5-15 million USD range ( as filtered above).

# 2. includes, All the columns of the master_frame along with 
#    the primary sector and the main sector

D1 = top3_sectors.loc[top3_sectors.country_code == 'USA',:].copy()
D2 = top3_sectors.loc[top3_sectors.country_code == 'GBR',:].copy()
D3 = top3_sectors.loc[top3_sectors.country_code == 'IND',:].copy()

D1.shape
D2.shape
D3.shape
D1.head()

#Export the data for top3 sectors.
#top3_sectors.to_csv['Data/Top3_Sectors.csv']
#D1.to_csv("Data/Top3_Sectors_USA.csv")
#D2.to_csv("Data/Top3_Sectors_GBR.csv")
#D3.to_csv("Data/Top3_Sectors_IND.csv")

(12063, 12)

(621, 12)

(328, 12)

Unnamed: 0,funding_round_permalink,funding_round_type,funded_at,raised_amount_usd,permalink,name_x,category_list,status,country_code,country_name,primary_sector,main_sector
3,/funding-round/e1cfcbe1bdf4c70277c5f29a3482f24e,venture,19-07-2014,8900000.0,/organization/0xdata,H2O.ai,Analytics,operating,USA,United States,analytics,"Social, Finance, Analytics, Advertising"
4,/funding-round/b952cbaf401f310927430c97b68162ea,venture,17-03-2015,5000000.0,/organization/1-mainstream,1 Mainstream,Apps|Cable|Distribution|Software,acquired,USA,United States,apps,"News, Search and Messaging"
17,/funding-round/fb6216a30cb566ede89e0bee0623a634,venture,16-12-2014,11999347.0,/organization/128-technology,128 Technology,Service Providers|Technology,operating,USA,United States,service providers,Others
20,/funding-round/424129ce1235cfab2655ee81305f7c2b,venture,15-10-2013,15000000.0,/organization/1366-technologies,1366 Technologies,Manufacturing,operating,USA,United States,manufacturing,Manufacturing
21,/funding-round/6d3f3797371956ece035b8478c1441b2,venture,09-04-2015,5000000.0,/organization/1366-technologies,1366 Technologies,Manufacturing,operating,USA,United States,manufacturing,Manufacturing


In [81]:
# The total number (or count) of investments 
# for each main sector in a separate column USA.

D1 = D1.join(D1.groupby('main_sector')['raised_amount_usd'].count(),\
                               on='main_sector',rsuffix='_first')

# The total amount invested in each main sector in a separate column

D1 = D1.join(D1.groupby('main_sector')['raised_amount_usd'].sum(),\
                               on='main_sector',rsuffix='_second')

# Rename the new columns added.
D1.rename(columns={'raised_amount_usd_first':'main_sector_total_count'},inplace=True)
D1.rename(columns={'raised_amount_usd_second':'main_sector_total_amount'},inplace=True)

In [82]:
# The total number (or count) of investments 
# for each main sector in a separate column GBR.

D2 = D2.join(D2.groupby('main_sector')['raised_amount_usd'].count(),\
                               on='main_sector',rsuffix='_first')

# The total amount invested in each main sector in a separate column

D2 = D2.join(D2.groupby('main_sector')['raised_amount_usd'].sum(),\
                               on='main_sector',rsuffix='_second')

# Rename the new columns added.
D2.rename(columns={'raised_amount_usd_first':'main_sector_total_count'},inplace=True)
D2.rename(columns={'raised_amount_usd_second':'main_sector_total_amount'},inplace=True)

In [83]:
# The total number (or count) of investments 
# for each main sector in a separate column IND.

D3 = D3.join(D3.groupby('main_sector')['raised_amount_usd'].count(),\
                               on='main_sector',rsuffix='_first')

# The total amount invested in each main sector in a separate column

D3 = D3.join(D3.groupby('main_sector')['raised_amount_usd'].sum(),\
                               on='main_sector',rsuffix='_second')

# Rename the new columns added.
D3.rename(columns={'raised_amount_usd_first':'main_sector_total_count'},inplace=True)
D3.rename(columns={'raised_amount_usd_second':'main_sector_total_amount'},inplace=True)

In [84]:
D1[D1.main_sector=='Manufacturing'].head(5)

Unnamed: 0,funding_round_permalink,funding_round_type,funded_at,raised_amount_usd,permalink,name_x,category_list,status,country_code,country_name,primary_sector,main_sector,main_sector_total_count,main_sector_total_amount
20,/funding-round/424129ce1235cfab2655ee81305f7c2b,venture,15-10-2013,15000000.0,/organization/1366-technologies,1366 Technologies,Manufacturing,operating,USA,United States,manufacturing,Manufacturing,799,7258553378.0
21,/funding-round/6d3f3797371956ece035b8478c1441b2,venture,09-04-2015,5000000.0,/organization/1366-technologies,1366 Technologies,Manufacturing,operating,USA,United States,manufacturing,Manufacturing,799,7258553378.0
22,/funding-round/786f61aa9866f4471151285f5c56be36,venture,03-02-2010,5150000.0,/organization/1366-technologies,1366 Technologies,Manufacturing,operating,USA,United States,manufacturing,Manufacturing,799,7258553378.0
23,/funding-round/82ace97530965cd2be8f262836b43ff5,venture,27-03-2008,12400000.0,/organization/1366-technologies,1366 Technologies,Manufacturing,operating,USA,United States,manufacturing,Manufacturing,799,7258553378.0
24,/funding-round/ab99fc5a53717b1b53fd6aa5687c5fa9,venture,16-12-2010,6000000.0,/organization/1366-technologies,1366 Technologies,Manufacturing,operating,USA,United States,manufacturing,Manufacturing,799,7258553378.0


In [85]:
D2[D2.main_sector=='Manufacturing'].head(5)

Unnamed: 0,funding_round_permalink,funding_round_type,funded_at,raised_amount_usd,permalink,name_x,category_list,status,country_code,country_name,primary_sector,main_sector,main_sector_total_count,main_sector_total_amount
2257,/funding-round/0f5b23faa3fc155f50e679237f2b38f8,venture,06-06-2007,10000000.0,/organization/antenova,Antenova,Hardware + Software,operating,GBR,United Kingdom,hardware + software,Manufacturing,42,361940335.0
2258,/funding-round/1750b93d061a4539de278a4384ed220f,venture,20-01-2005,12000000.0,/organization/antenova,Antenova,Hardware + Software,operating,GBR,United Kingdom,hardware + software,Manufacturing,42,361940335.0
2259,/funding-round/47c0b4eac099171d3c5c0f9170428485,venture,21-10-2008,6500000.0,/organization/antenova,Antenova,Hardware + Software,operating,GBR,United Kingdom,hardware + software,Manufacturing,42,361940335.0
4996,/funding-round/3eb3fd60aa846d24a440a6e816540c89,venture,17-01-2001,6686126.0,/organization/blueheath,Blueheath Holdings,Groceries|Leisure|Retail|Wholesale,operating,GBR,United Kingdom,groceries,Manufacturing,42,361940335.0
5914,/funding-round/752f5792832c5ea9ccac167213a4ffbd,venture,24-03-2013,6832317.0,/organization/cambridge-communication-systems,Cambridge Communication Systems,Hardware + Software,operating,GBR,United Kingdom,hardware + software,Manufacturing,42,361940335.0


In [86]:
D3[D3.main_sector=='Manufacturing'].head(5)

Unnamed: 0,funding_round_permalink,funding_round_type,funded_at,raised_amount_usd,permalink,name_x,category_list,status,country_code,country_name,primary_sector,main_sector,main_sector_total_count,main_sector_total_amount
3872,/funding-round/9725f0dc5dc4bf295cd047b4887a10f5,venture,23-02-2015,9600000.0,/organization/bakers-circle,Bakers Circle,Food Processing,operating,IND,India,food processing,Manufacturing,21,200900000.0
6857,/funding-round/d40413cd302a9501f87ebcc648675295,venture,16-09-2015,10000000.0,/organization/chai-point,Chai Point,Food Processing,operating,IND,India,food processing,Manufacturing,21,200900000.0
10076,/funding-round/4706b97ec7264fc01c951a6c0c8de6b9,venture,28-04-2008,9350000.0,/organization/dixon-technologies,Dixon Technologies,Hardware + Software,operating,IND,India,hardware + software,Manufacturing,21,200900000.0
10935,/funding-round/c7087296aaa1e6ee33e9d174022ad444,venture,26-08-2013,6000000.0,/organization/electronic-payment-and-services,Electronic Payment and Services (EPS),Hardware + Software,operating,IND,India,hardware + software,Manufacturing,21,200900000.0
10936,/funding-round/cabd3c8428576ef3018e1c91812a732e,venture,17-12-2013,5000000.0,/organization/electronic-payment-and-services,Electronic Payment and Services (EPS),Hardware + Software,operating,IND,India,hardware + software,Manufacturing,21,200900000.0


 ### Table 5.1 : Sector-wise Investment Analysis

In [87]:
#Table 5.1 : Sector-wise Investment Analysis
# Question-1: Total number of investments (count) - for Country1, Country2 and Country3 ?.
print("Total Number of investments for country1 (USA) - ",D1.raised_amount_usd.count())
print("Total Number of investments for country2 (GBR) - ",D2.raised_amount_usd.count())
print("Total Number of investments for country3 (IND) - ",D3.raised_amount_usd.count())

Total Number of investments for country1 (USA) -  12063
Total Number of investments for country2 (GBR) -  621
Total Number of investments for country3 (IND) -  328


In [88]:
#Table 5.1 : Sector-wise Investment Analysis
#Question-2: Total amount of investments (USD) for Country1, Country2 and Country3 ?
print("Total Number of investments for country1 (USA) - $",D1.raised_amount_usd.sum())
print("Total Number of investments for country2 (GBR) - $",D2.raised_amount_usd.sum())
print("Total Number of investments for country3 (IND) - $",D3.raised_amount_usd.sum())

Total Number of investments for country1 (USA) - $ 107757097294.0
Total Number of investments for country2 (GBR) - $ 5379078691.0
Total Number of investments for country3 (IND) - $ 2949543602.0


 ### Table 5.1 : Sector-wise Investment Analysis
>The below Questions for applicable for 3 top countries country1,country2 and country3.<br>
>**Question-1:** Total number of investments (count) ?<br>
>**Answer:** 
  - For Country1 `USA`, the total number of investments (count) is `12063`<br>
  - For Country2 `GBR`, the total number of investments (count) is `621`<br>
  - For Country3 `IND`, the total number of investments (count) is `328`<br>
  
>**Question-2:** Total amount of investment (USD) ?<br>
>**Answer:** 
  - For Country1 `USA`, the Total amount of investments (USD) is `$107757097294.00`<br>
  - For Country2 `GBR`, the Total amount of investments (USD) is `$5379078691.00`<br>
  - For Country3 `IND`, the Total amount of investments (USD) is `$2949543602.00`<br>

In [89]:
#Table 5.1 : Sector-wise Investment Analysis
#Question-3: Top sector (based on count of investments) ?


print('Top Sectors counts based on Investments for United States (USA)')
D1.groupby('main_sector')['raised_amount_usd'].count().sort_values(ascending=False)
print('Top Sectors counts based on Investments for Country Great Britain (GBR)')
D2.groupby('main_sector')['raised_amount_usd'].count().sort_values(ascending=False)
print('Top Sectors counts based on Investments For Country India (IND)')
D3.groupby('main_sector')['raised_amount_usd'].count().sort_values(ascending=False)

Top Sectors counts based on Investments for United States (USA)


main_sector
Others                                     2950
Social, Finance, Analytics, Advertising    2714
Cleantech / Semiconductors                 2350
News, Search and Messaging                 1583
Health                                      909
Manufacturing                               799
Entertainment                               591
Automotive & Sports                         167
Name: raised_amount_usd, dtype: int64

Top Sectors counts based on Investments for Country Great Britain (GBR)


main_sector
Others                                     147
Social, Finance, Analytics, Advertising    133
Cleantech / Semiconductors                 130
News, Search and Messaging                  73
Entertainment                               56
Manufacturing                               42
Health                                      24
Automotive & Sports                         16
Name: raised_amount_usd, dtype: int64

Top Sectors counts based on Investments For Country India (IND)


main_sector
Others                                     110
Social, Finance, Analytics, Advertising     60
News, Search and Messaging                  52
Entertainment                               33
Manufacturing                               21
Cleantech / Semiconductors                  20
Health                                      19
Automotive & Sports                         13
Name: raised_amount_usd, dtype: int64

In [90]:
# Using pivot_table function of data frame.

# For Country USA
print('For United States (USA)')
D1_pivot = D1.pivot_table(index=['main_sector'],\
                          values='raised_amount_usd',\
                          aggfunc = ('count','sum')).\
                          sort_values(by='count',ascending=False)

D1_pivot
# For Country GBR
print('For Country Great Britain (GBR)')
D2_pivot = D2.pivot_table(index=['main_sector'],\
                          values='raised_amount_usd',\
                          aggfunc = ('count','sum')).\
                          sort_values(by='count',ascending=False)
D2_pivot

# For Country IND
print('For Country India (IND)')
D3_pivot = D3.pivot_table(index=['main_sector'],\
                          values='raised_amount_usd',\
                          aggfunc = ('count','sum')).\
                          sort_values(by='count',ascending=False)
D3_pivot

For United States (USA)


Unnamed: 0_level_0,count,sum
main_sector,Unnamed: 1_level_1,Unnamed: 2_level_1
Others,2950,26321007002.0
"Social, Finance, Analytics, Advertising",2714,23807376964.0
Cleantech / Semiconductors,2350,21633430822.0
"News, Search and Messaging",1583,13971567428.0
Health,909,8211859357.0
Manufacturing,799,7258553378.0
Entertainment,591,5099197982.0
Automotive & Sports,167,1454104361.0


For Country Great Britain (GBR)


Unnamed: 0_level_0,count,sum
main_sector,Unnamed: 1_level_1,Unnamed: 2_level_1
Others,147,1283624289.0
"Social, Finance, Analytics, Advertising",133,1089404014.0
Cleantech / Semiconductors,130,1163990056.0
"News, Search and Messaging",73,615746235.0
Entertainment,56,482784687.0
Manufacturing,42,361940335.0
Health,24,214537510.0
Automotive & Sports,16,167051565.0


For Country India (IND)


Unnamed: 0_level_0,count,sum
main_sector,Unnamed: 1_level_1,Unnamed: 2_level_1
Others,110,1013409507.0
"Social, Finance, Analytics, Advertising",60,550549550.0
"News, Search and Messaging",52,433834545.0
Entertainment,33,280830000.0
Manufacturing,21,200900000.0
Cleantech / Semiconductors,20,165380000.0
Health,19,167740000.0
Automotive & Sports,13,136900000.0


### Table 5.1 : Sector-wise Investment Analysis
>The below Questions for applicable for 3 top countries country1(D1),country2(D2) and country3(D3).<br>
>**Question-3:** Top sector (based on count of investments) ?<br>
>**Answer:** 
  - For Country1(D1) `USA`, the top sector based on count is `Others`
  - For Country2(D2) `GBR`, the top sector based on count is `Others`
  - For Country3(D3) `IND`  the top sector based on count is `Others`
         
>**Question-4:** Second-best sector (based on count of investments) ?<br>
>**Answer:** 
  - For Country1(D1) `USA`, the second-best sector based on count is 
`Social, Finance, Analytics, Advertising`
  - For Country2(D2) `GBR`, the second-best sector based on count is 
`Social, Finance, Analytics, Advertising`
  - For Country3(D3) `IND`  the second-best sector based on count is 
`Social, Finance, Analytics, Advertising`<br>

>**Question-5:** Third-best sector (based on count of investments) ?<br>
>**Answer:** 
  - For Country1(D1) `USA`, the third-best sector based on count is 
`Cleantech / Semiconductors`
  - For Country2(D2) `GBR`, the third-best sector based on count is 
`Cleantech / Semiconductors`
  - For Country3(D3) `IND`  the third-best sector based on count is 
`News, Search and Messaging`<br>

>**Question-6:** Number of investments in the top sector (refer to point 3) ?<br>
>**Answer:** 
  - For Country1(D1) `USA`, the number of investment in the top sector is `2950`
  - For Country2(D2) `GBR`, the number of investment in the top sector is `147`
  - For Country3(D3) `IND` the number of investment in the top sector is `110`<br>

>**Question-7:** Number of investments in the second-best sector (refer to point 4) ?<br>
>**Answer:** 
  - For Country1(D1) `USA`, the number of investment in the second best sector is `2714`
  - For Country2(D2) `GBR`, the number of investment in the second best sector is `133`
  - For Country3(D3) `IND` the number of investment in the second best sector is `60`<br>
  
>**Question-8:** Number of investments in the third-best sector (refer to point 5) ?<br>
>**Answer:** 
  - For Country1(D1) `USA`, the number of investment in the third best sector is `2350`
  - For Country2(D2) `GBR`, the number of investment in the third best sector is `130`
  - For Country3(D3) `IND` the number of investment in the third best sector is `52`<br>

In [91]:
#Table 5.1 : Sector-wise Investment Analysis
# Question-9: For the top sector count-wise (point 3), which company received 
# the highest investment?

Top_Sector = 'Others'

# For Country USA
print('The top sector,company wise counts for sector for United States (USA)')
# The top sector count wise is 'Others' - calculate the company wise investments.
D1.loc[D1.main_sector == Top_Sector,:].groupby('permalink')['raised_amount_usd'].\
                                                sum().sort_values(ascending=False).head(1)
# For Country GBR
print('The top sector,company wise counts for sector for Great Britain (GBR)')
# The top sector count wise is 'Others' - calculate the company wise investments.
D2.loc[D2.main_sector == Top_Sector,:].groupby('permalink')['raised_amount_usd'].\
                                                sum().sort_values(ascending=False).head(1)

#For Country IND
print('The top sector, company wise counts for sector for India (IND)')
# The top sector count wise is 'Others' - calculate the company wise investments.
D3.loc[D3.main_sector == Top_Sector,:].groupby('permalink')['raised_amount_usd'].\
                                                sum().sort_values(ascending=False).head(1)

The top sector,company wise counts for sector for United States (USA)


permalink
/organization/virtustream   64,300,000.00
Name: raised_amount_usd, dtype: float64

The top sector,company wise counts for sector for Great Britain (GBR)


permalink
/organization/electric-cloud   37,000,000.00
Name: raised_amount_usd, dtype: float64

The top sector, company wise counts for sector for India (IND)


permalink
/organization/firstcry-com   39,000,000.00
Name: raised_amount_usd, dtype: float64

In [92]:
#Table 5.1 : Sector-wise Investment Analysis
# Question-9: For the top sector count-wise (point 3), which company received 
# the highest investment?

# Calculation using pivot_table function.

# The Top Sector count-wise (as per Question-3 above)
Top_Sector = 'Others'

#For country USA
print('For United States (USA)')
D1_company_pivot1 = D1.loc[D1.main_sector == Top_Sector,:]\
                         .pivot_table(index='permalink',\
                                      values='raised_amount_usd',
                                      aggfunc={'sum'})
D1_company_pivot1.sort_values(by='sum',ascending=False).head(3)

#For country GBR
print('For Country Great Britain (GBR)')
D2_company_pivot1 = D2.loc[D2.main_sector == Top_Sector,:]\
                         .pivot_table(index='permalink',\
                                      values='raised_amount_usd',
                                      aggfunc={'sum'})
D2_company_pivot1.sort_values(by='sum',ascending=False).head(3)

#For country IND 
print('For Country India (IND)')
D3_company_pivot1 = D3.loc[D3.main_sector == Top_Sector,:]\
                         .pivot_table(index='permalink',\
                                      values='raised_amount_usd',
                                      aggfunc={'sum'})
D3_company_pivot1.sort_values(by='sum',ascending=False).head(3)

For United States (USA)


Unnamed: 0_level_0,sum
permalink,Unnamed: 1_level_1
/organization/virtustream,64300000.0
/organization/capella,54968051.0
/organization/airtight-networks,54201907.0


For Country Great Britain (GBR)


Unnamed: 0_level_0,sum
permalink,Unnamed: 1_level_1
/organization/electric-cloud,37000000.0
/organization/sensage,36250000.0
/organization/enigmatic,32500000.0


For Country India (IND)


Unnamed: 0_level_0,sum
permalink,Unnamed: 1_level_1
/organization/firstcry-com,39000000.0
/organization/myntra,38000000.0
/organization/commonfloor,32900000.0


>**Question-9:** For the top sector count-wise (point 3), which company received the highest investment? ?<br>
>**Answer:** 
  - For Country1 `USA`, top sector count wise (`Others`), the company `virtustream` has received the highest investment of `$64,300,000.00`
  - For Country2 `GBR`, top sector count wise (`Others`), the company `electric-cloud` has received the highest investment of `$37,000,000.00`
  - For Country3 `IND`, top sector count wise (`Others`) , the company `firstcry-com` has received the highest investment of `$39,000,000.00`

In [93]:
# Table 5.1 : Sector-wise Investment Analysis
# Question-10: For the second-best sector count-wise (point 4), 
# which company received the highest investment?


Second_Best_Sector = 'Social, Finance, Analytics, Advertising'

# For Country USA
print('The Second best sector company wise counts for sector for United States (USA)')
D1.loc[D1.main_sector == Second_Best_Sector,:].groupby('permalink')['raised_amount_usd'].\
                                                sum().sort_values(ascending=False).head(1)
# For Country GBR
print('The Second best sector company wise counts for sector for United States (GBR)')
D2.loc[D2.main_sector == Second_Best_Sector,:].groupby('permalink')['raised_amount_usd'].\
                                                sum().sort_values(ascending=False).head(1)

#For Country IND
print('The Second best sector company wise counts for sector for United States (IND)')

D3.loc[D3.main_sector == Second_Best_Sector,:].groupby('permalink')['raised_amount_usd'].\
                                                sum().sort_values(ascending=False).head(1)

The Second best sector company wise counts for sector for United States (USA)


permalink
/organization/shotspotter   67,933,006.00
Name: raised_amount_usd, dtype: float64

The Second best sector company wise counts for sector for United States (GBR)


permalink
/organization/celltick-technologies   37,500,000.00
Name: raised_amount_usd, dtype: float64

The Second best sector company wise counts for sector for United States (IND)


permalink
/organization/manthan-systems   50,700,000.00
Name: raised_amount_usd, dtype: float64

In [94]:
# Table 5.1 : Sector-wise Investment Analysis
# Question-10: For the second-best sector count-wise (point 4), 
# which company received the highest investment?

# Second Best Sector.
Second_Best_Sector = 'Social, Finance, Analytics, Advertising'

#Using pivot_table function.

#For country USA
print('For United States (USA)')
D1_company_pivot2 = D1.loc[D1.main_sector == Second_Best_Sector,:]\
                         .pivot_table(index='permalink',\
                                      values='raised_amount_usd',
                                      aggfunc={'sum'})

D1_company_pivot2.sort_values(by='sum',ascending=False).head(3)

#For country GBR
print('For Country Great Britain (GBR)')
D2_company_pivot2 = D2.loc[D2.main_sector == Second_Best_Sector,:]\
                         .pivot_table(index='permalink',\
                                      values='raised_amount_usd',
                                      aggfunc={'sum'})

D2_company_pivot2.sort_values(by='sum',ascending=False).head(3)

#For country IND 
print('For Country India (IND)')
D3_company_pivot2 = D3.loc[D3.main_sector == Second_Best_Sector,:]\
                         .pivot_table(index='permalink',\
                                      values='raised_amount_usd',
                                      aggfunc={'sum'})

D3_company_pivot2.sort_values(by='sum',ascending=False).head(3)

For United States (USA)


Unnamed: 0_level_0,sum
permalink,Unnamed: 1_level_1
/organization/shotspotter,67933006.0
/organization/demandbase,63000000.0
/organization/intacct,61800000.0


For Country Great Britain (GBR)


Unnamed: 0_level_0,sum
permalink,Unnamed: 1_level_1
/organization/celltick-technologies,37500000.0
/organization/mythings,34000000.0
/organization/zopa,32900000.0


For Country India (IND)


Unnamed: 0_level_0,sum
permalink,Unnamed: 1_level_1
/organization/manthan-systems,50700000.0
/organization/komli-media,28000000.0
/organization/shopclues-com,25000000.0


**Question-10:** 
>For the second-best sector count-wise (point 4), which company received the highest investment?<br>
**Answer:** 
>- For Country1 `USA`, second best sector count wise (`Social, Finance, Analytics, Advertising`), the company `shotspotter` has received the highest investment of `$67,933,006.00`<br>
>- For Country2 `GBR`, second best sector count wise (`Social, Finance, Analytics, Advertising`) , the company `celltick-technologies` has received the highest investment of `$37,500,000.00`<br>
- For Country3 `IND`, second best sector count wise (`Social, Finance, Analytics, Advertising`) , the company `manthan-systems` has received the highest investment of `$50,700,000.00`<br>