# Uncovering Trends and Opportunities in the Indian Startup Ecosystem: A Python-Based Approach.


# Description


In this project, Python is used to analyze and visualize industry data and 
identify key trends and opportunities in the Indian startup market. 
The analysis will cover funding trends, the geographic distribution of 
the start-ups,funding sources, and the industrial sector in which the start-ups operate. 
The insights gained from this project will help venture capitalists 
stay ahead of the curve and identify promising investment opportunities.


# Questions


1-What is the funding trend in the Indian start-up ecosystem over the past few years?

2-Which industries have received the most funding year on year?

3-Who are the top investors and what initiatives do they typically invest in?

4-Where are the start-ups located and in what industries?

5-What is the performance of the Indian start-up ecosystem in terms of funding rounds, 
and how does this vary across the years?


# Hypothesis

Hypothesis:
The location of a start-up has an impact on the amount of funding it is able to secure.

Null hypothesis:
The location of a start-up has no significant influence on the amount of funding it raises.

Alternate hypothesis:
The location of a start-up significantly influences the amount of funding it raises.


# INSTALLATION

In [1]:
pip install jupyter-summarytools

Note: you may need to restart the kernel to use updated packages.


# IMPORTING LIBRARIES

In [2]:
# libraries to use
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from summarytools import dfSummary
import warnings
warnings.filterwarnings('ignore')

# LOADING DATA 

NB:The columns loaded were restricted to only the ones that will enable us answer the questions posed above.

In [3]:
su_18=pd.read_csv('startup_funding2018.csv', usecols=['Company Name','Location','Industry','Amount','Round/Series'])
su_19=pd.read_csv('startup_funding2019.csv', usecols=['Company/Brand','HeadQuarter','Sector','Investor','Amount($)','Stage'])
su_20=pd.read_csv('startup_funding2020.csv',usecols=['Company/Brand','HeadQuarter','Sector','Investor','Amount($)','Stage'])
su_21=pd.read_csv('startup_funding2021.csv', usecols=['Company/Brand','HeadQuarter','Sector','Investor','Amount($)','Stage'])

# EXPLORATORY DATA ANALYSIS:EDA

Here we inspect the datasets in depth year by year and column by column. This is to clean and process it.

<b>2018 Dataset inspection and cleaning<b>

In [4]:
su_18.head()

Unnamed: 0,Company Name,Industry,Round/Series,Amount,Location
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"Bangalore, Karnataka, India"
1,Happy Cow Dairy,"Agriculture, Farming",Seed,"₹40,000,000","Mumbai, Maharashtra, India"
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,"₹65,000,000","Gurgaon, Haryana, India"
3,PayMe India,"Financial Services, FinTech",Angel,2000000,"Noida, Uttar Pradesh, India"
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,—,"Hyderabad, Andhra Pradesh, India"


In [5]:
su_18["Company Name"].value_counts()

TheCollegeFever                      2
NIRAMAI Health Analytix              1
Drivezy                              1
Hush - Speak Up. Make Work Better    1
The Souled Store                     1
                                    ..
Qandle                               1
iChamp                               1
Credy                                1
Survaider                            1
Netmeds                              1
Name: Company Name, Length: 525, dtype: int64

In [6]:
su_19["Company/Brand"].value_counts()

Kratikal            2
Licious             2
Bombay Shaving      1
KredX               1
Euler Motors        1
                   ..
HungerBox           1
Fireflies .ai       1
Toffee Insurance    1
Seekify             1
Ess Kay Fincorp     1
Name: Company/Brand, Length: 87, dtype: int64

<b>First create a column in the datasets to represent the year represented by that particular dataset.<b>

In [7]:
su_18['Year']='2018'
su_19['Year']='2019'
su_20['Year']='2020'
su_21['Year']='2021'


In [8]:
dfSummary(su_18, is_collapsible = True)

No,Variable,Stats / Values,Freqs / (% of Valid),Graph,Missing
1,Company Name [object],1. TheCollegeFever 2. NIRAMAI Health Analytix 3. Drivezy 4. Hush - Speak Up. Make Work Bet 5. The Souled Store 6. Perspectico 7. Kogta Financial India Limited 8. Hospals 9. UrbanClap 10. Square Off 11. other,2 (0.4%) 1 (0.2%) 1 (0.2%) 1 (0.2%) 1 (0.2%) 1 (0.2%) 1 (0.2%) 1 (0.2%) 1 (0.2%) 1 (0.2%) 515 (97.9%),,0 (0.0%)
2,Industry [object],"1. — 2. Financial Services 3. Education 4. Information Technology 5. Health Care, Hospital 6. Finance, Financial Services 7. Fitness, Health Care, Wellness 8. Internet 9. Artificial Intelligence 10. Health Care 11. other",30 (5.7%) 15 (2.9%) 8 (1.5%) 7 (1.3%) 5 (1.0%) 5 (1.0%) 4 (0.8%) 4 (0.8%) 4 (0.8%) 4 (0.8%) 440 (83.7%),,0 (0.0%)
3,Round/Series [object],1. Seed 2. Series A 3. Angel 4. Venture - Series Unknown 5. Series B 6. Series C 7. Debt Financing 8. Private Equity 9. Corporate Round 10. Pre-Seed 11. other,280 (53.2%) 73 (13.9%) 37 (7.0%) 37 (7.0%) 20 (3.8%) 16 (3.0%) 13 (2.5%) 10 (1.9%) 8 (1.5%) 6 (1.1%) 26 (4.9%),,0 (0.0%)
4,Amount [object],"1. — 2. 1000000 3. 500000 4. 2000000 5. ₹50,000,000 6. ₹20,000,000 7. 4000000 8. 5000000 9. 250000 10. ₹40,000,000 11. other",148 (28.1%) 24 (4.6%) 13 (2.5%) 12 (2.3%) 9 (1.7%) 8 (1.5%) 7 (1.3%) 7 (1.3%) 6 (1.1%) 6 (1.1%) 286 (54.4%),,0 (0.0%)
5,Location [object],"1. Bangalore, Karnataka, India 2. Mumbai, Maharashtra, India 3. Bengaluru, Karnataka, India 4. Gurgaon, Haryana, India 5. New Delhi, Delhi, India 6. Pune, Maharashtra, India 7. Chennai, Tamil Nadu, India 8. Hyderabad, Andhra Pradesh, Ind 9. Delhi, Delhi, India 10. Noida, Uttar Pradesh, India 11. other",102 (19.4%) 94 (17.9%) 55 (10.5%) 52 (9.9%) 51 (9.7%) 20 (3.8%) 19 (3.6%) 18 (3.4%) 16 (3.0%) 15 (2.9%) 84 (16.0%),,0 (0.0%)
6,Year [object],1. 2018,526 (100.0%),,0 (0.0%)


<b>Issues identified with the 2018 datasets<b>
    
-The amount columns are presented in Rupees,US dollars and figures with no designated currency symbols.
    
-The Amount column is represented as an object this has to be altered to allow for particular type numerical insights to be drawn from it.
    
-The location column lists the city, state and country name. This would have to be stripped to only the city name.

-Some locations are named sometimes using its official name and other times its unofficial name. This has to be harmonized.
    
-2018 dataset does not have a column for investors, this will have to created to aid the analysis.
    
-Industry column has an assortment of industries per row. This will have to be harmonized to a specific industry.
    
-Industry column has slightly different spellings of the same industry.
    
-Names of some of the columns has to be changed for it to be consistent with those of 2019,2020 and 2021.


<b>NB:THE SPECIFIC ISSUES PERTAINING TO 2018 DATASET IS DEALT WITH BEFORE IT IS INTEGRATED WITH THE OTHER DATASET FOR WHOLISTIC INSPECTION AND CLEANING.<b>

In [9]:
#harmonizing the 2018 dataset columns to match the other three datasets
su_18.columns=['Company/Brand','Sector','Stage','Amount($)','HeadQuarter','Year']

su_18

Unnamed: 0,Company/Brand,Sector,Stage,Amount($),HeadQuarter,Year
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"Bangalore, Karnataka, India",2018
1,Happy Cow Dairy,"Agriculture, Farming",Seed,"₹40,000,000","Mumbai, Maharashtra, India",2018
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,"₹65,000,000","Gurgaon, Haryana, India",2018
3,PayMe India,"Financial Services, FinTech",Angel,2000000,"Noida, Uttar Pradesh, India",2018
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,—,"Hyderabad, Andhra Pradesh, India",2018
...,...,...,...,...,...,...
521,Udaan,"B2B, Business Development, Internet, Marketplace",Series C,225000000,"Bangalore, Karnataka, India",2018
522,Happyeasygo Group,"Tourism, Travel",Series A,—,"Haryana, Haryana, India",2018
523,Mombay,"Food and Beverage, Food Delivery, Internet",Seed,7500,"Mumbai, Maharashtra, India",2018
524,Droni Tech,Information Technology,Seed,"₹35,000,000","Mumbai, Maharashtra, India",2018


SECTOR COLUMN

<b>Here the aim is to retain only the first description as shown in the sector column as its own column then drop the rest.<b>

In [10]:
df_18=su_18['Sector'].str.split(pat=',', n=1, expand=True)
su_18['industry1']=df_18[0]

su_18

Unnamed: 0,Company/Brand,Sector,Stage,Amount($),HeadQuarter,Year,industry1
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"Bangalore, Karnataka, India",2018,Brand Marketing
1,Happy Cow Dairy,"Agriculture, Farming",Seed,"₹40,000,000","Mumbai, Maharashtra, India",2018,Agriculture
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,"₹65,000,000","Gurgaon, Haryana, India",2018,Credit
3,PayMe India,"Financial Services, FinTech",Angel,2000000,"Noida, Uttar Pradesh, India",2018,Financial Services
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,—,"Hyderabad, Andhra Pradesh, India",2018,E-Commerce Platforms
...,...,...,...,...,...,...,...
521,Udaan,"B2B, Business Development, Internet, Marketplace",Series C,225000000,"Bangalore, Karnataka, India",2018,B2B
522,Happyeasygo Group,"Tourism, Travel",Series A,—,"Haryana, Haryana, India",2018,Tourism
523,Mombay,"Food and Beverage, Food Delivery, Internet",Seed,7500,"Mumbai, Maharashtra, India",2018,Food and Beverage
524,Droni Tech,Information Technology,Seed,"₹35,000,000","Mumbai, Maharashtra, India",2018,Information Technology


In [11]:
su_18.drop('Sector', axis=1, inplace=True)


In [12]:
su_18=su_18.rename(columns={'industry1':'Sector'})
su_18

Unnamed: 0,Company/Brand,Stage,Amount($),HeadQuarter,Year,Sector
0,TheCollegeFever,Seed,250000,"Bangalore, Karnataka, India",2018,Brand Marketing
1,Happy Cow Dairy,Seed,"₹40,000,000","Mumbai, Maharashtra, India",2018,Agriculture
2,MyLoanCare,Series A,"₹65,000,000","Gurgaon, Haryana, India",2018,Credit
3,PayMe India,Angel,2000000,"Noida, Uttar Pradesh, India",2018,Financial Services
4,Eunimart,Seed,—,"Hyderabad, Andhra Pradesh, India",2018,E-Commerce Platforms
...,...,...,...,...,...,...
521,Udaan,Series C,225000000,"Bangalore, Karnataka, India",2018,B2B
522,Happyeasygo Group,Series A,—,"Haryana, Haryana, India",2018,Tourism
523,Mombay,Seed,7500,"Mumbai, Maharashtra, India",2018,Food and Beverage
524,Droni Tech,Seed,"₹35,000,000","Mumbai, Maharashtra, India",2018,Information Technology


HEADQUARTER COLUMN

<b>The aim is to strip the city name from the string under the HeadQuarter<b>

In [13]:
su18=su_18['HeadQuarter'].str.split(pat=',', n=1, expand=True)
su_18['location']=su18[0]

su_18

Unnamed: 0,Company/Brand,Stage,Amount($),HeadQuarter,Year,Sector,location
0,TheCollegeFever,Seed,250000,"Bangalore, Karnataka, India",2018,Brand Marketing,Bangalore
1,Happy Cow Dairy,Seed,"₹40,000,000","Mumbai, Maharashtra, India",2018,Agriculture,Mumbai
2,MyLoanCare,Series A,"₹65,000,000","Gurgaon, Haryana, India",2018,Credit,Gurgaon
3,PayMe India,Angel,2000000,"Noida, Uttar Pradesh, India",2018,Financial Services,Noida
4,Eunimart,Seed,—,"Hyderabad, Andhra Pradesh, India",2018,E-Commerce Platforms,Hyderabad
...,...,...,...,...,...,...,...
521,Udaan,Series C,225000000,"Bangalore, Karnataka, India",2018,B2B,Bangalore
522,Happyeasygo Group,Series A,—,"Haryana, Haryana, India",2018,Tourism,Haryana
523,Mombay,Seed,7500,"Mumbai, Maharashtra, India",2018,Food and Beverage,Mumbai
524,Droni Tech,Seed,"₹35,000,000","Mumbai, Maharashtra, India",2018,Information Technology,Mumbai


In [14]:
su_18.drop('HeadQuarter', axis=1, inplace=True)
su_18

Unnamed: 0,Company/Brand,Stage,Amount($),Year,Sector,location
0,TheCollegeFever,Seed,250000,2018,Brand Marketing,Bangalore
1,Happy Cow Dairy,Seed,"₹40,000,000",2018,Agriculture,Mumbai
2,MyLoanCare,Series A,"₹65,000,000",2018,Credit,Gurgaon
3,PayMe India,Angel,2000000,2018,Financial Services,Noida
4,Eunimart,Seed,—,2018,E-Commerce Platforms,Hyderabad
...,...,...,...,...,...,...
521,Udaan,Series C,225000000,2018,B2B,Bangalore
522,Happyeasygo Group,Series A,—,2018,Tourism,Haryana
523,Mombay,Seed,7500,2018,Food and Beverage,Mumbai
524,Droni Tech,Seed,"₹35,000,000",2018,Information Technology,Mumbai


In [15]:
su_18=su_18.rename(columns={'location':'HeadQuarter'})
su_18

Unnamed: 0,Company/Brand,Stage,Amount($),Year,Sector,HeadQuarter
0,TheCollegeFever,Seed,250000,2018,Brand Marketing,Bangalore
1,Happy Cow Dairy,Seed,"₹40,000,000",2018,Agriculture,Mumbai
2,MyLoanCare,Series A,"₹65,000,000",2018,Credit,Gurgaon
3,PayMe India,Angel,2000000,2018,Financial Services,Noida
4,Eunimart,Seed,—,2018,E-Commerce Platforms,Hyderabad
...,...,...,...,...,...,...
521,Udaan,Series C,225000000,2018,B2B,Bangalore
522,Happyeasygo Group,Series A,—,2018,Tourism,Haryana
523,Mombay,Seed,7500,2018,Food and Beverage,Mumbai
524,Droni Tech,Seed,"₹35,000,000",2018,Information Technology,Mumbai


# MERGING THE FOUR DATASETS

NB:The four datasets have been combined into a single dataFrame called df_startup. The individual columns in this new DataFrame will inspected and cleaned. Only six relevant columns are in this new DataFrame, they are Company/Brand,Stage,Amount,Sector,Headquarter and Investor.

In [16]:
df_startup=pd.concat([su_18,su_19,su_20,su_21],ignore_index=True)

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

In [17]:
dfSummary(df_startup, is_collapsible = True)

No,Variable,Stats / Values,Freqs / (% of Valid),Graph,Missing
1,Company/Brand [object],1. BharatPe 2. Nykaa 3. Zomato 4. Spinny 5. Zetwerk 6. MPL 7. Trell 8. Vedantu 9. Dunzo 10. Unacademy 11. other,"10 (0.3%) 7 (0.2%) 7 (0.2%) 6 (0.2%) 6 (0.2%) 6 (0.2%) 6 (0.2%) 6 (0.2%) 5 (0.2%) 5 (0.2%) 2,815 (97.8%)",,0 (0.0%)
2,Stage [object],1. nan 2. Seed 3. Series A 4. Pre-series A 5. Series B 6. Series C 7. Seed Round 8. Pre series A 9. Pre-seed 10. Series D 11. other,938 (32.6%) 606 (21.0%) 305 (10.6%) 211 (7.3%) 134 (4.7%) 114 (4.0%) 69 (2.4%) 62 (2.2%) 58 (2.0%) 50 (1.7%) 332 (11.5%),,938 (32.6%)
3,Amount($) [object],"1. Undisclosed 2. — 3. $1,000,000 4. $Undisclosed 5. $2,000,000 6. $1000000 7. $3,000,000 8. $5,000,000 9. $2000000 10. $10,000,000 11. other","298 (10.4%) 148 (5.1%) 93 (3.2%) 73 (2.5%) 64 (2.2%) 52 (1.8%) 46 (1.6%) 44 (1.5%) 40 (1.4%) 35 (1.2%) 1,986 (69.0%)",,6 (0.2%)
4,Year [object],1. 2021 2. 2020 3. 2018 4. 2019,"1,209 (42.0%) 1,055 (36.6%) 526 (18.3%) 89 (3.1%)",,0 (0.0%)
5,Sector [object],1. FinTech 2. EdTech 3. Financial Services 4. Fintech 5. Edtech 6. E-commerce 7. Automotive 8. AgriTech 9. Food & Beverages 10. Logistics 11. other,"175 (6.1%) 148 (5.1%) 88 (3.1%) 85 (3.0%) 74 (2.6%) 73 (2.5%) 54 (1.9%) 43 (1.5%) 39 (1.4%) 37 (1.3%) 2,063 (71.7%)",,18 (0.6%)
6,HeadQuarter [object],1. Bangalore 2. Mumbai 3. Gurugram 4. New Delhi 5. nan 6. Chennai 7. Pune 8. Delhi 9. Noida 10. Gurgaon 11. other,866 (30.1%) 474 (16.5%) 239 (8.3%) 232 (8.1%) 114 (4.0%) 106 (3.7%) 105 (3.6%) 88 (3.1%) 86 (3.0%) 80 (2.8%) 489 (17.0%),,114 (4.0%)
7,Investor [object],1. nan 2. Inflection Point Ventures 3. Venture Catalysts 4. Mumbai Angels Network 5. Angel investors 6. Undisclosed 7. Tiger Global 8. Titan Capital 9. Unicorn India Ventures 10. Better Capital 11. other,"626 (21.7%) 36 (1.3%) 32 (1.1%) 17 (0.6%) 15 (0.5%) 13 (0.5%) 12 (0.4%) 11 (0.4%) 10 (0.3%) 9 (0.3%) 2,098 (72.9%)",,626 (21.7%)


<b>ISSUES IDENTIFIED WITH THE NEW DATAFRAME THAT NEEDS TO BE ADDRESSED<b>

In [18]:
-df_startup has 25 duplicate entries.
-The Amount column is presented in an object datatype this has to be transformed into a float.
-The Amount column is represented in Indian Rupees, US dollars, figures with no designated currency symbol. This has to hamonized.
-In the Stage column some of the stages are spelt slighty differently, these need to be harmonised to a single spelling.
-

SyntaxError: invalid syntax (55851311.py, line 1)

<b>CHECK FOR DUPLICATE ENTRIES IN THE NEW DATASET df_startup<b>

In [None]:
#CHECKING FOR THE DUPLICATES

df_startup.duplicated().value_counts()

In [None]:
#DROP THE DUPLICATES ENTRIES WHILST MAINTAINING THE INTEGRITY OF THE ORIGINAL DATAFRAME

df_startup.drop_duplicates(keep='first', inplace=True)

In [None]:
#CONFIRM THERE ARE NO DUPLICATE ENTRIES IN THE DATAFRAME df_startup

df_startup.duplicated().value_counts()

<b>STAGE COLUMN.<b>

<b>Harmonising the column entries.<b>

In [None]:
df_startup.replace(to_replace=['Seed round','Seed funding','Early seed','Seed fund'], value='Seed', inplace=True)
df_startup.replace(to_replace=['Debt Financing'], value='Debt', inplace=True)
df_startup.replace(to_replace=['Venture - Series Unknown'], value='Undisclosed', inplace=True)
df_startup.replace(to_replace=['Angel Round'], value='Angel', inplace=True)
df_startup.replace(to_replace=['Pre-Series B','Pre-series B'], value='Pre series B', inplace=True)
df_startup.replace(to_replace=['Pre-seed'], value='Pre Seed', inplace=True)
df_startup.replace(to_replace=['Seis A'], value='Series A', inplace=True)

<b>Drop the Rows that are not fit for purpose.<b>

In [22]:
df_startup.drop([178,1768,2208,2221,2244,2247], inplace=True)

<b>We replace NaN in the Stage Column with Undisclosed.<b>

In [37]:
df_startup.Stage.fillna('Undisclosed', inplace=True)


SECTOR COLUMN

<b>NB: What we did was to harmonise the sectors to make them consistent. Some sectors were spelt slightly differently, which implied they were grouped differently.<b>

In [61]:
df_startup.replace(to_replace=['Fintech'], value='FinTech', inplace=True)
df_startup.replace(to_replace=['EdTech Startup','Edtech','EdtTech'], value='EdTech', inplace=True)
df_startup.replace(to_replace=['Insurance','Banking','Credit','Consumer Lending','Accounting','Finance company','Finance','Capital Markets','Venture Capital & Private Equity','Venture capital',], value='Financial Services', inplace=True)
df_startup.replace(to_replace=['E-Commerce','Ecommerce','Social e-commerce','E-marketplace'], value='E-commerce', inplace=True)
df_startup.replace(to_replace=['Automotive & Rentals','Automobiles'], value='Automotive', inplace=True)
df_startup.replace(to_replace=['Agritech','B2B Agritech'], value='AgriTech', inplace=True)
df_startup.replace(to_replace=['Food and Beverage','Beverages','Beverage'], value='Food & Beverages', inplace=True)
df_startup.replace(to_replace=['Logistics & Supply Chain'], value='Logistics', inplace=True)
df_startup.replace(to_replace=['Information Technology & Services','Tech','Internet','IT'], value='Information Technology', inplace=True)
df_startup.replace(to_replace=['Gaming startup','Computer Games'], value='Gaming', inplace=True)
df_startup.replace(to_replace=['HealthCare','Health Care','Health,Wellness & Fitness','Fitness','Hospital & Health Care','Health','Fitness startup','Health Care','Yoga & wellness','Health & Wellness','Helath care','Healthcare','Health Diagnostics','Heathcare'], value='Healthcare', inplace=True)
df_startup.replace(to_replace=['HealthTech'], value='Healthtech', inplace=True)
df_startup.replace(to_replace=['SaaS startup','SaaS platform'], value='SaaS', inplace=True)
df_startup.replace(to_replace=['FMCG'], value='Consumer Goods', inplace=True)
df_startup.replace(to_replace=['Online Media'], value='Media', inplace=True)
df_startup.replace(to_replace=['Retail startup','Consumer'], value='Retail', inplace=True)
df_startup.replace(to_replace=['Apps','Tech Company','Technology','Information Services','Tech startup'], value='Tech Startup', inplace=True)
df_startup.replace(to_replace=['AI','Artificial Intelligence','AI Company','AI startup'], value='AI Startup', inplace=True)
df_startup.replace(to_replace=['Computer softwre','Software','Computer','Software Startup','Software Startup'], value='Computer Software', inplace=True)
df_startup.replace(to_replace=['Apparel & Fashion','Fashion startup'], value='Fashion', inplace=True)
df_startup.replace(to_replace=['B2B Service','B2B marketplace','B2B Ecommerce','B2B E-commerce','B2B startup','B2B service'], value='B2B', inplace=True)
df_startup.replace(to_replace=['Food','Foodtech','Food tech'], value='FoodTech', inplace=True)
df_startup.replace(to_replace=['Internet of Things'], value='IoT', inplace=True)
df_startup.replace(to_replace=['Farming'], value='Agriculture', inplace=True)
df_startup.replace(to_replace=['Deeptech'], value='DeepTech', inplace=True)
df_startup.replace(to_replace=['Insuretech','Insurance technology'], value='InsureTech', inplace=True)
df_startup.replace(to_replace=['Rental space'], value='Rental', inplace=True)
df_startup.replace(to_replace=['Food Delivery','Delivery Service'], value='Delivery', inplace=True)
df_startup.replace(to_replace=['Marketing & Advertising','Brand Marketing','Market Research','Marketing startup'], value='Marketing', inplace=True)
df_startup.replace(to_replace=['Biotechnology'], value='BioTechnology', inplace=True)
df_startup.replace(to_replace=['Cleantech'], value='CleanTech', inplace=True)
df_startup.replace(to_replace=['Crypto'], value='Cryptocurrency', inplace=True)
df_startup.replace(to_replace=['Interior design'], value='Interior Design', inplace=True)

<b>HEADQUARTER COLUMN<b>

<b>ISSUES WITH THE HEADQUARTER COLUMN<b>
    
-Names of the same locations are spelt differently.
    
-Some locations listed are outside India.
    
-Some of the locations listed are States in India instead of cities which are relevant for the analysis.
    
-Some locations are districts of already listed cities.

<b>Replacing the names of the HeadQuarter with the right ones.<b>

In [68]:
df_startup.replace(to_replace=['Bangalore City','Bangalore'], value='Bengaluru', inplace=True)
df_startup.replace(to_replace=['New Delhi','Azadpor'], value='Delhi', inplace=True)
df_startup.replace(to_replace=['Ahmadabad'], value='Ahmedabad', inplace=True)
df_startup.replace(to_replace=['Kochi'], value='Cochin', inplace=True)
df_startup.replace(to_replace=['Kormangala'], value='Koramangala', inplace=True)
df_startup.replace(to_replace=['Jaipur, Rajastan'], value='Jaipur', inplace=True)
df_startup.replace(to_replace=['Faridabad, Haryana'], value='Faridabad', inplace=True)
df_startup.replace(to_replace=['Powai','Worli'], value='Mumbai', inplace=True)
df_startup.replace(to_replace=['Small Towns, Andhra Pradesh'], value='Andhra Pradesh', inplace=True)
df_startup.replace(to_replace=['Hyderebad'], value='Hyderabad', inplace=True)
df_startup.replace(to_replace=['Gurugram\t#REF!'], value='Gurugram', inplace=True)
df_startup.replace(to_replace=['Orissia'], value='Orissa', inplace=True)
df_startup.replace(to_replace=['Samstipur','Samastipur, Bihar','Samsitpur'], value='Samastipur', inplace=True)
df_startup.replace(to_replace=['The Nilgiris'], value='Nilgiris', inplace=True)
df_startup.replace(to_replace=['Dhindsara, Haryana'], value='Dhingsara', inplace=True)
df_startup.replace(to_replace=['Tirunelveli'], value='Tirunelveli', inplace=True)
df_startup.replace(to_replace=['Mylapore'], value='Chennai', inplace=True)
df_startup.replace(to_replace=['Rajastan'], value='Rajasthan', inplace=True)
df_startup.replace(to_replace=['Trivandrum, Kerala, India'], value='Trivandrum', inplace=True)
df_startup.replace(to_replace=['Mangalore'], value='Mangaluru', inplace=True)
df_startup.replace(to_replace=['Tumkur, Karnataka'], value='Tumkur', inplace=True)


<b>Now we drop the HeadQuarter Entries that are not fit for the analysis<b>

In [71]:
df_startup.drop([42,59,706,781,791,801,838,840,844,845,847,855,860,875,876,877,879,880,888,889,894,902,907,915,916,918,921,932,984,988,999,1001,1003,1005,1006,1012,1014,1015,1035,1072,1073,1074,1098,1911,1912,1925,1926,2422,2571,2590,2770,2843,2846], inplace=True)

In [77]:
df_startup.Investor

0                                                     NaN
1                                                     NaN
2                                                     NaN
3                                                     NaN
4                                                     NaN
5                                                     NaN
6                                                     NaN
7                                                     NaN
8                                                     NaN
9                                                     NaN
10                                                    NaN
11                                                    NaN
12                                                    NaN
13                                                    NaN
14                                                    NaN
15                                                    NaN
16                                                    NaN
17            

In [81]:
df_startup.columns[2]

'Amount($)'