# Uncovering Trends and Opportunities in the Indian Startup Ecosystem: A Python-Based Approach.


# Description


In this project, Python is used to analyze and visualize industry data and 
identify key trends and opportunities in the Indian startup market. 
The analysis will cover funding trends, the geographic distribution of 
the start-ups,funding sources, and the industrial sector in which the start-ups operate. 
The insights gained from this project will help venture capitalists 
stay ahead of the curve and identify promising investment opportunities.


# Questions


1-What is the funding trend in the Indian start-up ecosystem over the past few years?

2-Which industries have received the most funding year on year?

3-Who are the top investors and what initiatives do they typically invest in?

4-Where are the start-ups located and in what industries?

5-What is the performance of the Indian start-up ecosystem in terms of funding rounds, 
and how does this vary across the years?


# Hypothesis

Hypothesis:
The location of a start-up has an impact on the amount of funding it is able to secure.

Null hypothesis:
The location of a start-up has no significant influence on the amount of funding it raises.

Alternate hypothesis:
The location of a start-up significantly influences the amount of funding it raises.


# INSTALLATION

In [1]:
pip install jupyter-summarytools

Note: you may need to restart the kernel to use updated packages.


# IMPORTING LIBRARIES

In [2]:
# libraries to use
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from summarytools import dfSummary
import warnings
warnings.filterwarnings('ignore')

# LOADING DATA 

NB:The columns loaded were restricted to only the ones that will enable us answer the questions posed above.

In [3]:
su_18=pd.read_csv('startup_funding2018.csv', usecols=['Company Name','Location','Industry','Amount','Round/Series'])
su_19=pd.read_csv('startup_funding2019.csv', usecols=['Company/Brand','HeadQuarter','Sector','Investor','Amount($)','Stage'])
su_20=pd.read_csv('startup_funding2020.csv',usecols=['Company/Brand','HeadQuarter','Sector','Investor','Amount($)','Stage'])
su_21=pd.read_csv('startup_funding2021.csv', usecols=['Company/Brand','HeadQuarter','Sector','Investor','Amount($)','Stage'])

# EXPLORATORY DATA ANALYSIS:EDA

Here we inspect the datasets in depth year by year and column by column. This is to clean and process it.

<b>2018 Dataset inspection and cleaning<b>

In [4]:
su_18.head()

Unnamed: 0,Company Name,Industry,Round/Series,Amount,Location
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"Bangalore, Karnataka, India"
1,Happy Cow Dairy,"Agriculture, Farming",Seed,"₹40,000,000","Mumbai, Maharashtra, India"
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,"₹65,000,000","Gurgaon, Haryana, India"
3,PayMe India,"Financial Services, FinTech",Angel,2000000,"Noida, Uttar Pradesh, India"
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,—,"Hyderabad, Andhra Pradesh, India"


<b>First create a column in the datasets to represent the year represented by that particular dataset.<b>

In [5]:
su_18['Year']='2018'
su_19['Year']='2019'
su_20['Year']='2020'
su_21['Year']='2021'


In [6]:
dfSummary(su_18, is_collapsible = True)

No,Variable,Stats / Values,Freqs / (% of Valid),Graph,Missing
1,Company Name [object],1. TheCollegeFever 2. NIRAMAI Health Analytix 3. Drivezy 4. Hush - Speak Up. Make Work Bet 5. The Souled Store 6. Perspectico 7. Kogta Financial India Limited 8. Hospals 9. UrbanClap 10. Square Off 11. other,2 (0.4%) 1 (0.2%) 1 (0.2%) 1 (0.2%) 1 (0.2%) 1 (0.2%) 1 (0.2%) 1 (0.2%) 1 (0.2%) 1 (0.2%) 515 (97.9%),,0 (0.0%)
2,Industry [object],"1. — 2. Financial Services 3. Education 4. Information Technology 5. Health Care, Hospital 6. Finance, Financial Services 7. Fitness, Health Care, Wellness 8. Internet 9. Artificial Intelligence 10. Health Care 11. other",30 (5.7%) 15 (2.9%) 8 (1.5%) 7 (1.3%) 5 (1.0%) 5 (1.0%) 4 (0.8%) 4 (0.8%) 4 (0.8%) 4 (0.8%) 440 (83.7%),,0 (0.0%)
3,Round/Series [object],1. Seed 2. Series A 3. Angel 4. Venture - Series Unknown 5. Series B 6. Series C 7. Debt Financing 8. Private Equity 9. Corporate Round 10. Pre-Seed 11. other,280 (53.2%) 73 (13.9%) 37 (7.0%) 37 (7.0%) 20 (3.8%) 16 (3.0%) 13 (2.5%) 10 (1.9%) 8 (1.5%) 6 (1.1%) 26 (4.9%),,0 (0.0%)
4,Amount [object],"1. — 2. 1000000 3. 500000 4. 2000000 5. ₹50,000,000 6. ₹20,000,000 7. 4000000 8. 5000000 9. 250000 10. ₹40,000,000 11. other",148 (28.1%) 24 (4.6%) 13 (2.5%) 12 (2.3%) 9 (1.7%) 8 (1.5%) 7 (1.3%) 7 (1.3%) 6 (1.1%) 6 (1.1%) 286 (54.4%),,0 (0.0%)
5,Location [object],"1. Bangalore, Karnataka, India 2. Mumbai, Maharashtra, India 3. Bengaluru, Karnataka, India 4. Gurgaon, Haryana, India 5. New Delhi, Delhi, India 6. Pune, Maharashtra, India 7. Chennai, Tamil Nadu, India 8. Hyderabad, Andhra Pradesh, Ind 9. Delhi, Delhi, India 10. Noida, Uttar Pradesh, India 11. other",102 (19.4%) 94 (17.9%) 55 (10.5%) 52 (9.9%) 51 (9.7%) 20 (3.8%) 19 (3.6%) 18 (3.4%) 16 (3.0%) 15 (2.9%) 84 (16.0%),,0 (0.0%)
6,Year [object],1. 2018,526 (100.0%),,0 (0.0%)


<b>Issues identified with the 2018 datasets<b>
    
-The amount columns are presented in Rupees,US dollars and figures with no designated currency symbols.
    
-The Amount column is represented as an object this has to be altered to allow for particular type numerical insights to be drawn from it.
    
-The location column lists the city, state and country name. This would have to be stripped to only the city name.

-Some locations are named sometimes using its official name and other times its unofficial name. This has to be harmonized.
    
-2018 dataset does not have a column for investors, this will have to created to aid the analysis.
    
-Industry column has an assortment of industries per row. This will have to be harmonized to a specific industry.
    
-Industry column has slightly different spellings of the same industry.
    
-Names of some of the columns has to be changed for it to be consistent with those of 2019,2020 and 2021.


<b>NB:THE SPECIFIC ISSUES PERTAINING TO 2018 DATASET IS DEALT WITH BEFORE IT IS INTEGRATED WITH THE OTHER DATASET FOR WHOLISTIC INSPECTION AND CLEANING.<b>

In [7]:
#harmonizing the 2018 dataset columns to match the other three datasets
su_18.columns=['Company/Brand','Sector','Stage','Amount($)','HeadQuarter','Year']

su_18

Unnamed: 0,Company/Brand,Sector,Stage,Amount($),HeadQuarter,Year
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"Bangalore, Karnataka, India",2018
1,Happy Cow Dairy,"Agriculture, Farming",Seed,"₹40,000,000","Mumbai, Maharashtra, India",2018
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,"₹65,000,000","Gurgaon, Haryana, India",2018
3,PayMe India,"Financial Services, FinTech",Angel,2000000,"Noida, Uttar Pradesh, India",2018
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,—,"Hyderabad, Andhra Pradesh, India",2018
...,...,...,...,...,...,...
521,Udaan,"B2B, Business Development, Internet, Marketplace",Series C,225000000,"Bangalore, Karnataka, India",2018
522,Happyeasygo Group,"Tourism, Travel",Series A,—,"Haryana, Haryana, India",2018
523,Mombay,"Food and Beverage, Food Delivery, Internet",Seed,7500,"Mumbai, Maharashtra, India",2018
524,Droni Tech,Information Technology,Seed,"₹35,000,000","Mumbai, Maharashtra, India",2018


SECTOR COLUMN

<b>Here the aim is to retain only the first description as shown in the sector column as its own column then drop the rest.<b>

In [8]:
df_18=su_18['Sector'].str.split(pat=',', n=1, expand=True)
su_18['industry1']=df_18[0]

su_18

Unnamed: 0,Company/Brand,Sector,Stage,Amount($),HeadQuarter,Year,industry1
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"Bangalore, Karnataka, India",2018,Brand Marketing
1,Happy Cow Dairy,"Agriculture, Farming",Seed,"₹40,000,000","Mumbai, Maharashtra, India",2018,Agriculture
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,"₹65,000,000","Gurgaon, Haryana, India",2018,Credit
3,PayMe India,"Financial Services, FinTech",Angel,2000000,"Noida, Uttar Pradesh, India",2018,Financial Services
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,—,"Hyderabad, Andhra Pradesh, India",2018,E-Commerce Platforms
...,...,...,...,...,...,...,...
521,Udaan,"B2B, Business Development, Internet, Marketplace",Series C,225000000,"Bangalore, Karnataka, India",2018,B2B
522,Happyeasygo Group,"Tourism, Travel",Series A,—,"Haryana, Haryana, India",2018,Tourism
523,Mombay,"Food and Beverage, Food Delivery, Internet",Seed,7500,"Mumbai, Maharashtra, India",2018,Food and Beverage
524,Droni Tech,Information Technology,Seed,"₹35,000,000","Mumbai, Maharashtra, India",2018,Information Technology


In [9]:
su_18.drop('Sector', axis=1, inplace=True)


In [10]:
su_18=su_18.rename(columns={'industry1':'Sector'})
su_18

Unnamed: 0,Company/Brand,Stage,Amount($),HeadQuarter,Year,Sector
0,TheCollegeFever,Seed,250000,"Bangalore, Karnataka, India",2018,Brand Marketing
1,Happy Cow Dairy,Seed,"₹40,000,000","Mumbai, Maharashtra, India",2018,Agriculture
2,MyLoanCare,Series A,"₹65,000,000","Gurgaon, Haryana, India",2018,Credit
3,PayMe India,Angel,2000000,"Noida, Uttar Pradesh, India",2018,Financial Services
4,Eunimart,Seed,—,"Hyderabad, Andhra Pradesh, India",2018,E-Commerce Platforms
...,...,...,...,...,...,...
521,Udaan,Series C,225000000,"Bangalore, Karnataka, India",2018,B2B
522,Happyeasygo Group,Series A,—,"Haryana, Haryana, India",2018,Tourism
523,Mombay,Seed,7500,"Mumbai, Maharashtra, India",2018,Food and Beverage
524,Droni Tech,Seed,"₹35,000,000","Mumbai, Maharashtra, India",2018,Information Technology


HEADQUARTER COLUMN

<b>The aim is to strip the city name from the string under the HeadQuarter<b>

In [11]:
su18=su_18['HeadQuarter'].str.split(pat=',', n=1, expand=True)
su_18['location']=su18[0]

su_18

Unnamed: 0,Company/Brand,Stage,Amount($),HeadQuarter,Year,Sector,location
0,TheCollegeFever,Seed,250000,"Bangalore, Karnataka, India",2018,Brand Marketing,Bangalore
1,Happy Cow Dairy,Seed,"₹40,000,000","Mumbai, Maharashtra, India",2018,Agriculture,Mumbai
2,MyLoanCare,Series A,"₹65,000,000","Gurgaon, Haryana, India",2018,Credit,Gurgaon
3,PayMe India,Angel,2000000,"Noida, Uttar Pradesh, India",2018,Financial Services,Noida
4,Eunimart,Seed,—,"Hyderabad, Andhra Pradesh, India",2018,E-Commerce Platforms,Hyderabad
...,...,...,...,...,...,...,...
521,Udaan,Series C,225000000,"Bangalore, Karnataka, India",2018,B2B,Bangalore
522,Happyeasygo Group,Series A,—,"Haryana, Haryana, India",2018,Tourism,Haryana
523,Mombay,Seed,7500,"Mumbai, Maharashtra, India",2018,Food and Beverage,Mumbai
524,Droni Tech,Seed,"₹35,000,000","Mumbai, Maharashtra, India",2018,Information Technology,Mumbai


In [12]:
su_18.drop('HeadQuarter', axis=1, inplace=True)
su_18

Unnamed: 0,Company/Brand,Stage,Amount($),Year,Sector,location
0,TheCollegeFever,Seed,250000,2018,Brand Marketing,Bangalore
1,Happy Cow Dairy,Seed,"₹40,000,000",2018,Agriculture,Mumbai
2,MyLoanCare,Series A,"₹65,000,000",2018,Credit,Gurgaon
3,PayMe India,Angel,2000000,2018,Financial Services,Noida
4,Eunimart,Seed,—,2018,E-Commerce Platforms,Hyderabad
...,...,...,...,...,...,...
521,Udaan,Series C,225000000,2018,B2B,Bangalore
522,Happyeasygo Group,Series A,—,2018,Tourism,Haryana
523,Mombay,Seed,7500,2018,Food and Beverage,Mumbai
524,Droni Tech,Seed,"₹35,000,000",2018,Information Technology,Mumbai


In [13]:
su_18=su_18.rename(columns={'location':'HeadQuarter'})
su_18

Unnamed: 0,Company/Brand,Stage,Amount($),Year,Sector,HeadQuarter
0,TheCollegeFever,Seed,250000,2018,Brand Marketing,Bangalore
1,Happy Cow Dairy,Seed,"₹40,000,000",2018,Agriculture,Mumbai
2,MyLoanCare,Series A,"₹65,000,000",2018,Credit,Gurgaon
3,PayMe India,Angel,2000000,2018,Financial Services,Noida
4,Eunimart,Seed,—,2018,E-Commerce Platforms,Hyderabad
...,...,...,...,...,...,...
521,Udaan,Series C,225000000,2018,B2B,Bangalore
522,Happyeasygo Group,Series A,—,2018,Tourism,Haryana
523,Mombay,Seed,7500,2018,Food and Beverage,Mumbai
524,Droni Tech,Seed,"₹35,000,000",2018,Information Technology,Mumbai


In [14]:
su_21

Unnamed: 0,Company/Brand,HeadQuarter,Sector,Investor,Amount($),Stage,Year
0,Unbox Robotics,Bangalore,AI startup,"BEENEXT, Entrepreneur First","$1,200,000",Pre-series A,2021
1,upGrad,Mumbai,EdTech,"Unilazer Ventures, IIFL Asset Management","$120,000,000",,2021
2,Lead School,Mumbai,EdTech,"GSV Ventures, Westbridge Capital","$30,000,000",Series D,2021
3,Bizongo,Mumbai,B2B E-commerce,"CDC Group, IDG Capital","$51,000,000",Series C,2021
4,FypMoney,Gurugram,FinTech,"Liberatha Kallat, Mukesh Yadav, Dinesh Nagpal","$2,000,000",Seed,2021
...,...,...,...,...,...,...,...
1204,Gigforce,Gurugram,Staffing & Recruiting,Endiya Partners,$3000000,Pre-series A,2021
1205,Vahdam,New Delhi,Food & Beverages,IIFL AMC,$20000000,Series D,2021
1206,Leap Finance,Bangalore,Financial Services,Owl Ventures,$55000000,Series C,2021
1207,CollegeDekho,Gurugram,EdTech,"Winter Capital, ETS, Man Capital",$26000000,Series B,2021


# MERGING THE FOUR DATASETS

NB:The four datasets have been combined into a single dataFrame called df_startup. The individual columns in this new DataFrame will inspected and cleaned. Only six relevant columns are in this new DataFrame, they are Company/Brand,Stage,Amount,Sector,Headquarter and Investor.

In [19]:
df_startup=pd.concat([su_18,su_19,su_20,su_21],ignore_index=True)

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

In [24]:
df_startup

Unnamed: 0,Company/Brand,Stage,Amount($),Year,Sector,HeadQuarter,Investor
0,TheCollegeFever,Seed,250000,2018,Brand Marketing,Bangalore,
1,Happy Cow Dairy,Seed,"₹40,000,000",2018,Agriculture,Mumbai,
2,MyLoanCare,Series A,"₹65,000,000",2018,Credit,Gurgaon,
3,PayMe India,Angel,2000000,2018,Financial Services,Noida,
4,Eunimart,Seed,—,2018,E-Commerce Platforms,Hyderabad,
5,Hasura,Seed,1600000,2018,Cloud Infrastructure,Bengaluru,
6,Tripshelf,Seed,"₹16,000,000",2018,Internet,Kalkaji,
7,Hyperdata.IO,Angel,"₹50,000,000",2018,Market Research,Hyderabad,
8,Freightwalla,Seed,—,2018,Information Services,Mumbai,
9,Microchip Payments,Seed,—,2018,Mobile Payments,Bangalore,


In [16]:
dfSummary(df_startup, is_collapsible = True)

No,Variable,Stats / Values,Freqs / (% of Valid),Graph,Missing
1,Company/Brand [object],1. BharatPe 2. Nykaa 3. Zomato 4. Spinny 5. Zetwerk 6. MPL 7. Trell 8. Vedantu 9. Dunzo 10. Unacademy 11. other,"10 (0.3%) 7 (0.2%) 7 (0.2%) 6 (0.2%) 6 (0.2%) 6 (0.2%) 6 (0.2%) 6 (0.2%) 5 (0.2%) 5 (0.2%) 2,815 (97.8%)",,0 (0.0%)
2,Stage [object],1. nan 2. Seed 3. Series A 4. Pre-series A 5. Series B 6. Series C 7. Seed Round 8. Pre series A 9. Pre-seed 10. Series D 11. other,938 (32.6%) 606 (21.0%) 305 (10.6%) 211 (7.3%) 134 (4.7%) 114 (4.0%) 69 (2.4%) 62 (2.2%) 58 (2.0%) 50 (1.7%) 332 (11.5%),,938 (32.6%)
3,Amount($) [object],"1. Undisclosed 2. — 3. $1,000,000 4. $Undisclosed 5. $2,000,000 6. $1000000 7. $3,000,000 8. $5,000,000 9. $2000000 10. $10,000,000 11. other","298 (10.4%) 148 (5.1%) 93 (3.2%) 73 (2.5%) 64 (2.2%) 52 (1.8%) 46 (1.6%) 44 (1.5%) 40 (1.4%) 35 (1.2%) 1,986 (69.0%)",,6 (0.2%)
4,Year [object],1. 2021 2. 2020 3. 2018 4. 2019,"1,209 (42.0%) 1,055 (36.6%) 526 (18.3%) 89 (3.1%)",,0 (0.0%)
5,Sector [object],1. FinTech 2. EdTech 3. Financial Services 4. Fintech 5. Edtech 6. E-commerce 7. Automotive 8. AgriTech 9. Food & Beverages 10. Logistics 11. other,"175 (6.1%) 148 (5.1%) 88 (3.1%) 85 (3.0%) 74 (2.6%) 73 (2.5%) 54 (1.9%) 43 (1.5%) 39 (1.4%) 37 (1.3%) 2,063 (71.7%)",,18 (0.6%)
6,HeadQuarter [object],1. Bangalore 2. Mumbai 3. Gurugram 4. New Delhi 5. nan 6. Chennai 7. Pune 8. Delhi 9. Noida 10. Gurgaon 11. other,866 (30.1%) 474 (16.5%) 239 (8.3%) 232 (8.1%) 114 (4.0%) 106 (3.7%) 105 (3.6%) 88 (3.1%) 86 (3.0%) 80 (2.8%) 489 (17.0%),,114 (4.0%)
7,Investor [object],1. nan 2. Inflection Point Ventures 3. Venture Catalysts 4. Mumbai Angels Network 5. Angel investors 6. Undisclosed 7. Tiger Global 8. Titan Capital 9. Unicorn India Ventures 10. Better Capital 11. other,"626 (21.7%) 36 (1.3%) 32 (1.1%) 17 (0.6%) 15 (0.5%) 13 (0.5%) 12 (0.4%) 11 (0.4%) 10 (0.3%) 9 (0.3%) 2,098 (72.9%)",,626 (21.7%)


In [17]:
df_startup.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2879 entries, 0 to 2878
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Company/Brand  2879 non-null   object
 1   Stage          1941 non-null   object
 2   Amount($)      2873 non-null   object
 3   Year           2879 non-null   object
 4   Sector         2861 non-null   object
 5   HeadQuarter    2765 non-null   object
 6   Investor       2253 non-null   object
dtypes: object(7)
memory usage: 157.6+ KB
