# Indian Startup Data Exploration

India is one of the fastest growing economy in the world. There are a lot of innovative startups coming up in the region and a lot of funding for these startups as well.

Wanted to know what type of startups are getting funded in the last few years?

Wanted to know who are the important investors? 

Wanted to know the hot fields that get a lot of funding these days?

Well, investors as well as startup founders have these questions in mind too. There are two main scenarios:

* Investors are forming a partnership with the startups they choose to invest in – if the company turns a profit, investors make returns proportionate to their amount of equity in the startup; if the startup fails, the investors lose the money they’ve invested. So, they want to know, which startup to invest in.

* Start-up companies often look to angel or investors to raise much-needed capital to get their business off the ground - but how does one value a brand new company?

This dataset is a chance to explore the Indian start up scene. Deep dive into funding data,derive insights to answer the above questions and also peek into the future of the market. 

We have been provided with data containing features like date, industry verticals, startup location,investment type,amount of investment,investor names etc.

**Source:** [kaggle](https://www.kaggle.com/sudalairajkumar/indian-startup-funding?select=startup_funding.csv) scraped from [trak.in](https://trak.in/)

Lets, start exploring and analyzing the data then!

### To visualize plots in this notebook please click [here](https://nbviewer.jupyter.org/github/hirenhk15/ga-learner-dst-repo/blob/master/glabs_ds_learn/4_Startup_data_analysis/notebook/Startup_data_analysis.ipynb)

## Importing Libraries

In [1]:
import os
import string
import datetime
import numpy as np 
import pandas as pd 

import matplotlib.pyplot as plt
# import seaborn as sns
# color = sns.color_palette()

%matplotlib inline

from plotly import tools
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly_express as px
from plotly.subplots import make_subplots

pd.options.mode.chained_assignment = None
pd.options.display.max_columns = 999

### Load the data

In [2]:
df = pd.read_csv('../data/startup_data.csv')
df.head()

Unnamed: 0,SNo,Date,StartupName,IndustryVertical,SubVertical,City,InvestorsName,InvestmentType,AmountInUSD,Remarks,year,yearmonth,CleanedAmount
0,1,09/01/2020,BYJU’S,E-Tech,E-learning,Bangalore,Tiger Global Management,Private Equity Round,200000000,,2020,2020-01-01,200000000.0
1,2,13/01/2020,Shuttl,Transportation,App based shuttle service,NCR,Susquehanna Growth Equity,Series C,8048394,,2020,2020-01-01,8048394.0
2,3,09/01/2020,Mamaearth,E-commerce,Retailer of baby and toddler products,Bangalore,Sequoia Capital India,Series B,18358860,,2020,2020-01-01,18358860.0
3,4,02/01/2020,https://www.wealthbucket.in/,FinTech,Online Investment,NCR,Vinod Khatumal,Preseries A,3000000,,2020,2020-01-01,3000000.0
4,5,02/01/2020,Fashor,Fashion and Apparel,Embroiled Clothes For Women,Mumbai,Sprout Venture Partners,Seed Funding,1800000,,2020,2020-01-01,1800000.0


## Number Of Fundings


### Can we get an overview of the number of fundings that has changed over time?

In [3]:
# Getting the number of fundings according to year
num_funding_deals = df.year.value_counts().sort_index(ascending=False)
num_funding_deals

2020      7
2019    111
2018    310
2017    687
2016    993
2015    936
Name: year, dtype: int64

In [4]:
# Set the bar graph
trace = go.Bar(
    x=num_funding_deals.index,
    y=num_funding_deals.values
)

# Set the layout
layout = go.Layout(
    title=go.layout.Title(text='Number of funding deals over the years', x=0.5),
    font=dict(size=14)
)

data = [trace]
fig = go.Figure(data=data, layout=layout)
fig.show()

#### Insights:


* Years 2015 & 2016 has got more number of fundings compared to the recent years

* We can see a clear decling trend in the number of funding deals from 2016. Not sure of the exact reason. One thing could be that not all the funding deals are captured in the recent days.

* It is clear that 2020 has the lowest number of fundings, may be due to COVID pandemic.


In [5]:
# Getting the number of fundings based on yearmonth
yearmonth = df.yearmonth.value_counts().sort_index()

# Set the line plot
trace = go.Scatter(x=yearmonth.index, 
                   y=yearmonth.values,
                   mode='lines+markers'
                  )

# Set the layout
layout = go.Layout(
    title=go.layout.Title(text='Number of funding deals - month on month', x=0.5),
    font=dict(size=14),
    xaxis=dict(title='Years - (Monthly data)'),
    yaxis=dict(title='Number of Fundings')
)

data = [trace]
fig = go.Figure(data=data, layout=layout)
fig.show()

#### Insights:

We can see a steady decline here as well but seems to be increasing in the last few months.

## Lets try to see if the decrease in deals has any impact on amount being invested?

In [6]:
sum_amount = df.groupby('yearmonth')['CleanedAmount'].sum()
sum_amount.head()

yearmonth
2015-01-01    370158781.0
2015-02-01    394249613.0
2015-03-01    463314013.0
2015-04-01    980989011.0
2015-05-01    326168017.0
Name: CleanedAmount, dtype: float64

In [7]:
# Create figure with secondary y-axis
fig = make_subplots(specs=[[{"secondary_y": True}]])

# Set the line plot and bar graph
fig.add_trace(
    go.Scatter(x=yearmonth.index, y=yearmonth.values, mode='lines+markers', name='Fundings'),
    secondary_y=False
)

fig.add_trace(
    go.Bar(x=sum_amount.index, y=sum_amount.values, name='Investment'),
    secondary_y=True
)

# Add figure title
fig.update_layout(
    go.Layout(
        title=go.layout.Title(text='Number of funding deals - month on month', x=0.5),
        font=dict(size=14),
        xaxis=dict(title='Years - (Monthly data)')
    )
)

# Set y-axes titles
fig.update_yaxes(title_text="Number of Fundings", secondary_y=False)
fig.update_yaxes(title_text="Total Amount Invested", secondary_y=True)

fig.show()

#### Insights:

* As the number of fundings decreased over years, it seems few startups got very large chunk of funds. 

* It can be assumed that investors may have started focusing on few highly profitable startups to invest in.

## Funding Values

### Can we get an overview of the funding values investors usually invest?

In [8]:
# lets convert the amount from string to numeric
# df.CleanedAmount = pd.to_numeric(df.AmountInUSD.str.replace(',', ''), errors='coerce')

In [9]:
investor_fund_dist = df.groupby('InvestorsName')['CleanedAmount'].sum().sort_values(ignore_index=True)

# Set the line plot
trace = go.Scatter(x=investor_fund_dist.index, 
                   y=investor_fund_dist.values,
                   mode='markers',
                   marker_color='rgba(0, 0, 235, .7)'
                  )

# Set the layout
layout = go.Layout(
    title=go.layout.Title(text='Overview of the funding values investors usually invest', x=0.5),
    font=dict(size=14),
    xaxis=dict(title='Investors'),
    yaxis=dict(title='Amount (USD)')
)

data = [trace]
fig = go.Figure(data=data, layout=layout)
fig.show()

There are some extreme values at the right. Let us see who are these very well funded startups.

In [10]:
df.iloc[df.CleanedAmount.nlargest(10).index[:5]]

Unnamed: 0,SNo,Date,StartupName,IndustryVertical,SubVertical,City,InvestorsName,InvestmentType,AmountInUSD,Remarks,year,yearmonth,CleanedAmount
60,61,27/08/2019,Rapido Bike Taxi,Transportation,Bike Taxi,Bangalore,Westbridge Capital,Series B,3900000000,,2019,2019-08-01,3900000000.0
651,652,11/08/2017,Flipkart,E-Commerce,Online Marketplace,Bangalore,Softbank,Private Equity,2500000000,,2017,2017-08-01,2500000000.0
830,831,18/05/2017,Paytm,E-Commerce,Mobile Wallet & ECommerce platform,Bangalore,SoftBank Group,Private Equity,1400000000,,2017,2017-05-01,1400000000.0
966,967,21/03/2017,Flipkart,E-Commerce,ECommerce Marketplace,Bangalore,"Microsoft, eBay, Tencent Holdings",Private Equity,1400000000,,2017,2017-03-01,1400000000.0
31,32,25/11/2019,Paytm,FinTech,Mobile Wallet,NCR,Vijay Shekhar Sharma,Funding Round,1000000000,,2019,2019-11-01,1000000000.0


### Insights:

* Rapido Bike Taxi looks like leading the pack by raising 3.9 Billion USD. But wait, this looks fishy. Infact Rapido raised 3.9 Billion INR and not USD. So this one is around 54 Million USD. This also shows that the data is not very accurate and so there should be caution in using it.
    
* Three of the next four high fundings are flipkart which seems to be expected and the other one is PayTM.
* Also Swiggy raised 1 Billion USD last year which is not in the data.

We will correct the data for Rapido and do the following analysis.

In [11]:
# Correcting the funding value
df.loc[60, 'CleanedAmount'] = 50e6
df.loc[60, 'CleanedAmount']

50000000.0

In [12]:
# Getting the sum, count and mean values of funding
amount_df = df.groupby('year')['CleanedAmount'].agg(['sum', 'count', 'mean']).sort_index()

# Set the bar graph
trace = go.Bar(
    x=amount_df['sum'].index,
    y=amount_df['sum'].values,
    marker=dict(color="#1E90FF")
)

# Set the layout
layout = go.Layout(
    title=go.layout.Title(text='Total investments each year', x=0.5),
    font=dict(size=14),
    xaxis=dict(title='Years'),
    yaxis=dict(title='Amount (USD)')
)

data = [trace]
fig = go.Figure(data=data, layout=layout)
fig.show()

### Insights:

* Though 2016 is the year with most number of funding deals, it is the year with the lowest sum (2020 is yet to complete)
* 2017 has got the highest total amount of funding in the last 5 years. Out of the 10B in 2017, 5.5B is raised by Flipkart and PayTM in 3 deals which we can see in the table above the plot.

In [13]:
# Set the bar graph
trace = go.Bar(
    x=amount_df['mean'].index,
    y=amount_df['mean'].values,
    marker=dict(color="#1E90FF")
)

# Set the layout
layout = go.Layout(
    title=go.layout.Title(text='Average investment each year', x=0.5),
    font=dict(size=14),
    xaxis=dict(title='Years'),
    yaxis=dict(title='Amount (USD)')
)

data = [trace]
fig = go.Figure(data=data, layout=layout)
fig.show()

### Insights:

* When it comes to the mean value of funding, 2020 leads the pack with an average of 55 Million USD.
* But the year has just started, should the mean funding of 2020 be considered or there is something we are missing? Check the number of funds raised in the year 2020, it is pretty less.
* We will consider 2019 data as valid data for mean funding.

## Investment Type

Now let us explore the investment type of the funding deals like whether it is seed funding, private equity funding or so on.


### Can we get an idea about the number and value of funding deals with respect to the investment type?

In [14]:
# Get the counts for investment type
inestment_type = df.InvestmentType.value_counts().sort_values(ascending=False)[:10]

# Set the bar graph
trace = go.Bar(
    y=inestment_type.index[::-1],
    x=inestment_type.values[::-1],
    orientation='h'
)

# Set the layout
layout = go.Layout(
    title=go.layout.Title(text='Number of funding deals by investment type', x=0.5),
    font=dict(size=14)
)

data = [trace]
fig = go.Figure(data=data, layout=layout)
fig.show()

### Insights:

* Seed funding tops the chart closely followed by Private Equity and seed angel funding
    
* We can clearly see the decreasing number of deals as we move up the stages of funding rounds like Series A, B, C & D

In [15]:
# Get the total investments for investment type
inestment_type_sum = df.groupby('InvestmentType')['CleanedAmount'].agg(['size', 'sum', 'mean']).sort_values(by='size', ascending=False)[:10]

# Set the bar graph
trace = go.Bar(
    y=inestment_type_sum['sum'].index[::-1],
    x=inestment_type_sum['sum'].values[::-1],
    orientation='h'
)

# Set the layout
layout = go.Layout(
    title=go.layout.Title(text='Total investment by its type', x=0.5),
    font=dict(size=14)
)

data = [trace]
fig = go.Figure(data=data, layout=layout)
fig.show()

### Insights:

* Private equity funding seems to be the one with high number of deals and the highest sum value of 26.7B raised as well
* Though seed funding has 1388 funding deals, the sum of money raised is just about 775M since they happen during the very early stages of a startup.

Now let us see what is the average value raised by the startups in each of these funding rounds.

In [16]:
# Set the bar graph
trace = go.Bar(
    y=inestment_type_sum['mean'].index[::-1],
    x=inestment_type_sum['mean'].values[::-1],
    orientation='h'
)

# Set the layout
layout = go.Layout(
    title=go.layout.Title(text='Average investment by investment type', x=0.5),
    font=dict(size=14)
)

data = [trace]
fig = go.Figure(data=data, layout=layout)
fig.show()

### Insights:

* We can see a clear increase in the mean funding value as we go up the funding round ladder from Seed funding to Series D as expected.

## Location

### Find about the major start up hubs in India.

Now let us explore the location of the startups that got funded. This can help us to understand the startup hubs of India.

Since there are multiple locations in the data, let us plot the top 10 locations. We will also club New Delhi, Gurgaon & Noida together to form NCR for the below chart.

In [17]:
# Total number of cities
df.City.nunique()

102

In [18]:
# Get the total investments by city
startup_hubs = df.groupby('City')['CleanedAmount'].agg(['size', 'sum', 'mean']).sort_values(by='size', ascending=False)[:10]
startup_hubs

Unnamed: 0_level_0,size,sum,mean
City,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
NCR,892,8461624000.0,9486126.0
Bangalore,842,14640450000.0,17387710.0
Mumbai,568,4940369000.0,8697833.0
Pune,105,633048000.0,6029029.0
Hyderabad,99,401049300.0,4051003.0
Chennai,97,718745000.0,7409742.0
Ahmedabad,38,113625000.0,2990132.0
Jaipur,30,152719000.0,5090634.0
Kolkata,21,15972010.0,760572.0
Indore,13,4664008.0,358769.8


In [19]:
# Set the bar graph
trace = go.Bar(
    y=startup_hubs['size'].index[::-1],
    x=startup_hubs['size'].values[::-1],
    orientation='h'
)

# Set the layout
layout = go.Layout(
    title=go.layout.Title(text='Number of funding deals in each location', x=0.5),
    font=dict(size=14)
)

data = [trace]
fig = go.Figure(data=data, layout=layout)
fig.show()

### Insights:


* NCR & Bangalore are almost equal to each other with respect to number of funding deals followed by Mumbai in third place.
    
* Chennai, Hyderabad & Pune are the next set of cities are that are catching up.

In [20]:
# Set the bar graph
trace = go.Bar(
    y=startup_hubs['sum'].index[::-1],
    x=startup_hubs['sum'].values[::-1],
    orientation='h'
)

# Set the layout
layout = go.Layout(
    title=go.layout.Title(text='Total Investment by City', x=0.5),
    font=dict(size=14)
)

data = [trace]
fig = go.Figure(data=data, layout=layout)
fig.show()

### Insight:

* Though NCR tops the number of funding deals when it comes to the total funding value by location, Bangalore leads the way by a huge margin.

In [21]:
# Set the bar graph
trace = go.Bar(
    y=startup_hubs['mean'].index[::-1],
    x=startup_hubs['mean'].values[::-1],
    orientation='h'
)

# Set the layout
layout = go.Layout(
    title=go.layout.Title(text='Average Investment by City', x=0.5),
    font=dict(size=14)
)

data = [trace]
fig = go.Figure(data=data, layout=layout)
fig.show()

### Insight:

* Bangalore tops the list here again
* Jaipur took the fifth spot with respect to mean funding value

In [22]:
# Subsetting the data according to year, location and amount
temp_df = df.groupby(['City', 'year'])['CleanedAmount'].agg(["size", "mean"]).reset_index().sort_values('size', ascending=False)

# Selecting the major cities 
cities_to_use = ['Bangalore', 'NCR', 'Mumbai', 'Chennai', 'Pune', 'Hyderabad']

# Subsetting the data according to our needs
temp_df = temp_df[temp_df['City'].isin(cities_to_use)]

# Plotting a graph
fig = px.scatter(temp_df, x='year', y='City', color='City', size='size')

layout = go.Layout(
    title=go.layout.Title(text="Number of funding deals by location over time", x=0.5),
    font=dict(size=14),
    showlegend=False
)

fig.update_layout(layout)
fig.show()

In [23]:
# Plotting a graph
fig = px.scatter(temp_df, x='year', y='City', color='City', size='mean')

layout = go.Layout(
    title=go.layout.Title(text="Number of funding deals by location over time", x=0.5),
    font=dict(size=14),
    showlegend=False
)

fig.update_layout(layout)
fig.show()

## Industry Vertical

Let us now have a look at the industry verticals and the number of funding deals for each vertical.


### Can we get an overview of the Industry verticals and the number of funding deals?

In [24]:
# Total number of industry vertical
df.IndustryVertical.nunique()

819

In [25]:
industry_vertical = df.groupby('IndustryVertical')['CleanedAmount'].agg(['size', 'sum', 'mean']).sort_values(by='size', ascending=False)[:10]
industry_vertical

Unnamed: 0_level_0,size,sum,mean
IndustryVertical,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Consumer Internet,941,6252733000.0,6644774.0
Technology,478,2229540000.0,4664310.0
E-Commerce,276,7889354000.0,28584620.0
Healthcare,70,381192000.0,5445600.0
Finance,62,1971433000.0,31797310.0
Logistics,32,242836000.0,7588625.0
Education,24,301138000.0,12547420.0
Food & Beverage,23,42228010.0,1836000.0
Ed-Tech,14,33546640.0,2396189.0
E-commerce,12,88912020.0,7409335.0


In [26]:
# Set the bar graph
trace = go.Bar(
    y=industry_vertical['size'].index[::-1],
    x=industry_vertical['size'].values[::-1],
    orientation='h'
)

# Set the layout
layout = go.Layout(
    title=go.layout.Title(text='Number of funding deals in each industry', x=0.5),
    font=dict(size=14)
)

data = [trace]
fig = go.Figure(data=data, layout=layout)
fig.show()

### Insights:

* Consumer Internet is the most preferred industry segment for funding followed by Technology and E-commerce.


In [27]:
# Set the bar graph
trace = go.Bar(
    y=industry_vertical['sum'].index[::-1],
    x=industry_vertical['sum'].values[::-1],
    orientation='h'
)

# Set the layout
layout = go.Layout(
    title=go.layout.Title(text='Number of funding deals in each industry', x=0.5),
    font=dict(size=14)
)

data = [trace]
fig = go.Figure(data=data, layout=layout)
fig.show()

### Insights:

* E-Commerse got the highest sum of funds followed by Consumer Internet and Technology.

In [28]:
# Set the bar graph
trace = go.Bar(
    y=industry_vertical['mean'].index[::-1],
    x=industry_vertical['mean'].values[::-1],
    orientation='h'
)

# Set the layout
layout = go.Layout(
    title=go.layout.Title(text='Average investment in each industry', x=0.5),
    font=dict(size=14)
)

data = [trace]
fig = go.Figure(data=data, layout=layout)
fig.show()

### Insights:

* Finance got the highest average investment followed by E-Commerce and Education.