# Investigating the Indian startup ecosystem

## Description
The objective of this project is to analyse the indian start-up investment data over the course of four years (2018-2021) to find out which funding stages are very attractive to investors and at what risk level. 

# GOAL
The goal of this project is to propose the best indian start-up to invest.

### Null Hypothesis 
Investment amounts received by start-ups have no relation to sectors they operate in.


### Alternate Hypothesis
There is a relationship between investment amounts received by start-ups and the sectors they operate in

## Analytical Questions
1. Does location affect the amount of funding or investments?
2. Does the sector of start up affect the fundings?
3. How many companies need funding and are at what level of funding ?
4. Which stages give out the highest investment amounts?
5. Which cities have the highest number of startups and at what levels?
6. What are the levels of funding the startups are receiving per year?

In [None]:
#Libraries imported
import MySQLdb
import sqlalchemy as sa
import pyodbc     
from dotenv import dotenv_values    #import the dotenv_values function from the dotenv package
import pandas as pd
import warnings 


In [None]:
env_variables= dotenv_values('logins.env')
database= env_variables.get('database')
server = env_variables.get('server')
username = env_variables.get('username')
password = env_variables.get('password')



# Data Understanding

- There are four data sources to work with (2 SQL and 2 CSVs)
- Explore data
- Verify data quality

### Connecting to the dapDB to extract the 2020 and 2021 data

In [None]:
#Connecting to the database to analyse the 2020-2021 data

connection_string = f"DRIVER={{SQL Server}};SERVER={server};DATABASE={database};UID={username};PWD={password};MARS_Connection=yes;MinProtocolVersion=TLSv1.2;"
connection = pyodbc.connect(connection_string)

In [None]:
#query the 2020 startup funding data

query = "SELECT * FROM LP1_startup_funding2020"

data_2020 = pd.read_sql(query, connection)
data_2020.columns

In [None]:
data_2020.head(2)
data_2020['Investment_year'] = '2020'

In [None]:
#Checking the datatypes of the columns
datatypes = data_2020.dtypes
datatypes

In [None]:
data_2020.Amount.unique()

In [None]:
#query the 2021 startup funding data
query = "SELECT * FROM LP1_startup_funding2021"

data_2021 = pd.read_sql(query, connection)
data_2021.head(1)


In [None]:
data_2021.Amount.unique()
data_2021.shape

In [None]:
#Checking the datatypes of the columns
datatypes = data_2021.dtypes
datatypes

In [None]:
data_2021[data_2021['Amount'].str.contains('$', na = False)]
data_2021['Amount']=data_2021.Amount.str.replace('\W', '', regex=True)
data_2021.head(2)

In [None]:
#combining the Stage and Amount columns since there are mixed data entry errors
data_2021['new'] =  data_2021['Investor'].fillna('inv') +data_2021['Amount'].astype(str) + data_2021['Stage'].fillna('ab') 
#data_2021['new']

In [None]:
data_2021['Amount_new']=data_2021['new'].str.extract('(\d+)')
data_2021.head(1)

In [None]:
#Removing the dollar symbol from the 2021 amount
data_2021.drop(columns=['new','Amount'],inplace=True)



In [None]:
data_2021.head(2)
data_2021['Investment_year'] = '2021'

In [None]:
#Renaming the Amount_new column
data_2021.rename(columns={'Amount_new':'Amount'}, inplace=True)

#2021 data cleaned!!!!!!!!!!!

In [None]:
#Reading 2018 data from the csv files
#assump
data_2018=pd.read_csv('startup_funding_2018_2019\startup_funding2018.csv')
data_2018.head(5)

In [None]:
#Changing column names of 2018 data to match all other datasets
data_2018.rename(columns={'Company Name':'Company_Brand','Industry':'Sector', 'Round/Series':'Stage', 'Location':'HeadQuarter', 'About Company':'What_it_does'}, inplace=True)
data_2018.head(5)

In [None]:
def curr_converter(df,rate):
    amount_new=[]
    for a in df:
        if a.startswith('$'):
            
            amount_new.append(a.split('$')[1].replace(',',''))
        elif a.startswith('₹'): 
            
            amount_new.append(float((a.split('₹')[1]).replace(',',''))/rate )  
        else :
            amount_new.append(a)    
    return amount_new

In [None]:
#Removing the Lahk symbol
exch_rate= 158.38
amount_list=data_2018.Amount.tolist()
amount_new=curr_converter(amount_list,exch_rate)
data_2018['Amount']=amount_new
data_2018['Investment_year'] = '2018'

In [None]:
#Reading 2019 data from the csv files
data_2019=pd.read_csv('startup_funding_2018_2019\startup_funding2019.csv')
data_2019.head(5)

In [None]:
#Changing column names of 2019 data to match all other datasets
data_2019.rename(columns={'Company/Brand':'Company_Brand', 'What it does':'What_it_does', 'Amount($)':'Amount'}, inplace=True)
data_2019.columns

In [None]:
#Removing the Lahk symbol
exch_rate= 177.13
amount_list=data_2019.Amount.tolist()
amount_new=curr_converter(amount_list,exch_rate)
data_2019['Amount']=amount_new
data_2019['Investment_year'] = '2019'

### Merging datasets

In [None]:
#Concating 2020 and 2021 data since they have a similar structure
pd.set_option('display.max_rows', None)
final_df = pd.concat([data_2021,data_2020,data_2019,data_2018],axis=0,ignore_index=True)
final_df.head(5)
final_df.shape

In [None]:
#Saving the combined dataset to xlsx
final_df.to_csv("startup_funding_2018_2019\combined.csv",index=False
             ) 

# Data Cleaning & Exploration

In [None]:
#Considering the columns of interest and reindexing
df= pd.read_csv('startup_funding_2018_2019\combined.csv')
df.info()


In [None]:
df.shape

#Merging the removed symbols with the original dataset
#f_df = pd.concat([df,dollar_investment,lahk_investment],axis=0,ignore_index=True)
f_df.shape


#Dropping duplicates from the original dataset to maintain the 'removed symbols' rows
f_df.drop_duplicates(subset=['Stage','Founded','Founders','Company_Brand','HeadQuarter','Investor','Sector','What_it_does','column10'],  keep='last', inplace=True, ignore_index=False)
f_df.shape

In [None]:
#split HQ column into two columns
df[['Town', 'Other']] = df['HeadQuarter'].str.split(',', 1, expand=True)
df[['City', 'Country']] = df['Other'].str.split(',', 1, expand=True)
#df


In [None]:
# Function to fill new column based on conditions
def fill_new_column(row):
    if row['Town'] != '':
        return row['Town']
    else:
        return row['City']
    
# Create a new column filled with values from one column or default value if empty
df['Headquarter_City'] = df.apply(fill_new_column, axis=1)    

df.head(2)

Technology & Software: Startups primarily focused on developing software, AI, IT solutions, and technology-related services.

Examples: AI startup, Tech Startup, IT startup, Software Startup, SaaS startup, MLOps platform, Digital platform, Blockchain startup, Automation, Podcast, Trading platform, Social network, Mobile Games, Computer software.

E-commerce & Retail: Startups involved in online retail, e-commerce platforms, marketplace solutions, and retail-focused businesses.

Examples: E-commerce, B2B E-commerce, Retail, Social commerce, Food Industry, Content commerce, B2B Manufacturing, Home Decor, Consumer Electronics, Apparel & Fashion.

Finance & FinTech: Startups operating in financial services, banking, financial technology (FinTech), cryptocurrency, and related areas.

Examples: FinTech, Banking, Financial Services, Trading platform, Cryptocurrency, Digital mortgage, Venture Capital, Insurance, Trading platform, Insuretech, Consumer finance.

Healthcare & HealthTech: Startups in the healthcare industry, including health technology (HealthTech), telemedicine, medical devices, and healthcare services.

Examples: HealthTech, Healthcare, Hospital & Health Care, Health, Health care, Healtcare, Insuretech, Medical, Hospital & Health Care.
Others: Startups that do not fit directly into the above categories or have unique business models.

Examples: EdTech (Education Technology), AgriTech (Agricultural Technology), Food & Beverages, Hospitality, Logistics & Supply Chain, 
Transportation, Renewable Energy, Robotics, Aerospace, Automotive, Gaming, Fashion, Real Estate, Media, Social Media, Consumer Goods, Industrial Automation, Lifestyle, Food delivery, LegalTech, Rental, Recruitment, Construction, Sports, Spirituality, Pet care, Music, Tobacco, Advisory firm, Pollution control equipment, Consulting, BioTechnology, Innovation Management, Location Analytics, Computer & Network Security, Apparel & Fashion, Automotive, Computer Games, Lifestyle, Environmental Services, Facilities Services, Marketing & Advertising, Job discovery platform, D2C (Direct-to-Consumer), E-learning, OTT (Over-the-Top media), Music, Fitness, Eyewear, NFT Marketplace, Online storytelling, SpaceTech, Online Media, Fishery, Advisory firm, Apparel & Fashion, Environmental service, Commercial Real Estate, AR startup.

In [None]:
#Categorizing the startups into 5 groups
sector = df.Sector.tolist()
cat=[]
cat = ['Finance & FinTech' if 'bank' in str(a).lower() or 'trading' in str(a).lower() or 'fintech' in str(a).lower() or 'vent' in str(a).lower() or 'capital' in str(a).lower() or 'insure' in str(a).lower() or 'crypt' in str(a).lower() or 'nft' in str(a).lower()
         else 'Technology & Software' if 'ai startup' in str(a).lower() or 'it' in str(a).lower() or 'chain' in str(a).lower() or 'augment' in str(a).lower() or 'robot' in str(a).lower() or 'gamin' in str(a).lower() or 'information' in str(a).lower() or 'mobile' in str(a).lower() or 'mlops' in str(a).lower() or 'biotech' in str(a).lower() or 'space' in str(a).lower()
         else 'E-commerce & Retail' if 'b2b' in str(a).lower() or 'e-com' in str(a).lower() or 'reta' in str(a).lower() or 'soci' in str(a).lower() or 'food' in str(a).lower() or 'content' in str(a).lower() or 'decor' in str(a).lower() or 'electronics' in str(a).lower() or 'apparel' in str(a).lower() or 'fashion' in str(a).lower() or 'food' in str(a).lower() or  'eye' in str(a).lower() or 'merchand' in str(a).lower()
         else 'Education' if 'edtech' in str(a).lower() or 'learn' in str(a).lower() or 'working' in str(a).lower() or 'story' in str(a).lower()
         else 'Healthcare & HealthTech' if 'heal' in str(a).lower() or 'medi' in str(a).lower() or 'care' in str(a).lower() or 'hael' in str(a).lower() or 'nutri' in str(a).lower()
         else 'Agriculture' if 'farm' in str(a).lower() or 'fish' in str(a).lower() or 'annabi' in str(a).lower() or 'pollution' in str(a).lower()
         else 'Energy' if 'energy' in str(a).lower() or 'petro' in str(a).lower() or 'crude' in str(a).lower() or 'batter' in str(a).lower() or 'ev start' in str(a).lower() or 'solar' in str(a).lower()
         else 'Hospitality' if 'hospitalit' in str(a).lower() or 'mortg' in str(a).lower() or 'estate' in str(a).lower() or 'touri' in str(a).lower()
         else 'Other'
         for a in sector]


In [143]:
sector

['AI startup',
 'EdTech',
 'EdTech',
 'B2B E-commerce',
 'FinTech',
 'Home services',
 'HealthTech',
 'HealthTech',
 'Tech Startup',
 'E-commerce',
 'HealthTech',
 'B2B service',
 'Helathcare',
 'Renewable Energy',
 'E-commerce',
 'Electronics',
 'Renewable Energy',
 'IT startup',
 'EdTech',
 'FinTech',
 'FinTech',
 'EdTech',
 'Food & Beverages',
 'Aeorspace',
 'FinTech',
 'Deep Tech',
 'HealthTech',
 'HealthTech',
 'Dating',
 'EdTech',
 'Gaming',
 'Robotics',
 'Retail',
 'Food',
 'Oil and Energy',
 'FinTech',
 'Tech Startup',
 'AgriTech',
 'Electronics',
 'Food & Beverages',
 'Telecommuncation',
 'Milk startup',
 'EdTech',
 'AI Chatbot',
 'IT',
 'Logistics',
 'Hospitality',
 'Fashion',
 'E-commerce',
 'FinTech',
 'Marketing',
 'Transportation',
 'LegalTech',
 'Food delivery',
 'EdTech',
 'EdTech',
 'Automotive',
 'FinTech',
 'FinTech',
 'SaaS startup',
 'HealthTech',
 'Renewable Energy',
 'FinTech',
 'Fantasy sports',
 'Food & Beverages',
 'FinTech',
 'FinTech',
 'FinTech',
 'FinTech'

In [None]:
#Categorizing the Stages into 5 groups
stages = df.Stage.tolist()
stage=[]
stage = ['Series A' if 'series a' in str(a).lower() 
         else 'Series B' if 'series b' in str(a).lower() 
         else 'Series C' if 'series c' in str(a).lower()
         else 'Series D' if 'series d' in str(a).lower() 
         else 'Series E' if 'series e' in str(a).lower() 
         else 'Series F' if 'series f' in str(a).lower() 
         else 'Series G' if 'series g' in str(a).lower() 
         else 'Seed' if 'seed' in str(a).lower() 
         else 'Other'
         for a in stages]
stage

In [None]:
df['Category']=cat
df['Categorised_stage']=stage

In [None]:
df.head(2)

In [None]:
df.info()

In [None]:
#Dropping the Town,Other,City and column10 columns 
df.drop(['Town','Other','City','column10'],axis =1,inplace=True)

In [None]:
#Converting the Amount column to numeric
df['Amount'] = pd.to_numeric(df['Amount'], errors='coerce')

In [None]:
df.Category.unique()
df.head(1)

In [None]:
# Filling the nan with the average of their respective category
cat_median_dict={}
for cat in df.Category.unique(): #.Amount.mean()
    cat_median_dict[cat] = df[df.Category == cat].Amount.median()

cat_median_dict

In [None]:
# Filling  the Amount values less than 1000 with the average of their respective category
cat_average_dict={}
for cat in df.Category.unique(): #.Amount.mean()
    cat_average_dict[cat] = df[df.Category == cat].Amount.mean()

cat_average_dict

In [None]:
#Filling the Amount values less than 1000 with the average of their category
index_list = df.query("Amount < 1000").index.tolist()
for nums in index_list:
    cat=df.iloc[nums].Category
    old=df.iloc[nums].Amount
    df.replace(old,cat_average_dict[cat],inplace=True)

In [None]:
#Fill na with their respective averages
index_list = df.query("Amount == 'nan'").index.tolist()
for nums in index_list:
    cat=df.iloc[nums].Category
    old=df.iloc[nums].Amount
    df.replace(old,cat_median_dict[cat],inplace=True)


In [132]:
# Sum of nulls using .isna()
df.isna().sum()

Company_Brand           0
Founded               769
HeadQuarter           114
Sector                 18
What_it_does            0
Founders              545
Investor              626
Stage                 938
Amount                545
Investment_year         0
Country              2339
Headquarter_City      114
Category                0
Categorised_stage       0
dtype: int64

In [133]:
# Find nulls using .isnull()
df=df.replace(r'^s*$', float('NaN'), regex = True)

In [134]:
# Sum of nulls using .isna()
df.isna().sum()

Company_Brand           0
Founded               769
HeadQuarter           114
Sector                 18
What_it_does            0
Founders              545
Investor              626
Stage                 938
Amount                545
Investment_year         0
Country              2339
Headquarter_City      114
Category                0
Categorised_stage       0
dtype: int64

In [135]:
df.dropna(subset = ['Amount'], inplace = True) 

In [None]:
#Checking Normality of the data 
from scipy import stats
def check_normality(data):
    test_stat_normality, p_value_normality=stats.shapiro(data)
    print("p value:%.4f" % p_value_normality)
    if p_value_normality <0.05:
        print("Reject null hypothesis >> The data is not normally distributed")
    else:
        print("Fail to reject null hypothesis >> The data is normally distributed")

In [136]:
fintech=df[df.Category=='Finance & FinTech']
tech = df[df.Category=='Technology & Software']
commerce=df[df.Category=='E-commerce & Retail']
health=df[df.Category=='Healthcare & HealthTech']
edu=df[df.Category=='Education']
other=df[df.Category=='Other']


In [137]:
fintech
check_normality(fintech.Amount)
check_normality(tech.Amount)
check_normality(commerce.Amount)
check_normality(health.Amount)
check_normality(edu.Amount)
check_normality(other.Amount)

p value:0.0000
Reject null hypothesis >> The data is not normally distributed
p value:0.0000
Reject null hypothesis >> The data is not normally distributed
p value:0.0000
Reject null hypothesis >> The data is not normally distributed
p value:0.0000
Reject null hypothesis >> The data is not normally distributed
p value:0.0000
Reject null hypothesis >> The data is not normally distributed
p value:0.0000
Reject null hypothesis >> The data is not normally distributed


In [138]:
#Use an ANOVA test (READ ABOUT IT) (Non parametric tests)
#Using the P-Levene to test the Hypothesis
stat, pvalue_levene= stats.levene(fintech.Amount, commerce.Amount, health.Amount,tech.Amount,edu.Amount,other.Amount)

print("p value:%.4f" % pvalue_levene)
if pvalue_levene <0.05:
    print("Reject null hypothesis >> The variances of the samples are different.")
else:
    print("Fail to reject null hypothesis >> The variances of the samples are same.")

p value:0.2760
Fail to reject null hypothesis >> The variances of the samples are same.


In [139]:
#T-test
stats.f_oneway(fintech.Amount, commerce.Amount, health.Amount,tech.Amount,edu.Amount,other.Amount)

F_onewayResult(statistic=1.2655804543028475, pvalue=0.2760112593145255)

In [141]:
import statsmodels.api as sm
from statsmodels.formula.api import ols
model = ols('Amount~C(Category)', data=df).fit()
model.summary() 

0,1,2,3
Dep. Variable:,Amount,R-squared:,0.003
Model:,OLS,Adj. R-squared:,0.001
Method:,Least Squares,F-statistic:,1.266
Date:,"Fri, 08 Mar 2024",Prob (F-statistic):,0.276
Time:,16:39:38,Log-Likelihood:,-65741.0
No. Observations:,2334,AIC:,131500.0
Df Residuals:,2328,BIC:,131500.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,6.275e+10,2.31e+10,2.712,0.007,1.74e+10,1.08e+11
C(Category)[T.Education],-6.272e+10,3.56e+10,-1.760,0.079,-1.33e+11,7.17e+09
C(Category)[T.Finance & FinTech],-6.221e+10,3.36e+10,-1.852,0.064,-1.28e+11,3.65e+09
C(Category)[T.Healthcare & HealthTech],-6.273e+10,3.99e+10,-1.572,0.116,-1.41e+11,1.55e+10
C(Category)[T.Other],-6.272e+10,2.7e+10,-2.320,0.020,-1.16e+11,-9.7e+09
C(Category)[T.Technology & Software],-6.273e+10,3.02e+10,-2.078,0.038,-1.22e+11,-3.52e+09

0,1,2,3
Omnibus:,7332.951,Durbin-Watson:,1.999
Prob(Omnibus):,0.0,Jarque-Bera (JB):,522587162.757
Skew:,48.08,Prob(JB):,0.0
Kurtosis:,2319.117,Cond. No.,7.92


In [142]:
anova = sm.stats.anova_lm(model, typ=1)
anova

Unnamed: 0,df,sum_sq,mean_sq,F,PR(>F)
C(Category),5.0,1.083918e+24,2.167836e+23,1.26558,0.276011
Residual,2328.0,3.987673e+26,1.712918e+23,,


### Analytical Questions

#### 1. Does location affect the amount of funding or investments?

In [None]:
df.head(2)

In [None]:
#Converting Amount to millions of dollars
df['Amount']=df.Amount/1000
pd.options.display.float_format = '${:,.2f}'.format


In [None]:
startup_loc= df.groupby('Headquarter_City')['Amount'].mean()

startup_loc

#### 2. Does the Sector/Category of start up affect the fundings?

In [None]:
startup_cat= df.groupby('Category')['Amount'].max()
startup_cat

In [None]:
#### Graph
##Conclusion

#### 3. How many companies need funding and are at what level of funding ?

In [None]:
startup_stages= df.groupby(['Category','Categorised_stage'])['Sector'].count().sort_values(ascending=False)
startup_stages

#### 4. Which stages give out the highest investment amounts?

In [None]:
startup_sectors= df.groupby('Categorised_stage')['Amount'].max()
startup_sectors

#### 5. Which cities have the highest number of startups and at what levels?

In [None]:
startup_cities= df.groupby('Headquarter_City')['Categorised_stage'].count().sort_values(ascending=False)
startup_cities

#### 6. What are the levels of funding the startups are receiving per year?

In [None]:
startup_cities= df.groupby('Investment_year')['Amount'].sum().sort_values(ascending=True)
startup_cities