<h1 align="center">Title: Indian Startup Funding Analysis</h1>

## Business Objective:
The goal of this project is to investigate the Indian Startup Ecosystem to better understand and provide valuable insight into the opportunites and challenges in the ecosystem in order to help stakeholders who plan on venturing into the startup ecosystem in India to make informed decisions based on findings from analyzing the dataset from 2018 to 2021.

### Import all necessary libraries

In [516]:
# Data Manipulation Libraries
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
import re

# Visualization Libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Database Libraries
import pyodbc

# Other Utilities
from dotenv import dotenv_values
from warnings import filterwarnings
filterwarnings('ignore')

### Create connection for SQL Server

In [517]:
# Load Environment Variables
config = dotenv_values('.env')

Server_name = config.get('Server')
Database_name = config.get('Database')
Username = config.get('Login')
PassWord = config.get('Password')

# Create Database Connection
connection_string = f"DRIVER={{SQL Server}};SERVER={Server_name};DATABASE={Database_name};UID={Username};PWD={PassWord}"

connection = pyodbc.connect(connection_string)

### Load tables from SQL and save a copy of the dataset

In [518]:
# Query table as dataframe
query = "SELECT * FROM dbo.LP1_startup_funding2020"
df_2020 = pd.read_sql(query, connection)

# Preview the dataset
df_2020.head()

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage,column10
0,Aqgromalin,2019.0,Chennai,AgriTech,Cultivating Ideas for Profit,"Prasanna Manogaran, Bharani C L",Angel investors,200000.0,,
1,Krayonnz,2019.0,Bangalore,EdTech,An academy-guardian-scholar centric ecosystem ...,"Saurabh Dixit, Gurudutt Upadhyay",GSF Accelerator,100000.0,Pre-seed,
2,PadCare Labs,2018.0,Pune,Hygiene management,Converting bio-hazardous waste to harmless waste,Ajinkya Dhariya,Venture Center,,Pre-seed,
3,NCOME,2020.0,New Delhi,Escrow,Escrow-as-a-service platform,Ritesh Tiwari,"Venture Catalysts, PointOne Capital",400000.0,,
4,Gramophone,2016.0,Indore,AgriTech,Gramophone is an AgTech platform enabling acce...,"Ashish Rajan Singh, Harshit Gupta, Nishant Mah...","Siana Capital Management, Info Edge",340000.0,,


In [519]:
df_2020.shape

(1055, 10)

In [520]:
# Query table as dataframe
query = "SELECT * FROM dbo.LP1_startup_funding2021"
df_2021 = pd.read_sql(query, connection)

# Preview the dataset
df_2021.head()

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage
0,Unbox Robotics,2019.0,Bangalore,AI startup,Unbox Robotics builds on-demand AI-driven ware...,"Pramod Ghadge, Shahid Memon","BEENEXT, Entrepreneur First","$1,200,000",Pre-series A
1,upGrad,2015.0,Mumbai,EdTech,UpGrad is an online higher education platform.,"Mayank Kumar, Phalgun Kompalli, Ravijot Chugh,...","Unilazer Ventures, IIFL Asset Management","$120,000,000",
2,Lead School,2012.0,Mumbai,EdTech,LEAD School offers technology based school tra...,"Smita Deorah, Sumeet Mehta","GSV Ventures, Westbridge Capital","$30,000,000",Series D
3,Bizongo,2015.0,Mumbai,B2B E-commerce,Bizongo is a business-to-business online marke...,"Aniket Deb, Ankit Tomar, Sachin Agrawal","CDC Group, IDG Capital","$51,000,000",Series C
4,FypMoney,2021.0,Gurugram,FinTech,"FypMoney is Digital NEO Bank for Teenagers, em...",Kapil Banwari,"Liberatha Kallat, Mukesh Yadav, Dinesh Nagpal","$2,000,000",Seed


In [521]:
df_2021.shape

(1209, 9)

In [522]:
# Save the dataset as csv files for each year
# df_2021.to_csv('startup_funding2021.csv')
# df_2020.to_csv('startup_funding2020.csv')

In [523]:
df_2021.shape

(1209, 9)

### Load other dataset in csv file

In [524]:
# Read csv file
df_2018 = pd.read_csv('startup_funding2018.csv')
df_2018.head()

Unnamed: 0,Company Name,Industry,Round/Series,Amount,Location,About Company
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f..."
1,Happy Cow Dairy,"Agriculture, Farming",Seed,"₹40,000,000","Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,"₹65,000,000","Gurgaon, Haryana, India",Leading Online Loans Marketplace in India
3,PayMe India,"Financial Services, FinTech",Angel,2000000,"Noida, Uttar Pradesh, India",PayMe India is an innovative FinTech organizat...
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,—,"Hyderabad, Andhra Pradesh, India",Eunimart is a one stop solution for merchants ...


In [525]:
df_2018.shape

(526, 6)

In [526]:
# Read csv file
df_2019 = pd.read_csv('startup_funding2019.csv')
df_2019.head()

Unnamed: 0,Company/Brand,Founded,HeadQuarter,Sector,What it does,Founders,Investor,Amount($),Stage
0,Bombay Shaving,,,Ecommerce,Provides a range of male grooming products,Shantanu Deshpande,Sixth Sense Ventures,"$6,300,000",
1,Ruangguru,2014.0,Mumbai,Edtech,A learning platform that provides topic-based ...,"Adamas Belva Syah Devara, Iman Usman.",General Atlantic,"$150,000,000",Series C
2,Eduisfun,,Mumbai,Edtech,It aims to make learning fun via games.,Jatin Solanki,"Deepak Parekh, Amitabh Bachchan, Piyush Pandey","$28,000,000",Fresh funding
3,HomeLane,2014.0,Chennai,Interior design,Provides interior designing solutions,"Srikanth Iyer, Rama Harinath","Evolvence India Fund (EIF), Pidilite Group, FJ...","$30,000,000",Series D
4,Nu Genes,2004.0,Telangana,AgriTech,"It is a seed company engaged in production, pr...",Narayana Reddy Punyala,Innovation in Food and Agriculture (IFA),"$6,000,000",


In [527]:
df_2019.shape

(89, 9)

<h3>Hypothesis</h3>
<p>Null: Tech startups do not receive funding from investors.</p>
<p>Alternative: Tech startups receive funding from investors.</p>

<h3>Questions</h3>

1. What is the total amount of funding received from investors from 2018 to 2021?
2. How many startups emerged from 2018 to 2021?
3. What is the level of funding for startups based on their sector?
4. How dispersed are the startup firms over the locations?
5. What is the trend of funding based on the years of funding?
6. Is there a relationship between stage and funding receive?

## Rename columns, generate columns where needed and merge datasets

In [528]:
# Add Funding Year for all the dataset
df_2021['Funding_Year']=2021; df_2020['Funding_Year']=2020; df_2019['Funding_Year']=2019; df_2018['Funding_Year']=2018

In [529]:
# Adjust the column names of the columns in the datasets

df_2021.rename(columns={'Amount': 'Amount($)'},inplace=True)

df_2020.rename(columns={'Amount': 'Amount($)'},inplace=True)

df_2019.rename(columns={'Company/Brand': 'Company_Brand',
                        'What it does': 'What_it_does'}, inplace=True)

df_2018.rename(columns={'Company Name': 'Company_Brand',
                'Amount': 'Amount($)',
                'About Company': 'What_it_does',
                'Round/Series': 'Stage'}, inplace=True)

In [530]:
df_2021.head(1)

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount($),Stage,Funding_Year
0,Unbox Robotics,2019.0,Bangalore,AI startup,Unbox Robotics builds on-demand AI-driven ware...,"Pramod Ghadge, Shahid Memon","BEENEXT, Entrepreneur First","$1,200,000",Pre-series A,2021


In [531]:
df_2020.head(1)

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount($),Stage,column10,Funding_Year
0,Aqgromalin,2019.0,Chennai,AgriTech,Cultivating Ideas for Profit,"Prasanna Manogaran, Bharani C L",Angel investors,200000.0,,,2020


In [532]:
df_2019.head(1)

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount($),Stage,Funding_Year
0,Bombay Shaving,,,Ecommerce,Provides a range of male grooming products,Shantanu Deshpande,Sixth Sense Ventures,"$6,300,000",,2019


In [533]:
df_2018.head(1)

Unnamed: 0,Company_Brand,Industry,Stage,Amount($),Location,What_it_does,Funding_Year
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f...",2018


Observing a preview of all the dataset, we will consider the 2021 dataset as our base format the columns in all our dataset. The 2018 dataset do not follow the agreed base dataset and must be corrected to meet the standard. We see for instance from the Location column that the value in the first index position matches the values for HeadQuarters in the other datasets. The Location column in the 2018 dataset must be dropped and the industry column name changed to Sector.

### Extract HeadQuarter from Location column in the 2018 dataset

In [534]:
# Add HeadQuarter column from the location column
df_2018['HeadQuarter'] = df_2018['Location'].str.split(',').str[0]
df_2018.head()

Unnamed: 0,Company_Brand,Industry,Stage,Amount($),Location,What_it_does,Funding_Year,HeadQuarter
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f...",2018,Bangalore
1,Happy Cow Dairy,"Agriculture, Farming",Seed,"₹40,000,000","Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...,2018,Mumbai
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,"₹65,000,000","Gurgaon, Haryana, India",Leading Online Loans Marketplace in India,2018,Gurgaon
3,PayMe India,"Financial Services, FinTech",Angel,2000000,"Noida, Uttar Pradesh, India",PayMe India is an innovative FinTech organizat...,2018,Noida
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,—,"Hyderabad, Andhra Pradesh, India",Eunimart is a one stop solution for merchants ...,2018,Hyderabad


In [535]:
df_2018.rename(columns={'Industry':'Sector'}, inplace=True)

df_2018.drop('Location', axis=1, inplace=True)

df_2018.head()

Unnamed: 0,Company_Brand,Sector,Stage,Amount($),What_it_does,Funding_Year,HeadQuarter
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"TheCollegeFever is a hub for fun, fiesta and f...",2018,Bangalore
1,Happy Cow Dairy,"Agriculture, Farming",Seed,"₹40,000,000",A startup which aggregates milk from dairy far...,2018,Mumbai
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,"₹65,000,000",Leading Online Loans Marketplace in India,2018,Gurgaon
3,PayMe India,"Financial Services, FinTech",Angel,2000000,PayMe India is an innovative FinTech organizat...,2018,Noida
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,—,Eunimart is a one stop solution for merchants ...,2018,Hyderabad


Concatenate the dataset into one full dataset

In [536]:
full_df = pd.concat([df_2021, df_2020, df_2019, df_2018],ignore_index=True)
full_df.head()

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount($),Stage,Funding_Year,column10
0,Unbox Robotics,2019.0,Bangalore,AI startup,Unbox Robotics builds on-demand AI-driven ware...,"Pramod Ghadge, Shahid Memon","BEENEXT, Entrepreneur First","$1,200,000",Pre-series A,2021,
1,upGrad,2015.0,Mumbai,EdTech,UpGrad is an online higher education platform.,"Mayank Kumar, Phalgun Kompalli, Ravijot Chugh,...","Unilazer Ventures, IIFL Asset Management","$120,000,000",,2021,
2,Lead School,2012.0,Mumbai,EdTech,LEAD School offers technology based school tra...,"Smita Deorah, Sumeet Mehta","GSV Ventures, Westbridge Capital","$30,000,000",Series D,2021,
3,Bizongo,2015.0,Mumbai,B2B E-commerce,Bizongo is a business-to-business online marke...,"Aniket Deb, Ankit Tomar, Sachin Agrawal","CDC Group, IDG Capital","$51,000,000",Series C,2021,
4,FypMoney,2021.0,Gurugram,FinTech,"FypMoney is Digital NEO Bank for Teenagers, em...",Kapil Banwari,"Liberatha Kallat, Mukesh Yadav, Dinesh Nagpal","$2,000,000",Seed,2021,


## Exploratory Data Analysis (EDA)

In [537]:
print('The rows and columns in the dataset are', full_df.shape, 'respectively.')

The rows and columns in the dataset are (2879, 11) respectively.


In [538]:
# Check structure of the dataset
full_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2879 entries, 0 to 2878
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company_Brand  2879 non-null   object 
 1   Founded        2110 non-null   float64
 2   HeadQuarter    2765 non-null   object 
 3   Sector         2861 non-null   object 
 4   What_it_does   2879 non-null   object 
 5   Founders       2334 non-null   object 
 6   Investor       2253 non-null   object 
 7   Amount($)      2622 non-null   object 
 8   Stage          1941 non-null   object 
 9   Funding_Year   2879 non-null   int64  
 10  column10       2 non-null      object 
dtypes: float64(1), int64(1), object(9)
memory usage: 247.5+ KB


We see the datatype for each column in the dataset and the number of non-null values in each column.

<h3>Check for missing values</h3>

In [539]:
# Check for total missing values in the columns
full_df.isna().sum()

Company_Brand       0
Founded           769
HeadQuarter       114
Sector             18
What_it_does        0
Founders          545
Investor          626
Amount($)         257
Stage             938
Funding_Year        0
column10         2877
dtype: int64

* We see the total number of missing values in the columns.
* Handling of missing values will be done in the data cleaning phase when the individual columns are inspected.

<h3>Check for duplicate values</h3>

In [540]:
f'There are {full_df.duplicated().sum()} duplicate values in the 2021 dataset'

'There are 23 duplicate values in the 2021 dataset'

* Handing of duplicate values will be done in the data cleaning phase when the  individual columns are inspected.

<h3>Check the descriptive statistics of all the columns in each dataset</h3>

In [541]:
full_df.describe(include=['number', 'object']).T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
Company_Brand,2879.0,2214.0,BharatPe,10.0,,,,,,,
Founded,2110.0,,,,2016.079621,4.368006,1963.0,2015.0,2017.0,2019.0,2021.0
HeadQuarter,2765.0,141.0,Bangalore,866.0,,,,,,,
Sector,2861.0,873.0,FinTech,173.0,,,,,,,
What_it_does,2879.0,2691.0,Provides online learning classes,5.0,,,,,,,
Founders,2334.0,1980.0,"Ashneer Grover, Shashvat Nakrani",7.0,,,,,,,
Investor,2253.0,1777.0,Inflection Point Ventures,36.0,,,,,,,
Amount($),2622.0,774.0,—,148.0,,,,,,,
Stage,1941.0,75.0,Seed,606.0,,,,,,,
Funding_Year,2879.0,,,,2020.023619,1.086974,2018.0,2020.0,2020.0,2021.0,2021.0


<h4>From the descriptive tables above, we are able to observe the minimum, maximum, and other statistics of the columns. The categorical columns however do not have most of these values as we cannot get most of the statistical measures on categorical values. Further investigation on the columns will be proceeded with in order to get a wholistic view of the dataset.</h4>

## Data Cleaning

### Correct duplicates in the dataset

In [542]:
f'We have {full_df.duplicated().sum()} duplicated rows in the dataset and since there are no unique identifiers assigned to the values we drop the duplicated values.'

'We have 23 duplicated rows in the dataset and since there are no unique identifiers assigned to the values we drop the duplicated values.'

In [543]:
full_df.drop_duplicates(keep='first', inplace=True)

full_df.duplicated().sum()

0

### Drop Column 10 from the dataset

Considering the number of missing values in the column, imputing is with any value will be in-valid hence we must drop the column.

In [544]:
full_df.drop('column10', axis=1, inplace=True)
full_df.head()

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount($),Stage,Funding_Year
0,Unbox Robotics,2019.0,Bangalore,AI startup,Unbox Robotics builds on-demand AI-driven ware...,"Pramod Ghadge, Shahid Memon","BEENEXT, Entrepreneur First","$1,200,000",Pre-series A,2021
1,upGrad,2015.0,Mumbai,EdTech,UpGrad is an online higher education platform.,"Mayank Kumar, Phalgun Kompalli, Ravijot Chugh,...","Unilazer Ventures, IIFL Asset Management","$120,000,000",,2021
2,Lead School,2012.0,Mumbai,EdTech,LEAD School offers technology based school tra...,"Smita Deorah, Sumeet Mehta","GSV Ventures, Westbridge Capital","$30,000,000",Series D,2021
3,Bizongo,2015.0,Mumbai,B2B E-commerce,Bizongo is a business-to-business online marke...,"Aniket Deb, Ankit Tomar, Sachin Agrawal","CDC Group, IDG Capital","$51,000,000",Series C,2021
4,FypMoney,2021.0,Gurugram,FinTech,"FypMoney is Digital NEO Bank for Teenagers, em...",Kapil Banwari,"Liberatha Kallat, Mukesh Yadav, Dinesh Nagpal","$2,000,000",Seed,2021


### Check the amount column for inconsistencies

In [545]:
# Check the unique values in the column
full_df['Amount($)'].unique()

array(['$1,200,000', '$120,000,000', '$30,000,000', '$51,000,000',
       '$2,000,000', '$188,000,000', '$200,000', 'Undisclosed',
       '$1,000,000', '$3,000,000', '$100,000', '$700,000', '$9,000,000',
       '$40,000,000', '$49,000,000', '$400,000', '$300,000',
       '$25,000,000', '$160,000,000', '$150,000', '$1,800,000',
       '$5,000,000', '$850,000', '$53,000,000', '$500,000', '$1,100,000',
       '$6,000,000', '$800,000', '$10,000,000', '$21,000,000',
       '$7,500,000', '$26,000,000', '$7,400,000', '$1,500,000',
       '$600,000', '$800,000,000', '$17,000,000', '$3,500,000',
       '$15,000,000', '$215,000,000', '$2,500,000', '$350,000,000',
       '$5,500,000', '$83,000,000', '$110,000,000', '$500,000,000',
       '$65,000,000', '$150,000,000,000', '$300,000,000', '$2,200,000',
       '$35,000,000', '$140,000,000', '$4,000,000', '$13,000,000', None,
       '$Undisclosed', '$2000000', '$800000', '$6000000', '$2500000',
       '$9500000', '$13000000', '$5000000', '$8000000',

* We consider the base value for the amount column to be dollars so any value in the amount column with no sign will be treated as a dollar value.
* Certain rows in the amount column have wrong inputs which are values of other columns and these rows must also be corrected.
* The undisclosed values in the column should be replaced with NaN since we can't replace it with zero as that will mean the amount raised was 0 dollars.
* The rows with INR currency sign will be converted using the exchange rate for the year in which funding was received.
* The column must be converted to a float dtype after correcting the inconsistencies

In [546]:
# Check for null values in the Amount column
f"There are {full_df['Amount($)'].isnull().sum()} total missing values in the amount column."

'There are 256 total missing values in the amount column.'

The missing values in the amount column cannot be replace with the mean or with zeros. Dealing with these missing values, the assumption is that all the missing values are undisclosed values in our dataset and hence we can drop these values.

In [547]:
# Drop missing values in the dataset where the Amount is null
full_df.dropna(axis=0, subset='Amount($)', inplace=True)

In [548]:
# Check missing values in the columns of the dataset
full_df.isnull().sum()

Company_Brand      0
Founded          700
HeadQuarter       97
Sector            16
What_it_does       0
Founders         539
Investor         612
Amount($)          0
Stage            787
Funding_Year       0
dtype: int64

In [549]:
# Assign the index of the rows to a string
symbols_string = full_df.index[full_df['Amount($)'].isin(['$','—'])]
symbols = ['$','—']

# Replace the columns with a dollar sign or ' in the amount value with NA
full_df.loc[symbols_string, ['Amount($)']] = full_df.loc[symbols_string, ['Amount($)']].replace(symbols, pd.NA, inplace=True)

In [550]:
# Preview rows with amount vales showing $Undisclosed, $undisclosed, and Undisclosed
undisclosed_list = ['$Undisclosed', '$undisclosed', 'Undisclosed']

# Get the index for all rows with undisclosed
undisclosed_index = full_df.index[full_df['Amount($)'].isin(undisclosed_list)]

# Replace undisclosed values with NA
full_df.loc[undisclosed_index, ['Amount($)']] = full_df.loc[undisclosed_index, ['Amount($)']].replace(undisclosed_list, pd.NA)

In [551]:
# Change the values in the amount column with the original amount values in the Stage column
amount_list = ['ITO Angel Network, LetsVenture', 'JITO Angel Network, LetsVenture', 'ah! Ventures', 'Upsparks']

amount_index = full_df.index[full_df['Amount($)'].isin(amount_list)]

stage_values = full_df.loc[amount_index, ['Amount($)']].values

full_df.loc[amount_index, ['Amount($)']] = full_df.loc[amount_index, ['Stage']].values

full_df.loc[amount_index, ['Stage']] = stage_values
full_df.loc[amount_index]

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount($),Stage,Funding_Year
98,FanPlay,2020.0,Computer Games,Computer Games,A real money game app specializing in trivia g...,YC W21,"Pritesh Kumar, Bharat Gupta",$1200000,Upsparks,2021
538,Little Leap,2020.0,New Delhi,EdTech,Soft Skills that make Smart Leaders,Holistic Development Programs for children in ...,Vishal Gupta,$300000,ah! Ventures,2021
551,BHyve,2020.0,Mumbai,Human Resources,A Future of Work Platform for diffusing Employ...,Backed by 100x.VC,"Omkar Pandharkame, Ketaki Ogale",$300000,"ITO Angel Network, LetsVenture",2021
677,Saarthi Pedagogy,2015.0,Ahmadabad,EdTech,"India's fastest growing Pedagogy company, serv...",Pedagogy,Sushil Agarwal,$1000000,"JITO Angel Network, LetsVenture",2021


In [552]:
# Change the values in the amount column with the original amount values in the Stage column
stage_list = ['Seed', 'Pre-series A', 'Series C']

stage_index = full_df.index[full_df['Amount($)'].isin(stage_list)]

full_df.loc[stage_index, ['Stage']] = full_df.loc[stage_index, ['Amount($)']].values

full_df.loc[stage_index, ['Amount($)']] = full_df.loc[stage_index, ['Investor']].values

full_df.loc[stage_index, ['Investor']] = full_df.loc[stage_index, ['Investor']].replace(stage_list, pd.NA, inplace=True)
full_df.loc[stage_index]

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount($),Stage,Funding_Year
242,Fullife Healthcare,2009.0,Pharmaceuticals\t#REF!,Primary Business is Development and Manufactur...,Varun Khanna,Morgan Stanley Private Equity Asia,,$22000000,Series C,2021
257,MoEVing,2021.0,Gurugram\t#REF!,MoEVing is India's only Electric Mobility focu...,"Vikash Mishra, Mragank Jain","Anshuman Maheshwary, Dr Srihari Raju Kalidindi",,$5000000,Seed,2021
545,AdmitKard,2016.0,Noida,EdTech,A tech solution for end to end career advisory...,"Vamsi Krishna, Pulkit Jain, Gaurav Munjal\t#REF!",,$1000000,Pre-series A,2021
1148,Godamwale,2016.0,Mumbai,Logistics & Supply Chain,Godamwale is tech enabled integrated logistics...,"Basant Kumar, Vivek Tiwari, Ranbir Nandan",,1000000\t#REF!,Seed,2021


In [553]:
full_df.shape

(2600, 10)

In [554]:
full_df.isnull().sum()

Company_Brand      0
Founded          700
HeadQuarter       97
Sector            16
What_it_does       0
Founders         539
Investor         616
Amount($)        299
Stage            783
Funding_Year       0
dtype: int64

A total of 299 firms didn't disclose or failed to provide information on the amount of funding received. We will proceed with dropping these values since there do not give us insight into the funding amount received.

In [555]:
full_df.dropna(subset='Amount($)', inplace=True)

In [556]:
full_df.shape

(2301, 10)

In [557]:
# Change the data type of the amount column to string and replace the $ and , with an empty string
full_df['Amount($)'] = full_df['Amount($)'].astype(str)

full_df['Amount($)'] = full_df['Amount($)'].apply(lambda x: x.replace('$',''))

full_df['Amount($)'] = full_df['Amount($)'].apply(lambda x: x.replace(',',''))

In [558]:
# Get the index position of the amount values that contain ₹
INR_index = full_df.index[full_df['Amount($)'].str.contains('₹')]
full_df.loc[INR_index, ['Amount($)']]

Unnamed: 0,Amount($)
2354,₹40000000
2355,₹65000000
2359,₹16000000
2360,₹50000000
2368,₹100000000
...,...
2866,₹1130000000
2867,₹810000000
2869,₹1400000000
2870,₹10000000


* According to [Exchange Rates UK](https://www.exchangerates.org.uk/INR-USD-spot-exchange-rates-history-2018.html), the average rate for 1 INR to USD in 2018 was `0.0146` and that is the rate we are using to convert the INR amount to dollars.

In [559]:
# Change the INR symbol and other special characters and replace them with an empty string
INR_index = full_df.index[full_df['Amount($)'].str.contains('₹')]

full_df['Amount($)'] = full_df['Amount($)'].apply(lambda x: x.replace('₹',''))
full_df['Amount($)'] = full_df['Amount($)'].apply(lambda x: x.replace('\t#REF!',''))

# Convert the amount column to a float datatype
full_df['Amount($)'] = full_df['Amount($)'].astype(float)

# Apply the exchange rate to the amount with the INR symbol
full_df.loc[INR_index, ['Amount($)']] = full_df.loc[INR_index, ['Amount($)']].values * 0.0146

full_df.loc[INR_index, ['Amount($)']]

Unnamed: 0,Amount($)
2354,584000.0
2355,949000.0
2359,233600.0
2360,730000.0
2368,1460000.0
...,...
2866,16498000.0
2867,11826000.0
2869,20440000.0
2870,146000.0


### Check the headquarter column for inconsistencies

In [560]:
full_df['HeadQuarter'].unique()

array(['Bangalore', 'Mumbai', 'Gurugram', 'New Delhi', 'Hyderabad',
       'Jaipur', 'Ahmadabad', 'Chennai', None,
       'Small Towns, Andhra Pradesh', 'Goa', 'Rajsamand', 'Ranchi',
       'Faridabad, Haryana', 'Gujarat', 'Thane', 'Pune', 'Computer Games',
       'Cochin', 'Noida', 'Chandigarh', 'Gurgaon', 'Vadodara',
       'Food & Beverages', 'Pharmaceuticals\t#REF!', 'Gurugram\t#REF!',
       'Kolkata', 'Ahmedabad', 'Indore', 'Powai', 'Ghaziabad', 'Nagpur',
       'West Bengal', 'Patna', 'Samsitpur', 'Lucknow', 'Telangana',
       'Haryana', 'Silvassa', 'Faridabad', 'Ambernath', 'Panchkula',
       'Surat', 'Andheri', 'Telugana', 'Bhubaneswar', 'Kottayam',
       'Beijing', 'Panaji', 'Coimbatore', 'Satara', 'Orissia', 'Jodhpur',
       'New York', 'Santra', 'Trivandrum', 'Bhilwara', 'Kochi', 'London',
       'Information Technology & Services', 'The Nilgiris', 'Gandhinagar',
       'Belgaum', 'Tirunelveli, Tamilnadu', 'Singapore',
       'Jaipur, Rajastan', 'Delhi', 'California', '

* Foriegn values in the HeadQuarter column must be corrected
* Some HeadQuarters have values which are suppose to be trimmed down for consistency

In [561]:
sector_values = ['Information Technology & Services','Food & Beverages','Pharmaceuticals\t#REF!', 'Gurugram\t#REF!']
sector_values_index = full_df.index[full_df['HeadQuarter'].isin(sector_values)]
full_df.loc[sector_values_index]

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount($),Stage,Funding_Year
241,MasterChow,2020.0,Food & Beverages,Hauz Khas,A ready-to-cook Asian cuisine brand,"Vidur Kataria, Sidhanth Madan",WEH Ventures,461000.0,Seed,2021
242,Fullife Healthcare,2009.0,Pharmaceuticals\t#REF!,Primary Business is Development and Manufactur...,Varun Khanna,Morgan Stanley Private Equity Asia,,22000000.0,Series C,2021
257,MoEVing,2021.0,Gurugram\t#REF!,MoEVing is India's only Electric Mobility focu...,"Vikash Mishra, Mragank Jain","Anshuman Maheshwary, Dr Srihari Raju Kalidindi",,5000000.0,Seed,2021
1176,Peak,2014.0,Information Technology & Services,"Manchester, Greater Manchester",Peak helps the world's smartest companies put ...,Atul Sharma,SoftBank Vision Fund 2,75000000.0,Series C,2021


* At index position 242 and 257 the Investor, Founders, What_it_does and Sector values have been shifted to the left by one value and must be corrected
* At index position 1176, HeadQuarter value have been filled with Sector values which needs to be corrected
* From researching on the company names, it was realized that there were company's who had wrong values inputed and these values were corrected with the right information

In [562]:
full_df.loc[[32, 98, 241, 242, 1173, 1176, 257,1190,1393, 1483,2395, 2412]]

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount($),Stage,Funding_Year
32,SuperK,2019.0,"Small Towns, Andhra Pradesh",Retail,SuperK is a full-stack solution to empower sma...,Neeraj Menta,STRIVE VC,800000.0,Seed,2021
98,FanPlay,2020.0,Computer Games,Computer Games,A real money game app specializing in trivia g...,YC W21,"Pritesh Kumar, Bharat Gupta",1200000.0,Upsparks,2021
241,MasterChow,2020.0,Food & Beverages,Hauz Khas,A ready-to-cook Asian cuisine brand,"Vidur Kataria, Sidhanth Madan",WEH Ventures,461000.0,Seed,2021
242,Fullife Healthcare,2009.0,Pharmaceuticals\t#REF!,Primary Business is Development and Manufactur...,Varun Khanna,Morgan Stanley Private Equity Asia,,22000000.0,Series C,2021
1173,moneyHOP,2018.0,London,Financial Services,moneyHOP is India’s first cross-border neo bank.,Mayank Goyal,,1200000.0,Seed,2021
1176,Peak,2014.0,Information Technology & Services,"Manchester, Greater Manchester",Peak helps the world's smartest companies put ...,Atul Sharma,SoftBank Vision Fund 2,75000000.0,Series C,2021
257,MoEVing,2021.0,Gurugram\t#REF!,MoEVing is India's only Electric Mobility focu...,"Vikash Mishra, Mragank Jain","Anshuman Maheshwary, Dr Srihari Raju Kalidindi",,5000000.0,Seed,2021
1190,Prolgae,2016.0,The Nilgiris,Biotechnology,Prolgae Spirulina Supplies Pvt. Ltd. is a Nord...,Aakas Sadasivam,Vijayan,200000.0,Seed,2021
1393,Dhurnia,2018.0,"Dhingsara, Haryana",EdTech,Developer of an online learning platform inten...,"Ajay Kumar, Murari Singh","Chandigarh Angels Network, Modulor Capital",100000.0,,2020
1483,Antaios,2016.0,France,Tech company,Developer of memory-based technology intended ...,Jean Pierre Nozieres,,11000000.0,,2020


* HeadQuarter values for these rows will be corrected with the right values

In [563]:
# Change the value of Sector to value of HeadQuarter value at index position 241
full_df.loc[241, ['Sector']] = full_df.loc[241, ['HeadQuarter']].values

# Change the values of Investor, Founders, and What_it_does to value with right values at index positions 242, and 257
full_df.loc[[242, 257], ['Investor']] = full_df.loc[[242, 257], ['Founders']].values
full_df.loc[[242, 257], ['Founders']] = full_df.loc[[242, 257], ['What_it_does']].values
full_df.loc[[242, 257], ['What_it_does']] = full_df.loc[[242, 257], ['Sector']].values

# Swap the Sector and HeadQuarter values at index position 1176
hq_value_1176 = full_df.loc[1176, ['Sector']].values
full_df.loc[[242, 1176], ['Sector']] = full_df.loc[[242, 1176], ['HeadQuarter']].values

full_df.loc[1176, ['HeadQuarter']] = hq_value_1176

# Correct wrong values in the column with the right values
correction_values = ['Bangalore', 'Bangalore', 'New Delhi', 'Mumbai', 'Bengaluru', 'Manchester', 'Bangalore', 'Nilgiris', 'Guruguram', 'Grenoble','Lucknow', 'New Delhi']
correction_values_index = [32, 98, 241, 242, 1173, 1176, 257,1190,1393, 1483,2395, 2412]
full_df.loc[correction_values_index, ['HeadQuarter']] = correction_values

# Clean special characters in the column
full_df['HeadQuarter'] = full_df['HeadQuarter'].str.replace('\\t#REF!', '', regex=True)

* Certain values in the HeadQuarter column have more information than need. After inspecting the values it was realized that the correct values are in the first postion of the string so this inconsistency must be corrected too.

In [564]:
# Get the values in the first position of the strings in all the rows
full_df['HeadQuarter'] = full_df['HeadQuarter'].str.split(',').str.get(0)

In [565]:
full_df.loc[full_df['HeadQuarter'] == 'Bangalore City']

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount($),Stage,Funding_Year
2476,FeedMyPockets,,Bangalore City,"Advertising, Human Resources, Marketing",On Demand Staffing Platform,,,642400.0,Seed,2018
2591,Trell,,Bangalore City,—,Trell is a location based network which helps ...,,,1250000.0,Seed,2018
2688,BetterPlace Safety Solutions Pvt. Ltd.,,Bangalore City,"Human Resources, Security, Training",BetterPlace provides businesses with contracto...,,,3000000.0,Series A,2018
2721,Idha Skin Clinic,,Bangalore City,"Beauty, Cosmetics, Health Care, Service Industry",Idha skin clinic jayanagar is dedicated to res...,,,73000.0,Seed,2018
2842,GoGaga,,Bangalore City,"Dating, Private Social Networking",GoGaga is a new age dating app that has remove...,,,40000.0,Non-equity Assistance,2018


In [566]:
# Replace Bangalore City with Bangalore
bangalore_city_index =  full_df.index[full_df['HeadQuarter'] == 'Bangalore City']

full_df.loc[bangalore_city_index, ['HeadQuarter']] = 'Bangalore'

* Fill null values in the column with Unstated to help describe the data adequately

In [567]:
f"There are {full_df['HeadQuarter'].isnull().sum()} missing values in the HeadQuarter column and we will replace them with Unstated"

'There are 94 missing values in the HeadQuarter column and we will replace them with Unstated'

In [568]:
imputer = SimpleImputer(missing_values=pd.NA, strategy='constant', fill_value='Unstated')
imputer.fit(full_df['HeadQuarter'].values.reshape(-1, 1))

In [569]:
full_df['HeadQuarter'] = imputer.transform(full_df['HeadQuarter'].values.reshape(-1, 1)).reshape(-1)

full_df['HeadQuarter'].isnull().sum()

0

In [570]:
full_df.loc[257, ['Investor']] = full_df.loc[257, ['Founders']].values

full_df.loc[257, ['Founders']] = full_df.loc[257, ['What_it_does']].values

full_df.loc[257, ['What_it_does']] = full_df.loc[257, ['Sector']].values

full_df.loc[257, ['Sector']] = full_df.loc[257, ['Sector']].replace(stage_list, pd.NA, inplace=True)
full_df.loc[257]

Company_Brand                                              MoEVing
Founded                                                     2021.0
HeadQuarter                                              Bangalore
Sector                                                        None
What_it_does     MoEVing is India's only Electric Mobility focu...
Founders         MoEVing is India's only Electric Mobility focu...
Investor                               Vikash Mishra, Mragank Jain
Amount($)                                                5000000.0
Stage                                                         Seed
Funding_Year                                                  2021
Name: 257, dtype: object

### Check the Stage column for inconsistencies

Now, we have to correct the various stages. The assumption for this column are as follows:
* Pre-seed: This is the earliest stage of funding for a startup. Group all pre seed values under this value.
* Seed: This is the first round of institutional funding for a startup.
* Pre-series: This term is used to describe a funding round that is raised before a Series A round. Group all pre series values under this value
* Series A: This is the second round of institutional funding for a startup.
* Series B: This is the third round of institutional funding for a startup.
* Series C: This is the fourth round of institutional funding for a startup.
* Later Stage: Series D, Series E, Series F are subsequent rounds of institutional funding after a Series C round.
* Equity: This is a form of funding that is provided in exchange for an ownership stake in the company and it also constitutes PE and Grant.
* Others: Every other value that is present in the column.
* Unstated: These are undisclosed stages and null values in the column

In [571]:
# View unique variables in the Stage column
full_df['Stage'].unique()

array(['Pre-series A', None, 'Series D', 'Series C', 'Seed', 'Series B',
       'Series E', 'Pre-seed', 'Series A', 'Pre-series B', 'Debt',
       'Upsparks', 'Bridge', 'Seed+', 'Series F2', 'Series A+',
       'Series G', 'Series F', 'Series H', 'Series B3', 'PE', 'Series F1',
       'Pre-series A1', 'ah! Ventures', 'ITO Angel Network, LetsVenture',
       'Series D1', 'JITO Angel Network, LetsVenture', 'Seies A',
       'Pre-series', 'Series A2', 'Series I', 'Pre-series C', 'Series B2',
       'Pre- series A', 'Edge', 'Pre-Series B', 'Seed A', 'Series A-1',
       'Seed round', 'Seed Round & Series A', 'Pre Series A',
       'Pre series A1', 'Series E2', 'Pre series A', 'Seed Round',
       'Pre series B', 'Pre series C', 'Angel Round', 'Mid series',
       'Pre seed round', 'Seed funding', 'Seed Funding', nan,
       'Fresh funding', 'Post series A', 'Seed fund', 'Series B+',
       'Angel', 'Private Equity', 'Venture - Series Unknown', 'Grant',
       'Debt Financing', 'Post-IPO De

In [572]:
# Check null values in the column
f'''Considering we have {full_df['Stage'].isnull().sum()} missing values in the Stge column, replacing it with the mode will reduce the quality of our dataset, thus, we will replace the missing values with 'Unstated' to give wide perspective to the dataset.'''

"Considering we have 708 missing values in the Stge column, replacing it with the mode will reduce the quality of our dataset, thus, we will replace the missing values with 'Unstated' to give wide perspective to the dataset."

In [573]:
# Fill nulll values using simple imputer
imputer = SimpleImputer(missing_values=pd.NA, strategy='constant',  fill_value='Unstated')
imputer.fit(full_df['Stage'].values.reshape(-1, 1))

In [574]:
# Check for null values in the column
full_df['Stage'] = imputer.transform(full_df['Stage'].values.reshape(-1, 1)).reshape(-1)

full_df['Stage'].isnull().sum()

0

In [575]:
# Create a function which regroups the stage column values into our assumed values

def stage_cleaner(data, column_name='column'):
    stage_group = []
    for row in data[column_name]:
        if re.search(r'Pre[- ]?seed', row, re.IGNORECASE):
            stage_group.append('Pre seed')
        elif re.search(r'Seed|\bAngel\b', row, re.IGNORECASE):
            stage_group.append('Seed')
        elif re.search(r'Pre[- ]?ser(?:ies)?', row, re.IGNORECASE):
            stage_group.append('Pre series')
        elif re.search(r'Series[- ]?A(?:\D|$)', row, re.IGNORECASE):
            stage_group.append('Series A')
        elif re.search(r'Series[- ]?B(?:\D|$)', row, re.IGNORECASE):
            stage_group.append('Series B')
        elif re.search(r'Series[- ]?C(?:\D|$)', row, re.IGNORECASE):
            stage_group.append('Series C')
        elif re.search(r'Series[- ]?[DEFGH]', row, re.IGNORECASE):
            stage_group.append('Later stage')
        elif re.search(r'\bGrant\b|\bEquity\b|PE', row, re.IGNORECASE):
            stage_group.append('Equity')
        elif re.search(r'\bund(?:isclosed)?\b|\bUnstated\b', row, re.IGNORECASE):
            stage_group.append('Unstated')
        else:
            stage_group.append('Others')
    return stage_group

full_df['Stage'] = stage_cleaner(full_df, column_name='Stage')

In [578]:
# Check null values to ensure values have been regrouped
full_df['Stage'].unique()

array(['Pre series', 'Unstated', 'Later stage', 'Series C', 'Seed',
       'Series B', 'Pre seed', 'Series A', 'Others', 'Equity'],
      dtype=object)

### Check the Investor, Founders, and What_it_does columns for inconsistencies

In [None]:
full_df['Investor'].unique()

array(['BEENEXT, Entrepreneur First',
       'Unilazer Ventures, IIFL Asset Management',
       'GSV Ventures, Westbridge Capital', 'CDC Group, IDG Capital',
       'Liberatha Kallat, Mukesh Yadav, Dinesh Nagpal', 'Vy Capital',
       'CIIE.CO, KIIT-TBI', None,
       '9Unicorns Accelerator Fund, Metaform Ventures',
       'SucSEED Indovation, IIM Calcutta Innovation Park',
       'Safe Planet Medicare', 'Impact Partners, C4D Partners',
       'Tiger Global Management, InnoVen Capital', 'Novo Tellus Capital',
       'Raintree Family Office, ADB arm', 'Inflection Point Ventures',
       'Mumbai Angels, Narendra Shyamsukha', 'Paradigm, Kunal Shah',
       'Matrix Partners India, GIC', 'Mumbai Angels Network, Expert DOJO',
       'GVFL', 'Kotak Mahindra Bank, FMO', 'Kalaari Capital',
       'NB Ventures, IAN Fund',
       'Sequoia Capital India, Hummingbird Ventures',
       'Gaurav Munjal, Snehil Khanor', 'JITO Angel Network, SOSV',
       'Chiratae Ventures, YourNest Venture Capital', '

In [None]:
full_df['Founders'].unique()

array(['Pramod Ghadge, Shahid Memon',
       'Mayank Kumar, Phalgun Kompalli, Ravijot Chugh, Ronnie Screwvala',
       'Smita Deorah, Sumeet Mehta',
       'Aniket Deb, Ankit Tomar, Sachin Agrawal', 'Kapil Banwari',
       'Abhiraj Singh Bhal, Raghav Chandra, Varun Khaitan', 'Gururaj KB',
       'Nidhi Ramachandran, Sachin Chhabra', 'Dr Arbinder Singal',
       'Konark Sharma, Sneh Soni',
       'Dr Mohender Narula, Dr AnandKrishna, Dr Girish Rao',
       'Radhika Choudary, Saurabh Marda',
       'Sankar Bora, Sourjyendu Medda, Vineet Rao',
       'P Raja Manickam, Srinivas Chinamilli, Veerappan V',
       'Arjun P Gupta, Ujjal Majumdar, Sidhartha Gupta',
       'Swapnil Jain, Sujit Das Biswas', 'Chandraprakash Joshi',
       'Ashish Singhal, Govind Soni, Vimal Sagar Tiwari',
       'Harshil Mathur, Shashank Kumar', 'Madhav Kasturia',
       'Anil Yekkala, Dharin Shah, Kuldeep Saxena, Purvi Shah, Sandeep Shah',
       'Kshama Fernandes',
       'Bhaktha Keshavachar, Ravi Prasad Sharma,

In [None]:
full_df['What_it_does'].unique()

array(['Unbox Robotics builds on-demand AI-driven warehouse robotics solutions, which can be deployed using limited foot-print, time, and capital.',
       'UpGrad is an online higher education platform.',
       'LEAD School offers technology based school transformation system that assures excellent learning for every child.',
       ..., 'International education loans for high potential students.',
       'Collegedekho.com is Student’s Partner, Friend & Confidante, To Help Him Take a Decision and Move On to His Career Goals.',
       'India’s first socially distributed full stack financial services platform for small town India'],
      dtype=object)

After inspecting the investor, founders, and what_it_does columns it is observed that they will not be relevant to our analysis so we drop them.

In [None]:
full_df.drop(['Investor', 'Founders', 'What_it_does'], axis=1, inplace=True)

In [None]:
full_df.shape

(1048, 8)

### Check the Sector column for inconsistencies

In [None]:
full_df['Sector'].unique()

array(['AI startup', 'EdTech', 'B2B E-commerce', 'FinTech',
       'Home services', 'HealthTech', 'E-commerce', 'B2B service',
       'Helathcare', 'Renewable Energy', 'Electronics', 'IT startup',
       'Food & Beverages', 'Aeorspace', 'Deep Tech', 'Dating', 'Gaming',
       'Robotics', 'Retail', 'Food', 'Oil and Energy', 'Tech Startup',
       'AgriTech', 'Telecommuncation', 'Milk startup', 'AI Chatbot', 'IT',
       'Logistics', 'Hospitality', 'Fashion', 'Marketing',
       'Transportation', 'LegalTech', 'Food delivery', 'Automotive',
       'SaaS startup', 'Fantasy sports', 'Video communication',
       'Social Media', 'Skill development', 'Rental', 'Recruitment',
       'Sports', 'Computer Games', 'Consumer Goods',
       'Information Technology', 'Apparel & Fashion',
       'Logistics & Supply Chain', 'Healthcare', 'SportsTech', 'HRTech',
       'Wine & Spirits', 'Mechanical & Industrial Engineering',
       'Spiritual', 'Financial Services', 'Industrial Automation',
       'Heal

Clean up the Sector Column
- Some of the values in the column has values separated with commas which convey the same information so we split the rows and select the name in the first position.

In [None]:
full_df['Sector'] = full_df['Sector'].str.split(',').str.get(0).astype(str)
full_df.head()

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,Amount($),Stage,Funding_Year,New_Stage
0,Unbox Robotics,2019.0,Bangalore,AI startup,1200000,Pre-series A,2021,Pre-Series
1,upGrad,2015.0,Mumbai,EdTech,120000000,Unstated,2021,Unstated
2,Lead School,2012.0,Mumbai,EdTech,30000000,Series D,2021,Later Stage
3,Bizongo,2015.0,Mumbai,B2B E-commerce,51000000,Series C,2021,Series C
4,FypMoney,2021.0,Gurugram,FinTech,2000000,Seed,2021,Seed


In [None]:
sector_regex_mapping = {
    "Information Technology": r"AI|EdTech|IT|Deep|Chatbot|Tech|SaaS|Fantasy|Robotics|Telecomunication|Software|MLOps|Location|Consumer|Blockchain|Computer (?:startup|company|platform)",
    "Health Care": r"(?:Heal(?:th|care|tech)|Pharma(?:ceuticals|cy)|Nutrition|Veterinary|FemTech|InsureTech)",
    "Financials": r"(?:Fin(?:Tech|ancial)|Venture|Insurance|Cryptocurrency|Equity|Investment|Capital|Banking) (?:Services|Management|Markets)",
    "Consumer Discretionary": r"(?:E(?:-commerce|nt)|Gaming|Retail|Food|Fashion|Cosmetics|Beverages|Beauty|Footwear|Clothing|Celebrity|Matrimony|Tourism|Entertainment) (?:Media|Engagement)",
    "Industrials": r"(?:Aero(?:space)|Renewable|Electronics|Automation|Mechanical|Pollution|Manufacturing|Clean(?:Tech)|Renewable|Maritime|Transportation|Construction|Automobile|Environmental) (?:Services|Engineering|Equipment|Startup)",
    "Communication Services": r"(?:Med(?:ia)|Social|Online|Mobile|Content|Audio|Storytelling|Community|Advertisement) (?:Games|Commerce|Publishing)",
    "Energy": r"Oil(?: and Energy)|Renewable|EV|Solar (?:Energy|Power)",
    "Materials": r"(?:Pollution|Textiles|Environmental) (?:Services|Equipment)",
    "Real Estate": r"(?:Real|Housing|Furniture|Interior|Construction) (?:Estate|Marketplace|Rental|Design)",
    "Utilities": r"water(?: purification)?",
}

def sector_cleaner(data, column='column_name'):
    corrected_sector = []
    for row in data[column]:
        for sector, regex in sector_regex_mapping.items():
            if re.search(regex, row):
                corrected_sector.append(sector)
                break
        else:
            corrected_sector.append('Consumer Staples')

    return corrected_sector

full_df['new_column'] = sector_cleaner(full_df, 'Sector')
full_df.tail(20)

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,Amount($),Stage,Funding_Year,New_Stage,new_column
1188,Elixia Tech Solutions,2011.0,Mumbai,Information Technology & Services,1000000,Pre-series A,2021,Pre-Series,Information Technology
1189,bitsCrunch,2020.0,Chennai,Blockchain,700000,Seed,2021,Seed,Information Technology
1190,Prolgae,2016.0,Nilgiris,Biotechnology,200000,Seed,2021,Seed,Consumer Staples
1191,Biddano,2016.0,Pune,Health,2000000,Pre-series A1,2021,Pre-Series,Health Care
1192,Geniemode,2021.0,Gurugram,B2B,2000000,Unstated,2021,Unstated,Consumer Staples
1194,Neokred,2019.0,Bangalore,FinTech,500000,Seed,2021,Seed,Information Technology
1195,Delhivery,2011.0,Gurugram,Logistics & Supply Chain,76000000,Series I,2021,Other terms,Consumer Staples
1196,Flipspaces,2011.0,Mumbai,Design,2000000,Pre-series B,2021,Pre-Series,Consumer Staples
1197,Fleek,2021.0,Bangalore,Internet,1000000,Seed,2021,Seed,Consumer Staples
1198,GoKwik,2020.0,New Delhi,Information Technology & Services,5000000,Pre-series A,2021,Pre-Series,Information Technology


Group the sector column into the 11 major industry sectors based on GSIC standards.

SyntaxError: invalid syntax (1933637684.py, line 1)

#### `2020 Dataset`

Change the label of the amount column to Amount($)

In [None]:
df_2020.rename(columns={'Amount': 'Amount($)'},inplace=True)

In [None]:
df_2020.head(1)

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount($),Stage,column10,Funding_Year
0,Aqgromalin,2019.0,Chennai,AgriTech,Cultivating Ideas for Profit,"Prasanna Manogaran, Bharani C L",Angel investors,200000.0,,,2020


Correct duplicates in the dataset

In [None]:
f"There are {df_2020.duplicated().sum()} duplicate rows in the dataset which must be dropped."

'There are 3 duplicate rows in the dataset which must be dropped.'

In [None]:
df_2020.drop_duplicates(keep='first', inplace=True)

Check unique values in ech column of the dataset

In [None]:
df_2020['HeadQuarter'].unique()

array(['Chennai', 'Bangalore', 'Pune', 'New Delhi', 'Indore', 'Hyderabad',
       'Gurgaon', 'Belgaum', 'Noida', 'Mumbai', 'Andheri', 'Jaipur',
       'Ahmedabad', 'Kolkata', 'Tirunelveli, Tamilnadu', 'Thane', None,
       'Singapore', 'Gurugram', 'Gujarat', 'Haryana', 'Kerala', 'Jodhpur',
       'Jaipur, Rajastan', 'Delhi', 'Frisco, Texas, United States',
       'California', 'Dhingsara, Haryana', 'New York, United States',
       'Patna', 'San Francisco, California, United States',
       'San Francisco, United States', 'San Ramon, California',
       'Paris, Ile-de-France, France', 'Plano, Texas, United States',
       'Sydney', 'San Francisco Bay Area, Silicon Valley, West Coast',
       'Bangaldesh', 'London, England, United Kingdom',
       'Sydney, New South Wales, Australia', 'Milano, Lombardia, Italy',
       'Palmwoods, Queensland, Australia', 'France',
       'San Francisco Bay Area, West Coast, Western US',
       'Trivandrum, Kerala, India', 'Cochin', 'Samastipur, Bihar',


There are a few unique values in the headquarter column which are not capturing the correct values which must be corrected.

In [None]:
df_2020['HeadQuarter'] = df_2020['HeadQuarter'].str.split(',').str.get(0)
df_2020['HeadQuarter'].unique()

array(['Chennai', 'Bangalore', 'Pune', 'New Delhi', 'Indore', 'Hyderabad',
       'Gurgaon', 'Belgaum', 'Noida', 'Mumbai', 'Andheri', 'Jaipur',
       'Ahmedabad', 'Kolkata', 'Tirunelveli', 'Thane', None, 'Singapore',
       'Gurugram', 'Gujarat', 'Haryana', 'Kerala', 'Jodhpur', 'Delhi',
       'Frisco', 'California', 'Dhingsara', 'New York', 'Patna',
       'San Francisco', 'San Ramon', 'Paris', 'Plano', 'Sydney',
       'San Francisco Bay Area', 'Bangaldesh', 'London', 'Milano',
       'Palmwoods', 'France', 'Trivandrum', 'Cochin', 'Samastipur',
       'Irvine', 'Tumkur', 'Newcastle Upon Tyne', 'Shanghai', 'Jiaxing',
       'Rajastan', 'Kochi', 'Ludhiana', 'Dehradun', 'San Franciscao',
       'Tangerang', 'Berlin', 'Seattle', 'Riyadh', 'Seoul', 'Bangkok',
       'Kanpur', 'Chandigarh', 'Warangal', 'Hyderebad', 'Odisha', 'Bihar',
       'Goa', 'Tamil Nadu', 'Uttar Pradesh', 'Bhopal', 'Banglore',
       'Coimbatore', 'Bengaluru'], dtype=object)

In [None]:
df_2020['Amount($)']

0         200000.0
1         100000.0
2              NaN
3         400000.0
4         340000.0
           ...    
1050     1500000.0
1051    13200000.0
1052     8000000.0
1053     8043000.0
1054     9000000.0
Name: Amount($), Length: 1052, dtype: float64

* We consider the base value for the amount column to be dollars so any value in the amount column with no sign will be treated as a dollar value. In view of this, we have to rename the amount column Amount($). This is the case because most of our values in the Amount columns for all the dataset are in dollars.
* Since the amount column is already a float dtype, we don't have any data cleaning to do for now except to drop the null values since imputing with mean will temper with the quality of our dataset.
* We assume they are undisclosed values and drop them.

In [None]:
f'A total of {df_2020["Amount($)"].isna().sum()} missing values are in the Amount column which constitutes 24% of the rows hence can be dropped.'

'A total of 253 missing values are in the Amount column which constitutes 24% of the rows hence can be dropped.'

In [None]:
df_2020['Amount($)'].dropna(axis=0, inplace=True)

Check the headquarters column for inconsistencies

In [None]:
df_2020['HeadQuarter'].unique()

array(['Chennai', 'Bangalore', 'Pune', 'New Delhi', 'Indore', 'Hyderabad',
       'Gurgaon', 'Belgaum', 'Noida', 'Mumbai', 'Andheri', 'Jaipur',
       'Ahmedabad', 'Kolkata', 'Tirunelveli, Tamilnadu', 'Thane', None,
       'Singapore', 'Gurugram', 'Gujarat', 'Haryana', 'Kerala', 'Jodhpur',
       'Jaipur, Rajastan', 'Delhi', 'Frisco, Texas, United States',
       'California', 'Dhingsara, Haryana', 'New York, United States',
       'Patna', 'San Francisco, California, United States',
       'San Francisco, United States', 'San Ramon, California',
       'Paris, Ile-de-France, France', 'Plano, Texas, United States',
       'Sydney', 'San Francisco Bay Area, Silicon Valley, West Coast',
       'Bangaldesh', 'London, England, United Kingdom',
       'Sydney, New South Wales, Australia', 'Milano, Lombardia, Italy',
       'Palmwoods, Queensland, Australia', 'France',
       'San Francisco Bay Area, West Coast, Western US',
       'Trivandrum, Kerala, India', 'Cochin', 'Samastipur, Bihar',


#### `2019 Dataset`

In [None]:
df_2019.rename(columns={'Company/Brand': 'Company_Brand',
                        'What it does': 'What_it_does'}, inplace=True)

In [None]:
df_2019.head(1)

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount($),Stage,Funding_Year
0,Bombay Shaving,,,Ecommerce,Provides a range of male grooming products,Shantanu Deshpande,Sixth Sense Ventures,"$6,300,000",,2019


Correct duplicated in the dataset

In [None]:
df_2019['Stage'].unique()

: 

* The missing values in this column will be replaced with 'Unstated' and cleaning of some values as well.

#### `2018 Dataset`

Change the label of the amount, company name, about company and round/series columns to Amount($), Company_Brand, What_it_does, and Stage.

In [None]:
df_2018.rename(columns={'Company Name': 'Company_Brand',
                'Amount': 'Amount($)',
                'About Company': 'What_it_does',
                'Round/Series': 'Stage'}, inplace=True)

In [None]:
df_2018.head(1)

Unnamed: 0,Company_Brand,Industry,Stage,Amount($),Location,What_it_does,Funding_Year
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f...",2018


Create new column named HeadQuarter and extract city name from Location

In [None]:
df_2018['HeadQuarter'] = df_2018['Location'].str.split(',').str[0]

: 

In [None]:
df_2018.head(1)

: 

Check for inconsistencies in the Amount($) column and implement the necessary cleaning

In [None]:
df_2018['Amount($)'].unique()

array(['250000', '₹40,000,000', '₹65,000,000', '2000000', '—', '1600000',
       '₹16,000,000', '₹50,000,000', '₹100,000,000', '150000', '1100000',
       '₹500,000', '6000000', '650000', '₹35,000,000', '₹64,000,000',
       '₹20,000,000', '1000000', '5000000', '4000000', '₹30,000,000',
       '2800000', '1700000', '1300000', '₹5,000,000', '₹12,500,000',
       '₹15,000,000', '500000', '₹104,000,000', '₹45,000,000', '13400000',
       '₹25,000,000', '₹26,400,000', '₹8,000,000', '₹60,000', '9000000',
       '100000', '20000', '120000', '₹34,000,000', '₹342,000,000',
       '$143,145', '₹600,000,000', '$742,000,000', '₹1,000,000,000',
       '₹2,000,000,000', '$3,980,000', '$10,000', '₹100,000',
       '₹250,000,000', '$1,000,000,000', '$7,000,000', '$35,000,000',
       '₹550,000,000', '$28,500,000', '$2,000,000', '₹240,000,000',
       '₹120,000,000', '$2,400,000', '$30,000,000', '₹2,500,000,000',
       '$23,000,000', '$150,000', '$11,000,000', '₹44,000,000',
       '$3,240,000', '₹60

In [None]:
# Replace ',', and '$' with '' and also replace '—' with NA

df_2018['Amount($)'] = df_2018['Amount($)'].apply(lambda x: x.replace(',', ''))
df_2018['Amount($)'] = df_2018['Amount($)'].apply(lambda x: x.replace('$', ''))

# Get the index of all the amount values with '₹'
INR_rows = df_2018.index[df_2018['Amount($)'].str.contains('₹')]


In [None]:
# Replace '—' with NA
df_2018['Amount($)'] = df_2018['Amount($)'].replace('—', pd.NA)

In [None]:
# Replace '₹' with ''
df_2018['Amount($)'] = df_2018['Amount($)'].str.replace('₹', '')

In [None]:
# View total missing values in the columns
df_2018.isna().sum()

Company_Brand      0
Industry           0
Stage              0
Amount($)        148
Location           0
What_it_does       0
Funding_Year       0
dtype: int64

* The 148 missing values present in the amount column represent rows which had '—'. We assume they are rows which represent undisclosed amount and drop these missing values.

In [None]:
df_2018.dropna(inplace=True)

In [None]:
# Convert the 'Amount($)' column to a float
df_2018['Amount($)'] = df_2018['Amount($)'].astype(float)

# Multiply the rows with had '₹' in the amount column with the average exchange rate for the year 2018
df_2018.loc[INR_rows, ['Amount($)']] = df_2018.loc[INR_rows, ['Amount($)']].values * 0.0146

In [None]:
df_2018.head()

Unnamed: 0,Company_Brand,Industry,Stage,Amount($),Location,What_it_does,Funding_Year
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000.0,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f...",2018
1,Happy Cow Dairy,"Agriculture, Farming",Seed,584000.0,"Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...,2018
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,949000.0,"Gurgaon, Haryana, India",Leading Online Loans Marketplace in India,2018
3,PayMe India,"Financial Services, FinTech",Angel,2000000.0,"Noida, Uttar Pradesh, India",PayMe India is an innovative FinTech organizat...,2018
5,Hasura,"Cloud Infrastructure, PaaS, SaaS",Seed,1600000.0,"Bengaluru, Karnataka, India",Hasura is a platform that allows developers to...,2018


In [None]:
df_2018['Stage'].unique()

array(['Seed', 'Series A', 'Angel', 'Series B', 'Private Equity',
       'Venture - Series Unknown', 'Grant', 'Debt Financing',
       'Post-IPO Debt', 'Series H', 'Series C', 'Series E', 'Pre-Seed',
       'Undisclosed',
       'https://docs.google.com/spreadsheets/d/1x9ziNeaz6auNChIHnMI8U6kS7knTr3byy_YBGfQaoUA/edit#gid=1861303593',
       'Series D', 'Corporate Round', 'Post-IPO Equity',
       'Secondary Market', 'Non-equity Assistance', 'Funding Round'],
      dtype=object)

In [None]:
# def stage_corrector(data, column='column_name'):
#     for row in data[column]:
#         if re.search('Series A|Series-A|series A', row):
#             return 'Series A'
#         elif re.search('Series B', row):
#             return 'Series B'
#         elif re.search('Series C', row):
#             return 'Series C'
#         elif re.search('.*Ang.*', row):
#             return 'Angel'
#         elif re.search('.*qui.*', row):
#             return 'Equity'
#         elif re.search('.*rant.*', row):
#             return 'Grant'
#         elif re.search('Seed|seed', row):
#             return 'Seed'
#         elif re.search('.*Und.*|.*Unk.*|.*unk.*', row):
#             return 'Unstated'
#         else:
#             return 'Other Stage'
    
     

# df_2018['new'] = stage_corrector(df_2018, 'Stage')
# df_2018.head(50)

In [None]:
def stage_corrector(data, column='Stage'):
    corrected_stages = []
    for row in data[column]:
        if re.search('Series A|Series-A|series A', row):
            corrected_stages.append('Series A')
        elif re.search('Series B', row):
            corrected_stages.append('Series B')
        elif re.search('Series C', row):
            corrected_stages.append('Series C')
        elif re.search('.*Ang.*', row):
            corrected_stages.append('Angel')
        elif re.search('.*qui.*', row):
            corrected_stages.append('Equity')
        elif re.search('.*rant.*', row):
            corrected_stages.append('Grant')
        elif re.search('Seed|seed', row):
            corrected_stages.append('Seed')
        elif re.search('.*Und.*|.*Unk.*|.*unk.*', row):
            corrected_stages.append('Unstated')
        else:
            corrected_stages.append('Other Stage')

    return corrected_stages

df_2018['corrected_stages'] = stage_corrector(df_2018, 'Stage')
df_2018.tail()

Unnamed: 0,Company_Brand,Industry,Stage,Amount($),Location,What_it_does,Funding_Year,new,corrected_stages
520,SlicePay,"FinTech, Internet, Payments, Service Industry",Series A,14900000.0,"Bengaluru, Karnataka, India",SlicePay is an AI based instant credit app for...,2018,Seed,Series A
521,Udaan,"B2B, Business Development, Internet, Marketplace",Series C,225000000.0,"Bangalore, Karnataka, India","Udaan is a B2B trade platform, designed specif...",2018,Seed,Series C
523,Mombay,"Food and Beverage, Food Delivery, Internet",Seed,7500.0,"Mumbai, Maharashtra, India",Mombay is a unique opportunity for housewives ...,2018,Seed,Seed
524,Droni Tech,Information Technology,Seed,511000.0,"Mumbai, Maharashtra, India",Droni Tech manufacture UAVs and develop softwa...,2018,Seed,Seed
525,Netmeds,"Biotechnology, Health Care, Pharmaceutical",Series C,35000000.0,"Chennai, Tamil Nadu, India",Welcome to India's most convenient pharmacy!,2018,Seed,Series C


In [None]:
df_2018['corrected_stages'].unique()

array(['Seed', 'Series A', 'Angel', 'Series B', 'Equity', 'Unstated',
       'Grant', 'Other Stage', 'Series C'], dtype=object)