<h1 align="center">Title: Indian Startup Funding Analysis</h1>

## Business Objective:
The goal of this project is to investigate the Indian Startup Ecosystem to better understand and provide valuable insight into the opportunites and challenges in the ecosystem in order to help stakeholders who plan on venturing into the startup ecosystem in India to make informed decisions based on finding from analyzing the dataset from 2018 to 2021.

### Import all necessary libraries

In [390]:
# Data Manipulation Libraries
import numpy as np
import pandas as pd

# Visualization Libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Database Libraries
import pyodbc

# Other Utilities
from dotenv import dotenv_values
from warnings import filterwarnings
filterwarnings('ignore')

### Create connection for SQL Server

In [391]:
# Load Environment Variables
config = dotenv_values('.env')

Server_name = config.get('Server')
Database_name = config.get('Database')
Username = config.get('Login')
PassWord = config.get('Password')

# Create Database Connection
connection_string = f"DRIVER={{SQL Server}};SERVER={Server_name};DATABASE={Database_name};UID={Username};PWD={PassWord}"

connection = pyodbc.connect(connection_string)

### Load tables from SQL and save a copy of the dataset

In [392]:
# Query table as dataframe
query = "SELECT * FROM dbo.LP1_startup_funding2020"
df_2020 = pd.read_sql(query, connection)

# Preview the dataset
df_2020.head()

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage,column10
0,Aqgromalin,2019.0,Chennai,AgriTech,Cultivating Ideas for Profit,"Prasanna Manogaran, Bharani C L",Angel investors,200000.0,,
1,Krayonnz,2019.0,Bangalore,EdTech,An academy-guardian-scholar centric ecosystem ...,"Saurabh Dixit, Gurudutt Upadhyay",GSF Accelerator,100000.0,Pre-seed,
2,PadCare Labs,2018.0,Pune,Hygiene management,Converting bio-hazardous waste to harmless waste,Ajinkya Dhariya,Venture Center,,Pre-seed,
3,NCOME,2020.0,New Delhi,Escrow,Escrow-as-a-service platform,Ritesh Tiwari,"Venture Catalysts, PointOne Capital",400000.0,,
4,Gramophone,2016.0,Indore,AgriTech,Gramophone is an AgTech platform enabling acce...,"Ashish Rajan Singh, Harshit Gupta, Nishant Mah...","Siana Capital Management, Info Edge",340000.0,,


In [393]:
df_2020.shape

(1055, 10)

In [394]:
# Query table as dataframe
query = "SELECT * FROM dbo.LP1_startup_funding2021"
df_2021 = pd.read_sql(query, connection)

# Preview the dataset
df_2021.head()

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage
0,Unbox Robotics,2019.0,Bangalore,AI startup,Unbox Robotics builds on-demand AI-driven ware...,"Pramod Ghadge, Shahid Memon","BEENEXT, Entrepreneur First","$1,200,000",Pre-series A
1,upGrad,2015.0,Mumbai,EdTech,UpGrad is an online higher education platform.,"Mayank Kumar, Phalgun Kompalli, Ravijot Chugh,...","Unilazer Ventures, IIFL Asset Management","$120,000,000",
2,Lead School,2012.0,Mumbai,EdTech,LEAD School offers technology based school tra...,"Smita Deorah, Sumeet Mehta","GSV Ventures, Westbridge Capital","$30,000,000",Series D
3,Bizongo,2015.0,Mumbai,B2B E-commerce,Bizongo is a business-to-business online marke...,"Aniket Deb, Ankit Tomar, Sachin Agrawal","CDC Group, IDG Capital","$51,000,000",Series C
4,FypMoney,2021.0,Gurugram,FinTech,"FypMoney is Digital NEO Bank for Teenagers, em...",Kapil Banwari,"Liberatha Kallat, Mukesh Yadav, Dinesh Nagpal","$2,000,000",Seed


In [395]:
df_2021['Sector'].unique()

array(['AI startup', 'EdTech', 'B2B E-commerce', 'FinTech',
       'Home services', 'HealthTech', 'Tech Startup', 'E-commerce',
       'B2B service', 'Helathcare', 'Renewable Energy', 'Electronics',
       'IT startup', 'Food & Beverages', 'Aeorspace', 'Deep Tech',
       'Dating', 'Gaming', 'Robotics', 'Retail', 'Food', 'Oil and Energy',
       'AgriTech', 'Telecommuncation', 'Milk startup', 'AI Chatbot', 'IT',
       'Logistics', 'Hospitality', 'Fashion', 'Marketing',
       'Transportation', 'LegalTech', 'Food delivery', 'Automotive',
       'SaaS startup', 'Fantasy sports', 'Video communication',
       'Social Media', 'Skill development', 'Rental', 'Recruitment',
       'HealthCare', 'Sports', 'Computer Games', 'Consumer Goods',
       'Information Technology', 'Apparel & Fashion',
       'Logistics & Supply Chain', 'Healthtech', 'Healthcare',
       'SportsTech', 'HRTech', 'Wine & Spirits',
       'Mechanical & Industrial Engineering', 'Spiritual',
       'Financial Services', 'I

In [396]:
df_2021.shape

(1209, 9)

In [397]:
# Save the dataset as csv files for each year
df_2021.to_csv('startup_funding2021.csv')
df_2020.to_csv('startup_funding2020.csv')

In [398]:
df_2021.shape

(1209, 9)

### Load other dataset in csv file

In [399]:
# Read csv file
df_2018 = pd.read_csv('startup_funding2018.csv')
df_2018.head()

Unnamed: 0,Company Name,Industry,Round/Series,Amount,Location,About Company
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f..."
1,Happy Cow Dairy,"Agriculture, Farming",Seed,"₹40,000,000","Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,"₹65,000,000","Gurgaon, Haryana, India",Leading Online Loans Marketplace in India
3,PayMe India,"Financial Services, FinTech",Angel,2000000,"Noida, Uttar Pradesh, India",PayMe India is an innovative FinTech organizat...
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,—,"Hyderabad, Andhra Pradesh, India",Eunimart is a one stop solution for merchants ...


In [400]:
df_2018.shape

(526, 6)

In [401]:
# Read csv file
df_2019 = pd.read_csv('startup_funding2019.csv')
df_2019.head()

Unnamed: 0,Company/Brand,Founded,HeadQuarter,Sector,What it does,Founders,Investor,Amount($),Stage
0,Bombay Shaving,,,Ecommerce,Provides a range of male grooming products,Shantanu Deshpande,Sixth Sense Ventures,"$6,300,000",
1,Ruangguru,2014.0,Mumbai,Edtech,A learning platform that provides topic-based ...,"Adamas Belva Syah Devara, Iman Usman.",General Atlantic,"$150,000,000",Series C
2,Eduisfun,,Mumbai,Edtech,It aims to make learning fun via games.,Jatin Solanki,"Deepak Parekh, Amitabh Bachchan, Piyush Pandey","$28,000,000",Fresh funding
3,HomeLane,2014.0,Chennai,Interior design,Provides interior designing solutions,"Srikanth Iyer, Rama Harinath","Evolvence India Fund (EIF), Pidilite Group, FJ...","$30,000,000",Series D
4,Nu Genes,2004.0,Telangana,AgriTech,"It is a seed company engaged in production, pr...",Narayana Reddy Punyala,Innovation in Food and Agriculture (IFA),"$6,000,000",


In [402]:
df_2019.shape

(89, 9)

<h3>Hypothesis</h3>
<p>Null: Tech startups do not receive funding from investors.</p>
<p>Alternative: Tech startups receive funding from investors.</p>

<h3>Questions</h3>

1. What is the total amount of funding received from investors from 2018 to 2021?
2. How many startups emerged from 2018 to 2021?
3. What is the level of funding for startups based on their sector?
4. How dispersed are the startup firms over the locations?
5. What is the trend of funding based on the years of funding?
6. Is there a relationship between stage and funding receive?

### Exploratory Data Analysis (EDA)
* 2021 Dataset

In [403]:
print('The rows and columns in the 2021 dataset are', df_2021.shape, 'respectively.')

The rows and columns in the 2021 dataset are (1209, 9) respectively.


In [404]:
# Check structure of the dataset
df_2021.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1209 entries, 0 to 1208
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company_Brand  1209 non-null   object 
 1   Founded        1208 non-null   float64
 2   HeadQuarter    1208 non-null   object 
 3   Sector         1209 non-null   object 
 4   What_it_does   1209 non-null   object 
 5   Founders       1205 non-null   object 
 6   Investor       1147 non-null   object 
 7   Amount         1206 non-null   object 
 8   Stage          781 non-null    object 
dtypes: float64(1), object(8)
memory usage: 85.1+ KB


We see the datatype for each column in the dataset and the number of non-null values in each column.

<h3>Check for missing values</h3>

In [405]:
# Check for total missing values in the columns
df_2021.isna().sum()

Company_Brand      0
Founded            1
HeadQuarter        1
Sector             0
What_it_does       0
Founders           4
Investor          62
Amount             3
Stage            428
dtype: int64

* We see the total number of missing values in the columns.
* The missing values in the HeadQuarter, founders, investor and the stage must be replaced with 'Unstated'

<h3>Check for duplicate values</h3>

In [406]:
print(f'The are {df_2021.duplicated().sum()} duplicate values in the 2021 dataset')

The are 19 duplicate values in the 2021 dataset


<h3>Inspect Individual Columns</h3>

In [407]:
df_2021['Amount'].unique()

array(['$1,200,000', '$120,000,000', '$30,000,000', '$51,000,000',
       '$2,000,000', '$188,000,000', '$200,000', 'Undisclosed',
       '$1,000,000', '$3,000,000', '$100,000', '$700,000', '$9,000,000',
       '$40,000,000', '$49,000,000', '$400,000', '$300,000',
       '$25,000,000', '$160,000,000', '$150,000', '$1,800,000',
       '$5,000,000', '$850,000', '$53,000,000', '$500,000', '$1,100,000',
       '$6,000,000', '$800,000', '$10,000,000', '$21,000,000',
       '$7,500,000', '$26,000,000', '$7,400,000', '$1,500,000',
       '$600,000', '$800,000,000', '$17,000,000', '$3,500,000',
       '$15,000,000', '$215,000,000', '$2,500,000', '$350,000,000',
       '$5,500,000', '$83,000,000', '$110,000,000', '$500,000,000',
       '$65,000,000', '$150,000,000,000', '$300,000,000', '$2,200,000',
       '$35,000,000', '$140,000,000', '$4,000,000', '$13,000,000', None,
       '$Undisclosed', '$2000000', '$800000', '$6000000', '$2500000',
       '$9500000', '$13000000', '$5000000', '$8000000',

* From the preview of the amount column we can see some signs there which makes the datatype object.
* We consider the base value for the amount column to be dollars so any value in the amount column with no sign will be treated as a dollar value. In view of this, we have to rename the amount column 'Amount($)'.
* Certain rows in the amount column have wrong inputs which are supposed to be in other columns and these rows must also be corrected.
* The undisclosed values in the column should be replaced with NaN since we can't replace it with zero as that will mean the amount raised was 0 dollars.
* The column must be converted to a float dtype after correcting the inconsistencies

* 2020 Dataset

In [408]:
print('The rows and columns in the 2020 dataset are', df_2020.shape, 'respectively.')

The rows and columns in the 2020 dataset are (1055, 10) respectively.


In [409]:
# Check structure of the dataset
df_2020.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1055 entries, 0 to 1054
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company_Brand  1055 non-null   object 
 1   Founded        842 non-null    float64
 2   HeadQuarter    961 non-null    object 
 3   Sector         1042 non-null   object 
 4   What_it_does   1055 non-null   object 
 5   Founders       1043 non-null   object 
 6   Investor       1017 non-null   object 
 7   Amount         801 non-null    float64
 8   Stage          591 non-null    object 
 9   column10       2 non-null      object 
dtypes: float64(2), object(8)
memory usage: 82.5+ KB


We see the datatype for each column in the dataset and the number of non-null values in each column.

<h3>Check for missing values</h3>

In [410]:
# Check for total missing values in the columns
df_2020.isna().sum()

Company_Brand       0
Founded           213
HeadQuarter        94
Sector             13
What_it_does        0
Founders           12
Investor           38
Amount            254
Stage             464
column10         1053
dtype: int64

* We see the total number of missing values in the columns.
* Column10 has almost all it's values as missing values which makes it an irrelevant columns hence must be dropped
* The missing values in the sector column, founders, investor and the HeadQuarter must be replaced with 'Unstated'

<h3>Check for duplicate values</h3>

In [411]:
print(f'The are {df_2020.duplicated().sum()} duplicate values in the 2020 dataset')

The are 3 duplicate values in the 2020 dataset


<h3>Inspect Individual Columns</h3>

In [412]:
df_2020['Amount']

0         200000.0
1         100000.0
2              NaN
3         400000.0
4         340000.0
           ...    
1050     1500000.0
1051    13200000.0
1052     8000000.0
1053     8043000.0
1054     9000000.0
Name: Amount, Length: 1055, dtype: float64

* We consider the base value for the amount column to be dollars so any value in the amount column with no sign will be treated as a dollar value. In view of this, we have to rename the amount column 'Amount($)'.

In [413]:
df_2020['Stage'].unique()

array([None, 'Pre-seed', 'Seed', 'Pre-series A', 'Pre-series', 'Series C',
       'Series A', 'Series B', 'Debt', 'Pre-series C', 'Pre-series B',
       'Series E', 'Bridge', 'Series D', 'Series B2', 'Series F',
       'Pre- series A', 'Edge', 'Series H', 'Pre-Series B', 'Seed A',
       'Series A-1', 'Seed Funding', 'Pre-Seed', 'Seed round',
       'Pre-seed Round', 'Seed Round & Series A', 'Pre Series A',
       'Pre seed Round', 'Angel Round', 'Pre series A1', 'Series E2',
       'Pre series A', 'Seed Round', 'Bridge Round', 'Pre seed round',
       'Pre series B', 'Pre series C', 'Seed Investment', 'Series D1',
       'Mid series', 'Series C, D', 'Seed funding'], dtype=object)

* There are some inconsistencies like wrong spelling of the labels, and incorrect inputs values in the stage column which must be corrected as we can see from above.
* The missing values in this column will be replaced with 'Unstated'

* 2019 Dataset

In [414]:
print('The rows and columns in the 2019 dataset are', df_2019.shape, 'respectively.')

The rows and columns in the 2019 dataset are (89, 9) respectively.


In [415]:
# Check structure of the dataset
df_2019.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 89 entries, 0 to 88
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company/Brand  89 non-null     object 
 1   Founded        60 non-null     float64
 2   HeadQuarter    70 non-null     object 
 3   Sector         84 non-null     object 
 4   What it does   89 non-null     object 
 5   Founders       86 non-null     object 
 6   Investor       89 non-null     object 
 7   Amount($)      89 non-null     object 
 8   Stage          43 non-null     object 
dtypes: float64(1), object(8)
memory usage: 6.4+ KB


* We see the datatype for each column in the dataset and the number of non-null values in each column.
* The column name for company/brand and What it does must be renamed to the recent format which is used in the recent dataset

<h3>Check for missing values</h3>

In [416]:
# Check for total missing values in the columns
df_2019.isna().sum()

Company/Brand     0
Founded          29
HeadQuarter      19
Sector            5
What it does      0
Founders          3
Investor          0
Amount($)         0
Stage            46
dtype: int64

* The missing values in the sector column, founders, investor and the HeadQuarter must be replaced with 'Unstated'

<h3>Check for duplicate values</h3>

In [417]:
print(f'The are {df_2019.duplicated().sum()} duplicate values in the 2019 dataset')

The are 0 duplicate values in the 2019 dataset


<h3>Inspect Individual Columns</h3>

In [418]:
df_2019['Amount($)'].unique()

array(['$6,300,000', '$150,000,000', '$28,000,000', '$30,000,000',
       '$6,000,000', 'Undisclosed', '$1,000,000', '$20,000,000',
       '$275,000,000', '$22,000,000', '$5,000,000', '$140,500',
       '$540,000,000', '$15,000,000', '$182,700', '$12,000,000',
       '$11,000,000', '$15,500,000', '$1,500,000', '$5,500,000',
       '$2,500,000', '$140,000', '$230,000,000', '$49,400,000',
       '$32,000,000', '$26,000,000', '$150,000', '$400,000', '$2,000,000',
       '$100,000,000', '$8,000,000', '$100,000', '$50,000,000',
       '$120,000,000', '$4,000,000', '$6,800,000', '$36,000,000',
       '$5,700,000', '$25,000,000', '$600,000', '$70,000,000',
       '$60,000,000', '$220,000', '$2,800,000', '$2,100,000',
       '$7,000,000', '$311,000,000', '$4,800,000', '$693,000,000',
       '$33,000,000'], dtype=object)

* The amount column has undisclosed values which must be replaced with NaN.
* The dollar sign, and coma must be cleaned from the column and converted to a float dtype.

In [419]:
df_2019['Stage'].unique()

array([nan, 'Series C', 'Fresh funding', 'Series D', 'Pre series A',
       'Series A', 'Series G', 'Series B', 'Post series A',
       'Seed funding', 'Seed fund', 'Series E', 'Series F', 'Series B+',
       'Seed round', 'Pre-series A'], dtype=object)

* There are some  wrong spelling of the labels in the stage column which must be corrected as we can see from above.
* The missing values in this column will be replaced with 'Unstated'

* 2018 Dataset

In [420]:
df_2018.head()

Unnamed: 0,Company Name,Industry,Round/Series,Amount,Location,About Company
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f..."
1,Happy Cow Dairy,"Agriculture, Farming",Seed,"₹40,000,000","Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,"₹65,000,000","Gurgaon, Haryana, India",Leading Online Loans Marketplace in India
3,PayMe India,"Financial Services, FinTech",Angel,2000000,"Noida, Uttar Pradesh, India",PayMe India is an innovative FinTech organizat...
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,—,"Hyderabad, Andhra Pradesh, India",Eunimart is a one stop solution for merchants ...


* Company Name and About Company must be changed to Company_Brand, and What_it_does

In [421]:
print('The rows and columns in the 2018 dataset are', df_2018.shape, 'respectively.')

The rows and columns in the 2018 dataset are (526, 6) respectively.


In [422]:
# Check structure of the dataset
df_2018.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 526 entries, 0 to 525
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Company Name   526 non-null    object
 1   Industry       526 non-null    object
 2   Round/Series   526 non-null    object
 3   Amount         526 non-null    object
 4   Location       526 non-null    object
 5   About Company  526 non-null    object
dtypes: object(6)
memory usage: 24.8+ KB


We see the datatype for each column in the dataset and the number of non-null values in each column.

<h3>Check for missing values</h3>

In [423]:
# Check for total missing values in the columns
df_2018.isna().sum()

Company Name     0
Industry         0
Round/Series     0
Amount           0
Location         0
About Company    0
dtype: int64

There are no missing values in the 2018 Dataset

<h3>Check for duplicate values</h3>

In [424]:
print(f'The is {df_2018.duplicated().sum()} duplicate value in the 2018 dataset')

The is 1 duplicate value in the 2018 dataset


<h3>Inspect Individual Columns</h3>

In [425]:
df_2018['Amount'].unique()

array(['250000', '₹40,000,000', '₹65,000,000', '2000000', '—', '1600000',
       '₹16,000,000', '₹50,000,000', '₹100,000,000', '150000', '1100000',
       '₹500,000', '6000000', '650000', '₹35,000,000', '₹64,000,000',
       '₹20,000,000', '1000000', '5000000', '4000000', '₹30,000,000',
       '2800000', '1700000', '1300000', '₹5,000,000', '₹12,500,000',
       '₹15,000,000', '500000', '₹104,000,000', '₹45,000,000', '13400000',
       '₹25,000,000', '₹26,400,000', '₹8,000,000', '₹60,000', '9000000',
       '100000', '20000', '120000', '₹34,000,000', '₹342,000,000',
       '$143,145', '₹600,000,000', '$742,000,000', '₹1,000,000,000',
       '₹2,000,000,000', '$3,980,000', '$10,000', '₹100,000',
       '₹250,000,000', '$1,000,000,000', '$7,000,000', '$35,000,000',
       '₹550,000,000', '$28,500,000', '$2,000,000', '₹240,000,000',
       '₹120,000,000', '$2,400,000', '$30,000,000', '₹2,500,000,000',
       '$23,000,000', '$150,000', '$11,000,000', '₹44,000,000',
       '$3,240,000', '₹60

* From the preview of the amount column we can see some signs there which makes the datatype object hence, it must be corrected.
* We consider the base value for the amount column to be dollars so any value in the amount column with no sign will be treated as a dollar value. In view of this, we have to rename the amount column 'Amount($)'.
* The values in this column which have different currency will be converted to our base value using the conversion rate for the year in which funding was raised.
* The column must be converted to a float dtype after correcting the inconsistencies

In [426]:
df_2018['Round/Series'].unique()

array(['Seed', 'Series A', 'Angel', 'Series B', 'Pre-Seed',
       'Private Equity', 'Venture - Series Unknown', 'Grant',
       'Debt Financing', 'Post-IPO Debt', 'Series H', 'Series C',
       'Series E', 'Corporate Round', 'Undisclosed',
       'https://docs.google.com/spreadsheets/d/1x9ziNeaz6auNChIHnMI8U6kS7knTr3byy_YBGfQaoUA/edit#gid=1861303593',
       'Series D', 'Secondary Market', 'Post-IPO Equity',
       'Non-equity Assistance', 'Funding Round'], dtype=object)

* The column name must be changed to Stage
* The inconsistencies in the column must be corrected to meet the new input of the current dataset

In [427]:
df_2018['Industry'].unique()

array(['Brand Marketing, Event Promotion, Marketing, Sponsorship, Ticketing',
       'Agriculture, Farming',
       'Credit, Financial Services, Lending, Marketplace',
       'Financial Services, FinTech',
       'E-Commerce Platforms, Retail, SaaS',
       'Cloud Infrastructure, PaaS, SaaS',
       'Internet, Leisure, Marketplace', 'Market Research',
       'Information Services, Information Technology', 'Mobile Payments',
       'B2B, Shoes', 'Internet',
       'Apps, Collaboration, Developer Platform, Enterprise Software, Messaging, Productivity Tools, Video Chat',
       'Food Delivery', 'Industrial Automation',
       'Automotive, Search Engine, Service Industry',
       'Finance, Internet, Travel',
       'Accounting, Business Information Systems, Business Travel, Finance, SaaS',
       'Artificial Intelligence, Product Search, SaaS, Service Industry, Software',
       'Internet of Things, Waste Management',
       'Air Transportation, Freight Service, Logistics, Marine Transport

* The valus column must be separated and then Sector selected from it.
* The Sector column must be created and added to the dataset.

In [428]:
df_2018['Location']

0           Bangalore, Karnataka, India
1            Mumbai, Maharashtra, India
2               Gurgaon, Haryana, India
3           Noida, Uttar Pradesh, India
4      Hyderabad, Andhra Pradesh, India
                     ...               
521         Bangalore, Karnataka, India
522             Haryana, Haryana, India
523          Mumbai, Maharashtra, India
524          Mumbai, Maharashtra, India
525          Chennai, Tamil Nadu, India
Name: Location, Length: 526, dtype: object

* The name of the column must be changed to HeadQuarter and the values extracted from the first index of the strings separated by a comma.

<h2>Data Cleaning</h2>

Change column names for 2018, 2019 and 2020-2021 dataset

In [429]:
df_2018.rename(columns={'Company Name': 'Company_Brand',
                'Amount': 'Amount($)',
                'About Company': 'What_it_does',
                'Round/Series': 'Stage'}, inplace=True)

In [430]:
df_2018.head(1)

Unnamed: 0,Company_Brand,Industry,Stage,Amount($),Location,What_it_does
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f..."


In [431]:
df_2019.rename(columns={'Company/Brand': 'Company_Brand',
                        'What it does': 'What_it_does'}, inplace=True)

In [432]:
df_2019.head(1)

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount($),Stage
0,Bombay Shaving,,,Ecommerce,Provides a range of male grooming products,Shantanu Deshpande,Sixth Sense Ventures,"$6,300,000",


In [433]:
df_2020.rename(columns={'Amount': 'Amount($)'},inplace=True)

In [434]:
df_2020.head(1)

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount($),Stage,column10
0,Aqgromalin,2019.0,Chennai,AgriTech,Cultivating Ideas for Profit,"Prasanna Manogaran, Bharani C L",Angel investors,200000.0,,


In [435]:
df_2021.rename(columns={'Amount': 'Amount($)'},inplace=True)

In [436]:
df_2021.head(1)

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount($),Stage
0,Unbox Robotics,2019.0,Bangalore,AI startup,Unbox Robotics builds on-demand AI-driven ware...,"Pramod Ghadge, Shahid Memon","BEENEXT, Entrepreneur First","$1,200,000",Pre-series A


Create new column for year funding
* This is to track the funding raised for each year

In [437]:
df_2021['Funding_Year']=2021; df_2020['Funding_Year']=2020; df_2019['Funding_Year']=2019; df_2018['Funding_Year']=2018

In [438]:
df_2021.head(1)

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount($),Stage,Funding_Year
0,Unbox Robotics,2019.0,Bangalore,AI startup,Unbox Robotics builds on-demand AI-driven ware...,"Pramod Ghadge, Shahid Memon","BEENEXT, Entrepreneur First","$1,200,000",Pre-series A,2021


Create new column named HeadQuarter and extract city name from Location of 2018 dataset

In [439]:
df_2018['HeadQuarter'] = df_2018['Location'].str.split(',').str[0]

In [440]:
df_2018.head(1)

Unnamed: 0,Company_Brand,Industry,Stage,Amount($),Location,What_it_does,Funding_Year,HeadQuarter
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f...",2018,Bangalore


Clean inconsistencies in the amount column for the 2018 dataset
* According to [Exchange Rates UK](https://www.exchangerates.org.uk/INR-USD-spot-exchange-rates-history-2018.html), the average rate for 1 INR to USD in 2018 was `0.0146` and that is the rate we are using to convert the INR amount to dollars.
* Replace '—' with NA, and also '$', ',', and '₹' with an empty string ''.

In [446]:
# Replace ',', and '$' with '' and also replace '—' with NA

df_2018['Amount($)'] = df_2018['Amount($)'].apply(lambda x: x.replace(',', ''))
df_2018['Amount($)'] = df_2018['Amount($)'].apply(lambda x: x.replace('$', ''))

# Get the index of all the amount values with '₹'
INR_rows = df_2018.index[df_2018['Amount($)'].str.contains('₹')]


In [447]:
# Replace '—' with NA
df_2018['Amount($)'] = df_2018['Amount($)'].replace('—', pd.NA)

In [448]:
# Replace '₹' with ''
df_2018['Amount($)'] = df_2018['Amount($)'].str.replace('₹', '')

In [449]:
# View total missing values in the columns
df_2018.isna().sum()

Company_Brand      0
Industry           0
Stage              0
Amount($)        148
Location           0
What_it_does       0
Funding_Year       0
HeadQuarter        0
dtype: int64

* The 148 missing values present in the amount column represent rows which had '—'. We assume they are rows which represent undisclosed amount and drop these missing values.

In [450]:
df_2018.dropna(inplace=True)

In [451]:
# Convert the 'Amount($)' column to a float
df_2018['Amount($)'] = df_2018['Amount($)'].astype(float)

In [452]:
# Multiply the rows with had '₹' in the amount column with the average exchange rate for the year 2018
df_2018.loc[INR_rows, ['Amount($)']] = df_2018.loc[INR_rows, ['Amount($)']].values * 0.0146

In [453]:
df_2018.head()

Unnamed: 0,Company_Brand,Industry,Stage,Amount($),Location,What_it_does,Funding_Year,HeadQuarter
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000.0,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f...",2018,Bangalore
1,Happy Cow Dairy,"Agriculture, Farming",Seed,584000.0,"Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...,2018,Mumbai
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,949000.0,"Gurgaon, Haryana, India",Leading Online Loans Marketplace in India,2018,Gurgaon
3,PayMe India,"Financial Services, FinTech",Angel,2000000.0,"Noida, Uttar Pradesh, India",PayMe India is an innovative FinTech organizat...,2018,Noida
5,Hasura,"Cloud Infrastructure, PaaS, SaaS",Seed,1600000.0,"Bengaluru, Karnataka, India",Hasura is a platform that allows developers to...,2018,Bengaluru


Clean inconsistencies in the stage column for the 2018 dataset
* According to [Indeed Career Guide](https://www.indeed.com/career-advice/career-development/startup-funding-stages) there are 8 startup funding stages namely 8 startup funding stages
1. Pre-seed funding
2. Seed funding stage
3. Series A funding
4. Series B funding
5. Series C funding
6. IPO

In [456]:
df_2018['Stage'].unique()

array(['Seed', 'Series A', 'Angel', 'Series B', 'Private Equity',
       'Venture - Series Unknown', 'Grant', 'Debt Financing',
       'Post-IPO Debt', 'Series H', 'Series C', 'Series E', 'Pre-Seed',
       'Undisclosed',
       'https://docs.google.com/spreadsheets/d/1x9ziNeaz6auNChIHnMI8U6kS7knTr3byy_YBGfQaoUA/edit#gid=1861303593',
       'Series D', 'Corporate Round', 'Post-IPO Equity',
       'Secondary Market', 'Non-equity Assistance', 'Funding Round'],
      dtype=object)