# Indian Start-Up Funding Ecosystem

## Business Understanding
Objective: The primary objective is to venture into the Indian start-up ecosystem. This involves understanding the landscape of start-ups in India, their funding patterns, and the key players in the ecosystem (both start-ups and investors).
Problem Definition: The problem can be defined as analyzing the funding received by Indian start-ups from 2018 to 2021. This involves understanding the trends in funding, identifying the sectors or industries that are attracting the most investment, and recognizing the major investors in the ecosystem.
Data Understanding: The data that will be used for this analysis includes details of the start-ups, the funding amounts they received, and information about the investors. This data will be provided separately for each year from 2018 to 2021 via three different sources.
Plan: The preliminary plan would involve cleaning and preprocessing the data, conducting exploratory data analysis to understand trends and patterns, and possibly building predictive models to forecast future trends in the start-up ecosystem.
Success Criteria: The success of this project could be measured by the team’s ability to gain insights that help in making informed decisions about venturing into the Indian start-up ecosystem. This could involve identifying promising sectors, understanding the competitive landscape, and recognizing potential investment opportunities.
## Hypothesis Test
Null Hypothesis (H0): The location of a start-up in India does not affect the amount of funding it receives.
Alternative Hypothesis (H1): The location of a start-up in India does affect the amount of funding it receives.
Null Hypothesis (H0): There is no difference in the amount of funding received by start-ups across different sectors.
Alternative Hypothesis (H1): There is a difference in the amount of funding received by start-ups across different sectors.
Null Hypothesis (H0): The size of a start-up (in terms of employees or customers) does not affect the amount of funding it receives.
Alternative Hypothesis (H1): The size of a start-up (in terms of employees or customers) does affect the amount of funding it receives.
Null Hypothesis (H0): All investors contribute equally to the funding of start-ups.
Alternative Hypothesis (H1): Some investors contribute more to the funding of start-ups than others.
Null Hypothesis (H0): The amount of funding received by Indian start-ups has not changed from 2018 to 2021.
Alternative Hypothesis (H1): The amount of funding received by Indian start-ups has increased or decreased from 2018 to 2021.
## Relevant Questions
1 - How has the funding trend for Indian start-ups changed from 2018 to 2021? Are there any noticeable patterns or trends?
2 - Which sectors or industries received the most funding? Are there any sectors that are emerging as new favorites for investors?
3 - Who are the major investors in the Indian start-up ecosystem? Are there any investors who are particularly active or influential?
4 - Are there any specific cities or regions in India that are attracting more start-ups or funding?
5 - Is there a correlation between the size of the start-up (in terms of employees or customers) and the amount of funding received?
6 - Which start-ups have shown the most growth in terms of funding received over the years?
7 - Is there a correlation between the amount of funding received and the success of the start-up?
8 - Can we identify common investment strategies among the major investors?

In [2]:
# Import requisite libraries
import pyodbc    
from dotenv import dotenv_values
import warnings 
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Data Importation

### Importing and Cleaning 2018 Dataset

In [3]:
# Access the 2018 dataset
url = 'https://raw.githubusercontent.com/Azubi-Africa/Career_Accelerator_LP1-Data_Analysis/main/startup_funding2018.csv'

dat1 = pd.read_csv(url)
dat1

Unnamed: 0,Company Name,Industry,Round/Series,Amount,Location,About Company
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f..."
1,Happy Cow Dairy,"Agriculture, Farming",Seed,"₹40,000,000","Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,"₹65,000,000","Gurgaon, Haryana, India",Leading Online Loans Marketplace in India
3,PayMe India,"Financial Services, FinTech",Angel,2000000,"Noida, Uttar Pradesh, India",PayMe India is an innovative FinTech organizat...
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,—,"Hyderabad, Andhra Pradesh, India",Eunimart is a one stop solution for merchants ...
...,...,...,...,...,...,...
521,Udaan,"B2B, Business Development, Internet, Marketplace",Series C,225000000,"Bangalore, Karnataka, India","Udaan is a B2B trade platform, designed specif..."
522,Happyeasygo Group,"Tourism, Travel",Series A,—,"Haryana, Haryana, India",HappyEasyGo is an online travel domain.
523,Mombay,"Food and Beverage, Food Delivery, Internet",Seed,7500,"Mumbai, Maharashtra, India",Mombay is a unique opportunity for housewives ...
524,Droni Tech,Information Technology,Seed,"₹35,000,000","Mumbai, Maharashtra, India",Droni Tech manufacture UAVs and develop softwa...


In [4]:
# adding years
year_1 = 2018
dat1['Year'] = year_1


In [5]:
dat1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 526 entries, 0 to 525
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Company Name   526 non-null    object
 1   Industry       526 non-null    object
 2   Round/Series   526 non-null    object
 3   Amount         526 non-null    object
 4   Location       526 non-null    object
 5   About Company  526 non-null    object
 6   Year           526 non-null    int64 
dtypes: int64(1), object(6)
memory usage: 28.9+ KB


In [6]:
dat1["Amount"].unique()

array(['250000', '₹40,000,000', '₹65,000,000', '2000000', '—', '1600000',
       '₹16,000,000', '₹50,000,000', '₹100,000,000', '150000', '1100000',
       '₹500,000', '6000000', '650000', '₹35,000,000', '₹64,000,000',
       '₹20,000,000', '1000000', '5000000', '4000000', '₹30,000,000',
       '2800000', '1700000', '1300000', '₹5,000,000', '₹12,500,000',
       '₹15,000,000', '500000', '₹104,000,000', '₹45,000,000', '13400000',
       '₹25,000,000', '₹26,400,000', '₹8,000,000', '₹60,000', '9000000',
       '100000', '20000', '120000', '₹34,000,000', '₹342,000,000',
       '$143,145', '₹600,000,000', '$742,000,000', '₹1,000,000,000',
       '₹2,000,000,000', '$3,980,000', '$10,000', '₹100,000',
       '₹250,000,000', '$1,000,000,000', '$7,000,000', '$35,000,000',
       '₹550,000,000', '$28,500,000', '$2,000,000', '₹240,000,000',
       '₹120,000,000', '$2,400,000', '$30,000,000', '₹2,500,000,000',
       '$23,000,000', '$150,000', '$11,000,000', '₹44,000,000',
       '$3,240,000', '₹60

In [9]:
dat1[dat1["Amount"].str.startswith("$")]

Unnamed: 0,Company Name,Industry,Round/Series,Amount,Location,About Company,Year
86,WHR,"Health Care, Information Technology",Seed,$143145,"Pune, Maharashtra, India",WHR is to make affordable healthcare a reality...,2018
90,SBI Life,Insurance,Private Equity,$742000000,"Mumbai, Maharashtra, India",SBI Life is one of the life insurance company ...,2018
93,NoPaperForms Solutions Pvt. Ltd.,"EdTech, Education, Information Services, SaaS",Series B,$3980000,"New Delhi, Delhi, India","NoPaperForms is a marketing automation, lead n...",2018
95,AuthMetrik,"B2B, Biometrics, Cyber Security, Fraud Detecti...",Grant,$10000,"Gurgaon, Haryana, India","SaaS, B2B, Security, Stop account sharing, Fra...",2018
101,Swiggy,"Food Delivery, Food Processing, Internet",Series H,$1000000000,"Bangalore, Karnataka, India",Swiggy is a food ordering and delivery company...,2018
102,Milkbasket,"E-Commerce, Food and Beverage, Internet",Series A,$7000000,"Haryana, Haryana, India","Milkbasket delivers milk, bread, eggs, butter,...",2018
104,Toppr,"EdTech, Education, Knowledge Management",Series C,$35000000,"Mumbai, Maharashtra, India",Toppr.com is an online preparation platform fo...,2018
106,Vivriti Capital,Financial Services,Venture - Series Unknown,$28500000,"Chennai, Tamil Nadu, India",Vivriti Capital is an online platform for inst...,2018
108,Impact Guru,"Creative Agency, Crowdfunding, EdTech, Health ...",Series A,$2000000,"Mumbai, Maharashtra, India",We're a Harvard incubated crowdfunding platfor...,2018
114,OneAssist,"Financial Services, SaaS, Security",Debt Financing,$2400000,"Mumbai, Maharashtra, India",OneAssist is a protection & assistance service...,2018


In [10]:

dat1['Amount'] = dat1['Amount'].str.replace(',', '')
dat1

Unnamed: 0,Company Name,Industry,Round/Series,Amount,Location,About Company,Year
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f...",2018
1,Happy Cow Dairy,"Agriculture, Farming",Seed,₹40000000,"Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...,2018
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,₹65000000,"Gurgaon, Haryana, India",Leading Online Loans Marketplace in India,2018
3,PayMe India,"Financial Services, FinTech",Angel,2000000,"Noida, Uttar Pradesh, India",PayMe India is an innovative FinTech organizat...,2018
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,—,"Hyderabad, Andhra Pradesh, India",Eunimart is a one stop solution for merchants ...,2018
...,...,...,...,...,...,...,...
521,Udaan,"B2B, Business Development, Internet, Marketplace",Series C,225000000,"Bangalore, Karnataka, India","Udaan is a B2B trade platform, designed specif...",2018
522,Happyeasygo Group,"Tourism, Travel",Series A,—,"Haryana, Haryana, India",HappyEasyGo is an online travel domain.,2018
523,Mombay,"Food and Beverage, Food Delivery, Internet",Seed,7500,"Mumbai, Maharashtra, India",Mombay is a unique opportunity for housewives ...,2018
524,Droni Tech,Information Technology,Seed,₹35000000,"Mumbai, Maharashtra, India",Droni Tech manufacture UAVs and develop softwa...,2018


In [11]:
# Define a function to convert rupees to dollars
def convert_to_dollars(Amount):
    if Amount.startswith('₹'):
        return float(Amount[1:]) * 0.0146
    else:
        return Amount
 
# Apply the conversion function to the 'amount' column
dat1['Amount'] = dat1['Amount'].apply(convert_to_dollars)
dat1

Unnamed: 0,Company Name,Industry,Round/Series,Amount,Location,About Company,Year
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f...",2018
1,Happy Cow Dairy,"Agriculture, Farming",Seed,584000.0,"Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...,2018
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,949000.0,"Gurgaon, Haryana, India",Leading Online Loans Marketplace in India,2018
3,PayMe India,"Financial Services, FinTech",Angel,2000000,"Noida, Uttar Pradesh, India",PayMe India is an innovative FinTech organizat...,2018
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,—,"Hyderabad, Andhra Pradesh, India",Eunimart is a one stop solution for merchants ...,2018
...,...,...,...,...,...,...,...
521,Udaan,"B2B, Business Development, Internet, Marketplace",Series C,225000000,"Bangalore, Karnataka, India","Udaan is a B2B trade platform, designed specif...",2018
522,Happyeasygo Group,"Tourism, Travel",Series A,—,"Haryana, Haryana, India",HappyEasyGo is an online travel domain.,2018
523,Mombay,"Food and Beverage, Food Delivery, Internet",Seed,7500,"Mumbai, Maharashtra, India",Mombay is a unique opportunity for housewives ...,2018
524,Droni Tech,Information Technology,Seed,511000.0,"Mumbai, Maharashtra, India",Droni Tech manufacture UAVs and develop softwa...,2018


### Loading and Cleaning 2019 datasets

In [12]:
# Access the 2019 dataset

dat2 = pd.read_csv('dataset/startup_funding2019.csv')
dat2

Unnamed: 0,Company/Brand,Founded,HeadQuarter,Sector,What it does,Founders,Investor,Amount($),Stage
0,Bombay Shaving,,,Ecommerce,Provides a range of male grooming products,Shantanu Deshpande,Sixth Sense Ventures,"$6,300,000",
1,Ruangguru,2014.0,Mumbai,Edtech,A learning platform that provides topic-based ...,"Adamas Belva Syah Devara, Iman Usman.",General Atlantic,"$150,000,000",Series C
2,Eduisfun,,Mumbai,Edtech,It aims to make learning fun via games.,Jatin Solanki,"Deepak Parekh, Amitabh Bachchan, Piyush Pandey","$28,000,000",Fresh funding
3,HomeLane,2014.0,Chennai,Interior design,Provides interior designing solutions,"Srikanth Iyer, Rama Harinath","Evolvence India Fund (EIF), Pidilite Group, FJ...","$30,000,000",Series D
4,Nu Genes,2004.0,Telangana,AgriTech,"It is a seed company engaged in production, pr...",Narayana Reddy Punyala,Innovation in Food and Agriculture (IFA),"$6,000,000",
...,...,...,...,...,...,...,...,...,...
84,Infra.Market,,Mumbai,Infratech,It connects client requirements to their suppl...,"Aaditya Sharda, Souvik Sengupta","Tiger Global, Nexus Venture Partners, Accel Pa...","$20,000,000",Series A
85,Oyo,2013.0,Gurugram,Hospitality,Provides rooms for comfortable stay,Ritesh Agarwal,"MyPreferred Transformation, Avendus Finance, S...","$693,000,000",
86,GoMechanic,2016.0,Delhi,Automobile & Technology,Find automobile repair and maintenance service...,"Amit Bhasin, Kushal Karwa, Nitin Rana, Rishabh...",Sequoia Capital,"$5,000,000",Series B
87,Spinny,2015.0,Delhi,Automobile,Online car retailer,"Niraj Singh, Ramanshu Mahaur, Ganesh Pawar, Mo...","Norwest Venture Partners, General Catalyst, Fu...","$50,000,000",


In [13]:
# adding years
year_2 = 2019
dat2['Year'] = year_2


In [14]:
#Renaming amount column in dataset 2#

dat2 = dat2.rename(columns={'Amount($)':'Amount'})
dat2

Unnamed: 0,Company/Brand,Founded,HeadQuarter,Sector,What it does,Founders,Investor,Amount,Stage,Year
0,Bombay Shaving,,,Ecommerce,Provides a range of male grooming products,Shantanu Deshpande,Sixth Sense Ventures,"$6,300,000",,2019
1,Ruangguru,2014.0,Mumbai,Edtech,A learning platform that provides topic-based ...,"Adamas Belva Syah Devara, Iman Usman.",General Atlantic,"$150,000,000",Series C,2019
2,Eduisfun,,Mumbai,Edtech,It aims to make learning fun via games.,Jatin Solanki,"Deepak Parekh, Amitabh Bachchan, Piyush Pandey","$28,000,000",Fresh funding,2019
3,HomeLane,2014.0,Chennai,Interior design,Provides interior designing solutions,"Srikanth Iyer, Rama Harinath","Evolvence India Fund (EIF), Pidilite Group, FJ...","$30,000,000",Series D,2019
4,Nu Genes,2004.0,Telangana,AgriTech,"It is a seed company engaged in production, pr...",Narayana Reddy Punyala,Innovation in Food and Agriculture (IFA),"$6,000,000",,2019
...,...,...,...,...,...,...,...,...,...,...
84,Infra.Market,,Mumbai,Infratech,It connects client requirements to their suppl...,"Aaditya Sharda, Souvik Sengupta","Tiger Global, Nexus Venture Partners, Accel Pa...","$20,000,000",Series A,2019
85,Oyo,2013.0,Gurugram,Hospitality,Provides rooms for comfortable stay,Ritesh Agarwal,"MyPreferred Transformation, Avendus Finance, S...","$693,000,000",,2019
86,GoMechanic,2016.0,Delhi,Automobile & Technology,Find automobile repair and maintenance service...,"Amit Bhasin, Kushal Karwa, Nitin Rana, Rishabh...",Sequoia Capital,"$5,000,000",Series B,2019
87,Spinny,2015.0,Delhi,Automobile,Online car retailer,"Niraj Singh, Ramanshu Mahaur, Ganesh Pawar, Mo...","Norwest Venture Partners, General Catalyst, Fu...","$50,000,000",,2019


In [15]:
# Define a function to convert rupees to dollars
def convert_to_dollars(Amount):
    if Amount.startswith('₹'):
        return float(Amount[1:]) * 0.0142
    else:
        return Amount
 
# Apply the conversion function to the 'amount' column
dat2['Amount'] = dat2['Amount'].apply(convert_to_dollars)
dat2

Unnamed: 0,Company/Brand,Founded,HeadQuarter,Sector,What it does,Founders,Investor,Amount,Stage,Year
0,Bombay Shaving,,,Ecommerce,Provides a range of male grooming products,Shantanu Deshpande,Sixth Sense Ventures,"$6,300,000",,2019
1,Ruangguru,2014.0,Mumbai,Edtech,A learning platform that provides topic-based ...,"Adamas Belva Syah Devara, Iman Usman.",General Atlantic,"$150,000,000",Series C,2019
2,Eduisfun,,Mumbai,Edtech,It aims to make learning fun via games.,Jatin Solanki,"Deepak Parekh, Amitabh Bachchan, Piyush Pandey","$28,000,000",Fresh funding,2019
3,HomeLane,2014.0,Chennai,Interior design,Provides interior designing solutions,"Srikanth Iyer, Rama Harinath","Evolvence India Fund (EIF), Pidilite Group, FJ...","$30,000,000",Series D,2019
4,Nu Genes,2004.0,Telangana,AgriTech,"It is a seed company engaged in production, pr...",Narayana Reddy Punyala,Innovation in Food and Agriculture (IFA),"$6,000,000",,2019
...,...,...,...,...,...,...,...,...,...,...
84,Infra.Market,,Mumbai,Infratech,It connects client requirements to their suppl...,"Aaditya Sharda, Souvik Sengupta","Tiger Global, Nexus Venture Partners, Accel Pa...","$20,000,000",Series A,2019
85,Oyo,2013.0,Gurugram,Hospitality,Provides rooms for comfortable stay,Ritesh Agarwal,"MyPreferred Transformation, Avendus Finance, S...","$693,000,000",,2019
86,GoMechanic,2016.0,Delhi,Automobile & Technology,Find automobile repair and maintenance service...,"Amit Bhasin, Kushal Karwa, Nitin Rana, Rishabh...",Sequoia Capital,"$5,000,000",Series B,2019
87,Spinny,2015.0,Delhi,Automobile,Online car retailer,"Niraj Singh, Ramanshu Mahaur, Ganesh Pawar, Mo...","Norwest Venture Partners, General Catalyst, Fu...","$50,000,000",,2019


In [16]:
#removing dollar sign
dat2['Amount']= dat2['Amount'].str.replace('$', '')
dat2

Unnamed: 0,Company/Brand,Founded,HeadQuarter,Sector,What it does,Founders,Investor,Amount,Stage,Year
0,Bombay Shaving,,,Ecommerce,Provides a range of male grooming products,Shantanu Deshpande,Sixth Sense Ventures,6300000,,2019
1,Ruangguru,2014.0,Mumbai,Edtech,A learning platform that provides topic-based ...,"Adamas Belva Syah Devara, Iman Usman.",General Atlantic,150000000,Series C,2019
2,Eduisfun,,Mumbai,Edtech,It aims to make learning fun via games.,Jatin Solanki,"Deepak Parekh, Amitabh Bachchan, Piyush Pandey",28000000,Fresh funding,2019
3,HomeLane,2014.0,Chennai,Interior design,Provides interior designing solutions,"Srikanth Iyer, Rama Harinath","Evolvence India Fund (EIF), Pidilite Group, FJ...",30000000,Series D,2019
4,Nu Genes,2004.0,Telangana,AgriTech,"It is a seed company engaged in production, pr...",Narayana Reddy Punyala,Innovation in Food and Agriculture (IFA),6000000,,2019
...,...,...,...,...,...,...,...,...,...,...
84,Infra.Market,,Mumbai,Infratech,It connects client requirements to their suppl...,"Aaditya Sharda, Souvik Sengupta","Tiger Global, Nexus Venture Partners, Accel Pa...",20000000,Series A,2019
85,Oyo,2013.0,Gurugram,Hospitality,Provides rooms for comfortable stay,Ritesh Agarwal,"MyPreferred Transformation, Avendus Finance, S...",693000000,,2019
86,GoMechanic,2016.0,Delhi,Automobile & Technology,Find automobile repair and maintenance service...,"Amit Bhasin, Kushal Karwa, Nitin Rana, Rishabh...",Sequoia Capital,5000000,Series B,2019
87,Spinny,2015.0,Delhi,Automobile,Online car retailer,"Niraj Singh, Ramanshu Mahaur, Ganesh Pawar, Mo...","Norwest Venture Partners, General Catalyst, Fu...",50000000,,2019


In [17]:
# remove commas
dat2['Amount']= dat2['Amount'].str.replace(',', '', regex= True)

dat2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 89 entries, 0 to 88
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company/Brand  89 non-null     object 
 1   Founded        60 non-null     float64
 2   HeadQuarter    70 non-null     object 
 3   Sector         84 non-null     object 
 4   What it does   89 non-null     object 
 5   Founders       86 non-null     object 
 6   Investor       89 non-null     object 
 7   Amount         89 non-null     object 
 8   Stage          43 non-null     object 
 9   Year           89 non-null     int64  
dtypes: float64(1), int64(1), object(8)
memory usage: 7.1+ KB


In [20]:
dat2['Amount'].unique()

array(['6300000', '150000000', '28000000', '30000000', '6000000',
       'Undisclosed', '1000000', '20000000', '275000000', '22000000',
       '5000000', '140500', '540000000', '15000000', '182700', '12000000',
       '11000000', '15500000', '1500000', '5500000', '2500000', '140000',
       '230000000', '49400000', '32000000', '26000000', '150000',
       '400000', '2000000', '100000000', '8000000', '100000', '50000000',
       '120000000', '4000000', '6800000', '36000000', '5700000',
       '25000000', '600000', '70000000', '60000000', '220000', '2800000',
       '2100000', '7000000', '311000000', '4800000', '693000000',
       '33000000'], dtype=object)

In [19]:
dat2_undisclosed = dat2[dat2['Amount']=='Undisclosed']
dat2_undisclosed

Unnamed: 0,Company/Brand,Founded,HeadQuarter,Sector,What it does,Founders,Investor,Amount,Stage,Year
5,FlytBase,,Pune,Technology,A drone automation platform,Nitin Gupta,Undisclosed,Undisclosed,,2019
6,Finly,,Bangalore,SaaS,It builds software products that makes work si...,"Vivek AG, Veekshith C Rai","Social Capital, AngelList India, Gemba Capital...",Undisclosed,,2019
10,Cub McPaws,2010.0,Mumbai,E-commerce & AR,A B2C brand that focusses on premium and comf...,"Abhay Bhat, Kinnar Shah",Venture Catalysts,Undisclosed,,2019
14,Open Secret,,,Food tech,It produces and sells top quality snacks,"Ahana Gautam, Udit Kejriwal",Matrix Partners,Undisclosed,,2019
19,Azah Personal Care Pvt. Ltd.,2018.0,Gurugram,Health,Aims to solve some problems in the feminine hy...,"Mohammed, Shashwat Diesh","Kunal Bahl, Rohit Bansal.",Undisclosed,Pre series A,2019
23,DROR Labs Pvt. Ltd,2018.0,Delhi,Safety tech,It uses technology to create a trust-based net...,"Dhiraj Naubhar, Dheeraj Bansal",Inflection Point Ventures,Undisclosed,,2019
32,Pumpkart,2014.0,Chandigarh,E-marketplace,B2B model for appliances and electrical products,KS Bhatia,Dinesh Dua,Undisclosed,,2019
45,Afinoz,,Noida,Fintech,Online financial marketplace for customized ra...,Rachna Suneja,Fintech innovation lab,Undisclosed,,2019
54,Ninjacart,2015.0,,B2B Supply Chain,It connects producers of food directly to reta...,"Thirukumaran Nagarajanin, Vasudevan Chinnathambi","Walmart, Flipkart",Undisclosed,,2019
55,Binca Games,2014.0,Mumbai,Games,It offers games that are available across Indi...,"Rubianca Wadhwa, Sahil Wadhwa",Sunil Gavaskar,Undisclosed,,2019


In [21]:
dat2 = dat2[dat2['Amount']!='Undisclosed']
dat2['Amount'].unique()

array(['6300000', '150000000', '28000000', '30000000', '6000000',
       '1000000', '20000000', '275000000', '22000000', '5000000',
       '140500', '540000000', '15000000', '182700', '12000000',
       '11000000', '15500000', '1500000', '5500000', '2500000', '140000',
       '230000000', '49400000', '32000000', '26000000', '150000',
       '400000', '2000000', '100000000', '8000000', '100000', '50000000',
       '120000000', '4000000', '6800000', '36000000', '5700000',
       '25000000', '600000', '70000000', '60000000', '220000', '2800000',
       '2100000', '7000000', '311000000', '4800000', '693000000',
       '33000000'], dtype=object)

In [22]:
# convert amount to float
dat2['Amount']= dat2['Amount'].astype(float)
dat2

Unnamed: 0,Company/Brand,Founded,HeadQuarter,Sector,What it does,Founders,Investor,Amount,Stage,Year
0,Bombay Shaving,,,Ecommerce,Provides a range of male grooming products,Shantanu Deshpande,Sixth Sense Ventures,6300000.0,,2019
1,Ruangguru,2014.0,Mumbai,Edtech,A learning platform that provides topic-based ...,"Adamas Belva Syah Devara, Iman Usman.",General Atlantic,150000000.0,Series C,2019
2,Eduisfun,,Mumbai,Edtech,It aims to make learning fun via games.,Jatin Solanki,"Deepak Parekh, Amitabh Bachchan, Piyush Pandey",28000000.0,Fresh funding,2019
3,HomeLane,2014.0,Chennai,Interior design,Provides interior designing solutions,"Srikanth Iyer, Rama Harinath","Evolvence India Fund (EIF), Pidilite Group, FJ...",30000000.0,Series D,2019
4,Nu Genes,2004.0,Telangana,AgriTech,"It is a seed company engaged in production, pr...",Narayana Reddy Punyala,Innovation in Food and Agriculture (IFA),6000000.0,,2019
...,...,...,...,...,...,...,...,...,...,...
84,Infra.Market,,Mumbai,Infratech,It connects client requirements to their suppl...,"Aaditya Sharda, Souvik Sengupta","Tiger Global, Nexus Venture Partners, Accel Pa...",20000000.0,Series A,2019
85,Oyo,2013.0,Gurugram,Hospitality,Provides rooms for comfortable stay,Ritesh Agarwal,"MyPreferred Transformation, Avendus Finance, S...",693000000.0,,2019
86,GoMechanic,2016.0,Delhi,Automobile & Technology,Find automobile repair and maintenance service...,"Amit Bhasin, Kushal Karwa, Nitin Rana, Rishabh...",Sequoia Capital,5000000.0,Series B,2019
87,Spinny,2015.0,Delhi,Automobile,Online car retailer,"Niraj Singh, Ramanshu Mahaur, Ganesh Pawar, Mo...","Norwest Venture Partners, General Catalyst, Fu...",50000000.0,,2019


### Importing and cleaning 2020 and 2021 datasets

In [23]:
# Load environment variables from .env file into a dictionary
environment_variables = dotenv_values('.env')

# Get the values for the credentials you set in the '.env' file
server = environment_variables.get("SERVER_NAME")
database = environment_variables.get("DATABASE_NAME")
username = environment_variables.get("DB_USERNAME")
password = environment_variables.get("PASSWORD")

In [24]:

connection_string = f"DRIVER={{SQL Server}};SERVER={server};DATABASE={database};UID={username};PWD={password};MARS_Connection=yes;MinProtocolVersion=TLSv1.2;"


In [25]:
connection = pyodbc.connect(connection_string)

In [26]:
# select data from 2020
query = "SELECT * FROM LP1_startup_funding2020"

dat3 = pd.read_sql(query, connection)
dat3.head

<bound method NDFrame.head of      Company_Brand  Founded HeadQuarter              Sector  \
0       Aqgromalin   2019.0     Chennai            AgriTech   
1         Krayonnz   2019.0   Bangalore              EdTech   
2     PadCare Labs   2018.0        Pune  Hygiene management   
3            NCOME   2020.0   New Delhi              Escrow   
4       Gramophone   2016.0      Indore            AgriTech   
...            ...      ...         ...                 ...   
1050  Leverage Edu      NaN       Delhi              Edtech   
1051         EpiFi      NaN        None             Fintech   
1052       Purplle   2012.0      Mumbai           Cosmetics   
1053        Shuttl   2015.0       Delhi           Transport   
1054         Pando   2017.0     Chennai            Logitech   

                                           What_it_does  \
0                          Cultivating Ideas for Profit   
1     An academy-guardian-scholar centric ecosystem ...   
2      Converting bio-hazardous wast

In [27]:
# adding years
year_3 = 2020
dat3['Year'] = year_3
dat3

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage,column10,Year
0,Aqgromalin,2019.0,Chennai,AgriTech,Cultivating Ideas for Profit,"Prasanna Manogaran, Bharani C L",Angel investors,200000.0,,,2020
1,Krayonnz,2019.0,Bangalore,EdTech,An academy-guardian-scholar centric ecosystem ...,"Saurabh Dixit, Gurudutt Upadhyay",GSF Accelerator,100000.0,Pre-seed,,2020
2,PadCare Labs,2018.0,Pune,Hygiene management,Converting bio-hazardous waste to harmless waste,Ajinkya Dhariya,Venture Center,,Pre-seed,,2020
3,NCOME,2020.0,New Delhi,Escrow,Escrow-as-a-service platform,Ritesh Tiwari,"Venture Catalysts, PointOne Capital",400000.0,,,2020
4,Gramophone,2016.0,Indore,AgriTech,Gramophone is an AgTech platform enabling acce...,"Ashish Rajan Singh, Harshit Gupta, Nishant Mah...","Siana Capital Management, Info Edge",340000.0,,,2020
...,...,...,...,...,...,...,...,...,...,...,...
1050,Leverage Edu,,Delhi,Edtech,AI enabled marketplace that provides career gu...,Akshay Chaturvedi,"DSG Consumer Partners, Blume Ventures",1500000.0,,,2020
1051,EpiFi,,,Fintech,It offers customers with a single interface fo...,"Sujith Narayanan, Sumit Gwalani","Sequoia India, Ribbit Capital",13200000.0,Seed Round,,2020
1052,Purplle,2012.0,Mumbai,Cosmetics,Online makeup and beauty products retailer,"Manish Taneja, Rahul Dash",Verlinvest,8000000.0,,,2020
1053,Shuttl,2015.0,Delhi,Transport,App based bus aggregator serice,"Amit Singh, Deepanshu Malviya",SIG Global India Fund LLP.,8043000.0,Series C,,2020


In [28]:
# loading data from 2021
query = "SELECT * FROM LP1_startup_funding2021"

dat4 = pd.read_sql(query, connection)
dat4.head

<bound method NDFrame.head of        Company_Brand  Founded HeadQuarter                 Sector  \
0     Unbox Robotics   2019.0   Bangalore             AI startup   
1             upGrad   2015.0      Mumbai                 EdTech   
2        Lead School   2012.0      Mumbai                 EdTech   
3            Bizongo   2015.0      Mumbai         B2B E-commerce   
4           FypMoney   2021.0    Gurugram                FinTech   
...              ...      ...         ...                    ...   
1204        Gigforce   2019.0    Gurugram  Staffing & Recruiting   
1205          Vahdam   2015.0   New Delhi       Food & Beverages   
1206    Leap Finance   2019.0   Bangalore     Financial Services   
1207    CollegeDekho   2015.0    Gurugram                 EdTech   
1208          WeRize   2019.0   Bangalore     Financial Services   

                                           What_it_does  \
0     Unbox Robotics builds on-demand AI-driven ware...   
1        UpGrad is an online higher

In [29]:
# adding years
year_4 = 2021
dat4['Year'] = year_4
dat4


Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage,Year
0,Unbox Robotics,2019.0,Bangalore,AI startup,Unbox Robotics builds on-demand AI-driven ware...,"Pramod Ghadge, Shahid Memon","BEENEXT, Entrepreneur First","$1,200,000",Pre-series A,2021
1,upGrad,2015.0,Mumbai,EdTech,UpGrad is an online higher education platform.,"Mayank Kumar, Phalgun Kompalli, Ravijot Chugh,...","Unilazer Ventures, IIFL Asset Management","$120,000,000",,2021
2,Lead School,2012.0,Mumbai,EdTech,LEAD School offers technology based school tra...,"Smita Deorah, Sumeet Mehta","GSV Ventures, Westbridge Capital","$30,000,000",Series D,2021
3,Bizongo,2015.0,Mumbai,B2B E-commerce,Bizongo is a business-to-business online marke...,"Aniket Deb, Ankit Tomar, Sachin Agrawal","CDC Group, IDG Capital","$51,000,000",Series C,2021
4,FypMoney,2021.0,Gurugram,FinTech,"FypMoney is Digital NEO Bank for Teenagers, em...",Kapil Banwari,"Liberatha Kallat, Mukesh Yadav, Dinesh Nagpal","$2,000,000",Seed,2021
...,...,...,...,...,...,...,...,...,...,...
1204,Gigforce,2019.0,Gurugram,Staffing & Recruiting,A gig/on-demand staffing company.,"Chirag Mittal, Anirudh Syal",Endiya Partners,$3000000,Pre-series A,2021
1205,Vahdam,2015.0,New Delhi,Food & Beverages,VAHDAM is among the world’s first vertically i...,Bala Sarda,IIFL AMC,$20000000,Series D,2021
1206,Leap Finance,2019.0,Bangalore,Financial Services,International education loans for high potenti...,"Arnav Kumar, Vaibhav Singh",Owl Ventures,$55000000,Series C,2021
1207,CollegeDekho,2015.0,Gurugram,EdTech,"Collegedekho.com is Student’s Partner, Friend ...",Ruchir Arora,"Winter Capital, ETS, Man Capital",$26000000,Series B,2021


In [30]:
dat3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1055 entries, 0 to 1054
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company_Brand  1055 non-null   object 
 1   Founded        842 non-null    float64
 2   HeadQuarter    961 non-null    object 
 3   Sector         1042 non-null   object 
 4   What_it_does   1055 non-null   object 
 5   Founders       1043 non-null   object 
 6   Investor       1017 non-null   object 
 7   Amount         801 non-null    float64
 8   Stage          591 non-null    object 
 9   column10       2 non-null      object 
 10  Year           1055 non-null   int64  
dtypes: float64(2), int64(1), object(8)
memory usage: 90.8+ KB


In [31]:
dat4.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1209 entries, 0 to 1208
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company_Brand  1209 non-null   object 
 1   Founded        1208 non-null   float64
 2   HeadQuarter    1208 non-null   object 
 3   Sector         1209 non-null   object 
 4   What_it_does   1209 non-null   object 
 5   Founders       1205 non-null   object 
 6   Investor       1147 non-null   object 
 7   Amount         1206 non-null   object 
 8   Stage          781 non-null    object 
 9   Year           1209 non-null   int64  
dtypes: float64(1), int64(1), object(8)
memory usage: 94.6+ KB


In [32]:
dat3=dat3.drop('column10', axis=1)
dat3

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage,Year
0,Aqgromalin,2019.0,Chennai,AgriTech,Cultivating Ideas for Profit,"Prasanna Manogaran, Bharani C L",Angel investors,200000.0,,2020
1,Krayonnz,2019.0,Bangalore,EdTech,An academy-guardian-scholar centric ecosystem ...,"Saurabh Dixit, Gurudutt Upadhyay",GSF Accelerator,100000.0,Pre-seed,2020
2,PadCare Labs,2018.0,Pune,Hygiene management,Converting bio-hazardous waste to harmless waste,Ajinkya Dhariya,Venture Center,,Pre-seed,2020
3,NCOME,2020.0,New Delhi,Escrow,Escrow-as-a-service platform,Ritesh Tiwari,"Venture Catalysts, PointOne Capital",400000.0,,2020
4,Gramophone,2016.0,Indore,AgriTech,Gramophone is an AgTech platform enabling acce...,"Ashish Rajan Singh, Harshit Gupta, Nishant Mah...","Siana Capital Management, Info Edge",340000.0,,2020
...,...,...,...,...,...,...,...,...,...,...
1050,Leverage Edu,,Delhi,Edtech,AI enabled marketplace that provides career gu...,Akshay Chaturvedi,"DSG Consumer Partners, Blume Ventures",1500000.0,,2020
1051,EpiFi,,,Fintech,It offers customers with a single interface fo...,"Sujith Narayanan, Sumit Gwalani","Sequoia India, Ribbit Capital",13200000.0,Seed Round,2020
1052,Purplle,2012.0,Mumbai,Cosmetics,Online makeup and beauty products retailer,"Manish Taneja, Rahul Dash",Verlinvest,8000000.0,,2020
1053,Shuttl,2015.0,Delhi,Transport,App based bus aggregator serice,"Amit Singh, Deepanshu Malviya",SIG Global India Fund LLP.,8043000.0,Series C,2020


## Merging datasets

In [33]:
#merging 2020 and 2021 datasets
merged_table= pd.concat([dat3, dat4], ignore_index=True)
merged_table

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage,Year
0,Aqgromalin,2019.0,Chennai,AgriTech,Cultivating Ideas for Profit,"Prasanna Manogaran, Bharani C L",Angel investors,200000.0,,2020
1,Krayonnz,2019.0,Bangalore,EdTech,An academy-guardian-scholar centric ecosystem ...,"Saurabh Dixit, Gurudutt Upadhyay",GSF Accelerator,100000.0,Pre-seed,2020
2,PadCare Labs,2018.0,Pune,Hygiene management,Converting bio-hazardous waste to harmless waste,Ajinkya Dhariya,Venture Center,,Pre-seed,2020
3,NCOME,2020.0,New Delhi,Escrow,Escrow-as-a-service platform,Ritesh Tiwari,"Venture Catalysts, PointOne Capital",400000.0,,2020
4,Gramophone,2016.0,Indore,AgriTech,Gramophone is an AgTech platform enabling acce...,"Ashish Rajan Singh, Harshit Gupta, Nishant Mah...","Siana Capital Management, Info Edge",340000.0,,2020
...,...,...,...,...,...,...,...,...,...,...
2259,Gigforce,2019.0,Gurugram,Staffing & Recruiting,A gig/on-demand staffing company.,"Chirag Mittal, Anirudh Syal",Endiya Partners,$3000000,Pre-series A,2021
2260,Vahdam,2015.0,New Delhi,Food & Beverages,VAHDAM is among the world’s first vertically i...,Bala Sarda,IIFL AMC,$20000000,Series D,2021
2261,Leap Finance,2019.0,Bangalore,Financial Services,International education loans for high potenti...,"Arnav Kumar, Vaibhav Singh",Owl Ventures,$55000000,Series C,2021
2262,CollegeDekho,2015.0,Gurugram,EdTech,"Collegedekho.com is Student’s Partner, Friend ...",Ruchir Arora,"Winter Capital, ETS, Man Capital",$26000000,Series B,2021


In [34]:
#renaming company name columns
dat1.rename(columns={'Company Name':'Company_Brand'}, inplace= True)

dat2.rename(columns={'Company/Brand':'Company_Brand'}, inplace= True)

In [35]:
#renaming round/series to name columns
dat1.rename(columns={'Round/Series':'Stage'}, inplace= True)

In [36]:
#renaming Industry to name Sector
dat1.rename(columns={'Industry':'Sector'}, inplace= True)

In [37]:
#renaming Industry to name Sector
dat1.rename(columns={'About Company':'What it does'}, inplace= True)

In [38]:
#renaming Industry to name Sector
dat1.rename(columns={'Location':'HeadQuarter'}, inplace= True)

In [39]:
# extracting the locations in dat1
dat1['HeadQuarter']= dat1['HeadQuarter'].str.split(',').str[0]

In [40]:
#printing columns to compare if the names are matching
print(dat1.columns)
print(dat2.columns)
print(merged_table.columns)


Index(['Company_Brand', 'Sector', 'Stage', 'Amount', 'HeadQuarter',
       'What it does', 'Year'],
      dtype='object')
Index(['Company_Brand', 'Founded', 'HeadQuarter', 'Sector', 'What it does',
       'Founders', 'Investor', 'Amount', 'Stage', 'Year'],
      dtype='object')
Index(['Company_Brand', 'Founded', 'HeadQuarter', 'Sector', 'What_it_does',
       'Founders', 'Investor', 'Amount', 'Stage', 'Year'],
      dtype='object')


In [41]:
dat1.head(3)

Unnamed: 0,Company_Brand,Sector,Stage,Amount,HeadQuarter,What it does,Year
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000.0,Bangalore,"TheCollegeFever is a hub for fun, fiesta and f...",2018
1,Happy Cow Dairy,"Agriculture, Farming",Seed,584000.0,Mumbai,A startup which aggregates milk from dairy far...,2018
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,949000.0,Gurgaon,Leading Online Loans Marketplace in India,2018


In [42]:
dat2['HeadQuarter'].unique()

array([nan, 'Mumbai', 'Chennai', 'Telangana', 'Noida', 'Delhi',
       'Bangalore', 'Ahmedabad', 'Haryana', 'Gurugram', 'Jaipur', 'Pune',
       'New Delhi', 'Surat', 'Uttar pradesh', 'Hyderabad', 'Rajasthan'],
      dtype=object)

In [43]:
dat2.head(3)

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What it does,Founders,Investor,Amount,Stage,Year
0,Bombay Shaving,,,Ecommerce,Provides a range of male grooming products,Shantanu Deshpande,Sixth Sense Ventures,6300000.0,,2019
1,Ruangguru,2014.0,Mumbai,Edtech,A learning platform that provides topic-based ...,"Adamas Belva Syah Devara, Iman Usman.",General Atlantic,150000000.0,Series C,2019
2,Eduisfun,,Mumbai,Edtech,It aims to make learning fun via games.,Jatin Solanki,"Deepak Parekh, Amitabh Bachchan, Piyush Pandey",28000000.0,Fresh funding,2019


In [44]:
merged_table.head(3)

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage,Year
0,Aqgromalin,2019.0,Chennai,AgriTech,Cultivating Ideas for Profit,"Prasanna Manogaran, Bharani C L",Angel investors,200000.0,,2020
1,Krayonnz,2019.0,Bangalore,EdTech,An academy-guardian-scholar centric ecosystem ...,"Saurabh Dixit, Gurudutt Upadhyay",GSF Accelerator,100000.0,Pre-seed,2020
2,PadCare Labs,2018.0,Pune,Hygiene management,Converting bio-hazardous waste to harmless waste,Ajinkya Dhariya,Venture Center,,Pre-seed,2020


In [45]:
#renaming What_it_does to What it does
merged_table.rename(columns={'What_it_does':'What it does'}, inplace= True)
merged_table

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What it does,Founders,Investor,Amount,Stage,Year
0,Aqgromalin,2019.0,Chennai,AgriTech,Cultivating Ideas for Profit,"Prasanna Manogaran, Bharani C L",Angel investors,200000.0,,2020
1,Krayonnz,2019.0,Bangalore,EdTech,An academy-guardian-scholar centric ecosystem ...,"Saurabh Dixit, Gurudutt Upadhyay",GSF Accelerator,100000.0,Pre-seed,2020
2,PadCare Labs,2018.0,Pune,Hygiene management,Converting bio-hazardous waste to harmless waste,Ajinkya Dhariya,Venture Center,,Pre-seed,2020
3,NCOME,2020.0,New Delhi,Escrow,Escrow-as-a-service platform,Ritesh Tiwari,"Venture Catalysts, PointOne Capital",400000.0,,2020
4,Gramophone,2016.0,Indore,AgriTech,Gramophone is an AgTech platform enabling acce...,"Ashish Rajan Singh, Harshit Gupta, Nishant Mah...","Siana Capital Management, Info Edge",340000.0,,2020
...,...,...,...,...,...,...,...,...,...,...
2259,Gigforce,2019.0,Gurugram,Staffing & Recruiting,A gig/on-demand staffing company.,"Chirag Mittal, Anirudh Syal",Endiya Partners,$3000000,Pre-series A,2021
2260,Vahdam,2015.0,New Delhi,Food & Beverages,VAHDAM is among the world’s first vertically i...,Bala Sarda,IIFL AMC,$20000000,Series D,2021
2261,Leap Finance,2019.0,Bangalore,Financial Services,International education loans for high potenti...,"Arnav Kumar, Vaibhav Singh",Owl Ventures,$55000000,Series C,2021
2262,CollegeDekho,2015.0,Gurugram,EdTech,"Collegedekho.com is Student’s Partner, Friend ...",Ruchir Arora,"Winter Capital, ETS, Man Capital",$26000000,Series B,2021


In [46]:
#merging 2019,202,2021
merger= pd.concat([dat2, merged_table], ignore_index=True)
merger

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What it does,Founders,Investor,Amount,Stage,Year
0,Bombay Shaving,,,Ecommerce,Provides a range of male grooming products,Shantanu Deshpande,Sixth Sense Ventures,6300000.0,,2019
1,Ruangguru,2014.0,Mumbai,Edtech,A learning platform that provides topic-based ...,"Adamas Belva Syah Devara, Iman Usman.",General Atlantic,150000000.0,Series C,2019
2,Eduisfun,,Mumbai,Edtech,It aims to make learning fun via games.,Jatin Solanki,"Deepak Parekh, Amitabh Bachchan, Piyush Pandey",28000000.0,Fresh funding,2019
3,HomeLane,2014.0,Chennai,Interior design,Provides interior designing solutions,"Srikanth Iyer, Rama Harinath","Evolvence India Fund (EIF), Pidilite Group, FJ...",30000000.0,Series D,2019
4,Nu Genes,2004.0,Telangana,AgriTech,"It is a seed company engaged in production, pr...",Narayana Reddy Punyala,Innovation in Food and Agriculture (IFA),6000000.0,,2019
...,...,...,...,...,...,...,...,...,...,...
2336,Gigforce,2019.0,Gurugram,Staffing & Recruiting,A gig/on-demand staffing company.,"Chirag Mittal, Anirudh Syal",Endiya Partners,$3000000,Pre-series A,2021
2337,Vahdam,2015.0,New Delhi,Food & Beverages,VAHDAM is among the world’s first vertically i...,Bala Sarda,IIFL AMC,$20000000,Series D,2021
2338,Leap Finance,2019.0,Bangalore,Financial Services,International education loans for high potenti...,"Arnav Kumar, Vaibhav Singh",Owl Ventures,$55000000,Series C,2021
2339,CollegeDekho,2015.0,Gurugram,EdTech,"Collegedekho.com is Student’s Partner, Friend ...",Ruchir Arora,"Winter Capital, ETS, Man Capital",$26000000,Series B,2021


In [47]:
#merging 2018,2019,2020,2021
data_df= pd.concat([dat1, merger], ignore_index=True)
data_df

Unnamed: 0,Company_Brand,Sector,Stage,Amount,HeadQuarter,What it does,Year,Founded,Founders,Investor
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,Bangalore,"TheCollegeFever is a hub for fun, fiesta and f...",2018,,,
1,Happy Cow Dairy,"Agriculture, Farming",Seed,584000.0,Mumbai,A startup which aggregates milk from dairy far...,2018,,,
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,949000.0,Gurgaon,Leading Online Loans Marketplace in India,2018,,,
3,PayMe India,"Financial Services, FinTech",Angel,2000000,Noida,PayMe India is an innovative FinTech organizat...,2018,,,
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,—,Hyderabad,Eunimart is a one stop solution for merchants ...,2018,,,
...,...,...,...,...,...,...,...,...,...,...
2862,Gigforce,Staffing & Recruiting,Pre-series A,$3000000,Gurugram,A gig/on-demand staffing company.,2021,2019.0,"Chirag Mittal, Anirudh Syal",Endiya Partners
2863,Vahdam,Food & Beverages,Series D,$20000000,New Delhi,VAHDAM is among the world’s first vertically i...,2021,2015.0,Bala Sarda,IIFL AMC
2864,Leap Finance,Financial Services,Series C,$55000000,Bangalore,International education loans for high potenti...,2021,2019.0,"Arnav Kumar, Vaibhav Singh",Owl Ventures
2865,CollegeDekho,EdTech,Series B,$26000000,Gurugram,"Collegedekho.com is Student’s Partner, Friend ...",2021,2015.0,Ruchir Arora,"Winter Capital, ETS, Man Capital"


In [48]:
data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2867 entries, 0 to 2866
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company_Brand  2867 non-null   object 
 1   Sector         2849 non-null   object 
 2   Stage          1938 non-null   object 
 3   Amount         2610 non-null   object 
 4   HeadQuarter    2756 non-null   object 
 5   What it does   2867 non-null   object 
 6   Year           2867 non-null   int64  
 7   Founded        2102 non-null   float64
 8   Founders       2322 non-null   object 
 9   Investor       2241 non-null   object 
dtypes: float64(1), int64(1), object(8)
memory usage: 224.1+ KB


### Data Cleaning by Columns

### Cleaning Amount Column

In [49]:
data_df['Amount'].unique()

array(['250000', 584000.0, 949000.0, '2000000', '—', '1600000', 233600.0,
       730000.0, 1460000.0, '150000', '1100000', 7300.0, '6000000',
       '650000', 511000.0, 934400.0, 292000.0, '1000000', '5000000',
       '4000000', 438000.0, '2800000', '1700000', '1300000', 73000.0,
       182500.0, 219000.0, '500000', 1518400.0, 657000.0, '13400000',
       365000.0, 385440.0, 116800.0, 876.0, '9000000', '100000', '20000',
       '120000', 496400.0, 4993200.0, '$143145', 8760000.0, '$742000000',
       14600000.0, 29200000.0, '$3980000', '$10000', 1460.0, 3650000.0,
       '$1000000000', '$7000000', '$35000000', 8030000.0, '$28500000',
       '$2000000', 3504000.0, 1752000.0, '$2400000', '$30000000',
       36500000.0, '$23000000', '$150000', '$11000000', 642400.0,
       '$3240000', 876000.0, '$540000000', 9490000.0, 23360000.0,
       '$900000', '$10000000', '$1500000', 1022000.0, '$1000000',
       '$5000000', '$14000000', 1496500.0, '$100000000', 17520.0,
       75920000.0, '$800000'

In [50]:
# removing the $
data_df['Amount']= data_df['Amount'].str.replace('$', '')

data_df['Amount'].unique()

array(['250000', nan, '2000000', '—', '1600000', '150000', '1100000',
       '6000000', '650000', '1000000', '5000000', '4000000', '2800000',
       '1700000', '1300000', '500000', '13400000', '9000000', '100000',
       '20000', '120000', '143145', '742000000', '3980000', '10000',
       '1000000000', '7000000', '35000000', '28500000', '2400000',
       '30000000', '23000000', '11000000', '3240000', '540000000',
       '900000', '10000000', '1500000', '14000000', '100000000', '800000',
       '1041000', '15000', '1400000', '1200000', '2200000', '1800000',
       '3600000', '300000', '6830000', '200000', '4300000', '364846',
       '400000', '13200000', '50000', '3000000', '1250000', '180000',
       '4200000', '175000', '1450000', '4500000', '600000', '15000000',
       '125000', '130000', '17200000', '3500000', '12000000', '40000000',
       '50000000', '41900000', '3530000', '3300000', '210000000',
       '37680000', '22000000', '70000', '185000000', '65000000', '700000',
       '75

In [51]:
# removing the ,
data_df['Amount']= data_df['Amount'].str.replace(',', '')

data_df['Amount'].unique()

array(['250000', nan, '2000000', '—', '1600000', '150000', '1100000',
       '6000000', '650000', '1000000', '5000000', '4000000', '2800000',
       '1700000', '1300000', '500000', '13400000', '9000000', '100000',
       '20000', '120000', '143145', '742000000', '3980000', '10000',
       '1000000000', '7000000', '35000000', '28500000', '2400000',
       '30000000', '23000000', '11000000', '3240000', '540000000',
       '900000', '10000000', '1500000', '14000000', '100000000', '800000',
       '1041000', '15000', '1400000', '1200000', '2200000', '1800000',
       '3600000', '300000', '6830000', '200000', '4300000', '364846',
       '400000', '13200000', '50000', '3000000', '1250000', '180000',
       '4200000', '175000', '1450000', '4500000', '600000', '15000000',
       '125000', '130000', '17200000', '3500000', '12000000', '40000000',
       '50000000', '41900000', '3530000', '3300000', '210000000',
       '37680000', '22000000', '70000', '185000000', '65000000', '700000',
       '75

In [52]:
# Filter the records where the amount column contains 'undisclosed'
undisclosed_data = data_df[data_df['Amount'] == 'Undisclosed']
undisclosed_data


Unnamed: 0,Company_Brand,Sector,Stage,Amount,HeadQuarter,What it does,Year,Founded,Founders,Investor
1665,Qube Health,HealthTech,Pre-series A,Undisclosed,Mumbai,India's Most Respected Workplace Healthcare Ma...,2021,2016.0,Gagan Kapur,Inflection Point Ventures
1666,Vitra.ai,Tech Startup,,Undisclosed,Bangalore,Vitra.ai is an AI-based video translation plat...,2021,2020.0,Akash Nidhi PS,Inflexor Ventures
1679,Uable,EdTech,Seed,Undisclosed,Bangalore,Uable offers role based programmes to empower ...,2021,2020.0,Saurabh Saxena,"Chiratae Ventures, JAFCO Asia"
1697,TruNativ,Food & Beverages,Seed,Undisclosed,Mumbai,TruNativ Foods & Beverages Pvt Ltd,2021,2019.0,"Pranav Malhotra, Mamta Malhotra",9Unicorns
1712,AntWak,EdTech,Seed,Undisclosed,Bangalore,AntWak provides a video platform for e-learnin...,2021,2019.0,"Basav Nagur, Joybroto Ganguly, Sudhanshu Shekh...","Vaibhav Domkundwar, Kunal Shah"
...,...,...,...,...,...,...,...,...,...,...
2786,Leverage Edu,Higher Education,,Undisclosed,New Delhi,India's Most Trusted Study Abroad Platform,2021,2017.0,Akshay Chaturvedi,"Vijay Shekhar Sharma, Rohit Kapoor, Amanpreet ..."
2818,Atomberg Technologies,Consumer Electronics,,Undisclosed,Mumbai,A maker of energy-efficient smart fans,2021,2012.0,"Manoj Meena, Sibabrata Das",Ka Enterprises
2819,Genext Students,EdTech,,Undisclosed,Mumbai,LIVE online classes with expert tutors for K-1...,2021,2013.0,"Ali Asgar Kagzi, Piyush Dhanuka",Navneet Education
2824,OckyPocky,EdTech,Seed,Undisclosed,Gurugram,OckyPocky is India's 1st interactive English l...,2021,2015.0,Amit Agrawal,"Sujeet Kumar, SucSEED Indovation Fund"


In [53]:
undisclosed_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 116 entries, 1665 to 2851
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company_Brand  116 non-null    object 
 1   Sector         116 non-null    object 
 2   Stage          59 non-null     object 
 3   Amount         116 non-null    object 
 4   HeadQuarter    116 non-null    object 
 5   What it does   116 non-null    object 
 6   Year           116 non-null    int64  
 7   Founded        116 non-null    float64
 8   Founders       116 non-null    object 
 9   Investor       111 non-null    object 
dtypes: float64(1), int64(1), object(8)
memory usage: 10.0+ KB


In [54]:
# Remove the undisclosed records from Data_df
data_df = data_df[data_df['Amount']!='Undisclosed']
data_df['Amount'].unique()

array(['250000', nan, '2000000', '—', '1600000', '150000', '1100000',
       '6000000', '650000', '1000000', '5000000', '4000000', '2800000',
       '1700000', '1300000', '500000', '13400000', '9000000', '100000',
       '20000', '120000', '143145', '742000000', '3980000', '10000',
       '1000000000', '7000000', '35000000', '28500000', '2400000',
       '30000000', '23000000', '11000000', '3240000', '540000000',
       '900000', '10000000', '1500000', '14000000', '100000000', '800000',
       '1041000', '15000', '1400000', '1200000', '2200000', '1800000',
       '3600000', '300000', '6830000', '200000', '4300000', '364846',
       '400000', '13200000', '50000', '3000000', '1250000', '180000',
       '4200000', '175000', '1450000', '4500000', '600000', '15000000',
       '125000', '130000', '17200000', '3500000', '12000000', '40000000',
       '50000000', '41900000', '3530000', '3300000', '210000000',
       '37680000', '22000000', '70000', '185000000', '65000000', '700000',
       '75

In [55]:
# checking data that has amount has upsparks
data_df[data_df['Amount'] == 'Upsparks']


Unnamed: 0,Company_Brand,Sector,Stage,Amount,HeadQuarter,What it does,Year,Founded,Founders,Investor
1756,FanPlay,Computer Games,$1200000,Upsparks,Computer Games,A real money game app specializing in trivia g...,2021,2020.0,YC W21,"Pritesh Kumar, Bharat Gupta"
1769,FanPlay,Computer Games,$1200000,Upsparks,Computer Games,A real money game app specializing in trivia g...,2021,2020.0,YC W21,"Pritesh Kumar, Bharat Gupta"


###### the script above printed two rows which are duplicates. so one of them were going to actually be removed
###### then the data in the following columns ; Sector, stage, Amount and Headquarters do not give enough justification to keep


In [56]:
data_df = data_df[data_df['Amount'] != 'Upsparks']

In [57]:
# Remove dashes from the 'Amount' column
data_df['Amount'] = data_df['Amount'].replace('—', '')
data_df['Amount'].unique()

array(['250000', nan, '2000000', '', '1600000', '150000', '1100000',
       '6000000', '650000', '1000000', '5000000', '4000000', '2800000',
       '1700000', '1300000', '500000', '13400000', '9000000', '100000',
       '20000', '120000', '143145', '742000000', '3980000', '10000',
       '1000000000', '7000000', '35000000', '28500000', '2400000',
       '30000000', '23000000', '11000000', '3240000', '540000000',
       '900000', '10000000', '1500000', '14000000', '100000000', '800000',
       '1041000', '15000', '1400000', '1200000', '2200000', '1800000',
       '3600000', '300000', '6830000', '200000', '4300000', '364846',
       '400000', '13200000', '50000', '3000000', '1250000', '180000',
       '4200000', '175000', '1450000', '4500000', '600000', '15000000',
       '125000', '130000', '17200000', '3500000', '12000000', '40000000',
       '50000000', '41900000', '3530000', '3300000', '210000000',
       '37680000', '22000000', '70000', '185000000', '65000000', '700000',
       '750

In [58]:
# Remove space from the 'Amount' column and convert to 0
data_df['Amount'] = data_df['Amount'].replace('', 0)
data_df['Amount'].unique()

array(['250000', nan, '2000000', 0, '1600000', '150000', '1100000',
       '6000000', '650000', '1000000', '5000000', '4000000', '2800000',
       '1700000', '1300000', '500000', '13400000', '9000000', '100000',
       '20000', '120000', '143145', '742000000', '3980000', '10000',
       '1000000000', '7000000', '35000000', '28500000', '2400000',
       '30000000', '23000000', '11000000', '3240000', '540000000',
       '900000', '10000000', '1500000', '14000000', '100000000', '800000',
       '1041000', '15000', '1400000', '1200000', '2200000', '1800000',
       '3600000', '300000', '6830000', '200000', '4300000', '364846',
       '400000', '13200000', '50000', '3000000', '1250000', '180000',
       '4200000', '175000', '1450000', '4500000', '600000', '15000000',
       '125000', '130000', '17200000', '3500000', '12000000', '40000000',
       '50000000', '41900000', '3530000', '3300000', '210000000',
       '37680000', '22000000', '70000', '185000000', '65000000', '700000',
       '7500

In [59]:
#removing Series C from Amount column
data_df[data_df['Amount'] == 'Series C']

Unnamed: 0,Company_Brand,Sector,Stage,Amount,HeadQuarter,What it does,Year,Founded,Founders,Investor
1900,Fullife Healthcare,Primary Business is Development and Manufactur...,,Series C,Pharmaceuticals\t#REF!,Varun Khanna,2021,2009.0,Morgan Stanley Private Equity Asia,$22000000
1914,Fullife Healthcare,Primary Business is Development and Manufactur...,,Series C,Pharmaceuticals\t#REF!,Varun Khanna,2021,2009.0,Morgan Stanley Private Equity Asia,$22000000


In [60]:
# Remove rows where the 'Amount' column is equal to 'Series C'
# These rows have been mislabeled and are duplicated.
data_df = data_df[data_df['Amount'] != 'Series C']


In [61]:
data_df[data_df['Amount'] == 'Seed']

Unnamed: 0,Company_Brand,Sector,Stage,Amount,HeadQuarter,What it does,Year,Founded,Founders,Investor
1915,MoEVing,MoEVing is India's only Electric Mobility focu...,,Seed,Gurugram\t#REF!,"Vikash Mishra, Mragank Jain",2021,2021.0,"Anshuman Maheshwary, Dr Srihari Raju Kalidindi",$5000000
2806,Godamwale,Logistics & Supply Chain,,Seed,Mumbai,Godamwale is tech enabled integrated logistics...,2021,2016.0,"Basant Kumar, Vivek Tiwari, Ranbir Nandan",1000000\t#REF!


In [62]:
# Remove rows where the 'Amount' column is equal to 'Seed'
# These rows have been mislabeled 
data_df = data_df[data_df['Amount'] != 'Seed']


In [63]:
# Convert undisclosed_data to DataFrame if it's not already
if not isinstance(undisclosed_data, pd.DataFrame):
    undisclosed_data = pd.DataFrame(undisclosed_data)

# Filter the additional records where the amount column contains 'Undisclosed' (both capital and lowercase)
additional_undisclosed_data = data_df[data_df['Amount'].str.lower() == 'undisclosed']

# Append the additional undisclosed records to the existing undisclosed_data
undisclosed_data = pd.concat([undisclosed_data, additional_undisclosed_data], ignore_index=True)


In [64]:
#remove undisclosed from amount
data_df=  data_df[data_df['Amount'].str.lower() != 'undisclosed']


In [65]:
data_df[data_df['Amount'] == 'ah! Ventures']

Unnamed: 0,Company_Brand,Sector,Stage,Amount,HeadQuarter,What it does,Year,Founded,Founders,Investor
2196,Little Leap,EdTech,$300000,ah! Ventures,New Delhi,Soft Skills that make Smart Leaders,2021,2020.0,Holistic Development Programs for children in ...,Vishal Gupta


In [66]:
#remove ah! Ventures from amount
data_df=  data_df[data_df['Amount'] != 'ah! Ventures']


In [67]:
data_df[data_df['Amount'] == 'Pre-series A']

Unnamed: 0,Company_Brand,Sector,Stage,Amount,HeadQuarter,What it does,Year,Founded,Founders,Investor
2203,AdmitKard,EdTech,,Pre-series A,Noida,A tech solution for end to end career advisory...,2021,2016.0,"Vamsi Krishna, Pulkit Jain, Gaurav Munjal\t#REF!",$1000000


In [68]:
#remove ah! Ventures from amount
data_df=  data_df[data_df['Amount'] != 'Pre-series A']

In [69]:
data_df[data_df['Amount'] == 'ITO Angel Network LetsVenture']

Unnamed: 0,Company_Brand,Sector,Stage,Amount,HeadQuarter,What it does,Year,Founded,Founders,Investor
2209,BHyve,Human Resources,$300000,ITO Angel Network LetsVenture,Mumbai,A Future of Work Platform for diffusing Employ...,2021,2020.0,Backed by 100x.VC,"Omkar Pandharkame, Ketaki Ogale"


In [70]:
#remove ITO Angel Network LetsVenture from amount
data_df=  data_df[data_df['Amount'] != 'ITO Angel Network LetsVenture']

In [71]:
data_df[data_df['Amount'] == 'JITO Angel Network LetsVenture']

Unnamed: 0,Company_Brand,Sector,Stage,Amount,HeadQuarter,What it does,Year,Founded,Founders,Investor
2335,Saarthi Pedagogy,EdTech,$1000000,JITO Angel Network LetsVenture,Ahmadabad,"India's fastest growing Pedagogy company, serv...",2021,2015.0,Pedagogy,Sushil Agarwal


In [72]:
#remove ITO Angel Network LetsVenture from amount
data_df=  data_df[data_df['Amount'] != 'JITO Angel Network LetsVenture']

In [73]:
# Convert 'Amount' column to float 
data_df['Amount'] = data_df['Amount'].astype(float)

### Cleaning the Company_Brand column


In [74]:
# Get unique values in the Company_Brand column and sort them
sorted(data_df['Company_Brand'].unique())



['&ME',
 '1Bridge',
 '1Crowd',
 '1K Kirana Bazaar',
 '1MG',
 '21K School',
 '3SC',
 '3one4 Capital',
 '4Fin',
 '4baseCare',
 '5C Network',
 '6Degree',
 '88academics',
 '8i Ventures',
 '90+ My Tuition App',
 '91springboard',
 '9stacks',
 'ABL Workspaces',
 'ACKO',
 'AFK Gaming',
 'AMPM',
 'ANS Commerce',
 'ANSR',
 'APAC Financial Services',
 'ASQI Advisors',
 'Aadhar',
 'Aagey',
 'Aarav Unmanned Systems',
 'Aashiyaan Housing and Development Finance',
 'Aavas Financiers',
 'Aavenir',
 'Able Jobs',
 'Acculi Labs',
 'AcknoLedger',
 'Acko',
 'Acko General Insurance',
 'Adda247',
 'Adiuvo Diagnostics',
 'AdonMo',
 'Advantage Club',
 'Aerchain',
 'Aerostrovilos',
 'Aesthetic Nutrition',
 'Aether Biomedical',
 'AgNext',
 'AgNext Technologies',
 'Agnikul',
 'Agricxlab Private Limited',
 'Agrix',
 'Agro2o',
 'AgroStar',
 'AgroWave',
 'Ahaguru',
 'Aibono',
 'Airblack',
 'Airmeet',
 'AjnaLens',
 'Aker Foods',
 'Akna Medical',
 'Akudo',
 'Aldopay',
 'AlgoBulls',
 'Alpha Coach',
 'AlphaVector',
 'Al

In [75]:
# Get unique values in the Company_Brand column
unique_company_brands = data_df['Company_Brand'].unique()

# Convert the unique values array to a list
unique_company_brands_list = unique_company_brands.tolist()

# Print the list of unique company brands
unique_company_brands_list


['TheCollegeFever',
 'Happy Cow Dairy',
 'MyLoanCare',
 'PayMe India',
 'Eunimart',
 'Hasura',
 'Tripshelf',
 'Hyperdata.IO',
 'Freightwalla',
 'Microchip Payments',
 'BizCrum Infotech Pvt. Ltd.',
 'Emojifi',
 'Flock',
 'Freshboxx',
 'Wide Mobility Mechatronics',
 'Pitstop',
 'Mihuru',
 'Fyle',
 'AppWharf',
 'Antariksh Waste Ventures Pvt ltd',
 'Cogoport',
 'PaisaDukan',
 'Sleepy Owl Coffee',
 'BlueJack',
 'PregBuddy',
 'AgNext Technologies',
 'Pando',
 'Mintifi',
 'Carcrew',
 'NicheAI',
 'Chariot Tech',
 'Ideal Insurance Brokers/ 121Policy.com',
 'Loanzen',
 'Mojro Technologies',
 'Elemential',
 'Loadshare',
 'Yumlane',
 'Kriger Campus',
 'Pipa+Bella',
 'Kaleidofin',
 'Chakr Innovation',
 'IndigoLearn',
 'UClean',
 'Coutloot',
 'Letstrack',
 'Pooltoo',
 'Finzy',
 'Fitternity',
 'Keito',
 'Tolet for Students',
 'Chai Kings',
 'Dainik Jagran',
 'Playtoome',
 'ONGO Framework',
 'Notesgen',
 'Arogya MedTech',
 'Propshop24',
 'myUpchar',
 'MissMalini Entertainment',
 'Rooter',
 'ZestMoney'

### Cleaning the Sector column

In [76]:
data_df['Sector'].unique()

array(['Brand Marketing, Event Promotion, Marketing, Sponsorship, Ticketing',
       'Agriculture, Farming',
       'Credit, Financial Services, Lending, Marketplace',
       'Financial Services, FinTech',
       'E-Commerce Platforms, Retail, SaaS',
       'Cloud Infrastructure, PaaS, SaaS',
       'Internet, Leisure, Marketplace', 'Market Research',
       'Information Services, Information Technology', 'Mobile Payments',
       'B2B, Shoes', 'Internet',
       'Apps, Collaboration, Developer Platform, Enterprise Software, Messaging, Productivity Tools, Video Chat',
       'Food Delivery', 'Industrial Automation',
       'Automotive, Search Engine, Service Industry',
       'Finance, Internet, Travel',
       'Accounting, Business Information Systems, Business Travel, Finance, SaaS',
       'Artificial Intelligence, Product Search, SaaS, Service Industry, Software',
       'Internet of Things, Waste Management',
       'Air Transportation, Freight Service, Logistics, Marine Transport

In [77]:
# Get unique values in the Company_Brand column
unique_sectors = data_df['Sector'].unique()

# Convert the unique values array to a list
unique_sectors_list = unique_sectors.tolist()

# Print the list of unique company brands
unique_sectors_list


['Brand Marketing, Event Promotion, Marketing, Sponsorship, Ticketing',
 'Agriculture, Farming',
 'Credit, Financial Services, Lending, Marketplace',
 'Financial Services, FinTech',
 'E-Commerce Platforms, Retail, SaaS',
 'Cloud Infrastructure, PaaS, SaaS',
 'Internet, Leisure, Marketplace',
 'Market Research',
 'Information Services, Information Technology',
 'Mobile Payments',
 'B2B, Shoes',
 'Internet',
 'Apps, Collaboration, Developer Platform, Enterprise Software, Messaging, Productivity Tools, Video Chat',
 'Food Delivery',
 'Industrial Automation',
 'Automotive, Search Engine, Service Industry',
 'Finance, Internet, Travel',
 'Accounting, Business Information Systems, Business Travel, Finance, SaaS',
 'Artificial Intelligence, Product Search, SaaS, Service Industry, Software',
 'Internet of Things, Waste Management',
 'Air Transportation, Freight Service, Logistics, Marine Transportation',
 'Financial Services',
 'Food and Beverage',
 'Autonomous Vehicles',
 'Enterprise Software

In [78]:
# Filter the data where the "Sector" column is equal to "Manchester, Greater Manchester"
data_df[data_df['Sector'] == "Manchester, Greater Manchester"]

Unnamed: 0,Company_Brand,Sector,Stage,Amount,HeadQuarter,What it does,Year,Founded,Founders,Investor
2834,Peak,"Manchester, Greater Manchester",Series C,75000000.0,Information Technology & Services,Peak helps the world's smartest companies put ...,2021,2014.0,Atul Sharma,SoftBank Vision Fund 2


In [79]:
# Swap values between 'Sector' and 'HeadQuarter' columns for the row where 'Company_Brand' is 'Peak'
data_df.loc[data_df['Company_Brand'] == 'Peak', ['Sector', 'HeadQuarter']] = data_df.loc[data_df['Company_Brand'] == 'Peak', ['HeadQuarter', 'Sector']].values

# Filter the data where the "Company_Brand" column is equal to Peak
data_df[data_df['Company_Brand'] == "Peak"]

Unnamed: 0,Company_Brand,Sector,Stage,Amount,HeadQuarter,What it does,Year,Founded,Founders,Investor
2834,Peak,Information Technology & Services,Series C,75000000.0,"Manchester, Greater Manchester",Peak helps the world's smartest companies put ...,2021,2014.0,Atul Sharma,SoftBank Vision Fund 2


In [81]:
# Get unique values in the Company_Brand column
unique_sectors = data_df['Sector'].unique()

# Convert the unique values array to a list
unique_sectors_list = unique_sectors.tolist()

# Print the list of unique company brands
unique_sectors_list

['Brand Marketing, Event Promotion, Marketing, Sponsorship, Ticketing',
 'Agriculture, Farming',
 'Credit, Financial Services, Lending, Marketplace',
 'Financial Services, FinTech',
 'E-Commerce Platforms, Retail, SaaS',
 'Cloud Infrastructure, PaaS, SaaS',
 'Internet, Leisure, Marketplace',
 'Market Research',
 'Information Services, Information Technology',
 'Mobile Payments',
 'B2B, Shoes',
 'Internet',
 'Apps, Collaboration, Developer Platform, Enterprise Software, Messaging, Productivity Tools, Video Chat',
 'Food Delivery',
 'Industrial Automation',
 'Automotive, Search Engine, Service Industry',
 'Finance, Internet, Travel',
 'Accounting, Business Information Systems, Business Travel, Finance, SaaS',
 'Artificial Intelligence, Product Search, SaaS, Service Industry, Software',
 'Internet of Things, Waste Management',
 'Air Transportation, Freight Service, Logistics, Marine Transportation',
 'Financial Services',
 'Food and Beverage',
 'Autonomous Vehicles',
 'Enterprise Software

### Cleaning the Stage Column

In [82]:
# Get unique values in the Stage column
unique_stages = data_df['Stage'].unique()

# Convert the unique values array to a list``
unique_stages_list = unique_stages.tolist()

# Print the list of unique stages
unique_stages_list

['Seed',
 'Series A',
 'Angel',
 'Series B',
 'Pre-Seed',
 'Private Equity',
 'Venture - Series Unknown',
 'Grant',
 'Debt Financing',
 'Post-IPO Debt',
 'Series H',
 'Series C',
 'Series E',
 'Corporate Round',
 'Undisclosed',
 'https://docs.google.com/spreadsheets/d/1x9ziNeaz6auNChIHnMI8U6kS7knTr3byy_YBGfQaoUA/edit#gid=1861303593',
 'Series D',
 'Secondary Market',
 'Post-IPO Equity',
 'Non-equity Assistance',
 'Funding Round',
 nan,
 'Fresh funding',
 'Pre series A',
 'Series G',
 'Post series A',
 'Seed funding',
 'Seed fund',
 'Series F',
 'Series B+',
 'Seed round',
 'Pre-series A',
 None,
 'Pre-seed',
 'Pre-series',
 'Debt',
 'Pre-series C',
 'Pre-series B',
 'Bridge',
 'Series B2',
 'Pre- series A',
 'Edge',
 'Pre-Series B',
 'Seed A',
 'Series A-1',
 'Seed Funding',
 'Pre-seed Round',
 'Seed Round & Series A',
 'Pre Series A',
 'Pre seed Round',
 'Angel Round',
 'Pre series A1',
 'Series E2',
 'Seed Round',
 'Bridge Round',
 'Pre seed round',
 'Pre series B',
 'Pre series C',


In [83]:
data_df[data_df['Stage'] == 'https://docs.google.com/spreadsheets/d/1x9ziNeaz6auNChIHnMI8U6kS7knTr3byy_YBGfQaoUA/edit#gid=1861303593']

Unnamed: 0,Company_Brand,Sector,Stage,Amount,HeadQuarter,What it does,Year,Founded,Founders,Investor
178,BuyForexOnline,Travel,https://docs.google.com/spreadsheets/d/1x9ziNe...,2000000.0,Bangalore,BuyForexOnline.com is India's first completely...,2018,,,


In [84]:
# Set the values in the 'Stage' column to an empty string where the value is equal to the specified URL
data_df.loc[data_df['Stage'] == 'https://docs.google.com/spreadsheets/d/1x9ziNeaz6auNChIHnMI8U6kS7knTr3byy_YBGfQaoUA/edit#gid=1861303593', 'Stage'] = ''
data_df[data_df['Company_Brand'] == 'BuyForexOnline']

Unnamed: 0,Company_Brand,Sector,Stage,Amount,HeadQuarter,What it does,Year,Founded,Founders,Investor
178,BuyForexOnline,Travel,,2000000.0,Bangalore,BuyForexOnline.com is India's first completely...,2018,,,


In [85]:
data_df[data_df['Stage'] == '$6000000']

Unnamed: 0,Company_Brand,Sector,Stage,Amount,HeadQuarter,What it does,Year,Founded,Founders,Investor
2332,MYRE Capital,Commercial Real Estate,$6000000,,Mumbai,Democratising Real Estate Ownership,2021,2020.0,Own rent yielding commercial properties,Aryaman Vir


In [86]:
# Set the 'Stage' column to an empty string where the value is equal to '$6000000'
data_df.loc[data_df['Stage'] == '$6000000', 'Stage'] = ''
data_df[data_df['Company_Brand'] == 'MYRE Capital']

Unnamed: 0,Company_Brand,Sector,Stage,Amount,HeadQuarter,What it does,Year,Founded,Founders,Investor
2332,MYRE Capital,Commercial Real Estate,,,Mumbai,Democratising Real Estate Ownership,2021,2020.0,Own rent yielding commercial properties,Aryaman Vir


In [87]:
# Set 'Amount' to 60000000 for the row where 'Company_Brand' is 'MYRE Capital'
data_df.loc[data_df['Company_Brand'] == 'MYRE Capital', 'Amount'] = 60000000
data_df[data_df['Company_Brand'] == 'MYRE Capital']

Unnamed: 0,Company_Brand,Sector,Stage,Amount,HeadQuarter,What it does,Year,Founded,Founders,Investor
2332,MYRE Capital,Commercial Real Estate,,60000000.0,Mumbai,Democratising Real Estate Ownership,2021,2020.0,Own rent yielding commercial properties,Aryaman Vir


In [88]:
# Select rows where the 'Stage' column contains NaN values
data_df[pd.isna(data_df['Stage'])]

Unnamed: 0,Company_Brand,Sector,Stage,Amount,HeadQuarter,What it does,Year,Founded,Founders,Investor
526,Bombay Shaving,Ecommerce,,,,Provides a range of male grooming products,2019,,Shantanu Deshpande,Sixth Sense Ventures
530,Nu Genes,AgriTech,,,Telangana,"It is a seed company engaged in production, pr...",2019,2004.0,Narayana Reddy Punyala,Innovation in Food and Agriculture (IFA)
534,Appnomic,SaaS,,,Bangalore,"It is a self-healing enterprise, the IT operat...",2019,,D Padmanabhan,Avataar Ventures
536,JobSquare,HR tech,,,Ahmedabad,Technology-based platform that is connecting s...,2019,2019.0,Ishit Jethwa,Titan Capital
537,LivFin,Fintech,,,Delhi,"Grants small business loans, supply chain fina...",2019,2017.0,Rakesh Malhotra,German development finance institution DEG
...,...,...,...,...,...,...,...,...,...,...
2828,Lido Learning,E-learning,,10000000.0,Mumbai,LIDO is an ed-tech company revolutionizing fo...,2021,2019.0,Sahil Sheth,Unilazer Ventures
2830,Peppermint,Industrial Automation,,600000.0,Pune,Intelligent Housekeeping Robots for public and...,2021,2019.0,"Runal Dahiwade, Miraj C Vora","Venture Catalysts, Indian Angel Network"
2840,Sugar.fit,Health,,10000000.0,Bangalore,"Innovative technology, compassionate diabetes ...",2021,2021.0,"Shivtosh Kumar, Madan Somasundaram","Cure.fit, Endiya Partners, Tanglin Venture"
2850,Geniemode,B2B,,2000000.0,Gurugram,Transforming global sourcing for retailers & s...,2021,2021.0,"Amit Sharma, Tanuj Gangwani",Info Edge Ventures


In [89]:
# Convert NaN values in the 'Stage' column to an empty string ('')
data_df.loc[pd.isna(data_df['Stage']), 'Stage'] = ''

In [90]:
# Get unique values in the Stage column
unique_stages = data_df['Stage'].unique()

# Convert the unique values array to a list``
unique_stages_list = unique_stages.tolist()

# Print the list of unique stages
unique_stages_list

['Seed',
 'Series A',
 'Angel',
 'Series B',
 'Pre-Seed',
 'Private Equity',
 'Venture - Series Unknown',
 'Grant',
 'Debt Financing',
 'Post-IPO Debt',
 'Series H',
 'Series C',
 'Series E',
 'Corporate Round',
 'Undisclosed',
 '',
 'Series D',
 'Secondary Market',
 'Post-IPO Equity',
 'Non-equity Assistance',
 'Funding Round',
 'Fresh funding',
 'Pre series A',
 'Series G',
 'Post series A',
 'Seed funding',
 'Seed fund',
 'Series F',
 'Series B+',
 'Seed round',
 'Pre-series A',
 'Pre-seed',
 'Pre-series',
 'Debt',
 'Pre-series C',
 'Pre-series B',
 'Bridge',
 'Series B2',
 'Pre- series A',
 'Edge',
 'Pre-Series B',
 'Seed A',
 'Series A-1',
 'Seed Funding',
 'Pre-seed Round',
 'Seed Round & Series A',
 'Pre Series A',
 'Pre seed Round',
 'Angel Round',
 'Pre series A1',
 'Series E2',
 'Seed Round',
 'Bridge Round',
 'Pre seed round',
 'Pre series B',
 'Pre series C',
 'Seed Investment',
 'Series D1',
 'Mid series',
 'Series C, D',
 'Seed+',
 'Series F2',
 'Series A+',
 'Series B3',

### Cleaning HeadQuarter Column

In [91]:
# Get unique values in the HeadQuarter column
unique_HeadQuarter = data_df['HeadQuarter'].unique()

# Convert the unique values array to a list``
unique_HeadQuarter_list = unique_HeadQuarter.tolist()

# Print the list of unique HeadQuarter
unique_HeadQuarter_list

['Bangalore',
 'Mumbai',
 'Gurgaon',
 'Noida',
 'Hyderabad',
 'Bengaluru',
 'Kalkaji',
 'Delhi',
 'India',
 'Hubli',
 'New Delhi',
 'Chennai',
 'Mohali',
 'Kolkata',
 'Pune',
 'Jodhpur',
 'Kanpur',
 'Ahmedabad',
 'Azadpur',
 'Haryana',
 'Cochin',
 'Faridabad',
 'Jaipur',
 'Kota',
 'Anand',
 'Bangalore City',
 'Belgaum',
 'Thane',
 'Margão',
 'Indore',
 'Alwar',
 'Kannur',
 'Trivandrum',
 'Ernakulam',
 'Kormangala',
 'Uttar Pradesh',
 'Andheri',
 'Mylapore',
 'Ghaziabad',
 'Kochi',
 'Powai',
 'Guntur',
 'Kalpakkam',
 'Bhopal',
 'Coimbatore',
 'Worli',
 'Alleppey',
 'Chandigarh',
 'Guindy',
 'Lucknow',
 nan,
 'Telangana',
 'Gurugram',
 'Surat',
 'Uttar pradesh',
 'Rajasthan',
 'Tirunelveli, Tamilnadu',
 None,
 'Singapore',
 'Gujarat',
 'Kerala',
 'Jaipur, Rajastan',
 'Frisco, Texas, United States',
 'California',
 'Dhingsara, Haryana',
 'New York, United States',
 'Patna',
 'San Francisco, California, United States',
 'San Francisco, United States',
 'San Ramon, California',
 'Paris, Ile

In [92]:
# Replace 'Online Media\t#REF!' with 'Online Media'
data_df['HeadQuarter'].replace('Online Media\t#REF!', 'Online Media', inplace=True)

# Replace 'Manchester, Greater Manchester' with 'Manchester'
data_df['HeadQuarter'].replace('Manchester, Greater Manchester', 'Manchester', inplace=True)


In [93]:
# Get unique values in the HeadQuarter column
unique_HeadQuarter = data_df['HeadQuarter'].unique()

# Convert the unique values array to a list``
unique_HeadQuarter_list = unique_HeadQuarter.tolist()

# Print the list of unique HeadQuarter
unique_HeadQuarter_list

['Bangalore',
 'Mumbai',
 'Gurgaon',
 'Noida',
 'Hyderabad',
 'Bengaluru',
 'Kalkaji',
 'Delhi',
 'India',
 'Hubli',
 'New Delhi',
 'Chennai',
 'Mohali',
 'Kolkata',
 'Pune',
 'Jodhpur',
 'Kanpur',
 'Ahmedabad',
 'Azadpur',
 'Haryana',
 'Cochin',
 'Faridabad',
 'Jaipur',
 'Kota',
 'Anand',
 'Bangalore City',
 'Belgaum',
 'Thane',
 'Margão',
 'Indore',
 'Alwar',
 'Kannur',
 'Trivandrum',
 'Ernakulam',
 'Kormangala',
 'Uttar Pradesh',
 'Andheri',
 'Mylapore',
 'Ghaziabad',
 'Kochi',
 'Powai',
 'Guntur',
 'Kalpakkam',
 'Bhopal',
 'Coimbatore',
 'Worli',
 'Alleppey',
 'Chandigarh',
 'Guindy',
 'Lucknow',
 nan,
 'Telangana',
 'Gurugram',
 'Surat',
 'Uttar pradesh',
 'Rajasthan',
 'Tirunelveli, Tamilnadu',
 None,
 'Singapore',
 'Gujarat',
 'Kerala',
 'Jaipur, Rajastan',
 'Frisco, Texas, United States',
 'California',
 'Dhingsara, Haryana',
 'New York, United States',
 'Patna',
 'San Francisco, California, United States',
 'San Francisco, United States',
 'San Ramon, California',
 'Paris, Ile

In [94]:
# Filter the data where the "HeadQuarter" column is equal to "The Nilgiris"
data_df[data_df['HeadQuarter'] == "The Nilgiris"]

Unnamed: 0,Company_Brand,Sector,Stage,Amount,HeadQuarter,What it does,Year,Founded,Founders,Investor
2848,Prolgae,Biotechnology,Seed,200000.0,The Nilgiris,Prolgae Spirulina Supplies Pvt. Ltd. is a Nord...,2021,2016.0,Aakas Sadasivam,Vijayan


### Cleaning what it does column

In [96]:
# Get unique values in the HeadQuarter column
unique_WiD = data_df['What it does'].unique()

# Convert the unique values array to a list``
unique_WiD_list = unique_WiD.tolist()

# Print the list of unique HeadQuarter
unique_WiD_list

['TheCollegeFever is a hub for fun, fiesta and frolic of Colleges.',
 'A startup which aggregates milk from dairy farmers in rural Maharashtra.',
 'Leading Online Loans Marketplace in India',
 'PayMe India is an innovative FinTech organization which offers short term financial suport to corporate employees.',
 'Eunimart is a one stop solution for merchants to create a difference by selling globally.',
 'Hasura is a platform that allows developers to build, deploy, and host cloud-native applications quickly.',
 'Tripshelf is an online market place for holiday packages.',
 'Hyperdata combines advanced machine learning with human intelligence.',
 'Freightwalla is an international forwarder thats helps you manage supply chain by providing online tools including instant quotations.',
 'Microchip payments is a mobile-based payment application and point-of-sale device',
 'Building Transactionary B2B Marketplaces',
 'Emojifi is an app that provides live emoji, stickers & GIFs suggestions based

### Cleaning the Year Column

In [97]:
# Get unique values in the Year column
unique_Year = data_df['Year'].unique()

# Convert the unique values array to a list``
unique_Year_list = unique_Year.tolist()

# Print the list of unique Year
unique_Year_list

[2018, 2019, 2020, 2021]

###  Cleaning the Founded Column

In [98]:
# Get unique values in the Founded column
unique_Founded = data_df['Founded'].unique()

# Convert the unique values array to a list``
unique_Founded_list = unique_Founded.tolist()

# Print the list of unique Founded
unique_Founded_list

[nan,
 2014.0,
 2004.0,
 2013.0,
 2010.0,
 2018.0,
 2019.0,
 2017.0,
 2011.0,
 2015.0,
 2016.0,
 2012.0,
 2008.0,
 2020.0,
 1998.0,
 2007.0,
 1982.0,
 2009.0,
 1995.0,
 2006.0,
 1978.0,
 1999.0,
 1994.0,
 2005.0,
 1973.0,
 2002.0,
 2001.0,
 2021.0,
 1993.0,
 1989.0,
 2000.0,
 2003.0,
 1991.0,
 1984.0,
 1963.0]

In [103]:
# Filter the data where the "Founded" column contains NaN values
data_df[data_df['Founded'].isna()]

Unnamed: 0,Company_Brand,Sector,Stage,Amount,HeadQuarter,What it does,Year,Founded,Founders,Investor
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000.0,Bangalore,"TheCollegeFever is a hub for fun, fiesta and f...",2018,,,
1,Happy Cow Dairy,"Agriculture, Farming",Seed,,Mumbai,A startup which aggregates milk from dairy far...,2018,,,
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,,Gurgaon,Leading Online Loans Marketplace in India,2018,,,
3,PayMe India,"Financial Services, FinTech",Angel,2000000.0,Noida,PayMe India is an innovative FinTech organizat...,2018,,,
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,0.0,Hyderabad,Eunimart is a one stop solution for merchants ...,2018,,,
...,...,...,...,...,...,...,...,...,...,...
1646,Quicko,Taxation,,,Ahmedabad,Online tax planning and filing platform,2020,,Vishvajit Sonagara,"Zerodha fintech fund, Rainmatter"
1647,Satin Creditcare,Fintech,,,Gurgaon,A micro finance company,2020,,,Austrian Bank
1653,Leverage Edu,Edtech,,,Delhi,AI enabled marketplace that provides career gu...,2020,,Akshay Chaturvedi,"DSG Consumer Partners, Blume Ventures"
1654,EpiFi,Fintech,Seed Round,,,It offers customers with a single interface fo...,2020,,"Sujith Narayanan, Sumit Gwalani","Sequoia India, Ribbit Capital"


In [104]:
# Calculate the average of non-NaN values in the 'Founded' column
average_founded = data_df['Founded'].mean()

# Fill NaN values in the 'Founded' column with the calculated average
data_df['Founded'].fillna(average_founded, inplace=True)

# Filter the data where the "Founded" column contains NaN values
data_df[data_df['Founded'].isna()]

Unnamed: 0,Company_Brand,Sector,Stage,Amount,HeadQuarter,What it does,Year,Founded,Founders,Investor


In [105]:
data_df['Founded'].unique()

array([2015.9667349, 2014.       , 2004.       , 2013.       ,
       2010.       , 2018.       , 2019.       , 2017.       ,
       2011.       , 2015.       , 2016.       , 2012.       ,
       2008.       , 2020.       , 1998.       , 2007.       ,
       1982.       , 2009.       , 1995.       , 2006.       ,
       1978.       , 1999.       , 1994.       , 2005.       ,
       1973.       , 2002.       , 2001.       , 2021.       ,
       1993.       , 1989.       , 2000.       , 2003.       ,
       1991.       , 1984.       , 1963.       ])

In [106]:
# Round all values in the 'Founded' column to the nearest whole number
data_df['Founded'] = data_df['Founded'].round()

# View unique values in the 'Founded' column after rounding
data_df['Founded'].unique()


array([2016., 2014., 2004., 2013., 2010., 2018., 2019., 2017., 2011.,
       2015., 2012., 2008., 2020., 1998., 2007., 1982., 2009., 1995.,
       2006., 1978., 1999., 1994., 2005., 1973., 2002., 2001., 2021.,
       1993., 1989., 2000., 2003., 1991., 1984., 1963.])

In [107]:
# Convert all values in the 'Founded' column to integers
data_df['Founded'] = data_df['Founded'].astype(int)

# View unique values in the 'Founded' column after conversion
data_df['Founded'].unique()


array([2016, 2014, 2004, 2013, 2010, 2018, 2019, 2017, 2011, 2015, 2012,
       2008, 2020, 1998, 2007, 1982, 2009, 1995, 2006, 1978, 1999, 1994,
       2005, 1973, 2002, 2001, 2021, 1993, 1989, 2000, 2003, 1991, 1984,
       1963])

### Cleaning the Founders Coulmn

In [108]:
# Get unique values in the Founders column
unique_Founders = data_df['Founders'].unique()

# Convert the unique values array to a list``
unique_Founders_list = unique_Founders.tolist()

# Print the list of unique Founded
unique_Founders_list

[nan,
 'Shantanu Deshpande',
 'Adamas Belva Syah Devara, Iman Usman.',
 'Jatin Solanki',
 'Srikanth Iyer, Rama Harinath',
 'Narayana Reddy Punyala',
 'Pavan Kushwaha, Paratosh Bansal, Dip Jung Thapa',
 'Renuka Ramnath',
 'Peyush Bansal, Amit Chaudhary, Sumeet Kapahi',
 'D Padmanabhan',
 'Puneet Gupta, Sucharita Mukherjee',
 'Ishit Jethwa',
 'Rakesh Malhotra',
 'Byju Raveendran',
 'Chapman, Priya Sharma, Ashish Anantharaman',
 'Amit Modi',
 'Renato Araujo',
 'Harsimarbir Singh, Dr Vaibhav Kapoor, Dr Garima Sawhney',
 'Gautam Tambay, Parul Gupta',
 'Tushar Kumar, Prashant Singh',
 'Arihant Jain, Ajeet Kushwaha',
 'Nishant Jain, Rohan Kumar',
 'Sam Udotong',
 'Sandipan Mitra, Uttam Kumar',
 'Nukul Upadhye, Mahesh Jakhotia, Jitender Bedwal, Daya Rai, Nikhil Tripathi',
 'Vivek Gupta, Abhay Hanjura',
 'Babu Dayal, Pramod Uniyal, Lalit Mehta',
 'Neel Mehta, Nihar Vartak',
 'Deepak Garg, Gazal Kalra',
 'Vivek Prabhakar, Boris Zha',
 'Amit Acharya, Srinath Ramakkrushnan',
 'Swapnil',
 'Rajendra

### Cleaning the Investor Column

In [109]:
data_df['Investor'].unique()

array([nan, 'Sixth Sense Ventures', 'General Atlantic', ...,
       'Owl Ventures', 'Winter Capital, ETS, Man Capital',
       '3one4 Capital, Kalaari Capital'], dtype=object)

###

### Checking to see if everything is correct

In [110]:
data_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2719 entries, 0 to 2866
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company_Brand  2719 non-null   object 
 1   Sector         2701 non-null   object 
 2   Stage          2719 non-null   object 
 3   Amount         1441 non-null   float64
 4   HeadQuarter    2608 non-null   object 
 5   What it does   2719 non-null   object 
 6   Year           2719 non-null   int64  
 7   Founded        2719 non-null   int32  
 8   Founders       2174 non-null   object 
 9   Investor       2099 non-null   object 
dtypes: float64(1), int32(1), int64(1), object(7)
memory usage: 223.0+ KB


#### the imputation with the mean was chosen to fill the amount column

In [111]:
# Calculate the mean of non-null values in the 'Amount' column
mean_amount = data_df['Amount'].mean()

# Fill missing values in the 'Amount' column with the mean
data_df['Amount'].fillna(mean_amount, inplace=True)

data_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2719 entries, 0 to 2866
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company_Brand  2719 non-null   object 
 1   Sector         2701 non-null   object 
 2   Stage          2719 non-null   object 
 3   Amount         2719 non-null   float64
 4   HeadQuarter    2608 non-null   object 
 5   What it does   2719 non-null   object 
 6   Year           2719 non-null   int64  
 7   Founded        2719 non-null   int32  
 8   Founders       2174 non-null   object 
 9   Investor       2099 non-null   object 
dtypes: float64(1), int32(1), int64(1), object(7)
memory usage: 223.0+ KB


## Export Dataframe to CSV

In [None]:
# Export the DataFrame to a CSV file
data_df.to_csv('./dataset/data.csv', index=False)


In [None]:
# Creating dataframes based on each dataset

data1 = 'startup_funding2018.csv'
data2 = 'startup_funding2019.csv'
data3 = 'startup_funding2020.csv'
data4 = 'startup_funding2021.csv'

df1 = pd.read_csv(data1)
df2 = pd.read_csv(data2)
df3 = pd.read_csv(data3)
df4 = pd.read_csv(data4)

In [None]:
# Adding a year column to each dataset so that rows are not lost should it become necessary to merge all datasets later on
year_1 = 2018
df1['Year'] = year_1

year_2 = 2019
df2['Year'] = year_2

year_3 = 2020
df3['Year'] = year_3

year_4 = 2021
df4['Year'] = year_4

# Convert the year column from int data type to year data type
df1['Year'] = pd.to_datetime(df1['Year'], format='%Y').dt.year

df2['Year'] = pd.to_datetime(df2['Year'], format='%Y').dt.year

df3['Year'] = pd.to_datetime(df3['Year'], format='%Y').dt.year

df4['Year'] = pd.to_datetime(df4['Year'], format='%Y').dt.year

### Data Cleaning

In [None]:
# Preview of df1
df1.head()

In [None]:
# Preview of df2
df2.head()

In [None]:
# Preview of df3
df3.head()

In [None]:
# Preview of df4
df4.head()

In [None]:
# Information on df1
df1.info()

In [None]:
# Rename columns in df1 to match columns with similar content in df2, df3 and df4
df1 = df1.rename(columns = {'Company Name' : 'Company_Brand', 'Round/Series' : 'Stage', 'About Company' : 'What_it_does', 'Amount' : 'Amount($)'})

In [None]:
# Extract Headquarters information from df1's Location column
df1['Headquarters'] = df1['Location'].str.split(',').str[0]
df1['Headquarters']

In [None]:
df1['Sector'] = df1['Industry'].str.split(',').str[0]
df1[df1['Sector'] == 'Agritech']

In [None]:
# Convert Amount from object to float data type

# First, remove currency symbol ₹
df1['Amount($)'] = df1['Amount($)'].str.replace('₹', '')

# Remove the comma (,) symbol
df1['Amount($)'] = df1['Amount($)'].str.replace(',', '')

# Remove the hyphen (—) symbol
df1['Amount($)'] = df1['Amount($)'].str.replace('—', '')

# Replace all empty spaces with 0
df1['Amount($)'] = df1['Amount($)'].str.replace('', '0')

# Replace all strings longer than 9 character
max_length = 9
df1['Amount($)'] = df1['Amount($)'].apply(lambda x: '0' if len(x) > max_length else x)

# df1['Amount($)'] = df1['Amount($)'].str.replace('0000000$0000000100000004000000030000000100000004000000050000000', '0')

# df1['Amount($)'] = df1['Amount($)'].str.replace('000000000000000$000000000000000100000000000000040000000000000003000000000000000100000000000000040000000000000005000000000000000', '0')

# Remove all white spaces
df1['Amount($)'] = df1['Amount($)'].str.strip()

df1['Amount($)'] = df1['Amount($)'].astype(np.float64).round(2)

In [None]:
df1.drop(columns = ['Industry', 'Location'])

In [None]:
# Checking information on df1 after cleaning the data
df1.info()

In [None]:
df2.Sector

In [None]:
# Checking df2 information before data cleaning
df2.info()

In [None]:
# Fill missing values in Founded column
df2['Founded'] = df2['Founded'].fillna(0)

In [None]:
# Fill missing values in Founded column
df2['Founded'] = df2['Founded'].fillna(0)

In [None]:
# Convert Founded column from float to int
df2['Founded'] = df2['Founded'].astype(int)

In [None]:
# Rename Headquarter to Headquarters
df2 = df2.rename(columns = {'Company/Brand' : 'Company_Brand', 'What it does' : 'What_it_does', 'HeadQuarter' : 'Headquarters'})

In [None]:
# Convert Amount column from object to float data type

# Remove currency symbol $
df2['Amount($)'] = df2['Amount($)'].str.replace('$', '')

# Remove the comma (,) symbol
df2['Amount($)'] = df2['Amount($)'].str.replace(',', '')

# Replace all strings longer than 9 character
max_length = 9
df2['Amount($)'] = df2['Amount($)'].apply(lambda x: '0' if len(x) > max_length else x)

# Remove all white spaces
df2['Amount($)'] = df2['Amount($)'].str.strip()

df2['Amount($)'] = df2['Amount($)'].astype(np.float64).round(2)

In [None]:
df2.isna().sum()

In [None]:
df2[['Headquarters', 'Sector', 'Founders', 'Stage']] = df2[['Headquarters', 'Sector', 'Founders', 'Stage']].fillna('')

In [None]:
df2.isna().sum()

In [None]:
# Checking df2 after data cleaning
df2.info()

In [None]:
# df3 information before data cleaning
df3.info()

In [None]:
# Rename Headquarter to Headquarters
df3 = df3.rename(columns = {'HeadQuarter' : 'Headquarters', 'Amount' : 'Amount($)'})

In [None]:
df3['Founded'] = df3['Founded'].fillna('0')

In [None]:
# Convert Founded column from float to int
df3['Founded'] = df3['Founded'].astype(int)

In [None]:
# # Convert Amount($) column from object to float data type

# # Remove currency symbol $
# df3['Amount($)'] = df3['Amount($)'].str.replace('$', '')

# # Remove the comma (,) symbol
# df3['Amount($)'] = df3['Amount($)'].str.replace(',', '')

# # Replace all strings longer than 9 character
# max_length = 9
# df3['Amount($)'] = df3['Amount($)'].apply(lambda x: '0' if len(x) > max_length else x)

# # Remove all white spaces
# df3['Amount($)'] = df3['Amount($)'].str.strip()

# df3['Amount($)'] = df3['Amount($)'].astype(np.float64).round(2)
df3['Amount($)'].isna().sum()

In [None]:
df3['Amount($)'] = df3['Amount($)'].fillna(0)

In [None]:
df3['Stage'].describe()

In [None]:
df3.isna().sum()

In [None]:
df3['Stage'] = df3['Stage'].fillna('Unknown')
df3['Stage'].isna().sum()

In [None]:
df3.tail(20)

In [None]:
# df3 after renaming column
df3.info()

In [None]:
# df4 information before data cleaning
df4.info()

In [None]:
# Rename Headquarter to Headquarters
df4 = df4.rename(columns = {'HeadQuarter' : 'Headquarters', 'Amount' : 'Amount($)'})

In [None]:
df4['Founded'] = df4['Founded'].fillna(0)

In [None]:
# Convert Founded column from float to int
df4['Founded'] = df4['Founded'].astype(int)

In [None]:
df4['Stage'] = df4['Stage'].fillna('')

In [None]:
# Convert Amount($) column from object to float data type

# Remove currency symbol $
df4['Amount($)'] = df4['Amount($)'].str.replace('$', '')
df4['Stage'] = df4['Stage'].str.replace('$', '')
df4['Stage'] = df4['Stage'].str.replace(',', '')
df4['Investor'] = df4['Investor'].str.replace('$', '')
df4['Investor'] = df4['Investor'].str.replace(',', '')
df4['Investor'] = df4['Investor'].str.strip()


In [None]:
# Define a function to check if a string represents an integer
def is_string_integer(s):
    try:
        int(s)
        return True
    except ValueError:
        return False

# Apply the function to the 'Stage' column to create a boolean mask
condition = df4['Stage'].apply(is_string_integer)

# Swap values between 'Stage' and 'Amount($)' columns where condition is True
temp = df4.loc[condition, 'Stage'].copy()
df4.loc[condition, 'Stage'] = df4.loc[condition, 'Amount($)']
df4.loc[condition, 'Amount($)'] = temp



# df4[['Amount($)', 'Stage']]
df4

In [None]:
df4.groupby(by = 'Amount($)', as_index = False).sum()

In [None]:
df4['Amount($)'].isna().sum()

In [None]:
df4['Amount($)'] = df4['Amount($)'].fillna('0')

In [None]:
# Define the pattern you want to search for in the 'Amount($)' column
pattern = 'Series'  # Replace 'pattern' with your desired pattern

# Identify rows where the 'Amount($)' column contains the specified pattern
condition = df4['Amount($)'].str.contains(pattern)

# Move the values from 'Amount($)' to 'Stage' where the pattern is found
df4.loc[condition, 'Stage'] = df4.loc[condition, 'Amount($)']

# Replace the values in 'Amount($)' with NaN where the pattern is found
df4.loc[condition, 'Amount($)'] = pd.NA

df4.groupby(by = 'Amount($)', as_index = False).sum()

In [None]:
df4['Amount($)'] = df4['Amount($)'].fillna('0')

In [None]:
# Define the pattern you want to search for in the 'Amount($)' column
pattern = 'Seed'  # Replace 'pattern' with your desired pattern

# Identify rows where the 'Amount($)' column contains the specified pattern
condition = df4['Amount($)'].str.contains(pattern)

# Move the values from 'Amount($)' to 'Stage' where the pattern is found
df4.loc[condition, 'Stage'] = df4.loc[condition, 'Amount($)']

# Replace the values in 'Amount($)' with NaN where the pattern is found
df4.loc[condition, 'Amount($)'] = pd.NA

df4.groupby(by = 'Amount($)', as_index = False).sum()

In [None]:
df4['Amount($)'] = df4['Amount($)'].fillna('0')

In [None]:
# Define the pattern you want to search for in the 'Amount($)' column
pattern = 'Pre-s'  # Replace 'pattern' with your desired pattern

# Identify rows where the 'Amount($)' column contains the specified pattern
condition = df4['Amount($)'].str.contains(pattern)

# Move the values from 'Amount($)' to 'Stage' where the pattern is found
df4.loc[condition, 'Stage'] = df4.loc[condition, 'Amount($)']

# Replace the values in 'Amount($)' with NaN where the pattern is found
df4.loc[condition, 'Amount($)'] = pd.NA

df4.groupby(by = 'Amount($)', as_index = False).sum()

In [None]:
df4.groupby(by = 'Investor', as_index = False).sum()

In [None]:
df4['Investor'] = df4['Investor'].fillna('')

In [None]:
# Identify rows where 'Investor' column contains '$' symbol
condition = df4['Investor'].str.contains('1000000')

# Move values with '$' symbol from 'Investor' to 'Amount($)' column
df4.loc[condition, 'Amount($)'] = df4.loc[condition, 'Investor']

# Remove the '$' symbol from the 'Amount($)' column
df4['Amount($)'] = df4['Amount($)'].str.replace('$', '')

# Replace remaining values in 'Investor' with NaN
# df4.loc[condition, 'Investor'] = pd.NA

# print(df)

In [None]:
# Identify rows where 'Investor' column contains '$' symbol
condition = df4['Investor'].str.contains('5000000')

# Move values with '$' symbol from 'Investor' to 'Amount($)' column
df4.loc[condition, 'Amount($)'] = df4.loc[condition, 'Investor']

# Remove the '$' symbol from the 'Amount($)' column
df4['Amount($)'] = df4['Amount($)'].str.replace('$', '')

# Replace remaining values in 'Investor' with NaN
# df4.loc[condition, 'Investor'] = pd.NA

# print(df)

In [None]:
# Identify rows where 'Investor' column contains '$' symbol
condition = df4['Investor'].str.contains('1000000')

# Move values with '$' symbol from 'Investor' to 'Amount($)' column
df4.loc[condition, 'Amount($)'] = df4.loc[condition, 'Investor']

# Remove the '$' symbol from the 'Amount($)' column
df4['Amount($)'] = df4['Amount($)'].str.replace('$', '')

# Replace remaining values in 'Investor' with NaN
# df4.loc[condition, 'Investor'] = pd.NA

# print(df)

In [None]:
# Identify rows where 'Investor' column contains '$' symbol
condition = df4['Investor'].str.contains('22000000')

# Move values with '$' symbol from 'Investor' to 'Amount($)' column
df4.loc[condition, 'Amount($)'] = df4.loc[condition, 'Investor']

# Remove the '$' symbol from the 'Amount($)' column
df4['Amount($)'] = df4['Amount($)'].str.replace('$', '')
df4['Investor'] = df4['Investor'].str.replace('1000000', 'NA')
df4['Investor'] = df4['Investor'].str.replace('1000000\t#REF!', 'NA')
# Replace remaining values in 'Investor' with NaN
# df4.loc[condition, 'Investor'] = pd.NA

# print(df)

In [None]:
df4.groupby(by = 'Investor', as_index = False).sum()

In [None]:
df4.Investor.isna().sum()

In [None]:
df4.describe().T

In [None]:
# Convert Amount column from object to float data type

# Remove currency symbol $
df4['Amount($)'] = df4['Amount($)'].str.replace('$', '')

# Remove the comma (,) symbol
df4['Amount($)'] = df4['Amount($)'].str.replace(',', '')
df4['Amount($)'] = df4['Amount($)'].str.replace('', '0')
# Replace all strings longer than 9 character
max_length = 9
df4['Amount($)'] = df4['Amount($)'].apply(lambda x: '0' if len(x) > max_length else x)

# Remove all white spaces
df4['Amount($)'] = df4['Amount($)'].str.strip()

df4['Amount($)'] = df4['Amount($)'].astype(np.float64).round(2)

In [None]:
df4.isna().sum()

In [None]:
df4.info()

In [None]:
# Concatenate all DataFrames along rows
df = pd.concat([df1, df2, df3, df4], ignore_index=True)

# If there are overlapping columns, you can handle them separately if needed

# Print the concatenated DataFrame
df

In [None]:
df.drop(columns=['column10'], inplace=True)
df

In [None]:
df.describe().T.round(2)

In [None]:
df.isna().sum()

In [None]:
df['Amount($)'].mean().round(2)

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
# df.duplicated().sum()

In [None]:
# df.drop_duplicates()

### Visualizing Characteristics of the Dataset

In [None]:
# Visualize the distribution of the start-ups' Stage with boxplot
df.plot.box(column='Amount($)', by='Year')

In [None]:
# Visualize the distribution of the start-ups' Location with boxplot
df.plot.box(column='Amount($)', by='Founded')

In [None]:
# Distribution of the variables
df.hist(figsize=(20, 15))