# Title
Indian Companies Initial Startup Capital Analysis

# Project Description
Team niece has been tasked to set out to understand the problem, the indian start up terrain and how best a new business will attract rom investors and propose working course of action.

# Hypothesis
### Null Hypothesis, HO
There are no exact key factors that correlate with the exact amount of funding contributed by the investors.
### AlternativeHypothesis, H1
There are exact key factors that correlate with the exact amount of funding contributed by investors.

# Questions
1. Which year has the most funding?
2. Which industry has the most funding?
3. Which industry is fastest growing in funding?
4. Which area has the most funding in India?
5. What is the average funding per year?


## Importation
Here is the section to import all the packages/libraries that will be used through this notebook.

In [2]:
# Data handling
import pandas as pd
import numpy as np

# Vizualisation (Matplotlib, Plotly, Seaborn, etc. )
import seaborn as sns
import matplotlib as plt

# EDA (pandas-profiling, etc. )
...

# Feature Processing (Scikit-learn processing, etc. )
...

# Machine Learning (Scikit-learn Estimators, Catboost, LightGBM, etc. )
...

# Hyperparameters Fine-tuning (Scikit-learn hp search, cross-validation, etc. )
...

# Other packages


Ellipsis

# Data Loading
Here is the section to load the datasets (train, eval, test) and the additional files

In [3]:
sheet1=pd.read_csv("startup_funding2018.csv")
sheet2=pd.read_csv("startup_funding2019.csv")
sheet3=pd.read_csv("startup_funding2020.csv")
sheet4=pd.read_csv("startup_funding2021.csv")

# Exploratory Data Analysis: EDA
Here is the section to **inspect** the datasets in depth, **present** it, make **hypotheses** and **think** the *cleaning, processing and features creation*.

## Dataset overview

Have a look at the loaded datsets using the following methods: `.head(), .info()`

In [4]:
#checking the shape of each file
print(sheet1.shape,sheet2.shape,sheet3.shape,sheet4.shape)

(526, 6) (89, 9) (1055, 10) (1209, 9)


In [5]:
sheet1.head(2)

Unnamed: 0,Company Name,Industry,Round/Series,Amount,Location,About Company
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f..."
1,Happy Cow Dairy,"Agriculture, Farming",Seed,"₹40,000,000","Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...


In [6]:
sheet2.head(2)

Unnamed: 0,Company/Brand,Founded,HeadQuarter,Sector,What it does,Founders,Investor,Amount($),Stage
0,Bombay Shaving,,,Ecommerce,Provides a range of male grooming products,Shantanu Deshpande,Sixth Sense Ventures,"$6,300,000",
1,Ruangguru,2014.0,Mumbai,Edtech,A learning platform that provides topic-based ...,"Adamas Belva Syah Devara, Iman Usman.",General Atlantic,"$150,000,000",Series C


In [7]:
sheet3.head(2)

Unnamed: 0,Company/Brand,Founded,HeadQuarter,Sector,What it does,Founders,Investor,Amount($),Stage,Unnamed: 9
0,Aqgromalin,2019,Chennai,AgriTech,Cultivating Ideas for Profit,"Prasanna Manogaran, Bharani C L",Angel investors,"$200,000",,
1,Krayonnz,2019,Bangalore,EdTech,An academy-guardian-scholar centric ecosystem ...,"Saurabh Dixit, Gurudutt Upadhyay",GSF Accelerator,"$100,000",Pre-seed,


In [8]:
sheet4.head(2)

Unnamed: 0,Company/Brand,Founded,HeadQuarter,Sector,What it does,Founders,Investor,Amount($),Stage
0,Unbox Robotics,2019.0,Bangalore,AI startup,Unbox Robotics builds on-demand AI-driven ware...,"Pramod Ghadge, Shahid Memon","BEENEXT, Entrepreneur First","$1,200,000",Pre-series A
1,upGrad,2015.0,Mumbai,EdTech,UpGrad is an online higher education platform.,"Mayank Kumar, Phalgun Kompalli, Ravijot Chugh,...","Unilazer Ventures, IIFL Asset Management","$120,000,000",


## Issues with the data
1. column names of sheet1 are different from the others
2. sheet3 has an unnecessary column('Unnamed: 9')
3. sheet1 doesn't have columns Founders,Investor and Founded.
4. sheet1 amount column has rupees, dollars, commas and is a string
5. amount column of other sheets have dollars, commas and is a string
6. there are null values in the founded, headquater, sector and stage columns
7. headquater column for sheet1 has more information
8. for the sector column, the values are different in all
9. amount column has null values
10. datatypes are mostly object

## How we intend to handle each issue identified
1. change column names of sheet1 to match the others
2. drop unnecessary column in sheet3
3. add missing columns to sheet1 with null values
4. convert sheet1 amount column to dollars in float
5. convert amount of other sheets to dollars in float
6. we'll leave the null values since they are not numbers
7. separate by commas and keep only the first word in headquater column for sheet1
8. make the values similar for example: "Ecommerce" and "E-Commerce Platforms" should me "E-commerce"
9. replace null values in amount column by calculating the mean or median
10. change datatypes accordinly for each column

In [9]:
sheet1.head()

Unnamed: 0,Company Name,Industry,Round/Series,Amount,Location,About Company
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f..."
1,Happy Cow Dairy,"Agriculture, Farming",Seed,"₹40,000,000","Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,"₹65,000,000","Gurgaon, Haryana, India",Leading Online Loans Marketplace in India
3,PayMe India,"Financial Services, FinTech",Angel,2000000,"Noida, Uttar Pradesh, India",PayMe India is an innovative FinTech organizat...
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,—,"Hyderabad, Andhra Pradesh, India",Eunimart is a one stop solution for merchants ...


In [10]:
sheet2.head()

Unnamed: 0,Company/Brand,Founded,HeadQuarter,Sector,What it does,Founders,Investor,Amount($),Stage
0,Bombay Shaving,,,Ecommerce,Provides a range of male grooming products,Shantanu Deshpande,Sixth Sense Ventures,"$6,300,000",
1,Ruangguru,2014.0,Mumbai,Edtech,A learning platform that provides topic-based ...,"Adamas Belva Syah Devara, Iman Usman.",General Atlantic,"$150,000,000",Series C
2,Eduisfun,,Mumbai,Edtech,It aims to make learning fun via games.,Jatin Solanki,"Deepak Parekh, Amitabh Bachchan, Piyush Pandey","$28,000,000",Fresh funding
3,HomeLane,2014.0,Chennai,Interior design,Provides interior designing solutions,"Srikanth Iyer, Rama Harinath","Evolvence India Fund (EIF), Pidilite Group, FJ...","$30,000,000",Series D
4,Nu Genes,2004.0,Telangana,AgriTech,"It is a seed company engaged in production, pr...",Narayana Reddy Punyala,Innovation in Food and Agriculture (IFA),"$6,000,000",


In [11]:
sheet3.head(1)

Unnamed: 0,Company/Brand,Founded,HeadQuarter,Sector,What it does,Founders,Investor,Amount($),Stage,Unnamed: 9
0,Aqgromalin,2019,Chennai,AgriTech,Cultivating Ideas for Profit,"Prasanna Manogaran, Bharani C L",Angel investors,"$200,000",,


In [12]:
sheet4.head(1)

Unnamed: 0,Company/Brand,Founded,HeadQuarter,Sector,What it does,Founders,Investor,Amount($),Stage
0,Unbox Robotics,2019.0,Bangalore,AI startup,Unbox Robotics builds on-demand AI-driven ware...,"Pramod Ghadge, Shahid Memon","BEENEXT, Entrepreneur First","$1,200,000",Pre-series A


In [23]:
sheet1['Sector'].unique()

array(['Brand Marketing, Event Promotion, Marketing, Sponsorship, Ticketing',
       'Agriculture, Farming',
       'Credit, Financial Services, Lending, Marketplace',
       'Financial Services, FinTech',
       'E-Commerce Platforms, Retail, SaaS',
       'Cloud Infrastructure, PaaS, SaaS',
       'Internet, Leisure, Marketplace', 'Market Research',
       'Information Services, Information Technology', 'Mobile Payments',
       'B2B, Shoes', 'Internet',
       'Apps, Collaboration, Developer Platform, Enterprise Software, Messaging, Productivity Tools, Video Chat',
       'Food Delivery', 'Industrial Automation',
       'Automotive, Search Engine, Service Industry',
       'Finance, Internet, Travel',
       'Accounting, Business Information Systems, Business Travel, Finance, SaaS',
       'Artificial Intelligence, Product Search, SaaS, Service Industry, Software',
       'Internet of Things, Waste Management',
       'Air Transportation, Freight Service, Logistics, Marine Transport

In [None]:
sheet2['Sector'].unique()

array(['Ecommerce', 'Edtech', 'Interior design', 'AgriTech', 'Technology',
       'SaaS', 'AI & Tech', 'E-commerce', 'E-commerce & AR', 'Fintech',
       'HR tech', 'Food tech', 'Health', 'Healthcare', 'Safety tech',
       'Pharmaceutical', 'Insurance technology', 'AI', 'Foodtech', 'Food',
       'IoT', 'E-marketplace', 'Robotics & AI', 'Logistics', 'Travel',
       'Manufacturing', 'Food & Nutrition', 'Social Media', nan,
       'E-Sports', 'Cosmetics', 'B2B', 'Jewellery', 'B2B Supply Chain',
       'Games', 'Food & tech', 'Accomodation', 'Automotive tech',
       'Legal tech', 'Mutual Funds', 'Cybersecurity', 'Automobile',
       'Sports', 'Healthtech', 'Yoga & wellness', 'Virtual Banking',
       'Transportation', 'Transport & Rentals',
       'Marketing & Customer loyalty', 'Infratech', 'Hospitality',
       'Automobile & Technology', 'Banking'], dtype=object)

In [None]:
sheet1.isnull().sum()

Company/Brand      0
Sector             0
Stage              0
Amount($)          0
HeadQuarter        0
What it does       0
Founded          526
Founders         526
Investor         526
dtype: int64

In [None]:
#trying to check the number of  null values in amount column
#because the first method didn't show it
nullSum=0
sheet1['Amount($)']. apply(lambda x: nullSum+1 if 'NaN' in x else x)
null

TypeError: argument of type 'float' is not iterable

In [None]:
sheet2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 89 entries, 0 to 88
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company/Brand  89 non-null     object 
 1   Founded        60 non-null     float64
 2   HeadQuarter    70 non-null     object 
 3   Sector         84 non-null     object 
 4   What it does   89 non-null     object 
 5   Founders       86 non-null     object 
 6   Investor       89 non-null     object 
 7   Amount($)      89 non-null     object 
 8   Stage          43 non-null     object 
dtypes: float64(1), object(8)
memory usage: 6.4+ KB


## Verify Data Quality
The data is dirty and needs a lot of cleaning before we can use it. It also has a lot of null values. So the quality is low.

## Data Cleaning

1. change column names of sheet1 to match the others

In [15]:
#changing column names
sheet1.rename(columns = {'Company Name':'Company/Brand', 'Industry':'Sector',
                              'Round/Series':'Stage', 'Amount':'Amount($)',
                        'Location':'HeadQuarter', 'About Company':'What it does'}, inplace = True)

2. drop unnecessary column in sheet3

In [16]:
sheet3.drop(columns=['Unnamed: 9'], axis=1, inplace = True)


3. add missing columns to sheet1 with null values

In [17]:
sheet1["Founded"] = np.nan
sheet1["Founders"] = np.nan
sheet1["Investor"] = np.nan

4. convert sheet1 amount column to dollars in float

In [18]:
def amountCleaner(sheet):
    sheet['Amount($)'] = sheet['Amount($)']. apply(lambda x: str(x).replace('—','NaN').replace(',','').replace('$',''))

amountCleaner(sheet1)

#convert rupees to dollars if it has rupees sign
sheet1['Amount($)'] = sheet1['Amount($)'].apply(
    lambda x: float(str(x).replace('₹',''))*0.012 if '₹' in x
else x) 

5. convert amount of other sheets to dollars in float

In [19]:
amountCleaner(sheet2)
amountCleaner(sheet3)
amountCleaner(sheet4)

In [20]:
sheet2.head()

Unnamed: 0,Company/Brand,Founded,HeadQuarter,Sector,What it does,Founders,Investor,Amount($),Stage
0,Bombay Shaving,,,Ecommerce,Provides a range of male grooming products,Shantanu Deshpande,Sixth Sense Ventures,6300000,
1,Ruangguru,2014.0,Mumbai,Edtech,A learning platform that provides topic-based ...,"Adamas Belva Syah Devara, Iman Usman.",General Atlantic,150000000,Series C
2,Eduisfun,,Mumbai,Edtech,It aims to make learning fun via games.,Jatin Solanki,"Deepak Parekh, Amitabh Bachchan, Piyush Pandey",28000000,Fresh funding
3,HomeLane,2014.0,Chennai,Interior design,Provides interior designing solutions,"Srikanth Iyer, Rama Harinath","Evolvence India Fund (EIF), Pidilite Group, FJ...",30000000,Series D
4,Nu Genes,2004.0,Telangana,AgriTech,"It is a seed company engaged in production, pr...",Narayana Reddy Punyala,Innovation in Food and Agriculture (IFA),6000000,


7. separate by commas and keep only the first word in headquater column for sheet1

In [27]:
sheet1[['HeadQuarter','s','f']] = sheet1['HeadQuarter'].str.split(', ',expand=True)
sheet1.drop(columns=['s','f'], axis=1, inplace = True)

In [28]:
sheet1.head(2)

Unnamed: 0,Company/Brand,Sector,Stage,Amount($),HeadQuarter,What it does,Founded,Founders,Investor
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000.0,Bangalore,"TheCollegeFever is a hub for fun, fiesta and f...",,,
1,Happy Cow Dairy,"Agriculture, Farming",Seed,480000.0,Mumbai,A startup which aggregates milk from dairy far...,,,
