# Investigating the Indian startup ecosystem to propose the best indian start-up to invest

## Description
The objective of this project is to analyse the indian start-up investment data over the course of four years (2018-2021) to find out which funding stages are very attractive to investors and at what risk level. 

# GOAL
The goal of this project is to propose the best indian start-up to invest.

## Null Hypothesis 
Average Investment amounts received by start-ups have no relation to sectors they operate in.


## Alternate Hypothesis
There is a relationship between average investment amounts received by start-ups and the sectors they operate in

## Analytical Questions
1. Does location affect the amount of funding or investments?
2. Does the sector of start up affect the fundings?
3. How many companies need funding and are at what level of funding ?
4. Which sectors receive the highest investment amounts?
5. Which cities have the highest number of startups and at what levels?
6. What are the levels of funding the startups are receiving?

In [2]:
#Libraries imported
import MySQLdb
import sqlalchemy as sa
import pyodbc     
from dotenv import dotenv_values    #import the dotenv_values function from the dotenv package
import pandas as pd
import warnings 


In [3]:
env_variables= dotenv_values('logins.env')
database= env_variables.get('database')
server = env_variables.get('server')
username = env_variables.get('username')
password = env_variables.get('password')



# Data Understanding

- There are four data sources to work with (2 SQL and 2 CSVs)
- Explore data
- Verify data quality

### Connecting to the dapDB to extract the 2020 and 2021 data

In [4]:
#Connecting to the database to analyse the 2020-2021 data

connection_string = f"DRIVER={{SQL Server}};SERVER={server};DATABASE={database};UID={username};PWD={password};MARS_Connection=yes;MinProtocolVersion=TLSv1.2;"
connection = pyodbc.connect(connection_string)

OperationalError: ('08001', '[08001] [Microsoft][ODBC SQL Server Driver][DBNETLIB]SQL Server does not exist or access denied. (17) (SQLDriverConnect); [08001] [Microsoft][ODBC SQL Server Driver][DBNETLIB]ConnectionOpen (Connect()). (53); [08001] [Microsoft][ODBC SQL Server Driver]Invalid connection string attribute (0)')

In [None]:
#query the 2020 startup funding data

query = "SELECT * FROM LP1_startup_funding2020"

data_2020 = pd.read_sql(query, connection)
data_2020.columns



Index(['Company_Brand', 'Founded', 'HeadQuarter', 'Sector', 'What_it_does',
       'Founders', 'Investor', 'Amount', 'Stage', 'column10'],
      dtype='object')

In [None]:
data_2020.head(2)
data_2020['Investment_year'] = '2020'

In [None]:
#Checking the datatypes of the columns
datatypes = data_2020.dtypes
datatypes

Company_Brand       object
Founded            float64
HeadQuarter         object
Sector              object
What_it_does        object
Founders            object
Investor            object
Amount             float64
Stage               object
column10            object
Investment_year     object
dtype: object

In [None]:
data_2020.Amount.unique()

array([2.0000000e+05, 1.0000000e+05,           nan, 4.0000000e+05,
       3.4000000e+05, 6.0000000e+05, 4.5000000e+07, 1.0000000e+06,
       2.0000000e+06, 1.2000000e+06, 6.6000000e+08, 1.2000000e+05,
       7.5000000e+06, 5.0000000e+06, 5.0000000e+05, 3.0000000e+06,
       1.0000000e+07, 1.4500000e+08, 1.0000000e+08, 2.1000000e+07,
       4.0000000e+06, 2.0000000e+07, 5.6000000e+05, 2.7500000e+05,
       4.5000000e+06, 1.5000000e+07, 3.9000000e+08, 7.0000000e+06,
       5.1000000e+06, 7.0000000e+08, 2.3000000e+06, 7.0000000e+05,
       1.9000000e+07, 9.0000000e+06, 4.0000000e+07, 7.5000000e+05,
       1.5000000e+06, 7.8000000e+06, 5.0000000e+07, 8.0000000e+07,
       3.0000000e+07, 1.7000000e+06, 2.5000000e+06, 4.0000000e+04,
       3.3000000e+07, 3.5000000e+07, 3.0000000e+05, 2.5000000e+07,
       3.5000000e+06, 2.0000000e+08, 6.0000000e+06, 1.3000000e+06,
       4.1000000e+06, 5.7500000e+05, 8.0000000e+05, 2.8000000e+07,
       1.8000000e+07, 3.2000000e+06, 9.0000000e+05, 2.5000000e

In [None]:
#query the 2021 startup funding data
query = "SELECT * FROM LP1_startup_funding2021"

data_2021 = pd.read_sql(query, connection)
data_2021.head(1)




Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage
0,Unbox Robotics,2019.0,Bangalore,AI startup,Unbox Robotics builds on-demand AI-driven ware...,"Pramod Ghadge, Shahid Memon","BEENEXT, Entrepreneur First","$1,200,000",Pre-series A


In [None]:
data_2021.Amount.unique()
data_2021.shape

(1209, 9)

In [None]:
#Checking the datatypes of the columns
datatypes = data_2021.dtypes
datatypes

Company_Brand     object
Founded          float64
HeadQuarter       object
Sector            object
What_it_does      object
Founders          object
Investor          object
Amount            object
Stage             object
dtype: object

In [None]:
data_2021[data_2021['Amount'].str.contains('$', na = False)]
data_2021['Amount']=data_2021.Amount.str.replace('\W', '', regex=True)
data_2021.head(2)

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage
0,Unbox Robotics,2019.0,Bangalore,AI startup,Unbox Robotics builds on-demand AI-driven ware...,"Pramod Ghadge, Shahid Memon","BEENEXT, Entrepreneur First",1200000,Pre-series A
1,upGrad,2015.0,Mumbai,EdTech,UpGrad is an online higher education platform.,"Mayank Kumar, Phalgun Kompalli, Ravijot Chugh,...","Unilazer Ventures, IIFL Asset Management",120000000,


In [None]:
#combining the Stage and Amount columns since there are data entry errors
data_2021['new'] =  data_2021['Investor'].fillna('inv') +data_2021['Amount'].astype(str) + data_2021['Stage'].fillna('ab') 
#data_2021['new']

In [None]:
data_2021['Amount_new']=data_2021['new'].str.extract('(\d+)')
data_2021.head(1)

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage,new,Amount_new
0,Unbox Robotics,2019.0,Bangalore,AI startup,Unbox Robotics builds on-demand AI-driven ware...,"Pramod Ghadge, Shahid Memon","BEENEXT, Entrepreneur First",1200000,Pre-series A,"BEENEXT, Entrepreneur First1200000Pre-series A",1200000


In [None]:
#Removing the dollar symbol from the 2021 amount
data_2021.drop(columns=['new','Amount'],inplace=True)



In [None]:
data_2021.head(2)
data_2021['Investment_year'] = '2021'

In [None]:
#Renaming the Amount_new column
data_2021.rename(columns={'Amount_new':'Amount'}, inplace=True)

#2021 data cleaned

In [None]:
#Reading 2018 data from the csv files

data_2018=pd.read_csv('startup_funding_2018_2019\startup_funding2018.csv')
data_2018.head(5)

Unnamed: 0,Company Name,Industry,Round/Series,Amount,Location,About Company
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f..."
1,Happy Cow Dairy,"Agriculture, Farming",Seed,"₹40,000,000","Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,"₹65,000,000","Gurgaon, Haryana, India",Leading Online Loans Marketplace in India
3,PayMe India,"Financial Services, FinTech",Angel,2000000,"Noida, Uttar Pradesh, India",PayMe India is an innovative FinTech organizat...
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,—,"Hyderabad, Andhra Pradesh, India",Eunimart is a one stop solution for merchants ...


In [None]:
#Changing column names of 2018 data to match all other datasets
data_2018.rename(columns={'Company Name':'Company_Brand','Industry':'Sector', 'Round/Series':'Stage', 'Location':'HeadQuarter', 'About Company':'What_it_does'}, inplace=True)
data_2018.head(5)

Unnamed: 0,Company_Brand,Sector,Stage,Amount,HeadQuarter,What_it_does
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f..."
1,Happy Cow Dairy,"Agriculture, Farming",Seed,"₹40,000,000","Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,"₹65,000,000","Gurgaon, Haryana, India",Leading Online Loans Marketplace in India
3,PayMe India,"Financial Services, FinTech",Angel,2000000,"Noida, Uttar Pradesh, India",PayMe India is an innovative FinTech organizat...
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,—,"Hyderabad, Andhra Pradesh, India",Eunimart is a one stop solution for merchants ...


In [None]:
def curr_converter(df,rate):
    amount_new=[]
    for a in df:
        if a.startswith('$'):
            
            amount_new.append(a.split('$')[1].replace(',',''))
        elif a.startswith('₹'): 
            
            amount_new.append(float((a.split('₹')[1]).replace(',',''))/rate )  
        else :
            amount_new.append(a)    
    return amount_new

In [None]:
#Removing the Lahk symbol
exch_rate= 158.38
amount_list=data_2018.Amount.tolist()
amount_new=curr_converter(amount_list,exch_rate)
data_2018['Amount']=amount_new
data_2018['Investment_year'] = '2018'

In [None]:
#Reading 2019 data from the csv files
data_2019=pd.read_csv('startup_funding_2018_2019\startup_funding2019.csv')
data_2019.head(5)

Unnamed: 0,Company/Brand,Founded,HeadQuarter,Sector,What it does,Founders,Investor,Amount($),Stage
0,Bombay Shaving,,,Ecommerce,Provides a range of male grooming products,Shantanu Deshpande,Sixth Sense Ventures,"$6,300,000",
1,Ruangguru,2014.0,Mumbai,Edtech,A learning platform that provides topic-based ...,"Adamas Belva Syah Devara, Iman Usman.",General Atlantic,"$150,000,000",Series C
2,Eduisfun,,Mumbai,Edtech,It aims to make learning fun via games.,Jatin Solanki,"Deepak Parekh, Amitabh Bachchan, Piyush Pandey","$28,000,000",Fresh funding
3,HomeLane,2014.0,Chennai,Interior design,Provides interior designing solutions,"Srikanth Iyer, Rama Harinath","Evolvence India Fund (EIF), Pidilite Group, FJ...","$30,000,000",Series D
4,Nu Genes,2004.0,Telangana,AgriTech,"It is a seed company engaged in production, pr...",Narayana Reddy Punyala,Innovation in Food and Agriculture (IFA),"$6,000,000",


In [None]:
#Changing column names of 2019 data to match all other datasets
data_2019.rename(columns={'Company/Brand':'Company_Brand', 'What it does':'What_it_does', 'Amount($)':'Amount'}, inplace=True)
data_2019.columns

Index(['Company_Brand', 'Founded', 'HeadQuarter', 'Sector', 'What_it_does',
       'Founders', 'Investor', 'Amount', 'Stage'],
      dtype='object')

In [None]:
#Removing the Lahk symbol
exch_rate= 177.13
amount_list=data_2019.Amount.tolist()
amount_new=curr_converter(amount_list,exch_rate)
data_2019['Amount']=amount_new
data_2019['Investment_year'] = '2019'

### Merging datasets

In [None]:
#Concating 2020 and 2021 data since they have a similar structure
pd.set_option('display.max_rows', None)
final_df = pd.concat([data_2021,data_2020,data_2019,data_2018],axis=0,ignore_index=True)
final_df.head(5)
final_df.shape

(2879, 11)

In [None]:
#Saving the combined dataset to xlsx
final_df.to_csv("startup_funding_2018_2019\combined.csv",index=False
             ) 

# Data Cleaning & Exploration

In [5]:
#Considering the columns of interest and reindexing
df= pd.read_csv('startup_funding_2018_2019\combined.csv')
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2879 entries, 0 to 2878
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Company_Brand    2879 non-null   object 
 1   Founded          2110 non-null   float64
 2   HeadQuarter      2765 non-null   object 
 3   Sector           2861 non-null   object 
 4   What_it_does     2879 non-null   object 
 5   Founders         2334 non-null   object 
 6   Investor         2253 non-null   object 
 7   Stage            1941 non-null   object 
 8   Amount           2494 non-null   object 
 9   Investment_year  2879 non-null   int64  
 10  column10         2 non-null      object 
dtypes: float64(1), int64(1), object(9)
memory usage: 247.5+ KB


In [None]:
df.shape

(2879, 11)

#Merging the removed symbols with the original dataset
#f_df = pd.concat([df,dollar_investment,lahk_investment],axis=0,ignore_index=True)
f_df.shape


#Dropping duplicates from the original dataset to maintain the 'removed symbols' rows
f_df.drop_duplicates(subset=['Stage','Founded','Founders','Company_Brand','HeadQuarter','Investor','Sector','What_it_does','column10'],  keep='last', inplace=True, ignore_index=False)
f_df.shape

In [6]:
#split HQ column into two columns
df[['Town', 'Other']] = df['HeadQuarter'].str.split(',', 1, expand=True)
df[['City', 'Country']] = df['Other'].str.split(',', 1, expand=True)
#df


In [7]:
# Function to fill new column based on conditions
def fill_new_column(row):
    if row['Town'] != '':
        return row['Town']
    else:
        return row['City']
    
# Create a new column filled with values from one column or default value if empty
df['Headquarter_City'] = df.apply(fill_new_column, axis=1)    

df.head(2)

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Stage,Amount,Investment_year,column10,Town,Other,City,Country,Headquarter_City
0,Unbox Robotics,2019.0,Bangalore,AI startup,Unbox Robotics builds on-demand AI-driven ware...,"Pramod Ghadge, Shahid Memon","BEENEXT, Entrepreneur First",Pre-series A,1200000,2021,,Bangalore,,,,Bangalore
1,upGrad,2015.0,Mumbai,EdTech,UpGrad is an online higher education platform.,"Mayank Kumar, Phalgun Kompalli, Ravijot Chugh,...","Unilazer Ventures, IIFL Asset Management",,120000000,2021,,Mumbai,,,,Mumbai


Technology & Software: Startups primarily focused on developing software, AI, IT solutions, and technology-related services.

Examples: AI startup, Tech Startup, IT startup, Software Startup, SaaS startup, MLOps platform, Digital platform, Blockchain startup, Automation, Podcast, Trading platform, Social network, Mobile Games, Computer software.

E-commerce & Retail: Startups involved in online retail, e-commerce platforms, marketplace solutions, and retail-focused businesses.

Examples: E-commerce, B2B E-commerce, Retail, Social commerce, Food Industry, Content commerce, B2B Manufacturing, Home Decor, Consumer Electronics, Apparel & Fashion.

Finance & FinTech: Startups operating in financial services, banking, financial technology (FinTech), cryptocurrency, and related areas.

Examples: FinTech, Banking, Financial Services, Trading platform, Cryptocurrency, Digital mortgage, Venture Capital, Insurance, Trading platform, Insuretech, Consumer finance.

Healthcare & HealthTech: Startups in the healthcare industry, including health technology (HealthTech), telemedicine, medical devices, and healthcare services.

Examples: HealthTech, Healthcare, Hospital & Health Care, Health, Health care, Healtcare, Insuretech, Medical, Hospital & Health Care.
Others: Startups that do not fit directly into the above categories or have unique business models.

Examples: EdTech (Education Technology), AgriTech (Agricultural Technology), Food & Beverages, Hospitality, Logistics & Supply Chain, 
Transportation, Renewable Energy, Robotics, Aerospace, Automotive, Gaming, Fashion, Real Estate, Media, Social Media, Consumer Goods, Industrial Automation, Lifestyle, Food delivery, LegalTech, Rental, Recruitment, Construction, Sports, Spirituality, Pet care, Music, Tobacco, Advisory firm, Pollution control equipment, Consulting, BioTechnology, Innovation Management, Location Analytics, Computer & Network Security, Apparel & Fashion, Automotive, Computer Games, Lifestyle, Environmental Services, Facilities Services, Marketing & Advertising, Job discovery platform, D2C (Direct-to-Consumer), E-learning, OTT (Over-the-Top media), Music, Fitness, Eyewear, NFT Marketplace, Online storytelling, SpaceTech, Online Media, Fishery, Advisory firm, Apparel & Fashion, Environmental service, Commercial Real Estate, AR startup.

In [None]:
df['Category'] = df['Sector'].map({'AI startup':'Technology & Software','EdTech':'Education',
 'B2B E-commerce':'E-commerce & Retail',
 'FinTech':'Finance & FinTech',
 'Home services':'E-commerce & Retail',
 'HealthTech':'Healthcare & HealthTech',
 'Tech Startup':'Technology & Software',
 'E-commerce':'E-commerce & Retail',
 'B2B service':'E-commerce & Retail',
 'Helathcare':'Healthcare & HealthTech',
 'Renewable Energy':'Energy & Utilities',
 'Electronics':'E-commerce & Retail',
 'IT startup':'Technology & Software',
 'Food & Beverages':'Food & Beverages',
 'Aeorspace':'Others',
 'Deep Tech':'Others',
 'Dating': 'Others',
 'Gaming':'Technology & Software',
 'Robotics':'Technology & Software',
 'Retail':'E-commerce & Retail',
 'Food':'Food & Beverages',
 'Oil and Energy':'Energy & Utilities',
 'AgriTech':'Others',
 'Telecommuncation':'Others',
 'Milk startup':'Others',
 'AI Chatbot':'Technology & Software',
 'IT':'Technology & Software',
 'Logistics':'E-commerce & Retail',
 'Hospitality':'Others',
 'Fashion':'E-commerce & Retail',
 'Marketing':'E-commerce & Retail',
 'Transportation':'E-commerce & Retail',
 'LegalTech':'Others',
 'Food delivery':'E-commerce & Retail',
 'Automotive':'E-commerce & Retail',
 'SaaS startup':'Finance & FinTech',
 'Renewable Energy':'Energy & Utilities',
 'Fantasy sports':'Others',
 'Video communication':'Technology & Software',
 'Social Media':'Others',
 'Skill development':'Education',
 'Rental':'Others',
 'Tech Startup':'Technology & Software',
 'Recruitment':'Education',
 'E-commerce':'E-commerce & Retail',
 'Sports':'Others',
 'Computer Games':'E-commerce & Retail',
 'Consumer Goods':'E-commerce & Retail',
 'Information Technology':'Technology & Software',
 'Apparel & Fashion':'E-commerce & Retail',
 'Logistics & Supply Chain':'E-commerce & Retail',
 'SportsTech':'Others',
 'HRTech':'Others',
 'Healthcare':'Healthcare & HealthTech',
 'Wine & Spirits':'E-commerce & Retail',
 'Mechanical & Industrial Engineering':'Others',
 'Spiritual':'Others',
 'Consumer Goods':'E-commerce & Retail',
 'Industrial Automation':'Others',
 'Lifestyle':'Others',
 'IoT':'Technology & Software',
 'Banking':'Finance & FinTech',
 'Computer software':'Technology & Software',
 'Automotive':'Others',
 'Digital mortgage':'Others',
 'Hospitality':'E-commerce & Retail',
 'Location Analytics':'Others',
 'Media':'Others',
 'Transportation':'Others',
'Tobacco':'E-commerce & Retail',
'MLOps platform':'Education',
 'Insuretech':'Finance & FinTech',
'Venture Capital':'Finance & FinTech',
 'Pet care':'Others',
 'Drone':'Others',
'E-learning':'Education',
 'Computer & Network Security':'Technology & Software',
 'Capital Markets':'Finance & FinTech',
 'Social network':'Others',
'Venture Capital & Private Equity':'Finance & FinTech',
'Furniture':'E-commerce & Retail',
'Wholesale':'E-commerce & Retail',
 'Health, Wellness & Fitness':'Healthcare & HealthTech',
 'OTT':'Others',
 'Hospital & Health Care':'Healthcare & HealthTech',
 'Information Technology & Services':'Technology & Software',
 'Construction':'Others',
 'Media':'Others',
 'E-learning':'Education',
 'Music':'Others',
 'Information Technology & Services':'E-commerce & Retail',
 'B2B marketplace':'E-commerce & Retail',
 'Financial Services':'Finance & FinTech',
 'Healtcare':'Healthcare & HealthTech',
'Education Management':'Education',
 'E-learning':'Education',
 'Music':'Others',
'Social commerce':'Others',

'Insurance':'Finance & FinTech',
  'Social audio':'Others',
  'Content commerce':'Others',
   'Celebrity Engagement':'Others',
 'Trading platform':'Finance & FinTech',
 'Innovation Management':'Others',
'Advisory firm':'Others',
'Vehicle repair startup':'Others',
 'Beverages':'Food & Beverages',
 'EV startup':'Others',
  'Home Decor':'E-commerce & Retail',
 'Solar':'E-commerce & Retail',
 'Cannabis startup':'E-commerce & Retail',
  'Helathcare':'Healthcare & HealthTech',
 'Water purification':'Healthcare & HealthTech',
'Cosmetics':'Others',
'CRM':'Others',
'Job discovery platform':'Others',
'Aviation':'Others',
'SpaceTech':'Others',
'NFT Marketplace':'Finance & FinTech',
'Human Resources':'Others',
 'D2C':'Others',
 'Pollution control equiptment':'Othes',
'BioTechnology':'Others',
 'Software Startup':'Technology & Software',
 'Mobile Games':'Technology & Software',
 'Podcast':'Others',
 'Content publishing':'Others',
 'Blockchain startup':'Others',
 'Social network':'Others',
 'Insuretech':'Finance & FinTech',
 'Company-as-a-Service':'Others',
 'Eyewear':'E-commerce & Retail',
 'Textiles':'E-commerce & Retail',
 'Matrimony':'Others',
 'Blockchain':'Finance & FinTech',
 'Merchandise':'E-commerce & Retail',
 'Facilities Services':'E-commerce & Retail',
 'Farming':'Healthcare & HealthTech',
 'Internet':'Others','Online Media':'Others',
 'Social community':'Others','Consumer Electronics':'Others',
 'Fishery':'Healthcare & HealthTech','Deeptech':'Others',
 'Renewables & Environment':'Others','Tech startup':'Technology & Software',
 'Online storytelling':'Others','Digital platform':'Others',
 'Nutrition':'Healthcare & HealthTech','Health':'Healthcare & HealthTech',
 'Augmented reality':'Technology & Software','Online Media':'Others',
 'Co-working':'Others','HealthCare':'Healthcare & HealthTech',
 'Blockchain startup':'Finance & FinTech','Healthtech':'Healthcare & HealthTech'})



In [46]:
sector = df.Sector.tolist()
cat=[]
cat = ['Finance & FinTech' if 'bank' in str(a).lower() or 'trading' in str(a).lower() or 'fintech' in str(a).lower() or 'vent' in str(a).lower() or 'capital' in str(a).lower() or 'insure' in str(a).lower() or 'crypt' in str(a).lower()
         else 'Technology & Software' if 'ai startup' in str(a).lower() or 'technology' in str(a).lower() or 'it' in str(a).lower() or 'chain' in str(a).lower() or 'augment' in str(a).lower() or 'robot' in str(a).lower() or 'gamin' in str(a).lower() or 'information' in str(a).lower() or 'mobile' in str(a).lower()
         else 'E-commerce & Retail' if 'b2b' in str(a).lower() or 'e-com' in str(a).lower() or 'reta' in str(a).lower() or 'soci' in str(a).lower() or 'food' in str(a).lower() or 'content' in str(a).lower() or 'decor' in str(a).lower() or 'electronics' in str(a).lower() or 'apparel' in str(a).lower() or 'fashion' in str(a).lower() or 'food' in str(a).lower() or 'farm' in str(a).lower() or 'fish' in str(a).lower() or 'annabi' in str(a).lower() or 'eye' in str(a).lower()
         else 'Education' if 'edtech' in str(a).lower() or 'learn' in str(a).lower() or 'working' in str(a).lower() or 'story' in str(a).lower()
         else 'Healthcare & HealthTech' if 'heal' in str(a).lower() or 'medi' in str(a).lower() or 'care' in str(a).lower() or 'hael' in str(a).lower()
         else 'Other'
         for a in sector]


In [44]:
df['Category']=cat

In [47]:
df.head(2)

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Stage,Amount,Investment_year,column10,Town,Other,City,Country,Headquarter_City,Category
0,Unbox Robotics,2019.0,Bangalore,AI startup,Unbox Robotics builds on-demand AI-driven ware...,"Pramod Ghadge, Shahid Memon","BEENEXT, Entrepreneur First",Pre-series A,1200000,2021,,Bangalore,,,,Bangalore,Technology & Software
1,upGrad,2015.0,Mumbai,EdTech,UpGrad is an online higher education platform.,"Mayank Kumar, Phalgun Kompalli, Ravijot Chugh,...","Unilazer Ventures, IIFL Asset Management",,120000000,2021,,Mumbai,,,,Mumbai,Education


In [None]:
final_df.Investor.unique()

array(['BEENEXT, Entrepreneur First',
       'Unilazer Ventures, IIFL Asset Management',
       'GSV Ventures, Westbridge Capital', ...,
       'Norwest Venture Partners, General Catalyst, Fundamentum, Accel Partners',
       'TPG, Norwest Venture Partners, Evolvence India', nan],
      dtype=object)