# Investigating the indian ecosystem and to propose the best indian start-up to invest

## Description
The objective of this project is to analyse the indian start-up investment data over the course of four years (2018-2021) to find out which funding stages are very attractive to investors and at what risk level. 

## Assumptions
- All currencies is in USD
- 82 Lahk/USD

## Null Hypothesis 
Average Investment amounts received by start-ups have no relation to sectors they operate in.


# Alternate Hypo
There is a relationship between average investment amounts received by start-ups and the sectors they operate in

## Analytical Questions
1. Does location affect the amount of funding or investments?
2. Does the sector of start up affect the fundings?
3. How many companies need funding and are at what level of funding ?
4. Which sectors receive the highest investment amounts?
5. Which cities have the highest number of startups and at what levels?
6. What are the levels of funding the startups are receiving?

# GOAL
`
The goal of this project is to propose the best indian start-up to invest.

In [None]:
#Libraries imported
import MySQLdb
import sqlalchemy as sa
import pyodbc     
from dotenv import dotenv_values    #import the dotenv_values function from the dotenv package
import pandas as pd
import warnings 


In [None]:
env_variables= dotenv_values('logins.env')
database= env_variables.get('database')
server = env_variables.get('server')
username = env_variables.get('username')
password = env_variables.get('password')

# Data Understanding

- There are four data sources to work with (2 SQL and 2 CSVs)
- Explore data
- Verify data quality

### Connecting to the dapDB to extract the 2020 and 2021 data

In [None]:
#Connecting to the database to analyse the 2020-2021 data

connection_string = f"DRIVER={{SQL Server}};SERVER={server};DATABASE={database};UID={username};PWD={password};MARS_Connection=yes;MinProtocolVersion=TLSv1.2;"
connection = pyodbc.connect(connection_string)

In [None]:
#query the 2020 startup funding data

query = "SELECT * FROM LP1_startup_funding2020"

data_2020 = pd.read_sql(query, connection)
data_2020.columns

In [None]:
data_2020.head(2)

In [None]:
data_2020.Amount.unique()

In [None]:
#query the 2021 startup funding data
query = "SELECT * FROM LP1_startup_funding2021"

data_2021 = pd.read_sql(query, connection)
data_2021.head(5)


In [None]:
data_2021.Amount.unique()

In [None]:
#Checking for data entry errors
test=data_2021.query('Amount==["SeriesC","Seed","Upsparks"]')
test

In [None]:
#Reading 2018 and 2019 data from the csv files

data_2018=pd.read_csv('startup_funding_2018_2019\startup_funding2018.csv')
data_2018.head(5)

In [None]:
data_2019=pd.read_csv('startup_funding_2018_2019\startup_funding2019.csv')
data_2019.info()

In [None]:
#Changing column names of 2019 data to match all other datasets
data_2019.rename(columns={'Company/Brand':'Company_Brand', 'What it does':'What_it_does', 'Amount($)':'Amount'}, inplace=True)
data_2019.columns

In [None]:
#Changing column names of 2018 data to match all other datasets
data_2018.rename(columns={'Company Name':'Company_Brand','Industry':'Sector', 'Round/Series':'Stage', 'Location':'HeadQuarter', 'About Company':'What_it_does'}, inplace=True)
data_2018.head(5)

### Merging datasets

In [None]:
#Concating 2020 and 2021 data since they have a similar structure
pd.set_option('display.max_rows', None)
final_df = pd.concat([data_2021,data_2020,data_2019,data_2018],axis=0,ignore_index=True)
final_df.head(5)

In [None]:
#Saving the combined dataset to xlsx
final_df.to_excel("startup_funding_2018_2019\combined.xlsx",index=False,
             sheet_name='2018_to_2021') 

# Data Cleaning & Exploration

In [136]:
#Considering the columns of interest and reindexing
df= pd.read_excel('startup_funding_2018_2019\complete_data.xlsx')
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2879 entries, 0 to 2878
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company_Brand  2879 non-null   object 
 1   Founded        2110 non-null   float64
 2   HeadQuarter    2764 non-null   object 
 3   Sector         2861 non-null   object 
 4   What_it_does   2879 non-null   object 
 5   Founders       2334 non-null   object 
 6   Investor       1264 non-null   object 
 7   Amount         1743 non-null   object 
 8   Stage          1262 non-null   object 
 9   column10       3 non-null      object 
dtypes: float64(1), object(9)
memory usage: 225.0+ KB


In [138]:
df.shape

(2879, 10)

In [139]:
#Removing the Lahk symbol
lahk_investment=df[df['Amount'].str.contains('₹', na = False)]
lahk_investment.Amount=lahk_investment.Amount.str.replace('\W', '', regex=True).astype(float)
lahk_investment.Amount=(lahk_investment.Amount )/ 82 #Converting to dollar denomination
lahk_investment.head(2)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  lahk_investment.Amount=lahk_investment.Amount.str.replace('\W', '', regex=True).astype(float)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  lahk_investment.Amount=(lahk_investment.Amount )/ 82 #Converting to dollar denomination


Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage,column10
2354,Happy Cow Dairy,,"Mumbai, Maharashtra, India","Agriculture, Farming",A startup which aggregates milk from dairy far...,,,487804.878049,Seed,
2355,MyLoanCare,,"Gurgaon, Haryana, India","Credit, Financial Services, Lending, Marketplace",Leading Online Loans Marketplace in India,,,792682.926829,Series A,


In [140]:
#Removing the dollar symbol
dollar_investment=df[df['Amount'].str.contains('$', na = False)]
dollar_investment.Amount=dollar_investment.Amount.str.replace('\W', '', regex=True)
dollar_investment.head(2)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dollar_investment.Amount=dollar_investment.Amount.str.replace('\W', '', regex=True)


Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage,column10
0,Unbox Robotics,2019.0,Bangalore,AI startup,Unbox Robotics builds on-demand AI-driven ware...,"Pramod Ghadge, Shahid Memon","BEENEXT, Entrepreneur First",1200000,Pre-series A,
1,upGrad,2015.0,Mumbai,EdTech,UpGrad is an online higher education platform.,"Mayank Kumar, Phalgun Kompalli, Ravijot Chugh,...","Unilazer Ventures, IIFL Asset Management",120000000,,


In [172]:
#Merging the removed symbols with the original dataset
f_df = pd.concat([df,dollar_investment,lahk_investment],axis=0,ignore_index=True)
f_df.shape


(3023, 10)

In [173]:
#Dropping duplicates from the original dataset to maintain the 'removed symbols' rows
f_df.drop_duplicates(subset=['Stage','Founded','Founders','Company_Brand','HeadQuarter','Investor','Sector','What_it_does','column10'],  keep='last', inplace=True, ignore_index=False)
f_df.shape

(2843, 10)

In [187]:
#Finding the number of nan in each column
f_df.isna().sum()



Company_Brand       0
Founded           625
HeadQuarter        50
Sector             10
What_it_does        0
Founders          530
Investor          542
Amount              0
Stage               0
column10         1141
dtype: int64

In [191]:
#Dropping empty rows in the Stage and Amount columns
wd=f_df.dropna(  how='any', subset=['Amount','Stage'])
wd.dropna(subset = ['Amount','Stage'], inplace = True) 
wd.shape


(1142, 10)

In [123]:
for m in am:
    v=str(m)
    if v.startswith('₹'):
        b=v.split("₹")[1]

        
        print(b)

40,000,000
65,000,000
16,000,000
50,000,000
100,000,000
500,000
35,000,000
64,000,000
20,000,000
30,000,000
40,000,000
5,000,000
20,000,000
40,000,000
20,000,000
12,500,000
15,000,000
104,000,000
45,000,000
25,000,000
26,400,000
8,000,000
60,000
34,000,000
342,000,000
600,000,000
1,000,000,000
2,000,000,000
2,000,000,000
1,000,000,000
100,000
250,000,000
2,000,000,000
550,000,000
30,000,000
240,000,000
120,000,000
2,500,000,000
44,000,000
60,000,000
2,500,000,000
650,000,000
1,600,000,000
50,000,000
70,000,000
16,000,000
102,500,000
550,000,000
20,000,000
1,200,000
250,000,000
5,200,000,000
100,000
50,000,000
100,000,000
9,500,000
150,000,000
7,000,000
1,400,000
50,000,000
10,000,000
22,500,000
5,000,000
50,000,000
140,200,000
30,000,000
19,200,000
103,000,000
40,000,000
35,000,000
100,000,000
200,000
16,600,000
12,000,000
20,000,000
30,000,000
33,000,000
34,900,000
72,000,000
50,000,000
120,000,000
35,000,000
32,000,000
250,000,000
135,000,000
15,000,000
20,000,000
10,000,000
135,000,

In [None]:
#Not done yet
final_df.query('Stage==["$6000000","$300000","$1000000"]')


In [None]:
final_df.What_it_does.unique()

In [None]:
final_df.HeadQuarter.unique()

In [None]:
final_df.Founders.unique()

In [None]:
final_df.Investor.unique()

In [None]:
test

for row in test.index:
    print(row)
    row_list = test.loc[row, :].values.flatten().tolist()
    row_list
#row_list = test.loc[98, :].values.flatten().tolist() 


In [None]:
row_list   


In [None]:

for i in row_list:
    # Remove letters using string manipulation
    
    if isinstance(i, str):
       s = ''.join(filter(str.isdigit, i))
      
       
    else: 
       pass  
print(row_list)

    