# SECTION A: Python & Data Cleaning



---

1. Data Loading & Cleaning:

   - Loaded the dataset using pandas.

   - Renamed columns for consistency (Startup Name → Startup_Name, etc.).

   - Dropped unnecessary columns like Sr No.


---



In [405]:
import pandas as pd
import numpy as np

In [406]:
df=pd.read_csv('/content/drive/MyDrive/Colab Notebooks/PROJECT_DS/CSV/startup_funding.csv')
df.head()

Unnamed: 0,Sr No,Date dd/mm/yyyy,Startup Name,Industry Vertical,SubVertical,City Location,Investors Name,InvestmentnType,Amount in USD,Remarks
0,1,09/01/2020,BYJU’S,E-Tech,E-learning,Bengaluru,Tiger Global Management,Private Equity Round,200000000,
1,2,13/01/2020,Shuttl,Transportation,App based shuttle service,Gurgaon,Susquehanna Growth Equity,Series C,8048394,
2,3,09/01/2020,Mamaearth,E-commerce,Retailer of baby and toddler products,Bengaluru,Sequoia Capital India,Series B,18358860,
3,4,02/01/2020,https://www.wealthbucket.in/,FinTech,Online Investment,New Delhi,Vinod Khatumal,Pre-series A,3000000,
4,5,02/01/2020,Fashor,Fashion and Apparel,Embroiled Clothes For Women,Mumbai,Sprout Venture Partners,Seed Round,1800000,


In [407]:
df.drop(columns='Sr No', inplace=True)
df.rename(columns={
    'Date dd/mm/yyyy': 'Date',
    'Startup Name': 'Startup_Name',
    'Industry Vertical': 'Industry_Vertical',
    'SubVertical': 'SubVertical',
    'City  Location': 'City_Location',
    'Investors Name': 'Investors_Name',
    'InvestmentnType': 'Investment_Type',
    'Amount in USD': 'Amount_USD',
    'Remarks': 'Remarks'
}, inplace=True)



---


2. Display structure, and Inspect Datatypes


---



In [408]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3044 entries, 0 to 3043
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Date               3044 non-null   object
 1   Startup_Name       3044 non-null   object
 2   Industry_Vertical  2873 non-null   object
 3   SubVertical        2108 non-null   object
 4   City_Location      2864 non-null   object
 5   Investors_Name     3020 non-null   object
 6   Investment_Type    3040 non-null   object
 7   Amount_USD         2084 non-null   object
 8   Remarks            419 non-null    object
dtypes: object(9)
memory usage: 214.2+ KB


In [409]:
df.dtypes

Unnamed: 0,0
Date,object
Startup_Name,object
Industry_Vertical,object
SubVertical,object
City_Location,object
Investors_Name,object
Investment_Type,object
Amount_USD,object
Remarks,object


In [410]:
df.shape

(3044, 9)

In [411]:
df.describe()

Unnamed: 0,Date,Startup_Name,Industry_Vertical,SubVertical,City_Location,Investors_Name,Investment_Type,Amount_USD,Remarks
count,3044,3044,2873,2108,2864,3020,3040,2084,419
unique,1035,2459,821,1942,112,2412,55,471,72
top,30/11/2016,Swiggy,Consumer Internet,Online Lending Platform,Bangalore,Undisclosed Investors,Private Equity,1000000,Series A
freq,11,8,941,11,700,39,1356,165,175


The describe() function is showing only the columns with data type as string (object) because right now, there are no numerical columns in the dataset.

After we convert the Amount_USD column to the correct numeric type, then describe() will show numerical statistics like:
- Count
- Mean
- Standard deviation
- Minimum and maximum
- Percentiles (25%, 50%, 75%)



---

3. Handling Missing Values:
   - Removed rows with missing Industry Vertical, since it's a crucial feature.
   - Filled missing values in SubVertical, City Location, and Investors Name with "Unknown" to maintain consistency.



---



In [412]:
df.isnull().sum()

Unnamed: 0,0
Date,0
Startup_Name,0
Industry_Vertical,171
SubVertical,936
City_Location,180
Investors_Name,24
Investment_Type,4
Amount_USD,960
Remarks,2625


In [413]:
missing_percent = df.isnull().mean() * 100
print(missing_percent.sort_values(ascending=False))

Remarks              86.235217
Amount_USD           31.537451
SubVertical          30.749014
City_Location         5.913272
Industry_Vertical     5.617608
Investors_Name        0.788436
Investment_Type       0.131406
Date                  0.000000
Startup_Name          0.000000
dtype: float64


In [414]:
df=df[~df['Industry_Vertical'].isnull()]
df.isnull().sum()

Unnamed: 0,0
Date,0
Startup_Name,0
Industry_Vertical,0
SubVertical,765
City_Location,9
Investors_Name,24
Investment_Type,4
Amount_USD,920
Remarks,2522


- About 5% of the rows had missing values in the Industry_Vertical column.
- Those same rows also had missing values in other columns.
- So, we dropped them to keep the dataset clean.




In [415]:
#Handling mising values.
df['SubVertical'] = df['SubVertical'].fillna('Unknown')
df['City_Location'] = df['City_Location'].fillna('Unknown')
df['Investors_Name'] = df['Investors_Name'].fillna('Unknown')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['SubVertical'] = df['SubVertical'].fillna('Unknown')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['City_Location'] = df['City_Location'].fillna('Unknown')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Investors_Name'] = df['Investors_Name'].fillna('Unknown')




---


4. Data Formatting:
 - Handled categorical issues like inconsistent city names (Bangalore, Bengaluru, etc.).
 - Dates were converted to proper datetime format.
 - Currency values (Amount in USD) were cleaned (commas removed, converted to float).



---



In [416]:
df_NewCopy=df.copy()

In [417]:
df['Investment_Type'].unique()

array(['Private Equity Round', 'Series C', 'Series B', 'Pre-series A',
       'Seed Round', 'Series A', 'Series D', 'Seed', 'Series F',
       'Series E', 'Debt Funding', 'Series G', 'Series H', 'Venture',
       'Seed Funding', nan, 'Funding Round', 'Corporate Round',
       'Maiden Round', 'pre-series A', 'Seed Funding Round',
       'Single Venture', 'Venture Round', 'Pre-Series A', 'Angel',
       'Series J', 'Angel Round', 'pre-Series A',
       'Venture - Series Unknown', 'Bridge Round', 'Private Equity',
       'Debt and Preference capital', 'Inhouse Funding',
       'Seed/ Angel Funding', 'Debt', 'Pre Series A', 'Equity',
       'Debt-Funding', 'Mezzanine', 'Series B (Extension)',
       'Equity Based Funding', 'Private Funding', 'Seed / Angel Funding',
       'Seed/Angel Funding', 'Seed funding', 'Seed / Angle Funding',
       'Angel / Seed Funding', 'Private', 'Structured Debt', 'Term Loan',
       'PrivateEquity', 'Angel Funding'], dtype=object)

In [418]:
investment_type_map = {
    # Seed / Angel
    'Seed': 'Seed Funding',
    'Seed Round': 'Seed Funding',
    'Seed funding': 'Seed Funding',
    'Seed Funding Round': 'Seed Funding',
    'Seed / Angel Funding': 'Seed Funding',
    'Seed/Angel Funding': 'Seed Funding',
    'Seed/ Angel Funding': 'Seed Funding',
    'Seed / Angle Funding': 'Seed Funding',
    'Seed/ Angel Funding': 'Seed Funding',
    'Angel': 'Angel Funding',
    'Angel Round': 'Angel Funding',
    'Angel Funding': 'Angel Funding',
    'Angel / Seed Funding': 'Seed Funding',

    # Series
    'Series A': 'Series A',
    'Pre-Series A': 'Pre-Series A',
    'Pre-series A': 'Pre-Series A',
    'pre-Series A': 'Pre-Series A',
    'pre-series A': 'Pre-Series A',
    'Pre Series A': 'Pre-Series A',
    'Series B': 'Series B',
    'Series B (Extension)': 'Series B',
    'Series C': 'Series C',
    'Series D': 'Series D',
    'Series E': 'Series E',
    'Series F': 'Series F',
    'Series G': 'Series G',
    'Series H': 'Series H',
    'Series J': 'Series J',

    # Venture/Private Equity
    'Venture': 'Venture Round',
    'Venture Round': 'Venture Round',
    'Venture - Series Unknown': 'Venture Round',
    'Private Equity': 'Private Equity',
    'Private Equity Round': 'Private Equity',
    'Private': 'Private Equity',
    'Private Funding': 'Private Equity',
    'PrivateEquity': 'Private Equity',

    # Debt
    'Debt Funding': 'Debt Funding',
    'Debt': 'Debt Funding',
    'Debt-Funding': 'Debt Funding',
    'Structured Debt': 'Debt Funding',
    'Term Loan': 'Debt Funding',
    'Debt and Preference capital': 'Debt Funding',

    # Other types
    'Funding Round': 'Other',
    'Corporate Round': 'Other',
    'Maiden Round': 'Other',
    'Single Venture': 'Other',
    'Bridge Round': 'Other',
    'Inhouse Funding': 'Other',
    'Equity': 'Other',
    'Equity Based Funding': 'Other',
    'Mezzanine': 'Other',
}
df['Investment_Type'] = df['Investment_Type'].str.strip()
df['Investment_Type'] = df['Investment_Type'].replace(investment_type_map)
df['Investment_Type'].value_counts()


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Investment_Type'] = df['Investment_Type'].str.strip()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Investment_Type'] = df['Investment_Type'].replace(investment_type_map)


Unnamed: 0_level_0,count
Investment_Type,Unnamed: 1_level_1
Seed Funding,1454
Private Equity,1280
Debt Funding,30
Series A,24
Series B,21
Series C,14
Series D,12
Other,11
Pre-Series A,9
Venture Round,4


In [419]:
df['City_Location'].unique()

array(['Bengaluru', 'Gurgaon', 'New Delhi', 'Mumbai', 'Chennai', 'Pune',
       'Noida', 'Faridabad', 'San Francisco', 'San Jose,', 'Amritsar',
       'Delhi', 'Kormangala', 'Tulangan', 'Hyderabad', 'Burnsville',
       'Menlo Park', 'Gurugram', 'Palo Alto', 'Santa Monica', 'Singapore',
       'Taramani', 'Andheri', 'Chembur', 'Nairobi', 'Haryana', 'New York',
       'Karnataka', 'Mumbai/Bengaluru', 'Bhopal',
       'Bengaluru and Gurugram', 'India/Singapore', 'Jaipur', 'India/US',
       'Nagpur', 'Indore', 'New York, Bengaluru', 'California', 'India',
       'Ahemadabad', 'Rourkela', 'Srinagar', 'Bhubneswar', 'Chandigarh',
       'Delhi & Cambridge', 'Kolkatta', 'Kolkata', 'Coimbatore',
       'Bangalore', 'Udaipur', 'Unknown', 'Ahemdabad', 'Bhubaneswar',
       'Ahmedabad', 'Surat', 'Goa', 'Uttar Pradesh', 'Nw Delhi', 'Gaya',
       'Vadodara', 'Trivandrum', 'Missourie', 'Panaji', 'Gwalior',
       'Karur', 'Udupi', 'Kochi', 'Agra', 'Bangalore/ Bangkok', 'Hubli',
       'Kerala', 'K

In [420]:
# Step 1: Remove encoding junk and trim
df['City_Location'] = df['City_Location'].astype(str).str.replace(r'\\x[a-zA-Z0-9]+', '', regex=True)
df['City_Location'] = df['City_Location'].str.replace(r'\\\\xc2\\\\xa0', '', regex=True)
df['City_Location'] = df['City_Location'].str.strip()

# Step 2: Standardize entries using a mapping dictionary
city_map = {
    "Bangalore": "Bangalore",
    "Hyderabad": "Hyderabad",
    "Mumbai": "Mumbai",
    "Gurgaon": "Gurgaon",
    "Noida": "Noida",
    "Goa": "Goa",
    "Chennai": "Chennai",
    "Bangalore / SFO": "Bangalore",
    "New Delhi": "New Delhi",
    "Pune": "Pune",
    "New Delhi / California": "New Delhi",
    "Delhi": "New Delhi",
    "Unknown": "Unknown",
    "Chandigarh": "Chandigarh",
    "Pune / US": "Pune",
    "Udaipur": "Udaipur",
    "Ahmedabad": "Ahmedabad",
    "\\xc2\\xa0Gurgaon": "Gurgaon",
    "\\xc2\\xa0New Delhi": "New Delhi",
    "\\xc2\\xa0Bangalore": "Bangalore",
    "\\xc2\\xa0Noida": "Noida",
    "Coimbatore": "Coimbatore",
    "Nagpur": "Nagpur",
    "\\xc2\\xa0Mumbai": "Mumbai",
    "Kanpur": "Kanpur",
    "India / US": "Unknown",
    "Mumbai / Global": "Mumbai",
    "Kolkata": "Kolkata",
    "Jaipur": "Jaipur",
    "New Delhi/ Houston": "New Delhi",
    "New Delhi / US": "New Delhi",
    "Bangalore / USA": "Bangalore",
    "Gurgaon / SFO": "Gurgaon",
    "US/India": "Unknown",
    "Bhopal": "Bhopal",
    "New York/ India": "Unknown",
    "Bangalore / San Mateo": "Bangalore",
    "Pune / Singapore": "Pune",
    "Singapore": "Unknown",
    "Chennai/ Singapore": "Chennai",
    "Belgaum": "Belgaum",
    "Goa/Hyderabad": "Goa",
    "Bangalore/ Bangkok": "Bangalore",
    "Noida / Singapore": "Noida",
    "USA/India": "Unknown",
    "Vadodara": "Vadodara",
    "Gwalior": "Gwalior",
    "Mumbai / NY": "Mumbai",
    "Boston": "Unknown",
    "Hyderabad/USA": "Hyderabad",
    "Bangalore / Palo Alto": "Bangalore",
    "Udupi": "Udupi",
    "Dallas / Hyderabad": "Hyderabad",
    "London": "Unknown",
    "Jodhpur": "Jodhpur",
    "Mumbai / UK": "Mumbai",
    "Indore": "Indore",
    "Pune / Dubai": "Pune",
    "Varanasi": "Varanasi",
    "Pune/Seattle": "Pune",
    "Seattle / Bangalore": "Bangalore",
    "Agra": "Agra",
    "Kochi": "Kochi",
    "Hubli": "Hubli",
    "Kerala": "Unknown",
    "\\\\" : "Unknown",
    "\\\\ Delhi": "New Delhi",
    "Kozhikode": "Kozhikode",
    "US": "Unknown",
    "Siliguri": "Siliguri",
    "USA": "Unknown",
    "Lucknow": "Lucknow",
    "Trivandrum": "Trivandrum",
    "SFO / Bangalore": "Bangalore",
    "Bengaluru": "Bangalore",
    "Panaji": "Panaji",
    "Missourie": "Unknown",
    "Surat": "Surat",
    "Gaya": "Gaya",
    "Uttar Pradesh": "Unknown",
    "Nw Delhi": "New Delhi",
    "Karur": "Karur",
    "Gurugram": "Gurgaon",
    "Bhubneswar": "Bhubaneswar",
    "Kolkatta": "Kolkata",
    "Delhi & Cambridge": "New Delhi",
    "India/US": "Unknown",
    "Rourkela": "Rourkela",
    "India": "Unknown",
    "Mumbai/Bengaluru": "Mumbai",
    "California": "Unknown",
    "New York, Bengaluru": "Bangalore",
    "Ahemadabad": "Ahmedabad",
    "Ahemdabad": "Ahmedabad",
    "Faridabad": "Faridabad",
    "Bhubaneswar": "Bhubaneswar",
    "Srinagar": "Srinagar",
    "Nairobi": "Unknown",
    "Haryana": "Unknown",
    "Andheri": "Mumbai",
    "Chembur": "Mumbai",
    "Taramani": "Chennai",
    "Kormangala": "Bangalore",
    "Santa Monica": "Unknown",
    "San Francisco": "Unknown",
    "Karnataka": "Unknown",
    "San Jose,": "Unknown",
    "Menlo Park": "Unknown",
    "Burnsville": "Unknown",
    "New York": "Unknown",
    "Bengaluru and Gurugram": "Bangalore",
    "Tulangan": "Tulangan",
    "Amritsar": "Amritsar",
    "Palo Alto": "Unknown",
    "India/Singapore": "Unknown"
}



df['City_Location'] = df['City_Location'].replace(city_map)

# Optional: Capitalize first letters and clean up any whitespace again
df['City_Location'] = df['City_Location'].str.title().str.strip()


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['City_Location'] = df['City_Location'].astype(str).str.replace(r'\\x[a-zA-Z0-9]+', '', regex=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['City_Location'] = df['City_Location'].str.replace(r'\\\\xc2\\\\xa0', '', regex=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['City_Locat

In [421]:
df[df['Investment_Type'].isna()]

Unnamed: 0,Date,Startup_Name,Industry_Vertical,SubVertical,City_Location,Investors_Name,Investment_Type,Amount_USD,Remarks
30,19/11/2019,Furtados School of Music,Education,Music Education,Tulangan,IAN Fund and DSG Consumer Partners,,200000000.0,
76,03/06/2019,FabHotels,E-Commerce,Hospitality,Gurgaon,"Goldman Sachs, Accel Partners and Qualcomm",,4889975.54,
81,06/06/2019,Sistema.bio,Agriculture,Hybrid Reactor Biodigestor,Unknown,"Shell Foundation, DILA CAPITAL, Engie RDE Fund...",,2739034.68,
1562,21/07/2016,Drums Food,Food & Beverage,Yogurt and Ice Cream maker,Mumbai,"Verlinvest, DSG Consumer Partners",,,


- We saw that the Investment_Type column has missing values in some Industry_Vertical categories.
- To fill them, we will use the mode (most common value) of Investment_Type within each Industry_Vertical.

In [422]:
DF_1=df[(df['Industry_Vertical']=='E-Commerce')| (df['Industry_Vertical']=='E-Commerce') | (df['Industry_Vertical']=='Agriculture') | (df['Industry_Vertical']=='Food & Beverages')]
DF_1.head()

Unnamed: 0,Date,Startup_Name,Industry_Vertical,SubVertical,City_Location,Investors_Name,Investment_Type,Amount_USD,Remarks
8,06/12/2019,CarDekho,E-Commerce,Automobile,Gurgaon,Ping An Global Voyager Fund,Series D,70000000,
12,16/12/2019,Licious,E-Commerce,Online Meat And Seafood Ordering Startup,Bangalore,Vertex Growth Fund,Series E,30000000,
16,20/12/2019,Lenskart.com,E-Commerce,Online Eyewear Shopping Portal,Faridabad,SoftBank Vision Fund,Series G,231000000,
27,19/11/2019,Digital Mall Asia,E-Commerce,Virtual e-commerce platform,New Delhi,Amour Infrastructure,Seed Funding,220000000,
44,01/08/2019,CarDekho,E-Commerce,Automotive,Gurgaon,SC GG India Mobility Holdings LLC,Series C,20000000,


In [423]:
DF_IType_MODE= DF_1.groupby('Industry_Vertical')['Investment_Type'].value_counts()
DF_IType_MODE

Unnamed: 0_level_0,Unnamed: 1_level_0,count
Industry_Vertical,Investment_Type,Unnamed: 2_level_1
Agriculture,Seed Funding,1
E-Commerce,Private Equity,11
E-Commerce,Seed Funding,6
E-Commerce,Series D,4
E-Commerce,Debt Funding,2
E-Commerce,Series C,2
E-Commerce,Series E,1
E-Commerce,Series F,1
E-Commerce,Series G,1
Food & Beverages,Seed Funding,3


In [424]:
def fill_investment_type(row):
    if pd.isnull(row['Investment_Type']):
        if row['Industry_Vertical'] == 'Agriculture':
            return 'Seed Funding'
        elif row['Industry_Vertical'] == 'E-Commerce':
            return 'Private Equity'
        elif row['Industry_Vertical'] == 'Food & Beverages':
            return 'Venture Funding'
        else:
            return 'Seed Funding'
    else:
        return row['Investment_Type']

# Apply the function to fill missing values
df['Investment_Type'] = df.apply(fill_investment_type, axis=1)

In [425]:
df.isnull().sum()

Unnamed: 0,0
Date,0
Startup_Name,0
Industry_Vertical,0
SubVertical,0
City_Location,0
Investors_Name,0
Investment_Type,0
Amount_USD,920
Remarks,2522


In [426]:
df['Industry_Vertical'].value_counts()

Unnamed: 0_level_0,count
Industry_Vertical,Unnamed: 1_level_1
Consumer Internet,941
Technology,478
eCommerce,186
Healthcare,70
Finance,62
...,...
Developer Portfolio Showcase platform,1
Doctors Network Mobile App,1
End-to-End Lending platform,1
on-demand healthcare marketplace,1


In [427]:
industry_mapping = {
    # --- FOOD & BEVERAGE, DELIVERY, RESTAURANTS ---
    "Online Food Delivery": "FoodTech",
    "Online Food Ordering": "FoodTech",
    "Online Food Ordering Marketplace": "FoodTech",
    "Online Food ordering & Delivery": "FoodTech",
    "Online Food ordering & Delivery service": "FoodTech",
    "Online Food Ordering & Delivery platform": "FoodTech",
    "Food Delivery Platform": "FoodTech",
    "Food Delivery platform": "FoodTech",
    "Food Discovery App": "FoodTech",
    "Food Discovery & Delivery Mobile app": "FoodTech",
    "Chinese food delivery": "FoodTech",
    "Healthy Food Online Community": "FoodTech",
    "Healthy Food Manufacturer": "FoodTech",
    "Gourmet Food Discovery & Delivery platform": "FoodTech",
    "Gourmet Meals Delivery": "FoodTech",
    "Hyperlocal Grocery App": "FoodTech",
    "Hyperlocal Grocery Delivery": "FoodTech",
    "Hyperlocal Grocery Delivery Service": "FoodTech",
    "Raw Meat & Ready to eat food etailer": "FoodTech",
    "Home Cooked Food marketplace & Delivery": "FoodTech",
    "Home Made Food Marketplace": "FoodTech",
    "Food Subscription platform": "FoodTech",
    "Hyperlocal food & grocery store": "FoodTech",
    "Online Grocery platform": "FoodTech",
    "Hyper-Local Online/Mobile Grocery": "FoodTech",
    "Online Grocers": "FoodTech",
    "Grocery Delivery platform": "FoodTech",
    "Used Vehicle Marketplace": "Auto",
    "Used Car Marketplace": "Auto",
    "Used Bikes Marketplace": "Auto",
    "Online Grocery Store": "FoodTech",
    "Online Meat Ordering platform": "FoodTech",
    "Hyper-local Grocery Delivery platform": "FoodTech",
    "Hyperlocal Online Home services provider": "Services",
    "Hyperlocal Goods marketplace": "Marketplace",
    "Affordable Hotel Booking Online": "Hospitality",
    "Budget Hotel Accommodation": "Hospitality",
    # ------------------ LOGISTICS, TRANSPORT, RENTALS ------------
    "Last Mile Delivery Service": "Logistics",
    "Last Mile Logistics": "Logistics",
    "Logistics Intelligence": "Logistics",
    "On-Demand Logistics Service": "Logistics",
    "Freight logistics platform": "Logistics",
    "End-to-end Logistics platform": "Logistics",
    "Logistics Automation Platform": "Logistics",
    "Logistics service platform": "Logistics",
    "Logistics Service Provider Marketplace": "Logistics",
    "Logistics Solution Provider": "Logistics",
    "B2B logistics delivery platform": "Logistics",
    "Express local delivery platform": "Logistics",
    "Hyperlocal Delivery Services": "Logistics",
    "Hyperlocal Delivery Platform": "Logistics",
    "Hyperlocal Logistics Service": "Logistics",
    "Hyperlocal Logistics Service Provider": "Logistics",
    "On-demand delivery service": "Logistics",
    "On-Demand Delivery Logistics": "Logistics",
    "Delivery & Logistics Service provider": "Logistics",
    "Logistics Services Provider": "Logistics",
    "Hyper local logistics platform": "Logistics",
    "On-Demand Local Logistics provider": "Logistics",
    "Logistics Tech": "Logistics",
    "Logistics Tech Platform": "Logistics",
    "Online Logistics Platform": "Logistics",
    "Cab Aggregator": "Transport",
    "Cab Booking app platform": "Transport",
    "Taxi Rental platform": "Transport",
    "Cab rental Mobile app": "Transport",
    "Car & Bike ecommerce platform": "Auto",
    "Bike Rental Platform": "Transport",
    "Self-driven Car rental": "Transport",
    "Self Driven Car rental": "Transport",
    "Self-driven vehicle rental": "Transport",
    "Self Driven Rental Car Platform": "Transport",
    "Car Aggregator & Retail Mobile App": "Auto",
    "On-Demand Handyman Services": "Services",
    "On demand cleaning & fixing services": "Services",
    "On-Demand Beauty Service": "Beauty",
    "On-Demand Beauty Services": "Beauty",
    "Beauty & Wellness Mobile App": "Beauty",
    "Beauty & Lifestyle Mobile Marketplace": "Beauty",
    "Beauty services Mobile Marketplace": "Beauty",
    "Beauty and Wellness platform": "Beauty",
    "Beauty and Wellness Marketplace": "Beauty",
    "Beauty and Wellness Platform": "Beauty",
    "Health, Wellness & Beauty Services App": "Beauty",
    # --------------- E-COMMERCE -------------------
    "eCommerce platform": "ECommerce",
    "ecommerce": "ECommerce",
    "E-Commerce": "ECommerce",
    "ECommerce": "ECommerce",
    "Ecommerce": "ECommerce",
    "E-Commerce & M-Commerce platform": "ECommerce",
    "FMCG": "Consumer Goods",
    "Ecommerce Marketplace": "ECommerce",
    "Online Marketplace": "ECommerce",
    "Online Marketplace for Industrial Goods": "ECommerce",
    "Hyper-Local Ecommerce": "ECommerce",
    "Hyperlocal Shopping App": "ECommerce",
    "Industrial Supplies B2B ecommerce": "ECommerce",
    "Hyperlocal electronics repair Service": "ECommerce",
    "Ecommerce Marketplace": "ECommerce",
    "Online Shopping Assistant Mobile app": "ECommerce",
    "Ecommerce Brands’ Full Service Agency": "ECommerce",
    "Ecommerce Marketing Software Platform": "ECommerce",
    "Ecommerce returns etailer": "ECommerce",
    "Digital / Mobile Wallet": "FinTech",
    "Mobile Commerce Platform": "ECommerce",
    "Mobile Only Shopping Assistant": "ECommerce",
    "Fashion jewelry and accessories e-tailer": "Fashion",
    "Online Branded Furniture etailer": "ECommerce",
    "Indian Ethnic Crafts Etailer": "ECommerce",
    "Mobile Commerce for Farmers": "ECommerce",
    "Online Grocery Delivery": "FoodTech",
    "Online home décor marketplace": "Home Decor",
    "Online Home D\xe9cor": "Home Decor",
    "Home Furnishing Solutions": "Home Decor",
    "Online Furniture Marketplace": "Home Decor",
    "Online Furniture ecommerce": "Home Decor",
    "Custom Furniture Marketplace": "Home Decor",
    "Pre-owned games Marketplace": "ECommerce",
    # --------------- EDUCATION --------------------
    "Online Education Platform": "EdTech",
    "Online Learning Platform": "EdTech",
    "Online Education Marketplace": "EdTech",
    "Education Marketplace": "EdTech",
    "Education Content Provider": "EdTech",
    'E-Tech':'EdTech',
    "EdTech": "EdTech",
    "Ed-Tech": "EdTech",
    "Edtech": "EdTech",
    "Education": "EdTech",
    "Online Ed-Tech Platform": "EdTech",
    "Competitive exam learning platform": "EdTech",
    "Teacher empowerment platform": "EdTech",
    "Exam Preparation Platform": "EdTech",
    "Online Certification Courses": "EdTech",
    "Online Student & Campus Social Networking platform": "EdTech",
    "Skill Training & Placement Platform": "EdTech",
    "Multilingual Test Preparation Platform": "EdTech",
    "Government Test Preparation platform": "EdTech",
    "Test Automation SAAS platform": "EdTech",
    "Technology": "Technology",
    # ------------ HEALTHCARE, WELLNESS, LIFE SCIENCES ------
    "Healthcare": "Healthcare",
    "Health Care": "Healthcare",
    "healthcare": "Healthcare",
    "Health": "Healthcare",
    "Health-Tech platform": "Healthcare",
    "Healthcare Consulting platform": "Healthcare",
    "Online Pharmacy": "Healthcare",
    "Home Healthcare Services platform,": "Healthcare",
    "Home Medical Care Services": "Healthcare",
    "Healthcare IT Solutions & services": "Healthcare",
    "Healthcare Services Discovery platform": "Healthcare",
    "Diagnostics Labs aggregator platform": "Healthcare",
    "Prepaid Bill manager App": "FinTech",
    "App Analytics platform": "Analytics",
    "Online Medical Diagnostic": "Healthcare",
    "Preventive healthcare services": "Healthcare",
    "Preventive Healthcare Services": "Healthcare",
    # ------------------- FASHION, LIFESTYLE, BEAUTY -----------
    "Online Fashion Aggregator": "Fashion",
    "Fashion Ecommerce store": "Fashion",
    "Fashion ECommerce": "Fashion",
    "Private Label lingerie Ecommerce": "Fashion",
    "Online Lingerie platform": "Fashion",
    "Online Lingerie Marketplace": "Fashion",
    "Womens Fashion Wear Portal": "Fashion",
    "Designer fashion Jewellery Marketplace": "Fashion",
    "Designer Merchandize Marketplace": "Fashion",
    "Online Jewelry Store": "Fashion",
    "Online Jewellery Store": "Fashion",
    "Online Jewellery etailer": "Fashion",
    "Celebrity Fashion Brand": "Fashion",
    "Fashion Search & Review Platform": "Fashion",
    "Online Apparels Fashion brand": "Fashion",
    "Luxury Apparel rental": "Fashion",
    "pre-owned Luxury online apparel seller": "Fashion",
    "Private label Fashion eTailer": "Fashion",
    "Personalized Styling platform": "Fashion",
    "Fashion Discovery platform": "Fashion",
    "Ethnic/ Traditional Fashion Store": "Fashion",
    "Women Lifestyle Marketplace": "Fashion",
    "Women Ethnic Wear Online Marketplace": "Fashion",
    "Ethnic Beverages manufacturer": "Food & Beverage",
    # ------------------- MARKETPLACE & GENERAL ----------------
    "Marketplace": "Marketplace",
    "Home rental platform": "Marketplace",
    "Professional Services Marketplace": "Marketplace",
    "Performance based Wholesale Marketplace": "Marketplace",
    "Curated Freelancer Marketplace": "Marketplace",
    "Fund Raising Platform": "FinTech",
    "Startup Funding platform": "FinTech",
    "FinTech": "FinTech",
    "Fintech": "FinTech",
    "Financial Inclusion platform": "FinTech",
    "Financial Tech": "FinTech",
    "Financial Services Platform": "FinTech",
    "Financial Services Portal": "FinTech",
    "Finance": "FinTech",
    "NBFC": "FinTech",
    "BFSI": "FinTech",
    # ---------- CONSUMER INTERNET, OTHERS, GENERAL -----------
    "Consumer Internet": "Consumer Internet",
    "Consumer internet": "Consumer Internet",
    "Consumer Interne": "Consumer Internet",
    "Consumer Goods": "Consumer Goods",
    "Consumer Portal": "Consumer Internet",
    "Internet Network Infrastructure Services": "Technology",
    # ------------ TECHNOLOGY: IT, CLOUD, DATA, SAAS ---------
    "Cloud Data Integration Platform": "Technology",
    "Tech": "Technology",
    "IT": "Technology",
    "IT / Customer Engagement Consulting": "Technology",
    "Digital Media Platform": "Media",
    "Digital Media publishing platform": "Media",
    "Media": "Media",
    "Publishing": "Media",
    "Cloud Based Collaboration platform": "Technology",
    "AI": "Artificial Intelligence",
    "Artificial Intelligence": "Artificial Intelligence",
    # ------------ OTHER BUCKETS, TERMS, & PLACEHOLDER LABELS ----
    "Others": "Others",
    "Reality": "Real Estate",
    "Real Estate": "Real Estate",
    "Real Estate focused Tech platform": "Real Estate",
    "Real Estate Broker network App": "Real Estate",
    "Real Estate Broker Platform App": "Real Estate",
    "Real Estate Intelligence Platform": "Real Estate",
    "Food": "FoodTech",
    "Food and Beverage": "FoodTech",
    "Food-Tech": "FoodTech",
    "Food and Beverages": "FoodTech",
    "Food & Beverage": "FoodTech",
    "Food & Beverages": "FoodTech",
    "Hospitality": "Hospitality",
    "Retail": "Retail",
    "FMCG": "Consumer Goods",
    "Ed-Tech": "EdTech",
    "SaaS": "SaaS",
    "Saas": "SaaS",
    "SaaS, Ecommerce": "SaaS",
    "B2B": "B2B",
    "B2B Platform": "B2B",
    "B2B-focused foodtech startup": "FoodTech",
    "Online Education": "EdTech",
    "Services Platform": "Services",
    "Services": "Services",
    "Automation": "Automation",
    "Clean Tech": "CleanTech",
    "Clean-tech": "CleanTech",
    "Big Data & Analytics Services": "Analytics",
    "Big Data & Analytics platform": "Analytics",
    "Big Data Analytics Platform": "Analytics",
    "Big Data Management Platform": "Analytics",
    "Data Analytics Platform": "Analytics",
    "Analytics": "Analytics",
    "Video Games": "Gaming",

}


df['Industry_Vertical'] = df['Industry_Vertical'].map(industry_mapping).fillna('Others')


In [428]:
#check all type in which date is filled

import re       # Check for all strings NOT matching dd/mm/yyyy pattern
invalid_format = df[~df['Date'].astype(str).str.match(r'^\d{2}/\d{2}/\d{4}$')]       # Keep values that do NOT match the pattern dd/mm/yyyy

print(invalid_format['Date'].unique())

['05/072018' '01/07/015' '\\\\xc2\\\\xa010/7/2015' '12/05.2015'
 '13/04.2015']


In [429]:
df['Date'].iloc[[192,2571,2606]]

Unnamed: 0,Date
192,05/072018
2571,01/07/015
2606,\\xc2\\xa010/7/2015


In [430]:
df.loc[192, 'Date'] = '05/07/2018'
df.loc[2571, 'Date'] = '01/07/2015'
df.loc[2606, 'Date'] = '10/07/2015'

In [431]:
df['Date'].iloc[[192,2571,2606]]

Unnamed: 0,Date
192,05/07/2018
2571,01/07/2015
2606,10/07/2015


In [432]:
# 1. Remove encoding junk like '\\x..'
df['Date'] = df['Date'].str.replace(r'\\x[a-zA-Z0-9]+', '', regex=True)

# 2. Remove other encoded junk like '\\\\xc2\\\\xa0'
df['Date'] = df['Date'].str.replace(r'\\\\[a-zA-Z0-9]+', '', regex=True)

# 3. Replace dots with slashes: 12/05.2015 → 12/05/2015
df['Date'] = df['Date'].str.replace(r'\.', '/', regex=True)

# 4. Fix 'dd/mmYYYY' → insert missing slash: 05/072018 → 05/07/2018
df['Date'] = df['Date'].str.replace(r'(\d{2})/(\d{5})', r'\1/0\2', regex=True)

# 5. Fix 2-digit years: 01/07/015 → 01/07/2015
df['Date'] = df['Date'].str.replace(r'(\d{2})/(\d{2})/(\d{2})$', r'\1/\2/20\3', regex=True)

# 6. Remove any leading/trailing whitespace
df['Date'] = df['Date'].str.strip()

# 7. Convert to datetime safely
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True, errors='coerce')

In [433]:
df.isnull().sum()

Unnamed: 0,0
Date,0
Startup_Name,0
Industry_Vertical,0
SubVertical,0
City_Location,0
Investors_Name,0
Investment_Type,0
Amount_USD,920
Remarks,2522


In [434]:
df['Amount_USD'].unique()

array(['20,00,00,000', '80,48,394', '1,83,58,860', '30,00,000',
       '18,00,000', '90,00,000', '15,00,00,000', '60,00,000',
       '7,00,00,000', '5,00,00,000', '2,00,00,000', '1,20,00,000',
       '3,00,00,000', '59,00,000', '20,00,000', '23,10,00,000',
       '4,86,000', '15,00,000', 'undisclosed', '2,60,00,000',
       '1,74,11,265', '13,00,000', '13,50,00,000', '3,00,000',
       '22,00,00,000', '1,58,00,000', '28,30,00,000', '1,00,00,00,000',
       '4,50,00,000', '58,50,00,000', 'unknown', '45,00,000', '33,00,000',
       '50,00,000', '1,80,00,000', '10,00,000', '1,00,00,000',
       '45,00,00,000', '16,00,000', '14,00,00,000', '3,80,80,000',
       '12,50,00,000', '1,10,00,000', '5,10,00,000', '3,70,00,000',
       '5,00,000', '11,00,00,000', '1,50,00,000', '65,90,000',
       'Undisclosed', '3,90,00,00,000', '1,90,00,000', '25,00,000',
       '1,45,000', '6,00,00,000', '1,60,00,000', '57,50,000', '3,19,605',
       '48,89,975.54', '7,50,00,000', '27,39,034.68', '1,51,09,500.0

In [435]:
import pandas as pd
import numpy as np

# Assume df['Amount'] is your column
def clean_amount(val):
    if pd.isna(val):
        return np.nan

    val = str(val).strip()

    # Remove unwanted encodings and whitespace
    val = re.sub(r'\\x[a-zA-Z0-9]+', '', val)       # Remove \x-type encodings
    val = re.sub(r'\\\\[a-zA-Z0-9]+', '', val)      # Remove \\xc2\\xa0 etc.
    val = val.replace('+', '')                      # Remove '+' sign
    val = val.replace(',', '')                      # Remove all commas
    val = val.strip().rstrip('.')                   # Strip and remove trailing dot/comma

    # Convert known bad labels to NaN
    if val.lower() in ['undisclosed', 'unknown', 'n/a', 'na', '']:
        return np.nan

    try:
        return float(val)
    except:
        return np.nan

# Apply the function
df['Amount_USD'] = df['Amount_USD'].apply(clean_amount)
df['Amount_USD']

Unnamed: 0,Amount_USD
0,200000000.0
1,8048394.0
2,18358860.0
3,3000000.0
4,1800000.0
...,...
2868,3500000.0
2869,
2870,400000.0
2871,500000.0


In [436]:
df['Amount_USD'] = pd.to_numeric(df['Amount_USD'], errors='coerce')

In [437]:
df.isnull().sum()

Unnamed: 0,0
Date,0
Startup_Name,0
Industry_Vertical,0
SubVertical,0
City_Location,0
Investors_Name,0
Investment_Type,0
Amount_USD,938
Remarks,2522


In [438]:
df2= df.groupby('Investment_Type')['Amount_USD'].mean()
print(df2)

Investment_Type
Angel Funding     2.323025e+05
Debt Funding      6.600799e+06
Other             1.464471e+08
Pre-Series A      5.546500e+06
Private Equity    2.579555e+07
Seed Funding      1.529113e+06
Series A          9.236364e+06
Series B          2.288188e+08
Series C          7.462274e+07
Series D          1.234832e+08
Series E          1.650000e+07
Series F          4.500000e+07
Series G          2.310000e+08
Series H          1.500000e+08
Series J          1.000000e+06
Venture Round     6.956200e+06
Name: Amount_USD, dtype: float64


In [439]:
# Your mapping (convert to dictionary)
investment_avg = {
    'Angel Funding': 2.323025e+05,
    'Debt Funding': 6.600799e+06,
    'Other': 1.464471e+08,
    'Pre-Series A': 5.546500e+06,
    'Private Equity': 2.579555e+07,
    'Seed Funding': 1.529113e+06,
    'Series A': 9.236364e+06,
    'Series B': 2.288188e+08,
    'Series C': 7.462274e+07,
    'Series D': 1.234832e+08,
    'Series E': 1.650000e+07,
    'Series F': 4.500000e+07,
    'Series G': 2.310000e+08,
    'Series H': 1.500000e+08,
    'Series J': 1.000000e+06,
    'Venture Round': 6.956200e+06
}

# Assume df['Cleaned_Amount'] has NaN and df['Investment_Type'] exists

# Fill NaNs using the investment type mapping
df['Amount_USD'] = df['Amount_USD'].fillna(
    df['Investment_Type'].map(investment_avg)
)


In [440]:
df['Remarks'] = df['Remarks'].fillna('No Remark')

In [441]:
df.isnull().sum()

Unnamed: 0,0
Date,0
Startup_Name,0
Industry_Vertical,0
SubVertical,0
City_Location,0
Investors_Name,0
Investment_Type,0
Amount_USD,0
Remarks,0


In [442]:
df['Amount_USD'].describe()

Unnamed: 0,Amount_USD
count,2873.0
mean,15738870.0
std,103159100.0
min,18000.0
25%,1000000.0
50%,1529113.0
75%,10000000.0
max,3900000000.0




---
5. Feature Engineering:

  - Extracted year from the Date column for trend analysis.
  - sort by year
  - Counted number of investors by splitting the Investors Name field using commas.


---




In [443]:
df['Year'] = df['Date'].dt.year
df=df.sort_values(by='Year', ascending=False)
df.head()

Unnamed: 0,Date,Startup_Name,Industry_Vertical,SubVertical,City_Location,Investors_Name,Investment_Type,Amount_USD,Remarks,Year
0,2020-01-09,BYJU’S,EdTech,E-learning,Bangalore,Tiger Global Management,Private Equity,200000000.0,No Remark,2020
3,2020-01-02,https://www.wealthbucket.in/,FinTech,Online Investment,New Delhi,Vinod Khatumal,Pre-Series A,3000000.0,No Remark,2020
4,2020-01-02,Fashor,Others,Embroiled Clothes For Women,Mumbai,Sprout Venture Partners,Seed Funding,1800000.0,No Remark,2020
1,2020-01-13,Shuttl,Others,App based shuttle service,Gurgaon,Susquehanna Growth Equity,Series C,8048394.0,No Remark,2020
2,2020-01-09,Mamaearth,Others,Retailer of baby and toddler products,Bangalore,Sequoia Capital India,Series B,18358860.0,No Remark,2020


In [444]:
#new column to categorize Amount as low high and mid
df['Amount_USD'].median()

1529113.0

In [445]:
amount_category=df['Amount_USD']
def amount_category_fun(amount):
  if amount < 1_000_000:
    return 'Low'
  elif amount <= 10_000_000:
    return 'Medium'
  else:
    return 'High'
df['amount_category']=amount_category.apply(amount_category_fun)

In [446]:
df['amount_category'].value_counts()

Unnamed: 0_level_0,count
amount_category,Unnamed: 1_level_1
Medium,1502
Low,697
High,674


In [447]:
Investors_name_df=df['Investors_Name']
def investor_count_func(name):
  name_list=name.split(',')
  return len(name_list)
df['investor_count']=Investors_name_df.apply(investor_count_func)

In [448]:
df.head()

Unnamed: 0,Date,Startup_Name,Industry_Vertical,SubVertical,City_Location,Investors_Name,Investment_Type,Amount_USD,Remarks,Year,amount_category,investor_count
0,2020-01-09,BYJU’S,EdTech,E-learning,Bangalore,Tiger Global Management,Private Equity,200000000.0,No Remark,2020,High,1
3,2020-01-02,https://www.wealthbucket.in/,FinTech,Online Investment,New Delhi,Vinod Khatumal,Pre-Series A,3000000.0,No Remark,2020,Medium,1
4,2020-01-02,Fashor,Others,Embroiled Clothes For Women,Mumbai,Sprout Venture Partners,Seed Funding,1800000.0,No Remark,2020,Medium,1
1,2020-01-13,Shuttl,Others,App based shuttle service,Gurgaon,Susquehanna Growth Equity,Series C,8048394.0,No Remark,2020,Medium,1
2,2020-01-09,Mamaearth,Others,Retailer of baby and toddler products,Bangalore,Sequoia Capital India,Series B,18358860.0,No Remark,2020,High,1


In [449]:
df.isnull().sum()

Unnamed: 0,0
Date,0
Startup_Name,0
Industry_Vertical,0
SubVertical,0
City_Location,0
Investors_Name,0
Investment_Type,0
Amount_USD,0
Remarks,0
Year,0


In [450]:
df.describe()

Unnamed: 0,Date,Amount_USD,Year,investor_count
count,2873,2873.0,2873.0,2873.0
mean,2016-10-26 20:11:26.668987136,15738870.0,2016.314306,1.92064
min,2015-04-03 00:00:00,18000.0,2015.0,1.0
25%,2015-12-11 00:00:00,1000000.0,2015.0,1.0
50%,2016-08-24 00:00:00,1529113.0,2016.0,1.0
75%,2017-07-03 00:00:00,10000000.0,2017.0,2.0
max,2020-01-13 00:00:00,3900000000.0,2020.0,10.0
std,,103159100.0,1.106952,1.422548
