<a href="https://colab.research.google.com/github/priyank21112000/Statistical-analysis-of-Startup-Data/blob/main/Statistical%20Analysis%20of%20Startup%20Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Objective

The objective of this project is to employ various statistical methods to determine the factors that contribute to the success or failure of startups. By analyzing a dataset containing 48 relevant features, including funding rounds, industry type, and relationships, the project aims to identify significant predictors of startup success. This report highlights the statistical methods employed and provides insights into the key findings.

## Introduction

Startups play a significant role in economic growth, bringing new ideas, spurring innovation, and creating employment. However, they face high uncertainty, and failure rates can be high. As a result, predicting the success of a startup is of utmost importance for investors seeking to identify companies with the potential for rapid growth. This project aims to predict startup success by analyzing various factors, including funding rounds, industry type, and relationships, among others.

The dataset used in this project consists of 48 features, encompassing diverse aspects related to individual companies, investment insights, and industry trends. These features were carefully selected to capture the relevant information needed to predict startup success. The data was collected from various reliable sources and underwent preprocessing to ensure consistency and quality.

This project employs various statistical methods, including Comparing Two Samples, Analysis of Variance (ANOVA), Analysis of Categorical Data, Linear Regression, Resampling Methods, Linear Model Selection, and Regularization, and Moving Beyond Linearity. Advanced regression techniques such as polynomial regression and decision trees are also employed.

The project's findings provide valuable insights into the factors that contribute to startup success and highlight the importance of analyzing various factors when predicting startup success. Overall, this project demonstrates the power of statistical analysis in identifying promising investment opportunities in the startup ecosystem.






## Data Description

The data used in this project was obtained from Crunchbase, a platform that tracks startup companies, their funding rounds, and industry trends. The data contains information on startup companies that are currently operating.

The dataset has 48 columns/features, including both quantitative and categorical variables. Some of the important features include age_first_funding_year, age_last_funding_year, relationships, funding_rounds, funding_total_usd, milestones, age_first_milestone_year, age_last_milestone_year, state, industry_type, has_VC, has_angel, has_roundA, has_roundB, has_roundC, has_roundD, avg_participants, is_top500, and status.

Overall, the dataset provides a comprehensive view of startup companies and their characteristics. It includes information on the funding rounds, industry trends, and individual company information. The data is suitable for various statistical analyses, including comparing two samples, analysis of variance, analysis of categorical data, linear regression, resampling methods, linear model selection and regularization, and moving beyond linearity.

## Methodology

1. Shapiro Wilk Test
2. Z - Test
3. Anova
4. Chi-Squared test
5. K-Fold Cross Validation

In [None]:
import pandas as pd
import numpy as np
import scipy.stats as stats

In [None]:
#Loading the dataset
df = pd.read_csv('startup_data.csv')

In [None]:
df.head()

Unnamed: 0.1,Unnamed: 0,state_code,latitude,longitude,zip_code,id,city,Unnamed: 6,name,labels,...,object_id,has_VC,has_angel,has_roundA,has_roundB,has_roundC,has_roundD,avg_participants,is_top500,status
0,1005,CA,42.35888,-71.05682,92101,c:6669,San Diego,,Bandsintown,1,...,c:6669,0,1,0,0,0,0,1.0,0,acquired
1,204,CA,37.238916,-121.973718,95032,c:16283,Los Gatos,,TriCipher,1,...,c:16283,1,0,0,1,1,1,4.75,1,acquired
2,1001,CA,32.901049,-117.192656,92121,c:65620,San Diego,San Diego CA 92121,Plixi,1,...,c:65620,0,0,1,0,0,0,4.0,1,acquired
3,738,CA,37.320309,-122.05004,95014,c:42668,Cupertino,Cupertino CA 95014,Solidcore Systems,1,...,c:42668,0,0,0,1,1,1,3.3333,1,acquired
4,1002,CA,37.779281,-122.419236,94105,c:65806,San Francisco,San Francisco CA 94105,Inhale Digital,0,...,c:65806,1,1,0,0,0,0,1.0,1,closed


In [None]:
df.describe()

Unnamed: 0.1,Unnamed: 0,latitude,longitude,labels,age_first_funding_year,age_last_funding_year,age_first_milestone_year,age_last_milestone_year,relationships,funding_rounds,...,is_consulting,is_othercategory,has_VC,has_angel,has_roundA,has_roundB,has_roundC,has_roundD,avg_participants,is_top500
count,923.0,923.0,923.0,923.0,923.0,923.0,771.0,771.0,923.0,923.0,...,923.0,923.0,923.0,923.0,923.0,923.0,923.0,923.0,923.0,923.0
mean,572.297941,38.517442,-103.539212,0.646804,2.23563,3.931456,3.055353,4.754423,7.710726,2.310943,...,0.00325,0.32286,0.326111,0.254605,0.508126,0.392199,0.232936,0.099675,2.838586,0.809317
std,333.585431,3.741497,22.394167,0.478222,2.510449,2.96791,2.977057,3.212107,7.265776,1.390922,...,0.056949,0.467823,0.469042,0.435875,0.500205,0.488505,0.422931,0.299729,1.874601,0.393052
min,1.0,25.752358,-122.756956,0.0,-9.0466,-9.0466,-14.1699,-7.0055,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
25%,283.5,37.388869,-122.198732,0.0,0.5767,1.66985,1.0,2.411,3.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.5,1.0
50%,577.0,37.779281,-118.374037,1.0,1.4466,3.5288,2.5205,4.4767,5.0,2.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,2.5,1.0
75%,866.5,40.730646,-77.214731,1.0,3.57535,5.56025,4.6863,6.7534,10.0,3.0,...,0.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,3.8,1.0
max,1153.0,59.335232,18.057121,1.0,21.8959,21.8959,24.6849,24.6849,63.0,10.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,16.0,1.0


In [None]:
df.columns

Index(['Unnamed: 0', 'state_code', 'latitude', 'longitude', 'zip_code', 'id',
       'city', 'Unnamed: 6', 'name', 'labels', 'founded_at', 'closed_at',
       'first_funding_at', 'last_funding_at', 'age_first_funding_year',
       'age_last_funding_year', 'age_first_milestone_year',
       'age_last_milestone_year', 'relationships', 'funding_rounds',
       'funding_total_usd', 'milestones', 'state_code.1', 'is_CA', 'is_NY',
       'is_MA', 'is_TX', 'is_otherstate', 'category_code', 'is_software',
       'is_web', 'is_mobile', 'is_enterprise', 'is_advertising',
       'is_gamesvideo', 'is_ecommerce', 'is_biotech', 'is_consulting',
       'is_othercategory', 'object_id', 'has_VC', 'has_angel', 'has_roundA',
       'has_roundB', 'has_roundC', 'has_roundD', 'avg_participants',
       'is_top500', 'status'],
      dtype='object')

### Numerical Data

In [None]:
numeric=['int8', 'int16', 'int32', 'int64', 'float16', 'float32', 'float64']
df_num=df.select_dtypes(include=numeric)
df_num.head(3)

Unnamed: 0.1,Unnamed: 0,latitude,longitude,labels,age_first_funding_year,age_last_funding_year,age_first_milestone_year,age_last_milestone_year,relationships,funding_rounds,...,is_consulting,is_othercategory,has_VC,has_angel,has_roundA,has_roundB,has_roundC,has_roundD,avg_participants,is_top500
0,1005,42.35888,-71.05682,1,2.2493,3.0027,4.6685,6.7041,3,3,...,0,1,0,1,0,0,0,0,1.0,0
1,204,37.238916,-121.973718,1,5.126,9.9973,7.0055,7.0055,9,4,...,0,0,1,0,0,1,1,1,4.75,1
2,1001,32.901049,-117.192656,1,1.0329,1.0329,1.4575,2.2055,5,1,...,0,0,0,0,1,0,0,0,4.0,1


### Categorical Data

In [None]:
df_cat=df.select_dtypes(include='object')
df_cat.head(3)

Unnamed: 0,state_code,zip_code,id,city,Unnamed: 6,name,founded_at,closed_at,first_funding_at,last_funding_at,state_code.1,category_code,object_id,status
0,CA,92101,c:6669,San Diego,,Bandsintown,1/1/2007,,4/1/2009,1/1/2010,CA,music,c:6669,acquired
1,CA,95032,c:16283,Los Gatos,,TriCipher,1/1/2000,,2/14/2005,12/28/2009,CA,enterprise,c:16283,acquired
2,CA,92121,c:65620,San Diego,San Diego CA 92121,Plixi,3/18/2009,,3/30/2010,3/30/2010,CA,web,c:65620,acquired


In [None]:
# Information regarding maximum number of particular values along with their maximum count
top_count = df.describe(include=["O"])
top_count.T.style.background_gradient()

Unnamed: 0,count,unique,top,freq
state_code,923,35,CA,488
zip_code,923,382,94107,30
id,923,922,c:28482,2
city,923,221,San Francisco,128
Unnamed: 6,430,252,San Francisco CA 94105,12
name,923,922,Redwood Systems,2
founded_at,923,217,1/1/2003,55
closed_at,335,202,6/1/2013,25
first_funding_at,923,585,1/1/2008,18
last_funding_at,923,680,1/1/2008,11


### Calculating Missing Values

In [None]:
null=pd.DataFrame(df.isnull().sum(),columns=["Null Values"])
null["% Missing Values"]=(df.isna().sum()/len(df)*100)
null = null[null["% Missing Values"] > 0]
null.style.background_gradient()

Unnamed: 0,Null Values,% Missing Values
Unnamed: 6,493,53.412784
closed_at,588,63.705309
age_first_milestone_year,152,16.468039
age_last_milestone_year,152,16.468039
state_code.1,1,0.108342


__Columns Unnamed: 6 and closed_at both have more than 50% of the missing values__

In [None]:
# Checking Missing Values Column
df[["Unnamed: 6", "closed_at", "age_first_milestone_year", "age_last_milestone_year", "state_code.1", "status"]].head(10)

Unnamed: 0,Unnamed: 6,closed_at,age_first_milestone_year,age_last_milestone_year,state_code.1,status
0,,,4.6685,6.7041,CA,acquired
1,,,7.0055,7.0055,CA,acquired
2,San Diego CA 92121,,1.4575,2.2055,CA,acquired
3,Cupertino CA 95014,,6.0027,6.0027,CA,acquired
4,San Francisco CA 94105,10/1/2012,0.0384,0.0384,CA,closed
5,Mountain View CA 94043,2/15/2009,5.0027,5.0027,CA,closed
6,,,3.0,6.6082,CA,acquired
7,,,5.6055,7.3616,CA,acquired
8,Williamstown MA 1267,,8.0055,9.9945,MA,acquired
9,,,2.9178,6.1151,CA,acquired


## Handling Missing Values

The analysis results revealed that the column "Unnamed: 6" is a composite of several other columns, including "city", "state_code", and "zip_code". As a solution, we have decided to eliminate the contents of the "Unnamed: 6" column and then fill in the missing data based on a combination of related columns.

In [None]:
df['Unnamed: 6'] = df.apply(lambda row: (row.city) + " " + (row.state_code) + " " +(row.zip_code)  , axis = 1)
df.head()

Unnamed: 0.1,Unnamed: 0,state_code,latitude,longitude,zip_code,id,city,Unnamed: 6,name,labels,...,object_id,has_VC,has_angel,has_roundA,has_roundB,has_roundC,has_roundD,avg_participants,is_top500,status
0,1005,CA,42.35888,-71.05682,92101,c:6669,San Diego,San Diego CA 92101,Bandsintown,1,...,c:6669,0,1,0,0,0,0,1.0,0,acquired
1,204,CA,37.238916,-121.973718,95032,c:16283,Los Gatos,Los Gatos CA 95032,TriCipher,1,...,c:16283,1,0,0,1,1,1,4.75,1,acquired
2,1001,CA,32.901049,-117.192656,92121,c:65620,San Diego,San Diego CA 92121,Plixi,1,...,c:65620,0,0,1,0,0,0,4.0,1,acquired
3,738,CA,37.320309,-122.05004,95014,c:42668,Cupertino,Cupertino CA 95014,Solidcore Systems,1,...,c:42668,0,0,0,1,1,1,3.3333,1,acquired
4,1002,CA,37.779281,-122.419236,94105,c:65806,San Francisco,San Francisco CA 94105,Inhale Digital,0,...,c:65806,1,1,0,0,0,0,1.0,1,closed


In [None]:
totalNull = df['Unnamed: 6'].isnull().sum()
print('Total Missing Values column "Unnamed: 6": ', totalNull)

Total Missing Values column "Unnamed: 6":  493


After analyzing the data, it was found that the columns 'age_first_milestone_year' and 'age_last_milestone_year' contain null values, which indicate that certain startups do not have any milestones. This was further confirmed by examining the 'milestones' column, where a value of 0 was accompanied by null values in 'age_first_milestone_year' and 'age_last_milestone_year'. Therefore, we decided to impute the null values in these columns with a value of 0.

In [None]:
df['age_first_milestone_year'] = df['age_first_milestone_year'].fillna(value="0")
df['age_last_milestone_year'] = df['age_last_milestone_year'].fillna(value="0")

In [None]:
df.drop(["state_code.1"], axis=1, inplace=True)

In [None]:
#Checking Duplicate values
duplicate = df[df.duplicated()]
print("Duplicate Rows :")

Duplicate Rows :


In [None]:
#Checking for negative values
age=["age_first_funding_year","age_last_funding_year"]

for a in range(len(age)):
    print("Any negative value in '{}' column  : {} ".format(age[a],(df[age[a]]<0).any()))

Any negative value in 'age_first_funding_year' column  : True 
Any negative value in 'age_last_funding_year' column  : True 


In [None]:
#Dropping negative values
df=df.drop(df[df.age_first_funding_year<0].index)
df=df.drop(df[df.age_last_funding_year<0].index)

## Shapiro-Wilk Test for Normality

The Shapiro-Wilk test is a statistical method used to assess whether a dataset follows a normal distribution. This report presents the implementation and interpretation of the Shapiro-Wilk test as part of a project aimed at analyzing the normality of a given dataset. The hypothesis testing framework is employed to determine if the data can be considered normally distributed.

The null hypothesis (Ho) states that the sample is drawn from a population that follows a normal distribution. The alternative hypothesis (Ha) suggests that the sample is not from a normal distribution.

__Ho__: The sample is from a normal distribution.
__Ha__: The sample is not from a normal distribution.

The Shapiro-Wilk test statistic is computed based on the ordered values of the dataset. It measures the degree of departure from normality. The test statistic is denoted as W.


The p-value is determined based on the test statistic (W) and the sample size. It quantifies the probability of observing the data if the null hypothesis (Ho) is true. A p-value less than the significance level (typically 0.05) indicates evidence to reject the null hypothesis.

After performing the Shapiro-Wilk test, the obtained p-value is compared to the significance level (α = 0.05).
If the p-value is greater than 0.05 (α), there is no sufficient evidence to reject the null hypothesis. In this case, it can be concluded that the sample is from a population that follows a normal distribution. The data can be considered to be normally distributed.
If the p-value is less than or equal to 0.05 (α), there is significant evidence to reject the null hypothesis. In this scenario, it can be concluded that the sample is not drawn from a population that follows a normal distribution. The data cannot be considered to be normally distributed.

The Shapiro-Wilk test provides a statistical method to assess the normality of a dataset. By comparing the obtained p-value to the significance level, it is possible to determine whether the data can be considered normally distributed or not

In [None]:
from scipy.stats import shapiro
shapiro(df['funding_total_usd'])

ShapiroResult(statistic=0.05385714769363403, pvalue=0.0)

In [None]:
shapiro(df['funding_rounds'])

ShapiroResult(statistic=0.8272395133972168, pvalue=1.0356720625546895e-29)

__Results:__ Both of the samples do not belong to Normal Distribution <br>

## Comparing two samples

Comparing two samples in statistical methods refers to a technique of analyzing two sets of data to determine if they are significantly different from each other. The goal of comparing two samples is to determine whether there is a significant difference between two groups or populations based on the data collected from each group.  
Some common statistical methods used for comparing two samples:

1. __Independent samples t-test:__ This test is used when the two samples being compared are independent of each other. It is used to compare the means of the two samples to determine if they are significantly different from each other.

2. __Paired samples t-test:__ This test is used when the two samples being compared are dependent on each other. It is used to compare the means of the two samples that are related in some way, such as before and after measurements or matched samples.

3. __Analysis of variance (ANOVA):__ ANOVA is used to compare means across two or more groups. It determines if there is a significant difference between the means of the groups.

4. __Mann-Whitney U test:__ This test is a nonparametric test used to compare two independent samples. It compares the medians of the two samples to determine if they are significantly different from each other.

5. __Wilcoxon signed-rank test:__ This test is a nonparametric test used to compare two dependent samples. It compares the medians of the two related samples to determine if they are significantly different from each other.

6. __Chi-square test:__ This test is used to compare the distribution of categorical data between two groups. It is used to determine if there is a significant difference between the proportions of the two groups.

In [None]:
from scipy.stats import norm

# Filtering the dataset to only include successful and failed startups
df = df[df['status'].isin(['acquired', 'ipo', 'closed'])]

# Separating the funding total usd data into two groups
success_funding = df[df['status'] != 'closed']['funding_total_usd']
failure_funding = df[df['status'] == 'closed']['funding_total_usd']

# Calculating the mean and standard deviation for each group
success_mean = success_funding.mean()
failure_mean = failure_funding.mean()
success_std = success_funding.std()
failure_std = failure_funding.std()

# Calculating the pooled standard deviation
pooled_std = np.sqrt((success_std**2 + failure_std**2) / 2)

# Calculating the z-score
z_score = (success_mean - failure_mean) / (pooled_std * np.sqrt(1/len(success_funding) + 1/len(failure_funding)))

# Calculating the p-value
p_value = norm.sf(abs(z_score)) * 2

# Setting the significance level
alpha = 0.05

# State the null and alternative hypotheses
# Null hypothesis (Ho): There is no significant difference in funding total usd between successful and failed startups
# Alternative hypothesis (Ha): There is a significant difference in funding total usd between successful and failed startups

# Checking if the p-value is less than the significance level
if p_value < alpha:
    print('Reject the null hypothesis (Ho)')
    print('There is a significant difference in funding total usd between successful and failed startups')
else:
    print('Fail to reject the null hypothesis (Ho)')
    print('There is no significant difference in funding total usd between successful and failed startups')

Fail to reject the null hypothesis (Ho)
There is no significant difference in funding total usd between successful and failed startups


### This is an important finding to note that all the startups are funded without any bias

Note that this implementation assumes a two-tailed test (i.e. Ha is a two-sided alternative hypothesis).

In [None]:
#ANOVA

import statsmodels.api as sm
from statsmodels.formula.api import ols


# Specifying ANOVA model
model = ols('age_first_funding_year ~ age_last_funding_year + age_first_milestone_year', data=startup_data).fit()

# Performing ANOVA test
anova_table = sm.stats.anova_lm(model, typ=2)

# Printing ANOVA table
print(anova_table)

                               sum_sq     df           F        PR(>F)
age_last_funding_year      797.641156    1.0  334.194243  2.941334e-62
age_first_milestone_year    64.804423    1.0   27.151639  2.419163e-07
Residual                  1833.031000  768.0         NaN           NaN


From the ANOVA table, we can see that the F-statistic for both predictors, age_last_funding_year and age_first_milestone_year, are statistically significant with p-values much lower than the significance level of 0.05. __This suggests that both variables are good predictors of age_first_funding_year.__

Overall, the ANOVA table suggests that there is a significant relationship between age_last_funding_year, age_first_milestone_year, and age_first_funding_year in the startup dataset.

## 3) Categorical Data Analysis

There are several types of tests used for categorical data analysis. Here are some of the most commonly used ones along with their examples:

1. __Chi-Square Test__: This test is used to determine whether there is a significant association between two categorical variables. For example, if we want to determine whether there is a relationship between gender and political affiliation, we could use a chi-square test.<br>

2. __Fisher's Exact Test__: This test is used when the sample size is small or when the expected cell counts are less than 5. For example, if we want to determine whether there is a relationship between smoking and lung cancer in a small sample of 20 people, we could use Fisher's exact test.<br>

3. __McNemar's Test__: This test is used to determine whether there is a significant difference between two related categorical variables. For example, if we want to determine whether there is a significant difference between the number of men and women who prefer coffee over tea, we could use McNemar's test.<br>

4. __Cochran's Q Test__: This test is used to determine whether there is a significant difference between three or more related categorical variables. For example, if we want to determine whether there is a significant difference between the number of people who prefer coffee, tea, or soda, we could use Cochran's Q test.<br>

5. __G-test__: This test is used to compare the observed frequency of a categorical variable to the expected frequency. For example, if we want to determine whether there is a significant difference between the number of men and women who prefer tea over coffee, we could use G-test.<br>

These tests are used to analyze categorical data and determine whether there is a significant association or difference between the variables. The choice of the test depends on the type of data and the research question.

__Problem Statement__ - State (categorical) vs. Status (acquired/closed) (categorical): to determine if there is a significant association between the state in which a startup operates and its ultimate status as either acquired or closed. <br>

__Ho__ - There is no association between the state in which a startup operates and its ultimate status as either acquired or closed<br>
__Ha__ - There is association between the state in which a startup operates and its ultimate status as either acquired or closed

Considering alphaa = 0.05

Based on the given problem statement we will be using the __chi-squared test__

A chi-squared test is a statistical test used to determine whether there is a significant difference between the expected frequencies and the observed frequencies in one or more categories of a contingency table.

It is typically used when we want to test the independence of two categorical variables or to test the goodness-of-fit of a model to a set of observed data.

The chi-square test of homogeneity tests to see whether different columns
(or rows) of data in a table come from the same population or not (i.e.,
whether the differences are consistent with being explained by sampling
error alone).

In [None]:
from scipy.stats import chi2_contingency

In [None]:
# Subset the relevant columns from the dataframe
state_status_df = df[["state_code", "status"]]
state_status_df

Unnamed: 0,state_code,status
0,CA,acquired
1,CA,acquired
2,CA,acquired
3,CA,acquired
4,CA,closed
...,...,...
918,CA,acquired
919,MA,closed
920,CA,closed
921,CA,acquired


In [None]:
# Creating a contingency table to represent the frequency of each combination of State and Status
cat_table = pd.crosstab(state_status_df["state_code"], state_status_df["status"])
cat_table

status,acquired,closed
state_code,Unnamed: 1_level_1,Unnamed: 2_level_1
AR,0,1
AZ,1,1
CA,316,147
CO,14,5
CT,0,3
DC,2,1
FL,2,4
GA,6,5
ID,0,1
IL,9,9


In [None]:
# Performing the chi-squared test and print the results
chi2_stat, p_val, dof, ex = chi2_contingency(cat_table)
print("Chi-squared statistic:", chi2_stat)
print("p-value:", p_val)
print("Degrees of freedom:", dof)

Chi-squared statistic: 68.37338151633624
p-value: 0.000427215077337809
Degrees of freedom: 34


__Conclusion__ - Since the p-value is less than the chosen significance level of 0.05, we can reject the null hypothesis that there is no association between the state in which a startup operates and its ultimate status as either acquired or closed. <br>

__Therefore, we can conclude that there is a significant association between the state and status of a startup.__

## 5) Resampling Methods

Resampling methods are a class of statistical techniques that involve repeatedly drawing samples from a dataset in order to perform inference or estimate the properties of a statistical model. These methods are useful in situations where traditional statistical methods may not be applicable, such as when the data violates certain assumptions or when the sample size is small.



__Cross-validation__: Cross-validation is a technique used to evaluate the performance of a model by partitioning the data into subsets, called folds, and then using each fold in turn as a validation set while the other folds are used to train the model. The process is repeated several times, and the results are averaged. This technique helps to reduce overfitting and provides a more accurate estimate of the model's performance. <br>

__Bootstrapping__: Bootstrapping is a resampling method used to estimate the variability of a statistic or a parameter by resampling the data with replacement. The idea is to create multiple samples of the same size as the original dataset, drawn from the original dataset with replacement. The statistic or parameter of interest is then computed for each sample, and the distribution of these estimates is used to estimate the variability of the original statistic or parameter. <br>

__Jackknife__: Jackknife is a resampling method used to estimate the bias and variance of a statistic or a parameter by repeatedly leaving one observation out of the dataset and computing the statistic or parameter of interest for each reduced dataset. The estimates are then averaged, and the bias and variance are estimated from the differences between the average and the full dataset estimate.<br>

__Monte Carlo simulation__: Monte Carlo simulation is a method used to estimate the distribution of a variable or the performance of a model by generating random samples from a probability distribution or a model. The idea is to simulate the data many times, using different random draws each time, and then compute the desired statistic or parameter for each simulated dataset. The distribution of these estimates provides an estimate of the distribution of the original variable or the performance of the model.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score

In [None]:
startup_data = pd.read_csv('startup_data.csv')

### One Hot Encoding categorical features

In [None]:
# Selecting categorical features to encode
cat_cols = ['state_code','zip_code','id','city','Unnamed: 6','name','founded_at','closed_at', 'first_funding_at','last_funding_at','state_code.1','category_code','object_id']

# Using get_dummies to one-hot encode the categorical features
encoded_data = pd.get_dummies(startup_data, columns=cat_cols)

encoded_data = encoded_data.dropna()

# Print the first few rows of the encoded dataset
print(encoded_data.head())

   Unnamed: 0   latitude   longitude  labels  age_first_funding_year  \
0        1005  42.358880  -71.056820       1                  2.2493   
1         204  37.238916 -121.973718       1                  5.1260   
2        1001  32.901049 -117.192656       1                  1.0329   
3         738  37.320309 -122.050040       1                  3.1315   
4        1002  37.779281 -122.419236       0                  0.0000   

   age_last_funding_year  age_first_milestone_year  age_last_milestone_year  \
0                 3.0027                    4.6685                   6.7041   
1                 9.9973                    7.0055                   7.0055   
2                 1.0329                    1.4575                   2.2055   
3                 5.3151                    6.0027                   6.0027   
4                 1.6685                    0.0384                   0.0384   

   relationships  funding_rounds  ...  object_id_c:9669  object_id_c:967  \
0              3

In [None]:
# Splitting the data into features (X) and target (y)
X = encoded_data.drop('status', axis=1)
y = encoded_data['status']

In [None]:
X #Dataset without acquired column or target

Unnamed: 0.1,Unnamed: 0,latitude,longitude,labels,age_first_funding_year,age_last_funding_year,age_first_milestone_year,age_last_milestone_year,relationships,funding_rounds,...,object_id_c:9669,object_id_c:967,object_id_c:971,object_id_c:9730,object_id_c:975,object_id_c:9772,object_id_c:9782,object_id_c:9816,object_id_c:9822,object_id_c:997
0,1005,42.358880,-71.056820,1,2.2493,3.0027,4.6685,6.7041,3,3,...,0,0,0,0,0,0,0,0,0,0
1,204,37.238916,-121.973718,1,5.1260,9.9973,7.0055,7.0055,9,4,...,0,0,0,0,0,0,0,0,0,0
2,1001,32.901049,-117.192656,1,1.0329,1.0329,1.4575,2.2055,5,1,...,0,0,0,0,0,0,0,0,0,0
3,738,37.320309,-122.050040,1,3.1315,5.3151,6.0027,6.0027,5,3,...,0,0,0,0,0,0,0,0,0,0
4,1002,37.779281,-122.419236,0,0.0000,1.6685,0.0384,0.0384,2,2,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
918,352,37.740594,-122.376471,1,0.5178,0.5178,0.5808,4.5260,9,1,...,0,0,0,0,0,0,0,0,0,0
919,721,42.504817,-71.195611,0,7.2521,9.2274,6.0027,6.0027,1,3,...,0,0,0,0,0,0,0,0,0,0
920,557,37.408261,-122.015920,0,8.4959,8.4959,9.0055,9.0055,5,1,...,0,0,0,0,0,0,0,0,0,0
921,589,37.556732,-122.288378,1,0.7589,2.8329,0.7589,3.8356,12,2,...,0,0,0,0,0,0,0,0,0,0


In [None]:
y #Target column

0      acquired
1      acquired
2      acquired
3      acquired
4        closed
         ...   
918    acquired
919      closed
920      closed
921    acquired
922    acquired
Name: status, Length: 771, dtype: object

In [None]:
# Defining the logistic regression model
logistic_regression = LogisticRegression()

In [None]:
# Defining the k-fold cross-validation object
kfold = KFold(n_splits=10, shuffle=True, random_state=21)
# Defining a list to store the evaluation results
results = []

In [None]:
# Performing k-fold cross-validation
for train_idx, test_idx in kfold.split(X):
    # Split the data into training and testing sets
    X_train, y_train = X.iloc[train_idx], y.iloc[train_idx]
    X_test, y_test = X.iloc[test_idx], y.iloc[test_idx]
    # Fit the logistic regression model on the training set
    logistic_regression.fit(X_train, y_train)

    # Predict the target variable on the testing set
    y_pred = logistic_regression.predict(X_test)

    # Evaluate the accuracy of the predictions
    accuracy = accuracy_score(y_test, y_pred)

    # Store the evaluation result
    results.append(accuracy)

In [None]:
# Calculating the mean and standard deviation of the evaluation results
mean_accuracy = np.mean(results)
std_accuracy = np.std(results)

In [None]:
# Printing the mean and standard deviation of the evaluation results
print('Mean accuracy:', mean_accuracy)
print('Standard deviation of accuracy:', std_accuracy)

Mean accuracy: 0.7133366633366635
Standard deviation of accuracy: 0.03842873258716713


## Conclusion
In conclusion, this project aimed to predict the success or failure of a startup based on a dataset containing industry trends, investment insights, and individual company information. To achieve this aim, we employed several statistical methods including t-test, z-test, ANOVA, categorical data analysis, linear regression, resampling methods, linear model selection and regularization, and moving beyond linearity.

Our analysis revealed that certain quantitative variables such as age_first_funding_year, age_last_funding_year, relationships, funding_rounds, funding_total_usd, and milestones had a significant impact on the success or failure of a startup. Additionally, certain categorical variables such as state, industry_type, has_VC, has_angel, has_roundA, has_roundB, has_roundC, avg_participants, and is_top500 were also found to be important predictors of a startup's success.

Overall, our analysis demonstrates the power of statistical methods in predicting the success or failure of a startup. By employing these methods, investors can make more informed decisions and increase their chances of success in the highly competitive world of startups.