This dataset is a collection of attributes of startup companies linked to their dependent variable of success. It can be found at https://www.kaggle.com/datasets/manishkc06/startup-success-prediction, with data provided by Ramkishan Panthena.

In this notebook, the dataset will be processed using decision trees, starting with a basic decision tree classifier and moving on to determining optimal hyperparameters of a random forest model using a grid search.

Models will be scored by accuracy, F1 score, and ROC AUC. 
* Accuracy is a measure of correctly classified observations.
* F1 score takes both precision and recall into account, and is useful at detecting imbalance in classification.
* ROC AUC measures the true positive rate and false positive rate.

For all three measures, the threshold for 'acceptable' will be set at 0.7, with higher numbers being much more desirable. However, the maximum of these numbers depends greatly on how much predictive power is able to be realistically obtained from the dataset.  

In [83]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_validate
from sklearn.model_selection import cross_val_score, GridSearchCV

In [63]:
df = pd.read_csv('startup_data.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,state_code,latitude,longitude,zip_code,id,city,Unnamed: 6,name,labels,...,object_id,has_VC,has_angel,has_roundA,has_roundB,has_roundC,has_roundD,avg_participants,is_top500,status
0,1005,CA,42.35888,-71.05682,92101,c:6669,San Diego,,Bandsintown,1,...,c:6669,0,1,0,0,0,0,1.0,0,acquired
1,204,CA,37.238916,-121.973718,95032,c:16283,Los Gatos,,TriCipher,1,...,c:16283,1,0,0,1,1,1,4.75,1,acquired
2,1001,CA,32.901049,-117.192656,92121,c:65620,San Diego,San Diego CA 92121,Plixi,1,...,c:65620,0,0,1,0,0,0,4.0,1,acquired
3,738,CA,37.320309,-122.05004,95014,c:42668,Cupertino,Cupertino CA 95014,Solidcore Systems,1,...,c:42668,0,0,0,1,1,1,3.3333,1,acquired
4,1002,CA,37.779281,-122.419236,94105,c:65806,San Francisco,San Francisco CA 94105,Inhale Digital,0,...,c:65806,1,1,0,0,0,0,1.0,1,closed


Many of these values are categorical and cannot be fed into the model. At this stage, they will be dropped. Redundant attributes (such as multiple location attributes) or irrelevant attributes will also be dropped.

Since the target variable is categorical, it will be converted to a numeric Boolean.

In [64]:
# Mapping the target variable to integer values. 
df['status'] = df['status'].map({'acquired': 1, 'closed': 0})

# Dropping the object-type attributes. 
objects = df.select_dtypes(include=['object']).columns
others = ['age_first_milestone_year', 'age_last_milestone_year', 'Unnamed: 0', 'labels', 'latitude', 'longitude']
df.drop(objects, axis=1, inplace=True)
df.drop(others, axis=1, inplace=True)
print(df.shape)
df.head()

(923, 30)


Unnamed: 0,age_first_funding_year,age_last_funding_year,relationships,funding_rounds,funding_total_usd,milestones,is_CA,is_NY,is_MA,is_TX,...,is_othercategory,has_VC,has_angel,has_roundA,has_roundB,has_roundC,has_roundD,avg_participants,is_top500,status
0,2.2493,3.0027,3,3,375000,3,1,0,0,0,...,1,0,1,0,0,0,0,1.0,0,1
1,5.126,9.9973,9,4,40100000,1,1,0,0,0,...,0,1,0,0,1,1,1,4.75,1,1
2,1.0329,1.0329,5,1,2600000,2,1,0,0,0,...,0,0,0,1,0,0,0,4.0,1,1
3,3.1315,5.3151,5,3,40000000,1,1,0,0,0,...,0,0,0,0,1,1,1,3.3333,1,1
4,0.0,1.6685,2,2,1300000,1,1,0,0,0,...,0,1,1,0,0,0,0,1.0,1,0


In [72]:
# Splitting the dataset into independent and dependent subsets. 
X, y = df.iloc[:,:-1], df.iloc[:, -1:]
y = y.values.reshape(923,)
print(X.shape)
print(y.shape)

(923, 29)
(923,)


Below, a basic decision tree classifier is implemented to use as a baseline by which to compare improvements to the model. 

In [78]:
base = DecisionTreeClassifier()
# Performing cross-validation
scores = cv_results = cross_validate(base, X, y, cv=5, scoring=('accuracy', 'f1', 'roc_auc'), return_train_score=True)
test_accuracy = np.mean(scores['test_accuracy'])
test_f1 = np.mean(scores['test_f1'])
test_roc = np.mean(scores['test_roc_auc'])
print(f'Test accuracy: {test_accuracy}\nTest F1 score: {test_f1}\nTest ROC AUC: {test_roc}')

Test accuracy: 0.6836251468860165
Test F1 score: 0.7526811774804638
Test ROC AUC: 0.6579291296938355


The scores do not quite meet the threshold for acceptability. However, a simple shift to a random forest classifier below drastically boosts the scores. 

In [74]:
mod = RandomForestClassifier()
# Performing cross-validation
scores = cv_results = cross_validate(mod, X, y, cv=5, scoring=('accuracy', 'f1', 'roc_auc'), return_train_score=True)
test_accuracy = np.mean(scores['test_accuracy'])
test_f1 = np.mean(scores['test_f1'])
test_roc = np.mean(scores['test_roc_auc'])
print(f'Test accuracy: {test_accuracy}\nTest F1 score: {test_f1}\nTest ROC AUC: {test_roc}')

Test accuracy: 0.7909048178613396
Test F1 score: 0.8489618790615576
Test ROC AUC: 0.8122198879551821


A k = 5 k-folds cross validation was run on the data, with test accuracy, F1 score, and ROC AUC measured and displayed. 

In an effort to improve the predictive power of the model, a grid search was conducted using ranges of hyperparameters that are relatively proximal to the default or expected inputs. 

In [34]:
parameters = {'n_estimators': np.arange(80,110), 'max_depth': np.arange(3, 15), 'min_samples_leaf': np.arange(1,15)}
mod = RandomForestClassifier()
grid = GridSearchCV(mod, parameters, cv=3)
grid.fit(X, y)

In [75]:
print(grid.best_params_)
print(grid.best_score_)

{'max_depth': 7, 'min_samples_leaf': 3, 'n_estimators': 85}
0.7974040075017274


As seen, this accuracy is not much of an improvement over the unaltered random forest model. However, these caclulated optimal parameters will be passed to a new model.

In [76]:
mod_params = RandomForestClassifier(max_depth=7, min_samples_leaf=3, n_estimators=85)
# Performing cross-validation
scores_params = cross_validate(mod_params, X, y, cv=5, scoring=('accuracy', 'f1', 'roc_auc'), 
                                            return_train_score=True)
test_accuracy_params = np.mean(scores_params['test_accuracy'])
test_f1_params = np.mean(scores_params['test_f1'])
test_roc_params = np.mean(scores_params['test_roc_auc'])
print(f'Test accuracy: {test_accuracy_params}\nTest F1 score: {test_f1_params}\nTest ROC AUC: {test_roc_params}')

Test accuracy: 0.7919682726204466
Test F1 score: 0.8537232602903609
Test ROC AUC: 0.8228548706195766


In summary, random forest provides significantly more predictive power than the standard decision tree classifier. A graphical representation of the base decision tree was too large to be displayed here, but is possible if desired. 

The hyperparameters obtained from a grid search on random forest parameters resulted in insignificant score improvement over the native random forest. Both random forests that were trained were of comparable scores, and their resulting scores were higher than any of the prior logistic regression models. This is, however, at the cost of interpretability. 