This dataset is a collection of attributes of startup companies linked to their dependent variable of success. It can be found at https://www.kaggle.com/datasets/manishkc06/startup-success-prediction, with data provided by Ramkishan Panthena.

In this notebook, the dataset will be processed minimally and then passed through an unaltered skelearn logistic regression model. The purpose of this will be to establish a baseline which can be used to guide later EDA and compare to later models.

In [25]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_validate

In [26]:
df = pd.read_csv('startup_data.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,state_code,latitude,longitude,zip_code,id,city,Unnamed: 6,name,labels,...,object_id,has_VC,has_angel,has_roundA,has_roundB,has_roundC,has_roundD,avg_participants,is_top500,status
0,1005,CA,42.35888,-71.05682,92101,c:6669,San Diego,,Bandsintown,1,...,c:6669,0,1,0,0,0,0,1.0,0,acquired
1,204,CA,37.238916,-121.973718,95032,c:16283,Los Gatos,,TriCipher,1,...,c:16283,1,0,0,1,1,1,4.75,1,acquired
2,1001,CA,32.901049,-117.192656,92121,c:65620,San Diego,San Diego CA 92121,Plixi,1,...,c:65620,0,0,1,0,0,0,4.0,1,acquired
3,738,CA,37.320309,-122.05004,95014,c:42668,Cupertino,Cupertino CA 95014,Solidcore Systems,1,...,c:42668,0,0,0,1,1,1,3.3333,1,acquired
4,1002,CA,37.779281,-122.419236,94105,c:65806,San Francisco,San Francisco CA 94105,Inhale Digital,0,...,c:65806,1,1,0,0,0,0,1.0,1,closed


Many of these values are categorical and cannot be fed into the model. Later, these will be one-hot-encoded to add to the model. At this stage, they will be dropped. Redundant attributes (such as multiple location attributes) or irrelevant attributes will also be dropped.

Since the target variable is categorical, it will be converted to a numeric binary.

In [27]:
df = pd.get_dummies(df, columns=['status'])
df.drop('status_closed', axis=1, inplace=True)

objects = df.select_dtypes(include=['object']).columns
others = ['age_first_milestone_year', 'age_last_milestone_year', 'Unnamed: 0', 'labels', 'latitude', 'longitude']
df.drop(objects, axis=1, inplace=True)
df.drop(others, axis=1, inplace=True)
print(df.shape)
df.head()

(923, 30)


Unnamed: 0,age_first_funding_year,age_last_funding_year,relationships,funding_rounds,funding_total_usd,milestones,is_CA,is_NY,is_MA,is_TX,...,is_othercategory,has_VC,has_angel,has_roundA,has_roundB,has_roundC,has_roundD,avg_participants,is_top500,status_acquired
0,2.2493,3.0027,3,3,375000,3,1,0,0,0,...,1,0,1,0,0,0,0,1.0,0,1
1,5.126,9.9973,9,4,40100000,1,1,0,0,0,...,0,1,0,0,1,1,1,4.75,1,1
2,1.0329,1.0329,5,1,2600000,2,1,0,0,0,...,0,0,0,1,0,0,0,4.0,1,1
3,3.1315,5.3151,5,3,40000000,1,1,0,0,0,...,0,0,0,0,1,1,1,3.3333,1,1
4,0.0,1.6685,2,2,1300000,1,1,0,0,0,...,0,1,1,0,0,0,0,1.0,1,0


In [28]:
X, y = df.iloc[:,:-1], df.iloc[:, -1:]
y = y.values.reshape(923,)
print(X.shape)
print(y.shape)

(923, 29)
(923,)


In [29]:
mod = LogisticRegression()
scores = cv_results = cross_validate(mod, X, y, cv=5, scoring=('accuracy', 'f1', 'roc_auc'), return_train_score=True)
test_accuracy = np.mean(scores['test_accuracy'])
test_f1 = np.mean(scores['test_f1'])
test_roc = np.mean(scores['test_roc_auc'])
print(f'Test accuracy: {test_accuracy}\nTest F1 score: {test_f1}\nTest ROC AUC: {test_roc}')

Test accuracy: 0.6468037602820212
Test F1 score: 0.7855244648709909
Test ROC AUC: 0.6554068676421618


A k = 5 k-folds cross validation was run on the data, with test accuracy, F1 score, and ROC AUC measured and displayed. These measures are decent, but there is likely much room for improvement, which will be addressed next. 

In [37]:
mod.fit(X, y)
coeffs = mod.coef_[0]
coeffs

array([ 9.38422598e-16,  1.00801794e-14,  6.73856844e-14,  9.36093378e-15,
        2.02889742e-08,  1.40888861e-14,  2.12688317e-15,  8.58390108e-16,
        6.80362443e-16, -7.00421681e-17, -7.38860527e-16,  5.92679107e-16,
        6.68785781e-16,  2.58160539e-16,  6.50883459e-16,  4.61690480e-16,
        5.16276493e-17, -1.42985599e-16, -7.62173129e-17,  1.03848875e-17,
        3.50185971e-16, -6.06303303e-17,  6.61852123e-16,  3.20571405e-15,
        2.38946794e-15,  1.23193093e-15,  6.25245206e-16,  1.39111338e-14,
        4.41504266e-15])

In [44]:
attr_coeffs = {}
attrs = df.columns
for i in range(len(coeffs)):
    attr_coeffs[attrs[i]] = coeffs[i]
attr_coeffs = dict(sorted(attr_coeffs.items(), key=lambda item: item[1], reverse=True))
attr_coeffs

{'funding_total_usd': 2.0288974162067822e-08,
 'relationships': 6.738568436262268e-14,
 'milestones': 1.4088886057956349e-14,
 'avg_participants': 1.3911133769033538e-14,
 'age_last_funding_year': 1.0080179387668835e-14,
 'funding_rounds': 9.36093377611379e-15,
 'is_top500': 4.41504266127909e-15,
 'has_roundA': 3.205714050470279e-15,
 'has_roundB': 2.389467944976437e-15,
 'is_CA': 2.1268831714347958e-15,
 'has_roundC': 1.2319309283489718e-15,
 'age_first_funding_year': 9.38422598001452e-16,
 'is_NY': 8.583901084814993e-16,
 'is_MA': 6.803624426481485e-16,
 'is_web': 6.687857806563945e-16,
 'has_angel': 6.618521229337517e-16,
 'is_enterprise': 6.508834592216633e-16,
 'has_roundD': 6.252452055845917e-16,
 'is_software': 5.9267910734619e-16,
 'is_advertising': 4.616904797614783e-16,
 'is_othercategory': 3.5018597084042284e-16,
 'is_mobile': 2.5816053859604856e-16,
 'is_gamesvideo': 5.1627649345797804e-17,
 'is_consulting': 1.0384887490944869e-17,
 'has_VC': -6.063033032946824e-17,
 'is_TX