# Machine learning models

We will start with importing the modules and loading the PCA dataframes:

In [1]:
import pandas as pd
# from lib.unsupervised_learning import *

from sklearn import svm, tree, metrics
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier

from sklearn.model_selection import ShuffleSplit, GridSearchCV
from sklearn.model_selection import train_test_split

import tqdm.notebook as tqdm

In [2]:
pca_df = pd.read_csv('data/dataframes/pca_df.csv').iloc[:,1:]

print(f'PCA dataframe shape: {pca_df.shape}')

PCA dataframe shape: (10070, 1093)


In [3]:
cat_cols = ['company_name', 'company_about','founded', 'business model','employees','product stage','status','funding stage','succeeded']
num_cols = ['total_raised','total_rounds', 'investors','ipo_price', 'geo_market_per']

pca_cols = [col for col in pca_df.columns if col not in cat_cols]

print(f"Categorical cols : {len(cat_cols)}")
print(f"Numerical cols : {len(num_cols)}")
print(f"Total PCA cols : {len(pca_cols)}")


Categorical cols : 9
Numerical cols : 5
Total PCA cols : 1084


In this noteook, we will train few machine learning models on our datasets to find the best model to predict the target variable.  
The models we will use are:
- [Logistic Regression](#lr)
- [K-Nearest Neighbours](#knn)
- [Support Vector Machine](#svm)
- [Gaussian Naive Bayes](#gnb)
- [Decision Tree](#dt)
- [Random Forest](#rf)
- [Multi-layer Perceptron](#mlp)  


First we need to prepare our data.  
As we saw in the visualizations, we have almost 70% of successful companies in the dataset.  
In order to avoid biased results, we need to train the models on an evenly distributed succeeded column in the dataset.  
i.e the train data should contain 50% succesfull companies and 50% failed companies.


We will do the same for both pca dataframes:

In [4]:
print("\nPCA data:")

pca_succeeded = pca_df[pca_df['succeeded'] == 1]
pca_failed = pca_df[pca_df['succeeded'] == 0]

size = min(pca_succeeded.shape[0], pca_failed.shape[0])

pca_df_succeded_sampled = pca_succeeded.sample(n = size , random_state = 42)
pca_df_failed_sampled = pca_failed.sample(n = size , random_state = 42)

pca_equal_df = pd.concat([pca_df_succeded_sampled, pca_df_failed_sampled])

print(f'PCA equal dataframe shape: {pca_equal_df.shape}')
print(f"PCA equal dataframe succeeded companies: {pca_equal_df['succeeded'].sum()}")


PCA data:
PCA equal dataframe shape: (6122, 1093)
PCA equal dataframe succeeded companies: 3061.0


We will create functions to train different models and return the scores :

### Logistic Regression
we will use **ShuffleSplit** and **cross validation** to get the best results of the model. 
<a id='lr'></a>

In [5]:
def train_logistic_regression(XTrain, yTrain, XTest, yTest):
    cv = ShuffleSplit(n_splits=10, test_size=0.2, random_state=42)
    clf = LogisticRegression(max_iter=150)

    lr = GridSearchCV(clf, param_grid={'C': [0.1, 1, 10]}, cv=cv, scoring='accuracy')
    lr.fit(XTrain, yTrain)
    y_pred = lr.predict(XTest)

    f1 = metrics.f1_score(yTest, y_pred)
    accuracy = metrics.accuracy_score(yTest, y_pred)
    
    return  {'test_f1_macro': f1, 'test_accuracy': accuracy}

### K Nearest Neighbours
we will use **GridSearch** to find best hyper-parameters for KNN algorithm. 
<a id='knn'></a>

In [28]:
def train_knn(XTrain, yTrain, XTest, yTest, parameters = {'n_neighbors':range(2,50,4), 'weights':['uniform', 'distance']}):

    knn = KNeighborsClassifier()
    clf = GridSearchCV(knn, parameters,scoring='accuracy')
    clf.fit(XTrain, yTrain)
    y_pred = clf.predict(XTest)

    f1 = metrics.f1_score(yTest, y_pred)
    accuracy = metrics.accuracy_score(yTest, y_pred)
    
    return  {'test_f1_macro': f1, 'test_accuracy': accuracy}

### Support Vector Machine
we will use **GridSearch** to find best hyper-parameters for SVC algorithm. 
<a id='svc'></a>

In [7]:
def train_svm(XTrain, yTrain, XTest, yTest):
    parameters = {'C':[0.1,1,10]}
    s = svm.SVC()
    clf = GridSearchCV(s, parameters,scoring='accuracy')
    clf.fit(XTrain, yTrain)
    y_pred = clf.predict(XTest)
    
    f1 = metrics.f1_score(yTest, y_pred)
    accuracy = metrics.accuracy_score(yTest, y_pred)
    
    return  {'test_f1_macro': f1, 'test_accuracy': accuracy}

### Gaussian Naive Bayes
<a id='gnb'></a>

In [8]:
def train_gnb(XTrain, yTrain, XTest, yTest):
    gnb = GaussianNB()
    gnb.fit(XTrain, yTrain)
    y_pred = gnb.predict(XTest)
    f1 = metrics.f1_score(yTest, y_pred)
    accuracy = metrics.accuracy_score(yTest, y_pred)

    return  {'test_f1_macro': f1, 'test_accuracy': accuracy}

### Decision Tree
<a id='dt'></a>

In [9]:
def train_dt(XTrain, yTrain, XTest, yTest):
    parameters = {'max_depth':range(3,21,3), 'min_samples_split':range(2,18,4), 'min_samples_leaf':range(2,18,4)}
    clf = tree.DecisionTreeClassifier()
    dt = GridSearchCV(clf, parameters ,scoring='accuracy')
    dt.fit(XTrain, yTrain)
    y_pred = dt.predict(XTest)
    
    f1 = metrics.f1_score(yTest, y_pred)
    accuracy = metrics.accuracy_score(yTest, y_pred)
    
    return  {'test_f1_macro': f1, 'test_accuracy': accuracy}

### Random Forest Classifier
<a id='rf'></a>

In [10]:
def train_rf(XTrain, yTrain, XTest, yTest):
    parameters = {'n_estimators':range(2,50,4), 'min_samples_split':range(2,22,4), 'min_samples_leaf':range(2,22,4)}
    clf = RandomForestClassifier()
    rf = GridSearchCV(clf, parameters ,scoring='accuracy')
    rf.fit(XTrain, yTrain)
    y_pred = rf.predict(XTest)
    
    f1 = metrics.f1_score(yTest, y_pred)
    accuracy = metrics.accuracy_score(yTest, y_pred)
    
    return  {'test_f1_macro': f1, 'test_accuracy': accuracy}

### Neural Network - Multi-layer Perceptron
<a id='mlp'></a>

In [27]:
def train_mlp(XTrain, yTrain, XTest, yTest):
    parameters = {'hidden_layer_sizes':[(100,), (50,)], 'activation':['logistic', 'relu']}
    clf = MLPClassifier(max_iter=500, random_state=42)
    mlp = GridSearchCV(clf, parameters ,scoring='accuracy')
    mlp.fit(XTrain, yTrain)
    y_pred = mlp.predict(XTest)

    f1 = metrics.f1_score(yTest, y_pred)
    accuracy = metrics.accuracy_score(yTest, y_pred)
    
    return  {'test_f1_macro': f1, 'test_accuracy': accuracy}

Now, we will run every algorithm on each dataset and compare the results  
We will use **tqdm** and **timeit** libraries to measure the time it takes to train all the models.

In [12]:
# runtime ~ 55 minutes
import tqdm.notebook as tqdm
import timeit
from datetime import timedelta

dfs_scores = {}
t0 = timeit.default_timer()

# dfs ={'pca_df':(pca_equal_df, pca_cols), 'bin_df': (bin_equal_df,bin_cols)}#'pca_2d_df': (pca_2d_df,pca_2d_cols), 'pca_3d_df': (pca_3d_df,pca_3d_cols)}
scores = {} 

XTrain, XTest, yTrain, yTest = train_test_split(pca_df[pca_cols], pca_df['succeeded'], test_size=0.2, random_state=42, stratify=pca_df['succeeded'])

scores['LogisticRegression'] = train_logistic_regression(XTrain, yTrain, XTest, yTest)
print(f'Elapsed time: {timedelta(seconds = timeit.default_timer() - t0)}\nLogisticRegression: \n{scores["LogisticRegression"]}\n')

scores['KNN'] = train_knn(XTrain, yTrain, XTest, yTest)
print(f'Elapsed time: {timedelta(seconds = timeit.default_timer() - t0)}\nKNN: \n{scores["KNN"]}\n')

scores['GNB'] = train_gnb(XTrain, yTrain, XTest, yTest)
print(f'Elapsed time: {timedelta(seconds = timeit.default_timer() - t0)}\nGNB: \n{scores["GNB"]}\n')

scores['DT'] = train_dt(XTrain, yTrain, XTest, yTest)
print(f'Elapsed time: {timedelta(seconds = timeit.default_timer() - t0)}\nDT: \n{scores["DT"]}\n')

scores['RF'] = train_rf(XTrain, yTrain, XTest, yTest)
print(f'Elapsed time: {timedelta(seconds = timeit.default_timer() - t0)}\nRF: \n{scores["RF"]}\n')

scores['MLP'] = train_mlp(XTrain, yTrain, XTest, yTest)
print(f'Elapsed time: {timedelta(seconds = timeit.default_timer() - t0)}\nMLP: \n{scores["MLP"]}\n')

scores['SVM'] = train_svm(XTrain, yTrain, XTest, yTest)
print(f'Elapsed time: {timedelta(seconds = timeit.default_timer() - t0)}\nSVM: \n{scores["SVM"]}\n')
# RUNTIME > 700 minutes for SVM model.


dfs_scores['pca_df'] = scores


Elapsed time: 0:00:10.025544
LogisticRegression: 
{'test_f1_macro': 0.2794292508917955, 'test_accuracy': 0.39821251241310823}

Elapsed time: 0:01:13.040229
KNN: 
{'test_f1_macro': 0.7998629198080878, 'test_accuracy': 0.7100297914597815}

Elapsed time: 0:01:13.357502
GNB: 
{'test_f1_macro': 0.18414322250639387, 'test_accuracy': 0.36643495531281034}

Elapsed time: 0:34:13.599809
DT: 
{'test_f1_macro': 0.8121859296482412, 'test_accuracy': 0.7030784508440914}

Elapsed time: 1:34:19.797035
RF: 
{'test_f1_macro': 0.8228915662650603, 'test_accuracy': 0.7080436941410129}

Elapsed time: 1:37:08.231955
MLP: 
{'test_f1_macro': 0.8208430913348946, 'test_accuracy': 0.6961271102284012}

Elapsed time: 1:44:36.875713
SVM: 
{'test_f1_macro': 0.8208430913348946, 'test_accuracy': 0.6961271102284012}



Now that we have all the results for all the algorithms, we will compare them and find the best model.  
First we will devide the results into 2 groups - **F1 score and Accuracy**, then we will plot the data using plotly library. 


In [13]:
model_scores = {'LogisticRegression': [], 'KNN': [], 'GNB': [], 'DT': [], 'RF': [], 'MLP': [], 'SVM': []}

for key, scores in dfs_scores.items():
    for model, score in scores.items():
        model_scores[model].append(score)


In [14]:
f1_values = {}
accuracy_values = {}

for model, scores in model_scores.items():
    f1_values[model] = [score['test_f1_macro'] for score in scores]
    accuracy_values[model] = [score['test_accuracy'] for score in scores]
    


We will plot the data in 2 ways:
- For each scoring function, we will plot the algorithms's scores in a bar chart.
- For each Algorithm, we will plot the scores in a bar chart.

In [15]:
! pip install plotly
import plotly.graph_objects as go

x = ['PCA DATA']
titles = ['F1 Score', 'Accuracy Score']

for i,score in enumerate([f1_values,accuracy_values]):
    fig = go.Figure(data=[
    go.Bar(name='LogisticRegression', x=x, y=list(score.values())[0]),
    go.Bar(name='KNN', x=x, y=list(score.values())[1]),
    go.Bar(name='GNB', x=x, y=list(score.values())[2]),
    go.Bar(name='DT', x=x, y=list(score.values())[3]),
    go.Bar(name='RF', x=x, y=list(score.values())[4]),
    go.Bar(name='MLP', x=x, y=list(score.values())[5])
    ,go.Bar(name='SVM', x=x, y=list(score.values())[6])
    ], layout=go.Layout(title=titles[i],title_x = 0.5,  barmode='group', xaxis=dict(title='DataFrame'), yaxis=dict(title=titles[i]), width=1200, height=500))
    fig.update_yaxes(range=[0,1])
    fig.show()





In [16]:
import plotly.graph_objects as go
algs = ['LogisticRegression','KNN','GNB','DT','RF','MLP','SVM']

for i, alg in enumerate(algs):
    fig = go.Figure(data=[
    go.Bar(name='F1 Score', x=x, y=list(f1_values[alg]), text='F1 Score', textfont_color='black', textposition='outside', textfont_size=16),
    go.Bar(name='Accuracy_value', x=x, y=list(accuracy_values[alg]), text='Accuracy_value', textfont_color='black', textposition='outside', textfont_size=16)
    ], layout=go.Layout(title=algs[i],title_x = 0.5,  barmode='group', xaxis=dict(title='DataFrame'), yaxis=dict(title=algs[i]), width=600, height=500))
    fig.update_yaxes(range=[0,1])
    fig.show()


From first glance:  
We see that Logistic Regression didn't work so well in the Binary and PCA datasets.  
GNB failed in the Binary dataset, but redeemed itself in the PCA dataset.  
The rest of the model did well in both datasets - and returned high F1 and Accuracy scores on the test set.

The last step is to find the model with the best scores.  
We will find the model with the highest F1 score and the model with highest Accuracy score:

In [30]:
best_f1_model = ''
best_accuracy_model = ''

best_f1 = 0
best_accuracy = 0

for model, scores in model_scores.items():
    f1 = [score['test_f1_macro'] for score in scores]
    accuracy = [score['test_accuracy'] for score in scores]
    max_f1 = max(f1)
    max_accuracy = max(accuracy)
    if max_f1 > best_f1:
        best_f1 = max_f1
        best_f1_model = model
    if max_accuracy > best_accuracy:
        best_accuracy = max_accuracy
        best_accuracy_model = model

print("-----Find the best model-----")

print(f'Best model according to accuracy score is {best_accuracy_model} with accuracy score {best_accuracy}')

-----Find the best model-----
Best model according to accuracy score is KNN with accuracy score 0.7100297914597815


### We see that KNN is the most accurate Algorithm!
Lets try to improve it...

In [32]:
knn_params = {'n_neighbors':range(2,50,2), 'weights':['uniform', 'distance'], 'algorithm':['auto', 'ball_tree', 'kd_tree', 'brute']}

knn = KNeighborsClassifier()
clf = GridSearchCV(knn, knn_params, scoring='accuracy')
clf.fit(XTrain, yTrain)
print (f'Best parameters for KNN: {clf.best_params_}')
print (f'Best score for KNN: {clf.best_score_}')

Best parameters for KNN: {'algorithm': 'ball_tree', 'n_neighbors': 48, 'weights': 'uniform'}
Best score for KNN: 0.7153663630776623


In [35]:
y_pred = clf.predict(XTest)
print(f'Accuracy score for KNN: {metrics.accuracy_score(yTest, y_pred)}')
print(f'F1 score for KNN: {metrics.f1_score(yTest, y_pred)}')


Accuracy score for KNN: 0.7110228401191658
F1 score for KNN: 0.8013651877133107


# Conclusion
In this notebook, we have trained several machine learning models on our datasets and compared the results.  
We have found that the model best predicting the target variable (success or failure) is:  
### **K Nearest Neighbbors** with **Accuracy of ~0.71 and F1 score of ~0.80**. 

This project was a great opportunity to learn about machine learning and to learn how to use the different algorithms.  
The long journey we took tought us a lot about data in general, and specifically aout different methods to deal with it.  

We were very pleased with the results of the project, and maybe this can have an actual real world implications.. 