# Machine learning models

We will start with importing the modules and loading the 3 dataframes:

In [1]:
import pandas as pd
# from lib.unsupervised_learning import *

from sklearn import svm, tree, metrics
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier

from sklearn.model_selection import ShuffleSplit, GridSearchCV
from sklearn.model_selection import train_test_split, cross_val_score, cross_validate

import tqdm.notebook as tqdm

In [2]:
bin_df = pd.read_csv('data/dataframes/df_after_cols_reduction.csv').iloc[:,1:]
pca_df = pd.read_csv('data/dataframes/pca_df.csv').iloc[:,1:]
df_no_outliers = pd.read_csv('data/dataframes/cleaned_no_outliers.csv').iloc[:,1:]
# pca_2d_df = pd.read_csv('data/dataframes/pca_2d_df.csv').iloc[:,1:]
# pca_3d_df = pd.read_csv('data/dataframes/pca_3d_df.csv').iloc[:,1:]

print(f'Binary dataframe shape: {bin_df.shape}')
print(f'PCA dataframe shape: {pca_df.shape}')
print(f'No outliers dataframe shape: {df_no_outliers.shape}')
# print(f'PCA 2D dataframe shape: {pca_2d_df.shape}')
# print(f'PCA 3D dataframe shape: {pca_3d_df.shape}')


Binary dataframe shape: (10070, 1927)
PCA dataframe shape: (10070, 32)
No outliers dataframe shape: (10070, 1995)


In [3]:
cat_cols = ['company_name', 'company_about','founded', 'business model','employees','product stage','status','funding stage','succeeded']
num_cols = ['total_raised','total_rounds', 'investors','ipo_price', 'geo_market_per']
tag_cols = [col for col in bin_df.columns if col.startswith('tag_')]
targetmarket_cols = [col for col in bin_df.columns if col.startswith('targetmarket_')]
sector_list = [col for col in bin_df.columns if col.startswith("sector_")]
target_ind_list = [col  for col in bin_df.columns if col.startswith("industry_")]
technology_list = [col  for col in bin_df.columns if col.startswith("technology_")]


bin_cols = tag_cols + targetmarket_cols + sector_list + target_ind_list + technology_list
pca_cols = [col for col in pca_df.columns if col not in cat_cols]
# pca_2d_cols = [col for col in pca_2d_df.columns if col not in cat_cols and col not in num_cols]
# pca_3d_cols = [col for col in pca_3d_df.columns if col not in cat_cols and col not in num_cols]

In [4]:
print(f"Categorical cols : {len(cat_cols)}")
print(f"Numerical cols : {len(num_cols)}")
print(f"Tag cols : {len(tag_cols)}")
print(f"Targetmarket cols : {len(targetmarket_cols)}")
print(f"Sector cols : {len(sector_list)}")
print(f"Industry cols : {len(target_ind_list)}")
print(f"Technology cols : {len(technology_list)}")
print('---- Totals ----')
print(f"Total binary cols : {len(bin_cols)}")
print(f"Total PCA cols : {len(pca_cols)}")
# print(f"Toatl PCA 2D cols : {len(pca_2d_cols)}")
# print(f"Total PCA 3D cols : {len(pca_3d_cols)}")



Categorical cols : 9
Numerical cols : 5
Tag cols : 1599
Targetmarket cols : 117
Sector cols : 41
Industry cols : 81
Technology cols : 75
---- Totals ----
Total binary cols : 1913
Total PCA cols : 23


In this noteook, we will train few machine learning models on our datasets to find the best model to predict the target variable.  
The models we will use are:
- [Logistic Regression](#lr)
- [K-Nearest Neighbours](#knn)
- [Support Vector Machine](#svm)
- [Gaussian Naive Bayes](#gnb)
- [Decision Tree](#dt)
- [Random Forest](#rf)
- [Multi-layer Perceptron](#mlp)  


First we need to prepare our data.  
As we saw in the visualizations, we have almost 70% of successful companies in the dataset.  
In order to avoid biased results, we need to train the models on an evenly distributed succeeded column in the dataset.  
i.e the train data should contain 50% succesfull companies and 50% failed companies.


In [5]:
print("\nBinary data:")
bin_df_succeeded = bin_df[bin_df['succeeded'] == 1]
bin_df_failed = bin_df[bin_df['succeeded'] == 0]

size = min(bin_df_succeeded.shape[0], bin_df_failed.shape[0])

bin_df_succeded_sampled = bin_df_succeeded.sample(n = size , random_state = 42)
bin_df_failed_sampled = bin_df_failed.sample(n = size , random_state = 42)

bin_equal_df = pd.concat([bin_df_succeded_sampled, bin_df_failed_sampled])

print(f'Binary equal dataframe shape: {bin_equal_df.shape}')
print(f"Binary equal dataframe succeeded companies: {bin_equal_df['succeeded'].sum()}")


Binary data:
Binary equal dataframe shape: (6122, 1927)
Binary equal dataframe succeeded companies: 3061.0


We will do the same for both pca dataframes:

In [6]:
print("\nPCA data:")

pca_succeeded = pca_df[pca_df['succeeded'] == 1]
pca_failed = pca_df[pca_df['succeeded'] == 0]

size = min(pca_succeeded.shape[0], pca_failed.shape[0])

pca_df_succeded_sampled = pca_succeeded.sample(n = size , random_state = 42)
pca_df_failed_sampled = pca_failed.sample(n = size , random_state = 42)

pca_equal_df = pd.concat([pca_df_succeded_sampled, pca_df_failed_sampled])

print(f'PCA equal dataframe shape: {pca_equal_df.shape}')
print(f"PCA equal dataframe succeeded companies: {pca_equal_df['succeeded'].sum()}")


PCA data:
PCA equal dataframe shape: (6122, 32)
PCA equal dataframe succeeded companies: 3061.0


In [7]:
# print("\n2D PCA data:")

# pca_2d_df_succeeded = pca_2d_df[pca_2d_df['succeeded'] == 1]
# pca_2d_df_failed = pca_2d_df[pca_2d_df['succeeded'] == 0]

# size = min(pca_2d_df_succeeded.shape[0], pca_2d_df_failed.shape[0])

# pca_2d_df_succeded_sampled = pca_2d_df_succeeded.sample(n = size , random_state = 42)
# pca_2d_df_failed_sampled = pca_2d_df_failed.sample(n = size , random_state = 42)

# pca_2d_equal_df = pd.concat([pca_2d_df_succeded_sampled, pca_2d_df_failed_sampled])

# print(f'2D PCA equal dataframe shape: {pca_2d_equal_df.shape}')
# print(f"2D PCA equal dataframe succeeded companies: {pca_2d_equal_df['succeeded'].sum()}")

# print("\n3D PCA data:")

# pca_3d_df_succeeded = pca_3d_df[pca_3d_df['succeeded'] == 1]
# pca_3d_df_failed = pca_3d_df[pca_3d_df['succeeded'] == 0]

# size = min(pca_3d_df_succeeded.shape[0], pca_3d_df_failed.shape[0])

# pca_3d_df_succeded_sampled = pca_3d_df_succeeded.sample(n = size , random_state = 42)
# pca_3d_df_failed_sampled = pca_3d_df_failed.sample(n = size , random_state = 42)

# pca_3d_equal_df = pd.concat([pca_3d_df_succeded_sampled, pca_3d_df_failed_sampled])

# print(f'3D PCA equal dataframe shape: {pca_3d_equal_df.shape}')
# print(f"3D PCA equal dataframe succeeded companies: {pca_3d_equal_df['succeeded'].sum()}")

In [8]:
bin_XTrain, bin_XTest, bin_yTrain, bin_yTest = train_test_split(bin_equal_df[bin_cols], bin_equal_df['succeeded'], test_size=0.2, random_state=42, stratify=bin_equal_df['succeeded'])
pca_XTrain, pca_XTest, pca_yTrain, pca_yTest = train_test_split(pca_equal_df[pca_cols], pca_equal_df['succeeded'], test_size=0.2, random_state=42, stratify=pca_equal_df['succeeded'])
# pca_2d_XTrain, pca_2d_XTest, pca_2d_yTrain, pca_2d_yTest = train_test_split(pca_2d_equal_df[pca_2d_cols], pca_2d_equal_df['succeeded'], test_size=0.2, random_state=42, stratify=pca_2d_equal_df['succeeded'])
# pca_3d_XTrain, pca_3d_XTest, pca_3d_yTrain, pca_3d_yTest = train_test_split(pca_3d_equal_df[pca_3d_cols], pca_3d_equal_df['succeeded'], test_size=0.2, random_state=42, stratify=pca_3d_equal_df['succeeded'])


print(f"bin_XTrain shape: {bin_XTrain.shape}")
print(f"bin_yTrain shape: {bin_yTrain.shape}")

print(f"bin_XTest shape: {bin_XTest.shape}")
print(f"bin_yTest shape: {bin_yTest.shape}")

print(f"pca_XTrain shape: {pca_XTrain.shape}")
print(f"pca_yTrain shape: {pca_yTrain.shape}")

print(f"pca_XTest shape: {pca_XTest.shape}")
print(f"pca_yTest shape: {pca_yTest.shape}")

bin_XTrain shape: (4897, 1913)
bin_yTrain shape: (4897,)
bin_XTest shape: (1225, 1913)
bin_yTest shape: (1225,)
pca_XTrain shape: (4897, 23)
pca_yTrain shape: (4897,)
pca_XTest shape: (1225, 23)
pca_yTest shape: (1225,)


We will create functions to train different models and return the scores :

In [9]:
scoring = 'accuracy'

### Logistic Regression
we will use **ShuffleSplit** and **cross validation** to get the best results of the model. 
<a id='lr'></a>

In [10]:
def train_logistic_regression(XTrain, yTrain, XTest, yTest):
    cv = ShuffleSplit(n_splits=15, test_size=0.2, random_state=42)
    clf = LogisticRegression(max_iter=150)

    lr = GridSearchCV(clf, param_grid={'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]}, cv=cv, scoring=scoring)
    lr.fit(XTrain, yTrain)
    y_pred = lr.predict(XTest)

    f1 = metrics.f1_score(yTest, y_pred)
    accuracy = metrics.accuracy_score(yTest, y_pred)
    
    return  {'test_f1_macro': f1, 'test_accuracy': accuracy}

### K Nearest Neighbours
we will use **GridSearch** to find best hyper-parameters for KNN algorithm. 
<a id='knn'></a>

In [11]:
def train_knn(XTrain, yTrain, XTest, yTest):

    parameters = {'n_neighbors':range(2,50,2), 'weights':['uniform', 'distance'], }
    knn = KNeighborsClassifier()
    clf = GridSearchCV(knn, parameters,scoring=scoring)
    clf.fit(XTrain, yTrain)
    y_pred = clf.predict(XTest)

    f1 = metrics.f1_score(yTest, y_pred)
    accuracy = metrics.accuracy_score(yTest, y_pred)
    
    return  {'test_f1_macro': f1, 'test_accuracy': accuracy}

### Support Vector Machine
we will use **GridSearch** to find best hyper-parameters for SVC algorithm. 
<a id='svc'></a>

In [12]:
def train_svm(XTrain, yTrain, XTest, yTest):
    parameters = {'C':[0.1,1,10], 'kernel':['linear', 'rbf']}
    s = svm.SVC()
    clf = GridSearchCV(s, parameters,scoring=scoring)
    clf.fit(XTrain, yTrain)
    y_pred = clf.predict(XTest)
    
    f1 = metrics.f1_score(yTest, y_pred)
    accuracy = metrics.accuracy_score(yTest, y_pred)
    
    return  {'test_f1_macro': f1, 'test_accuracy': accuracy}

### Gaussian Naive Bayes
<a id='gnb'></a>

In [13]:
def train_gnb(XTrain, yTrain, XTest, yTest):
    gnb = GaussianNB()
    gnb.fit(XTrain, yTrain)
    y_pred = gnb.predict(XTest)
    f1 = metrics.f1_score(yTest, y_pred)
    accuracy = metrics.accuracy_score(yTest, y_pred)

    return  {'test_f1_macro': f1, 'test_accuracy': accuracy}

### Decision Tree
<a id='dt'></a>

In [14]:
def train_dt(XTrain, yTrain, XTest, yTest):
    parameters = {'max_depth':range(2,20,2), 'min_samples_split':range(2,20,2), 'min_samples_leaf':range(2,20,2)}
    clf = tree.DecisionTreeClassifier()
    dt = GridSearchCV(clf, parameters ,scoring=scoring)
    dt.fit(XTrain, yTrain)
    y_pred = dt.predict(XTest)
    
    f1 = metrics.f1_score(yTest, y_pred)
    accuracy = metrics.accuracy_score(yTest, y_pred)
    
    return  {'test_f1_macro': f1, 'test_accuracy': accuracy}

### Random Forest Classifier
<a id='rf'></a>

In [15]:
def train_rf(XTrain, yTrain, XTest, yTest):
    parameters = {'n_estimators':range(2,50,2), 'max_depth':range(2,20,2), 'min_samples_split':range(2,20,2), 'min_samples_leaf':range(2,20,2)}
    clf = RandomForestClassifier()
    rf = GridSearchCV(clf, parameters ,scoring=scoring)
    rf.fit(XTrain, yTrain)
    y_pred = rf.predict(XTest)
    
    f1 = metrics.f1_score(yTest, y_pred)
    accuracy = metrics.accuracy_score(yTest, y_pred)
    
    return  {'test_f1_macro': f1, 'test_accuracy': accuracy}

### Neural Network - Multi-layer Perceptron
<a id='mlp'></a>

In [16]:
def train_mlp(XTrain, yTrain, XTest, yTest):
    parameters = {'hidden_layer_sizes':[(100,), (50,), (25,), (10,)], 'activation':['identity', 'logistic', 'tanh', 'relu'], 'solver':['adam']}
    clf = MLPClassifier(max_iter=500, random_state=42)
    mlp = GridSearchCV(clf, parameters ,scoring=scoring)
    mlp.fit(XTrain, yTrain)
    y_pred = mlp.predict(XTest)

    f1 = metrics.f1_score(yTest, y_pred)
    accuracy = metrics.accuracy_score(yTest, y_pred)
    
    return  {'test_f1_macro': f1, 'test_accuracy': accuracy}

Now, we will run every algorithm on each dataset and compare the results  
We will use **tqdm** and **timeit** libraries to measure the time it takes to train all the models.

In [17]:
# runtime ~ 55 minutes
import tqdm.notebook as tqdm
import timeit
from datetime import timedelta

dfs_scores = {}
t0 = timeit.default_timer()

dfs ={'pca_df':(pca_equal_df, pca_cols), 'bin_df': (bin_equal_df,bin_cols)}#'pca_2d_df': (pca_2d_df,pca_2d_cols), 'pca_3d_df': (pca_3d_df,pca_3d_cols)}
with tqdm.tqdm(total=len(dfs)*7) as pbar:
    for key, df in dfs.items():
        scores = {} 
        
        XTrain, XTest, yTrain, yTest = train_test_split(df[0][df[1]], df[0]['succeeded'], test_size=0.2, random_state=42, stratify=df[0]['succeeded'])

        scores['LogisticRegression'] = train_logistic_regression(XTrain, yTrain, XTest, yTest)
        print(f'Elapsed time: {timedelta(seconds = timeit.default_timer() - t0)}\n{key} Model LogisticRegression: \n{scores["LogisticRegression"]}\n')
        pbar.update(1)

        scores['KNN'] = train_knn(XTrain, yTrain, XTest, yTest)
        print(f'Elapsed time: {timedelta(seconds = timeit.default_timer() - t0)}\n{key} Model KNN: \n{scores["KNN"]}\n')
        pbar.update(1)

        # scores['SVM'] = train_svm(XTrain, yTrain, XTest, yTest)
        # print(f'Elapsed time: {timedelta(seconds = timeit.default_timer() - t0)}\n{key} Model SVM: \n{scores["SVM"]}\n')
        # RUNTIME > 700 minutes for SVM model.
        pbar.update(1)

        scores['GNB'] = train_gnb(XTrain, yTrain, XTest, yTest)
        print(f'Elapsed time: {timedelta(seconds = timeit.default_timer() - t0)}\n{key} Model GNB: \n{scores["GNB"]}\n')
        pbar.update(1)

        scores['DT'] = train_dt(XTrain, yTrain, XTest, yTest)
        print(f'Elapsed time: {timedelta(seconds = timeit.default_timer() - t0)}\n{key} Model DT: \n{scores["DT"]}\n')
        pbar.update(1)

        scores['RF'] = train_rf(XTrain, yTrain, XTest, yTest)
        print(f'Elapsed time: {timedelta(seconds = timeit.default_timer() - t0)}\n{key} Model RF: \n{scores["RF"]}\n')
        pbar.update(1)

        scores['MLP'] = train_mlp(XTrain, yTrain, XTest, yTest)
        print(f'Elapsed time: {timedelta(seconds = timeit.default_timer() - t0)}\n{key} Model MLP: \n{scores["MLP"]}\n')
        pbar.update(1)

        dfs_scores[key] = scores


  0%|          | 0/14 [00:00<?, ?it/s]

Elapsed time: 0:00:03.294875
pca_df Model LogisticRegression: 
{'test_f1_macro': 0.4703010577705451, 'test_accuracy': 0.4685714285714286}

Elapsed time: 0:00:28.785793
pca_df Model KNN: 
{'test_f1_macro': 0.6867749419953596, 'test_accuracy': 0.6693877551020408}

Elapsed time: 0:00:28.799727
pca_df Model GNB: 
{'test_f1_macro': 0.14641288433382135, 'test_accuracy': 0.5240816326530612}

Elapsed time: 0:03:14.136090
pca_df Model DT: 
{'test_f1_macro': 0.666110183639399, 'test_accuracy': 0.673469387755102}

Elapsed time: 3:31:28.448658
pca_df Model RF: 
{'test_f1_macro': 0.7278431372549019, 'test_accuracy': 0.7167346938775511}

Elapsed time: 3:32:18.305181
pca_df Model MLP: 
{'test_f1_macro': 0.6027397260273972, 'test_accuracy': 0.6212244897959184}



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

Elapsed time: 3:33:50.350537
bin_df Model LogisticRegression: 
{'test_f1_macro': 0.6086956521739131, 'test_accuracy': 0.6253061224489795}

Elapsed time: 3:35:08.214554
bin_df Model KNN: 
{'test_f1_macro': 0.3757030371203599, 'test_accuracy': 0.5469387755102041}

Elapsed time: 3:35:08.551014
bin_df Model GNB: 
{'test_f1_macro': 0.6097867001254705, 'test_accuracy': 0.4922448979591837}

Elapsed time: 4:02:09.994221
bin_df Model DT: 
{'test_f1_macro': 0.5240847784200384, 'test_accuracy': 0.5967346938775511}



KeyboardInterrupt: 

In [18]:
dfs_scores

{'pca_df': {'LogisticRegression': {'test_f1_macro': 0.4703010577705451,
   'test_accuracy': 0.4685714285714286},
  'KNN': {'test_f1_macro': 0.6867749419953596,
   'test_accuracy': 0.6693877551020408},
  'GNB': {'test_f1_macro': 0.14641288433382135,
   'test_accuracy': 0.5240816326530612},
  'DT': {'test_f1_macro': 0.666110183639399,
   'test_accuracy': 0.673469387755102},
  'RF': {'test_f1_macro': 0.7278431372549019,
   'test_accuracy': 0.7167346938775511},
  'MLP': {'test_f1_macro': 0.6027397260273972,
   'test_accuracy': 0.6212244897959184}}}

Now that we have all the results for all the datasets, we will compare the results and find the best model.  
First we will devide the results into 4 groups (F1 score, Accuracy, Precision, Recall) and then we will plot the data using plotly. 


In [19]:
model_scores = {'LogisticRegression': [], 'KNN': [], 'GNB': [], 'DT': [], 'RF': [], 'MLP': []}

for key, scores in dfs_scores.items():
    for model, score in scores.items():
        model_scores[model].append(score)


In [20]:
f1_values = {}
accuracy_values = {}

for model, scores in model_scores.items():
    f1_values[model] = [score['test_f1_macro'] for score in scores]
    accuracy_values[model] = [score['test_accuracy'] for score in scores]
    



We will plot the data in 2 ways:
- For each scoring function, we will plot the algorithms's scores in a bar chart.
- For each Algorithm, we will plot the scores in a bar chart.

In [22]:
! pip install plotly
import plotly.graph_objects as go
x = ['PCA']
titles = ['F1 Score', 'Accuracy Score']
for i,score in enumerate([f1_values,accuracy_values]):
    fig = go.Figure(data=[
    go.Bar(name='Logistic Regression', x=x, y=list(score.values())[0]),
    go.Bar(name='KNN', x=x, y=list(score.values())[1]),
    go.Bar(name='GNB', x=x, y=list(score.values())[2]),
    go.Bar(name='DT', x=x, y=list(score.values())[3]),
    go.Bar(name='RF', x=x, y=list(score.values())[4]),
    go.Bar(name='MLP', x=x, y=list(score.values())[5])
    ], layout=go.Layout(title=titles[i],title_x = 0.5,  barmode='group', xaxis=dict(title='DataFrame'), yaxis=dict(title=titles[i]), width=1200, height=500))
    fig.show()

Defaulting to user installation because normal site-packages is not writeable


You should consider upgrading via the 'c:\program files\python37\python.exe -m pip install --upgrade pip' command.


In [34]:
import plotly.graph_objects as go
algs = ['LogisticRegression','KNN','GNB','DT','RF','MLP']

for i, alg in enumerate(algs):
    fig = go.Figure(data=[
    go.Bar(name='F1 Score', x=x, y=list(f1_values[alg]), text='F1 Score', textfont_color='black', textposition='outside', textfont_size=16),
    go.Bar(name='Accuracy_value', x=x, y=list(accuracy_values[alg]), text='Accuracy_value', textfont_color='black', textposition='outside', textfont_size=16)
    ], layout=go.Layout(title=algs[i],title_x = 0.5,  barmode='group', xaxis=dict(title='DataFrame'), yaxis=dict(title=algs[i]), width=600, height=500))
    fig.update_yaxes(range=[0,1])
    fig.show()


From first glance:  
We see that Logistic Regression didn't work so well in the Binary and PCA datasets.  
GNB failed in the Binary dataset, but redeemed itself in the PCA dataset.  
The rest of the model did well in both datasets - and returned high F1 and Accuracy scores on the test set.

The last step is to find the model with the best scores.  
We will find the model with the highest F1 score and the model with highest Accuracy score:

In [32]:
best_f1_model = ''
best_accuracy_model = ''

best_f1 = 0
best_accuracy = 0

for model, scores in model_scores.items():
    f1 = [score['test_f1_macro'] for score in scores]
    accuracy = [score['test_accuracy'] for score in scores]
    max_f1 = max(f1)
    max_accuracy = max(accuracy)
    if max_f1 > best_f1:
        best_f1 = max_f1
        best_f1_model = model
    if max_accuracy > best_accuracy:
        best_accuracy = max_accuracy
        best_accuracy_model = model

print("-----Find the best model-----")
if best_f1_model == best_accuracy_model:
    print(f'Best model is {best_f1_model}!')
    print(f'It has f1 score of {best_f1} and accuracy score of {best_accuracy}')
else:
    print(f'Best model according to f1 score is {best_f1_model} with f1 score {best_f1}')
    print(f'Best model according to accuracy score is {best_accuracy_model} with accuracy score {best_accuracy}')

-----Find the best model-----
Best model is RF!
It has f1 score of 0.7278431372549019 and accuracy score of 0.7167346938775511


# Conclusion
In this notebook, we have trained several machine learning models on our datasets and compared the results.  
We have found that the model best predicting the target variable (success or failure) is:  
### **Random Forest** with **F1 score of ~0.73 and Accuracy of ~0.72**. 

This project was a great opportunity to learn about machine learning and to learn how to use the different algorithms.  
The long journey we took tought us a lot about data in general, and specifically aout different methods to deal with it.  

We were very pleased with the results of the project, and maybe this can have an actual real world implications.. 