This notebook displays the process of training the ML model

In [10]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import accuracy_score
import os
from joblib import load
os.environ['KMP_DUPLICATE_LIB_OK']='True'

In [5]:
data = pd.read_csv("./ACME-HappinessSurvey2020.csv")
y = data["Y"]
columns  = list(data.columns)
columns.remove("Y")
X = data[columns]

### Model Training

We train a gradient boosting classifier for this task on all of the data

In [14]:
gbc = GradientBoostingClassifier(n_estimators=100, learning_rate=1.1, min_samples_leaf=12,random_state=5)
gbc = gbc.fit(X, y)
train_preds = gbc.predict(X)
acc_score = accuracy_score(y, train_preds)
print(acc_score)

0.9365079365079365


The model achieves a great accuracy score when evaluated on the training data. To prevent overfitting the minimum samples per leaf class is set to a higher value.

In [7]:
gbc = GradientBoostingClassifier(n_estimators=100, learning_rate=1.1, min_samples_leaf=12,random_state=5)
val_scores_gbc = cross_val_score(gbc, X, y, scoring="accuracy", cv=6)
np.mean(val_scores_gbc)

0.6428571428571428

When the same model run for cross validation using 6-fold cross validation, it achieves a 64% accuracy score, given the data, this is the best that could be achieved.

#### Evaluate 

To evaluate this model, having run the training cell, simply load your data and run the cell below

In [11]:
#Load the trained model
gbc = load('gbc.joblib')

In [None]:
test_data = pd.read_csv("test_data_file_path_here")
test_y = test_data["Y"]
columns  = list(test_data.columns)
columns.remove("Y")
test_X = test_data[columns]

test_preds = gbc.predict(test_X)
test_acc_score = accuracy_score(test_y, test_preds)
print(test_acc_score)