# Human-built Machine Learning Ensemble Model
### Following machine learning models have been evaluated to construct an ensemble model for prediction:
- kNN
- Naive Bayes
- Logistic Regression
- Random Forest
- Gradient Boosting Tree
- Support Vector Machine
- Artificial Neural Networks

## (1) Data Overview

In [1]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn import svm
import tensorflow as tf
from tensorflow import keras
from sklearn.metrics import f1_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

cfg_col_X = ['sepal length (cm)','sepal width (cm)','petal length (cm)','petal width (cm)']
cfg_col_Y = 'target'

# Load data
iris = load_iris()
col_names = iris['feature_names'] + ['target']
df = pd.DataFrame(data= np.c_[iris['data'], iris['target']], columns=col_names)

# shuffle dataset so that it is not ordered by target
df = df.sample(frac=1, random_state=1)

In [2]:
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
14,5.8,4.0,1.2,0.2,0.0
98,5.1,2.5,3.0,1.1,1.0
75,6.6,3.0,4.4,1.4,1.0
16,5.4,3.9,1.3,0.4,0.0
131,7.9,3.8,6.4,2.0,2.0


In [3]:
df.describe().round(2)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
count,150.0,150.0,150.0,150.0,150.0
mean,5.84,3.06,3.76,1.2,1.0
std,0.83,0.44,1.77,0.76,0.82
min,4.3,2.0,1.0,0.1,0.0
25%,5.1,2.8,1.6,0.3,0.0
50%,5.8,3.0,4.35,1.3,1.0
75%,6.4,3.3,5.1,1.8,2.0
max,7.9,4.4,6.9,2.5,2.0


In [4]:
# construct training and testing sets
train, test = train_test_split(df,test_size=0.15,random_state=602)

In [5]:
# extract target from features
X_train = train.iloc[:,:4]
Y_train = train.iloc[:,-1]
X_test = test.iloc[:,:4]
Y_test = test.iloc[:,-1]

## (2) kNN

In [6]:
# grid search with 10-fold cv
parameters = {'n_neighbors':[3, 5, 7, 10, 15]}
clf_KNN = GridSearchCV(KNeighborsClassifier(weights='uniform',n_jobs=-1),parameters,scoring='f1_macro',cv=10,n_jobs=-1)
clf_KNN.fit(X_train,Y_train)



GridSearchCV(cv=10, error_score='raise-deprecating',
             estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30,
                                            metric='minkowski',
                                            metric_params=None, n_jobs=-1,
                                            n_neighbors=5, p=2,
                                            weights='uniform'),
             iid='warn', n_jobs=-1,
             param_grid={'n_neighbors': [3, 5, 7, 10, 15]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='f1_macro', verbose=0)

In [7]:
# f1 score of kNN with 5 different number of neighbors
clf_KNN.cv_results_['mean_test_score']

array([0.95304375, 0.95304375, 0.9532708 , 0.95151932, 0.95264986])

In [8]:
# prediction on testing set
Y_pred = clf_KNN.predict(X_test)

In [9]:
# f1 score of testing set
f1_score(Y_test, Y_pred, average='macro')

1.0

In [10]:
# record prediction results of each validation set into the original dataframe from 10-fold cross validation
Y_pred = []
for i in range(10):
    X_cross_test = df.iloc[(i*15):((i+1)*15),:4]
    bad_df = df.index.isin(list(range(i*15,((i+1)*15))))
    cross_train = df[~bad_df]
    X_cross_train = cross_train.iloc[:,:4]
    Y_cross_train = cross_train.iloc[:,-1]
    clf_KNN = KNeighborsClassifier(weights='uniform',n_jobs=-1,n_neighbors=7)
    clf_KNN.fit(X_cross_train,Y_cross_train)
    Y_cross_pred = clf_KNN.predict(X_cross_test)
    Y_pred.extend(Y_cross_pred)
df['kNN Prediction'] = Y_pred

In [11]:
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target,kNN Prediction
14,5.8,4.0,1.2,0.2,0.0,0.0
98,5.1,2.5,3.0,1.1,1.0,1.0
75,6.6,3.0,4.4,1.4,1.0,1.0
16,5.4,3.9,1.3,0.4,0.0,0.0
131,7.9,3.8,6.4,2.0,2.0,2.0


After grid search for number of neighbors to be used, we found that numbers around 5 are approximately the optimal hyperparameter for kNN model, with 10-fold cross validation f1 score to be 0.9523. In addition, the model is showing an f1 score of 0.9521 in the testing set.

## (3) Naive Bayes

In [12]:
# cross-validation for Naive Bayes model with 10 folds
clf_NB = GaussianNB()
cross_val_score(clf_NB, X_train, Y_train, cv=10, scoring='f1_macro', n_jobs=-1).mean()

0.9613347763347763

In [13]:
# f1 score on testing set
clf_NB.fit(X_train, Y_train)
Y_pred = clf_NB.predict(X_test)
f1_score(Y_test, Y_pred, average='macro')

0.9027777777777778

In [14]:
# record prediction results of each validation set into the original dataframe from 10-fold cross validation
Y_pred = []
for i in range(10):
    X_cross_test = df.iloc[(i*15):((i+1)*15),:4]
    bad_df = df.index.isin(list(range(i*15,((i+1)*15))))
    cross_train = df[~bad_df]
    X_cross_train = cross_train.iloc[:,:4]
    Y_cross_train = cross_train.iloc[:,-1]
    clf_NB = GaussianNB()
    clf_NB.fit(X_cross_train,Y_cross_train)
    Y_cross_pred = clf_NB.predict(X_cross_test)
    Y_pred.extend(Y_cross_pred)
df['Naive Bayes Prediction'] = Y_pred

In [15]:
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target,kNN Prediction,Naive Bayes Prediction
14,5.8,4.0,1.2,0.2,0.0,0.0,0.0
98,5.1,2.5,3.0,1.1,1.0,1.0,1.0
75,6.6,3.0,4.4,1.4,1.0,1.0,1.0
16,5.4,3.9,1.3,0.4,0.0,0.0,0.0
131,7.9,3.8,6.4,2.0,2.0,2.0,2.0


By running 10-fold cross validation on Naive Bayes model, we can get an f1 score of 0.9614. The model is showing an f1 score of 0.9028 in the testing set. Even though the cross validation score of Naive Bayes is slightly higher than that of kNN model, the Naive Bayes model is showing a less stable performance.

## (4) Logistic Regression

In [16]:
# cross-validation for Logistic Regression model with 10 folds
clf_LR = LogisticRegression(n_jobs=-1)
cross_val_score(clf_LR, X_train, Y_train, cv=10, scoring='f1_macro', n_jobs=-1).mean()

0.9539923039923041

In [17]:
# f1 score on testing set
clf_LR.fit(X_train, Y_train)
Y_pred = clf_LR.predict(X_test)
f1_score(Y_test, Y_pred, average='macro')

0.952136752136752

In [18]:
# record prediction results of each validation set into the original dataframe from 10-fold cross validation
Y_pred = []
for i in range(10):
    X_cross_test = df.iloc[(i*15):((i+1)*15),:4]
    bad_df = df.index.isin(list(range(i*15,((i+1)*15))))
    cross_train = df[~bad_df]
    X_cross_train = cross_train.iloc[:,:4]
    Y_cross_train = cross_train.iloc[:,-1]
    clf_LR = LogisticRegression(n_jobs=-1)
    clf_LR.fit(X_cross_train,Y_cross_train)
    Y_cross_pred = clf_LR.predict(X_cross_test)
    Y_pred.extend(Y_cross_pred)
df['Logistic Regression Prediction'] = Y_pred

In [19]:
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target,kNN Prediction,Naive Bayes Prediction,Logistic Regression Prediction
14,5.8,4.0,1.2,0.2,0.0,0.0,0.0,0.0
98,5.1,2.5,3.0,1.1,1.0,1.0,1.0,1.0
75,6.6,3.0,4.4,1.4,1.0,1.0,1.0,1.0
16,5.4,3.9,1.3,0.4,0.0,0.0,0.0,0.0
131,7.9,3.8,6.4,2.0,2.0,2.0,2.0,2.0


By running 10-fold cross validation on Logistic Regression model, we can get an f1 score of 0.9608, and the model is showing a perfect f1 score in the testing set. With relatively low bias and low variance, Logistic Regression is indeed still one of the most popular simple classification machine learning models to be used by practitioners.

## (5) Random Forest

In [20]:
# grid search with 10-fold cv
parameters = {'n_estimators':[10, 50, 100, 200, 300, 400, 500],'max_depth':[1,2,4,8,16]}
clf_RF = GridSearchCV(RandomForestClassifier(n_jobs=-1,random_state=1),parameters,scoring='f1_macro',cv=10,n_jobs=-1)
clf_RF.fit(X_train,Y_train)



GridSearchCV(cv=10, error_score='raise-deprecating',
             estimator=RandomForestClassifier(bootstrap=True, class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators='warn', n_jobs=-1,
                                              oob_score=False, random_state=1,
                                              verbose=0, warm_start=False),
             iid='warn', n_jobs=-1,
             param_grid={'max_

In [21]:
# f1 score of Random Forest with 42 different combinations of n_estimatos and max_depth
clf_RF.cv_results_['mean_test_score']

array([0.76920271, 0.94442645, 0.9364729 , 0.95304375, 0.95304375,
       0.95304375, 0.9609973 , 0.93768184, 0.94442645, 0.95242545,
       0.94442645, 0.94442645, 0.94442645, 0.94442645, 0.92872103,
       0.92789841, 0.9265054 , 0.93669731, 0.93669731, 0.93669731,
       0.93669731, 0.92872103, 0.93585196, 0.92789841, 0.92789841,
       0.92789841, 0.92789841, 0.92789841, 0.92872103, 0.93585196,
       0.92789841, 0.92789841, 0.92789841, 0.92789841, 0.92789841])

In [22]:
# prediction on testing set
Y_pred = clf_RF.predict(X_test)

In [23]:
# f1 score of testing set
f1_score(Y_test, Y_pred, average='macro')

0.952136752136752

In [24]:
# record prediction results of each validation set into the original dataframe from 10-fold cross validation
Y_pred = []
for i in range(10):
    X_cross_test = df.iloc[(i*15):((i+1)*15),:4]
    bad_df = df.index.isin(list(range(i*15,((i+1)*15))))
    cross_train = df[~bad_df]
    X_cross_train = cross_train.iloc[:,:4]
    Y_cross_train = cross_train.iloc[:,-1]
    clf_RF = RandomForestClassifier(n_estimators=100,n_jobs=-1,max_depth=4,random_state=1)
    clf_RF.fit(X_cross_train,Y_cross_train)
    Y_cross_pred = clf_RF.predict(X_cross_test)
    Y_pred.extend(Y_cross_pred)
df['Random Forest Prediction'] = Y_pred

In [25]:
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target,kNN Prediction,Naive Bayes Prediction,Logistic Regression Prediction,Random Forest Prediction
14,5.8,4.0,1.2,0.2,0.0,0.0,0.0,0.0,0.0
98,5.1,2.5,3.0,1.1,1.0,1.0,1.0,1.0,1.0
75,6.6,3.0,4.4,1.4,1.0,1.0,1.0,1.0,1.0
16,5.4,3.9,1.3,0.4,0.0,0.0,0.0,0.0,0.0
131,7.9,3.8,6.4,2.0,2.0,2.0,2.0,2.0,2.0


After grid search for number of simple trees to be included in the forest and the maximum depth of individual simple trees, the approximately optimal hyperparameter combination (n_estimators=100 and max_depth=4) stands out among 42 cross validated combinations, giving us a 10-fold f1 cv score of 0.9529. When tested on the testing set, it is showing an f1 score of 0.9521, similar to the cross validation score.

## (6) Gradient Boosting Tree

In [26]:
from sklearn.ensemble import GradientBoostingClassifier
learning_rates = [0.05, 0.1, 0.25, 0.5, 0.75, 1]
for learning_rate in learning_rates:
    clf_GBT = GradientBoostingClassifier(n_estimators=20, learning_rate = learning_rate, random_state = 0)
    clf_GBT.fit(X_train, Y_train)
    print("Learning rate: ", learning_rate)
    print("Accuracy score (training): {0:.3f}".format(clf_GBT.score(X_train, Y_train)))
    print("Accuracy score (test): {0:.3f}".format(clf_GBT.score(X_test, Y_test)))
    print()

Learning rate:  0.05
Accuracy score (training): 0.992
Accuracy score (test): 0.957

Learning rate:  0.1
Accuracy score (training): 1.000
Accuracy score (test): 0.957

Learning rate:  0.25
Accuracy score (training): 1.000
Accuracy score (test): 0.957

Learning rate:  0.5
Accuracy score (training): 1.000
Accuracy score (test): 0.957

Learning rate:  0.75
Accuracy score (training): 1.000
Accuracy score (test): 0.957

Learning rate:  1
Accuracy score (training): 1.000
Accuracy score (test): 0.957



In [27]:
clf_GBT = GradientBoostingClassifier(n_estimators=20, learning_rate = 0.1, random_state = 0)
clf_GBT.fit(X_train, Y_train)
predictions = clf_GBT.predict(X_test)

# f1 score of testing set
f1_score(Y_test, predictions, average='macro')

0.952136752136752

In [28]:
# record prediction results of each validation set into the original dataframe from 10-fold cross validation
Y_pred = []
for i in range(10):
    X_cross_test = df.iloc[(i*15):((i+1)*15),:4]
    bad_df = df.index.isin(list(range(i*15,((i+1)*15))))
    cross_train = df[~bad_df]
    X_cross_train = cross_train.iloc[:,:4]
    Y_cross_train = cross_train.iloc[:,-1]
    clf_GBT = GradientBoostingClassifier(n_estimators=20, learning_rate = 0.1, random_state = 0)
    clf_GBT.fit(X_cross_train,Y_cross_train)
    Y_cross_pred = clf_GBT.predict(X_cross_test)
    Y_pred.extend(Y_cross_pred)
df['Gradient Boosting Tree Prediction'] = Y_pred

In [29]:
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target,kNN Prediction,Naive Bayes Prediction,Logistic Regression Prediction,Random Forest Prediction,Gradient Boosting Tree Prediction
14,5.8,4.0,1.2,0.2,0.0,0.0,0.0,0.0,0.0,0.0
98,5.1,2.5,3.0,1.1,1.0,1.0,1.0,1.0,1.0,1.0
75,6.6,3.0,4.4,1.4,1.0,1.0,1.0,1.0,1.0,1.0
16,5.4,3.9,1.3,0.4,0.0,0.0,0.0,0.0,0.0,0.0
131,7.9,3.8,6.4,2.0,2.0,2.0,2.0,2.0,2.0,2.0


## (7) Support Vector Machine

In [30]:
#Import svm model
from sklearn import svm
clf_SVM = svm.SVC()
clf_SVM.fit(X_train, Y_train)
Y_pred = clf_SVM.predict(X_test)
# f1 score of testing set
f1_score(Y_test, Y_pred, average='macro')

1.0

In [31]:
# record prediction results of each validation set into the original dataframe from 10-fold cross validation
Y_pred = []
for i in range(10):
    X_cross_test = df.iloc[(i*15):((i+1)*15),:4]
    bad_df = df.index.isin(list(range(i*15,((i+1)*15))))
    cross_train = df[~bad_df]
    X_cross_train = cross_train.iloc[:,:4]
    Y_cross_train = cross_train.iloc[:,-1]
    clf_SVM = svm.SVC()
    clf_SVM.fit(X_cross_train,Y_cross_train)
    Y_cross_pred = clf_SVM.predict(X_cross_test)
    Y_pred.extend(Y_cross_pred)
df['SVM Prediction'] = Y_pred
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target,kNN Prediction,Naive Bayes Prediction,Logistic Regression Prediction,Random Forest Prediction,Gradient Boosting Tree Prediction,SVM Prediction
14,5.8,4.0,1.2,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0
98,5.1,2.5,3.0,1.1,1.0,1.0,1.0,1.0,1.0,1.0,1.0
75,6.6,3.0,4.4,1.4,1.0,1.0,1.0,1.0,1.0,1.0,1.0
16,5.4,3.9,1.3,0.4,0.0,0.0,0.0,0.0,0.0,0.0,0.0
131,7.9,3.8,6.4,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0


## (8) Artificial Neural Networks

In [32]:
from sklearn.neural_network import MLPClassifier
clf_ANN = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(5, 5), random_state=1)
clf_ANN.fit(X_train, Y_train)
Y_pred = clf_ANN.predict(X_test)
# f1 score of testing set
f1_score(Y_test, Y_pred, average='macro')

0.952136752136752

In [33]:
# record prediction results of each validation set into the original dataframe from 10-fold cross validation
Y_pred = []
for i in range(10):
    X_cross_test = df.iloc[(i*15):((i+1)*15),:4]
    bad_df = df.index.isin(list(range(i*15,((i+1)*15))))
    cross_train = df[~bad_df]
    X_cross_train = cross_train.iloc[:,:4]
    Y_cross_train = cross_train.iloc[:,-1]
    clf_ANN = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(5, 5), random_state=1)
    clf_ANN.fit(X_cross_train,Y_cross_train)
    Y_cross_pred = clf_ANN.predict(X_cross_test)
    Y_pred.extend(Y_cross_pred)
df['ANN Prediction'] = Y_pred
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target,kNN Prediction,Naive Bayes Prediction,Logistic Regression Prediction,Random Forest Prediction,Gradient Boosting Tree Prediction,SVM Prediction,ANN Prediction
14,5.8,4.0,1.2,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
98,5.1,2.5,3.0,1.1,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
75,6.6,3.0,4.4,1.4,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
16,5.4,3.9,1.3,0.4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
131,7.9,3.8,6.4,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0


## (9) Ensemble Model

In [34]:
# vote for most predicted label of each sample and choose it as the output of our ensemble model
# calculate 10-fold cv score of our ensemble model (error msg may disappear once we have all 7 models)
prediction = df.iloc[:,5:]
answer = df['target']
cv_score = np.zeros(10)
for i in range(10):
    current_prediction = prediction.iloc[(i*15):((i+1)*15),:]
    current_answer = answer.iloc[(i*15):((i+1)*15)]
    mode_prediction = current_prediction.mode(axis=1)
    cv_score[i] = f1_score(current_answer, mode_prediction, average='macro')

In [35]:
from sklearn.ensemble import VotingClassifier
estimators = []
estimators.append(('knn', clf_KNN))
estimators.append(('Naive Bayes', clf_NB))
estimators.append(('Logistic Regression', clf_LR))
estimators.append(('Random Forest', clf_RF))
estimators.append(('Gradient Boosting Tree', clf_GBT))
estimators.append(('SVM', clf_SVM))
estimators.append(('Artificial Neural Networks', clf_ANN))
# create the ensemble model
ensemble = VotingClassifier(estimators)
ensemble.fit(X_train, Y_train)
Y_pred = ensemble.predict(X_test)
# f1 score of testing set
f1_score(Y_test, Y_pred, average='macro')

0.952136752136752

In [36]:
Y_pred = []
for i in range(10):
    X_cross_test = df.iloc[(i*15):((i+1)*15),:4]
    bad_df = df.index.isin(list(range(i*15,((i+1)*15))))
    cross_train = df[~bad_df]
    X_cross_train = cross_train.iloc[:,:4]
    Y_cross_train = cross_train.iloc[:,-1]
    ensemble.fit(X_cross_train,Y_cross_train)
    Y_cross_pred = ensemble.predict(X_cross_test)
    Y_pred.extend(Y_cross_pred)
df['Ensemble Prediction'] = Y_pred
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target,kNN Prediction,Naive Bayes Prediction,Logistic Regression Prediction,Random Forest Prediction,Gradient Boosting Tree Prediction,SVM Prediction,ANN Prediction,Ensemble Prediction
14,5.8,4.0,1.2,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
98,5.1,2.5,3.0,1.1,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
75,6.6,3.0,4.4,1.4,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
16,5.4,3.9,1.3,0.4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
131,7.9,3.8,6.4,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0
