# Experiment 1

## Model choice w/ hyperparam optimization on diagnosing emotion

Due to the past assignments I have been lead to believe that extreem gradient boosting (XGB) will lead to large improvements over other model choices, though I will be trying several out of the box models within sklearn and use Bayesian Optimization to tune the hyperparameters.

### Bayesian Optimization

Bayesian Optimization is a non-gradient-based arbitrary function optimization algorithm that I will be using to tune the hyperparameters of each model (to the extent that they have them). This is particularly useful in tuning the large amount of hyperparameters in XGB/any algorithm with a large amount of hyperparameters.

## Models to be used
The models to be experimented with are as follows:
- SVC
- XGB
- Naive Bayes
- Random Forest Classifier
- Feed Forward Neural Network

In [1]:
from scipy.io import arff
import pandas
from sklearn import svm, naive_bayes, ensemble, neural_network, metrics
from sklearn.model_selection import cross_val_predict, train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.decomposition import PCA
import numpy as np
from bayes_opt import BayesianOptimization
import xgboost as xgb
import matplotlib.pyplot as plt

In [None]:
%matplotlib inline

In [2]:
data, meta = arff.loadarff('emobase2010.old.arff')

In [3]:
df = pandas.DataFrame.from_records(data)

In [4]:
df.columns = data.dtype.names

In [5]:
# remove neutral, unknown and other classes
a = df['class']!=b'NEU'
b = df['class']!=b'UNK'
c = df['class']!=b'OTH'
df = df.loc[a&b&c]

In [6]:
df['class'].value_counts()

b'DIS'    467
b'SUR'    452
b'ACC'    450
b'ANT'    412
b'SAD'    285
b'FEA'    239
b'JOY'    226
b'ANG'    212
Name: class, dtype: int64

In [7]:
adata = df.as_matrix()

In [8]:
features, labels = np.split(adata, [-1], axis=1)
labels = [s for s in labels]

In [9]:
print(np.shape(labels))
le = LabelEncoder()
labels = [s[0] for s in labels]
le.fit(labels)
labels = le.transform(labels)
print(labels)

(2743, 1)
[0 3 0 ..., 1 2 1]


## SVC

In [10]:
wclf = svm.SVC(kernel='linear', class_weight='balanced')

In [67]:
predicted = cross_val_predict(wclf, features, labels)

In [68]:
metrics.accuracy_score(labels, predicted)

0.24462267590229675

In [14]:
# rough baseline -- majority class
print(467/len(labels), 1/6)

0.17025154939846884 0.16666666666666666


Alright so this is a pretty good score considering the rough baseline, unfortunately using Bayesian Optimization proved to be computationally prohibitive for the SVC. I provided the code to run it below, but was unable to finish it on my laptop.

In [11]:
def svceval(C, gamma):
    
    params['C'] = float(C)
    params['gamma'] = float(gamma)
    
    wclf = svm.SVC(kernel='linear', class_weight='balanced', **params)
    
    predicted = cross_val_predict(wclf, features, labels)
    
    return metrics.accuracy_score(labels, predicted)

In [None]:
num_rounds = 3000
random_state = 2017
num_iter = 10
init_points = 5
params = {}

xgbBO = BayesianOptimization(svceval, {'C': (0.001, 100), 
                                       'gamma': (0.0001, 0.1)
                                        })

xgbBO.maximize(init_points=init_points, n_iter=num_iter)

[31mInitialization[0m
[94m-----------------------------------------------------[0m
 Step |   Time |      Value |         C |     gamma | 


## XGB

In [11]:
wclf = xgb.XGBClassifier()

In [12]:
predicted = cross_val_predict(wclf, features, labels)

In [13]:
metrics.accuracy_score(labels, predicted)

0.30441122858184472

Ok, so we can see that the regular XGB improves the baseline pretty dramatically. Now let's define a function for our Bayesian Optimizer to optimize.

In [31]:
def xgbeval(min_child_weight,
                 colsample_bytree,
                 max_depth,
                 subsample,
                 gamma):

    params['min_child_weight'] = int(min_child_weight)
    params['colsample_bytree'] = max(min(colsample_bytree, 1), 0)
    params['max_depth'] = int(max_depth)
    params['subsample'] = max(min(subsample, 1), 0)
    params['gamma'] = max(gamma, 0)
    wclf = xgb.XGBClassifier(**params)
    
    predicted = cross_val_predict(wclf, features, labels)
    
    return metrics.accuracy_score(labels, predicted)

Now we optimize the function, warning: this will take a super long time to run. It may be advisable to run it on a server or just simply look at the output attatched to this notebook.

In [32]:
num_rounds = 3000
random_state = 2017
num_iter = 10
init_points = 5
params = {}

xgbBO = BayesianOptimization(xgbeval, {'min_child_weight': (1, 20),
                                                'colsample_bytree': (0.1, 1),
                                                'max_depth': (5, 15),
                                                'subsample': (0.5, 1),
                                                'gamma': (0, 10)
                                                })
xgbBO.maximize(init_points=init_points, n_iter=num_iter)

[31mInitialization[0m
[94m---------------------------------------------------------------------------------------------------------------[0m
 Step |   Time |      Value |   colsample_bytree |     gamma |   max_depth |   min_child_weight |   subsample | 
    1 | 04m48s | [35m   0.31134[0m | [32m            0.3366[0m | [32m   2.1001[0m | [32m     9.4543[0m | [32m            6.0659[0m | [32m     0.8636[0m | 
    2 | 08m54s |    0.29202 |             0.8072 |    0.7104 |      8.9388 |             2.7824 |      0.6164 | 
    3 | 03m07s |    0.30587 |             0.2928 |    5.1080 |      9.3488 |             9.3586 |      0.5261 | 
    4 | 06m07s |    0.29821 |             0.7249 |    8.5479 |     14.4441 |            12.0732 |      0.5231 | 
    5 | 03m26s |    0.30623 |             0.4091 |    2.7943 |     10.7828 |            19.6571 |      0.6007 | 
[31mBayesian Optimization[0m
[94m---------------------------------------------------------------------------------------

  " state: %s" % convergence_dict)


   14 | 02m44s |    0.31061 |             0.1076 |    2.6737 |     14.6929 |             7.0232 |      0.9564 | 
   15 | 13m30s |    0.30806 |             0.9887 |    1.2976 |     11.8162 |             8.8511 |      0.9865 | 


In [34]:
print('Final Results')
print('XGBOOST: %f' % xgbBO.res['max']['max_val'])
print('Best Params: {}'.format(xgbBO.res['max']['max_params']))

Final Results
XGBOOST: 0.311338
Best Params: {'subsample': 0.86358306525373618, 'colsample_bytree': 0.33658558928849269, 'max_depth': 9.4543347834429881, 'min_child_weight': 6.0659314521001315, 'gamma': 2.1000619104831797}


Well, a little dissapointing that a couple hours only got us about another percent of accuracy out of it. Lets repeat the process for other models.

## Naive Bayes

In [36]:
wclf = naive_bayes.GaussianNB()

In [37]:
predicted = cross_val_predict(wclf, features, labels)

In [38]:
metrics.accuracy_score(labels, predicted)

0.20889537003281078

Ok so that's not so good, and the downside is that because the Naive Bayes is so simple it doesn't allow for hyperparameter tuning as there are no hyperparameters to tune. We can use this as a good baseline however. Let's move on to something more interesting.

## Random Forest Classifier
Kind of the little brother of XGB, Random Forests represent a very reasonable model choice for a lot of tasks. With the addition of many hyperparameters we can see how much more accuracy we can squeeze out with Bayesian Optimization.

In [39]:
wclf = ensemble.RandomForestClassifier()

In [40]:
predicted = cross_val_predict(wclf, features, labels)

In [41]:
metrics.accuracy_score(labels, predicted)

0.23769595333576377

Ok so that's with entirely default parameters, now let's tune.

In [44]:
def rfeval(n_estimators,
          max_depth,
          min_samples_split,
          min_samples_leaf,
          min_weight_fraction_leaf,
          min_impurity_split):
    
    params['n_estimators'] = int(n_estimators)
    params['max_depth'] = int(max_depth)
    params['min_samples_split'] = float(min_samples_split)
    params['min_samples_leaf'] = float(min_samples_leaf)
    params['min_weight_fraction_leaf'] = float(min_weight_fraction_leaf)
    params['min_impurity_split'] = float(min_impurity_split)
    wclf = ensemble.RandomForestClassifier(**params)
    
    predicted = cross_val_predict(wclf, features, labels)
    
    return metrics.accuracy_score(labels, predicted)

In [50]:
num_iter = 25
init_points = 10
params = {}

xgbBO = BayesianOptimization(rfeval, {'n_estimators': (1, 20),
                                                'max_depth': (5, 15),
                                                'min_samples_split': (1e-10, 1),
                                                'min_samples_leaf': (1e-10, 0.5),
                                                'min_weight_fraction_leaf': (1e-10, 0.5),
                                                'min_impurity_split': (1e-10,5)
                                                })
xgbBO.maximize(init_points=init_points, n_iter=num_iter)

[31mInitialization[0m
[94m-----------------------------------------------------------------------------------------------------------------------------------------------------------[0m
 Step |   Time |      Value |   max_depth |   min_impurity_split |   min_samples_leaf |   min_samples_split |   min_weight_fraction_leaf |   n_estimators | 
    1 | 00m03s | [35m   0.17025[0m | [32m     8.2271[0m | [32m              1.3648[0m | [32m            0.1369[0m | [32m             0.7024[0m | [32m                    0.4156[0m | [32m        6.2759[0m | 
    2 | 00m03s |    0.17025 |      8.9579 |               4.0618 |             0.4170 |              0.7289 |                     0.4430 |        19.2143 | 
    3 | 00m02s |    0.17025 |      5.0506 |               3.1692 |             0.4092 |              0.5973 |                     0.4241 |        11.2262 | 
    4 | 00m03s | [35m   0.23077[0m | [32m    11.5825[0m | [32m              0.7003[0m | [32m            0.0880[0

  " state: %s" % convergence_dict)


   20 | 00m13s |    0.17025 |     15.0000 |               5.0000 |             0.0000 |              0.0000 |                     0.5000 |        20.0000 | 
   21 | 00m16s | [35m   0.24353[0m | [32m    11.4489[0m | [32m              0.0000[0m | [32m            0.0000[0m | [32m             0.0000[0m | [32m                    0.5000[0m | [32m       11.4352[0m | 
   22 | 00m15s |    0.24098 |      7.3848 |               0.0000 |             0.0000 |              0.0000 |                     0.5000 |        20.0000 | 
   23 | 00m14s |    0.23697 |      5.3645 |               0.0000 |             0.0000 |              0.3338 |                     0.5000 |        20.0000 | 
   24 | 00m11s |    0.22895 |      8.7176 |               0.0000 |             0.0000 |              0.0000 |                     0.5000 |        11.8383 | 
   25 | 00m13s |    0.17025 |     10.8394 |               0.8673 |             0.0000 |              0.0000 |                     0.5000 |        19.653

In [None]:
print('Final Results')
print('XGBOOST: %f' % xgbBO.res['max']['max_val'])
print('Best Params: {}'.format(xgbBO.res['max']['max_params']))

We were able to improve it a decent amount, now let's try it on a neural network.

## Multilayer Perceptron

In [14]:
wclf = neural_network.MLPClassifier()

In [15]:
predictions = cross_val_predict(wclf, features, labels)

In [16]:
metrics.accuracy_score(labels, predicted)

0.30441122858184472

Ok, doing approximately the same as the XGB. Now let's see how much we can improve this baseline with hyperparameter tuning.

In [17]:
def nneval(hidden_layer_sizes,
          alpha,
          max_iter,
          momentum
          ):
    
    params['hidden_layer_sizes'] = int(hidden_layer_sizes)
    params['alpha'] = float(alpha)
    params['max_iter'] = int(max_iter)
    params['momentum'] = float(momentum)
    wclf = neural_network.MLPClassifier(**params)
    
    predictions = cross_val_predict(wclf, features, labels)
    
    return metrics.accuracy_score(labels, predicted)

In [18]:
num_iter = 25
init_points = 10
params = {}

xgbBO = BayesianOptimization(nneval, {'hidden_layer_sizes': (1, 2000),
                                                'alpha': (0, 1),
                                                'max_iter': (1, 1000),
                                                'momentum': (0, 1),
                                                })

xgbBO.maximize(init_points=init_points, n_iter=num_iter)

[31mInitialization[0m
[94m------------------------------------------------------------------------------------------[0m
 Step |   Time |      Value |     alpha |   hidden_layer_sizes |   max_iter |   momentum | 
    1 | 00m22s | [35m   0.30441[0m | [32m   0.5979[0m | [32m            295.0528[0m | [32m  102.4887[0m | [32m    0.3906[0m | 
    2 | 01m41s |    0.30441 |    0.3655 |            1507.8652 |   306.6683 |     0.4306 | 
    3 | 00m13s |    0.30441 |    0.9130 |             191.5456 |   470.4478 |     0.0219 | 
    4 | 02m02s |    0.30441 |    0.3403 |            1969.1696 |   483.9889 |     0.0304 | 
    5 | 01m29s |    0.30441 |    0.8284 |            1329.8292 |   538.2416 |     0.7872 | 
    6 | 00m22s |    0.30441 |    0.4696 |             336.2538 |   387.0172 |     0.2317 | 
    7 | 01m50s |    0.30441 |    0.8471 |            1496.2960 |   371.1876 |     0.5711 | 
    8 | 01m40s |    0.30441 |    0.6670 |            1248.6818 |   666.7233 |     0.2567 | 
   



[31mBayesian Optimization[0m
[94m------------------------------------------------------------------------------------------[0m
 Step |   Time |      Value |     alpha |   hidden_layer_sizes |   max_iter |   momentum | 




   11 | 03m13s |    0.30441 |    0.2509 |            1995.2197 |   996.0094 |     0.6781 | 




   12 | 00m44s |    0.30441 |    0.0326 |             944.7270 |     6.6165 |     0.0500 | 




   13 | 01m20s |    0.30441 |    0.9516 |             781.6838 |   994.5673 |     0.9263 | 




   14 | 00m17s |    0.30441 |    0.8792 |               5.3752 |   954.7635 |     0.8511 | 




   15 | 00m19s |    0.30441 |    0.4225 |              11.8761 |    11.9248 |     0.8719 | 




   16 | 01m19s |    0.30441 |    0.4933 |            1453.3378 |   994.7019 |     0.6449 | 




   17 | 00m53s |    0.30441 |    0.2218 |             735.5566 |   398.9625 |     0.5321 | 




   18 | 01m50s |    0.30441 |    0.7276 |            1996.9570 |    20.6846 |     0.8805 | 




   19 | 00m39s |    0.30441 |    0.7178 |             539.5538 |    28.6458 |     0.0440 | 




   20 | 02m01s |    0.30441 |    0.8036 |            1720.8840 |   807.6312 |     0.2702 | 




   21 | 01m09s |    0.30441 |    0.0302 |            1433.6540 |     9.7319 |     0.0121 | 




   22 | 00m35s |    0.30441 |    0.6554 |             556.6435 |   421.0155 |     0.4299 | 




   23 | 00m31s |    0.30441 |    0.7291 |             361.8453 |   640.5830 |     0.2992 | 




   24 | 03m22s |    0.30441 |    0.8326 |            1864.7208 |   493.5720 |     0.4627 | 




   25 | 01m21s |    0.30441 |    0.5694 |             963.4945 |   445.7804 |     0.5866 | 




   26 | 01m04s |    0.30441 |    0.7661 |             912.7508 |   116.4773 |     0.0510 | 




   27 | 00m18s |    0.30441 |    0.0948 |             111.0786 |   838.6054 |     0.7529 | 




   28 | 01m19s |    0.30441 |    0.2078 |            1221.6622 |   867.6947 |     0.2643 | 




   29 | 01m43s |    0.30441 |    0.8030 |            1185.6962 |   624.9690 |     0.6572 | 




   30 | 01m22s |    0.30441 |    0.7703 |             967.0266 |   437.7335 |     0.3943 | 




   31 | 00m22s |    0.30441 |    0.0801 |              96.5543 |   254.3581 |     0.9190 | 




   32 | 00m50s |    0.30441 |    0.4010 |             804.5094 |   870.6812 |     0.2646 | 




   33 | 01m42s |    0.30441 |    0.0065 |            1440.7139 |   754.4995 |     0.6870 | 




   34 | 00m34s |    0.30441 |    0.2522 |             407.6040 |   163.0261 |     0.2908 | 




   35 | 00m26s |    0.30441 |    0.1964 |             131.2935 |   273.7879 |     0.1751 | 


In [19]:
print('Final Results')
print('XGBOOST: %f' % xgbBO.res['max']['max_val'])
print('Best Params: {}'.format(xgbBO.res['max']['max_params']))

Final Results
XGBOOST: 0.304411
Best Params: {'hidden_layer_sizes': 295.05280471785909, 'momentum': 0.39058478805905594, 'max_iter': 102.48867041393048, 'alpha': 0.59790180250810609}


Interestingly Bayesian Optimization doesn't seem to help the neural network for several reasons, one of which is that when predicting the level of the max iterations it can cause non-convergence if it sets it too low. It seems that, for neural networks at least, it helps much more to actually know what you're doing when setting the hyperparameters. Bayesian Optimization can theoretically help with setting the proper hidden layer size, which I will test below.

In [20]:
def nneval(hidden_layer_sizes):
    
    params['hidden_layer_sizes'] = int(hidden_layer_sizes)
    wclf = neural_network.MLPClassifier(**params)
    
    predictions = cross_val_predict(wclf, features, labels)
    
    return metrics.accuracy_score(labels, predicted)

In [23]:
num_iter = 5
init_points = 3
params = {}

xgbBO = BayesianOptimization(nneval, {'hidden_layer_sizes': (1, 2000)})

xgbBO.maximize(init_points=init_points, n_iter=num_iter)

[31mInitialization[0m
[94m----------------------------------------------------[0m
 Step |   Time |      Value |   hidden_layer_sizes | 
    1 | 00m22s | [35m   0.30441[0m | [32m            452.1221[0m | 
    2 | 00m21s |    0.30441 |             322.6510 | 
    3 | 00m59s |    0.30441 |            1072.1705 | 




[31mBayesian Optimization[0m
[94m----------------------------------------------------[0m
 Step |   Time |      Value |   hidden_layer_sizes | 




    4 | 02m47s |    0.30441 |            1999.9397 | 




    5 | 00m13s |    0.30441 |               1.0021 | 




    6 | 01m38s |    0.30441 |            1582.0588 | 




    7 | 00m45s |    0.30441 |             769.3950 | 




    8 | 01m37s |    0.30441 |            1337.1105 | 


In [24]:
print('Final Results')
print('XGBOOST: %f' % xgbBO.res['max']['max_val'])
print('Best Params: {}'.format(xgbBO.res['max']['max_params']))

Final Results
XGBOOST: 0.304411
Best Params: {'hidden_layer_sizes': 452.12210527070141}


It actually looks like the hidden layer size has extremely little affect on the overall performance of the model.

## Ensembling Models

Let's just do one last experiment to see how high of an accuracy we can feasibly get. I'll build an ensemble model of 3 XGB's with the optimized hyperparameters.

In [16]:
params = {'subsample': 0.86358306525373618, 
          'colsample_bytree': 0.33658558928849269, 
          'max_depth': 9.4543347834429881, 
          'min_child_weight': 6.0659314521001315, 
          'gamma': 2.1000619104831797}

params['min_child_weight'] = int(params['min_child_weight'])
params['colsample_bytree'] = max(min(params['colsample_bytree'], 1), 0)
params['max_depth'] = int(params['max_depth'])
params['subsample'] = max(min(params['subsample'], 1), 0)
params['gamma'] = max(params['gamma'], 0)

clf1 = xgb.XGBClassifier(**params)
clf2 = xgb.XGBClassifier(**params)
clf3 = xgb.XGBClassifier(**params)

In [17]:
wclf = ensemble.VotingClassifier(estimators=[('xgb', clf1), ('xgb', clf2), ('xgb', clf3)])
predictions = cross_val_predict(wclf, features, labels)
metrics.accuracy_score(labels, predictions)

0.3113379511483777

Looks like we don't really get a boost with the ensembling unfortunately.

# Results

SVC | SVC w/ BO | XGB | XGB w/ BO | NB | RF | RF w/ BO | MLP | MLP w/ BO | XGB Ensemble
--- | --- | --- | --- | --- | --- | --- | --- | --- | ---
0.245  |  dnf  |  0.304  |  0.311  |  0.208  |  0.237  | 0.269  | 0.304  |  0.304 | 0.311

## Conclusion

As we can see from the above results, XGB has again proved to be the highest performing model. Beyond that, we have seen that Bayesian Optimization can improve the performance of all decision tree based algorithms, but does not seem to be effective for the MLP. In the next experiment, we will take these results and see how feature normalization/dimensionality reduction will affect our accuracy.