# Experiment 1

## Model choice w/ hyperparam optimization on diagnosing emotion

Due to the past assignments I have been lead to believe that extreem gradient boosting (XGB) will lead to large improvements over other model choices, though I will be trying several out of the box models within sklearn and use Bayesian Optimization to tune the hyperparameters.

### Bayesian Optimization

Bayesian Optimization is a non-gradient-based arbitrary function optimization algorithm that I will be using to tune the hyperparameters of each model (to the extent that they have them). This is particularly useful in tuning the large amount of hyperparameters in XGB/any algorithm with a large amount of hyperparameters.

## Models to be used
The models to be experimented with are as follows:
- SVC
- XGB
- Naive Bayes
- Random Forest Classifier
- Feed Forward Neural Network

In [4]:
from scipy.io import arff
import pandas
from sklearn import svm, naive_bayes, ensemble, neural_network, metrics
from sklearn.model_selection import cross_val_predict, train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.decomposition import PCA
import numpy as np
from bayes_opt import BayesianOptimization
import xgboost as xgb
import matplotlib.pyplot as plt

In [5]:
data, meta = arff.loadarff('emobase2010.arff')

In [6]:
df = pandas.DataFrame.from_records(data)

In [7]:
df.columns = data.dtype.names

In [8]:
# remove neutral, unknown and other classes
a = df['class']!=b'NEU'
b = df['class']!=b'UNK'
c = df['class']!=b'OTH'
df = df.loc[a&b&c]

In [9]:
df['class'].value_counts()

b'DIS'    467
b'SUR'    452
b'ACC'    450
b'ANT'    412
b'SAD'    285
b'FEA'    239
b'JOY'    226
b'ANG'    212
Name: class, dtype: int64

In [10]:
adata = df.as_matrix()

In [11]:
features, labels = np.split(adata, [-1], axis=1)
labels = [s for s in labels]

In [12]:
print(np.shape(labels))
le = LabelEncoder()
labels = [s[0] for s in labels]
le.fit(labels)
labels = le.transform(labels)
print(labels)

(2743, 1)
[0 3 0 ..., 1 2 1]


## SVC

In [64]:
wclf = svm.SVC(kernel='linear', class_weight='balanced')

In [67]:
predicted = cross_val_predict(wclf, features, labels)

In [68]:
metrics.accuracy_score(labels, predicted)

0.24462267590229675

In [14]:
# rough baseline -- majority class
print(467/len(labels), 1/6)

0.17025154939846884 0.16666666666666666


Alright so this is a pretty good score considering the rough baseline, now let's try optimizing the hyperparameters with Bayesian Optimization. Just as a warning this is going to run the above code around 15 times, so either be prepared to wait a couple of hours (I ran it overnight) or just observe the results from this notebook.

In [11]:
def svceval(C, gamma):
    
    params['C'] = float(C)
    params['gamma'] = float(gamma)
    
    wclf = svm.SVC(kernel='linear', class_weight='balanced', **params)
    
    predicted = cross_val_predict(wclf, features, labels)
    
    return metrics.accuracy_score(labels, predicted)

In [None]:
num_rounds = 3000
random_state = 2017
num_iter = 10
init_points = 5
params = {}

xgbBO = BayesianOptimization(svceval, {'C': (0.001, 100), 
                                       'gamma': (0.0001, 0.1)
                                        })

xgbBO.maximize(init_points=init_points, n_iter=num_iter)

[31mInitialization[0m
[94m-----------------------------------------------------[0m
 Step |   Time |      Value |         C |     gamma | 


## XGB

In [13]:
wclf = xgb.XGBClassifier()

In [14]:
predicted = cross_val_predict(wclf, features, labels)

KeyboardInterrupt: 

In [15]:
metrics.accuracy_score(labels, predicted)

NameError: name 'predicted' is not defined

Ok, so we can see that the regular XGB improves the baseline pretty dramatically. Now let's define a function for our Bayesian Optimizer to optimize.

In [31]:
def xgbeval(min_child_weight,
                 colsample_bytree,
                 max_depth,
                 subsample,
                 gamma):

    params['min_child_weight'] = int(min_child_weight)
    params['colsample_bytree'] = max(min(colsample_bytree, 1), 0)
    params['max_depth'] = int(max_depth)
    params['subsample'] = max(min(subsample, 1), 0)
    params['gamma'] = max(gamma, 0)
    wclf = xgb.XGBClassifier(**params)
    
    predicted = cross_val_predict(wclf, features, labels)
    
    return metrics.accuracy_score(labels, predicted)

Now we optimize the function, warning: this will take a super long time to run. It may be advisable to run it on a server or just simply look at the output attatched to this notebook.

In [32]:
num_rounds = 3000
random_state = 2017
num_iter = 10
init_points = 5
params = {}

xgbBO = BayesianOptimization(xgbeval, {'min_child_weight': (1, 20),
                                                'colsample_bytree': (0.1, 1),
                                                'max_depth': (5, 15),
                                                'subsample': (0.5, 1),
                                                'gamma': (0, 10)
                                                })
xgbBO.maximize(init_points=init_points, n_iter=num_iter)

[31mInitialization[0m
[94m---------------------------------------------------------------------------------------------------------------[0m
 Step |   Time |      Value |   colsample_bytree |     gamma |   max_depth |   min_child_weight |   subsample | 
    1 | 04m48s | [35m   0.31134[0m | [32m            0.3366[0m | [32m   2.1001[0m | [32m     9.4543[0m | [32m            6.0659[0m | [32m     0.8636[0m | 
    2 | 08m54s |    0.29202 |             0.8072 |    0.7104 |      8.9388 |             2.7824 |      0.6164 | 
    3 | 03m07s |    0.30587 |             0.2928 |    5.1080 |      9.3488 |             9.3586 |      0.5261 | 
    4 | 06m07s |    0.29821 |             0.7249 |    8.5479 |     14.4441 |            12.0732 |      0.5231 | 
    5 | 03m26s |    0.30623 |             0.4091 |    2.7943 |     10.7828 |            19.6571 |      0.6007 | 
[31mBayesian Optimization[0m
[94m---------------------------------------------------------------------------------------

  " state: %s" % convergence_dict)


   14 | 02m44s |    0.31061 |             0.1076 |    2.6737 |     14.6929 |             7.0232 |      0.9564 | 
   15 | 13m30s |    0.30806 |             0.9887 |    1.2976 |     11.8162 |             8.8511 |      0.9865 | 


In [34]:
print('Final Results')
print('XGBOOST: %f' % xgbBO.res['max']['max_val'])
print('Best Params: {}'.format(xgbBO.res['max']['max_params']))

Final Results
XGBOOST: 0.311338
Best Params: {'subsample': 0.86358306525373618, 'colsample_bytree': 0.33658558928849269, 'max_depth': 9.4543347834429881, 'min_child_weight': 6.0659314521001315, 'gamma': 2.1000619104831797}


Well, a little dissapointing that a couple hours only got us about another percent of accuracy out of it. Lets repeat the process for other models.

## Naive Bayes

In [36]:
wclf = naive_bayes.GaussianNB()

In [37]:
predicted = cross_val_predict(wclf, features, labels)

In [38]:
metrics.accuracy_score(labels, predicted)

0.20889537003281078

Ok so that's not so good, and the downside is that because the Naive Bayes is so simple it doesn't allow for hyperparameter tuning as there are no hyperparameters to tune. We can use this as a good baseline however. Let's move on to something more interesting.

## Random Forest Classifier
Kind of the little brother of XGB, Random Forests represent a very reasonable model choice for a lot of tasks. With the addition of many hyperparameters we can see how much more accuracy we can squeeze out with Bayesian Optimization.

In [39]:
wclf = ensemble.RandomForestClassifier()

In [40]:
predicted = cross_val_predict(wclf, features, labels)

In [41]:
metrics.accuracy_score(labels, predicted)

0.23769595333576377

Ok so that's with entirely default parameters, now let's tune.

In [44]:
def rfeval(n_estimators,
          max_depth,
          min_samples_split,
          min_samples_leaf,
          min_weight_fraction_leaf,
          min_impurity_split):
    
    params['n_estimators'] = int(n_estimators)
    params['max_depth'] = int(max_depth)
    params['min_samples_split'] = float(min_samples_split)
    params['min_samples_leaf'] = float(min_samples_leaf)
    params['min_weight_fraction_leaf'] = float(min_weight_fraction_leaf)
    params['min_impurity_split'] = float(min_impurity_split)
    wclf = ensemble.RandomForestClassifier(**params)
    
    predicted = cross_val_predict(wclf, features, labels)
    
    return metrics.accuracy_score(labels, predicted)

In [50]:
num_iter = 25
init_points = 10
params = {}

xgbBO = BayesianOptimization(rfeval, {'n_estimators': (1, 20),
                                                'max_depth': (5, 15),
                                                'min_samples_split': (1e-10, 1),
                                                'min_samples_leaf': (1e-10, 0.5),
                                                'min_weight_fraction_leaf': (1e-10, 0.5),
                                                'min_impurity_split': (1e-10,5)
                                                })
xgbBO.maximize(init_points=init_points, n_iter=num_iter)

[31mInitialization[0m
[94m-----------------------------------------------------------------------------------------------------------------------------------------------------------[0m
 Step |   Time |      Value |   max_depth |   min_impurity_split |   min_samples_leaf |   min_samples_split |   min_weight_fraction_leaf |   n_estimators | 
    1 | 00m03s | [35m   0.17025[0m | [32m     8.2271[0m | [32m              1.3648[0m | [32m            0.1369[0m | [32m             0.7024[0m | [32m                    0.4156[0m | [32m        6.2759[0m | 
    2 | 00m03s |    0.17025 |      8.9579 |               4.0618 |             0.4170 |              0.7289 |                     0.4430 |        19.2143 | 
    3 | 00m02s |    0.17025 |      5.0506 |               3.1692 |             0.4092 |              0.5973 |                     0.4241 |        11.2262 | 
    4 | 00m03s | [35m   0.23077[0m | [32m    11.5825[0m | [32m              0.7003[0m | [32m            0.0880[0

  " state: %s" % convergence_dict)


   20 | 00m13s |    0.17025 |     15.0000 |               5.0000 |             0.0000 |              0.0000 |                     0.5000 |        20.0000 | 
   21 | 00m16s | [35m   0.24353[0m | [32m    11.4489[0m | [32m              0.0000[0m | [32m            0.0000[0m | [32m             0.0000[0m | [32m                    0.5000[0m | [32m       11.4352[0m | 
   22 | 00m15s |    0.24098 |      7.3848 |               0.0000 |             0.0000 |              0.0000 |                     0.5000 |        20.0000 | 
   23 | 00m14s |    0.23697 |      5.3645 |               0.0000 |             0.0000 |              0.3338 |                     0.5000 |        20.0000 | 
   24 | 00m11s |    0.22895 |      8.7176 |               0.0000 |             0.0000 |              0.0000 |                     0.5000 |        11.8383 | 
   25 | 00m13s |    0.17025 |     10.8394 |               0.8673 |             0.0000 |              0.0000 |                     0.5000 |        19.653

In [None]:
print('Final Results')
print('XGBOOST: %f' % xgbBO.res['max']['max_val'])
print('Best Params: {}'.format(xgbBO.res['max']['max_params']))

We were able to improve it a decent amount, now let's try it on a neural network.

## Multilayer Perceptron

In [51]:
wclf = neural_network.MLPClassifier()

In [52]:
predictions = cross_val_predict(wclf, features, labels)

In [53]:
metrics.accuracy_score(labels, predicted)

0.23769595333576377

Ok, doing approximately the same as the Random Forest. Now let's see how much we can improve this baseline with hyperparameter tuning.

In [57]:
def nneval(hidden_layer_sizes,
          alpha,
          max_iter,
          momentum
          ):
    
    params['hidden_layer_sizes'] = int(hidden_layer_sizes)
    params['alpha'] = float(alpha)
    params['max_iter'] = int(max_iter)
    params['momentum'] = float(momentum)
    wclf = neural_network.MLPClassifier(**params)
    
    predictions = cross_val_predict(wclf, features, labels)
    
    return metrics.accuracy_score(labels, predicted)

In [58]:
num_iter = 25
init_points = 10
params = {}

xgbBO = BayesianOptimization(nneval, {'hidden_layer_sizes': (1, 2000),
                                                'alpha': (0, 1),
                                                'max_iter': (1, 1000),
                                                'momentum': (0, 1),
                                                })

xgbBO.maximize(init_points=init_points, n_iter=num_iter)

[31mInitialization[0m
[94m------------------------------------------------------------------------------------------[0m
 Step |   Time |      Value |     alpha |   hidden_layer_sizes |   max_iter |   momentum | 
    1 | 00m23s | [35m   0.23770[0m | [32m   0.8330[0m | [32m            447.3926[0m | [32m  564.8413[0m | [32m    0.9552[0m | 
    2 | 00m31s |    0.23770 |    0.4811 |             581.0781 |   808.9416 |     0.5633 | 
    3 | 01m41s |    0.23770 |    0.0356 |            1801.2926 |   984.3828 |     0.9246 | 
    4 | 01m09s |    0.23770 |    0.9591 |            1416.2374 |   767.6943 |     0.5836 | 
    5 | 01m44s |    0.23770 |    0.8029 |            1822.3162 |   474.8421 |     0.2187 | 
    6 | 01m03s |    0.23770 |    0.2403 |            1754.1954 |   216.1525 |     0.0673 | 
    7 | 00m13s |    0.23770 |    0.7234 |             241.7600 |   895.3285 |     0.7591 | 
    8 | 00m11s |    0.23770 |    0.4555 |             150.0496 |   687.2552 |     0.7285 | 
   



[31mBayesian Optimization[0m
[94m------------------------------------------------------------------------------------------[0m
 Step |   Time |      Value |     alpha |   hidden_layer_sizes |   max_iter |   momentum | 




   11 | 00m20s |    0.23770 |    0.5110 |               4.7368 |     2.3946 |     0.3529 | 




   12 | 00m18s |    0.23770 |    0.3663 |            1999.0257 |     1.9800 |     0.1381 | 




   13 | 00m19s |    0.23770 |    0.7063 |            1299.1278 |     2.2767 |     0.6072 | 




   14 | 00m16s |    0.23770 |    0.1634 |               1.4881 |   994.9635 |     0.0829 | 




   15 | 00m14s |    0.23770 |    0.5768 |             502.7849 |     2.8394 |     0.1410 | 




   16 | 01m04s |    0.23770 |    0.1559 |            1051.6425 |   999.8829 |     0.4360 | 




   17 | 01m30s |    0.23770 |    0.1050 |            1997.0634 |   976.8648 |     0.2026 | 




   18 | 00m18s |    0.23770 |    0.4589 |               4.7814 |   322.2557 |     0.5049 | 




   19 | 01m14s |    0.23770 |    0.5727 |            1409.7515 |   351.3520 |     0.1234 | 




   20 | 01m37s |    0.23770 |    0.8234 |            1987.6209 |   363.8301 |     0.2746 | 




   21 | 00m29s |    0.23770 |    0.1833 |             920.0945 |     7.6615 |     0.7636 | 




   22 | 00m21s |    0.23770 |    0.9506 |             251.7709 |   139.8792 |     0.0859 | 




   23 | 00m40s |    0.23770 |    0.1681 |             712.0102 |   991.9642 |     0.2549 | 




   24 | 01m31s |    0.23770 |    0.1502 |            1984.9260 |   695.8043 |     0.3932 | 




   25 | 00m36s |    0.23770 |    0.6451 |             502.0858 |   329.6018 |     0.4787 | 




   26 | 00m39s |    0.23770 |    0.6072 |             934.3617 |   735.3286 |     0.7345 | 




   27 | 01m16s |    0.23770 |    0.9751 |            1127.5203 |   256.5993 |     0.5351 | 




   28 | 00m16s |    0.23770 |    0.0787 |             180.2970 |    50.2458 |     0.6834 | 




   29 | 00m59s |    0.23770 |    0.8436 |            1194.9331 |   431.5333 |     0.0181 | 




   30 | 01m20s |    0.23770 |    0.7209 |            1864.1997 |    82.5007 |     0.4808 | 




   31 | 00m47s |    0.23770 |    0.4402 |             774.5422 |   771.6745 |     0.8210 | 




   32 | 00m11s |    0.23770 |    0.3048 |              72.0848 |   373.1665 |     0.0597 | 




   33 | 00m13s |    0.23770 |    0.3147 |              50.8687 |   761.2084 |     0.3827 | 




   34 | 00m13s |    0.23770 |    0.7960 |              37.1463 |   136.3000 |     0.8085 | 




   35 | 00m39s |    0.23770 |    0.4523 |            1171.8966 |     9.5495 |     0.7053 | 


In [59]:
print('Final Results')
print('XGBOOST: %f' % xgbBO.res['max']['max_val'])
print('Best Params: {}'.format(xgbBO.res['max']['max_params']))

Final Results
XGBOOST: 0.237696
Best Params: {'alpha': 0.83300613352786079, 'hidden_layer_sizes': 447.39261592521166, 'max_iter': 564.84132865762592, 'momentum': 0.95519854732881171}


Interestingly Bayesian Optimization doesn't seem to help the neural network for several reasons, one of which is that when predicting the level of the max iterations it can cause non-convergence if it sets it too low. It seems that, for neural networks at least, it helps much more to actually know what you're doing when setting the hyperparameters. Bayesian Optimization can theoretically help with setting the proper hidden layer size, which I will test below.

In [60]:
def nneval(hidden_layer_sizes):
    
    params['hidden_layer_sizes'] = int(hidden_layer_sizes)
    wclf = neural_network.MLPClassifier(**params)
    
    predictions = cross_val_predict(wclf, features, labels)
    
    return metrics.accuracy_score(labels, predicted)

In [61]:
num_iter = 10
init_points = 5
params = {}

xgbBO = BayesianOptimization(nneval, {'hidden_layer_sizes': (1, 2000)})

xgbBO.maximize(init_points=init_points, n_iter=num_iter)

[31mInitialization[0m
[94m----------------------------------------------------[0m
 Step |   Time |      Value |   hidden_layer_sizes | 
    1 | 00m24s | [35m   0.23770[0m | [32m            304.9642[0m | 
    2 | 02m21s |    0.23770 |            1866.4436 | 
    3 | 03m19s |    0.23770 |            1746.9608 | 
    4 | 00m16s |    0.23770 |             175.0697 | 
    5 | 01m22s |    0.23770 |            1417.1520 | 




    6 | 01m08s |    0.23770 |            1362.9882 | 




    7 | 00m56s |    0.23770 |            1735.9870 | 


ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.



Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/IPython/core/interactiveshell.py", line 2881, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-61-dd7f2def7b04>", line 7, in <module>
    xgbBO.maximize(init_points=init_points, n_iter=num_iter)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/bayes_opt/bayesian_optimization.py", line 249, in maximize
    self.init(init_points)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/bayes_opt/bayesian_optimization.py", line 104, in init
    y_init.append(self.f(**dict(zip(self.keys, x))))
  File "<ipython-input-60-c1bc828c1717>", line 6, in nneval
    predictions = cross_val_predict(wclf, features, labels)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/model_selection/_validation.py", line 401, in cross_val_predict


KeyboardInterrupt: 

In [62]:
print('Final Results')
print('XGBOOST: %f' % xgbBO.res['max']['max_val'])
print('Best Params: {}'.format(xgbBO.res['max']['max_params']))

Final Results


TypeError: a float is required

It actually looks like the hidden layer size has extremely little affect on the overall performance of the model.

# Results

SVC | SVC w/ BO | XGB | XGB w/ BO | NB | RF | RF w/ BO | MLP | MLP w/ BO 
--- | --- | --- | --- | --- | --- | --- | --- | ---
0.245  |    |  0.304  |  0.311  |  0.208  |  0.237  | 0.269  | 0.237  |  0.237

## Conclusion

As we can see from the above results, XGB has again proved to be the highest performing model. Beyond that, we have seen that Bayesian Optimization can improve the performance of all decision tree based algorithms, but beyond that does not seem to be effective for the MLP. In the next experiment, we will take these results and see how feature normalization/dimensionality reduction will affect our accuracy.