# Model Selection - 1
- **k-fold cross-validation** using ```scikit-learn wrapper```
- **grid search** using ```scikit-learn wrapper```
- **random search** using ```scikit-learn wrapper```

## **k-fold cross validation**
- One of the most common method for model validation and selection
    - Partition training dataset into ```k``` subsets and choose one of partitions as **validation set** and other remaining subset as **training set**
    - Then, train model using **training set** and validate using **validation set**
    - Average validation results of ```k``` rounds of partitions and training/validation
    - Compare the results
    
<img src="https://upload.wikimedia.org/wikipedia/commons/1/1c/K-fold_cross_validation_EN.jpg" style="width: 500px"/>

## **Grid search & Random search**
- **Grid search** and **random search** are two methods for **hyperparameter tuning**
    - **Grid search** tries all possible combinations of hyperparameter values specified
    - **Random search** implements *randomized search over parameters*, in which each trial is a sample from possible hyperparameter distributions
    
<img src="https://cdn-images-1.medium.com/max/1600/1*ZTlQm_WRcrNqL-nLnx6GJA.png" style="width: 500px"/>

### Load dataset
- ```imdb``` dataset

In [5]:
from keras.datasets import imdb
from keras.preprocessing import sequence

In [4]:
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words = 5000)

In [11]:
# printing out maximum & minimun length of sentences
print(len(max(X_train, key = len)))
print(len(max(X_test, key = len)))
print(len(min(X_train, key = len)))
print(len(min(X_test, key = len)))

2494
2315
11
7


In [12]:
X_train = sequence.pad_sequences(X_train, maxlen = 500)
X_test = sequence.pad_sequences(X_test, maxlen = 500)

In [13]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(25000, 500)
(25000, 500)
(25000,)
(25000,)


- When using ```Sequential``` model API, ```scikit-learn wrapper``` can be used
- Define function to create model, and pass on to ```KerasClassifier```

In [21]:
from keras.models import Sequential
from keras.layers import *
from keras import optimizers
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import KFold, cross_val_score

In [34]:
def imdb_model(embed_dim = 100, lstm = True, lr = 0.01):
    model = Sequential()
    model.add(Embedding(5000, embed_dim))
    if lstm:
        model.add(CuDNNLSTM(100))
    else:
        model.add(CuDNNGRU(100))
            
    model.add(Dense(100))
    model.add(Activation('relu'))
    model.add(Dropout(0.3))
    model.add(Dense(100))
    model.add(Activation('relu'))
    model.add(Dropout(0.3))
    model.add(Dense(1))
    model.add(Activation('sigmoid'))
    
    adam = optimizers.adam(lr = lr)
    model.compile(loss = 'binary_crossentropy', optimizer = adam, metrics = ['accuracy'])
    return model

In [29]:
model = KerasClassifier(build_fn = imdb_model, epochs = 10, batch_size = 128, verbose = 1)

### ```k-fold cross-validation``` using ```scikit-learn wrapper```

In [None]:
# set k = 5
results = cross_val_score(model, X_train, y_train, cv = 5)

In [33]:
# print out results
# in most cases, average accuracy and standard deviation are meaningful metrics
print('Cross validation Results: ')
for i in range(len(results)):
    print('{}th round accuracy: {}'.format(i+1, results[i]))
print('Average accuracy: ', results.mean())
print('Standard deviation: ', results.std())

Cross validation Results: 
1th round accuracy: 0.8576
2th round accuracy: 0.8768
3th round accuracy: 0.8642
4th round accuracy: 0.8698
5th round accuracy: 0.8654
Average accuracy:  0.86676
Standard deviation:  0.00636163500996


### ```grid search``` using ```scikit-learn wrapper```

In [39]:
from sklearn.model_selection import GridSearchCV

In [35]:
# first define hyperparameter grid
embed_dim = [100, 300]
lstm = [True, False]
lr = [0.001, 0.01]
batch_size = [64, 128, 256]

In [38]:
hyperparam_grid = {'embed_dim': embed_dim, 'lstm': lstm, 'lr': lr, 'batch_size': batch_size}

In [40]:
model = KerasClassifier(build_fn = imdb_model, epochs = 10, verbose = 1)
clf = GridSearchCV(estimator = model, param_grid = hyperparam_grid)

In [None]:
grid_result = clf.fit(X_train, y_train)

In [46]:
means = clf.cv_results_['mean_test_score']
stds = clf.cv_results_['std_test_score']
params = clf.cv_results_['params']

In [47]:
# displaying best results & parameter settings
max_idx = np.argmax(means)
print('Best test accuracy: ', means[max_idx])
print('Standard Deviation of Accuracies: ', stds[max_idx])
print('Parameter Setting: ', params[max_idx])

Best test accuracy:  0.867440000012
Standard Deviation of Accuracies:  0.00621257665123
Parameter Setting:  {'batch_size': 64, 'embed_dim': 300, 'lr': 0.001, 'lstm': False}


### ```random search``` using ```scikit-learn wrapper```

In [48]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform
import numpy as np

In [49]:
embed_dim = randint(100, 300)
lstm = [True, False]
lr = uniform(0.001, 0.1)
batch_size = randint(64, 256)

In [50]:
hyperparam_dist = {'embed_dim': embed_dim, 'lstm': lstm, 'lr': lr, 'batch_size': batch_size}

In [51]:
model = KerasClassifier(build_fn = imdb_model, epochs = 10, verbose = 1)
clf = RandomizedSearchCV(estimator = model, param_distributions = hyperparam_dist, n_iter = 20)

In [None]:
random_result = clf.fit(X_train, y_train)

In [54]:
means = clf.cv_results_['mean_test_score']
stds = clf.cv_results_['std_test_score']
params = clf.cv_results_['params']

In [55]:
# displaying best results & parameter settings
max_idx = np.argmax(means)
print('Best test accuracy: ', means[max_idx])
print('Standard Deviation of Accuracies: ', stds[max_idx])
print('Parameter Setting: ', params[max_idx])

Best test accuracy:  0.861520008204
Standard Deviation of Accuracies:  0.00517948234862
Parameter Setting:  {'batch_size': 153, 'embed_dim': 121, 'lr': 0.0052380206595373227, 'lstm': True}
