# Convolutional Neural Networks

This notebook introduces convolutional neural networks (CNNs), a more powerful classification model similar to the Neural Bag-of-Words (BOW) model you explored earlier.

## Outline

- **Part (a):** Model Architecture
- **Part (b):** Implementing the CNN Model
- **Part (c):** Tuning

In [3]:
from __future__ import division
import os, sys, re, json, time, datetime, shutil
import itertools, collections
from importlib import reload
from IPython.display import display, HTML

# NLTK for NLP utils and corpora
import nltk

# NumPy and TensorFlow
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow import keras

# Helper libraries
from w266_common import utils, vocabulary, tf_embed_viz, treeviz
from w266_common import patched_numpy_io
# Code for this assignment
import sst

# Monkey-patch NLTK with better Tree display that works on Cloud or other display-less server.
print("Overriding nltk.tree.Tree pretty-printing to use custom GraphViz.")
treeviz.monkey_patch(nltk.tree.Tree, node_style_fn=sst.sst_node_style, format='svg')

Overriding nltk.tree.Tree pretty-printing to use custom GraphViz.


## (a) Model Architecture

CNNs are a more sophisticated neural model for sentence classification than the Neural BOW model we saw in the last section. CNNs operate by sweeping a collection of filters over a text. Each filter produces a sequence of feature values known as a _feature map_. In one of the most basic formulations introduced by [Kim (2014)](https://www.aclweb.org/anthology/D14-1181), a single layer of _pooling_ is used to summarize the _feature maps_ as a fixed length vector. The fixed length vector is then feed to the output layer in order to produce classification labels. A popular choice for the pooling operation is to take the maximum feature value from by each _feature map_.

![Convolutional Neural Network from Kim 2014](kim_2014_figure_1_cnn.png)
*CNN model architure, Figure 1 from Kim (2014)*

We'll use the following notation:
- $w^{(i)} \in \mathbb{Z}$, the word id for the $i^{th}$ word of the sequence (as an integer index)
- $x^{(i)} \in \mathbb{R}^d$ for the vector representation (embedding) of $w^{(i)}$
- $x^{i:i+j}$ is the concatenation of $x^{(i)}, x^{(i+1)} ... x^{(i+j)}$ 
- $c^{(i)}_{k}$ is the value of the $k^{th}$ feature map along the word sequence, each filter applies over a window of $h$ words and uses non-linearity $f$.
- $\hat{c}_{k}$ is the value of the $k^{th}$ feature after pooling the feature map over the whole sequence.
- $\hat{C}$ is the concatenation of pooled feature maps. 
- $y$ for the target label ($\in 1,\ldots,\mathtt{num\_classes}$)

Our model is defined as:
- **Embedding layer:** $x^{(i)} = W_{embed}[w^{(i)}]$
- **Convolutional layer:** $c^{(i)}_{k} = f(x^{i:i+h-1} W_k + b)$
- **Pooling layer:**  $\hat{c}_{k}$ = $max(c^{(0)}_{k}, c^{(1)}_{k}...)$ 
- **Output layer:** $\hat{y} = \hat{P}(y) = \mathrm{softmax}(\hat{C} W_{out} + b_{out})$


We'll refer to the first part of this model (**Embedding layer**, **Convolutional layer**, and **Pooling layer**) as the **Encoder**: it has the role of encoding the input sequence into a fixed-length vector representation that we pass to the output layer.

We'll also use these as shorthand for important dimensions:
- `V`: the vocabulary size (equal to `ds.vocab.size`)
- `N`: the maximum number of tokens in the input text
- `embed_dim`: the embedding dimension $d$
- `kernel_size`: a list of filter lengths
- `filters`: number filters per filter length
- `num_classes`: the number of target classes (2 for the binary task)

## (a) Short Answer Questions

When answering these questions in the answers file,
`embed_dim = 10`, `kernel_size = [3, 4, 5]`, `filters=128`, `N=10` and `num_classes = 7`.

1. In terms of these values, the vocabulary size `V` and the maximum sequence length `N`, what are the
   shapes of the following variables: 
   $c^{(i)}_{kernal\_size=3}$, $c^{(i)}_{kernal\_size=4}$, $c^{(i)}_{kernal\_size=5}$, $\hat{c}^{(i)}_{kernal\_size=3}$, $\hat{c}^{(i)}_{kernal\_size=4}$, $\hat{c}^{(i)}_{kernal\_size=5}$, and $\hat{C}$. Assume a stride size of 1. Assume padding is not used (e.g., for tf.nn.max_pool and tf.nn.conv1d, setting padding='VALID'), provide the shapes listed above.
<p>
2. What are the shapes of $c^{(i)}_{kernal\_size=3}$ and $\hat{c}^{(i)}_{kernal\_size=3}$ when paddiding is used.
      (e.g., for tf.nn.max_pool and tf.nn.conv1d, setting padding='same').
<p>
3. How many parameters are in each of the convolutional filters, $W_{filter\_length=3}$, $W_{filter\_length=4}$, $W_{filter\_length=5}$? And the output layer, $W_{out}$?
<p>
<p>
4. Historically NLP models made heavy use of manual feature engineering. In relation to systems with manually engineered features, describe what type of operation is being performed by the convolutional filters.
<p>
5. Suppose that we have two examples, `[foo bar baz]` and `[baz bar foo]`. Will this model make the same predictions on these? Why or why not?

In [21]:
# Question 1.1 (/2): What is the dimension of ck3?
convolutional_neural_networks_a_1_1: [8, 128]

# Question 1.2 (/2): What is the dimension of ck4?
convolutional_neural_networks_a_1_2: [7, 128]

# Question 1.3 (/2): What is the dimension of ck5?
convolutional_neural_networks_a_1_3: [6, 128]

# Question 1.4 (/2): What is the dimension of chatk3?
convolutional_neural_networks_a_1_4: [128, 1]

# Question 1.5 (/3): What is the dimension of chatk4?
convolutional_neural_networks_a_1_5: [128, 1]

# Question 1.6 (/3): What is the dimension of chatk5?
convolutional_neural_networks_a_1_6: [128, 1]

# Question 1.7 (/3): What is the dimension of Chat?
convolutional_neural_networks_a_1_7: [384, 1]

# Question 2.1 (/3): What is the dimension of ck3?
convolutional_neural_networks_a_2_1: [10, 1]

# Question 2.2 (/3): What is the dimension of chatk3?
convolutional_neural_networks_a_2_2: [128, 1]

# Question 3.1 (/3): How many parameters are there in Wfilter=3?
convolutional_neural_networks_a_3_1: 30

# Question 3.2 (/3): How many parameters are there in Wfilter=4?
convolutional_neural_networks_a_3_2: 40

# Question 3.3 (/3): How many parameters are there in Wfilter=5?
convolutional_neural_networks_a_3_3: 50

# Question 3.4 (/3): How many parameters are there in Wout?
convolutional_neural_networks_a_3_4: 2688

# Question 4 (/1): Compare kernels to feature engineering.
# This question is a candidate for discussion in live session.
convolutional_neural_networks_a_4: your answer

# Question 5.1 (/2): Would the two predictions be the same?
convolutional_neural_networks_a_5_1: False

# Question 5.2 (/0): Why or why not?
convolutional_neural_networks_a_5_2: These words are close enough in proximity that the filters will see the change.  If 'bar' negates then the sentiment could be reversted. 

SyntaxError: invalid syntax (<ipython-input-21-63ce3f091862>, line 42)

In [26]:
# Specify model hyperparameters.
epochs = 1
embed_dim = 10
num_classes = 7
num_filters = [128, 128, 128]
kernel_sizes = [3, 4, 5]
dense_layer_dims = []
dropout_rate = 0.9
max_input_length=10


wordids = keras.layers.Input(shape=(max_input_length,))

h = keras.layers.Embedding(max_input_length, embed_dim, input_length=10)(wordids)

conv_layers_for_all_kernel_sizes = []
for filters, kernel_size in zip(num_filters, kernel_sizes):
    conv_layer = keras.layers.Conv1D(filters=filters, kernel_size=kernel_size, activation='relu', padding='VALID')(h)
    conv_layer = keras.layers.GlobalMaxPooling1D()(conv_layer)
    conv_layers_for_all_kernel_sizes.append(conv_layer)

h = keras.layers.concatenate(conv_layers_for_all_kernel_sizes, axis=1)

h = keras.layers.Dropout(rate=dropout_rate)(h)

prediction = keras.layers.Dense(num_classes, activation='softmax')(h)
print(prediction.shape)

model = keras.Model(inputs=wordids, outputs=prediction)
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',  # From information theory notebooks.
              metrics=['accuracy'])        # What metric to output as we train.

print(model.summary())

(?, 7)
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_23 (InputLayer)           (None, 10)           0                                            
__________________________________________________________________________________________________
embedding_22 (Embedding)        (None, 10, 10)       100         input_23[0][0]                   
__________________________________________________________________________________________________
conv1d_57 (Conv1D)              (None, 8, 128)       3968        embedding_22[0][0]               
__________________________________________________________________________________________________
conv1d_58 (Conv1D)              (None, 7, 128)       5248        embedding_22[0][0]               
__________________________________________________________________________________________________
con

## (b) Implementing the CNN Model

We'll implement our CNN model below. Our implementation will differ from [Kim (2014)](https://www.aclweb.org/anthology/D14-1181) in that we will support using multiple dense hidden layers after the convolutional layers.

**Before you start**, be sure to answer the short-answer questions above!

In [1]:
import sst

# Load SST dataset
ds = sst.SSTDataset(V=20000).process(label_scheme="binary")
max_len = 40
train_x, train_ns, train_y = ds.as_padded_array('train', max_len=max_len, root_only=True)
dev_x,   dev_ns,   dev_y   = ds.as_padded_array('dev',   max_len=max_len, root_only=True)
test_x,  test_ns,  test_y  = ds.as_padded_array('test',  max_len=max_len, root_only=True)


Loading SST from data/sst/trainDevTestTrees_PTB.zip
Training set:     8,544 trees
Development set:  1,101 trees
Test set:         2,210 trees
Building vocabulary - 16,474 words
Processing to phrases...  Done!
Splits: train / dev / test : 98,794 / 13,142 / 26,052


In [36]:
# Specify model hyperparameters.
epochs = 10
embed_dim = 5
num_filters = [2, 2, 2]
kernel_sizes = [2, 3, 4]
dense_layer_dims = []
dropout_rate = 0.8
num_classes = len(ds.target_names)

# Construct the convolutional neural network.
# The form of each keras layer function is as follows:
#    result = keras.layers.LayerType(arguments for the layer)(layer(s) it should use as input)
# concretely,
#    this_layer_output = keras.layers.Dense(100, activation='relu')(prev_layer_vector)
# performs this_layer_output = relu(prev_layer_vector x W + b) where W has 100 columns.

# Input is a special "layer".  It defines a placeholder that will be overwritten by the training data.
# In our case, we are accepting a list of wordids (padded out to max_len).
wordids = keras.layers.Input(shape=(max_len,))

# Embed the wordids.
# Recall, this is just a mathematically equivalent operation to a linear layer and a one-hot
h = keras.layers.Embedding(ds.vocab.size, embed_dim, input_length=max_len)(wordids)
print("First h: {}".format(h.shape))

# Construct "filters" randomly initialized filters with dimension "kernel_size" for each size of filter we want.
# With the default hyperparameters, we construct 10 filters each of size 2, 3, 4.  As in the image above, each filter
# is wide enough to span the whole word embedding (this is why the convolution is "1d" as seen in the
# function name below).
conv_layers_for_all_kernel_sizes = []
for filters, kernel_size in zip(num_filters, kernel_sizes):
    conv_layer = keras.layers.Conv1D(filters=filters, kernel_size=kernel_size, activation='relu')(h)
    conv_layer = keras.layers.GlobalMaxPooling1D()(conv_layer)
    conv_layers_for_all_kernel_sizes.append(conv_layer)

# Concat the feature maps from each different size.
h = keras.layers.concatenate(conv_layers_for_all_kernel_sizes, axis=1)
print("Second h: {}".format(h.shape))

# Dropout can help with overfitting (improve generalization) by randomly 0-ing different subsets of values
# in the vector.
# See https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf for details.
h = keras.layers.Dropout(rate=dropout_rate)(h)
print("Third h: {}".format(h.shape))

### YOUR CODE HERE
# Add a fully connected layer for each dense layer dimension in dense_layer_dims.

print(type(h))

### END YOUR CODE

prediction = keras.layers.Dense(num_classes, activation='softmax')(h)

model = keras.Model(inputs=wordids, outputs=prediction)
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',  # From information theory notebooks.
              metrics=['accuracy'])        # What metric to output as we train.
print(model.summary())

First h: (?, 40, 5)
Second h: (?, 6)
Third h: (?, 6)
<class 'tensorflow.python.framework.ops.Tensor'>
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_26 (InputLayer)           (None, 40)           0                                            
__________________________________________________________________________________________________
embedding_25 (Embedding)        (None, 40, 5)        82370       input_26[0][0]                   
__________________________________________________________________________________________________
conv1d_66 (Conv1D)              (None, 39, 2)        22          embedding_25[0][0]               
__________________________________________________________________________________________________
conv1d_67 (Conv1D)              (None, 38, 2)        32          embedding_25[0][0]               
_______

In [29]:
model.reset_states()
model.fit(train_x, train_y, epochs=epochs)

Instructions for updating:
Use tf.cast instead.
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7fd91f28ebe0>

## Evaluation

Call [evaluate](https://keras.io/models/model/#evaluate) on your model.

In [37]:
#### YOUR CODE HERE ####
model.evaluate(dev_x, dev_y)

#### END(YOUR CODE) ####



[0.6929376087057482, 0.5091743]

# Part (c): Tuning Your Model

We'll once again want to optimize hyperparameters for our model to see if we can improve performance. The CNN model includes a number of new parameters that can significantly influence model performance.

In this section, you will be asked to describe the new parameters as well as use them to attempt to improve the performance of your model.

## Part (c) Short Answer Questions

  1. Choose two parameters unique the CNN model, perform at least 10 runs with different combinations of values for these parameters, and then report the dev set results below. ***Hint: Consider wrapping the training code above in a for loop the examines the different values.***  To do this efficiently, you should consider [this paper](http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf) from Bergstra and Bengio.  [This blog post](https://blog.floydhub.com/guide-to-hyperparameters-search-for-deep-learning-models/) also has a less formal treatment of the same topic.
  2. Describe any trends you see in experiments above (e.g., can you identify good ranges for the individual parameters; are there any interesting interactions?)
  3. Pick the three best configurations according to the dev set and evaluate them on the test data. Is the ranking of the three best models the same on the dev and test sets?
  4. What was the best accuracy you achieved on the test set?

In [65]:
from sklearn.model_selection import RandomizedSearchCV
import random

# Specify model hyperparameters.

def create_model(
    epochs = 10,
    embed_dim = 5,
    num_filters = [2, 2, 2],
    kernel_sizes = [2, 3, 4],
    dense_layer_dims = [],
    dropout_rate = 0.8
):
    num_classes = len(ds.target_names)
    # Construct the convolutional neural network.
    # The form of each keras layer function is as follows:
    #    result = keras.layers.LayerType(arguments for the layer)(layer(s) it should use as input)
    # concretely,
    #    this_layer_output = keras.layers.Dense(100, activation='relu')(prev_layer_vector)
    # performs this_layer_output = relu(prev_layer_vector x W + b) where W has 100 columns.

    # Input is a special "layer".  It defines a placeholder that will be overwritten by the training data.
    # In our case, we are accepting a list of wordids (padded out to max_len).
    wordids = keras.layers.Input(shape=(max_len,))

    # Embed the wordids.
    # Recall, this is just a mathematically equivalent operation to a linear layer and a one-hot
    h = keras.layers.Embedding(ds.vocab.size, embed_dim, input_length=max_len)(wordids)

    # Construct "filters" randomly initialized filters with dimension "kernel_size" for each size of filter we want.
    # With the default hyperparameters, we construct 10 filters each of size 2, 3, 4.  As in the image above, each filter
    # is wide enough to span the whole word embedding (this is why the convolution is "1d" as seen in the
    # function name below).
    conv_layers_for_all_kernel_sizes = []
    for filters, kernel_size in zip(num_filters, kernel_sizes):
        conv_layer = keras.layers.Conv1D(filters=filters, kernel_size=kernel_size, activation='relu')(h)
        conv_layer = keras.layers.GlobalMaxPooling1D()(conv_layer)
        conv_layers_for_all_kernel_sizes.append(conv_layer)

    # Concat the feature maps from each different size.
    h = keras.layers.concatenate(conv_layers_for_all_kernel_sizes, axis=1)

    # Dropout can help with overfitting (improve generalization) by randomly 0-ing different subsets of values
    # in the vector.
    # See https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf for details.
    h = keras.layers.Dropout(rate=dropout_rate)(h)
    

    ### YOUR CODE HERE
    # Add a fully connected layer for each dense layer dimension in dense_layer_dims.

    for dense_layer_dim in dense_layer_dims:
        h = keras.layers.Dense(dense_layer_dim,
                  use_bias=True,
                  activation='relu',
                  kernel_initializer='glorot_normal',
                  bias_initializer='zeros',
                  kernel_regularizer=None,
                  name='Dense_Encoder_' + str(dense_layer_dim))(h)

    ### END YOUR CODE

    prediction = keras.layers.Dense(num_classes, activation='softmax')(h)

    model = keras.Model(inputs=wordids, outputs=prediction)
    model.compile(optimizer='adam',
                  loss='sparse_categorical_crossentropy',  # From information theory notebooks.
                  metrics=['accuracy'])        # What metric to output as we train.
    
    return model

def train_and_score(model, epochs=10):
    model.reset_states()
    model.fit(train_x, train_y, epochs=epochs)
#     print(model.summary())
    dev_score = model.evaluate(dev_x, dev_y)
    test_score = model.evaluate(test_x, test_y)
    
    return dev_score, test_score

param_dist = dict(    
    dropout_rate=[0.80, 0.9, 0.95],
    num_filters=[[10, 20, 30], [40, 50, 60], [20, 20, 20]],
    kernel_sizes=[[2, 3, 4], [2, 4, 6], [3, 4, 5]],
    dense_layer_dims=[[10, 20, 30, 40], [10], [400]]
)

""" Random Search """
n_iter_search = 10
num_epochs = 10
# random_model = keras.wrappers.scikit_learn.KerasClassifier(build_fn=create_model())
# random_search = RandomizedSearchCV(estimator=random_model, 
#                                    param_distributions=param_dist,
#                                    n_iter=n_iter_search,
#                                    n_jobs=1,
#                                    cv=2,
#                                    verbose=5,
#                                    scoring="accuracy"
#                                   )
# random_search.fit(train_x, train_y)

# # Show the results
# print("Best: %f using %s" % (random_search.best_score_, random_search.best_params_))
# means = random_search.cv_results_['mean_test_score']
# stds = random_search.cv_results_['std_test_score']
# params = random_search.cv_results_['params']
# for mean, stdev, param in zip(means, stds, params):
#     print("%f (%f) with: %r" % (mean, stdev, param))

def random_search():
    results = []
    for n in range(n_iter_search):
        print("Iter {} of {}".format(n+1, n_iter_search))
        dropout_rate = param_dist['dropout_rate'][random.randrange(len(param_dist['dropout_rate']))]
        kernel_sizes = param_dist['kernel_sizes'][random.randrange(len(param_dist['kernel_sizes']))]
        dense_layer_dims = param_dist['dense_layer_dims'][random.randrange(len(param_dist['dense_layer_dims']))]
        num_filters = param_dist['num_filters'][1] #[random.randrange(len(param_dist['num_filters']))]
        
        model = create_model(epochs=num_epochs,
                             dropout_rate=dropout_rate, 
                             num_filters=num_filters, 
                             kernel_sizes=kernel_sizes, 
                             dense_layer_dims=dense_layer_dims,
                             )
        
        dev_score, test_score = train_and_score(model, epochs=num_epochs)
        
        results.append({"accuracy_dev": dev_score[1],
                        "accuracy_test": test_score[1],
                        "loss": dev_score[0],
                        "dropout_rate": dropout_rate, 
                        "num_filters": num_filters, 
                        "kernel_sizes": kernel_sizes, 
                        "dense_layer_dims": dense_layer_dims})
        
    return results
    
""" Grid Search """
def grid_search():
    dropout_rates=[0.82]
    num_filterses=[[120, 180, 240, 300, 360]]
    kernel_sizeses=[[2, 3, 4, 5, 6], [2, 4, 6, 8, 10]]
    denses=[[300, 400]]

    results = []
    for dr in dropout_rates:
        for nf in num_filterses:
            for ks in kernel_sizeses:
                for dense in denses:
                    model = create_model(dropout_rate=dr, num_filters=nf, kernel_sizes=ks, dense_layer_dims=dense)
                    dev_score, test_score = train_and_score(model)
                    results.append({"accuracy_dev": dev_score[1],
                                    "accuracy_test": test_score[1], 
                                    "loss": dev_score[0], 
                                    "dropout_rate": dr, 
                                    "num_filters": nf, 
                                    "kernel_sizes": ks, 
                                    "dense_layer_dims": dense})
    return results


random_search_results = random_search()
print(random_search_results)

# grid_search_results = grid_search()
# print(grid_search_results)

Iter 1 of 10
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Iter 2 of 10
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Iter 3 of 10
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Iter 4 of 10
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Iter 5 of 10
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Iter 6 of 10
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Iter 7 of 10
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


Iter 8 of 10
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Iter 9 of 10
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Iter 10 of 10
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
[{'accuracy_dev': 0.76376146, 'accuracy_test': 0.7693575, 'loss': 1.1139778947338053, 'dropout_rate': 0.8, 'num_filters': [40, 50, 60], 'kernel_sizes': [2, 4, 6], 'dense_layer_dims': [10]}, {'accuracy_dev': 0.7821101, 'accuracy_test': 0.76276773, 'loss': 0.8253861922736562, 'dropout_rate': 0.9, 'num_filters': [40, 50, 60], 'kernel_sizes': [2, 3, 4], 'dense_layer_dims': [10]}, {'accuracy_dev': 0.7293578, 'accuracy_test': 0.7435475, 'loss': 0.8906924483972952, 'dropout_rate': 0.9, 'num_filters': [40, 50, 60], 'kernel_sizes': [2, 3, 4], 'dense_layer_dims': [10]}, {'accuracy_dev': 0.76261467, 'accuracy_test': 0.76771003, 'l

In [66]:
sorted_results = sorted(random_search_results, key=lambda k: k['accuracy_dev'], reverse=True) 

top_res=sorted_results[0]

print("Best accuracy: {} accuracy_test: {} dropout rate: {}  number of filters: {} kernel sizes: {} dense: {}\n".format(
    top_res['accuracy_dev'], 
    top_res['accuracy_test'], 
    top_res['dropout_rate'], 
    top_res['num_filters'], 
    top_res['kernel_sizes'], 
    top_res['dense_layer_dims']))

for result in sorted_results:
    print("accuracy: {} accuracy_test: {}  dropout rate: {}  number of filters: {} kernel sizes: {} dense: {}".format(
         result['accuracy_dev'], result['accuracy_test'], result['dropout_rate'], result['num_filters'], result['kernel_sizes'], result['dense_layer_dims']))
#     print(result)

# Best accuracy: 0.7775229215621948 dropout rate: 0.85  number of filters: [240, 240, 240, 240] kernel sizes: [2, 4, 6, 8] dense: [300, 400]
# Best accuracy: 0.7798165082931519 dropout rate: 0.82  number of filters: [120, 180, 240, 300, 360] kernel sizes: [2, 3, 4, 5, 6] dense: [300, 400]

Best accuracy: 0.7821100950241089 accuracy_test: 0.7627677321434021 dropout rate: 0.9  number of filters: [40, 50, 60] kernel sizes: [2, 3, 4] dense: [10]

accuracy: 0.7821100950241089 accuracy_test: 0.7627677321434021  dropout rate: 0.9  number of filters: [40, 50, 60] kernel sizes: [2, 3, 4] dense: [10]
accuracy: 0.76949542760849 accuracy_test: 0.7649642825126648  dropout rate: 0.8  number of filters: [40, 50, 60] kernel sizes: [2, 3, 4] dense: [400]
accuracy: 0.7660550475120544 accuracy_test: 0.7693575024604797  dropout rate: 0.9  number of filters: [40, 50, 60] kernel sizes: [3, 4, 5] dense: [10]
accuracy: 0.7649082541465759 accuracy_test: 0.7594727873802185  dropout rate: 0.8  number of filters: [40, 50, 60] kernel sizes: [2, 4, 6] dense: [400]
accuracy: 0.7637614607810974 accuracy_test: 0.7693575024604797  dropout rate: 0.8  number of filters: [40, 50, 60] kernel sizes: [2, 4, 6] dense: [10]
accuracy: 0.7626146674156189 accuracy_test: 0.7677100300788879  dropout rate: 0.8  number