# Lesson 6 - TensorFlow and Keras

## 6.6.1 History of Artificial Neural Networks

Most of the history is inapplicable to practical skills that these notes focus on. Basically, ANNs were not used until recently when GoogleBrain made it popular again. A key tool used by GoogleBrain was DistBelief, and its popularity rose due to its open source nature.

In 2015, DistBelief was changed in a new iteration, becoming __TensorFlow.__ The competitor-tool to TensorFlow was the non-Google __Keras__. Nowadays, howver, Keras and TensorFlow are integrated, allowing Keras to be used natively with TensorFlow.

Good introductory source: [A. Rethinavel Subramanian, _Int'l Journal of Engineering Research and Applications_, www.ijera.com, ISSN : 2248-9622, Vol. 4, Issue 1 (Version 2), January 2014, pp.237-241](https://www.academia.edu/5886469/AF4102237242)

## 6.6.2 How TensorFlow Works

Tensors are array that have ___ranks.___ Ranks represent the number of dimensions of a tensor-array (or simply, a "tensor"). To convert a tensor to an actual NumPy array, use `.eval()`.

In [1]:
# the main import
import tensorflow as tf


In [2]:
[3, 2, 1]                  # rank 1 (single dimensional vector)
[[3, 2], [1, 3]]           # rank 2 (two dimensions)
[[[1], [2]], [[1], [2]]]   # rank 3
2                          # rank 0 (scalar values have no dimensionality)

2

### Nodes

__Thinkful Definition:__ "Key object that is a place where things can happen in our model." Cool.

#### Node - The "Constant"

In [2]:
node_const = tf.constant(70)
print(node_const) # notice how it prints the node itself, not the constant "70"

Tensor("Const:0", shape=(), dtype=int32)


#### Node - The "Mathematical Operator" (Add, Multiply, etc.)

In [7]:
node_add = tf.add(node_const, node_const)

print(node_add) # every time you re-run it, the name changes to Add_n+1
# The value (0) doesn't change because the rank of the constant is always 0

Tensor("Add_3:0", shape=(), dtype=int32)


#### Node - The "Placeholder"

In [8]:
node_place = tf.placeholder(tf.int32)

print(node_place)

Tensor("Placeholder_1:0", dtype=int32)


#### Node - The "Variable"

Has the properties of a placeholder (it literally "holds the place of a value") but with additional features, making it have a _variable value_ instead of a placeholder's constant value.

Also, you need to manually initialize them (ugh).

In [12]:
q = tf.Variable([0], tf.float32)
init = tf.global_variables_initializer()
sess.run(init)

Instructions for updating:
Colocations handled automatically by placer.


### Sessions

Instead of simply identifying the node, "sessions" actually "runs" the node and generates an output.

In [9]:
sess = tf.Session()

sess.run(node_const)

70

In [10]:
# utilizing all nodes as an example

a = tf.placeholder(tf.int32)

# Create an operator node that takes our placeholder and a constant node
multiply_by_2 = tf.multiply(a, tf.constant(2))

# Run the node to return our output
sess.run(multiply_by_2, {a : 3}) # this is what you use a placeholder for

6

In [11]:
# the beauty of tensors is that it performs higher matrix operations for you!
# (something good for image data analysis? (wink hint wink?))
sess.run(multiply_by_2, {a : [[3, 4, 81], [2, 31, 13]]})

array([[  6,   8, 162],
       [  4,  62,  26]])

## 6.6.3 - Example TensorFlow Model

In [13]:
# setting variables
# Note that our initial value has to match the data type
# so 1 would give an error since it's an int...
b = tf.Variable([1.], tf.float32)
m = tf.Variable([1.], tf.float32)
x = tf.placeholder(tf.float32)

# Implement a linear model with shorthand for tf.add() by using '+'
# and tf.multiply with '*'
linear_model = m * x + b

# New variables means we have to initialize again
init = tf.global_variables_initializer()
sess.run(init)

In [14]:
print(sess.run(linear_model, {x:[1, 2, 3, 4]}))

[2. 3. 4. 5.]


In [15]:
# getting fancier...
# creating a loss function and thereby determining accuracy

y = tf.placeholder(tf.float32)
squared_deltas = tf.square(linear_model - y)
loss = tf.reduce_sum(squared_deltas)
print(sess.run(loss, {x:[1, 2, 3, 4], y:[.1, -.9, -1.9, -2.9]}))

# That's a high sum squared loss! Large error!

116.04001


In [16]:
# You can go back and adjust the model with the node
#########TF.ASSIGN!

fixm = tf.assign(m, [-1.])
fixb = tf.assign(b, [1.1])
sess.run([fixm, fixb])
print(sess.run(loss, {x:[1, 2, 3, 4], y:[.1, -.9, -1.9, -2.9]}))

4.9960036e-16


^^^The error is basically zero here, which is great!<br>
But how did we know to re-assign m and b to be -1 and 1.1, respectively? Thinkful cheated...

What we _can_ do is to create a gradient descent function to minimize the loss function. We can set the gradient descent to be a tensorflow node.

In [17]:
# Set your learning rate in Gradient Descent - 0.01 is just fine
optimizer = tf.train.GradientDescentOptimizer(0.01)
train = optimizer.minimize(loss)

# reset values to incorrect defaults.
# Otherwise, the "session" would work off of itself - VERY IMPORTANT!*******
sess.run(init) 

# Loop for 100 iterations, trying to find optimal values
for i in range(100):
    sess.run(train, {x:[1, 2, 3, 4], y:[.1, -.9, -1.9, -2.9]})

print(sess.run([m, b]))

Instructions for updating:
Use tf.cast instead.
[array([-0.9286998], dtype=float32), array([0.89036876], dtype=float32)]


## 6.6.4 Keras Introduction

Keras is more accessible but can do less than TensorFlow (according to Thinkful, of course). Instead of TensorFlow's "nodes" and "variables", Keras has:

__Layers:__ Like layers in a typical neural network model (a set of perceptron nodes).
- A layer is called a "dense" layer if every node connects to every node in the next layer.

__Models:__ The structure of your layering. You "make a model" by stacking the layers (see example below). Two flavors of model:
- _Sequential_: Stacks of layers in a linear progression.
- _Complex_: Non-sequential layering. Obviously such a model is more...ummm...'complex'.

In [19]:
from keras.models import Sequential 

model = Sequential()
from keras.layers import Dense, Activation

model.add(Dense(units=100, input_dim=100)) # "adding" on top of one another 
model.add(Activation('relu'))              # forms a sequential model
model.add(Dense(units=10))
model.add(Activation('softmax'))

# ...I did it. Models need to be compiled and fit later
# This will be demonstrated in the next section:

Using TensorFlow backend.


## 6.6.5 Keras MNIST Guided Example 

__Goal:__ Using the MNIST dataset, we will use neural networks to classify handwritten numbers as the proper digits, 0-9. Note that this data is sparse!!!!!

MNIST is a good example of neural network usage. Multiple layers full of values allows for good unsupervised analysis. __To use neural networks, it needs to be fed A LOT of data__ (albeit, a lot of sparse data :( ...).

In [5]:
import tensorflow as tf
import keras

In [2]:
# Import the dataset
from keras.datasets import mnist # easy to load!!!!!!!!

# Import various componenets for model building
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten, Conv2D, MaxPooling2D
from keras.layers import LSTM, Input, TimeDistributed
from keras.models import Model
from keras.optimizers import RMSprop

# Import the backend
from keras import backend as K

Using TensorFlow backend.


In [22]:
(x_train, y_train), (x_test, y_test) = mnist.load_data() # split and load data
x_train

Downloading data from https://s3.amazonaws.com/img-datasets/mnist.npz


array([[[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]],

       [[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]],

       [[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]],

       ...,

       [[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]],

       [[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 

In [24]:
print(28*28)        # MNIST picture resolution
print(len(x_train)) # array size

784
60000


In [26]:
# need to compress/shape the data into flat vectors for each digit

# Change shape 
# Need to reshape 60,000 arrays to length 784, one for each image
x_train = x_train.reshape(60000, 784)
x_test = x_test.reshape(10000, 784)

# Convert to float32 for type consistency
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')

# Normalize values to 1 from 0 to 255 (256 values of pixels)
x_train /= 255
x_test /= 255

# Print sample sizes
print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')

# Convert class vectors to binary class matrices
# So instead of one column with 10 values, 
# create 10 binary columns (one for each number!!!!)
from keras.utils.np_utils import to_categorical 

y_train = keras.utils.to_categorical(y_train, 10)
y_test = keras.utils.to_categorical(y_test, 10)

60000 train samples
10000 test samples


In [27]:
# Instantiating our Keras model
'''
Keras model will use "dense layers" and "dropouts"

***Dense layers*** - layers that are fully connected
***Dropout Drops*** - drops out a fraction of the perceptrons to prevent overfitting
'''

#---------------------------------------------------

# Start with a simple sequential model
model = Sequential()

'''
 Add dense layers to create a fully connected MLP
 Note that we specify an input shape for the first layer, but
***ONLY*** the first layer!!!!!!!!!
'''

# Relu = "Rectified Linear Unit (standard activation function)"
model.add(Dense(64, activation='relu', input_shape=(784,)))
# Dropout layers remove features and fight overfitting
model.add(Dropout(0.1))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.1))

# End with a number of units equal to the number of classes we have for our outcome
model.add(Dense(10, activation='softmax'))

model.summary()

# Compile the model to put it all together.
model.compile(loss='categorical_crossentropy',
              optimizer=RMSprop(),
              metrics=['accuracy'])

Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_3 (Dense)              (None, 64)                50240     
_________________________________________________________________
dropout_1 (Dropout)          (None, 64)                0         
_________________________________________________________________
dense_4 (Dense)              (None, 64)                4160      
_________________________________________________________________
dropout_2 (Dropout)          (None, 64)                0         
_________________________________________________________________
dense_5 (Dense)              (None, 10)                650       
Total params: 55,050
Trainable params: 55,050
Non-trainable params: 0
_________________________________________________________________


In [28]:
# fitting the instantiated model
'''
BATCH_SIZE = number of samples to use in each step 
----larger size = faster (bigger steps) but 
----decreases accuracy (learning rate not as small step-wise)

----Notice how the layers added above add up to 128 (batch size?)
----layers are size=64 (see above) 2^x size layers good for parallized computation

EPOCH = essentially an iteration of the model - 
-----each epoch improves on what was learned in the previous iteration/epoch
'''

# ---------------------------------------------------------------

history = model.fit(x_train, y_train,
                    batch_size=128,
                    epochs=10,     # essentially iterations of the model going through the data
                    verbose=1,
                    validation_data=(x_test, y_test))
score = model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

Train on 60000 samples, validate on 10000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test loss: 0.08778246227339842
Test accuracy: 0.9749


---

### Convolutional Neural Networks

__Convolution:__ Analyzing data via overlapping segments of a given feature upon which it develops its model.

Here is an example:

In [33]:
from IPython.display import Image
Image(url='https://cdn-images-1.medium.com/max/1200/1*GcI7G-JLAQiEoCON7xFbhg.gif')

<IPython.core.display.Image object>


In [34]:
Image(url='https://cdn-images-1.medium.com/max/1000/1*yHKCrrgpdewt30JcE7016g.png')

# more good info: https://www.peculiar-coding-endeavours.com/2018/mlp_vs_cnn/

__STEPS TO CONVOLUTION__
1. Define shape of input data (most often, first chunk of code is reshaping the loaded data)
2. Create the tiles (or ___kernels___)
3. Reduce the sample size into a pooling layer (a process called ___downsampling___)
4. Flatten the downsampled data, place flattened data into dense layers and run model

In [8]:
# 1. LOAD DATA, SHUFFLE, AND RESHAPE

# input image dimensions, from our data
img_rows, img_cols = 28, 28
num_classes = 10

# the data, shuffled and split between train and test sets
(x_train, y_train), (x_test, y_test) = mnist.load_data()
print(x_train)
print(y_train)

# K is "backend" work...don't ask

if K.image_data_format() == 'channels_first':
    x_train = x_train.reshape(x_train.shape[0], 1, img_rows, img_cols)
    x_test = x_test.reshape(x_test.shape[0], 1, img_rows, img_cols)
    input_shape = (1, img_rows, img_cols)
else:
    x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1)
    x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols, 1)
    input_shape = (img_rows, img_cols, 1)

x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255
print('x_train shape:', x_train.shape)
print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')

# convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)

print(y_train)
print(type(y_train))

#---------------------------------------------------
# 2. MODEL INSTANTIATION

# Building the Model
model = Sequential()
# First convolutional layer, note the specification of shape
model.add(Conv2D(32, kernel_size=(3, 3),
                 activation='relu',
                 input_shape=input_shape))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(num_classes, activation='softmax'))

model.compile(loss=keras.losses.categorical_crossentropy,
              optimizer=keras.optimizers.Adadelta(),
              metrics=['accuracy'])

#----------------------------------------------------------
# 3. MODEL FIT & RUN

#model.fit(x_train, y_train,
#          batch_size=128,
 #         epochs=10,
  #        verbose=1,
   #       validation_data=(x_test, y_test))
#score = model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

# convolutional NNs take A LONG LONG TIME to run, so be patient...

[[[0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  ...
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]]

 [[0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  ...
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]]

 [[0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  ...
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]]

 ...

 [[0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  ...
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]]

 [[0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  ...
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]]

 [[0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  ...
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]]]
[5 0 4 ... 5 6 8]
x_train shape: (60000, 28, 28, 1)
60000 train samples
10000 test samples
[[0. 0. 0. ... 0. 0. 0.]
 [1. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 

NameError: name 'score' is not defined

As you can see, the convolutional model takes much longer...but it yielded a higher accuracy than the previous model. 

---

### Hierarchical Recurrent Neural Networks

Recurrent NNs are different from the above NNs, which are all __feed-forward__ - that is, data flows in one direction until it reaches the end.

Recurrent NNs instead "cycle" through the network. Accordingly, "Sequential models" are no longer an option. Also, this is more complex than feed-forward, meaning it will take even longer (yay...)!

Example of a Recurrent Neural Network is below:

In [31]:

# Training parameters.
batch_size = 64
num_classes = 10
epochs = 3      # notice this is much lower than the 10 - 
                # don't wanna take forever!

# Embedding dimensions.
row_hidden = 32
col_hidden = 32

#------------------------------------------------------------

# The data, shuffled, and split between train and test sets.
(x_train, y_train), (x_test, y_test) = mnist.load_data()

# Reshapes data to 4D for Hierarchical RNN.
x_train = x_train.reshape(x_train.shape[0], 28, 28, 1)
x_test = x_test.reshape(x_test.shape[0], 28, 28, 1)
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255
print('x_train shape:', x_train.shape)
print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')

# Converts class vectors to binary class matrices.
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)

row, col, pixel = x_train.shape[1:]

# 4D input.
x = Input(shape=(row, col, pixel))

# Encodes a row of pixels using TimeDistributed Wrapper. - THIS IS THE BASIS OF THE LSTM RNN!!!!!
encoded_rows = TimeDistributed(LSTM(row_hidden))(x)

# Encodes columns of encoded rows.
encoded_columns = LSTM(col_hidden)(encoded_rows)

# Final predictions and model.
prediction = Dense(num_classes, activation='softmax')(encoded_columns)
model = Model(x, prediction)
model.compile(loss='categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy'])

# Training.
model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=epochs,
          verbose=1,
          validation_data=(x_test, y_test))

# Evaluation.
scores = model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', scores[0])
print('Test accuracy:', scores[1])

x_train shape: (60000, 28, 28, 1)
60000 train samples
10000 test samples
Train on 60000 samples, validate on 10000 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3
Test loss: 0.23966183296740054
Test accuracy: 0.9229


## OVERVIEW

[This is a good example of how to do a CNN Keras model](https://www.kaggle.com/yassineghouzam/introduction-to-cnn-keras-0-997-top-6).

__Steps to do a Keras Model:__
1. Do basic imports (including various Keras imports) and import data.


2. Split and clean the data.


3. Normalize the data _values_ (i.e., change pixel values to \[0,1\] as neural network converges faster that way).
    - In the MNIST dataset, pixel values range from 0 to 255, depending on the pixel's darkness level. To normalize this data, simply divide the features columnwise: 
        - `X_train = [['pixel_1', 'pixel_2, ... pixel_784]]`
        - `X_train = X_train / 255`


4. Reshape the data (flatten it to make it an input that's appropriate for a Keras model)
    - `X_train = X_train.values.reshape(-1, 28, 28, 1)` representing 28x28 pixels and 1 for 'canal' dimension
    - For different neural networks, reshaping syntax may need to be altered (see above examples)
    - If target variable is not encoded into numbers, make sure to do that
    - Not only do __YOU HAVE TO__ one-hot-encode the values, but you __ALSO HAVE TO__ create/dummy columns representing each different categorical value (as opposed to having a one-dimensional array of all the values!!!!) 
        - You can do the above via `from keras.utils.np_utils import to_categorical` and later `Y_train = to_categorical(Y_train, num_classes = 10)`. It's Keras law.


5. Setting up the Keras model architecture:
    - adding neural network layers:
        - `model = Sequential()`
        - Add as many layers as you want: `model.add(`_(type of NN layer)_`)
        - After adding desired layers, do `model.add(Flatten())` and then `model.add(Dense(...))` and `model.add(Dropout())`
    - Modify model for efficiency:
        - Add optimizer (adjusts model's learning rate)
            - `optimizer = RMSprop(...)`
            - `model.compile(optimizer=optimizer, loss= ..., metrics=['accuracy']`
        - Add annealer (speeds up convergence by adjusting the learning rate if the model is learning too slow or too fast)
            - `learningratereduction = ReduceLROnPlateau(...)`
        
        
6. Set up Data augmentation (if the model runs into an overfitting problem, otherwise, ignore)


7. Fit the model (and adjust parameters, including the annealer in "callbacks=" parameter if necessary)


8. Test model's accuracy.


9. If using a supervised dataset, use a confusion matrix for individual accuracy.

## Possible errors:

1. Error when calling "cross_entropy" 
    - __Answer:__ shaping error when analyzing the target variable. You need to do y_train (or test) = `keras.utils.to_categorical(y_train, num_classes= (num of categories in target variable))
    
    
2. Array of size __X__ cannot be reshaped using (your dimensions)
    - __Answer:__ Your dimensions do not equal the dimensions of the original dataset. If you are having trouble figuring out what dimensions will work, reshape to the original dimensions of the dataset (yes, it is still a "reshaping" in the eyes of Keras)


3. Layer dense_21 expected (1,) but got (10,)
    - __Answer:__ As shown in the above example, the final layer must be the same dimensionality as that of the target variable. If appropriately one-hot encoded (see Possible Error 1 above), then the number of categories encoded to will be the number found in the final model.add(Dense(# here)) layer when creating the Keras model.

# Which Specific Neural Network to Use?

__source:__ ["When to use MLP, CNN, and RNN Neural Networks", Jason Brownlee, Jul. 23, 2018](https://machinelearningmastery.com/when-to-use-mlp-cnn-and-rnn-neural-networks/)

Three main types of ANNs:
1. __Multi-layer Perceptrons ("Classical" or "Feed forward" network):__ You can use sklearn's version of this. Use this one for:
    - Classification prediction problems
    - Regression prediction problems
    - Best with __tabular data__ (as opposed to image, audio, or other medium of data)
        - If working with non-tabular data, you should still use this ANN for "baseline" prediction values. This is important to avoid overfitting of more complex ANNs!!!!!!!
        
2. __Convolutional Neural Networks (CNNs):__ These models develop an internal representation of a 2D image. That representation is then positioned and scaled for analysis. Use this one for:
    - Classification prediction problems
    - Regression prediction problems
    - Image data
        - The reason this is designed for image data is because the model is good at constructing a __"spatial relationship"__ with the internal representation. This makes them good not just for image data, but for (1) word-order relationship in text and (2) the ordered relationship in steps of a time-series.
        
3. __Recurrent Neural Networks (aka "Feedback" networks)__: Uses a "non-linear" approach, where the inputs of a layer are fed back in, and the data can go essentially in _any_ direction within the network. This makes them good for __sequence prediction__ problems, where each data observation might have "multiple steps". The __"Long Short-Term Memory" (LSTM)__ RNN is a successful RNN because it significantly speeds up the training process.
    - Text data
    - Classification prediction problems
    - Regression prediction problems
    - Speech data
    - Generative models
    - Do __NOT__ use RNNs for:
        - tabular data
        - image data
        - MLPs or even simple classification/regression models fare better at working with tabular and image data than  RNNs do. 
   
### BUT WHY CHOOSE ONE WHEN YOU CAN HYBRIDIZE THEM ALL TOGETHER???
   
4. __CNN LTSM Architecture:__ This model starts with CNN layers at the input, LSTM in the middle, then MLP at the output. Such a model would be perfect for _video data_ (image analysis, then sequential analysis, then classification analysis).
    


In [35]:
# good example of the various things you can do.
# Again, this all depends on (1) THE DATA and (2) THE DESIRED OUTPUT
Image(url='https://cdn-images-1.medium.com/max/1000/1*S7Q2Rh7ba0jW5pQJovekSw.png')

In [None]:
# good miscellaneous things you can add for analysis purposes:

# Look at confusion matrix for categorical things

from sklearn.metrics import confusion_matrix
import itertools

def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

# Predict the values from the validation dataset
Y_pred = model.predict(X_val)
# Convert predictions classes to one hot vectors 
Y_pred_classes = np.argmax(Y_pred,axis = 1) 
# Convert validation observations to one hot vectors
Y_true = np.argmax(Y_val,axis = 1) 
# compute the confusion matrix
confusion_mtx = confusion_matrix(Y_true, Y_pred_classes) 
# plot the confusion matrix
plot_confusion_matrix(confusion_mtx, classes = range(10)) 

In [None]:
# Visualizing some error results 

# Errors are difference between predicted labels and true labels
errors = (Y_pred_classes - Y_true != 0) # see above for Y_true and Y_pred_classes

Y_pred_classes_errors = Y_pred_classes[errors]
Y_pred_errors = Y_pred[errors]
Y_true_errors = Y_true[errors]
X_val_errors = X_val[errors]

def display_errors(errors_index,img_errors,pred_errors, obs_errors):
    """ This function shows 6 images with their predicted and real labels"""
    n = 0
    nrows = 2
    ncols = 3
    fig, ax = plt.subplots(nrows,ncols,sharex=True,sharey=True)
    for row in range(nrows):
        for col in range(ncols):
            error = errors_index[n]
            ax[row,col].imshow((img_errors[error]).reshape((28,28)))
            ax[row,col].set_title("Predicted label :{}\nTrue label :{}".format(pred_errors[error],obs_errors[error]))
            n += 1

# Probabilities of the wrong predicted numbers
Y_pred_errors_prob = np.max(Y_pred_errors,axis = 1)

# Predicted probabilities of the true values in the error set
true_prob_errors = np.diagonal(np.take(Y_pred_errors, Y_true_errors, axis=1))

# Difference between the probability of the predicted label and the true label
delta_pred_true_errors = Y_pred_errors_prob - true_prob_errors

# Sorted list of the delta prob errors
sorted_dela_errors = np.argsort(delta_pred_true_errors)

# Top 6 errors 
most_important_errors = sorted_dela_errors[-6:]

# Show the top 6 errors
display_errors(most_important_errors, X_val_errors, Y_pred_classes_errors, Y_true_errors)

In [None]:
# wrap-up conditions

# predict results
results = model.predict(test)

# select the indix with the maximum probability
results1 = np.argmax(results, axis=1)

results2 = pd.Series(results1, name="Label")

submission = pd.concat([pd.Series(range(1,28001), name="ImageId"), results2], axis=1)

submission.to_csv("cnn_mnist_datagen.csv", index=False)