# Convolutional Neural Networks

## Standard Imports

In [None]:
import tensorflow as tf
from tensorflow.keras import datasets

import numpy as np
import matplotlib.pyplot as plt

import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import plotly

# https://stackoverflow.com/questions/57658935/save-jupyter-notebook-with-plotly-express-widgets-displaying
plotly.offline.init_notebook_mode()

## Data
- We'll use some Keras data functionality this time
    - Again we get the nice local data management -- no need for repeated downloaded

In [5]:
(train_images, train_labels), (test_images, test_labels) = datasets.cifar10.load_data()

Downloading data from https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz


In [None]:
(train_images, train_labels), (test_images, test_labels) = datasets.cifar10.load_data()

In [None]:
# https://stackoverflow.com/questions/51235508/in-which-folder-on-pc-windows-10-does-load-data-save-a-dataset-in-keras
! ls ~/.keras/datasets/cifar-10-batches-py

## Let's take a look at our ~~images~~ ~~data~~ tensors
- Keras loads all the data into memory: *not* a TFDS
    - In fact, not even tensors: they're `numpy.ndarray`'s
  
  
- *Still, Images are 'Tensors', in the 'neaural network' sense (though not in the physics/linear algebra sense)*


In [None]:
type(train_images)

In [None]:
train_images.shape

- `...` provides convenient indexing expressivity

In [None]:
image_index = 0 
train_images[image_index].shape, train_images[image_index,...].shape

- `np.newaxis` provides convenient dimensionality expressivity

In [None]:
# https://stackoverflow.com/questions/17394882/how-can-i-add-new-dimensions-to-a-numpy-array
train_images[np.newaxis,image_index,...].shape

# Humans See This
(using code borrowed from the [TF CNN tutorial](https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/images/cnn.ipynb#scrollTo=K3PAELE2eSU9))

In [None]:
# https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/images/cnn.ipynb#scrollTo=K3PAELE2eSU9
def plot_cifar10(x,y):
    class_names = ['airplane', 'automobile', 'bird', 'cat', 'deer',
                   'dog', 'frog', 'horse', 'ship', 'truck']
    plt.subplot(111);plt.xticks([]);plt.yticks([]);plt.grid(False)
    plt.imshow(x, cmap=plt.cm.binary);plt.xlabel(class_names[y[0]])
    
plot_cifar10(train_images[image_index],train_labels[image_index])

## Computers See This
- note that the scale of the data is 0-255 *and likely needs to be changed*

In [None]:
#train_images[image_index].shape

In [None]:
rgb = {v:i for i,v in enumerate("rgb")}
print(rgb)

image_index = 0 
train_images[image_index][..., rgb['r']]

In [None]:
# https://plotly.com/python/3d-scatter-plots/
# https://plotly.com/python/colorscales/
# https://plotly.com/python/creating-and-updating-figures/
# https://community.plotly.com/t/plotly-express-multiple-plots-overlay/31984

def plot_channel(x, h, color, alpha, loc, i=[0], j=[0]):

    i_grid,j_grid = np.meshgrid(*[range(i) for i in x[0].shape[:2]])

    df = pd.DataFrame(columns=['i','j','h','c'])
    for x_,h_,i_,j_ in zip(x,h,i,j):
        tmp = pd.DataFrame({'i': i_+i_grid.ravel(), 'j': j_+j_grid.ravel(), 'h': h_+0*i_grid.ravel()})
        tmp['c'] = x_.ravel()
        df = df.append(tmp)
    
    fig.append_trace(go.Scatter3d(x=df.i, y=df.j, z=df.h, opacity=alpha, 
                           mode='markers', marker={'color':df.c, 'colorscale': color}),*loc)

# a few other helpful posts which ultimately led to my solution above
# https://community.plotly.com/t/specifying-a-color-for-each-point-in-a-3d-scatter-plot/12652
# https://www.reddit.com/r/rstats/comments/g3tulu/is_it_possible_to_vary_alphaopacity_by_group_in_a/
# https://plotly.com/python/3d-scatter-plots/
# .. mode='markers', marker=dict(color='(255,0,0)', size=10) ...
# https://stackoverflow.com/questions/53875880/convert-a-pandas-dataframe-of-rgb-colors-to-hex
    

In [None]:
# https://stackoverflow.com/questions/46750462/subplot-with-plotly-with-multiple-traces
fig = plotly.subplots.make_subplots(rows=1,cols=2, specs=[[{'type': 'scene'}, {'type': 'scene'}]])

plot_channel([train_images[image_index][..., rgb['r']]/255], [0], 'Reds', 0.33, [1,1])
plot_channel([train_images[image_index][..., rgb['g']]/255], [1], 'Greens', 0.33, [1,1])
plot_channel([train_images[image_index][..., rgb['b']]/255], [2], 'Blues', 0.33, [1,1])

plot_channel([train_images[image_index][..., rgb['r']]/255,
              train_images[image_index][..., rgb['g']]/255,
              train_images[image_index][..., rgb['b']]/255], [0,1,2], 'Greys', 0.33, [1,2], [0]*3, [0]*3)

fig.show()

# So what are "Convolutional" Kernels?
- "Convolutional" is in quotes because 
    - the operation these *kernels* perform is actually a "cross-correlation" (i.e., similarity measure)
        - *inerestingly... kernel is also a highly overloaded term..."* but in the CNN context mean the [matrix](https://stats.stackexchange.com/questions/154798/difference-between-kernel-and-filter-in-cnn/188216) doing "cross-correlations" over the data
    - but the idea of a convolution is to move a function across another, which is reasonable as an intution for CNNs
        - [Mathematical Convolution](https://lpsa.swarthmore.edu/Convolution/CI.html)
        - [CNN Kernel "Convolution"](https://ezyang.github.io/convolution-visualizer/index.html)
        
        
**Not an official textbook; but, [Dive Into Deep Learning](http://d2l.ai/chapter_convolutional-neural-networks/channels.html) is another free resource that I've found useful for finding quick answers to things**

In [None]:
# using the terms kernel/filter as recommended on: https://stats.stackexchange.com/questions/154798/difference-between-kernel-and-filter-in-cnn/188216
number_filters = 10
number_channels = 3
kernel_width = 5

kernels = np.random.rand(number_filters*number_channels*kernel_width*kernel_width)
kernels = kernels.reshape((kernel_width,kernel_width,number_channels,number_filters))-0.5
print(kernels.shape)

# using the terms kernel/filter as recommended on: https://stats.stackexchange.com/questions/154798/difference-between-kernel-and-filter-in-cnn/188216
filter_index = 0
channel_index = rgb['r']
kernels[...,color_channel,filter_index]

And so God commanded...

## Image Patch

In [None]:
image_index = 0
i,j = 0,0

for channel in 'rgb':
    channel_index = rgb[channel]
    print(channel)
    print(train_images[image_index,
                       i:(i+kernel_width),
                       j:(j+kernel_width), 
                       channel_index])
    print()

## Meet Kernel
(for each color channel)

In [None]:
kernel_index = 0

for channel in 'rgb':
    channel_index = rgb[channel]
    print(channel)
    print(kernels[:,:,channel_index,filter_index])
    print()

## And Multiply (and Sum (across color chanels, too)

In [None]:
filter_index = 0
image_index = 0
i,j = 0,0#1,1

(\
train_images[image_index, 
             i:(i+kernel_width), 
             j:(j+kernel_width), 
             :] * kernels[...,filter_index]\
).sum()

## So what's this doing to the big picture as a TF/Keras layer?


In [None]:
train_images.shape, kernels.shape

In [None]:
# will result in
50000,32-5+1,32-5+1,10
# assuming no padding, stride, etc.

- Transforms each image into a "10 channel" 28 by 28 image
- Each "channel" captures "hotness" of a certain "feature"
    - captured via cross-correlation with the kernel characterizing the feature

## So "Kernels" are features (you say)? How so (pray tell)?

- E.g., horizontal edge detection...

In [None]:
kernels[:2,...,-1]=-1
kernels[2,...,-1]=0
kernels[3:,...,-1]=1
kernels[...,0,-1]

- or vertical edge detection...

In [None]:
kernels[:,:2,:,-2]=-1
kernels[:,2,:,-2]=0
kernels[:,3:,:,-2]=1
kernels[...,0,-2]

## Which gives us this sort of thing

In [None]:
convolutional_layer_output = np.zeros(list(np.array(train_images[image_index].shape[:2])-kernel_width+1)+[number_filters])
print(convolutional_layer_output.shape)

for filter_index in range(number_filters):
    for i in range(convolutional_layer_output.shape[0]):
        for j in range(convolutional_layer_output.shape[1]):
            convolutional_layer_output[i,j,filter_index] = \
            (kernels[...,filter_index] *
             (train_images[image_index,
                          i:(i+kernel_width),
                          j:(j+kernel_width), 
                          :]/255-0.5)).sum() # again, notice that this sums over color channels;
                                             # although, each color channel gets its own kernel.
                                             # ALSO NOTE the standardizing: but this only changes the output scale

- i.e., completing the full "convolution" over the whole image, gives us

In [None]:
# https://matplotlib.org/3.3.3/api/_as_gen/matplotlib.pyplot.subplots.html
f, ((ax1, ax2, ax3), (ax4, ax5, ax6)) = plt.subplots(2, 3, figsize=(15,10))

ax1.imshow(convolutional_layer_output[...,-1])
ax2.imshow(train_images[image_index], cmap=plt.cm.binary)
ax3.imshow(convolutional_layer_output[...,-2])
ax4.imshow(kernels[...,0,-1])
ax5.axis('off')
ax6.imshow(kernels[...,0,-2])

The vertical edge detection kernel
- detects "left-to-right" "low-to-high" as positive
- detects "left-to-right" "high-to-low" as negative

In [None]:
np.array([[0],[1],[0]]).dot(np.array([[0,1,0]]))

In [None]:
# might try to animate this later:
# https://plotly.com/python/animations/
# https://plotly.com/python/v3/gapminder-example/

fig = plotly.subplots.make_subplots(rows=1,cols=2, specs=[[{'type': 'scene'}, {'type': 'scene'}]])

plot_channel([train_images[image_index][..., rgb['r']]/255,
              train_images[image_index][..., rgb['g']]/255,
              train_images[image_index][..., rgb['b']]/255], [0,1,2], 'Greys', 0.33, [1,1], [0]*3, [0]*3)

for filter_index in range(number_filters):
    plot_channel([convolutional_layer_output[...,filter_index]], [3*filter_index], 
                 px.colors.named_colorscales()[filter_index], 0.1, [1,2])


filter_index=5
i,j=17,17
plot_channel(np.moveaxis(kernels[...,filter_index],-1,0), list(range(3)), 
             px.colors.named_colorscales()[filter_index], 1, [1,1], i=[i]*3, j=[j]*3)

plot_channel([np.array([[0],[1],[0]]).dot(np.array([[0,1,0]]))], [3*filter_index], 
             px.colors.named_colorscales()[filter_index], 1, [1,2], i=[i-1], j=[j-1])

fig.show()

## Summary/Review

- We make multiple distinct new "feature layers" (i.e., channels) *outputs*
- Kernels for each *input channel* are [different for each *input channel*](https://stackoverflow.com/questions/43306323/keras-conv2d-and-input-channels)
- *A bias (intercept) term (ONE for EACH filter) is usually included after all the multiplying and summing*
    
    - `#BIAS_TERM = number_filters*[0]`
    
    - `# + BIAS_TERM[filter_index]`


**What does it mean when we start chain these things?**

## Is *LEARNING* these *KERNELS* hard for NNs... gradients and such?


1. It's just a set of linear transformations; so
2. It can be represented as a matrix multiplication; this also means
3. The partial derivatives (and hence the gradient) of the weights is easy to calculate
4. And can again be represented as a ([transposed](https://stats.stackexchange.com/questions/335332/why-use-matrix-transpose-in-gradient-descent)) matrix multipliction
    - i.e., encoded into the tensor graph and quickly computed by TF

$$\huge 
\begin{align*}
Y'_{ij} = {}& g_{ijk}(X)\\
= {}& \sum_{c}\sum_{i_0 = 0:kw}\sum_{j_0 = 0:kw} X_{(i+i_0)(j+j_0)c} K_{i_0j_0k_0}\\
Y_{ij} = {}& f(g_{ijk}(X)) \\
\frac{\partial}{\partial K_{abk}} Y_{ij} = {} & \frac{Y_{ij}}{\partial Y'_{ij}} \frac{\partial Y'_{ij}}{\partial K_{abk}}\\
= {}& \frac{Y_{ij}}{\partial Y'_{ij}} \sum_{c} X_{(i+a)(j+b)c} \\
\frac{\partial}{\partial K_{abk}} \sum_{i,j} Y_{ij} = {} & \sum_{i,j} \frac{Y_{ij}}{\partial Y'_{ij}} \frac{\partial Y'_{ij}}{\partial K_{abk}}\\
= {}& \sum_{i,j} \frac{Y_{ij}}{\partial Y'_{ij}} \sum_{c} X_{(i+a)(j+b)c} \\
\end{align*}$$

# (Max) Pooling
- Creates local translation invariance
    - which is good when absolute locations are less important than approximate locations

In [None]:
pooling_width = 4

pooling_layer_result = np.zeros(list((np.array(convolutional_layer_output.shape[:2])/pooling_width).astype(int))+[convolutional_layer_output.shape[2]])
print(pooling_layer_result.shape)

i_pooling_layer_result = 0*pooling_layer_result
j_pooling_layer_result = 0*pooling_layer_result

for k in range(number_filters):
    for ii,i in enumerate(range(0, convolutional_layer_output.shape[1], pooling_width)):
        for jj,j in enumerate(range(0, convolutional_layer_output.shape[1], pooling_width)):
            pooling_layer_result[ii,jj,k] = convolutional_layer_output[i:(i+pooling_width),j:(j+pooling_width),k].max()
            i_pooling_layer_result[ii,jj,k] = ii
            j_pooling_layer_result[ii,jj,k] = jj

In [None]:
filter_index = 8

tmp = [convolutional_layer_output[i:(i+pooling_width),j:(j+pooling_width),filter_index]/255
       for j in range(0,28,8) for i in range(4,28,8)]
tmp_i = [i for j in range(0,28,8) for i in range(4,28,8)]
tmp_j = [j for j in range(0,28,8) for i in range(4,28,8)]
tmp += [convolutional_layer_output[i:(i+pooling_width),j:(j+pooling_width),filter_index]/255
        for i in range(0,28,8) for j in range(4,28,8)]
tmp_i += [i for i in range(0,28,8) for j in range(4,28,8)]
tmp_j += [j for i in range(0,28,8) for j in range(4,28,8)]


jmp = [pooling_layer_result[i:(i+1),j:(j+1),filter_index]/255
       for j in range(0,7,2) for i in range(1,7,2)]
jmp_i = [i for j in range(0,7,2) for i in range(1,7,2)]
jmp_j = [j for j in range(0,7,2) for i in range(1,7,2)]
jmp += [pooling_layer_result[i:(i+1),j:(j+1),filter_index]/255
        for i in range(0,7,2) for j in range(1,7,2)]
jmp_i += [i for i in range(0,7,2) for j in range(1,7,2)]
jmp_j += [j for i in range(0,7,2) for j in range(1,7,2)]

kern_indx = 2

tmp = [tmp]
tmp_i = [tmp_i]
tmp_j = [tmp_j]

tmp.append([convolutional_layer_output[i:(i+pooling_width),j:(j+pooling_width),filter_index]/255
       for j in range(0,28,8) for i in range(4,28,8)])
tmp_i.append([i for j in range(0,28,8) for i in range(4,28,8)])
tmp_j.append([j for j in range(0,28,8) for i in range(4,28,8)])
tmp[-1] += [convolutional_layer_output[i:(i+pooling_width),j:(j+pooling_width),filter_index]/255
        for i in range(0,28,8) for j in range(4,28,8)]
tmp_i[-1] += [i for i in range(0,28,8) for j in range(4,28,8)]
tmp_j[-1] += [j for i in range(0,28,8) for j in range(4,28,8)]

jmp = [jmp]
jmp_i = [jmp_i]
jmp_j = [jmp_j]

jmp.append([pooling_layer_result[i:(i+1),j:(j+1),filter_index]/255
       for j in range(0,7,2) for i in range(1,7,2)])
jmp_i.append([i for j in range(0,7,2) for i in range(1,7,2)])
jmp_j.append([j for j in range(0,7,2) for i in range(1,7,2)])
jmp[-1] += [pooling_layer_result[i:(i+1),j:(j+1),filter_index]/255
        for i in range(0,7,2) for j in range(1,7,2)]
jmp_i[-1] += [i for i in range(0,7,2) for j in range(1,7,2)]
jmp_j[-1] += [j for i in range(0,7,2) for j in range(1,7,2)]




# (Max) Pooling
- Creates local translation invariance
    - which is good when absolute locations are less important than approximate locations

In [None]:
fig = plotly.subplots.make_subplots(rows=1,cols=2, specs=[[{'type': 'scene'}, {'type': 'scene'}]])

for filter_index in range(number_filters):
    plot_channel([convolutional_layer_output[...,filter_index]/255], [3*filter_index], 
                 'Greys', 0.1, [1,1])

for filter_index in range(number_filters):
    plot_channel([pooling_layer_result[...,filter_index]/255], [3*filter_index], 
                 'Greys', 0.2, [1,2])


for k,filter_index in enumerate([8,2]):
    plot_channel(tmp[k], [3*filter_index]*len(tmp[k]), px.colors.named_colorscales()[filter_index], 
                 1, [1,1], i=tmp_i[k], j=tmp_j[k])    
    plot_channel(jmp[k], [3*filter_index]*len(jmp[k]), px.colors.named_colorscales()[filter_index], 
                 1, [1,2], i=jmp_i[k], j=jmp_j[k])    
    
fig.show()

- If we used more filters, and ["max pooled" across filters](https://stackoverflow.com/questions/36817868/tensorflow-how-to-pool-over-depth/36853403) we could create "kernel invariance"

![kernel invariance](https://www.programmersought.com/images/568/1d315847250c7ae6ecfddd8928f2ca90.png)

# Coding Lecture
- [TensorFlow CNN Tutorial](https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/images/cnn.ipynb) achieves ~70% accuracy for CIFAR-10
    - See also the [TensorFlow CNN Tutorial](https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/quickstart/advanced.ipynb); althogh, it's not for CIFAR-10
- How might we improve this?
    - Increase the network size?
        - Bigger?  Wider?  Deeper?
            - `padding='same'`?
    - Regularize the network?
        - Dropout?
        - Kernel Shrinkage?
        - Batch Normalization?
        - Data Augmentation?
        - Noise? 
- We can get up to nearly 90% accuracy [like this](https://appliedmachinelearning.blog/2018/03/24/achieving-90-accuracy-in-object-recognition-task-on-cifar-10-dataset-with-keras-convolutional-neural-networks/)
    - That's ["state of the art" for 2015](https://benchmarks.ai/cifar-10)
    - What's the epochs specification?
    - We've changed the application order of BatchNormalizatoin/ReLU steps following the advice of the book, but it seems [BatchNormalizatoin application order is still be an open question](https://www.reddit.com/r/MachineLearning/comments/67gonq/d_batch_normalization_before_or_after_relu/)
    - We've changed `elo` to `relo` but this [might not be the best choice](https://www.reddit.com/r/MachineLearning/comments/6g15si/d_elu_vs_relu_any_new_benchmarks/)



In [None]:
import numpy as np
mean = np.mean(train_images,axis=(0,1,2,3))
std = np.std(train_images,axis=(0,1,2,3))
train_images = (train_images-mean)/(std+1e-7)
test_images = (test_images-mean)/(std+1e-7)


In [None]:
# Data Augmentation
# https://keras.io/api/preprocessing/image/

from keras.preprocessing.image import ImageDataGenerator
datagen = ImageDataGenerator(rotation_range=15,
                             width_shift_range=0.1,
                             height_shift_range=0.1,
                             horizontal_flip=True)

datagen.fit(train_images)

In [None]:
# https://appliedmachinelearning.blog/2018/03/24/achieving-90-accuracy-in-object-recognition-task-on-cifar-10-dataset-with-keras-convolutional-neural-networks/

from keras.models import Sequential
from keras.layers import Dense, Activation, Flatten, Dropout, BatchNormalization
from keras.layers import Conv2D, MaxPooling2D
from keras import regularizers

weight_decay = 1e-4
model = Sequential()

# Layer 1: Initial Convolution
model.add(Conv2D(32, (3,3), padding='same', input_shape=(32, 32, 3),
                 kernel_regularizer=regularizers.l2(weight_decay)))
#model.add(Activation('elu'))
#model.add(BatchNormalization())
model.add(BatchNormalization())
model.add(Activation('relu'))

# Layer 2: Convolution on Initial Convolution followed by Pooling with Dropout
model.add(Conv2D(32, (3,3), padding='same', 
                 kernel_regularizer=regularizers.l2(weight_decay)))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Dropout(0.2))
 
# Layer 3: Plain Convolution again

model.add(Conv2D(64, (3,3), padding='same', 
                 kernel_regularizer=regularizers.l2(weight_decay)))
model.add(BatchNormalization())
model.add(Activation('relu'))
s
# Layer 4: Convolution on Convolution followed by Pooling with Dropout, again

model.add(Conv2D(64, (3,3), padding='same', 
                 kernel_regularizer=regularizers.l2(weight_decay)))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Dropout(0.3))
 
# Layer 5: another Plain Convolution
model.add(Conv2D(128, (3,3), padding='same', 
                 kernel_regularizer=regularizers.l2(weight_decay)))
model.add(BatchNormalization())
model.add(Activation('relu'))

# Layer 6: another Convolution-Pooling-Dropout layer

model.add(Conv2D(128, (3,3), padding='same', 
                 kernel_regularizer=regularizers.l2(weight_decay)))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Dropout(0.4))
 
# Layer 7: final
model.add(Flatten())
model.add(layers.Dense(10))
 


In [None]:
from keras.callbacks import LearningRateScheduler


def lr_schedule(epoch):
    lrate = 0.0005
    if epoch > 75:
        lrate = 0.00025
    if epoch > 125:
        lrate = 0.00001
    return lrate    

In [None]:
model.compile(optimizer=tf.keras.optimizers.RMSprop(lr=0.001, decay=1e-6),#'adam', 
              loss=tf.keras.losses.CategoricalCrossentropy(from_logits=True),#tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])


history = model.fit(datagen.flow(train_images,  train_labels, batch_size=64), 
                    epochs=150, validation_data=(test_images, test_labels),
                    callbacks=[LearningRateScheduler(lr_schedule)])

In [None]:
# SPARSE REPRESENTATION WON'T WORK
# WE NEED TO CONVERT TO CATEGORICAL
# https://keras.io/api/preprocessing/image/
# https://appliedmachinelearning.blog/2018/03/24/achieving-90-accuracy-in-object-recognition-task-on-cifar-10-dataset-with-keras-convolutional-neural-networks/
from keras.utils import np_utils
train_labels = np_utils.to_categorical(train_labels, 10)
test_labels = np_utils.to_categorical(test_labels, 10)


# After reading ch 9 you should be aware of


## the key details and specifics of CNNs

- each convolutional kernel extracts a specific feature
- by CNN we actually mean the application of multiple parallel different kernels
- *striding* is often used to downsample the created features, which can of course be computationally benefical
- *zero-padding* can keep the network from shrinking with each layer
    - *valid* convolutions don't zero pad so each output is based on the same number of inputs and they all behave "regularly" in this sense
    - *same* convolutions keep the output shape the same as the input shape, so the network can be arbitrarily deep, but also, edge/border inputs are mapped to fewer nodes in the next layer and so are less represented in the model
    - *full* "overpads" from the get go to make sure each input is equitably passed into the next layer in terms of it's number of connections, but this means at the next layer the edge units are based on different numbers of input units... 
        - the last point causes the third option to not usually pereform particularly well, so somewhere between the first two options is generally preferred...
        
- biases are typically attached according to the natural structures of the convolutional archetecture, but they could also be set to be unique at each locality, for example. 

    - *unshared* convolutions are a way to parameterize local structure feature extraction, i.e., create location specific features
    - *tiled* convolutions are another variant where parameters governing connections are selectively specified according to some predefined patterns



   
## what the Benefits of CNNs are

- *sparse interactions* (*sparse connectivity* or *sparse weights*): when inputs to the next hidden layer are defined by a kernel that is smaller than the inputs 
- *parameter sharing* (*tied weights*): when the weights applied at each application of the kernel are the same (i.e. repeatedly re-used) during each application of the kernel (i.e., as the kernel is "swept" over the input) 
- *equivariant representations*:
    - convolution $f$ is *equivariant* to translation $g$ since (for image $x$) transitioning the convolution $g(f(x))$ is the same as the convolution of the trasition $f(g(x))$: $g(f(x)) = f(g(x))$ 
    - convolution is not *equivariant* to other transformations like scaling/stretching and rotatinig.
    - sometimes we *don't* actually want *translation equivariant*... e.g., different convolutions might be different in different parts of an image
- *invariance*:    
    - *"Pooling"* (after the initial convolution and the so called "director stage", i.e., activations on the covolution) aggregates "local" convolutions in close proximity which results in images being *invariant* to small translations of the inputs
        - some features, e.g., eyes, are like this -- *exact* location is not important; however, other targets, e.g., lines forming a corner, can benefit from exact locations 
    - *max pooling* takes the largest of the activation values, but averaging or an $L^2$ norm of the local (within some rectangle) activation values are other choices
    - pooling is of course also computationally advantageous if the local regions are *strided* so as to form a partition pooling providing a local reducer function
    - we could also apply pooling over the outputs of different convolutions: this would mean that these convolutions can all measure the same output differently (e.g., when max pooling always just picks the max of them), so the output would then be *invariant to the different convolutions*
    
- **variable sized inputs**: since convolution is "swept" over or across the input the input shapes don't initially matter in that you can still extract features across the input using a kernel; and using a predefined fixed number of pooling partions with implied regions determined by relative percentages that rescale for different means your pooling function will always produce the same sized output 

### the Interpretation of  CNNs as a Bayesian *prior*
- CNNs simply implement useful prior knowledge that surely can help, e.g.:
    - translation invariance prior
    - feature extraction locality prior 
    - whereas they do not implement a *permutation invariant* prior (i.e., as fully-connected network) which can theoretically learn topology (the relationship between the different parts of the input) without relying upon natural input structure (e.g., images purmuted in the same manner)



## what *convolutions* are

- Convolutions are useful for grid-like topology... 1D, 2D, etc....
- "Convolutions" are mathematical operations that (in discrete, finite contexts) can be computd as specific particular forms of matrix multiplication
    - NN "Convolutions" actually though tend to (more technically correctly) refer to *cross-correlations*
    - The traditional definition of a convolution of the *input function* $f(t)$ with *kernel function* $k(r)$ is the *feature map* $m(t_0) = \int f(t) k(t_0-t) dt$ where the *kernel function* is evaluated at the displaced value $t_0-t$ relative to the *input function* which is evaluated at $t$
        - this effectvely makes $t_0$ "the origin" and "weights" values $f(t)$ by how far they are "before the origin" $t_0-t$ according to the "(kernal) weigting function" $k(t_0-t)$
        - this means $f(t)$ is integrated against and over every value of $k(r)$ evaluated relative to an origin point $t_0$
    - Proper convolutions following this definition are usually written in shorthand as $m(t) = (f*k)(t)$
    - In practical applied settings *convolutions* will be evaluated in a discrete manner $m(t_0) = (f*k)(t_0) = \sum_t f(t) k(t_0-t)$
        - Over higher dimensions this looks like this:  $m(t_0,s_0) = (f*k)(t_0,s_0) = \sum_t \sum_s f(t,s) k(t_0-t, s_0-s)$
- *Cross-correlation* (which is actually what is often meant by "convoltution" in NN contexts) is a slight variant of this: $(f*k)(t_0,s_0) = \sum_t \sum_s f(t_0+t,s_0+s) k(t, s)$
    
    - Implementing these *cross-correlations* as matrix multipllications entails usage of a special *Toeplitz matrix* [in which "each row of the matrix is constrained to be equal to the row above shifted by one element"] as the "kernel" matrix. 
    - A two-dimensional matrix which implements a discrete convolutional calculation is called a *doubly block circulant matrix*
    - Because these sorts of matrices are often very sparse, these matrix multiplications are done in a numerically efficient manner:
        - sparse matrix multiplication can avoid all the "0" multiplications, and memory referencing values in the sparse matrix can reuse repeated values of the same kernels from the same location in memory as opposed to storing the same values everywhere in the sparse matrix
        
## some further considerations regarding *convolutions* 

- gradients of neural networks are implemented in similar, but distinct formula's as the "convolution" formula (page 351)
- convolutions as outputs with labels on the pixels produce mask and object detection rules models
    - choices about output size should be handled intentionally with padding
    - A RNN structure can be used to repeatedly refine masking predictions as givein in figure 9.17
- Garbor functions are a model for kernels that seem to modol human V1 vision cells       
- learning, rather than predefining pooling in network architectures seems potentially interesting
    - pooling can present specific challenges within certain network contexts, which will be addressed later    


        
## how very computationally demanding CNNs are

- Computational efficiency can be very important
    - Fourier transforms can help speed things up
    - *Separable* kernels can be processed univariately and as well are faster that way, when applicable
    - instad of the computation needed for fitting these things, sometimes kernels might just be specified randomly (which works surprisingly approximately well, perhaps at least for comparing potential competing architectures), designed by hand, or they can be created in an unsupervised manner
        - e.g., K-means on the local data within a kernel region can be used as the kernel matrices at those locations
            - this might provide regularization; or, it might just enable larger, richer network architectures
    - traininig layers in isolation sequentially (i.e., greedy pretraining, see ch 8) can be another potential strategy to help fit layers, or even small patches can be used to fit the kernel without even using the full model (although these approaches are less popular these days given the increasingly powerful compute available)
    
