# 8 Convolutional neural networks
The goal of this exercise is to learn the basic stuff about [convolutional neural networks](https://en.wikipedia.org/wiki/Convolutional_neural_network) (CNN or ConvNet). In the previous exercises the building blocks mostly included simple operations that had some kind of activations functions and the each layer was usually fully connected to the previous one. CNNs take into account the spatial nature of the input data, e.g. an image, and they process it by applying one or more  [kernels](https://en.wikipedia.org/wiki/Kernel_%28image_processing%29). In the case of images, this processing i.e. convolving is also known as filtering. The results of processing the input with a single kernel will be a signle channel, but usually a convolutional layer involves more kernels that then produce more channels. These channels are often called **feature maps** because each kernel is specialized for extraction of a certain kind of features from the input. These feature maps are then combined into a single tensor that can be viewed as an image with multiple channels that can then be passed to further convolutional layers.

For example if the input consists of a grayscale image i.e. an image with only one channel and a $5\times 5$ kernel is applied, the result is a single feature map. The borders of the input image are usuallz padded with zeros in order to ensure that the resulting feature maps has the same number of rows and columns as the input image.

If the input consists of a color image i.e. an image with three channels and a $5\times 5$ kernel is applied, what will actually be applied is an $5\times 5\times 3$ kernel that will simultaneously process all three channels and the result will again be a single feature map. However, if e.g. 16 several kernels are applied, then the result will be 16 feature maps. Should they be passed to another convolutional layer, **each** of its kernels would simultaneously process **all** feature maps so their sizes would be e.g. $3\times 3\times 16$ or $5\times 5\times 16$ where 16 is used to reach all feature maps simultaneously.

The convolution is usually followed by applying an element-wise non-linear operation to each of the values in the feature maps. Finally, what offten follows is the summarization i.e. pooling of the information in the feature maps in order to reduce the spatial dimensions and keep only the more important information. A common approach used here is the so called max pooling. It is a non-linear downsampling where the input is divided into a set of non-overlapping rectangles and for each of them only the the maximum value inside of it is kept.

![Model of a neuron](cnn_img/max_pooling_2x2.png)
<center>Figure 1. Max pooling with $2\times 2$ rectangles (taken from [Wikipedia](https://en.wikipedia.org/wiki/File:Max_pooling.png)).</center>

What usually follows after several convolutional layers is putting the values of all feature maps into a single vector, which is then passed further to fully connected or other kinds of layers.

The number of parameters in the convolutional depends on the number of feature maps and the sizes of the kernels. For example is a convolutional layer with 32 kernels of nominal size $3\times 3$ receives 16 feature maps on its input, it will require $16\times 3\times 3\times 32+32$ where the last 32 covers the kernel biases.


## 8.1 The MNIST dataset revisited (2)
In one of the previous exercises the MNIST dataset was used to demonstrate the use of multilayer perceptron. Here we are going to apply a convolutional neural network to the problem of digits classification. We will use the following layers to build our model:

* [tf.nn.relu](https://www.tensorflow.org/api_docs/python/tf/nn/relu)
* [tf.layers.conv2d](https://www.tensorflow.org/api_docs/python/tf/layers/conv2d)
* [tf.layers.max_pooling2d](https://www.tensorflow.org/api_docs/python/tf/layers/max_pooling2d)
* [tf.layers.dense](https://www.tensorflow.org/api_docs/python/tf/layers/dense)

The [tf.layers.dense](https://www.tensorflow.org/api_docs/python/tf/layers/dense) layer has the same effect as the fully connected layer matrix multiplication that was used in the previous exercise with the MNIST dataset.

**Tasks**

1. Study and run the code below. How is the accuracy compared to the ones obtained in the previous exerises with MNIST?
2. Try to change the number and size of convolutional and fully connected layers. What has the greatest impact on the accuracy?
3. What happens to the accuracy if another non-linearity is used instead of ReLU?

In [None]:
#use MNIST data
from tensorflow.examples.tutorials.mnist import input_data
mnist=input_data.read_data_sets("mnist/", one_hot=True)

import tensorflow as tf

#settings
learning_rate=0.001
training_epochs_count=5
batch_size=100
batches_count=int(mnist.train.num_examples/batch_size)
display_step=1

activation_function=tf.nn.relu
optimizer_type=tf.train.AdamOptimizer

#architecture
input_size=784
n_channels_1=32
n_channels_2=64
n_classes=10
n_fully_connected=128
kernel_size=5

#data input
x=tf.placeholder(tf.float32, [None, input_size])

#reshaping the input to its image form so that we can apply convolution
layer=tf.reshape(x, [-1, 28, 28, 1])
y=tf.placeholder(tf.float32, [None, n_classes])

#first convolutional layer
#we will apply n_channels_1 kernels of size kernel_size X kernel_size
#we are padding the input in order for the result to have the same number of rows and columns
layer=tf.layers.conv2d(layer, n_channels_1, kernel_size, padding="SAME")
#applying the non-linearity
layer=tf.nn.relu(layer)
#now we downsample the feature maps from 28 X 28 to 14 X 14
layer=tf.layers.max_pooling2d(layer, 2, 2)

#second convolutional layer
#we will apply n_channels_2 kernels of size kernel_size X kernel_size
layer=tf.layers.conv2d(layer, n_channels_2, kernel_size, padding="SAME")
#again, we apply the non-linearity
layer=tf.nn.relu(layer)
#and max pooling again, now each feature map will be of size 7 X 7
layer=tf.layers.max_pooling2d(layer, 2, 2)

#we have n_channel_2 maps of size 7 X 7
#now reshape them into a single vector
layer=tf.reshape(layer, [-1, 7*7*n_channels_2])
#a fully connected layer
layer=tf.layers.dense(layer, n_fully_connected)
#non-linearity
layer=tf.nn.relu(layer)

#final classification
y_predicted=tf.layers.dense(layer, 10)

cost=tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=y_predicted, labels=y))
optimizer=tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(cost)

session=tf.Session();
session.run(tf.global_variables_initializer())

correct_y_predicted=tf.equal(tf.argmax(y_predicted, 1), tf.argmax(y, 1))
accuracy=tf.reduce_mean(tf.cast(correct_y_predicted, tf.float32))

for epoch in range(training_epochs_count):
    for i in range(batches_count):
        batch_x, batch_y = mnist.train.next_batch(batch_size)
        session.run(optimizer, feed_dict={x:batch_x, y:batch_y})
    if ((epoch+1)%display_step==0):
        print("Epoch #"+str(epoch+1)+" "+str(session.run(accuracy, feed_dict={x: mnist.test.images, y: mnist.test.labels})))

session.close()

## 8.2 Image classification
Image classification is a challenging computer vision problem with the best known competition being [The ImageNet Large Scale Visual Recognition Challenge (ILSVRC)](http://www.image-net.org/challenges/LSVRC/), which includes the ImageNet dataset with millions of $224\times 224$ training images. The class names in one of the tasks there can be found [here](https://gist.github.com/yrevar/942d3a0ac09ec9e5eb3a). One of the most important breakthroughs was when in 2012 the convolutional neural network [AlexNet](https://en.wikipedia.org/wiki/AlexNet) won the first place. Ever since many highly successful convolutional neural networks architectures have been proposed, e.g. [VGG-16](https://arxiv.org/abs/1409.1556), [VGG-19](https://arxiv.org/abs/1409.1556), [ResNet](https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/He_Deep_Residual_Learning_CVPR_2016_paper.pdf), [Inception](https://arxiv.org/abs/1409.4842), etc. Training such networks requires a lot of time because they have many layers with millions of parameters. In this exercise we are going to experiment with pre-trained models of some of the best known architectures. In order to make things simple, we are going to use [Keras](https://keras.io/), *"a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano."* To install Keras, it is enough to type
```
conda install keras
```
in your command prompt/terminal. Alternatively, you can type
```
pip install keras --upgrade
```
and there be any error, then first type
```
conda install pip
```
to refresh your pip and then repeat the first command. Keras already includes APIs to many well-known architectures. Let's first try to classify some images.
### 8.2.1 Using pre-trained models
Try running the following code:

In [None]:
from keras.preprocessing import image
import numpy as np

#choose the architecture
architecture="resnet"
#architecture="vgg16"
#architecture="vgg19"
#architecture="inceptionv3"

if (architecture=="resnet"):
    from keras.applications.resnet50 import ResNet50
    from keras.applications.resnet50 import preprocess_input, decode_predictions
    model=ResNet50(weights="imagenet")
elif (architecture=="vgg16"):
    from keras.applications.vgg16 import VGG16
    from keras.applications.vgg16 import preprocess_input
    model=VGG16(weights="imagenet")
elif (architecture=="vgg19"):
    from keras.applications.vgg19 import VGG19
    from keras.applications.vgg19 import preprocess_input
    model=VGG19(weights="imagenet")
elif (architecture=="inceptionv3"):
    from keras.applications.inception_v3 import InceptionV3
    from keras.applications.inception_v3 import preprocess_input
    model=InceptionV3(weights="imagenet")
    
    from keras.applications.inception_v3 import InceptionV3

#images to be classified
image_paths=["cnn_img/badger.jpg", "cnn_img/rabbit.jpg", "cnn_img/sundial.jpg", "cnn_img/pineapple.jpg", "cnn_img/can.jpg"];
for path in image_paths:
    #loading the image and rescaling it to fit the size for the imagenet architectures
    img=image.load_img(path, target_size=(224, 224))
    x=image.img_to_array(img)
    x=np.expand_dims(x, axis=0)
    x=preprocess_input(x)

    print("Processing image "+path+"...")
    predictions=model.predict(x)
    print("\t"+decode_predictions(predictions, top=1)[0][0][1])

**Tasks**
1. Is there any significant difference between the results of different architectures?
2. Try to classify several other images that you choose on your own. Which cases are problematic?

### 8.2.2 Creating your own classifier - pincers vs. scissors
Although ImageNet has a lot of classes, sometimes they do not cover some desired cases. Let's assume that we want to tell images with pincers apart from the ones with scissors. Neither pincers nor scissors are among ImageNet classes. Nevertheless, we can still use some parts of the pre-trained models.

Various layers of a deep convolutional network have diferent tasks. The ones closest to the original input image usually look for features such as edges and corners i.e. for low-level features. After them there are layers that look for middle-level features such as circular objects, special curves, etc. Next, there are usually fully connected layers that create high-level semantic features by combining the information from the previous layers. These features are then used by the last layer that performs the actual classification. What we can do here is simply to discard the last layer i.e. not to calculate the class of an image, but to extract the values in on of the fully connected layers. This effectively means that we are going to use the network only as an extractor for high-level features that we would hardly be able to engineer on our own. Let's first see which layers can be found in the VGG-16 network:


In [None]:
from keras.applications.resnet50 import ResNet50
from keras.preprocessing import image
from keras.applications.resnet50 import preprocess_input
from keras.models import Model
import numpy as np

base_model=ResNet50(weights="imagenet")

for layer in base_model.layers:
    print(layer.name)

At the end you can see fc1 and fc2, which stands for fully connected layers. For example We can extract the values of fc2 by using the following code:

In [None]:
#the last layer before the classification layer
model=Model(inputs=base_model.input, outputs=base_model.get_layer(base_model.layers[-2].name).output)

img_path="cnn_img/rabbit.jpg"
img=image.load_img(img_path, target_size=(224, 224))
x=image.img_to_array(img)
x=np.expand_dims(x, axis=0)
x=preprocess_input(x)

features=model.predict(x)
print(features.shape)
feature_layer_size=features.shape[1];

These values can now be used as features and that can later be used with another classifier. Let's first extract the features for our pincer and scissors images.

In [None]:
def create_numbered_paths(home_dir, n):
    return [home_dir+str(i)+".jpg" for i in range(n)]

def create_paired_numbered_paths(first_home_dir, second_home_dir, n):
    image_paths=[]
    for p in zip(create_numbered_paths(first_home_dir, n), create_numbered_paths(second_home_dir, n)):
        image_paths.extend(p)
    return image_paths
        
def create_features(paths, verbose=True):
    n=len(paths)
    features=np.zeros((n, feature_layer_size))
    for i in range(n):
        if (verbose==True):
            print("\t%2d / %2d"%(i+1, n))
        img=image.load_img(paths[i], target_size=(224, 224))
        img=image.img_to_array(img)
        img=np.expand_dims(img, axis=0)
        features[i, :]=preprocess_input(model.predict(img))
    
    return features

pincers_dir="cnn_img/pincers/"
scissors_dir="cnn_img/scissors/"

individual_n=50

#combining all image paths
image_paths=create_paired_numbered_paths(pincers_dir, scissors_dir, individual_n)

#marking their classes
image_classes=[]
for i in range(individual_n):
    #0 stands for the pincer image and 0 stands for the scissors image
    image_classes.extend((0, 1))

#number of all images
n=100
#number of training images
n_train=50
#number of test images
n_test=n-n_train

print("Creating training features...")
#here we will store the features of training images
x_train=create_features(image_paths[:n_train])
#train classes
y_train=np.array(image_classes[:n_train])

print("Creating test features...")
#here we will store the features of test images
x_test=create_features(image_paths[n_train:])
#train classes
y_test=np.array(image_classes[n_train:])

Now that for each image we have its features, we will divide the images into a training and a test set. Then we will use a linear SVM classifier to classify them.

In [None]:
from sklearn import svm

def create_svm_classifier(x, y):
    #we will use linear SVM
    C=1.0
    classifier=svm.SVC(kernel="linear", C=C);
    classifier.fit(x, y)
    return classifier

def calculate_accuracy(classifier, x, y):
    predicted=classifier.predict(x)
    return np.sum(y==predicted)/y.size

#training the model
classifier=create_svm_classifier(x_train, y_train)

#checking the model's accuracy
print("Accuracy: %.2lf%%"%(100*calculate_accuracy(classifier, x_test, y_test)))

**Tasks**

1. How has to be the training set for the accuracy to drop significantly?
2. Is there any significant gain if more complex SVM models are used?
3. What happens if we extract features from another layer, e.g. fc1?

### 8.2.1 Creating your own classifier - healthy vs. unhealthy food
The previous example was relatively simple because all images were of same size and each of them had a white background, which allowed the extractor to concentrate only on the features of the actual objects. In this example we will use a slightly more complicated case - namely, will will tell images with healthy food apart from the ones with unhealthy food. FIrst let's repeat the same process as we did in the previous example and create the features:

In [None]:
healthy_dir="cnn_img/healthy/"
unhealthy_dir="cnn_img/unhealthy/"

individual_n=100

#combining all image paths
image_paths=create_paired_numbered_paths(healthy_dir, unhealthy_dir, individual_n)

#marking their classes
image_classes=[]
for i in range(individual_n):
    #0 stands for the pincer image and 0 stands for the scissors image
    image_classes.extend((0, 1))

#number of all images
n=200
#number of training images
n_train=100
#number of test images
n_test=n-n_train

print("Creating training features...")
#here we will store the features of training images
x_train=create_features(image_paths[:n_train])
#train classes
y_train=np.array(image_classes[:n_train])

print("Creating test features...")
#here we will store the features of test images
x_test=create_features(image_paths[n_train:])
#train classes
y_test=np.array(image_classes[n_train:])

Now let's train a model and test its accuracy:

In [None]:
classifier=create_svm_classifier(x_train, y_train)
print("Accuracy: %.2lf%%"%(100*calculate_accuracy(classifier, x_test, y_test)))

**Tasks**
1. What is the effect of choosing some other layers for feature extraction?
2. Try the whole food classification with another network as feature extractor.
3. What kind of test images are problematic?