# Lecture 8
# Deep Learning for Computer Vision

## Computer vision
Most successful applications were initially CV applications
* ImageNet classification (2011 – )
    * 1000 classes
    * 1 million examples

Other Computer Vision applications
* OCR, image search, autonomous driving, medical diagnosis, etc

Convolutional network (convnet) is often used in CV tasks
* Cf. densely-connected (DCN) or fully-connected network (FCN)

**First example of convnet**

In [1]:
from tensorflow import keras
from tensorflow.keras import layers
inputs = keras.Input(shape=(28, 28, 1)) ## Different from densenet input
x = layers.Conv2D(filters=32, kernel_size=3, activation="relu")(inputs)
x = layers.MaxPooling2D(pool_size=2)(x)
x = layers.Conv2D(filters=64, kernel_size=3, activation="relu")(x)
x = layers.MaxPooling2D(pool_size=2)(x)
x = layers.Conv2D(filters=128, kernel_size=3, activation="relu")(x)
x = layers.Flatten()(x)
outputs = layers.Dense(10, activation="softmax")(x)

model = keras.Model(inputs=inputs, outputs=outputs)
model.summary() ## Check the output of the last conv layer

2022-03-15 11:06:36.508394: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2022-03-15 11:06:36.508927: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)


Metal device set to: Apple M1
Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 28, 28, 1)]       0         
                                                                 
 conv2d (Conv2D)             (None, 26, 26, 32)        320       
                                                                 
 max_pooling2d (MaxPooling2D  (None, 13, 13, 32)       0         
 )                                                               
                                                                 
 conv2d_1 (Conv2D)           (None, 11, 11, 64)        18496     
                                                                 
 max_pooling2d_1 (MaxPooling  (None, 5, 5, 64)         0         
 2D)                                                             
                                                                 
 conv2d_2 (Conv2D)           (N

In [2]:
from tensorflow.keras.datasets import mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()
train_images = train_images.reshape((60000, 28, 28, 1)) ## Check the shape
train_images = train_images.astype("float32") / 255
test_images = test_images.reshape((10000, 28, 28, 1)) ## Check the shape
test_images = test_images.astype("float32") / 255
model.compile(optimizer="rmsprop",
              loss="sparse_categorical_crossentropy",
              metrics=["accuracy"])

model.fit(train_images, train_labels, epochs=5, batch_size=64)

2022-03-15 11:09:23.062414: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz


Epoch 1/5


2022-03-15 11:09:23.301993: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.


Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
 40/313 [==>...........................] - ETA: 1s - loss: 0.0452 - accuracy: 0.9852 

2022-03-15 11:10:14.620700: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.


Test accuracy: 0.992


Dense layers learn global patterns, conv layers learn local pattern within the window (e.g., 3 x 3 kernel)
* Assumption: an image can be broken into local patterns such as edges, textures, etc

Key characteristics of convnets:

* Translation-invariant: an object/pattern can appear in different location of the image
* Can learn spatial hierarchies of patterns
* Efficient learning of increasing increasingly complex and abstract concepts

Convolution operates over rank-3 tensors, called feature maps
* Feature map: height x width x depth (or channel)
* Feature map for the input layer: for MNIST, 28 x 28 x 1 and for CIFAR10, 32 x 32 x 3
    * depth is 1 for grayscale images and 3 for RGB images
* Output is also a rank-3 feature map of different shape
    * Shape can change (26 x 26 x 32)
    * Depth is the number of filters/kernels in the conv layer

In [19]:
test_loss, test_acc = model.evaluate(test_images,test_labels)
print(f"Test accuracy: {test_acc: .3f}")

Test accuracy:  0.992


What happens if we remove max pooling layers

In [20]:
inputs = keras.Input(shape=(28, 28, 1))
x = layers.Conv2D(filters=32, kernel_size=3, activation="relu")(inputs) ## No max pooling
x = layers.Conv2D(filters=64, kernel_size=3, activation="relu")(x)## No max pooling
x = layers.Conv2D(filters=128, kernel_size=3, activation="relu")(x)
x = layers.Flatten()(x)
outputs = layers.Dense(10, activation="softmax")(x)
model_no_max_pool = keras.Model(inputs=inputs, outputs=outputs)
## Look at the feature map sizes
model_no_max_pool.summary()

Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_2 (InputLayer)        [(None, 28, 28, 1)]       0         
                                                                 
 conv2d_3 (Conv2D)           (None, 26, 26, 32)        320       
                                                                 
 conv2d_4 (Conv2D)           (None, 24, 24, 64)        18496     
                                                                 
 conv2d_5 (Conv2D)           (None, 22, 22, 128)       73856     
                                                                 
 flatten_1 (Flatten)         (None, 61952)             0         
                                                                 
 dense_1 (Dense)             (None, 10)                619530    
                                                                 
Total params: 712,202
Trainable params: 712,202
Non-trainab

In [21]:
model.fit(train_images, train_labels, epochs=5, batch_size=64)
test_loss, test_acc = model.evaluate(test_images, test_labels)
print(f"Test accuracy: {test_acc:.3f}")

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Test accuracy: 0.991


Two problems

* The final feature map before the dense layer is too large (61952 vs 1152)
    * Too many parameters for the dense layer to learn 
* No spatial hierarchy of patterns is learned
    * A single value of the final feature map should contain more global information

## Dogs vs Cats

In [3]:

## upload the json file
!kaggle competitions download -c dogs-vs-cats 
!unzip train.zip

403 - Forbidden
unzip:  cannot find or open train.zip, train.zip.zip or train.zip.ZIP.
