<a href="https://colab.research.google.com/github/mahima-c/deep-learning/blob/main/CNN2DL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


**CLASSIFICATION AND LOCALIZATION**

In the classification and localization task not only do you have to report the class of object found in the image, but also the coordinates of the bounding box where the object appears in the image. This type of task assumes that there is only one instance of the object in an image.

This can be achieved by attaching a "regression head" in addition to the "classification head" in a typical classification network. Recall that in a classification network, the final output of convolution and pooling operations, called the feature map, is fed into a fully connected network that produces a vector of class probabilities. This fully connected network is called the classification head, and it is tuned using a categorical loss function (Lc) such as categorical cross entropy.

Similarly, a regression head is another fully connected network that takes the feature map and produces a vector (x, y, w, h) representing the top-left x and y coordinates, width and height of the bounding box. It is tuned using a continuous loss function (Lr) such as mean squared error. The entire network is tuned using a linear combination of the two losses, that is:


Here  is a hyperparameter and can take a value between 0 and 1. Unless the value is determined by some domain knowledge about the problem, it can be set to 0.5.

The following figure shows a typical classification and localization network architecture. As you can see, the only difference with respect to a typical CNN classification network is the additional regression head on the top right:

**SEMANTIC SEGMENTATION**
Another class of problem that builds on the basic classification idea is "semantic segmentation." Here the aim is to classify every single pixel on the image as belonging to a single class.

Another class of problem that builds on the basic classification idea is "semantic segmentation." Here the aim is to classify every single pixel on the image as belonging to a single class.

An initial method of implementation could be to build a classifier network for each pixel, where the input is a small neighborhood around each pixel. In practice, this approach is not very performant, so an improvement over this implementation might be to run the image through convolutions that will increase the feature depth, while keeping the image width and height constant. Each pixel then has a feature map that can be sent through a fully connected network that predicts the class of the pixel. However, in practice, this is also quite expensive, and it is not normally used.

A third approach is to use a CNN encoder-decoder network, where the encoder decreases the width and height of the image but increases its depth (number of features), while the decoder uses transposed convolution operations to increase its size and decrease depth. Transpose convolution (or upsampling) is the process of going in the opposite direction of a normal convolution. The input to this network is the image and the output is the segmentation map.

A popular implementation of this encoder-decoder architecture is the U-Net (a good implementation is available at: https://github.com/jakeret/tf_unet), originally developed for biomedical image segmentation, which has additional skip-connections between corresponding layers of the encoder and decoder

**OBJECT DETECTION**
The object detection task is similar to the classification and localization tasks. The big difference is that now there are multiple objects in the image, and for each one we need to find the class and bounding box coordinates. In addition, neither the number of objects nor their size is known in advance. As you can imagine, this is a difficult problem and a fair amount of research has gone into it.

A first approach to the problem might be to create many random crops of the input image and for each crop, apply the classification and localization networks we described earlier. However, such an approach is very wasteful in terms of computing and unlikely to be very successful.

A more practical approach would be use a tool such as Selective Search (Selective Search for Object Recognition, by Uijlings et al, http://www.huppelen.nl/publications/selectiveSearchDraft.pdf), which uses traditional computer vision techniques to find areas in the image that might contain objects. These regions are called "Region Proposals," and the network to detect them was called "Region Proposal Network," or R-CNN. In the original R-CNN, the regions were resized and fed into a network to yield image vectors:


Classifying Fashion-MNIST with a tf.keras - estimator model



In [None]:
import os
import time
import tensorflow as tf
import numpy as np
# How many categories we are predicting from (0-9)
LABEL_DIMENSIONS = 10
(train_images, train_labels), (test_images, test_labels) = 
    tf.keras.datasets.fashion_mnist.load_data()
TRAINING_SIZE = len(train_images)
TEST_SIZE = len(test_images)
train_images = np.asarray(train_images, dtype=np.float32) / 255
# Convert the train images and add channels
train_images = train_images.reshape((TRAINING_SIZE, 28, 28, 1))
test_images = np.asarray(test_images, dtype=np.float32) / 255
# Convert the train images and add channels
test_images = test_images.reshape((TEST_SIZE, 28, 28, 1))
train_labels  = tf.keras.utils.to_categorical(train_labels, LABEL_DIMENSIONS)
test_labels = tf.keras.utils.to_categorical(test_labels, LABEL_DIMENSIONS)
# Cast the labels to float
train_labels = train_labels.astype(np.float32)
test_labels = test_labels.astype(np.float32)
print (train_labels.shape)
print (test_labels.shape)

Now let's build a convolutional model with the tf.keras functional API:



In [None]:
inputs = tf.keras.Input(shape=(28,28,1))  
x = tf.keras.layers.Conv2D(filters=32, kernel_size=(3, 3), activation='relu')(inputs)
x = tf.keras.layers.MaxPooling2D(pool_size=(2, 2), strides=2)(x)
x = tf.keras.layers.Conv2D(filters=64, kernel_size=(3, 3), activation='relu')(x)
x = tf.keras.layers.MaxPooling2D(pool_size=(2, 2), strides=2)(x)
x = tf.keras.layers.Conv2D(filters=64, kernel_size=(3, 3), activation='relu')(x)
x = tf.keras.layers.Flatten()(x)
x = tf.keras.layers.Dense(64, activation='relu')(x)
predictions = tf.keras.layers.Dense(LABEL_DIMENSIONS, activation='softmax')(x)
model = tf.keras.Model(inputs=inputs, outputs=predictions)
model.summary()

In [None]:
optimizer = tf.keras.optimizers.SGD()
model.compile(loss='categorical_crossentropy',
              optimizer=optimizer,
              metrics=['accuracy'])

Define a strategy, which is None for now because we run on CPUs first:



In [None]:
strategy = None
#strategy = tf.distribute.MirroredStrategy()
config = tf.estimator.RunConfig(train_distribute=strategy)

Now let's convert the tf.keras model into a convenient Estimator:



In [None]:
estimator = tf.keras.estimator.model_to_estimator(model, config=config)


The next step is to define input functions for training and for testing, which is pretty easy if we use tf.data:

In [None]:
def input_fn(images, labels, epochs, batch_size):
    # Convert the inputs to a Dataset
    dataset = tf.data.Dataset.from_tensor_slices((images, labels))
    # Shuffle, repeat, and batch the examples.
    SHUFFLE_SIZE = 5000
    dataset = dataset.shuffle(SHUFFLE_SIZE).repeat(epochs).batch(batch_size)
    dataset = dataset.prefetch(None)
    # Return the dataset. 
    return dataset

We are ready to start the training with the following code:



In [None]:
BATCH_SIZE = 512
EPOCHS = 50
estimator_train_result = estimator.train(input_fn=lambda:input_fn(train_images, train_labels,
                 epochs=EPOCHS,
                 batch_size=BATCH_SIZE))
print(estimator_train_result)

In [None]:
estimator.evaluate(lambda:input_fn(test_images, 
                                   test_labels,
                                   epochs=1,
                                   batch_size=BATCH_SIZE))

Run Fashion-MNIST the tf.keras - estimator model on GPUs
In this section we aim at running the estimator on GPUs. All we need to do is to change the strategy into a MirroredStrategy(). This strategy uses one replica per device and sync replication for its multi-GPU version:

In [None]:
BATCH_SIZE = 512
EPOCHS = 50
estimator_train_result = estimator.train(input_fn=lambda:input_fn(train_images, train_labels,
                 epochs=EPOCHS,
                 batch_size=BATCH_SIZE))
print(estimator_train_result)

In [None]:
estimator.evaluate(lambda:input_fn(test_images, 
                                   test_labels,
                                   epochs=1,
                                   batch_size=BATCH_SIZE))

Answering questions about images (VQA)


In [None]:
import tensorflow as tf
from tensorflow.keras import layers, models
# IMAGE
#
# Define CNN for visual processing
cnn_model = models.Sequential()
cnn_model.add(layers.Conv2D(64, (3, 3), activation='relu', padding='same', input_shape=(224, 224, 3)))
cnn_model.add(layers.Conv2D(64, (3, 3), activation='relu'))
cnn_model.add(layers.MaxPooling2D(2, 2))
cnn_model.add(layers.Conv2D(128, (3, 3), activation='relu', padding='same'))
cnn_model.add(layers.Conv2D(128, (3, 3), activation='relu'))
cnn_model.add(layers.MaxPooling2D(2, 2))
cnn_model.add(layers.Conv2D(256, (3, 3), activation='relu', padding='same'))
cnn_model.add(layers.Conv2D(256, (3, 3), activation='relu'))
cnn_model.add(layers.Conv2D(256, (3, 3), activation='relu'))
cnn_model.add(layers.MaxPooling2D(2, 2))
cnn_model.add(layers.Flatten())
cnn_model.summary()
# define the visual_model with proper input
image_input = layers.Input(shape=(224, 224, 3))
visual_model = cnn_model(image_input)
Text can be encoded with an RNN –