# From Classification to Semantics with CNNs

**Author:** [Liam O'Driscoll](https://github.com/lodriscoll)<br>
**Date created:** 06/04/2023<br>
**Last modified:** 06/15/2023<br>


# Preface

This notebook was developed using a google compute engine, runtimes and download times may vary on your runtime of choice. You may contact the professor for free google compute engine coupons for students if you want to play around with the code without extremely long runtimes.

# Introduction

This exercise provides a comprehensive review of image classification then implements semantic segmentation.

## Setup

We must first begin by importing the necessary packages. We will be using [Tensorflow](https://www.tensorflow.org/api_docs/python/tf) and Keras to implement our models.

In [None]:
import tensorflow as tf
from tensorflow.keras.datasets import cifar10
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Input, Flatten, Dense, Conv2D, MaxPooling2D, BatchNormalization, Dropout, Resizing, Conv2DTranspose, Concatenate, UpSampling2D
from tensorflow.keras.regularizers import l2
from tensorflow.keras.optimizers import SGD, Adam
from tensorflow.keras.metrics import MeanIoU
from tensorflow.keras.applications import ResNet50
from tensorflow.keras.applications.resnet import preprocess_input

import matplotlib.pyplot as plt
import numpy as np
from PIL import Image
import tqdm

## Data

For the classification portion of this exercise we will be training and evaluating using the [CIFAR-10 dataset](https://www.cs.toronto.edu/~kriz/cifar.html), a subset of the CIFAR-100 dataset. It served as a benchmark in the early era of big data, providing a challenging and diverse dataset for image classification tasks. Its large-scale nature and diverse classes contributed to advancing the field of computer vision research. Today we will be using CIFAR-10 as a benchmark to test our classification networks. Keras has already implemented a load_data() function for this operation so we will leverage this.

In [None]:
# Load CIFAR-10 dataset
(x_train_raw, y_train_raw), (x_test_raw, y_test_raw) = cifar10.load_data()

# Print shape of the loaded data
print("Training data shape:", x_train_raw.shape)  # (50000, 32, 32, 3)
print("Training labels shape:", y_train_raw.shape)  # (50000, 1)
print("Test data shape:", x_test_raw.shape)  # (10000, 32, 32, 3)
print("Test labels shape:", y_test_raw.shape)  # (10000, 1)

# Convert labels to categorical format
num_classes = 10
y_train = tf.keras.utils.to_categorical(y_train_raw, num_classes)
y_test = tf.keras.utils.to_categorical(y_test_raw, num_classes)

# Normalize input data
x_train = x_train_raw.astype('float32') / 255.0
x_test = x_test_raw.astype('float32') / 255.0


## Visualization
Lets take a look at a few examples by sampling the training set randomly and displaying the image with its corresponding label.

In [None]:
import matplotlib.pyplot as plt
import numpy as np

def plot_sample_images(x, y, class_names, num_images=6):
    # Generate random indices to select sample images from the dataset
    indices = np.random.randint(0, len(x), num_images)

    # Create a subplot grid based on the number of images
    num_rows = (num_images + 2) // 3
    fig, axes = plt.subplots(num_rows, 3, figsize=(8, 4 * num_rows))

    # Iterate over the images and plot them with their labels
    for i, index in enumerate(indices):
        row = i // 3
        col = i % 3

        image = x[index]
        label = y[index]
        class_name = class_names[np.argmax(label)]

        axes[row, col].imshow(image)
        axes[row, col].axis('off')
        axes[row, col].set_title(f"Label: {class_name}")

    # Adjust the spacing and display the plot
    plt.tight_layout()
    plt.show()

# Define the class names
class_names = ['airplane', 'automobile', 'bird', 'cat', 'deer',
               'dog', 'frog', 'horse', 'ship', 'truck']

# Plot 5 sample images with their labels
plot_sample_images(x_train, y_train, class_names, num_images=6)

# Classification

We can now build our first classification network. Lets start with a fully connected neural network (also called a Multilayer Perceptron Network or  MLP for short).

## Fully Connected Neural Network (MLP)

This model has 4 hidden layers, is optimized using SGD with momentum, and is trained for 10 epochs with a batchsize of 32.

In [None]:
def build_mlp_model(input_shape=(32,32,3)):

    # Build model
    model = Sequential()

    # Input shape matches the dimensions of images
    model.add(Input(shape=(input_shape)))

    # Flatten to pass into dense layer
    model.add(Flatten())

    # Four hidden layers
    model.add(Dense(units=256, activation='relu'))
    model.add(Dense(units=128, activation='relu'))
    model.add(Dense(units=64, activation='relu'))
    model.add(Dense(units=32, activation='relu'))

    # Output layer with 10 units corresponding to classes with softmax activation
    model.add(Dense(units=10, activation='softmax'))

    return model

# define modular training function
def train_model(model, x_train, y_train, optimizer, loss, metrics, epochs=10, batch_size=32):

    # compile model with specified hyperparamers
    model.compile(optimizer=optimizer,
                  loss=loss,
                  metrics=metrics)

    # train model
    history = model.fit(x_train, y_train, epochs=epochs, batch_size=batch_size)

    #return history
    return history

# define modular evaluation function
def evaluate_model(model, x_test, y_test):

    # Evaluate the model
    loss, accuracy = model.evaluate(x_test, y_test, verbose=0)
    print(f"Test loss: {loss:.4f}")
    print(f"Test accuracy: {accuracy:.4f}")


mlp_model = build_mlp_model(input_shape=(32,32,3))

# Use momentum SGD as optimizer and categorical_crossentropy
# because the labels were one-hot encoded

history = train_model(mlp_model, x_test, y_test,
                   optimizer=SGD(learning_rate=0.01, momentum=0.9),
                   loss='categorical_crossentropy',
                   metrics=['accuracy'])

evaluate_model(mlp_model, x_test, y_test)


### Discussion

As we can see, our current model is struggling to fit the training set. To improve its performance, we can consider implementing the following techniques:

- Adding more layers to the model, which can increase its capacity to learn complex patterns. [Regularization](https://keras.io/api/layers/regularizers/), such as L1 or L2 regularization, can also be applied to prevent overfitting.

- Utilizing [batch normalization](https://keras.io/api/layers/normalization_layers/batch_normalization/), which normalizes the activations of the previous layer, helping the model to converge faster and generalize better.

- Introducing [dropout layers](https://keras.io/api/layers/regularization_layers/dropout/) to randomly deactivate some neurons during training, which reduces overfitting and encourages the network to learn more robust features.

- Optimizing hyperparameters using techniques like [random grid search](https://machinelearningmastery.com/hyperparameter-optimization-with-random-search-and-grid-search/), which systematically explores the hyperparameter space to find the best combination.

Let's implement these changes and see if our model improves.


In [None]:
def build_better_mlp_model(input_shape=(32,32,3)):
    # Build model
    model = Sequential()

    # Input shape matches the dimensions of images
    model.add(Input(shape=input_shape))

    # Flatten to pass into dense layer
    model.add(Flatten())

    # Four hidden layers
    model.add(Dense(units=256, activation='relu'))
    model.add(BatchNormalization())
    model.add(Dense(units=128, activation='relu'))
    model.add(Dropout(0.2))
    model.add(Dense(units=64, activation='relu'))
    model.add(BatchNormalization())
    model.add(Dense(units=32, activation='relu'))
    model.add(Dropout(0.2))

    # Output layer with 10 units corresponding to classes with softmax activation
    model.add(Dense(units=10, activation='softmax'))

    return model


better_mlp_model = build_better_mlp_model(input_shape=(32,32,3))

history = train_model(better_mlp_model, x_train, y_train,
                      optimizer=SGD(learning_rate=0.01, momentum=0.9),
                      loss='categorical_crossentropy',
                      metrics=['accuracy'])

evaluate_model(better_mlp_model, x_test, y_test)


### Limitations
As we can see even with these improvements fully connected networks struggle to classify images with accuracy above 50% on the CIFAR-10 Dataset. For image classification MLPs suffer from two significant and related problems:

1) **Large number of parameters:** In our case, each pixel in the 32x32 image requires a separate input neuron, resulting in an explosion of parameters as the network deepens. This leads to computational inefficiency and can cause overfitting.

2) **Location Depended Pattern Recognition:** Fully connected networks can only detect patterns in the exact same x, y location where they were learned. This limitation hampers their ability to generalize and recognize patterns in different spatial locations.

### Convolutional Neural Networks

To address the limitations of the fully connected network, we can introduce an **inductive bias**, also known as human priors, to improve the model's capabilities. We desire improved efficiency and location indepenedtent pattern recognition. To do so we introduce the convolution operation. This new operation involves sliding windows with their own set of weights across the image which offers two main advantages:

1) **Parameter sharing:** The same set of weights is used across different locations in the input, resulting in improved efficiency. This reduces the number of parameters required to learn from the data.

2) **Translation invariance:** The convolution operation allows patterns to be recognized regardless of their position in the image. This enhances the model's ability to generalize and detect features at different locations.

To leverage the benefits of CNNs, we construct a new model with 3 convolution and max-pooling blocks and the better MLP classifier on top based. This new architecture allows us to increase the channel depth while compressing spatial information, enabling the network to capture relevant features and hierarchies present in the data more effectively.

In [None]:
def build_cnn_model():

    # create conv model
    conv_model = Sequential()

    # Block 1
    conv_model.add(Conv2D(64, (3, 3), activation='relu', padding='same', input_shape=(32,32,3)))
    conv_model.add(Conv2D(64, (3, 3), activation='relu', padding='same'))
    conv_model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))

    # Block 2
    conv_model.add(Conv2D(128, (3, 3), activation='relu', padding='same'))
    conv_model.add(Conv2D(128, (3, 3), activation='relu', padding='same'))
    conv_model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))

    # Block 3
    conv_model.add(Conv2D(256, (3, 3), activation='relu', padding='same'))
    conv_model.add(Conv2D(256, (3, 3), activation='relu', padding='same'))
    conv_model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))

    # the return model
    model = Sequential()

    # stack better MLP classifier on top of conv model
    # with input shape as output shape of last conv block
    model.add(conv_model)
    model.add(build_better_mlp_model(input_shape=conv_model.output.shape[1:]))

    return model


cnn_model = build_cnn_model()

# same optimizer and loss
history = train_model(cnn_model, x_train, y_train,
                      optimizer=SGD(learning_rate=0.01, momentum=0.9),
                      loss='categorical_crossentropy',
                      metrics=['accuracy'])

evaluate_model(cnn_model, x_test, y_test)

### Discussion

The comparison between the accuracies of the CNN and fully connected network clearly demonstrates the superiority of CNN in performance. As discussed earlier, the convolution operation provides more robust generalization for image data.

### Limitations

Although the CNN is a significant improvement over the MLP classifier, it still has inherent limitations. In fact, these limitations are direct byproducts of our inductive bias. One of these limitations is the lack of access to the context of the original image in later layers. This issue arises due to the simultaneous spatial compression and expansion of channel depth. To overcome this limitation, a concept called residual layers is introduced.

**Residual layers** act as skip connections between a layer and its previous layer, allowing the propagation of context throughout the network when it is optimal to do so.

## Go Larger and Deeper - ResNet50

ResNet50 is a deep convolutional neural network architecture that significantly improves the performance and training of CNNs. It introduces the concept of residual learning, where residual layers are added to the network to mitigate the issue of context loss in deeper layers. The ResNet50 architecture consists of 50 layers, including convolutional layers, pooling layers, and fully connected layers. It has achieved state-of-the-art results on various image classification tasks.

The ResNet50 model was originally trained on the [ImageNet](https://www.image-net.org/) dataset, the second large-scale dataset that we will be looking at. ImageNet contains millions of labeled images from thousands of categories and serves as a benchmark dataset for training and evaluating deep learning models for image classification. The dataset spans a wide range of visual concepts, including objects, animals, and scenes.

Pre-training ResNet50 on ImageNet allows the model to extract meaningful and discriminative features from images. These learned features can then be fine-tuned or transferred to other datasets, including those with specific classes like the CIFAR-10 dataset.

By incorporating residual layers and leveraging their skip connections, ResNet50 allows the network to learn more effectively and capture both low-level and high-level features in an image. This architecture has been widely adopted and serves as a powerful tool for various computer vision tasks.

For further details and implementation specifics, please refer to the [ResNet50 paper](https://arxiv.org/abs/1512.03385). We will adapt this architecture to our specific needs, in this case classifying the CIFAR-10 dataset. We will use a transfer learning approach -- the following code should look familiar if you took CSE144.



In [None]:

def build_transfer_model():

    # donwload resnet architecture without top (the original classification layers whith shape 1000)
    # with imagenet weights, and input shape that matches CIFAR-10
    res_model = ResNet50(include_top=False, weights="imagenet", input_shape=(32, 32, 3))

    # freeze all weights in res_model because we want to use transfer learning
    for layer in res_model.layers:
       layer.trainable = False

    # build sequential model with mlp classifier stacked on top of resnet
    model = Sequential()

    model.add(res_model)
    model.add(build_better_mlp_model(input_shape=res_model.output.shape[1:]))

    return model


# keras' resnet implementation requires you to preprocesses the input
# so we define a wrapper function for train_model
#
# [RGB] -> [BGR]
# zero-centers each channel wrt imagenet without scaling
#
# link: https://www.tensorflow.org/api_docs/python/tf/keras/applications/resnet50/ResNet50
#
def train_transfer_model(transfer_model, x_train, y_train, optimizer, loss, metrics, epochs=10, batch_size=32):

    # preprocess data as specified in docs
    x_train_proc = preprocess_input(np.copy(x_train))

    # use modular train_model function from earlier to get history
    history = train_model(transfer_model, x_train_proc, y_train, optimizer, loss, metrics, epochs, batch_size)

    return history


transfer_model = build_transfer_model()

history = train_transfer_model(transfer_model, x_train, y_train,
                              optimizer=SGD(learning_rate=0.001, momentum=.9),
                              loss='categorical_crossentropy',
                              metrics=['accuracy'])

evaluate_model(transfer_model, x_test, y_test)


### Discussion

As we can see, ResNet offers another significant improvement over the traditional CNN model. The addition of skip connections, also known as residual connections, plays a crucial role in enhancing the model's performance. These skip connections allow for the direct flow of information from earlier layers to later layers, mitigating the [vanishing gradient problem](https://en.wikipedia.org/wiki/Vanishing_gradient_problem) and enabling the network to learn more effectively. By preserving and propagating gradients, the skip connections help to alleviate the challenges of training very deep networks. Additionally, the skip connections introduce shortcut paths that bypass a few layers, enabling the model to capture both low-level and high-level features more efficiently. This improved information flow and feature preservation contribute to the enhanced performance of ResNet on various tasks, including image classification.

### Limitations

One of the main limitations of ResNet is its computational cost. The deep structure of ResNet, especially in variants like ResNet50, requires significant computational resources and training time. Training such models on large-scale datasets or with limited computing power can be challenging. Additionally, the large number of parameters in ResNet increases the risk of overfitting, especially when the training dataset is small or lacks diversity. Regularization techniques such as dropout and weight decay can help mitigate overfitting to some extent.

While ResNet has shown impressive performance, it has a limitation in terms of contextual information utilization. The skip connections in ResNet only propagate information from the previous layer to the next layer. As a result, later layers may not have direct access to the entire context of the original image, which can limit their ability to effectively utilize global information. This limitation can impact the model's performance, particularly in tasks that require capturing long-range dependencies or contextual understanding.

##### Side Note

[DenseNet](https://arxiv.org/abs/1608.06993) is an architecture that addresses the limitation of contextual information utilization in ResNet by introducing dense connections. In DenseNet, each layer is connected to all previous layers, allowing for direct information flow from all preceding layers. This design enables dense connections to propagate richer and more comprehensive information throughout the network. By incorporating dense connections, DenseNet facilitates better information flow, gradient propagation, and feature reuse across different layers. This architecture is beyond the scope of this exercise so let us move on from classification.




# Semantics

After comprehensively reviewing classification we are now equipped with the tools to move onto a more challenging beast: semantic segmentation. As we learned, Image classification is the task of assigning a single label or class to an entire image. The goal is to determine what is present in the image and categorize it into one of several predefined classes. The output of image classification is a single class label or a probability distribution over class labels.

On the other hand, semantic segmentation is the task of pixel-level labeling, where the objective is to assign a class label to each individual pixel in an image. The goal is to classify and differentiate the various objects or regions within an image. Semantic segmentation provides a detailed understanding of the scene by segmenting the image into different meaningful regions. For example, in an image containing a dog and a cat, semantic segmentation would aim to assign a specific label to each pixel belonging to the dog and the cat separately. The output of semantic segmentation is a pixel-wise classification map, where each pixel is assigned a class label.

## Data

To achieve this let's take a look at the third, and final large scale dataset: [Cityscapes](https://paperswithcode.com/dataset/cityscapes). This dataset is widely used in semantic segmentation and serves as a benchmark for most modern day driving computer vision systems. We will download a subset of this dataset using the [kaggle API](https://www.kaggle.com/general/156610) to streamline this process. In order to run these cells you must download your API token (a json file) from your account tab in Kaggle and upload it to this notebook. Further instructions and clarifications are available at that link.

In [None]:
# download token and upload to this notebook's working directory BEFORE running the following cell

In [None]:
! pip install kaggle

! mkdir ~/.kaggle

! cp kaggle.json ~/.kaggle/

! chmod 600 ~/.kaggle/kaggle.json

! kaggle datasets download dansbecker/cityscapes-image-pairs

! unzip -q cityscapes-image-pairs.zip

## Further Preprocessing

We must now split the concatenated images into our x_train (raw image) and y_train (labeled mask)

In [None]:
# define ID map and category map for each pixel (from city scapes documentation)

id_map = {
    0: (0, 0, 0), # unlabelled
    1: (111, 74,  0), #static
    2: ( 81,  0, 81), #ground
    3: (128, 64,127), #road
    4: (244, 35,232), #sidewalk
    5: (250,170,160), #parking
    6: (230,150,140), #rail track
    7: (70, 70, 70), #building
    8: (102,102,156), #wall
    9: (190,153,153), #fence
    10: (180,165,180), #guard rail
    11: (150,100,100), #bridge
    12: (150,120, 90), #tunnel
    13: (153,153,153), #pole
    14: (153,153,153), #polegroup
    15: (250,170, 30), #traffic light
    16: (220,220,  0), #traffic sign
    17: (107,142, 35), #vegetation
    18: (152,251,152), #terrain
    19: ( 70,130,180), #sky
    20: (220, 20, 60), #person
    21: (255,  0,  0), #rider
    22: (  0,  0,142), #car
    23: (  0,  0, 70), #truck
    24: (  0, 60,100), #bus
    25: (  0,  0, 90), #caravan
    26: (  0,  0,110), #trailer
    27: (  0, 80,100), #train
    28: (  0,  0,230), #motorcycle
    29: (119, 11, 32), #bicycle
    30: (  0,  0,142) #license plate
}

category_map = {
    0: 0,
    1: 0,
    2: 0,
    3: 1,
    4: 1,
    5: 1,
    6: 1,
    7: 2,
    8: 2,
    9: 2,
    10: 2,
    11: 2,
    12: 2,
    13: 3,
    14: 3,
    15: 3,
    16: 3,
    17: 4,
    18: 4,
    19: 5,
    20: 6,
    21: 6,
    22: 7,
    23: 7,
    24: 7,
    25: 7,
    26: 7,
    27: 7,
    28: 7,
    29: 7,
    30: 7
}

In [None]:
def preprocess(image_path):
    # Open the image
    image = Image.open(image_path)

    # Crop and resize two regions of the image
    region1 = image.crop((0, 0, 256, 256)).resize((128, 128))
    region2 = image.crop((256, 0, 512, 256)).resize((128, 128))

    # Convert region1 to a numpy array and normalize its values
    normalized_region1 = np.array(region1) / 255.

    # Convert region2 to a numpy array
    region2_array = np.array(region2)

    # Create an empty mask with the same shape as region2
    mask = np.zeros(shape=(region2_array.shape[0], region2_array.shape[1]), dtype=np.uint32)

    # Iterate over each pixel in region2
    for row in range(region2_array.shape[0]):
        for col in range(region2_array.shape[1]):
            pixel_value = region2_array[row, col, :]
            final_key = None
            final_distance = None

            # Find the closest matching key in the id_map dictionary
            for key, value in id_map.items():
                distance = np.sum(np.sqrt(np.power(pixel_value - value, 2)))

                # Update the final_key and final_distance if it's the first iteration
                if final_key is None:
                    final_distance = distance
                    final_key = key
                # Update the final_key and final_distance if the current distance is smaller
                elif distance < final_distance:
                    final_distance = distance
                    final_key = key

            # Assign the final_key to the corresponding pixel in the mask
            mask[row, col] = final_key

    # Reshape the mask to have an additional dimension
    mask = np.reshape(mask, (mask.shape[0], mask.shape[1], 1))

    # Clean up the region2 variable
    del region2_array

    # Return the preprocessed image and mask
    return normalized_region1, mask

def get_train_val_split_from_filenames(train_path, val_path):
    # Initialize empty lists for training and validation data
    train_images = []
    train_masks = []
    val_images = []
    val_masks = []

    # Process training images and masks
    for file in tqdm(os.listdir(train_path)):
        image, mask = preprocess(f"{train_path}/{file}")
        train_images.append(image)
        train_masks.append(mask)

    # Process validation images and masks
    for file in tqdm(os.listdir(val_path)):
        image, mask = preprocess(f"{val_path}/{file}")
        val_images.append(image)
        val_masks.append(mask)

    # Return the training and validation data
    return np.array(train_images), np.array(train_masks), np.array(val_images), np.array(val_masks)

In [None]:
# this will take a while

x_train, y_train, x_val, y_val = get_train_val_split_from_filenames("cityscapes_data/train/", "cityscapes_data/val/")

## Fully Convolutional Network (FCN)

Lets begin by creating our first fully convolutional network. We will use the ResNet architecture as our base model. It will be pretrained on the same ImageNet weights and like last time, we will remove the top so we can modify the architecture to our needs.

In [None]:
def build_fcn_model():

    res_model = ResNet50(include_top=False, weights="imagenet", input_shape=(128, 128, 3))

    # freeze all weights in res_model because we want to use transfer learning
    for layer in res_model.layers:
       layer.trainable = False

    # build sequential model with semantic segmentation classifier
    # stacked on top of resnet
    model = Sequential()
    model.add(res_model)

    # want 7 filters for the 7 classes
    model.add(Conv2D(7, (1, 1), activation='softmax', padding='same'))

    # resize to shape of input images so we can compute loss
    model.add(Resizing(128, 128))

    return model

fcn_model = build_fcn_model()

# have to use driver function again because we are using resnet as network backbone
history = train_transfer_model(fcn_model, x_train, y_train,
                              optimizer=SGD(learning_rate=0.001, momentum=.9),
                              loss='categorical_crossentropy',
                              metrics=['accuracy'])

# use validation data as test set for semantics
evaluate_model(fcn_model, x_val, y_val)


## Better FCN

The previous model is only able to provide coarse segmentation masks. This is due to decision to upsample the resulting feature map without any modifications. Resizing the feature map in this way removes all fine details from the feature map. To improve upon this, we introduce the [transpose convolution](https://towardsdatascience.com/understand-transposed-convolutions-and-build-your-own-transposed-convolution-layer-from-scratch-4f5d97b2967) operation. This allows the network to learn how to upsample the coarse segmentation mask to provide fine grain details. Let's implement these changes.

In [None]:
def build_better_fcn_model():
    res_model = ResNet50(include_top=False, weights="imagenet", input_shape=(128, 128, 3))

    # Freeze all weights in res_model because we want to use transfer learning
    for layer in res_model.layers:
       layer.trainable = False

    # Build sequential model with better semantic segmentation classifier
    # stacked on top of ResNet
    model = Sequential()
    model.add(res_model)

    # Use inverse convolution (Conv2DTranspose) instead of resizing
    model.add(Conv2DTranspose(256, (3, 3), strides=(2, 2), activation='relu', padding='same'))
    model.add(Conv2DTranspose(128, (3, 3), strides=(2, 2), activation='relu', padding='same'))
    model.add(Conv2DTranspose(64, (3, 3), strides=(2, 2), activation='relu', padding='same'))

    # Use Conv2D to reduce to the desired number of classes
    model.add(Conv2D(7, (1, 1), activation='softmax', padding='same'))

    return model

better_fcn_model = build_better_fcn_model()

history = train_transfer_model(better_fcn_model, x_train, y_train,
                              optimizer=SGD(learning_rate=0.001, momentum=.9),
                              loss='categorical_crossentropy',
                              metrics=['accuracy'])

evaluate_model(better_fcn_model, x_val, y_val)


## U-Net

These results are impressive but our better FCN suffers from one large and inherent issue. Since we decided to freeze the weights and to not include the top, our segmentation classifier loses access to context that was available in earlier layers. We can solve this issue by using a technique introduced in the ResNet architecture. By adding residual connections between early and later convolutional layers with the same shape, we can increase performance by giving the model more context in later layers. Further information can be found in the [original U-NET paper](https://arxiv.org/abs/1505.04597)

In [None]:
def build_unet_model(input_shape, num_classes):
    inputs = Input(shape=input_shape)

    # Contraction path - in better FCN we used resnet to accomplish this
    conv1 = Conv2D(64, 3, activation='relu', padding='same')(inputs)
    conv1 = Conv2D(64, 3, activation='relu', padding='same')(conv1)
    pool1 = MaxPooling2D()(conv1)

    conv2 = Conv2D(128, 3, activation='relu', padding='same')(pool1)
    conv2 = Conv2D(128, 3, activation='relu', padding='same')(conv2)
    pool2 = MaxPooling2D()(conv2)

    conv3 = Conv2D(256, 3, activation='relu', padding='same')(pool2)
    conv3 = Conv2D(256, 3, activation='relu', padding='same')(conv3)
    pool3 = MaxPooling2D()(conv3)

    conv4 = Conv2D(512, 3, activation='relu', padding='same')(pool3)
    conv4 = Conv2D(512, 3, activation='relu', padding='same')(conv4)
    pool4 = MaxPooling2D()(conv4)

    conv5 = Conv2D(1024, 3, activation='relu', padding='same')(pool4)
    conv5 = Conv2D(1024, 3, activation='relu', padding='same')(conv5)

    # Expansion Path - this is the semantic classifier (or top) from previous network
    up6 = Conv2D(512, 2, activation='relu', padding='same')(UpSampling2D()(conv5))
    merge6 = Concatenate()([conv4, up6])
    conv6 = Conv2D(512, 3, activation='relu', padding='same')(merge6)
    conv6 = Conv2D(512, 3, activation='relu', padding='same')(conv6)

    up7 = Conv2D(256, 2, activation='relu', padding='same')(UpSampling2D()(conv6))
    merge7 = Concatenate()([conv3, up7])
    conv7 = Conv2D(256, 3, activation='relu', padding='same')(merge7)
    conv7 = Conv2D(256, 3, activation='relu', padding='same')(conv7)

    up8 = Conv2D(128, 2, activation='relu', padding='same')(UpSampling2D()(conv7))
    merge8 = Concatenate()([conv2, up8])
    conv8 = Conv2D(128, 3, activation='relu', padding='same')(merge8)
    conv8 = Conv2D(128, 3, activation='relu', padding='same')(conv8)

    up9 = Conv2D(64, 2, activation='relu', padding='same')(UpSampling2D()(conv8))
    merge9 = Concatenate()([conv1, up9])
    conv9 = Conv2D(64, 3, activation='relu', padding='same')(merge9)
    conv9 = Conv2D(64, 3, activation='relu', padding='same')(conv9)

    # Output layer
    outputs = Conv2D(num_classes, 1, activation='softmax')(conv9)

    model = Model(inputs=inputs, outputs=outputs)
    return model

unet_model = build_unet_model()

history = train_model(unet_model, x_train, y_train,
                      optimizer=SGD(learning_rate=0.001, momentum=.9),
                      loss='categorical_crossentropy',
                      metrics=['accuracy'])

evaluate_model(unet_model, x_val, y_val)


# Summary

You have now seen how to implement image classification and semantic segmentation using convolutional networks and transfer learning. You have also been exposed to modular network design and three large scale benchmark datasets. In the classification section, we learned how to assign a probility distribution for the entire image. On the other hand, in the semantic segmentation section we learned how to assign a probability distribution for each pixel in the input image. You are now equipped with the tools to tackle more advanced computer vision tasks.