### Introduction

In classification problem we have an input image and our task is to label that image only. But sometimes we might want to know the label of the image as well as the location. This prroblem is known as classification and locaalizatioon. For example in our example the cat is on a field. Our task is to label the image as cat and the location as field. We already know the classifcation process. Here, localization refers to the bounding box of the object (in our example bounding box around the cat).

The differences between object detection and "classification and localization" is that in the former problem we may have multiple object but in latter case there is only one object.

### Procedure

We take our input image and pass it through a giant convolutional neural network(in our example: Alex Net) which will summarize the input image into a vector. Now unlike classification problem, we will feed this vector into two separate fully connceted layers instead of one. One will give class scores and the other will give four numbers height, width and the coordinates of the bounding box x and y. 

Now we will have two losses for this two different outputs. Moreover, we do this task in a fully supervised way. We assume that each of out training images are annoted with both a category label and also a ground truth bounding box for that category in the image. So, now we have two loss functions as well. One is famous softmax loss and the other is simply L2 loss. We might take other loss functions for the second case like L1, smooth L1 etc..


![alt text](images/cl1.PNG)

Now we have two scalars(two losses) for gradient computing and we want to minimize both. To do so we will take another hyperparameters that gives us some weights. Now we will take wighted sum of these loss functions to give our final scalar loss. Then we will take gradients w.r.t. this new scalar(weighted loss).

This is also tricky because, this additional hyperparameter is to be set. We can apply different set of this hyperparameter and observe which performs better.

This proess of classification and localization can be applied to human pose estimation. In this case, the input is an image of a person and we want to output the positions of the joints for that person. This will allow the network to prdict where the his arms were, where his legs were staff like that.

We assume every person has same number of joints naturally. However it might not be the case all the time, but it works for the network. Generally the datasets for this problem defines a person's pose by 14 joint positions, their feet, knees, their hips etc. Therefore our task will be to output 14 numbers giving the (x,y) co-ordinates for each of the joints. We will apply regression loss(other than softmax and cross entropy loss) on each of those 14 predicted points and train the network with back propagation again.

![alt text](images/cl2.PNG) ![alt text](images/cl3.PNG) 

### Implementation

Now, it's time for implementation of what we learned about calssification and localization.

In [6]:
#import libraries
import tensorflow as tf
from tensorflow import keras
import numpy as np
import tensorflow_datasets as tfds
from tensorflow.keras import layers, Model
from tensorflow.keras.utils import to_categorical

In [7]:
# Load PASCAL VOC dataset
(ds_train, ds_test), ds_info = tfds.load('voc',
 split=['train', 'validation'],
 with_info=True)


In [8]:
print(ds_info)

tfds.core.DatasetInfo(
    name='voc',
    full_name='voc/2007/4.0.0',
    description="""
    This dataset contains the data from the PASCAL Visual Object Classes Challenge,
    corresponding to the Classification and Detection competitions.
    
    In the Classification competition, the goal is to predict the set of labels
    contained in the image, while in the Detection competition the goal is to
    predict the bounding box and label of each individual object.
    annotations.
    """,
    config_description="""
    This dataset contains the data from the PASCAL Visual Object Classes Challenge
    2007, a.k.a. VOC2007.
    
    A total of 9963 images are included in this dataset, where each image
    contains a set of objects, out of 20 different classes, making a total of
    24640 annotated objects.
    
    """,
    homepage='http://host.robots.ox.ac.uk/pascal/VOC/voc2007/',
    data_path='C:\\Users\\klikh\\tensorflow_datasets\\voc\\2007\\4.0.0',
    file_format=tfrecord,
   

In [9]:
# Function to filter examples with a single object
def filter_single_object_examples(dataset):
    single_object_images = []
    single_object_annotations = []
    single_object_classes = []

    for example in dataset:
        if len(example['objects']['bbox']) == 1:  # Check if there's a single object
            single_object_images.append(example['image'])
            single_object_annotations.append(example['objects']['bbox'][0])  # Select the only annotation
            single_object_classes.append(example['labels'])  # Assuming label is the key for the class

    return single_object_images, single_object_classes, single_object_annotations


In [10]:
# Filter train and test datasets for single object examples
x_train, y_train_class,y_train_annotations= filter_single_object_examples(ds_train)
x_test, y_test_class,y_test_annotations = filter_single_object_examples(ds_test)

# one hot encoding for the class labels
y_train_class=to_categorical(y_train_class,num_classes=20)
y_test_class=to_categorical(y_test_class,num_classes=20)

In [11]:
# Preprocess images (resize, normalize, etc.)
def preprocess_images(images):
    processed_images = []
    for image in images:
        processed_image = tf.image.resize(image, (224,224))  # Resize the image
        # Add more preprocessing steps if required
        processed_images.append(processed_image)
    return processed_images

In [12]:
# Preprocess images
x_train = np.array(preprocess_images(x_train))
x_test = np.array(preprocess_images(x_test))

# Convert annotations to numpy arrays
# y_train = np.array(y_train)
# y_test = np.array(y_test)

y_train_class = np.array(y_train_class)
y_train_annotations = np.array(y_train_annotations)
y_test_class = np.array(y_test_class)
y_test_annotations = np.array(y_test_annotations)

In [14]:
print(x_train.shape)
print(y_train_class.shape)
print(y_train_annotations.shape)
print(y_test_annotations.shape)

(905, 224, 224, 3)
(905, 20)
(905, 4)
(960, 4)


In [16]:
# Check if the shapes are consistent before proceeding to model training
print(f"Train Images: {len(x_train)}, Train class: {len(y_train_class)},  Train Annotations: {len(y_train_annotations)}")
print(f"Test Images: {len(x_test)}, Test class: {len(y_test_class)} Test Annotations: {len(y_test_annotations)}")




Train Images: 905, Train class: 905,  Train Annotations: 905
Test Images: 960, Test class: 960 Test Annotations: 960


In [18]:
# Define the model architecture
def create_classification_localization_model(input_shape):
    input_layer = layers.Input(shape=input_shape)

    # Convolutional base for feature extraction
    base_model = tf.keras.applications.ResNet50(weights='imagenet', include_top=False, input_tensor=input_layer)

    # Classification branch
    classification_branch = layers.GlobalAveragePooling2D()(base_model.output)
    classification_branch = layers.Dense(256, activation='relu')(classification_branch)
    classification_output = layers.Dense(20, activation='softmax', name='class_output')(classification_branch)

    # Localization branch
    localization_branch = layers.GlobalAveragePooling2D()(base_model.output)
    localization_branch = layers.Dense(128, activation='relu')(localization_branch)
    localization_output = layers.Dense(4, name='loc_output')(localization_branch)  # 4 values for x, y, width, height

    model = Model(inputs=input_layer, outputs=[classification_output, localization_output])

    return model

In [19]:
# Define the combined loss function
def combined_loss(w_closs, w_lloss):
    def loss(y_true, y_pred):
        # Access class and annotation tensors directly
        y_class_true = y_true[0]
        y_annotation_true = y_true[1]
        y_class_pred = y_pred[0]
        y_annotation_pred = y_pred[1]

        # Classification loss
        class_loss = tf.keras.losses.CategoricalCrossentropy()(y_class_true, y_class_pred)

        # Localization loss
        loc_loss = tf.keras.losses.MeanSquaredError()(y_annotation_true, y_annotation_pred)

        # Combine both losses
        total_loss = (w_closs * class_loss) + (w_lloss * loc_loss)
        return total_loss

    return loss


In [20]:
# Example usage:
input_shape = (224, 224, 3)
num_classes = 20
w_closs=.5
w_lloss=.5

In [21]:
# Create the model
model = create_classification_localization_model(input_shape)

In [23]:
model.compile(
    loss=combined_loss(w_closs=w_closs,w_lloss=w_lloss),
    optimizer=keras.optimizers.Adam(learning_rate=.001),
    metrics=["accuracy"],

)

In [25]:
#model fit
model.fit(x_train,[y_train_class,y_train_annotations],batch_size=32, epochs=30, verbose=2)

Epoch 1/30
29/29 - 396s - loss: 23.5030 - class_output_loss: 2.4033 - loc_output_loss: 21.0998 - class_output_accuracy: 0.1370 - loc_output_accuracy: 0.2000 - 396s/epoch - 14s/step
Epoch 2/30
29/29 - 345s - loss: 11.1334 - class_output_loss: 2.1523 - loc_output_loss: 8.9810 - class_output_accuracy: 0.1459 - loc_output_accuracy: 0.4840 - 345s/epoch - 12s/step
Epoch 3/30
29/29 - 335s - loss: 6.8068 - class_output_loss: 1.8455 - loc_output_loss: 4.9613 - class_output_accuracy: 0.1569 - loc_output_accuracy: 0.5238 - 335s/epoch - 12s/step
Epoch 4/30
29/29 - 334s - loss: 3.1176 - class_output_loss: 1.7436 - loc_output_loss: 1.3740 - class_output_accuracy: 0.1127 - loc_output_accuracy: 0.5249 - 334s/epoch - 12s/step
Epoch 5/30
29/29 - 334s - loss: 4.2058 - class_output_loss: 2.0108 - loc_output_loss: 2.1951 - class_output_accuracy: 0.1293 - loc_output_accuracy: 0.5669 - 334s/epoch - 12s/step
Epoch 6/30
29/29 - 335s - loss: 3.8247 - class_output_loss: 1.9173 - loc_output_loss: 1.9074 - class_o

<keras.src.callbacks.History at 0x12dc5fc50d0>

In [26]:
#model evaluate
model.evaluate(x_test,[y_test_class,y_test_annotations],batch_size=32, verbose=2)

30/30 - 74s - loss: 2.9551 - class_output_loss: 1.5868 - loc_output_loss: 1.3683 - class_output_accuracy: 0.1125 - loc_output_accuracy: 0.5635 - 74s/epoch - 2s/step


[2.955080270767212,
 1.5867671966552734,
 1.368312954902649,
 0.11249999701976776,
 0.5635416507720947]