# Data Set and  Feature Extraction

In this notebook, I will explain how to load and pre-process images using PyTorch `Dataset` and `DataLoader` classes. Then, I will extract encoded features for each image using CNNs. . 

Outline of this notebook:
- [Step 1](#step1): Writing custom PyTorch Dataset
- [Step 2](#step2): Using the Data Loader to obtain Batches
- [Step 3](#step3): Extracting features of all dataset images using CNN Encoder

<a id='step1'></a>
## Step 1: Writing Custom PyTorch Dataset

I wrote a custom PyTorch [Dataset](http://pytorch.org/docs/master/data.html#torch.utils.data.DataLoader) to recursively load all images in a directory with their paths. This dataset is an instance of my custom `ImagesDataset` class in **images_dataset.py**.  If you are unfamiliar with data loaders and datasets, you are encouraged to review [my post](http://www.sefidian.com/2022/03/09/writing-custom-datasets-and-dataloader-in-pytorch/) or [this PyTorch tutorial](http://pytorch.org/tutorials/beginner/data_loading_tutorial.html).

#### Exploring the `__getitem__` Method

The `__getitem__` method in the `ImagesDataset` class determines how an image-path pair is pre-processed before being incorporated into a batch.  This is true for all `Dataset` classes in PyTorch; if this is unfamiliar to you, please review [this link](https://pytorch.org/tutorials/beginner/data_loading_tutorial.html). 

When the data loader is in training mode, this method begins by first obtaining the filename (`path`) of an image and its corresponding caption (`caption`).

#### Image Pre-Processing 

Image pre-processing is relatively straightforward (from the `__getitem__` method in the `ImagesDataset` class):
```python
# Convert image to tensor and pre-process using transform
image = Image.open(image_path).convert("RGB")
if self.transform is not None:
    image = self.transform(image)
```

The `ImagesDataset` takes as input a number of arguments that can be explored in **data_loader.py**. Take the time to explore these arguments now by opening **images_dataset.py**. 
1. **`transform`** - an [image transform](http://pytorch.org/docs/master/torchvision/transforms.html) specifying how to pre-process the images and convert them to PyTorch tensors before using them as input to the CNN encoder.  I will define transforms in `transformer` variable.
2. **`directory`** - determines the directory to search for image files.
3. **`extensions`** - image file extensions to search for within the directory. 

After loading the image in the directory, the image is pre-processed using the transform that was supplied when instantiating the data loader.  

In [19]:
import torch
from torch.utils.data import DataLoader
from torchvision import transforms
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from images_dataset import ImagesDataset
from model import EncoderCNN
import pandas as pd

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [30]:
# import configs
from configs import images_dir, batch_size, embedding_size, image_resize

In [31]:
# Define a transform to pre-process the training images.
transformer = transforms.Compose(
    [
        transforms.Resize((image_resize, image_resize)),
        # convert the PIL Image to a tensor
        transforms.ToTensor(),
        # normalize image for pre-trained model
        transforms.Normalize(
            (0.485, 0.456, 0.406),
            (0.229, 0.224, 0.225),
        ),
    ]
)

In the next code cell we define a `device` that we will use move PyTorch tensors to GPU (if CUDA is available).  Run this code cell before continuing.

In [32]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

cuda


In the code cells below, I will initialize the dataset and data loader.  

In [33]:
dataset = ImagesDataset(directory=images_dir, transform=transformer)
data_loader = DataLoader(dataset, batch_size=batch_size, shuffle=False)

In [34]:
len(full_features_df)

26

<a id='step2'></a>
## Step 2: Using the Data Loader to obtain Batches

The implementation of CNN encoder is in the **model.py** file. The `EncoderCNN` class takes `embedding_size` as an input argument. For this project, I incorporated a pre-trained CNN into the encoder. Specifically, I used the pre-trained ResNet-50 architecture (with the final fully-connected layer removed) to extract features from a batch of pre-processed images. The output is then flattened to a vector, before being passed through a `Linear` layer to transform the feature vector to have the same size as the word embedding.

![Encoder](assets/encoder.png)

You can amend the encoder in **model.py**, to experiment with other architectures. In particular, using a [different pre-trained model architecture](http://pytorch.org/docs/master/torchvision/models.html) could be good options. If you decide to modify the `EncoderCNN` class, save **model.py** and re-execute the code cell.

Run the code cell below to instantiate the CNN encoder in `encoder`.  

In [35]:
# Initializing the encoder
encoder = EncoderCNN(embedding_size=embedding_size)
encoder.to(device)
encoder.eval()

EncoderCNN(
  (resnet): Sequential(
    (0): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
    (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU(inplace=True)
    (3): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
    (4): Sequential(
      (0): Bottleneck(
        (conv1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu): ReLU(inplace=True)
        (downsample): Sequential(
          (0): Conv2d(64

<a id='step3'></a>
## Step 3: Extracting features of all dataset images using CNN Encoder

In this step, I will pass the pre-processed images from the batch in **Step 2** of this notebook through the encoder, and then store `features` for each image in a dataframe.

In [36]:
full_features_df = pd.DataFrame([])
for images_batch, paths in data_loader:
    images = images_batch.to(device)
    encoder.zero_grad()

    # Passing the inputs through the CNN model
    with torch.no_grad():
        features = encoder(images)
    batch_df = pd.DataFrame(features.cpu().numpy(), index=paths)
    full_features_df = full_features_df.append(batch_df)

In [37]:
full_features_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,246,247,248,249,250,251,252,253,254,255
images/class_1/sample_202 (copy).png,-0.098064,0.196833,0.335805,0.242302,-0.080422,-0.125721,0.005529,0.117378,-0.242434,0.03191,...,1.004789,-0.097422,0.318797,-0.009749,-0.412016,0.084764,0.57494,-0.577919,0.289771,0.154638
images/class_1/sample_202.png,-0.098064,0.196833,0.335805,0.242302,-0.080422,-0.125721,0.005529,0.117378,-0.242434,0.03191,...,1.004789,-0.097422,0.318797,-0.009749,-0.412016,0.084764,0.57494,-0.577919,0.289771,0.154638
images/class_1/sample_296 (copy).png,-0.06136,0.310623,0.369337,0.361454,0.114996,-0.385473,0.754855,-0.300193,-0.155391,-0.039086,...,0.555133,-0.241673,0.067022,0.020646,-0.088895,0.040543,0.482544,-0.151116,0.308696,-0.012785
images/class_1/sample_296.png,-0.06136,0.310623,0.369337,0.361454,0.114996,-0.385473,0.754855,-0.300193,-0.155391,-0.039086,...,0.555133,-0.241673,0.067022,0.020646,-0.088895,0.040543,0.482544,-0.151116,0.308696,-0.012785
images/class_1/sample_326 (copy).png,0.375747,-0.083963,0.266924,0.060957,0.045091,-0.139031,0.358977,0.306269,-0.419578,-0.157577,...,0.668993,-0.460801,0.15802,0.102513,-0.066617,-0.083304,0.616494,-0.258457,-0.010556,0.559132


In [38]:
full_features_df.shape

(26, 256)

In [39]:
full_features_df.to_pickle("features.pkl")

In the next notebook I will deduplicate images using these extracted features.