# Data Set and  Feature Extraction

In this notebook, I will explain how to load and pre-process images using PyTorch `Dataset` and `DataLoader` classes. Then, I will extract encoded features for each image using CNNs. . 

Outline of this notebook:
- [Step 1](#step1): Writing custom PyTorch Dataset
- [Step 2](#step2): Using the Data Loader to obtain Batches
- [Step 3](#step3): Extracting features of all dataset images using CNN Encoder

<a id='step1'></a>
## Step 1: Writing Custom PyTorch Dataset

I wrote a custom PyTorch [Dataset](http://pytorch.org/docs/master/data.html#torch.utils.data.DataLoader) to recursively load all images in a directory with their paths. This dataset is an instance of my custom `ImagesDataset` class in **images_dataset.py**.  If you are unfamiliar with data loaders and datasets, you are encouraged to review [my post](http://www.sefidian.com/2022/03/09/writing-custom-datasets-and-dataloader-in-pytorch/) or [this PyTorch tutorial](http://pytorch.org/tutorials/beginner/data_loading_tutorial.html).

#### Exploring the `__getitem__` Method

The `__getitem__` method in the `ImagesDataset` class determines how an image-path pair is pre-processed before being incorporated into a batch.  This is true for all `Dataset` classes in PyTorch; if this is unfamiliar to you, please review [this link](https://pytorch.org/tutorials/beginner/data_loading_tutorial.html). 

#### Image Pre-Processing 

Image pre-processing is relatively straightforward (from the `__getitem__` method in the `ImagesDataset` class):
```python
# Convert image to tensor and pre-process using transform
image = Image.open(image_path).convert("RGB")
if self.transform is not None:
    image = self.transform(image)
```

The `ImagesDataset` takes as input a number of arguments that can be explored in **data_loader.py**. Take the time to explore these arguments now by opening **images_dataset.py**. 
1. **`transform`** - an [image transform](http://pytorch.org/docs/master/torchvision/transforms.html) specifying how to pre-process the images and convert them to PyTorch tensors before using them as input to the CNN encoder.  I will define transforms in `transformer` variable.
2. **`directory`** - determines the directory to search for image files.
3. **`extensions`** - image file extensions to search for within the directory. 

After loading the image in the directory, the image is pre-processed using the transform that was supplied when instantiating the data loader.  

In [1]:
import torch
from torch.utils.data import DataLoader
from torchvision import transforms
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from images_dataset import ImagesDataset
from model import EncoderCNN
import pandas as pd
from tqdm.notebook import tqdm

%load_ext autoreload
%autoreload 2

In [2]:
# import configs
from configs import images_dir, batch_size, embedding_size, image_resize, extensions

In [3]:
# Define a transform to pre-process the training images.
transformer = transforms.Compose(
    [
        transforms.Resize((image_resize, image_resize)),
        # convert the PIL Image to a tensor
        transforms.ToTensor(),
        # normalize image for pre-trained model
        transforms.Normalize(
            (0.485, 0.456, 0.406),
            (0.229, 0.224, 0.225),
        ),
    ]
)

In the next code cell we define a `device` that we will use move PyTorch tensors to GPU (if CUDA is available).  Run this code cell before continuing.

In [4]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

cuda


In the code cells below, I will initialize the dataset and data loader.  

In [5]:
dataset = ImagesDataset(
    directory=images_dir, extensions=extensions, transform=transformer
)
data_loader = DataLoader(dataset, batch_size=batch_size, shuffle=False)

In [6]:
len(dataset)

8

<a id='step2'></a>
## Step 2: Using the Data Loader to obtain Batches

The implementation of CNN encoder is in the **model.py** file. The `EncoderCNN` class takes `embedding_size` as an input argument. For this project, I incorporated a pre-trained CNN into the encoder. Specifically, I used the pre-trained ResNet-50 architecture (with the final fully-connected layer removed) to extract features from a batch of pre-processed images. The output is then flattened to a vector, before being passed through a `Linear` layer to transform the feature vector to have the same size as the word embedding.

![Encoder](assets/encoder.png)

You can amend the encoder in **model.py**, to experiment with other architectures. In particular, using a [different pre-trained model architecture](http://pytorch.org/docs/master/torchvision/models.html) could be good options. If you decide to modify the `EncoderCNN` class, save **model.py** and re-execute the code cell.

Run the code cell below to instantiate the CNN encoder in `encoder`.  

In [7]:
# Initializing the encoder
encoder = EncoderCNN(embedding_size=embedding_size)
encoder.to(device)
encoder.eval()

EncoderCNN(
  (resnet): Sequential(
    (0): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
    (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU(inplace=True)
    (3): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
    (4): Sequential(
      (0): Bottleneck(
        (conv1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu): ReLU(inplace=True)
        (downsample): Sequential(
          (0): Conv2d(64

<a id='step3'></a>
## Step 3: Extracting features of all dataset images using CNN Encoder

In this step, I will pass the pre-processed images from the batch in **Step 2** of this notebook through the encoder, and then store `features` for each image in a dataframe.

In [8]:
full_features_df = pd.DataFrame([])
for images_batch, paths in tqdm(data_loader):
    images = images_batch.to(device)
    encoder.zero_grad()

    # Passing the inputs through the CNN model
    with torch.no_grad():
        features = encoder(images)
    batch_df = pd.DataFrame(features.cpu().numpy(), index=paths)
    full_features_df = full_features_df.append(batch_df)

  0%|          | 0/1 [00:00<?, ?it/s]

In [9]:
full_features_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,246,247,248,249,250,251,252,253,254,255
images/photo_2022-06-16_17-29-00.jpg,0.199145,0.246505,0.0028,0.323195,0.234513,0.09212,-0.434587,0.268099,-0.224246,-0.208385,...,-0.248163,0.146117,0.035693,-0.022854,-0.256254,0.216303,-0.33438,0.647641,-0.04866,0.621903
images/photo_2022-06-16_17-29-01 (2) (3rd copy).jpg,-0.297781,0.430951,0.131061,0.092694,0.336686,-0.097079,-0.509302,-0.057603,-0.46872,-0.120154,...,0.184711,0.129619,-0.344921,0.355836,-0.158022,0.177382,-0.185563,0.337501,-0.328419,0.472292
images/photo_2022-06-16_17-29-01 (2) (another copy).jpg,-0.297781,0.430951,0.131061,0.092694,0.336686,-0.097079,-0.509302,-0.057603,-0.46872,-0.120154,...,0.184711,0.129619,-0.344921,0.355836,-0.158022,0.177382,-0.185563,0.337501,-0.328419,0.472292
images/photo_2022-06-16_17-29-01 (2) (copy).jpg,-0.297781,0.430951,0.131061,0.092694,0.336686,-0.097079,-0.509302,-0.057603,-0.46872,-0.120154,...,0.184711,0.129619,-0.344921,0.355836,-0.158022,0.177382,-0.185563,0.337501,-0.328419,0.472292
images/photo_2022-06-16_17-29-01 (2).jpg,-0.297781,0.430951,0.131061,0.092694,0.336686,-0.097079,-0.509302,-0.057603,-0.46872,-0.120154,...,0.184711,0.129619,-0.344921,0.355836,-0.158022,0.177382,-0.185563,0.337501,-0.328419,0.472292


In [10]:
full_features_df.shape

(8, 256)

In [11]:
full_features_df.to_pickle("features.pkl")

In the next notebook, I will deduplicate images using these extracted features.