# Feature Extraction
See this on [Github](https://github.com/yinleon/doppler_tutorials/blob/master/2-feature-extraction.ipynb), [NbViewer](https://nbviewer.jupyter.org/github/yinleon/doppler_tutorials/blob/master/2-feature-extraction.ipynb)<br>
By Jansen Derr 2021-02-22<br>

In order to power the functions of the Doppler, we need to transform the images we just downloaded into searchable features. We use a neural network that has already been used to a task to create convolutional features called logits. Logits are learned representations of [shapes, colors and patterns](https://distill.pub/2017/feature-visualization/) that neural networks use to differentiate between different types of images through linear regression. We discard the last step of linear regression, so we just have the logits. The distance between the logits of a new image and all existing images determines the relevance of the image search engine. These same relationships are used to cluster and grid images, which we use for mosaic analysis.

To do this step-- called `feature extraction`, we use ResNet50 pre-trained on ImageNet.

In [None]:
import os
import json
import copy
import time
import requests
import shutil
from io import BytesIO

import numpy as np
import pandas as pd
from PIL import Image
from tqdm import tqdm
from sklearn.neighbors import NearestNeighbors
import joblib

import torch
from torch import nn
from torch.utils.data.dataset import Dataset
from torch.utils.data import DataLoader
from torchvision import datasets, models, transforms

from config import cols_conv_feats, skip_hash
from image_utils import read_image, read_and_transform_image

In [None]:
# this notebook needs version >= 0.4.0
torch.__version__

In [None]:
# Are we using a GPU? If not, the device will be using cpu
device = torch.device('cuda:1' if torch.cuda.is_available() else 'cpu')
device

In [None]:
import config

In [None]:
for _dir in [config.working_dir, config.media_dir]:
    os.makedirs(_dir, exist_ok=True)

# Feature Extraction <a id='features'></a>
This section will converting raw data that into ML-friendly data. What that means in this context is downloading images and transforming them into logits formatted as PyTorch Tensors.

In [None]:
df = pd.read_csv(config.image_lookup_file, 
                  compression='gzip')

In [None]:
df = df[~df['d_hash'].isin(skip_hash)] 
len(df)

In [None]:
df.head(2)

In order to read images into PyTorch, they need to be [Tensors](https://pytorch.org/docs/stable/tensors.html) with standardized dimensions.<br>For images, the dimensions are (`width`, `height`, `number_of_color_channels`, `batch_size`).

When using models that have already been trained, the new inputs need to resemble the input of the original model. For ResNet50, the input dimensions are (224 x 224 x 3). For most models the last dimension (`batch_size`) can be adjusted.

torchvision's `transforms` submodule is useful for resizing images, normalizing values and converting the image (which is read into Pillow and NumPy) into a PyTorch tensor.

In [None]:
# The image needs to be specific dimensions, normalized, and converted to a Tensor to be read into a PyTorch model.
scaler = transforms.Resize((224, 224))
to_tensor = transforms.ToTensor()
normalizer = transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                  std=[0.229, 0.224, 0.225])

# this is the order of operations that will occur on each image.
transformations = transforms.Compose([scaler, 
                                      to_tensor, 
                                      normalizer])

These operations are called within the `read_and_transform_image` function, which can operate on images on disk or on the web:

Using that handy function, can convert this local image...

In [None]:
img_file = df.f_img.iloc[0]
read_image(img_file)

into a PyTorch Tensor for ResNet50

In [None]:
read_and_transform_image(img_file, transformations)

But the above is only operating on one image. To efficiently transform many images use [datasets and dataloaders](https://pytorch.org/tutorials/beginner/data_loading_tutorial.html). Although typically used in training, datasets and dataloaders help parallelize transformations and iterating through the input in batches.

In [None]:
class Feature_Extraction_Dataset(Dataset):
    """Dataset wrapping images and file names
    img_col is the column for the image to be read
    index_col is a unique value to index the extracted features
    """
    def __init__(self, df, img_col, index_col):
        # filter out rows where the file is not on disk.
        self.X_train = df.drop_duplicates(subset='d_hash').reset_index(drop=True)
        self.files = self.X_train[img_col]
        self.idx = self.X_train[index_col]

    def __getitem__(self, index):
        img_idx = self.idx[index]
        img_file = self.files[index]
        try:
            img = read_and_transform_image(self.files[index], transformations)
            return img, img_file, img_idx
        except:
            pass

    def __len__(self):
        return len(self.X_train.index)

In [None]:
abd = []
if os.path.exists(config.logits_file):
    abd = pd.read_csv(config.logits_file, 
                      index_col=0).index.tolist()

In [None]:
dataset = Feature_Extraction_Dataset(df[~df['d_hash'].isin(abd)], 
                                     img_col='f_img', 
                                     index_col='d_hash')
data_loader = DataLoader(dataset,
                         batch_size=config.batch_size,
                         shuffle=False,
                         num_workers=config.num_workers)

Next load resNet50 pre-trained on ImageNet.

In [None]:
def load_resnet_for_feature_extraction():
    # Load a pre-trained model
    res50_model = models.resnet50(pretrained=True)

    # Pop the last Dense layer off. This will give us convolutional features.
    res50_conv = nn.Sequential(*list(res50_model.children())[:-1])
    res50_conv.to(device)

    # Don't run backprop!
    for param in res50_conv.parameters():
        param.requires_grad = False

    # we won't be training the model. Instead, we just want predictions so we switch to "eval" mode. 
    res50_conv.eval();
    
    return res50_conv

In [None]:
res50_conv = load_resnet_for_feature_extraction()

Now iterate through the dataset using a data_loader, and convert each batch of images into convolutional feautures. If memory is an issue reduce `batch_size` in the `data_loader`. Data loaders are iterators, for most use cases data loaders are used to return an input (`X`) and a target (`y`) to fit a PyTorch model. We however are not fitting a model, but rather using the data loader in a crucial transformation step in our data pipelines. Thus we return bazaar values such as the path of the image (`img_file`) and the hash (`idx`) instead. X is an array of image Tensors.

In [None]:
for (X, img_file, idx) in tqdm(data_loader):
    X = X.to(device)
    logits = res50_conv(X)
    #logits.size() # [`batch_size`, 2048, 1, 1])
    
    logits = logits.squeeze(2) # remove the extra dims
    logits = logits.squeeze(2) # remove the extra dims
    #logits.size() # [`batch_size`, 2048]
    
    n_dimensions = logits.size(1)
    logits_dict = dict(zip(idx, logits.cpu().data.numpy()))
    #{'filename' : np.array([x0, x1, ... x2047])}
    
    df_conv = pd.DataFrame.from_dict(logits_dict, 
                                     columns=cols_conv_feats, 
                                     orient='index')
    # add a column for the filename of images...
    df_conv['f_img'] = img_file
    
    # write to file
    if os.path.exists(config.logits_file):
        df_conv.to_csv(config.logits_file, mode='a', 
                       header=False, compression='gzip')
    else:
        df_conv.to_csv(config.logits_file, compression='gzip')

**NOTE**: Re-run feature extraction on all new images and append them to the `logits_file` csv.

Now each image is converted into an array of floats. We maintain the filename in the index to referback to the metadata later.