# Data Processing

- Load image data using PyTorch
- Image transformations
- Preprocess images (resize, crop, normalize)

### Setup drive

Run the following cell to mount your Drive onto Colab. Go to the given URL and once you login and copy and paste the authorization code, you should see "drive" pop up in the files tab on the left.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Click the little triangle next to "drive" and navigate to the "AI4All Chest X-Ray Project" folder. Hover over the folder and click the 3 dots that appear on the right. Select "copy path" and replace `PASTE PATH HERE` with the path to your folder.

In [None]:
cd "PASTE PATH HERE"

### Import necessary libraries
Torchvision, or the PyTorch package, consists of popular datasets, model architectures, and common image transformations for computer vision.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import random

from torch.utils.data import random_split, Subset

import torchvision
from torchvision import datasets, transforms

from utils.plotting import imshow_dataset
from utils.datahelper import calc_dataset_stats, get_random_image

### Setup paths
Define paths and load metadata

In [None]:
path_to_dataset = os.path.join('data')

path_to_images = os.path.join(path_to_dataset, 'images')

metadata = pd.read_csv(os.path.join(path_to_dataset, 'metadata_train.csv'))

### Load images

**Pytorch loads the data using sub-folder names as class labels**

Navigate to the "images" folder to see what this means.


In [None]:
dataset = datasets.ImageFolder(path_to_images, transform=None)
dataset

In [None]:
# EXERCISE: Use the function .class_to_idx to see what our classes are


**Now let's take a look at the images themselves!**

Note: The `imshow_dataset` function is defined in the file `utils/plotting.py`.

In [None]:
# plots the first 5 images
imshow_dataset(dataset, n=5)

In [None]:
# plots 5 random images
imshow_dataset(dataset, n=5, rand=True)

> **Discuss with each other**
>
> What do you notice about the images? What are their dimensions?



### Transformations
The transforms module in PyTorch defines various transformations that can be performed on an image. 

Image transformations are used to pre-process images
as well as to "augment" the data. (We will discuss data augmentation in another section.)


**Resize the image using transforms**

In [None]:
# get a random image from the dataset and resize it
im = get_random_image(dataset)
im = transforms.Resize(100)(im)
im

In [None]:
transforms.Resize(50)(im)

**Try out other transformations**

How do these transformations alter the image?
- `transforms.ColorJitter`
- `transforms.RandomAffine`
- `transforms.RandomHorizontalFlip`  

You can [read more about these transformations here](https://pytorch.org/docs/stable/torchvision/transforms.html)




In [None]:
# EXERCISE: Apply different transformations to images and check out the output
#
# HINT: Use the code above as an example and try transforms functions such as RandomAffine



> **Discuss with each other**
> 
> Which transformations could be useful to normalize the dataset? Which transformations could be useful to add diversity to data set?

### Examine image dimensions

Run the code below to calculate the image dimension.

> **Discuss with each other**
>
> Based on the image dimension, are the images greyscale or color images?

In [None]:
im_sizes = [d[0].size for d in dataset]

dimensions = set([len(s) for s in im_sizes])

print(f'Dimensions in dataset: {dimensions}')

Compare x-ray images to another image

In [None]:
# Answer the above question before running this block!

from skimage import io
color_image = io.imread('https://unsplash.com/photos/twukN12EN7c/download')
io.imshow(color_image)
print(f'Random color image shape: {color_image.shape}')
print(f'Random xray image shape: {get_random_image(dataset).size}')

**How much do image shapes and sizes vary in the dataset?**

Run the code below to print the image dimensions for a set of random images

In [None]:
im_num = 10
rand_indices = random.sample(range(len(dataset)), im_num)
subset = Subset(dataset, rand_indices)

print(f'Image dimensions for {im_num} random images')
for d in subset:
    print(d[0].size)

**Smallest dimension measurements**

Calculate the smallest image width and height in the dataset.

In [None]:
# EXERCISE: calculate the smallest image width and smallest image height in the
# dataset
#
# HINT: look at blocks above for useful code, use min() to find minimum in a list


> **Discuss with each other**
> 
> How should we resize and crop the images? How do the smallest image width and smallest image height constrain our strategy?

### Resize and crop

To make the images the same shape and size for the learning model, we can apply image transformations when loading the data.

The `transforms.Compose` function puts together a list of image transformations, which are applied in order to the images.

In [None]:
# EXERCISE: set resize and crop parameters based on your observations above

resize_value = # HERE #
crop_value = # HERE #


# compose transformations
data_transforms = transforms.Compose([
        transforms.Resize(resize_value),
        transforms.CenterCrop(crop_value)])

dataset = datasets.ImageFolder(path_to_images, transform=data_transforms)

In [None]:
# EXERCISE: compare the images with and without transformation applied. 



In [None]:
# EXERCISE: try applying another list of transformations and compare the results



### Normalize images

**Calculate the pixel intensity mean and standard deviation across all images in the dataset.**

Note: This code takes some time to run. The output is 

- Mean: 0.544
- Standard Deviation: 0.237

In [None]:
data_transforms = transforms.Compose([
        transforms.Resize(resize_value),
        transforms.CenterCrop(crop_value),
        transforms.ToTensor()])

dataset = datasets.ImageFolder(path_to_images, transform=data_transforms)

data_mean, data_std = calc_dataset_stats(dataset)
print(f'Mean: {data_mean:.3f}, Standard Deviation: {data_std:.3f}')

**Add normalization to the transformation list**

The normalization step is applied on tensors and so is added after the `transforms.ToTensor` step. 

In [None]:
data_transforms = transforms.Compose([
        transforms.Grayscale(),
        transforms.Resize(resize_value),
        transforms.CenterCrop(crop_value),
        transforms.ToTensor(),
        transforms.Normalize(mean=[data_mean], std=[data_std])])

dataset = datasets.ImageFolder(path_to_images, transform=data_transforms)

In [None]:
# EXERCISE: compare the images with all the transformations applied.

