TODO: Prepare dataset to cut-out faces

C4W3L04 Convolutional Implementation Sliding Windows - Idea: https://www.youtube.com/watch?v=XdsmlBGOK-k
- HRNetV2
- OpenSeeFace
- Dlib facial landmark predictor
- Very interesting: https://ai.plainenglish.io/facial-landmarks-detection-using-xception-net-908b8b80f758
- https://github.com/zmurez/MediaPipePyTorch

## Challenges
NamishNet didn't work well. It didn't seem to understand what a face with mouth, nose, eyes and eybrows really is. It seems that it simply learn't the average positioning of facial keypoints. It didn't realize the exact positions of the keypoints nor their orientation, e.g. when a person is turning heads or leaning aside:

- The images in the NaimishNet-paper where cropped such that only faces with almost no background was presented to the NN.
- NaimishNet had to learn only 15 keypoints and not 68. Particularly the outer keypoints often look randomly placed. It is often hard to see any features, especially when neck and chin are hard to distinguish.
- So there is much more noise in our pictures
- The quality of a few pictures is very poor and keypoints hardly visible, even to human eyes.

Consequently...
- The final pipeline will find the faces first and then look for the facial keypoints.
- So cropping to the areas of the faces using the keypoints before training.
- Design&train the keypoint detector for 128x128 rather than 224x224 pixels.
- Decided to train with color images, because sing grayscale only is neglecting valuable information.


## Define the Convolutional Neural Network

After you've looked at the data you're working with and, in this case, know the shapes of the images and of the keypoints, you are ready to define a convolutional neural network that can *learn* from this data.

In this notebook and in `models.py`, you will:
1. Define a CNN with images as input and keypoints as output
2. Construct the transformed FaceKeypointsDataset, just as before
3. Train the CNN on the training data, tracking loss
4. See how the trained model performs on test data
5. If necessary, modify the CNN structure and model hyperparameters, so that it performs *well* **\***

**\*** What does *well* mean?

"Well" means that the model's loss decreases during training **and**, when applied to test image data, the model produces keypoints that closely match the true keypoints of each face. And you'll see examples of this later in the notebook.

---


## CNN Architecture

Recall that CNN's are defined by a few types of layers:
* Convolutional layers
* Maxpooling layers
* Fully-connected layers

You are required to use the above layers and encouraged to add multiple convolutional layers and things like dropout layers that may prevent overfitting. You are also encouraged to look at literature on keypoint detection, such as [this paper](https://arxiv.org/pdf/1710.00977.pdf), to help you determine the structure of your network.


### TODO: Define your model in the provided file `models.py` file

This file is mostly empty but contains the expected name and some TODO's for creating your model.

---

## PyTorch Neural Nets

To define a neural network in PyTorch, you define the layers of a model in the function `__init__` and define the feedforward behavior of a network that employs those initialized layers in the function `forward`, which takes in an input image tensor, `x`. The structure of this Net class is shown below and left for you to fill in.

Note: During training, PyTorch will be able to perform backpropagation by keeping track of the network's feedforward behavior and using autograd to calculate the update to the weights in the network.

#### Define the Layers in ` __init__`
As a reminder, a conv/pool layer may be defined like this (in `__init__`):
```
# 1 input image channel (for grayscale images), 32 output channels/feature maps, 3x3 square convolution kernel
self.conv1 = nn.Conv2d(1, 32, 3)

# maxpool that uses a square window of kernel_size=2, stride=2
self.pool = nn.MaxPool2d(2, 2)      
```

#### Refer to Layers in `forward`
Then referred to in the `forward` function like this, in which the conv1 layer has a ReLu activation applied to it before maxpooling is applied:
```
x = self.pool(F.relu(self.conv1(x)))
```

Best practice is to place any layers whose weights will change during the training process in `__init__` and refer to them in the `forward` function; any layers or functions that always behave in the same way, such as a pre-defined activation function, should appear *only* in the `forward` function.

#### Why models.py

You are tasked with defining the network in the `models.py` file so that any models you define can be saved and loaded by name in different notebooks in this project directory. For example, by defining a CNN class called `Net` in `models.py`, you can then create that same architecture in this and other notebooks by simply importing the class and instantiating a model:
```
    from models import Net
    net = Net()
```

In [None]:
# import the usual resources
import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1 import ImageGrid
import numpy as np

# watch for any changes in model.py, if it changes, re-load it automatically
%load_ext autoreload
%autoreload 2

In [None]:
## TODO: Define the Net in models.py

import torch
import torch.nn as nn
import torch.nn.functional as F

## TODO: Once you've defined the network, you can instantiate it
# one example conv layer has been provided for you
from models import Net

net = Net()
print(net)

In [None]:
import gc
torch.cuda.empty_cache()
gc.collect()

## Transform the dataset 

To prepare for training, create a transformed dataset of images and keypoints.

### TODO: Define a data transform

In PyTorch, a convolutional neural network expects a torch image of a consistent size as input. For efficient training, and so your model's loss does not blow up during training, it is also suggested that you normalize the input images and keypoints. The necessary transforms have been defined in `data_load.py` and you **do not** need to modify these; take a look at this file (you'll see the same transforms that were defined and applied in Notebook 1).

To define the data transform below, use a [composition](http://pytorch.org/tutorials/beginner/data_loading_tutorial.html#compose-transforms) of:
1. Rescaling and/or cropping the data, such that you are left with a square image (the suggested size is 224x224px)
2. Normalizing the images and keypoints; turning each RGB image into a grayscale image with a color range of [0, 1] and transforming the given keypoints into a range of [-1, 1]
3. Turning these images and keypoints into Tensors

These transformations have been defined in `data_load.py`, but it's up to you to call them and create a `data_transform` below. **This transform will be applied to the training data and, later, the test data**. It will change how you go about displaying these images and keypoints, but these steps are essential for efficient training.

As a note, should you want to perform data augmentation (which is optional in this project), and randomly rotate or shift these images, a square image size will be useful; rotating a 224x224 image by 90 degrees will result in the same shape of output.

In [None]:
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms, utils, models

# the dataset we created in Notebook 1 is copied in the helper file `data_load.py`
from data_load import FacialKeypointsDataset
# the transforms we defined in Notebook 1 are in the helper file `data_load.py`
from data_load import Rescale, RandomCrop, CenterCrop, Normalize, ToTensor, ToTensorRGB


## DONE: define the data_transform using transforms.Compose([all tx's, . , .])
# order matters! i.e. rescaling should come before a smaller crop
data_transform = None
# data_transform = transforms.Compose([Rescale(250),
#                                      RandomCrop(224),
#                                      Normalize(),
#                                      ToTensor()])
data_transform = transforms.Compose([Rescale(250),
                                     CenterCrop(224),
                                     ToTensorRGB()])

# testing that you've defined a transform
assert(data_transform is not None), 'Define a data_transform'

In [None]:
import cv2

class FaceCrop(object):
    """Crop a face in the image in a sample using the keypoints.

    Args:
        output_size (tuple or int): Desired output size. If int, square crop
            is made.
    """

    def __init__(self, output_size, random_rotate_deg=0.0, random_scale=0.0, random_crop_percentage=0):
        #assert isinstance(output_size, (int, tuple))
        if isinstance(output_size, int):
            self.output_size = (output_size, output_size)
        else:
            #assert len(output_size) == 2
            self.output_size = output_size
        self.random_rotate_deg = random_rotate_deg

    def __call__(self, sample):
        image_np, key_pts = sample['image'], sample['keypoints']
        h, w = image_np.shape[:2]
        #print(image_np.shape[:2])
        # angle = np.random.uniform(-self.random_rotate_deg, self.random_rotate_deg)
        angle = 20.0
        # Rotation of the keypoints
        keypoints_rotation_matrix = np.array([
                [+np.cos(np.radians(angle)), -np.sin(np.radians(angle))], 
                [+np.sin(np.radians(angle)), +np.cos(np.radians(angle))]
            ])
        image_rot_np = rotate_image(image_np, angle)
        h_rot, w_rot = image_rot_np.shape[:2]
        center = (w/2, h/2)  # get the center coordinates of the image to create the 2D rotation matrix
        key_pts = key_pts - center
        key_pts = np.matmul(key_pts, keypoints_rotation_matrix)
        key_pts = key_pts + center + ((w_rot-w)/2, (h_rot-h)/2)

        # Face cropping using the keypoints
        h, w = h_rot, w_rot  # image.shape[:2]
        #print(image.shape[:2])
        #new_h, new_w = h, w  # self.output_size
        weighted_center_0, weighted_center_1 = (key_pts[:, 0]).mean(), (key_pts[:, 1]).mean()
        new_h = (key_pts[:, 1].max() - key_pts[:, 1].min()) * self.output_size # * rand_scale
        new_w = (key_pts[:, 0].max() - key_pts[:, 0].min()) * self.output_size # * rand_scale
        new_wh_max = int(min(
            max(new_h, new_w),
            w, h
        ))
        #print(weighted_center_0)
        #top = int((h - new_h)/2)
        #left = int((w - new_w)/2)
        top = max(0, int(min(weighted_center_1 - new_wh_max/2, key_pts[:, 1].min())))
        left = max(0, int(min(weighted_center_0 - new_wh_max/2, key_pts[:, 0].min())))
        new_wh_max = min(new_wh_max, h - top - 1, w - left - 1)
        bottom = top + new_wh_max
        right = left + new_wh_max
        if right >= w:
            print('right!')
            right = w
        if bottom >= h:
            print('bottom!')
            bottom = h
        
        #print('top:{0}, left:{1}'.format(top,left))
        image_rot_np = image_rot_np[top:bottom, left:right]

        key_pts = key_pts - [left, top]

        return {'image': image_rot_np, 'keypoints': key_pts}


# self.transform = transforms.ColorJitter(brightness, contrast, saturation, hue)
# return self.transform(image), landmarks

In [None]:
# For training grayscale images
data_transform = transforms.Compose([
    #TODO RandomRotation(20),
    # TODO Randomize saturation and brightness.
    FaceCrop(1.5),  # TODO: add random range
    Rescale(128),
    Normalize(),
    ToTensor()])

In [None]:
# For training color RGB images
data_transform = transforms.Compose([
    #TODO RandomRotation(20),
    # TODO Randomize saturation and brightness.
    FaceCrop(1.5),  # TODO: add random range
    Rescale(128),
    ToTensorRGB()])

In [None]:
# This doesn't work well for the training
data_transform = transforms.Compose([Rescale(150),
                                     CenterCrop(128),
                                     ToTensorRGB()])

In [None]:
data_transform = transforms.Compose([#Rescale(250),
                                     #CenterCrop(224),
                                     ToTensorRGB()])

In [None]:
# create the transformed dataset
training_keypoints = 'data/training_frames_keypoints.csv'
training_files = 'data/training/'
#training_keypoints='data/minitest_frames_keypoints.csv'
#training_files = 'data/test/'
transformed_dataset = FacialKeypointsDataset(csv_file=training_keypoints,
                                             root_dir=training_files,
                                             transform=data_transform)

print('Number of images: ', len(transformed_dataset))

# iterate through the transformed dataset and print some stats about the first few samples
import pandas as pd
for i in range(10):
    sample = transformed_dataset[i]
    if (pd.Series(list(sample['image'].size()[1:])) > 128).any():
        print(i, sample['image'].size(), sample['keypoints'].size())

In [None]:
def rotate_image(mat, angle):
    """
    Rotates an image (angle in degrees) and expands image to avoid cropping
    https://stackoverflow.com/questions/43892506/opencv-python-rotate-image-without-cropping-sides
    """

    height, width = mat.shape[:2] # image shape has 3 dimensions
    image_center = (width/2, height/2) # getRotationMatrix2D needs coordinates in reverse order (width, height) compared to shape

    rotation_mat = cv2.getRotationMatrix2D(image_center, angle, 1.)

    # rotation calculates the cos and sin, taking absolutes of those.
    abs_cos = abs(rotation_mat[0,0]) 
    abs_sin = abs(rotation_mat[0,1])

    # find the new width and height bounds
    bound_w = int(height * abs_sin + width * abs_cos)
    bound_h = int(height * abs_cos + width * abs_sin)

    # subtract old image center (bringing image back to origo) and adding the new image center coordinates
    rotation_mat[0, 2] += bound_w/2 - image_center[0]
    rotation_mat[1, 2] += bound_h/2 - image_center[1]

    # rotate image with the new bounds and translated rotation matrix
    rotated_mat = cv2.warpAffine(mat, rotation_mat, (bound_w, bound_h))
    return rotated_mat

In [None]:
sample = transformed_dataset[113]
print('Shape: ' + str(sample['image'].shape))
image = sample['image'].clone()
show_keypoints(image, sample['keypoints'], normalize=True)

In [None]:
key_pts = sample['keypoints'].numpy().copy()
key_pts = key_pts*50 + 100
image_np = image.numpy().transpose((1, 2, 0))

h, w = image_np.shape[:2]
print(image_np.shape[:2])
# angle = np.random.uniform(-self.random_rotate_deg, self.random_rotate_deg)
angle = 20.0
# Rotation of the keypoints
keypoints_rotation_matrix = np.array([
        [+np.cos(np.radians(angle)), -np.sin(np.radians(angle))], 
        [+np.sin(np.radians(angle)), +np.cos(np.radians(angle))]
    ])
image_rot_np = rotate_image(image_np, angle)
h_rot, w_rot = image_rot_np.shape[:2]
center = (w/2, h/2)  # get the center coordinates of the image to create the 2D rotation matrix
key_pts = key_pts - center
key_pts = np.matmul(key_pts, keypoints_rotation_matrix)
key_pts = key_pts + center + ((w_rot-w)/2, (h_rot-h)/2)

key_pts = (key_pts - 100) / 50
#key_pts = key_pts + (h_rot,w_rot) - (h,w)

In [None]:
image_torch = torch.from_numpy(image_rot_np.transpose((2, 0, 1)))
keypoints_torch = torch.from_numpy(key_pts)

In [None]:
show_keypoints(image_torch, keypoints_torch, normalize=True)

In [None]:
image_torch.shape

## Batching and loading data

Next, having defined the transformed dataset, we can use PyTorch's DataLoader class to load the training data in batches of whatever size as well as to shuffle the data for training the model. You can read more about the parameters of the DataLoader, in [this documentation](http://pytorch.org/docs/master/data.html).

#### Batch size
Decide on a good batch size for training your model. Try both small and large batch sizes and note how the loss decreases as the model trains.

**Note for Windows users**: Please change the `num_workers` to 0 or you may face some issues with your DataLoader failing.

In [None]:
# load training data in batches
batch_size = 16

train_loader = DataLoader(transformed_dataset, 
                          batch_size=batch_size,
                          shuffle=True, 
                          num_workers=4)


## Test the DataLoader

In [None]:
def imshow(image, ax=None, title=None, normalize=True):
    """Imshow for Tensor."""
    if ax is None:
        fig, ax = plt.subplots()
    image = image.numpy().transpose((1, 2, 0))

    if normalize:
        mean = np.array([0.485, 0.456, 0.406])
        std = np.array([0.229, 0.224, 0.225])
        image = std * image + mean
        image = np.clip(image, 0, 1)

    return image


def show_keypoints(image, key_pts, normalize=True):
    """Show image with keypoints"""
    key_pts_copy = np.copy(key_pts)
    # Invert keypoint normalization in data_load.py:Normalize: key_pts_copy = (key_pts_copy - 100)/50.0
    key_pts_copy = key_pts_copy*50.0 + 100.0

    image = image.numpy().transpose((1, 2, 0))

    if normalize:
        mean = np.array([0.485, 0.456, 0.406])
        std = np.array([0.229, 0.224, 0.225])
        image = std * image + mean
        image = np.clip(image, 0, 1)

    #plt.imshow(image, cmap='gray')
    plt.imshow(image)
    plt.scatter(key_pts_copy[:, 0], key_pts_copy[:, 1], s=20, marker='.', c='m')


In [None]:
# Test the data loader
images = next(iter(train_loader))
if images['image'][0].shape[0] == 3:  # its a color (RGB) image
    show_keypoints(images['image'][0], images['keypoints'][0], normalize=True)
else:
    show_keypoints(images['image'][0], images['keypoints'][0], normalize=False)

## Before training

Take a look at how this model performs before it trains. You should see that the keypoints it predicts start off in one spot and don't match the keypoints on a face at all! It's interesting to visualize this behavior so that you can compare it to the model after training and see how the model has improved.

#### Load in the test dataset

The test dataset is one that this model has *not* seen before, meaning it has not trained with these images. We'll load in this test data and before and after training, see how your model performs on this set!

To visualize this test data, we have to go through some un-transformation steps to turn our images into python images from tensors and to turn our keypoints back into a recognizable range. 

In [None]:
# load in the test data, using the dataset class
# AND apply the data_transform you defined above

# create the test dataset
test_dataset = FacialKeypointsDataset(csv_file='data/test_frames_keypoints.csv',
                                             root_dir='data/test/',
                                             transform=data_transform)

In [None]:
# load test data in batches
batch_size = 10

test_loader = DataLoader(test_dataset, 
                          batch_size=batch_size,
                          shuffle=True, 
                          num_workers=4)

In [None]:
import os
[ele.name for ele in os.scandir('data/display')]

In [None]:
import pandas as pd
test_pdf = pd.read_csv('data/test_frames_keypoints.csv')
test_pdf[test_pdf['Unnamed: 0'].isin(
    #[ele.name for ele in os.scandir('data/display')]
    ['Agbani_Darego_50.jpg', 'Alexandra_Pelosi_30.jpg', 'Warren_Buffett_51.jpg', 'Wesley_Clark_40.jpg', 'Tom_Hanks_51.jpg',
     'Hassan_Nasrallah_51.jpg', 'Jaime_Pressly_41.jpg', 'Benedita_da_Silva_20.jpg', 'Carlos_Iturgaitz_20.jpg']
)].to_csv(
    'data/display_test_frames_keypoints.csv', index=False)

In [None]:
# Subset of test data for displaying
display_dataset = FacialKeypointsDataset(csv_file='data/display_test_frames_keypoints.csv',
                                             root_dir='data/test/',
                                             transform=data_transform)

batch_size = 9

display_loader = DataLoader(
    display_dataset, 
    batch_size=batch_size,
    shuffle=False, 
    num_workers=4)

## Apply the model on a test sample

To test the model on a test sample of data, you have to follow these steps:
1. Extract the image and ground truth keypoints from a sample
2. Make sure the image is a FloatTensor, which the model expects.
3. Forward pass the image through the net to get the predicted, output keypoints.

This function test how the network performs on the first batch of test data. It returns the images, the transformed images, the predicted keypoints (produced by the model), and the ground truth keypoints.

In [None]:
# test the model on a batch of test images

def net_sample_output(my_loader=test_loader):
    
    # iterate through the test dataset
    for i, sample in enumerate(my_loader):
        
        # get sample data: images and ground truth keypoints
        images = sample['image']
        key_pts = sample['keypoints']

        # convert images to FloatTensors
        images = images.type(torch.FloatTensor)
        images = images.to(device)

        # forward pass to get net output
        output_pts = net(images)
        
        # reshape to batch_size x 14 x 2 pts
        output_pts = output_pts.view(output_pts.size()[0], 14, -1)
        
        # break after first batch of images is tested
        if i == 0:
            return images, output_pts, key_pts


#### Debugging tips

If you get a size or dimension error here, make sure that your network outputs the expected number of keypoints! Or if you get a Tensor type error, look into changing the above code that casts the data into float types: `images = images.type(torch.FloatTensor)`.

In [None]:
import torch.optim as optim

## Pre-defined architectures from torchvision

In [None]:
from fkpmodels.classifiers import Network

In [None]:
model_name = 'densenet121'
net = models.densenet121(weights=models.DenseNet121_Weights.IMAGENET1K_V1)

# Better use this because it could fit other image models better.
model_classifier = Network(1024, 14*2, [1024, 512, 224])  # Added 224

checkpoint_name = 'checkpoint_densenet121.pth'

In [None]:
model_name = 'vgg11'
net = getattr(models, model_name)(weights=models.VGG11_Weights.DEFAULT)

# Better use this because it could fit other image models better.
model_classifier = Network(25088, 14*2, [6272, 3136])

In [None]:
model_name = 'vgg16'
net = getattr(models, model_name)(weights=models.VGG16_Weights.DEFAULT)

model_classifier = Network(25088, 14*2, [6272, 3136])

net.classifier[0].in_features

In [None]:
# Freeze parameters so we don't backprop through them
for param in net.parameters():
    param.requires_grad = False

In [None]:
net.classifier = model_classifier

# Only train the classifier parameters, feature parameters are frozen
optimizer = optim.Adam(net.classifier.parameters(), lr=1e-4)
#optimizer = optim.Adam(net.parameters(), lr=1e-4)

## Xception Net model

In [None]:
from fkpmodels.xceptionnet import XceptionNet

net, model_name = XceptionNet(), 'XceptionNet'

optimizer = optim.Adam(net.parameters(), lr = 0.0008)
# net.cuda()

# Epoch 30/30.. Train loss: 0.003.. 
# Epoch 30/30.. Train loss: 0.002.. 
# Epoch 30/30.. Train loss: 0.002.. 
# Epoch 30/30.. Train loss: 0.002.. 
# Epoch 30/30.. Train loss: 0.002.. 
# Epoch 30/30.. Train loss: 0.002.. 
# Epoch 30/30.. Train loss: 0.001.. 
# CPU times: user 25min 10s, sys: 44.5 s, total: 25min 55s
# Wall time: 26min 17s

## Yet another NaimishNet

In [None]:
from fkpmodels.naimishnet import YaNaimishNet1
net, model_name = YaNaimishNet1(), 'YaNaimishNet1'  # Not too bad when showing it faces only. Loss 0.001

In [None]:
from fkpmodels.naimishnet import YaNaimishNet2
net, model_name = YaNaimishNet2(), 'YaNaimishNet2'  # Works

In [None]:
from fkpmodels.naimishnet import YaNaimishNet3
net, model_name = YaNaimishNet3(), 'YaNaimishNet3'

In [None]:
# Run this for all NaimishNet's
optimizer = optim.Adam(net.parameters(), lr=1e-4)

## Select Criterion and move NN to device

In [None]:
# Use GPU if it's available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# device = 'cpu'   # uncomment when training on the GPU fails, e.g. due to "out of memory"

criterion = nn.MSELoss()  #.to(device)

net.to(device);

## Check choosen architecture

In [None]:
# call the above function; will fail if there is a problem in the pipeline
# returns: test images, test predicted keypoints, test ground truth keypoints
test_images, test_outputs, gt_pts = net_sample_output()

# print out the dimensions of the data to see if they make sense
print(test_images.data.size())
print(test_outputs.data.size())
print(gt_pts.size())

In [None]:
# gt_pts.view(gt_pts.size()[0], 68, -1).size()
gt_pts.view(gt_pts.size()[0], 14, -1).size()

## Visualize the predicted keypoints

Once we've had the model produce some predicted output keypoints, we can visualize these points in a way that's similar to how we've displayed this data before, only this time, we have to "un-transform" the image/keypoint data to display it.

Note that I've defined a *new* function, `show_all_keypoints` that displays a grayscale image, its predicted keypoints and its ground truth keypoints (if provided).

In [None]:
def show_all_keypoints(image, predicted_key_pts, gt_pts=None):
    """Show image with predicted keypoints"""
    # image is grayscale
    plt.imshow(image, cmap='gray')
    plt.scatter(predicted_key_pts[:, 0], predicted_key_pts[:, 1], s=20, marker='.', c='m')
    # plot ground truth points as green pts
    if gt_pts is not None:
        plt.scatter(gt_pts[:, 0], gt_pts[:, 1], s=20, marker='.', c='g')


#### Un-transformation

Next, you'll see a helper function. `visualize_output` that takes in a batch of images, predicted keypoints, and ground truth keypoints and displays a set of those images and their true/predicted keypoints.

This function's main role is to take batches of image and keypoint data (the input and output of your CNN), and transform them into numpy images and un-normalized keypoints (x, y) for normal display. The un-transformation process turns keypoints and images into numpy arrays from Tensors *and* it undoes the keypoint normalization done in the Normalize() transform; it's assumed that you applied these transformations when you loaded your test data.

In [None]:
test_images, test_outputs, gt_pts = net_sample_output(my_loader=train_loader)

In [None]:
test_images, test_outputs, gt_pts = net_sample_output(my_loader=test_loader)

In [None]:
test_images, test_outputs, gt_pts = net_sample_output(my_loader=display_loader)

In [None]:
def show_all_keypoints_ax(ax, image, predicted_key_pts=None, gt_pts=None):
    """Show image with predicted keypoints"""
    # image is grayscale
    ax.imshow(image, cmap='gray')
    if predicted_key_pts is not None:
        ax.scatter(predicted_key_pts[:, 0], predicted_key_pts[:, 1], s=50, marker='.', c='m')
    # plot ground truth points as green pts
    if gt_pts is not None:
        ax.scatter(gt_pts[:, 0], gt_pts[:, 1], s=20, marker='.', c='g')
    ax.axis('off')

# visualize the output
# by default this shows a batch of 10 images
def visualize_output(test_images, test_outputs=None, gt_pts=None, batch_size=10, normalize=True, figsize=(15.0,10.0), nrows_ncols=(5, 5)):
    fig = plt.figure(figsize=figsize)
    grid = ImageGrid(fig, 111, nrows_ncols=nrows_ncols, axes_pad = 0.1)
    for i in range(batch_size):
        #plt.figure(figsize=(20,10))
        #ax = plt.subplot(1, batch_size, i+1)

        # un-transform the image data
        image = test_images[i].data   # get the image from it's wrapper
        image = image.cpu().numpy()   # convert to numpy array from a Tensor
        image = np.transpose(image, (1, 2, 0))   # transpose to go from torch to numpy image

        if normalize:
            mean = np.array([0.485, 0.456, 0.406])
            std = np.array([0.229, 0.224, 0.225])
            image = std * image + mean
            image = np.clip(image, 0, 1)

        predicted_key_pts = None
        if test_outputs is not None:
            # un-transform the predicted key_pts data
            predicted_key_pts = test_outputs[i].data
            predicted_key_pts = predicted_key_pts.cpu().numpy()
            # undo normalization of keypoints  
            predicted_key_pts = predicted_key_pts*50.0+100
        
        # plot ground truth points for comparison, if they exist
        ground_truth_pts = None
        if gt_pts is not None:
            ground_truth_pts = gt_pts[i]         
            ground_truth_pts = ground_truth_pts*50.0+100

        # call show_all_keypoints
        show_all_keypoints_ax(grid[i], np.squeeze(image), predicted_key_pts, ground_truth_pts)
            
        plt.axis('off')

    plt.show()
    
# call it
# visualize_output(test_images, test_outputs, gt_pts, batch_size=9, normalize=True, figsize=(6,6), nrows_ncols=(3, 3))
if test_images[1].shape[0] == 3:  # its a color (RGB) image
    visualize_output(test_images, test_outputs, gt_pts, batch_size=9, normalize=True, figsize=(6,6), nrows_ncols=(3, 3))
else:
    visualize_output(test_images, test_outputs, gt_pts, batch_size=9, normalize=False, figsize=(6,6), nrows_ncols=(3, 3))

In [None]:
images = next(iter(train_loader))
if images['image'][0].shape[0] == 3:  # its a color (RGB) image
    show_keypoints(images['image'][0], images['keypoints'][0], normalize=True)
else:
    show_keypoints(images['image'][0], images['keypoints'][0], normalize=False)

## Training

#### Loss function
Training a network to predict keypoints is different than training a network to predict a class; instead of outputting a distribution of classes and using cross entropy loss, you may want to choose a loss function that is suited for regression, which directly compares a predicted value and target value. Read about the various kinds of loss functions (like MSE or L1/SmoothL1 loss) in [this documentation](https://pytorch.org/docs/master/nn.html#loss-functions).

### TODO: Define the loss and optimization

Next, you'll define how the model will train by deciding on the loss function and optimizer.

---

In [None]:
# # Use GPU if it's available
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# device = "cpu"

# ## TODO: Define the loss and optimization
# import torch.optim as optim

# criterion = nn.MSELoss().to(device)

# #optimizer = optim.Adam(net.parameters(), lr=5e-5)
# optimizer = optim.Adamax(net.parameters(), lr=1e-4)

# net.to(device);

## Training and Initial Observation

Now, you'll train on your batched training data from `train_loader` for a number of epochs. 

To quickly observe how your model is training and decide on whether or not you should modify it's structure or hyperparameters, you're encouraged to start off with just one or two epochs at first. As you train, note how your model's loss behaves over time: does it decrease quickly at first and then slow down? Does it take a while to decrease in the first place? What happens if you change the batch size of your training data or modify your loss function? etc. 

Use these initial observations to make changes to your model and decide on the best architecture before you train for many epochs and create a final model.

### Run the following cell to start training the classifier with the selected model

In [None]:
%%time
# This training loop was inspired by the "Tranfer learning exercise" too.
# https://github.com/udacity/DSND_Term1/blob/1196aafd48a2278b02eff85510b582fd7e2a9d2d/lessons/DeepLearning/new-intro-to-pytorch/Part%208%20-%20Transfer%20Learning%20(Solution).ipynb
# Note: In case of an "out of memory" of the GPU, try reducing the batch_size in the dataloaders.
epochs = 30
steps = 0
running_loss = 0
print_every = 20
for epoch in range(epochs):
    for data in train_loader:
        steps += 1

        images = data['image']
        key_pts = data['keypoints']

        # flatten pts
        key_pts = key_pts.view(key_pts.size(0), -1)

        # convert variables to floats for regression loss
        key_pts = key_pts.type(torch.FloatTensor)
        images = images.type(torch.FloatTensor)
        # Move input and label tensors to the default device
        images, key_pts = images.to(device), key_pts.to(device)
        
        optimizer.zero_grad()
        
        logps = net.forward(images)
        loss = criterion(logps, key_pts)
        loss.backward()
        optimizer.step()

        running_loss += loss.item()

        if steps % print_every == 0:
            test_loss = 0
            accuracy = 0
            net.eval()
            # with torch.no_grad():
            #     for data in test_loader:
            #         inputs, labels = inputs.to(device), labels.to(device)
            #         logps = model.forward(inputs)
            #         batch_loss = criterion(logps, labels)
                    
            #         test_loss += batch_loss.item()
                    
            #         # Calculate accuracy
            #         ps = torch.exp(logps)
            #         top_p, top_class = ps.topk(1, dim=1)
            #         equals = top_class == labels.view(*top_class.shape)
            #         accuracy += torch.mean(equals.type(torch.FloatTensor)).item()
                    
            print(f"Epoch {epoch+1}/{epochs}.. "
                  f"Train loss: {running_loss/print_every:.3f}.. "
                  #f"Test loss: {test_loss/len(dataloaders['valid']):.3f}.. "
                  #f"Test accuracy: {accuracy/len(dataloaders['valid']):.3f}"
                 )
            running_loss = 0
            net.train()

In [None]:
def train_net(n_epochs, device=None, print_every=5):
    step = 0
    # prepare the net for training
    net.train()
    #global output_pts
    #global images
    for epoch in range(n_epochs):  # loop over the dataset multiple times
        
        running_loss = 0.0

        # train on batches of data, assumes you already have train_loader
        for batch_i, data in enumerate(train_loader):
            step += 1
            # get the input images and their corresponding labels
            images = data['image']
            key_pts = data['keypoints']

            # flatten pts
            key_pts = key_pts.view(key_pts.size(0), -1)

            # convert variables to floats for regression loss
            key_pts = key_pts.type(torch.FloatTensor)
            images = images.type(torch.FloatTensor)

            if device is not None:
                # Move input and label tensors to the default device
                images, key_pts = images.to(device), key_pts.to(device)
            # zero the parameter (weight) gradients
            optimizer.zero_grad()

            # forward pass to get outputs
            # output_pts = net(images).to(device)
            output_pts = net.forward(images) #.to(device)

            # calculate the loss between predicted and target keypoints
            #print(images.shape)
            #print(output_pts.shape)
            #print(key_pts.shape)
            loss = torch.sqrt(criterion(output_pts, key_pts))

            # zero the parameter (weight) gradients
            #optimizer.zero_grad()
            
            # backward pass to calculate the weight gradients
            loss.backward()

            # update the weights
            optimizer.step()
            # print loss statistics
            # to convert loss into a scalar and add it to the running_loss, use .item()
            running_loss += loss.item()
            #if batch_i % 10 == 9:    # print every 10 batches
            if step % print_every == 0:
                print('Epoch: {}, Batch: {}, Avg. Loss: {}'.format(epoch + 1, batch_i+1, running_loss/1000))
                running_loss = 0.0

    print('Finished Training')

In [None]:
%%time
# train your network
n_epochs = 30  # start small, and increase when you've decided on your model structure and hyperparams

train_net(n_epochs, device=device, print_every=5)

In [None]:
minitest_images, minitest_outputs, minigt_pts = net_sample_output(my_loader=train_loader)

In [None]:
visualize_output(minitest_images, minitest_outputs, minigt_pts)

## Test data

See how your model performs on previously unseen, test data. We've already loaded and transformed this data, similar to the training data. Next, run your trained model on these images to see what kind of keypoints are produced. You should be able to see if your model is fitting each new face it sees, if the points are distributed randomly, or if the points have actually overfitted the training data and do not generalize.

In [None]:
# get a sample of test data again

In [None]:
test_images, test_outputs, gt_pts = net_sample_output(my_loader=train_loader)

In [None]:
test_images, test_outputs, gt_pts = net_sample_output(my_loader=test_loader)

In [None]:
test_images, test_outputs, gt_pts = net_sample_output(my_loader=display_loader)

In [None]:
print(test_images.data.size())
print(test_outputs.data.size())
print(gt_pts.size())

In [None]:
test_images.shape

In [None]:
## TODO: visualize your test output
# you can use the same function as before, by un-commenting the line below:

if test_images[1].shape[0] == 3:  # its a color (RGB) image
    visualize_output(test_images, test_outputs, gt_pts, batch_size=9, normalize=True, figsize=(6,6), nrows_ncols=(3, 3))
else:
    visualize_output(test_images, test_outputs, gt_pts, batch_size=9, normalize=False, figsize=(6,6), nrows_ncols=(3, 3))

Once you've found a good model (or two), save your model so you can load it and use it later!

In [None]:
## TODO: change the name to something uniqe for each new model
model_dir = 'saved_models/'


# after training, save your model parameters in the dir 'saved_models'
torch.save(net.state_dict(), model_dir+model_name+'.pt')

After you've trained a well-performing model, answer the following questions so that we have some insight into your training and architecture selection process. Answering all questions is required to pass this project.

### Question 1: What optimization and loss functions did you choose and why?


**Answer**: write your answer here (double click to edit this cell)

### Question 2: What kind of network architecture did you start with and how did it change as you tried different architectures? Did you decide to add more convolutional layers or any layers to avoid overfitting the data?

**Answer**: write your answer here

### Question 3: How did you decide on the number of epochs and batch_size to train your model?

**Answer**: write your answer here

## Feature Visualization

Sometimes, neural networks are thought of as a black box, given some input, they learn to produce some output. CNN's are actually learning to recognize a variety of spatial patterns and you can visualize what each convolutional layer has been trained to recognize by looking at the weights that make up each convolutional kernel and applying those one at a time to a sample image. This technique is called feature visualization and it's useful for understanding the inner workings of a CNN.

In the cell below, you can see how to extract a single filter (by index) from your first convolutional layer. The filter should appear as a grayscale grid.

In [None]:
# visTensor from https://stackoverflow.com/questions/55594969/how-to-visualise-filters-in-a-cnn-with-pytorch

from torchvision import utils

def visTensor(tensor, ch=0, allkernels=False, nrow=8, padding=1): 
    n,c,w,h = tensor.shape

    if allkernels: tensor = tensor.view(n*c, -1, w, h)
    elif c != 3: tensor = tensor[:,ch,:,:].unsqueeze(dim=1)

    rows = np.min((tensor.shape[0] // nrow + 1, 64))    
    grid = utils.make_grid(tensor, nrow=nrow, normalize=True, padding=padding)
    plt.figure( figsize=(nrow,rows) )
    plt.imshow(grid.numpy().transpose((1, 2, 0)))

In [None]:
def plot_filters(w, columns=8, figsize=12, cmap=None):
    columns = 8
    rows = int(w.shape[0]/columns)
    fig=plt.figure(figsize=(figsize, figsize*rows/columns))
    for i in range(0, columns*rows):
        fig.add_subplot(rows, columns, i+1)
        if w.shape[1] == 3:  # color filter
            image = w[i].transpose(1,2,0)
            mean = np.array([0.485, 0.456, 0.406])
            std = np.array([0.229, 0.224, 0.225])
            image = std * image + mean
            image = np.clip(image, 0, 1)
            plt.imshow(image)
        else:
            plt.imshow(w[i][0], cmap='gray')

# Get the weights in the first conv layer, "conv1"
weights = net.conv1.weight.data
w = weights.cpu().clone().numpy()
visTensor(weights.cpu().clone())
# plot_filters(w)

print('First convolutional layer')
plt.show()

# weights = net.conv2.weight.data
# w = weights.cpu().numpy()

In [None]:
weights = net.conv1.weight.data
w = weights.cpu().clone().numpy()
visTensor(weights.cpu().clone()[:,0:1,:,:])

In [None]:
visTensor(weights.cpu().clone()[:,1:2,:,:])

In [None]:
# Get the weights in the first conv layer, "conv1"
# if necessary, change this to reflect the name of your first conv layer
weights1 = net.conv1.weight.data

w = weights1.cpu().numpy()

filter_index = 0

print(w[filter_index][0])
print(w[filter_index][0].shape)

# display the filter weights
plt.imshow(w[filter_index][0], cmap='gray')


## Feature maps

Each CNN has at least one convolutional layer that is composed of stacked filters (also known as convolutional kernels). As a CNN trains, it learns what weights to include in its convolutional kernels and when these kernels are applied to some input image, they produce a set of **feature maps**. So, feature maps are just sets of filtered images; they are the images produced by applying a convolutional kernel to an input image. These maps show us the features that the different layers of the neural network learn to extract. For example, you might imagine a convolutional kernel that detects the vertical edges of a face or another one that detects the corners of eyes. You can see what kind of features each of these kernels detects by applying them to an image. One such example is shown below; from the way it brings out the lines in an the image, you might characterize this as an edge detection filter.

<img src='images/feature_map_ex.png' width=50% height=50%/>


Next, choose a test image and filter it with one of the convolutional kernels in your trained CNN; look at the filtered output to get an idea what that particular kernel detects.

### TODO: Filter an image to see the effect of a convolutional kernel
---

In [None]:
##TODO: load in and display any image from the transformed test dataset

## TODO: Using cv's filter2D function,
## apply a specific set of filter weights (like the one displayed above) to the test image


In [None]:
import cv2

In [None]:
weights = net.conv1.weight.data
w = weights.cpu().clone().numpy()
visTensor(weights.cpu().clone())
# plot_filters(w)

In [None]:
w.shape

In [None]:
image_id = 1


def apply_kernels_to_image(image, w):
    """
    Apply the kernels w (channels, width, height) to the image using cv2.filter2D for each channel.
    """
    filtered_image = image.data.cpu().clone().numpy() # convert to numpy array from a Tensor
    n_image_channels, width_image, height_image = filtered_image.shape
    n_kernels, n_kernel_channels = w.shape[0], w.shape[1]
    assert n_kernel_channels == n_image_channels, "The number of image channels and kernel channels must match"
    n_channels = n_kernel_channels
    #filtered_image = np.transpose(image, (1, 2, 0))   # transpose to go from torch to numpy image
    filtered_images = np.zeros((n_kernels, n_channels, width_image, height_image))
    for kernel_num in range(n_kernels):
        image_copy = filtered_image.copy()
        for channel_num in range(n_channels):  # Apply kernels to each color channel
            filtered_images[kernel_num][channel_num] = cv2.filter2D(image_copy[channel_num], -1, w[kernel_num][channel_num]) 
    
    return torch.FloatTensor(filtered_images)


filtered_images_tt = apply_kernels_to_image(test_images[image_id], w)

In [None]:
if test_images[1].shape[0] == 3:  # its a color (RGB) image
    visualize_output(filtered_images_tt, batch_size=32, normalize=True, figsize=(12,6), nrows_ncols=(4, 8))
else:
    visualize_output(filtered_images_tt, batch_size=32, normalize=False, figsize=(6,6), nrows_ncols=(4, 8))

### Question 4: Choose one filter from your trained CNN and apply it to a test image; what purpose do you think it plays? What kind of feature do you think it detects?


**Answer**: (does it detect vertical lines or does it blur out noise, etc.) write your answer here

---
## Moving on!

Now that you've defined and trained your model (and saved the best model), you are ready to move on to the last notebook, which combines a face detector with your saved model to create a facial keypoint detection system that can predict the keypoints on *any* face in an image!