# LT2316 H20 Assignment B - image autoencoder

## Introduction

In this assignment, you will define an image autoencoder model by changing/adding a limited amount of code to this notebook.  We will mark off what you will change.  

An autoencoder is a machine learning system/network that attempts to reconstruct the input (or a proxy for the input such as its context, as in neural language modeling) after compressing it to a smaller "bottleneck" representation. We can then extract the compressed representation for the input by running the input through the part of the trained model that ends in the compressing hidden layer.  For images, the input and output of the full model usually have the same shape, and the loss is the pixel-by-pixel colour channel error.  The embeddings of the images are extracted from some middle layer as a vector of much lower dimensionality.

We can then examine how "good" the embeddings are not only by the training loss but also by other techniques, such as clustering the embeddings. We will just test on the training data for simplicity (this is not always wrong if our goal is merely to get a generalized/compressed representation of a fixed amount of data).

Below we will mark off what you need to change and what you can change in markdown and code comments. The rest should remain untouched when you submit.  You are recommended to develop in a copy of the notebook you will submit and then port over your changes to the "final" notebook. That way, you can modify our code to test your code in your private notebook.

You will submit a saved notebook directly to Canvas.  

## Loading the data

In [1]:
from pycocotools.coco import COCO
import numpy as np
import pandas as pd
import random 
from PIL import Image
from sklearn.preprocessing import StandardScaler
import torch
import torch.nn as nn
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
from sklearn.decomposition import TruncatedSVD
from sklearn.cluster import KMeans
# You may add imports you feel you need for the notebook

In [2]:
%matplotlib inline

In [3]:
coco = COCO(annotation_file="/scratch/lt2316-h18-resources/coco/annotations/instances_train2017.json")

loading annotations into memory...
Done (t=18.43s)
creating index...
index created!


In [4]:
my_device = "cuda:1" # you can change the device to another GPU or the cpu for testing

In [5]:
def get_data(meta, datadir="/scratch/lt2316-h18-resources/coco/train2017"):
    return [(x['file_name'], Image.open("{}/{}".format(datadir, x['file_name'])).resize((100,100)).convert('RGB')) for x in meta]

def cat_img_load(category, coco, trainsize):
    catids = coco.getCatIds(catNms=category)
    imgids = coco.getImgIds(catIds=catids)
    
    random.shuffle(imgids)
    imgids = imgids[:trainsize]
    imgmeta = coco.loadImgs(ids=imgids)
    imgdata = get_data(imgmeta)
    
    imgdf = pd.DataFrame()
    imgnames = [x[0] for x in imgdata]
    imgarrays = [x[1] for x in imgdata]
    imgdf['imgs'] = imgarrays
    imgdf['filename'] = imgnames
    imgdf['class'] = category
    
    return imgdf

def get_tensors(*imgdfs):
    bigdf = pd.concat(imgdfs)
    print(len(bigdf))
    X = np.array([np.array(x) for x in bigdf['imgs']])
    print(X.shape)
    y = bigdf['class']
    filenames = bigdf['filename']
    X_scaled = StandardScaler().fit_transform(X.reshape(len(X),30000)).reshape(len(X), 100, 100, 3)
    X_tensor = torch.Tensor(X_scaled).to(my_device)
    return X_tensor, y, filenames

**We may change these MS COCO categories when testing as well as the number of retrieved items.**

In [6]:
airplanedf = cat_img_load("airplane", coco, 1000)
skateboarddf = cat_img_load("skateboard", coco, 1000)
mousedf = cat_img_load("mouse", coco, 1000)

In [7]:
len(airplanedf), len(skateboarddf), len(mousedf)

(1000, 1000, 1000)

In [8]:
X_tensor, y, filenames = get_tensors(airplanedf, skateboarddf, mousedf)

3000
(3000, 100, 100, 3)


## Batching and shuffling

**There should be no reason to edit this.**

In [9]:
class Batcher:
    def __init__(self, X, device, batch_size=50, max_iter=None):
        self.X = X
        self.device = device
        self.batch_size=batch_size
        self.max_iter = max_iter
        self.curr_iter = 0
        
    def __iter__(self):
        return self
    
    def __next__(self):
        if self.curr_iter == self.max_iter:
            raise StopIteration
        permutation = torch.randperm(self.X.size()[0], device=self.device)
        permX = self.X[permutation]
        splitX = torch.split(permX, self.batch_size)
        
        self.curr_iter += 1
        return splitX

## Autoencoder

**Here is where you will make the main changes.**

You will get **10 points** for making changes that run and produce representations that are *emb_size* in width when we extract the embeddings by calling *emb* on the training data and represent a good-faith attempt at building a simple autoencoder.  

We will give **1 to 3 points** for model design.

In [41]:
class ImageAutoencoder(nn.Module):
    # You may ADD hyperparameters here (and make requisite adaptations in the training loop below).
    # Document them under "Your analysis"
    def __init__(self, emb_size, height, width):
        super(ImageAutoencoder, self).__init__()
        self.emb_size = emb_size
        self.height = height
        self.width = width
        
        #input_size = height*width*3
        input_size = width * 3
        #input_size = 3
        
        # Define your network structure here using PyTorch.
        self.enc1 = nn.Linear(in_features=input_size, out_features=input_size//2)
        self.enc2 = nn.Linear(in_features=input_size//2, out_features=input_size//4)
        self.enc3 = nn.Linear(in_features=input_size//4, out_features=emb_size)
        self.dec1 = nn.Linear(in_features=emb_size, out_features=input_size//4)
        self.dec2 = nn.Linear(in_features=input_size//4, out_features=input_size//2)
        self.dec3 = nn.Linear(in_features=input_size//2, out_features=input_size)
        
        
    def forward(self, batch):
        # Apply the model to the batch here.  Watch out for the shapes!
        batch = batch.flatten(start_dim=2)
        
        en1 = self.enc1(batch)
        en1 = torch.relu(en1)
        en2 = self.enc2(en1)
        en2 = torch.relu(en2)
        en3 = self.enc3(en2)
        en3 = torch.relu(en3)
        de1 = self.dec1(en3)
        de1 = torch.relu(de1)
        de2 = self.dec2(de1)
        de2 = torch.relu(de2)
        de3 = self.dec3(de2)
        de3 = torch.relu(de3)
        
        out = de3.view(-1, self.height, self.width, 3)
    
        return out
        
    def emb(self, batch):
        # This should return the inner representation of the images, including for arbitrary unseen images
        # of the correct shape.
        batch = batch.flatten(start_dim=2)
        en1 = self.enc1(batch)
        en1 = torch.relu(en1)
        en2 = self.enc2(en1)
        en2 = torch.relu(en2)
        en3 = self.enc3(en2)
        en3 = torch.relu(en3)
        
        out = en3.view(-1, self.height, self.emb_size)
        out = out.flatten(start_dim=1)
        return out

## Training

In [42]:
import torch.optim as optim

**You may make limited changes here.**

You can adapt *train()* slightly to handle any hyperparameters you added to *ImageAutoencoder*.  We may test by changing the values of the hyperparameters when we grade the assignment.

In [43]:
def train(X, batch_size, epochs, device, model=None):
    b = Batcher(X, device, batch_size=batch_size, max_iter=epochs)
    if not model:
         #We may change the embedding size by hand here. 
        m = ImageAutoencoder(400, X[0].size()[0], X[0].size()[1]).to(device)
    else:
        m = model
    loss = nn.MSELoss()
    optimizer = optim.Adam(m.parameters(), lr=0.005)
    epoch = 0
    for split in b:
        tot_loss = 0
        for batch in split:
            optimizer.zero_grad()
            o = m(batch)
            l = loss(o, batch)
            tot_loss += l
            l.backward()
            optimizer.step()
        print("Total loss in epoch {} is {}.".format(epoch, tot_loss))
        epoch += 1
    return m

## Running the model and checking the output

We're leaving the results of running our own simple model here just so you know what it might look like, but there's no guarantee or requirement that the performance of your model will be similar.  It's reasonably likely that it might even be better...whatever we mean by better. But it will very likely be different, especially as there is some randomness involved.

In [44]:
#toy_tensor = X_tensor[:2]
model = train(X_tensor, 30, 100, my_device) 
# You can add hyperparameters also here, change the number of epochs, batch size, etc.

Total loss in epoch 0 is 77.28620910644531.
Total loss in epoch 1 is 69.22488403320312.
Total loss in epoch 2 is 67.10160827636719.
Total loss in epoch 3 is 65.97740173339844.
Total loss in epoch 4 is 65.85813903808594.
Total loss in epoch 5 is 65.11811065673828.
Total loss in epoch 6 is 64.63735961914062.
Total loss in epoch 7 is 64.27767181396484.
Total loss in epoch 8 is 64.11898803710938.
Total loss in epoch 9 is 63.89195251464844.
Total loss in epoch 10 is 63.77454376220703.
Total loss in epoch 11 is 63.827762603759766.
Total loss in epoch 12 is 63.478363037109375.
Total loss in epoch 13 is 63.40165328979492.
Total loss in epoch 14 is 63.106842041015625.
Total loss in epoch 15 is 63.09342575073242.
Total loss in epoch 16 is 63.019775390625.
Total loss in epoch 17 is 62.75934600830078.
Total loss in epoch 18 is 62.981109619140625.
Total loss in epoch 19 is 62.73892593383789.
Total loss in epoch 20 is 62.74154281616211.
Total loss in epoch 21 is 62.59525680541992.
Total loss in epoc

In [45]:
everything = model(X_tensor)

RuntimeError: CUDA out of memory. Tried to allocate 458.00 MiB (GPU 1; 10.92 GiB total capacity; 4.20 GiB already allocated; 10.50 MiB free; 404.35 MiB cached)

In [None]:
everything.shape
#X_tensor.shape

In [None]:
sample = everything[1511].cpu().detach().numpy()

In [None]:
plt.imshow(sample)

In [None]:
sample_true = X_tensor[1511].cpu().detach().numpy()

In [None]:
plt.imshow(sample_true)

In [None]:
embs = model.emb(X_tensor)

In [None]:
print(X_tensor.shape)
print(embs.shape)
#embs=embs.flatten(start_dim=1)  # cheating here, figure out what to do with dimensions!!
print(embs.shape)

In [None]:
embs = embs.cpu().detach().numpy()

In [None]:
kmeans = KMeans(3, random_state=700).fit(embs)

In [None]:
kmeans.labels_

In [None]:
truncated = TruncatedSVD(2).fit_transform(embs)

In [None]:
truncated.shape

In [None]:
plt.rcParams['figure.figsize'] = (12,12)

In [None]:
plt.scatter(truncated[:,0], truncated[:,1], c=[{0:'b',1:'g',2:'r'}[x] for x in kmeans.labels_], 
            s=[{"mouse":10, "skateboard":50, "airplane":100}[x] for x in y])

## Your analysis

**Informally analyze the performance of your model and clustering by examining cluster members to see the quality of the clusters and by experimenting with hyperparameters.  You can show your investigations here with markdown write-up.**

This will be graded on good-faith effort. **5 points** absolute for a reasonable effort (think 2-3 paragraphs of discussion and **3 points** used to identify effort quality.

I decided to use a chain of linear layers for my autoencoder since I saw it suggested in several articles and it's a relatively straighforward approach. The parameters I'm going to change are input size, batch size, and number of epochs. Changing the input size of course changes the representation of the data to the neural network. I'll try it once with an input size of 3, representing the pixel layer, and once with an input size of 300, representing rows of pixels. I look at random images and their reconstructions for my analysis, as well as the cluster above.

For an input size of 300, batch size of 30, and 100 epochs, the reconstructed images include the correct shapes, and the overall color scheme is correct, but there are colored vertical lines running throughout the image. Since the clusters lie very close together, I'm struggling to see distinctive patterns there. Overall, the blue cluster tends to have many small points, and some medium ones, while the red one predominantly consists of medium ones, and the big dots cluster mainly in green. The grouping is not very clear though, many dots appear outside of these groups.

Using the same input size and batch size, but a significantly lower number of epochs (20) unsurprisingly yields worse results. The images still broadly contain the same shapes, but the colors are completely off and there are even more of the stripes. In the cluster, the colors are distributed differently, and they are less clearly grouped. For example, the green cluster contains a semi-cluster of big points and directly next to it a semi-cluster of small points. The blue and red cluster seem to contain a mix of all three sizes. So 20 epochs obviously werent enough for the model to learn a good way of representing the images. 

In a third experiment, I used 100 epochs again and reduced the batch size to 5. The results with these hyperparameters are very similar to the ones with few epochs and larger batches, both in the images and in the cluster. 

When I change the input size to 3, with the same settings for batch size (30) and number of epochs (100), I get very interesting results. The images are reconstructed relatively accurately regarding the shapes of objects in them, but the colors are completely off (or yellow, to be more specific). This might also be caused by some error in my network though, maybe I accidentally switch the RBG values at some point. Looking at the cluster gives an even stranger sight. There are three clusters visible, but all the dots are big, so it seems like only pictures of airplanes are included in this clustering. Overall, it does not seem like the input size of 3 works well for this model and data (or I messed up the data when adjusting the layer sizes). 

I was surprised how good the results in the images were with the input size of 300, batch size 30, and 100 epochs since the model I chose is very simplistic. Even though the colourful striped are not ideal of course, the motives of the pictures are still recognizable. However, they don't seem to be to the clustering algorithm, which inarguably doesn't do very well with the encoded data of my model. 

Out of curiosity, I played around with the embedding size too in the end, since the default of 400 seemed very high to me. However, I found out that the image quality significantly decreases with smaller embedding sizes and the clustering gets even less clear, with all three clusters overlapping in the middle of the graph. 




## Your analysis (bonus)

**Search for an apply a method to analyze cluster purity relative to ground truth (4 bonus points), and apply it to hyperparameter and model variants (3 bonus points).**

## Scratch area for your convenience

We will ignore anything after this line.

In [29]:
b = Batcher(X_tensor, device=my_device, batch_size=10, max_iter=3)
for split in b:
    for batch in split:
        print(batch.shape)
        print(batch)
        break
    break

torch.Size([10, 100, 100, 3])
tensor([[[[ 0.3837,  0.3632,  0.2218],
          [ 0.7108,  0.4557,  0.0433],
          [ 0.7174,  0.4365,  0.0641],
          ...,
          [ 0.8126,  0.5518,  0.2043],
          [ 0.7894,  0.5309,  0.1840],
          [ 0.7771,  0.5312,  0.1488]],

         [[ 0.3340,  0.3374,  0.2324],
          [ 0.7019,  0.4503,  0.0547],
          [ 0.6872,  0.4311,  0.0159],
          ...,
          [ 0.7742,  0.5522,  0.2168],
          [ 0.7803,  0.5212,  0.1739],
          [ 0.7681,  0.5105,  0.1657]],

         [[ 0.3540,  0.3620,  0.2329],
          [ 0.6599,  0.5088,  0.1207],
          [ 0.7330,  0.4749,  0.0522],
          ...,
          [ 0.8089,  0.5470,  0.2202],
          [ 0.7805,  0.5240,  0.1982],
          [ 0.7745,  0.5145,  0.1744]],

         ...,

         [[ 0.2015,  0.2574,  0.2929],
          [-0.6168, -0.7760, -0.8583],
          [ 0.1837, -0.2710, -0.6108],
          ...,
          [ 0.1372,  0.0344, -0.1008],
          [ 0.1844,  0.1562, -0

In [30]:
# solve problem with CUDA and dimensions.. height*width*3 makes more sense!
# try to add different/other layers to the model, try different layer sizes..
# analysis part..