In [None]:
import io
import itertools
import numpy as np
import pandas as pd
from PIL import Image
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import transforms
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
from torch.utils.data import random_split

# Before running this script
1. Please download below grid search files, and put them in the same folder as this notebook file:
https://github.com/msai437-group3/hw2/blob/mikky/classification_params_grid_search.csv
https://github.com/msai437-group3/hw2/blob/mikky/autoencoder_params_grid_search.csv
2. Please define the path to your dataset

In [None]:
DATA_PATH = '/Users/mikky/Downloads/train-00000-of-00001-38cc4fa96c139e86.parquet'
emojis = pd.read_parquet(DATA_PATH, engine='pyarrow')

In [None]:
image = Image.open(io.BytesIO(emojis.iloc[0]['image']['bytes']))
image.size

In [None]:
plt.imshow(image)

# Question 1
(5.0points) Implement and train your autoencoder on the subset of the Emoji dataset that you selected and augmented:
a. describe your dataset and the steps that you used to create it,
b. provide a summary of your architecture(see Adversarial Examples Notebook)
c. discuss and explain your design choices,
d. list hyper-parameters used in the model,
e. plot learning curves for training and validation loss as a function of training epochs,
f. provide the final average error of your autoencoder on your test set, and
g. discuss any decisions or observations that you find relevant.

In [None]:
from collections import Counter

all_words = emojis['text'].str.split(' ').explode()
word_counts = Counter(all_words)
word_counts_df = pd.DataFrame(word_counts.items(), columns=['Word', 'Count'])
# Rank the words by count from big to small
word_counts_df = word_counts_df.sort_values('Count', ascending=False).reset_index(drop=True)

word_counts_df[:20]

We only look at top 20 words with the most word frequency to ensure enough data for training. 
Out of 20, we will choose entity word such as man, woman, flag, face, male, female, hand, person. 
Let's take a look at these words and their correspondent pictures.

In [None]:
candicate_list = ['man', 'woman', 'flag', 'face', 'male', 'female', 'hand', 'person']

def format_text(text, max_words=4):
    # format the text, max_words: maximum number of words in one line
    words = text.split()
    lines = [' '.join(words[i:i + max_words]) for i in range(0, len(words), max_words)]
    return '\n'.join(lines)

def show_emojis(filtered_df, fig_size=(25,25), subplot=(10,10), num=100, maximum_word=4):
    # show emoji pictures
    # print (filtered_df['text'][0:20])
    resized_images = [{index:Image.open(io.BytesIO(row['image']['bytes'])).resize((128, 128))} for index, row in filtered_df.iterrows()][:num]
    
    plt.figure(figsize=fig_size)
    for i, img_dict in enumerate(resized_images, start=1):
        ax = plt.subplot(subplot[0], subplot[1], i)
        ax.axis('off')
        plt.imshow(np.array(list(img_dict.values())[0]))
        ax.text(0.5, -0.5, format_text(filtered_df['text'][list(img_dict.keys())[0]], max_words=maximum_word), fontsize=12, ha='center', transform=ax.transAxes)
    
    plt.subplots_adjust(wspace=0.1, hspace=1.2)
    plt.show()

Let's look at the example pictures of each word:

### 1. man

In [None]:
filtered_df = emojis[emojis['text'].apply(lambda x: 'man' in x.split())]
show_emojis(filtered_df)

### 2. woman

In [None]:
filtered_df = emojis[emojis['text'].apply(lambda x: 'woman' in x.split())]
show_emojis(filtered_df)

### 3. flag

In [None]:
filtered_df = emojis[emojis['text'].apply(lambda x: 'flag' in x.split())]
show_emojis(filtered_df)

### 4. face

In [None]:
filtered_df = emojis[emojis['text'].apply(lambda x: 'face' in x.split())]
show_emojis(filtered_df)

### 5. male

In [None]:
filtered_df = emojis[emojis['text'].apply(lambda x: 'male' in x.split())]
show_emojis(filtered_df)

### 6. female

In [None]:
filtered_df = emojis[emojis['text'].apply(lambda x: 'female' in x.split())]
show_emojis(filtered_df)

### 7. hand

In [None]:
filtered_df = emojis[emojis['text'].apply(lambda x: 'hand' in x.split())]
show_emojis(filtered_df)

### 8. person

In [None]:
filtered_df = emojis[emojis['text'].apply(lambda x: 'person' in x.split())]
show_emojis(filtered_df)

## Conclusion

Candidates: ['man', 'woman', 'flag', 'face', 'male', 'female', 'hand', 'person']
</br>

Choosing Criteria:
1. The data share common characteristics.
2. For question 2, the data should have clear classfications.
3. For question 3, the data should be easy to blend with each other.
</br>

Based on above criterias, we choose below candidates:
man, woman, face, male, female, person
</br>

Reasons why we do not choose hand and flag:
    - flags are too similar, which is not suitable for question 2 classification.
    - hands will become horrible if different hand gestures are blended.

We pick face.

let's take a look at the full face dataset.

In [None]:
face_df = emojis[emojis['text'].apply(lambda x: 'face' in x.split())]
show_emojis(face_df, fig_size=(30,40), subplot=(12,16), num=len(face_df), maximum_word=2)

Since "clock faces" are totally different from other faces, they just share same word of "face", so we will remove clock faces from the dataset.

In [None]:
face_df = emojis[emojis['text'].apply(lambda x: 'face' in x.split() and 'clock' not in x.split())].copy()
len(face_df)

In [None]:
face_texts = face_df['text']
face_imgs = np.array([np.array(Image.open(io.BytesIO(row['image']['bytes']))) for index, row in face_df.iterrows()])

In [None]:
# Fix a seed for reproducibility
seed_value = 42 
# Numpy RNG
np.random.seed(seed_value)
# PyTorch RNGs
torch.manual_seed(seed_value)
torch.cuda.manual_seed(seed_value)

In [None]:
class CustomDataset(Dataset):
    def __init__(self, data, transform=None):
        """
        Custom dataset for the autoencoder
        Args:
            data (numpy.ndarray): A matrix containing your data.
            transform (callable, optional): Optional transform to be applied on a sample.
        """
        self.data = data
        self.transform = transform

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        sample = self.data[idx]
        sample = Image.fromarray(sample)
        if self.transform:
            sample = self.transform(sample)
        return sample

## Autoencoder
For autoencoder, we will try two different networks and compare their differences: a fully connected network and a CNN network.

### Autoencoder Choice 1: Fully Connected Network
1. Given the small dataset size, we will start with a relatively simple model to avoid overfitting. First layer: 12288 (input) to 1024; Second layer: 1024 to 256; Third layer: 256 to 64; Fourth layer: 64 to latent space dimension. The latent space dimension should be large enough to capture relevant features but small enough to enforce meaningful compression. We will experiment about the latent space dimension in grid search later.
2. The decoder reconstructs the image by doing the reverse operations. 
3. Since we want to reverse it back to an image, so we want the output to be bounded. Therefore, we add a Sigmoid at the end of the decoder to limit the value between 0 and 1.

In [None]:
class EmojiAutoencoderLinear(nn.Module):
    
    def __init__(self, params):
        super(EmojiAutoencoderLinear, self).__init__()
        self.params = params
        self.encoder = nn.Sequential(
            nn.Linear(64 * 64 * 3, 1024),
            nn.ReLU(),
            nn.Linear(1024, 256),
            nn.ReLU(),
            nn.Linear(256, 64),
            nn.ReLU(),
            nn.Linear(64, params['bottleneck_dim'])
        )
        self.decoder = nn.Sequential(
            nn.Linear(params['bottleneck_dim'], 64),
            nn.ReLU(),
            nn.Linear(64, 256),
            nn.ReLU(),
            nn.Linear(256, 1024),
            nn.ReLU(),
            nn.Linear(1024, 64 * 64 * 3),
            nn.Sigmoid()  # Use sigmoid to ensure output values are between 0 and 1
        )

    def forward(self, x):
        bottleneck = self.encoder(x)
        x = self.decoder(bottleneck)
        return x

### Autoencoder Choice 2: CNN Network
Compared with simply fully connected layers, convolutional layers are more effective at capturing spatial hierarchies in image data.
1. The stride of 2 and padding of 1 in the convolutional layers reduce the spatial dimensions of the output by half each time (e.g., from 64x64 to 32x32, then to 16x16, and so on). This down-sampling is part of what helps the network to compress the input data into a more manageable set of features.
2. ReLU introduces non-linearity into the network, allowing the network to learn more complex patterns. It is favored in CNNs due to its computational efficiency and because it helps mitigate the vanishing gradient problem.
3. Since we want to reverse it back to an image, so we want the output to be bounded. Therefore, we add a Sigmoid at the end of the decoder to limit the value between 0 and 1.

In [None]:
class EmojiAutoencoderCNN(nn.Module):
    
    def __init__(self, params):
        super(EmojiAutoencoderCNN, self).__init__()
        self.params = params
        if self.params['batch_normalization'] is True:
            self.encoder = nn.Sequential(
                nn.Conv2d(3, 16, kernel_size=3, stride=2, padding=1),  # Output: 16 x 32 x 32
                nn.BatchNorm2d(16),
                nn.ReLU(),
                nn.Conv2d(16, 32, kernel_size=3, stride=2, padding=1),  # Output: 32 x 16 x 16
                nn.BatchNorm2d(32),
                nn.ReLU(),
                nn.Conv2d(32, 64, kernel_size=3, stride=2, padding=1),  # Output: 64 x 8 x 8
                nn.BatchNorm2d(64),
                nn.ReLU(),
                nn.Conv2d(64, 128, kernel_size=3, stride=2, padding=1),  # Output: 128 x 4 x 4
                nn.BatchNorm2d(128),
                nn.ReLU(),
            )
            self.decoder = nn.Sequential(
                nn.ConvTranspose2d(128, 64, kernel_size=3, stride=2, padding=1, output_padding=1),  # Output: 8 x 8 x 64
                nn.BatchNorm2d(64),
                nn.ReLU(),
                nn.ConvTranspose2d(64, 32, kernel_size=3, stride=2, padding=1, output_padding=1),  # Output: 16 x 16 x 32
                nn.BatchNorm2d(32),
                nn.ReLU(),
                nn.ConvTranspose2d(32, 16, kernel_size=3, stride=2, padding=1, output_padding=1),  # Output: 32 x 32 x 16
                nn.BatchNorm2d(16),
                nn.ReLU(),
                nn.ConvTranspose2d(16, 3, kernel_size=3, stride=2, padding=1, output_padding=1),  # Output: 64 x 64 x 3
                nn.Sigmoid(),
            )
        else:
            self.encoder = nn.Sequential(
                nn.Conv2d(3, 16, kernel_size=3, stride=2, padding=1),  # Output: 16 x 32 x 32
                nn.ReLU(),
                nn.Conv2d(16, 32, kernel_size=3, stride=2, padding=1),  # Output: 32 x 16 x 16
                nn.ReLU(),
                nn.Conv2d(32, 64, kernel_size=3, stride=2, padding=1),  # Output: 64 x 8 x 8
                nn.ReLU(),
                nn.Conv2d(64, 128, kernel_size=3, stride=2, padding=1),  # Output: 128 x 4 x 4
                nn.ReLU(),
            )
            self.decoder = nn.Sequential(
                nn.ConvTranspose2d(128, 64, kernel_size=3, stride=2, padding=1, output_padding=1),  # Output: 8 x 8 x 64
                nn.ReLU(),
                nn.ConvTranspose2d(64, 32, kernel_size=3, stride=2, padding=1, output_padding=1),  # Output: 16 x 16 x 32
                nn.ReLU(),
                nn.ConvTranspose2d(32, 16, kernel_size=3, stride=2, padding=1, output_padding=1),  # Output: 32 x 32 x 16
                nn.ReLU(),
                nn.ConvTranspose2d(16, 3, kernel_size=3, stride=2, padding=1, output_padding=1),  # Output: 64 x 64 x 3
                nn.Sigmoid(),
            )

    def forward(self, x):
        bottleneck = self.encoder(x)
        x = self.decoder(bottleneck)
        return x

Now we will construct our autoencoder and start training.

Since we have small dataset, we adopt regularization techniques to prevent overfitting: L2 weight decay and batch normalization(for CNN).

In [None]:
class EmojiAutoencoder(nn.Module):
    def __init__(self, params):
        super(EmojiAutoencoder, self).__init__()
        self.params = params
        if self.params['auto_encoder'] == 'linear':
            self.auto_encoder = EmojiAutoencoderLinear(self.params)
        elif self.params['auto_encoder'] == 'CNN':
            self.auto_encoder = EmojiAutoencoderCNN(self.params)
        else:
            self.auto_encoder = EmojiAutoencoderCNN(self.params)
        self.criterion = nn.MSELoss()
        self.optimizer = optim.Adam(self.parameters(), lr=self.params['learning_rate'], weight_decay=self.params['weight_decay'])
        self.train_loader, self.valid_loader, self.test_loader = load_data(self.params)
    
    def forward(self, img):
        return self.auto_encoder.forward(img)

    def train(self):
        train_losses = []
        validation_losses = []

        for epoch in range(self.params['epoch']):
            train_loss = 0
            for data in self.train_loader:
                img = data
                if self.params['auto_encoder'] == 'linear':
                    img = img.view(img.size(0), -1)  # Flatten the images
                self.optimizer.zero_grad()
                outputs = self.forward(img)
                loss = self.criterion(outputs, img)
                loss.backward()
                self.optimizer.step()
                train_loss += loss.item()
            train_loss /= len(self.train_loader)
            train_losses.append(train_loss)

            # Validation
            validation_loss = 0
            with torch.no_grad():
                for data in self.valid_loader:
                    img = data
                    if self.params['auto_encoder'] == 'linear':
                        img = img.view(img.size(0), -1)
                    outputs = self.forward(img)
                    loss = self.criterion(outputs, img)
                    validation_loss += loss.item()
            validation_loss /= len(self.valid_loader)
            validation_losses.append(validation_loss)
            print(f'Epoch {epoch + 1}/{self.params["epoch"]}, Train Loss: {train_loss:.4f}, Validation Loss: {validation_loss:.4f}')

        plt.figure(figsize=[8, 6])
        plt.plot(train_losses, label='Training Loss')
        plt.plot(validation_losses, label='Validation Loss')
        plt.xlabel('Epochs', fontsize=14)
        plt.ylabel('Loss', fontsize=14)
        plt.title('Training and Validation Loss Curves', fontsize=16)
        plt.legend()
        # fig_name = f'{self.params["auto_encoder"]}_learning_curve_{self.params["batch_size"]}_{self.params["epoch"]}_{self.params["learning_rate"]}_{self.params["weight_decay"]}_{self.params["batch_normalization"]}_{self.params["bottleneck_dim"]}.jpg'
        # plt.savefig(fig_name, dpi=300, bbox_inches='tight')
        plt.show()

    def test(self):
        total_mse_error = 0.0
        total_samples = 0
        with torch.no_grad():
            for data in self.test_loader:
                img = data
                if self.params['auto_encoder'] == 'linear':
                    img = img.view(img.size(0), -1)
                reconstructed = self.auto_encoder.forward(img)
                mse_error = self.criterion(reconstructed, img)
                total_mse_error += mse_error.item() * img.size(0)  # Multiply by batch size to accumulate error correctly
                total_samples += img.size(0)
            average_mse_error = total_mse_error / total_samples
            print(f'Test Average MSE Error: {average_mse_error:.4f}')
            self.visual_inspection()
            return average_mse_error

    def visual_inspection(self):
        # Visual inspect the difference between original images and reconstructed images
        dataiter = iter(self.test_loader)  
        images = next(dataiter)
        if self.params['auto_encoder'] == 'linear':
            images = images.view(images.size(0), -1)
        reconstructed = self.auto_encoder.forward(images)
        images = images.view(-1, 3, 64, 64)  # Reshape original images to proper shape
        reconstructed = reconstructed.view(-1, 3, 64, 64)
        images = images.numpy()
        reconstructed = reconstructed.detach().numpy()

        fig, axes = plt.subplots(nrows=2, ncols=5, figsize=(10, 4))
        for i in range(5):
            ax = axes[0, i]
            image_clipped = np.clip(np.transpose(images[i], (1, 2, 0)), 0, 1)
            ax.imshow(image_clipped)  # Convert from (C, H, W) to (H, W, C)
            ax.set_title('Original')
            ax.axis('off')
            ax = axes[1, i]
            reconstructed_clipped = np.clip(np.transpose(reconstructed[i], (1, 2, 0)), 0, 1)
            ax.imshow(reconstructed_clipped)
            ax.set_title('Reconstructed')
            ax.axis('off')
        # fig_name = f'{self.params["auto_encoder"]}_reconstructed_{self.params["batch_size"]}_{self.params["epoch"]}_{self.params["learning_rate"]}_{self.params["weight_decay"]}_{self.params["batch_normalization"]}_{self.params["bottleneck_dim"]}.jpg'
        # plt.savefig(fig_name, dpi=300, bbox_inches='tight')
        plt.show()
        
    def encode(self, img):
        return self.auto_encoder.encoder(img)
    
    def decode(self, encoded_img):
        return self.auto_encoder.decoder(encoded_img)

In [None]:
def load_data(params):
    # Read data
    emojis = pd.read_parquet(DATA_PATH, engine='pyarrow')
    face_df = emojis[emojis['text'].apply(lambda x: 'face' in x.split() and 'clock' not in x.split())]
    face_imgs = np.array([np.array(Image.open(io.BytesIO(row['image']['bytes']))) for index, row in face_df.iterrows()])

    # Split dataset
    total_length = len(face_df)
    train_length = int(total_length * 0.6)
    valid_length = int(total_length * 0.2)
    test_length = total_length - train_length - valid_length
    train_dataset, valid_dataset, test_dataset = random_split(face_imgs, [train_length, valid_length, test_length])

    # Augment training data
    train_transform = transforms.Compose([
        transforms.Resize((64, 64)),
        transforms.RandomAffine(
            degrees=15,  # Rotation: A small degree, since large rotations might make emojis unrecognizable.
            translate=(0.1, 0.1),  # Translation: Shift the image by 10% of its height/width in any direction.
            scale=(0.9, 1.1),  # Scale: Slightly zoom in or out by 10%.
            shear=5  # Shear: Apply a small shearing of 5 degrees.
        ),
        transforms.RandomHorizontalFlip(),  # Horizontally flip the image with a given probability.
        transforms.RandomVerticalFlip(p=0.5),  # Vertically flip the image with a given probability.
        transforms.RandomRotation(20),  # Rotate the image by angle.
        transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2),  # Randomly change the brightness, contrast, and saturation of an image.
        transforms.ToTensor(),  # Convert a PIL Image or numpy.ndarray to tensor.
        # transforms.Normalize([0.5, 0.5, 0.5], [0.5, 0.5, 0.5])  # Normalize a tensor image with mean and standard deviation.
    ])

    # Augment validation data
    validation_transform = transforms.Compose([
        transforms.Resize((64, 64)),
        transforms.ColorJitter(brightness=0.1, contrast=0.1),  # Mild augmentations
        transforms.RandomHorizontalFlip(),  # Horizontally flip the image with a given probability.
        transforms.RandomVerticalFlip(p=0.5),  # Vertically flip the image with a given probability.
        transforms.RandomRotation(20),  # Rotate the image by angle.
        transforms.ToTensor(),
        # transforms.Normalize([0.5, 0.5, 0.5], [0.5, 0.5, 0.5])  # Normalize a tensor image with mean and standard deviation.
    ])

    # Augment test data
    test_transform = transforms.Compose([
        transforms.Resize((64, 64)),
        # transforms.ColorJitter(brightness=0.1, contrast=0.1),  # Mild augmentations
        # transforms.RandomHorizontalFlip(),  # Horizontally flip the image with a given probability.
        # transforms.RandomVerticalFlip(p=0.5),  # Vertically flip the image with a given probability.
        # transforms.RandomRotation(20),  # Rotate the image by angle.
        transforms.ToTensor(),
        # transforms.Normalize([0.5, 0.5, 0.5], [0.5, 0.5, 0.5])  # Normalize a tensor image with mean and standard deviation.
    ])

    training_dataset = CustomDataset(train_dataset, transform=train_transform)
    validation_dataset = CustomDataset(valid_dataset, transform=validation_transform)
    testing_dataset = CustomDataset(test_dataset, transform=test_transform)
    train_loader = DataLoader(training_dataset, batch_size=params['batch_size'], shuffle=True)
    valid_loader = DataLoader(validation_dataset, batch_size=params['batch_size'], shuffle=False)
    test_loader = DataLoader(testing_dataset, batch_size=params['batch_size'], shuffle=False)
    return train_loader, valid_loader, test_loader

To find the best parameter combinations for the autoencoder, we define a grid search function.

In [None]:
def grid_search():
    params = {
        'epoch': None,
        'batch_size': None,
        'learning_rate': None,
        'bottleneck_dim': None,
        'weight_decay': None,
        'auto_encoder': None,
        'batch_normalization': None
    }
    epoch = [100, 200]
    batch_size = [16, 32, 64]
    learning_rate = [0.0001, 0.001, 0.01, 0.1]
    bottleneck_dim = [16, 32, 64]
    weight_decay = [1e-2, 1e-3, 1e-4, 1e-5]
    batch_normalization = [True, False]
    CNN_param_combinations = list(itertools.product(epoch, batch_size, learning_rate, weight_decay, batch_normalization, ['CNN']))
    linear_param_combinations = list(itertools.product(epoch, batch_size, learning_rate, weight_decay, bottleneck_dim, ['linear']))
    # CNN network
    grid_search_data = []
    for combo in CNN_param_combinations:
        keys = ['epoch', 'batch_size', 'learning_rate', 'weight_decay', 'batch_normalization', 'auto_encoder']
        params_update = dict(zip(keys, combo))
        params.update(params_update)
        autoencoder = EmojiAutoencoder(params)
        autoencoder.train()
        params['error_rate'] = autoencoder.test()
        print(params)
        grid_search_data.append(params.copy())
    # linear network
    for combo in linear_param_combinations:
        keys = ['epoch', 'batch_size', 'learning_rate', 'weight_decay', 'bottleneck_dim', 'auto_encoder']
        params_update = dict(zip(keys, combo))
        params.update(params_update)
        autoencoder = EmojiAutoencoder(params)
        autoencoder.train()
        params['error_rate'] = autoencoder.test()
        print(params)
        grid_search_data.append(params.copy())
    df = pd.DataFrame(grid_search_data)
    # df.to_csv('CNN_params_grid_search_final.csv', index=True)
    return df

Since it takes too long to run grid search, we have prepared the results here: https://github.com/msai437-group3/hw2/blob/mikky/autoencoder_params_grid_search.csv

In [None]:
result_df = pd.read_csv('autoencoder_params_grid_search.csv')
# result_df = grid_search()

## Observations

From the grid search results, we have below observations:

1. The best autoencoder achieves an MSE error rate of 0.015.
2. CNN outperforms fully connected network. The top 50 autoencoders all use CNN. 
3. Hyper parameters:
- Batch size: Smaller batch size brings better performance. Smaller batch sizes introduce noise into the training process, which can have a regularizing effect, leading to better generalization. This can be particularly beneficial for small datasets, where overfitting is a significant concern. Smaller batch sizes can also lead to models that generalize better to unseen data. 
- Learning rate: A larger step size can propel the parameters out of these local minima, leading to better solutions.
- Regularization:
    - A smaller weight decay is more suitable here. It's not uncommon to see values like 1e-4 or 1e-5 used in conjunction with Adam optimizer, as these values tend to be small enough not to distort the model's learned weights too harshly while still helping to prevent overfitting.
    - Batch normalization helps to decrease error rate. First, batch normalization helps to reduce the internal covariate shift, which is the change in the distribution of network activations due to the change in network parameters during training. By normalizing the inputs across mini-batches, batch normalization makes the training process faster and more stable. Second, because batch normalization stabilizes the learning process, it allows for the use of higher learning rates, which can make the training faster without the risk of divergence. Third, batch normalization adds a slight noise to the activations within each batch. This can be thought of as a form of regularization, helping to prevent overfitting.

In [None]:
result_df.sort_values("error_rate", ascending=True)[:10]

In [None]:
average_values = result_df.groupby('auto_encoder')['error_rate'].mean()
average_values

In [None]:
# Get the type of auto encoders for the top 50 performer
result_df.sort_values("error_rate", ascending=True)[:50].groupby('auto_encoder').size()

We can also take a look at the best performer with fully connected layer. MSE loss of 0.077 is a terrible rate for reconstruction.

In [None]:
result_df[result_df['auto_encoder'] == 'linear'].sort_values("error_rate", ascending=True)[:10]

In [None]:
cnn_df = result_df[result_df['auto_encoder'] == 'CNN']
cnn_df = cnn_df.loc[:, cnn_df.columns != 'auto_encoder']
correlation = cnn_df.corrwith(cnn_df['error_rate'])
correlation

In [None]:
linear_df = result_df[result_df['auto_encoder'] == 'linear']
linear_df = linear_df.loc[:, linear_df.columns != 'auto_encoder']
correlation = linear_df.corrwith(linear_df['error_rate'])
correlation

## Results 
Let's take a look at the best parameter combination:
1. training and validation loss learning curves
2. final average MSE error on the test set
3. reconstructed images

In [None]:
params = {
    'batch_size': 16,
    'learning_rate': 0.01,
    'epoch': 200,
    'auto_encoder': 'CNN',
    'weight_decay': 1e-5,
    'batch_normalization': False
}
autoencoder = EmojiAutoencoder(params)
autoencoder.train()
autoencoder.test()

# Question2

(5.0points) Separate your dataset into two or more classes using Emoji descriptions and assign labels. Repeat Step 1 adding image classification as an auxiliary task to MSE with a lambda of your choosing. You can choose any classification technique.
a. describe how you separated your dataset into classes,
b. describe your classification technique and hyperparameters,
c. plot learning curves for training and validation loss for MSE and classification accuracy,
d. discuss how incorporating classification as an auxiliary tasks impacts the performance of your autoencoder,
e. speculate why performance changed and recommend (but do not implement) an experiment to confirm or reject your speculation.

## Classes

Based on face attributes, we manually split the face dataset into four classes:
human face, person, animal, other

In [None]:
label_dict = {
    'animal': ['smiling cat face with open mouth',
               'grinning cat face with smiling eyes',
               'cat face with tears of joy',
               'smiling cat face with heart shaped eyes',
               'cat face with wry smile',
               'kissing cat face with closed eyes',
               'weary cat face',
               'crying cat face',
               'pouting cat face',
               'monkey face',
               'dog face',
               'wolf face',
               'fox face',
               'cat face',
               'lion face',
               'tiger face',
               'horse face',
               'unicorn face',
               'zebra face',
               'cow face',
               'pig face',
               'giraffe face',
               'mouse face',
               'hamster face',
               'rabbit face',
               'bear face',
               'panda face',
               'frog face',
               'dragon face'],
    'human face': ['grinning face',
               'smiling face with open mouth',
               'winking face',
               'smiling face with smiling eyes',
               'smiling face with halo',
               'smiling face with smiling eyes and three hearts',
               'smiling face with heart shaped eyes',
               'grinning face with star eyes',
               'face throwing a kiss',
               'kissing face',
               'white smiling face',
               'kissing face with closed eyes',
               'smiling face with open mouth and smiling eyes',
               'kissing face with smiling eyes',
               'face savouring delicious food',
               'face with stuck out tongue',
               'face with stuck out tongue and winking eye',
               'grinning face with one large and one small eye',
               'face with stuck out tongue and tightly closed eyes',
               'money mouth face',
               'hugging face',
               'smiling face with smiling eyes and hand covering mouth',
               'face with finger covering closed lips',
               'grinning face with smiling eyes',
               'thinking face',
               'zipper mouth face',
               'face with one eyebrow raised',
               'neutral face',
               'expressionless face',
               'face without mouth',
               'smirking face',
               'unamused face',
               'face with rolling eyes',
               'grimacing face',
               'smiling face with open mouth and tightly closed eyes',
               'lying face',
               'relieved face',
               'pensive face',
               'sleepy face',
               'drooling face',
               'sleeping face',
               'face with medical mask',
               'face with thermometer',
               'face with head bandage',
               'nauseated face',
               'smiling face with open mouth and cold sweat',
               'face with open mouth vomiting',
               'sneezing face',
               'overheated face',
               'freezing face',
               'face with uneven eyes and wavy mouth',
               'dizzy face',
               'shocked face with exploding head',
               'face with cowboy hat',
               'face with party horn and party hat',
               'smiling face with sunglasses',
               'nerd face',
               'face with monocle',
               'confused face',
               'worried face',
               'slightly frowning face',
               'white frowning face',
               'face with open mouth',
               'hushed face',
               'astonished face',
               'flushed face',
               'face with tears of joy',
               'face with pleading eyes',
               'frowning face with open mouth',
               'anguished face',
               'fearful face',
               'face with open mouth and cold sweat',
               'disappointed but relieved face',
               'crying face',
               'loudly crying face',
               'face screaming in fear',
               'confounded face',
               'slightly smiling face',
               'persevering face',
               'disappointed face',
               'face with cold sweat',
               'weary face',
               'tired face',
               'face with look of triumph',
               'pouting face',
               'angry face',
               'serious face with symbols covering mouth',
               'smiling face with horns',
               'upside down face'],
    'other': ['robot face',
               'new moon with face',
               'first quarter moon with face',
               'last quarter moon with face',
               'full moon with face',
               'sun with face',
               'wind blowing face',
               'clown face'],
    'person': ['face massage',
               'face massage',
               'face massage',
               'face massage',
               'face massage',
               'face massage',
               'man getting face massage',
               'man getting face massage type 1 2',
               'man getting face massage type 3',
               'man getting face massage type 4',
               'man getting face massage type 5',
               'man getting face massage type 6',
               'woman getting face massage',
               'woman getting face massage type 1 2',
               'woman getting face massage type 3',
               'woman getting face massage type 4',
               'woman getting face massage type 5',
               'woman getting face massage type 6',
               'person with pouting face',
               'person with pouting face',
               'person with pouting face',
               'person with pouting face',
               'person with pouting face',
               'person with pouting face',
               'face with no good gesture',
               'face with no good gesture',
               'face with no good gesture',
               'face with no good gesture',
               'face with no good gesture',
               'face with no good gesture',
               'face with ok gesture',
               'face with ok gesture',
               'face with ok gesture',
               'face with ok gesture',
               'face with ok gesture',
               'face with ok gesture',
               'face palm',
               'face palm',
               'face palm',
               'face palm',
               'face palm',
               'face palm']}

### Face class 1. human face

In [None]:
filtered_df = face_df[face_df['text'].isin(label_dict['human face'])]
show_emojis(filtered_df, fig_size=(30,40), subplot=(12,16), num=len(face_df), maximum_word=3)

### Face class 2. person

In [None]:
filtered_df = face_df[face_df['text'].isin(label_dict['person'])]
show_emojis(filtered_df, fig_size=(30,40), subplot=(12,16), num=len(face_df), maximum_word=3)

### Face class 3. animal

In [None]:
filtered_df = face_df[face_df['text'].isin(label_dict['animal'])]
show_emojis(filtered_df, fig_size=(30,40), subplot=(12,16), num=len(face_df), maximum_word=3)

### Face class 4. other

In [None]:
filtered_df = face_df[face_df['text'].isin(label_dict['other'])]
show_emojis(filtered_df, fig_size=(30,40), subplot=(12,16), num=len(face_df), maximum_word=3)

In [None]:
class CustomDatasetClassification(Dataset):
    # new custom dataset for classification problem
    def __init__(self, features, labels, transform=None):
        self.features = features
        self.labels = labels
        self.transform = transform

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        feature = self.features[idx]
        feature = Image.fromarray(feature)
        if self.transform:
            feature = self.transform(feature)
        return feature, self.labels[idx]

In [None]:
def load_data_classification(params):
    # Read data
    emojis = pd.read_parquet(DATA_PATH, engine='pyarrow')
    face_df = emojis[emojis['text'].apply(lambda x: 'face' in x.split() and 'clock' not in x.split())].copy()
    text_label_mapping = {item: label for label, text_list in label_dict.items() for item in text_list}
    face_df['label'] = face_df['text'].replace(text_label_mapping)
    face_features = np.array([Image.open(io.BytesIO(item.get('bytes'))) for item in face_df['image'].values])
    face_labels = np.array(face_df['label'].values)
    unique_labels = set(face_labels) 
    label_to_index = {label: index for index, label in enumerate(unique_labels)}
    label_indices = np.array([label_to_index[label] for label in face_labels])
    full_dataset = CustomDatasetClassification(face_features, label_indices) 

    # Split dataset
    total_length = len(full_dataset)
    train_length = int(total_length * 0.6)
    valid_length = int(total_length * 0.2)
    test_length = total_length - train_length - valid_length
    train_data, valid_data, test_data = random_split(full_dataset, [train_length, valid_length, test_length])

    # Augment training data
    train_transform = transforms.Compose([
        transforms.Resize((64, 64)),
        transforms.RandomAffine(
            degrees=15,  # Rotation: A small degree, since large rotations might make emojis unrecognizable.
            translate=(0.1, 0.1),  # Translation: Shift the image by 10% of its height/width in any direction.
            scale=(0.9, 1.1),  # Scale: Slightly zoom in or out by 10%.
            shear=5  # Shear: Apply a small shearing of 5 degrees.
        ),
        transforms.RandomHorizontalFlip(),  # Horizontally flip the image with a given probability.
        transforms.RandomVerticalFlip(p=0.5),  # Vertically flip the image with a given probability.
        transforms.RandomRotation(20),  # Rotate the image by angle.
        transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2),
        # Randomly change the brightness, contrast, and saturation of an image.
        transforms.ToTensor(),  # Convert a PIL Image or numpy.ndarray to tensor.
        # transforms.Normalize([0.5, 0.5, 0.5], [0.5, 0.5, 0.5])  # Normalize a tensor image with mean and standard deviation.
    ])

    # Augment validation data: Define light transformations for augmentation
    valid_transform = transforms.Compose([
        transforms.Resize((64, 64)),
        transforms.ColorJitter(brightness=0.1, contrast=0.1),  # Mild augmentations
        transforms.RandomHorizontalFlip(),  # Horizontally flip the image with a given probability.
        transforms.RandomVerticalFlip(p=0.5),  # Vertically flip the image with a given probability.
        transforms.RandomRotation(20),  # Rotate the image by angle.
        transforms.ToTensor(),
        # transforms.Normalize([0.5, 0.5, 0.5], [0.5, 0.5, 0.5])  # Normalize a tensor image with mean and standard deviation.
    ])

    # Augment test data: Define light transformations for augmentation
    test_transform = transforms.Compose([
        transforms.Resize((64, 64)),
        # transforms.ColorJitter(brightness=0.1, contrast=0.1),  # Mild augmentations
        # transforms.RandomHorizontalFlip(),  # Horizontally flip the image with a given probability.
        # transforms.RandomVerticalFlip(p=0.5),  # Vertically flip the image with a given probability.
        # transforms.RandomRotation(20),  # Rotate the image by angle.
        transforms.ToTensor(),
        # transforms.Normalize([0.5, 0.5, 0.5], [0.5, 0.5, 0.5])  # Normalize a tensor image with mean and standard deviation.
    ])

    train_dataset = CustomDatasetClassification(train_data.dataset.features[train_data.indices],
                                  train_data.dataset.labels[train_data.indices], train_transform)
    valid_dataset = CustomDatasetClassification(valid_data.dataset.features[valid_data.indices],
                                  valid_data.dataset.labels[valid_data.indices], valid_transform)
    test_dataset = CustomDatasetClassification(test_data.dataset.features[test_data.indices],
                                 test_data.dataset.labels[test_data.indices], test_transform)
    train_loader = DataLoader(train_dataset, batch_size=params['batch_size'], shuffle=True)
    valid_loader = DataLoader(valid_dataset, batch_size=params['batch_size'], shuffle=False)
    test_loader = DataLoader(test_dataset, batch_size=params['batch_size'], shuffle=False)
    return train_loader, valid_loader, test_loader

## Structure

- Since CNN outperforms fully connected models observed in Question 1, we will use CNN as our auto encoder. 
- The encoder will be connected with several fully connected layers as classification network.
- The final classification layer has four outputs with four class labels.
- We will run grid search to decide how many layers should be included in this auxiliary classification task.
- To prevent overfitting, we will further add dropout to the final fully connected layers.

In [None]:
class EmojiAutoencoderWithClassifier(nn.Module):
    def __init__(self, params, num_classes):
        super(EmojiAutoencoderWithClassifier, self).__init__()
        self.params = params
        self.auto_encoder = EmojiAutoencoderCNN(self.params)

        if self.params['classification_layer_num'] == 1:
            self.classifier = nn.Sequential(
                nn.Linear(2048, num_classes)
            )
        elif self.params['classification_layer_num'] == 2:
            self.classifier = nn.Sequential(
                nn.Linear(2048, 256),
                nn.ReLU(),
                nn.Dropout(self.params['dropout']),
                nn.Linear(256, num_classes)
            )
        elif self.params['classification_layer_num'] == 3:
            self.classifier = nn.Sequential(
                nn.Linear(2048, 256),
                nn.ReLU(),
                nn.Dropout(self.params['dropout']),
                nn.Linear(256, 32),
                nn.ReLU(),
                nn.Dropout(self.params['dropout']),
                nn.Linear(32, num_classes)
            )
        self.optimizer = optim.Adam(self.parameters(), lr=self.params['learning_rate'], weight_decay=self.params['weight_decay'])
        self.train_loader, self.valid_loader, self.test_loader = load_data_classification(self.params)
        self.mse_loss_function = nn.MSELoss()
        self.cross_entropy_loss_function = nn.CrossEntropyLoss()

    def forward(self, x):
        bottleneck = self.auto_encoder.encoder(x)
        decoded_features = self.auto_encoder.decoder(bottleneck)
        class_logits = self.classifier(bottleneck.view(bottleneck.size(0), -1))
        return decoded_features, class_logits

    def train(self):
        train_losses = []
        train_accuracies = []
        valid_losses = []
        valid_accuracies = []

        for epoch in range(self.params['epoch']):
            train_loss = 0
            correct_train = 0
            total_train = 0
            for data in self.train_loader:
                imgs, labels = data
                self.optimizer.zero_grad()
                decoded_features, class_logits = self.forward(imgs)
                # Calculate total loss
                reconstruction_loss = self.mse_loss_function(decoded_features, imgs)
                classification_loss = self.cross_entropy_loss_function(class_logits, labels)
                loss = reconstruction_loss + self.params["classification_lambda"] * classification_loss
                loss.backward()
                self.optimizer.step()
                train_loss += loss.item()
                # Calculate accuracy
                _, predicted = torch.max(class_logits.data, 1)
                total_train += labels.size(0)
                correct_train += (predicted == labels).sum().item()
            train_loss /= len(self.train_loader)
            train_losses.append(train_loss)
            train_accuracies.append(100 * correct_train / total_train)

            # Validation
            valid_loss = 0
            correct_valid = 0
            total_valid = 0
            with torch.no_grad():
                for data in self.valid_loader:
                    imgs, labels = data
                    decoded_features, class_logits = self.forward(imgs)
                    # Calculate total loss
                    reconstruction_loss = self.mse_loss_function(decoded_features, imgs)
                    classification_loss = self.cross_entropy_loss_function(class_logits, labels)
                    loss = reconstruction_loss + self.params["classification_lambda"] * classification_loss
                    valid_loss += loss.item()
                    # Calculate accuracy
                    _, predicted = torch.max(class_logits.data, 1)
                    total_valid += labels.size(0)
                    correct_valid += (predicted == labels).sum().item()
            valid_loss /= len(self.valid_loader)
            valid_losses.append(valid_loss)
            valid_accuracies.append(100 * correct_valid / total_valid)
            print(f'Epoch {epoch + 1}/{self.params["epoch"]}, '
                  f'Train Loss: {train_losses[-1]:.4f}, '
                  f'Train Accuracy: {train_accuracies[-1]:.2f}%, '
                  f'Validation Loss: {valid_losses[-1]:.4f}, '
                  f'Validation Accuracy: {valid_accuracies[-1]:.2f}%')

        plt.figure(figsize=(12, 5))
        plt.subplot(1, 2, 1)
        plt.plot(train_losses, label='Train MSE Loss')
        plt.plot(valid_losses, label='Validation MSE Loss')
        plt.title('Training and Validation MSE Loss')
        plt.xlabel('Epochs')
        plt.ylabel('Loss')
        plt.legend()

        plt.subplot(1, 2, 2)
        plt.plot(train_accuracies, label='Train Accuracy')
        plt.plot(valid_accuracies, label='Validation Accuracy')
        plt.title('Training and Validation Classification Accuracy')
        plt.xlabel('Epochs')
        plt.ylabel('Accuracy (%)')
        plt.legend()

        plt.tight_layout()
        # fig_name = f'classification_learning_curve_{self.params["batch_size"]}_{self.params["epoch"]}_{self.params["learning_rate"]}_{self.params["dropout"]}_{self.params["classification_lambda"]}_{self.params["classification_layer_num"]}_{self.params["batch_normalization"]}_{self.params["weight_decay"]}.jpg'
        # plt.savefig(fig_name, dpi=300, bbox_inches='tight')
        plt.show()

    def test(self):
        total_mse_error = 0.0
        total_samples = 0
        total_test = 0
        correct_test = 0
        with torch.no_grad():
            for data in self.test_loader:
                imgs, labels = data
                decoded_features, class_logits = self.forward(imgs)
                reconstruction_loss = self.mse_loss_function(decoded_features, imgs)
                total_mse_error += reconstruction_loss.item() * imgs.size(0)  # Multiply by batch size to accumulate error correctly
                _, predicted = torch.max(class_logits.data, 1)
                total_test += labels.size(0)
                correct_test += (predicted == labels).sum().item()
                total_samples += imgs.size(0)
            # performance for autoencoder
            average_mse_error = total_mse_error / total_samples
            print(f'Test Average MSE Error: {average_mse_error:.4f}')
            self.visual_inspection()
            # performance for classifier
            accuracy = 100 * correct_test / total_test
            print(f'Test Average Accuracyr: {accuracy:.4f}')
            return average_mse_error, accuracy

    def visual_inspection(self):
        # Visual inspect the difference between original images and reconstructed images
        dataiter = iter(self.test_loader)
        images, labels = next(dataiter)
        reconstructed = self.auto_encoder.forward(images)
        images = images.view(-1, 3, 64, 64)
        reconstructed = reconstructed.view(-1, 3, 64, 64)
        images = images.numpy()
        reconstructed = reconstructed.detach().numpy()

        fig, axes = plt.subplots(nrows=2, ncols=5, figsize=(10, 4))
        for i in range(5):
            # Display original images
            ax = axes[0, i]
            image_clipped = np.clip(np.transpose(images[i], (1, 2, 0)), 0, 1)
            ax.imshow(image_clipped)
            ax.set_title('Original')
            ax.axis('off')
            # Display reconstructed images
            ax = axes[1, i]
            reconstructed_clipped = np.clip(np.transpose(reconstructed[i], (1, 2, 0)), 0, 1)
            ax.imshow(reconstructed_clipped)
            ax.set_title('Reconstructed')
            ax.axis('off')
        # fig_name = f'classification_reconstructed_{self.params["batch_size"]}_{self.params["epoch"]}_{self.params["learning_rate"]}_{self.params["dropout"]}_{self.params["classification_lambda"]}_{self.params["classification_layer_num"]}_{self.params["batch_normalization"]}_{self.params["weight_decay"]}.jpg'
        # plt.savefig(fig_name, dpi=300, bbox_inches='tight')
        plt.show()

In [None]:
def grid_search_classification():
    params = {
        'epoch': None,
        'batch_size': None,
        'learning_rate': None,
        'dropout': None,
        'classification_lambda': None,
        'classification_layer_num': None
    }
    epoch = [100, 200]
    batch_size = [16, 32]
    learning_rate = [0.0001, 0.001, 0.01]
    dropout = [0, 0.3, 0.5]
    classification_lambda = [0.1, 0.5, 0.7]
    classification_layer_num = [1, 2, 3]
    batch_normalization = [True, False]
    weight_decay = [1e-3, 1e-4, 1e-5]
    param_combinations = list(itertools.product(epoch, batch_size, learning_rate, dropout, classification_lambda, classification_layer_num, batch_normalization, weight_decay))
    grid_search_data = []
    for combo in param_combinations:
        keys = ['epoch', 'batch_size', 'learning_rate', 'dropout', 'classification_lambda', 'classification_layer_num', 'batch_normalization', 'weight_decay']
        params_update = dict(zip(keys, combo))
        params.update(params_update)
        classifier = EmojiAutoencoderWithClassifier(params, 4)
        classifier.train()
        params['reconstruction_error_rate'], params['classification_accuracy'] = classifier.test()
        print(params)
        grid_search_data.append(params.copy())
    df = pd.DataFrame(grid_search_data)
    # df.to_csv('classification_params_grid_search.csv', index=True)
    return df

In [None]:
classification_result_df = pd.read_csv('classification_params_grid_search.csv')
# classification_result_df = grid_search_classification()

Let's take a look at the classification accuracy on the test data. The best performed model can reach 100%.

In [None]:
classification_result_df.sort_values(['classification_accuracy', 'reconstruction_error_rate'], ascending=[False, True])[:20]

After adding auxiliary classification task, the best reconstruction error rate goes up from 0.015 to 0.0213, which means incorporating classification as an auxiliary tasks negatively impacts the performance of the autoencoder.

In [None]:
classification_result_df.sort_values('reconstruction_error_rate', ascending=True)[:5]

In [None]:
correlation_error_rate = classification_result_df.corrwith(classification_result_df['reconstruction_error_rate'])
correlation_error_rate

In [None]:
correlation_accuracy = classification_result_df.corrwith(classification_result_df['classification_accuracy'])
correlation_accuracy

## Observations
1. Although the accuracy can be as high as 100%, after adding auxiliary classification task, the best reconstruction error rate goes up from 0.015 to 0.022.
2. Classification accuracy is negatively correlated with reconstruction MSE loss.

### Speculation 
The reason for the negative impact of auxiliary tasks on the autoencoder might be:
- Classification accuracy is negatively correlated with reconstruction MSE loss, which means that the autoencoder and the classifier have competing objectives. The autoencoder tries to learn a compact representation that best reconstructs the input data, while the classifier tries to learn features that are most discriminative for the classification task. If these objectives don't align perfectly, the shared representation might compromise between reconstruction quality and classification accuracy, potentially leading to worse reconstructions.
- Training Dynamics: The way the model is trained can also affect its performance. If the autoencoder was pre-trained before adding the classifier, and then both parts were fine-tuned together, the dynamics of learning might shift, impacting the quality of reconstruction. This can be due to changes in gradients and updates that now have to accommodate both tasks.
- Overfitting to Classification: Adding a classifier introduces more parameters to the model, increasing its complexity. If the classifier overfits to the training data, it may lead the entire model, including the autoencoder part, to overfit as well. This means that while the classification accuracy might improve, the autoencoder could lose its ability to generalize well to unseen data, worsening reconstruction quality.


### Experiment to confirm speculation
#### Experiment 1
Sequential Training: First, train the autoencoder alone until it reaches satisfactory reconstruction quality. Then, freeze the weights of the encoder (and possibly the decoder) and train only the classifier. 
#### Experiment 2
More Regularization: Apply regularization techniques to prevent overfitting, especially to the classifier part. As we already implemented dropout, L2 regularization, data augmentation, we can further implement early stopping.
#### Experiment 3
Instead of using the same encoded representation for both reconstruction and classification, we can have separate branches after the initial layers of the encoder: one branch for reconstruction and one for classification. This allows each branch to learn features relevant to its specific objective without interfering too much with each other.

## Results

Let’s take a look at the best parameter combination:
- training and validation loss learning curves
- training and validation learning curves for classification accuracy
- final average MSE error on the test set
- reconstructed images

In [None]:
params = {
    'epoch': 200,
    'batch_size': 16,
    'learning_rate': 0.001,
    'dropout': 0.3,
    'classification_lambda': 0.1,
    'classification_layer_num': 1,
    'weight_decay': 0.00001,
    'batch_normalization': False
}
classifier = EmojiAutoencoderWithClassifier(params, 4)
classifier.train()
classifier.test()

# Question 3
(5.0 points) Select an attribute from the Emoji dataset (internal or external to your selected subset) to compose with any image from your selected subset. Use vector arithmetic on latent representations to generate a composite image that expresses the attribute. For example, I chose to add the glasses from “nerd face” to the “face with stuck out tongue
a. specify which attribute you selected, the vector arithmetic applied and the resulting image(s) as displayed above,
b. provide a qualitative evaluation of your composite image,and
c. discuss ways to improve the quality of your generated image.

We will pick the best autoencoder from question 1.

In [None]:
def get_latent_representation(autoencoder, emoji_text):
    emoji_info = emojis[emojis['text'] == emoji_text].iloc[0]
    img = Image.open(io.BytesIO(emoji_info['image']['bytes']))
    transform = transforms.Compose([transforms.Resize((64, 64)),transforms.ToTensor()])
    emoji_tensor = transform(img)
    latent_representation = autoencoder.encode(emoji_tensor)
    return latent_representation

In [None]:
def get_reconstructed_image(latent_representation):
    reconstructed = autoencoder.decode(latent_representation).view(-1, 3, 64, 64).detach().numpy()[0]
    reconstructed = np.transpose(reconstructed, (1, 2, 0))
    return reconstructed

In [None]:
def show_reconstructed_emojis(emoji_list, fig_size=(25,25), subplot=(10,10), maximum_word=4):
    plt.figure(figsize=fig_size)
    for i, emoji_text in enumerate(emoji_list, start=1):
        latent_representation = get_latent_representation(autoencoder, emoji_text)
        reconstructed = get_reconstructed_image(latent_representation)
        ax = plt.subplot(subplot[0], subplot[1], i)
        ax.axis('off')
        ax.imshow(reconstructed)
        ax.text(0.5, -0.5, format_text(emoji_text, max_words=maximum_word), fontsize=12, ha='center', transform=ax.transAxes)
    
    plt.subplots_adjust(wspace=0.1, hspace=1.2)
    plt.show()

Let's first try the example in the homework file.

In [None]:
latent_1 = get_latent_representation(autoencoder, "nerd face")
latent_2 = get_latent_representation(autoencoder, "smiling face with open mouth")
latent_3 = get_latent_representation(autoencoder, "face with stuck out tongue")
latent_combined = latent_1 - latent_2 + latent_3
image_combined = get_reconstructed_image(latent_combined)
plt.imshow(image_combined)

The effect is not good. It indicates that we can not simply do arithmetic between latent representations of faces, instead, if we intend to combine different features of a face, we need to extract that specific feature (e.g. grinning mouth) and add up the features to an average face. The average face can be a neutral face with most common featured mouth, eyes, eyebrows. Here, we will pick up "slightly smiling face" or "face without mouth".

In [None]:
average_face_list = ['slightly smiling face', 'face without mouth']
show_reconstructed_emojis(average_face_list)

Next,let's pick up some featured mouths and eyes.

In [None]:
featured_mouth_list = ['grinning face', 'smiling face with open mouth', 
              'smiling face with open mouth and smiling eyes', 'face with stuck out tongue', 'face with stuck out tongue and tightly closed eyes', 'grinning face with smiling eyes', 
              'smiling face with open mouth and tightly closed eyes', 'sleepy face', 'face with medical mask',
              'smiling face with open mouth and cold sweat', 'face with open mouth vomiting', 'face with tears of joy',
              'weary face', 'tired face', 'face with look of triumph']
show_reconstructed_emojis(featured_mouth_list)

In [None]:
featured_eye_list = ['grinning face with star eyes',
            'face with stuck out tongue and winking eye', 
            'face with rolling eyes', 'dizzy face', 'smiling face with sunglasses',
            'flushed face', 'face with pleading eyes']
show_reconstructed_emojis(featured_eye_list)

Now let's generate a grinning face with sunglasses.

In [None]:
average_face = get_latent_representation(autoencoder, "slightly smiling face")

# get sunglasses feature
latent_1 = get_latent_representation(autoencoder, "smiling face with sunglasses")
latent_sunglasses = latent_1 - average_face

# get grinning mouth
latent_2 = get_latent_representation(autoencoder, "grinning face")
latent_grinning_mouth = latent_2 - average_face

# add features to an average face
latent_combined = average_face + latent_sunglasses + latent_grinning_mouth
image_combined = get_reconstructed_image(latent_combined)
# plt.imshow(image_combined)

# plot faces
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(3, 1))
axes[0].imshow(get_reconstructed_image(latent_1))
axes[1].imshow(get_reconstructed_image(latent_2))
axes[2].imshow(get_reconstructed_image(latent_combined))
axes[0].axis('off')
axes[1].axis('off')
axes[2].axis('off')
plt.show()

Pretty good! Let's try the example in the homework again.

In [None]:
average_face = get_latent_representation(autoencoder, "slightly smiling face")

# get glasses feature
latent_1 = get_latent_representation(autoencoder, "nerd face")
latent_glasses = latent_1 - average_face

# get grinning mouth feature
latent_2 = get_latent_representation(autoencoder, "smiling face with open mouth")
latent_smiling_mouth = latent_2 - average_face

# get stuck out tongue feature
latent_3 = get_latent_representation(autoencoder, "face with stuck out tongue")
latent_stuck_out_tongue = latent_3 - average_face

# add features to an average face
latent_combined = latent_1 - latent_smiling_mouth + latent_stuck_out_tongue
image_combined = get_reconstructed_image(latent_combined)
# plt.imshow(image_combined)

# plot faces
fig, axes = plt.subplots(nrows=1, ncols=4, figsize=(4, 1))
axes[0].imshow(get_reconstructed_image(latent_1))
axes[1].imshow(get_reconstructed_image(latent_2))
axes[2].imshow(get_reconstructed_image(latent_3))
axes[3].imshow(get_reconstructed_image(latent_combined))
axes[0].axis('off')
axes[1].axis('off')
axes[2].axis('off')
axes[3].axis('off')
plt.show()

Horrible!

### Qualitative evaluation:
1. Visual Quality: 
    - The clarity and resolution is not ideal, which is due to the reconstruction ability of the autoencoder.
    - The model is poor at expressing colors. As we can see from "face with stuck out tongue", the tongue is supposed to be pink but turns out to be grey in the reconstructed image.
2. Inconsistent integration effect. Some features, such as grinning mouth, smiling mouth and sunglasses, are easy to integrate, while some features, such as glasses, can have wierd contours and unnatural overlaps. The reasons might be that we have 6 emojis with grinning mouth and only 1 emoji with glasses, so the model can learn "grinning mouth" better than "glasses". Some attributes are not well-presented in their latent vectors, which leads to their poor performance in blending. 

### Ways to improve the quality of your generated image:
1. Improve the autoencoder:
    - Data: 
        - Increase data: Our dataset is still small, which is subject to over-fitting. We can enrich the data by introducing external emoji datasets to improve the generalization ability.
        - Balanced Attributes: Ensure that the attributes are well-represented in the latent space. For example, if blending "smiling" and "glasses" into a face, both attributes should be distinct and significant in their latent vectors.
    - Change the structure of the autoencoder by adding more CNN layers to capture the detailed features.
    - Add more color-sensitive network layers to capture the color features of the emojis.
    - Increase the size of the latent space to capture more details
2. Interpolation and Smoothing: Instead of directly adding or subtracting feature vectors, consider using interpolation (such as spherical linear interpolation or SLERP) between vectors for smoother transitions.
3. Average Attribute Vectors: If aiming to add common attributes (like "smiling"), consider averaging several vectors representing that attribute from different images to create a more general representation.
####  Advanced
4. Use of GANs: Consider using Generative Adversarial Networks (GANs) for attribute blending. GANs can produce more realistic and higher-quality images and can be trained specifically for tasks like face attribute modification.
5. Semantic Disentanglement: Work towards disentangling the latent space semantically, ensuring that different dimensions control distinct and interpretable aspects of the generated images.