# CSE 204 Exam 2

J.B. Scoggins

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/jbscoggi/teaching/blob/master/Polytechnique/CSE204/Exam_2.ipynb) 

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/jbscoggi/teaching/master?filepath=Polytechnique%2FCSE204%2FExam_2.ipynb)


## Introduction

In this exam, you will need to apply what you have learned during previous lab exercises to build a 2-dimensional embedding for a multi-label classification dataset you have never seen before.  The dataset is already loaded and split into `labels` and `features` dataframes for you.  Be sure to carefully inspect the labels and features for yourself before continuing.  

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
import pandas as pd

# Load the dataset and split into labels and features dataframes
data = pd.read_csv('https://raw.githubusercontent.com/jbscoggi/teaching/master/Polytechnique/CSE204/data/Scene_6.csv')
labels = data.loc[:,'beach':'urban']
features = data.loc[:,'Att1':]

As you can see, the dataset has 6 labels associated with a 294 features.  Labels are given values of 0 or 1, indicating if the example belongs to that label (1) or not (0).  Note that examples may belong to multiple labels.  In other words, the labels are not mutually exclusive.  This is known as a multi-label classification problem.

### Creating a unique floating point value for every combination of labels
Since examples can belong to 1 or more of 6 possible labels, it may be useful to assign a unique floating point value to each possible combination of labels, in order to provide a color scale when plotting the 2D embeddings.  You can use the following formula,
$$
\text{color_scale} = \log_2 (\sum_i y_i 2^i),
$$
where $y_i$ is the true i-th label for the given example.  A small code snippet is provided below, which implements this formula.

In [None]:
# Create a floating point scale for different possibility of labels
pow2 = np.array([np.power(2,i) for i in range(len(labels.columns))])
colors = [np.log2(pow2.dot(row)) for index, row in labels.iterrows()]

### Plotting embeddings

As you saw during the autoencoders lab, we can plot our dataset in a lower dimensional space.  Recall that autoencoders are a dimension-reduction technique, where the aim is to produce a low-dimenional encoding of the high-dimensional feature-space, such that we can reconstruct that space to a desired degree of accuracy.  Embeddings are similar, in that they are low-dimenional encodings, but optimized to reconstruct the labels, rather than the features.  In this exam, you are asked to produce a 2D embedding of the given dataset and plot the embedding (low dimensional transformation of the features), as you did during the autoencoders lab.  As an example of such a plot, the code below plots the first two components of the PCA analysis on the feature space. (Note the use of our `colors` list from above.)

In [None]:
pca_transform = PCA(2).fit_transform(features)
plt.scatter(pca_transform[:,0], pca_transform[:,1], c=colors, cmap='rainbow')
plt.colorbar()

## Problem Description

Your task is to build and train a model that produces a 2D embedding of the given dataset and then using this embedding, predicts the multi-label classification.  Note, this is similar to the autoencoders you have already built, but instead of trying to output the input, you want to output the class probabilities.  For example, if you were to use a simple linear autoencoder-like structure, it could have an input layer of 294 nodes, a code layer of 2 nodes, and an output of 6 nodes with sigmoid activations on the output layer.  However, you are not restricted to using this model.  

### Grading 

You will be graded based on a small report that you must submit on the Moodle as a PDF.  The report should contain 

1. The code and a description of your model, data preprocessing, and training procedure.  The model description should include the number of total training parameters in your model.
2. A plot of your 2D embedding, like the scatter plot shown above.
3. The best achieved accuracy in terms of classification of each label.  We will consider the least accurate label during the grading.

The first and second points above will be given a maximum of 5 points each.  Another 5 points will be determined based on how well your accuracy and number of model parameters compares with that of your classmates.  You will recieve points on a linear scale, such that the student with the best accuracy obtained will get 3 points, while the worst accuracy will get 0.  Likewise, the student with the lowest number of parameters will get 2 points, and the highest will recieve 0.  Thus there are a total of 15 points possible.

## Bonus Point

Bonus point: The following code loads a dataset into a 2D numpy array `X`. Build a 2D representation for this data (e.g., by modifying one of your previous architectures) which is interesting to visualize. Plotting code is provided, as well as for labelings for a set of randomly-selected rows. 
Evaluation will be done by an expert in the domain upon visual inspection of your plot. Note that there is no guarantee that it is possible to obtain a very good visual representation -- hence why it is a bonus point.

```python
X = numpy.loadtxt('https://raw.githubusercontent.com/jbscoggi/teaching/master/Polytechnique/CSE204/data/DATASET_X.csv')

# TODO: create a 2D representation/encoding Z with the same number of rows as X.

Z = 
fig, ax = plt.subplots()
ax.scatter(Z[:,0], Z[:,1])

rows_of_interest = [5, 15, 88, 20, 66, 21] 
for i in rows_of_interest:
    ax.annotate(str(i), (Z[i,0], Z[i,1]), fontsize=16, color='red')

plt.show()
```