# **Exercise on Autoencoders**

<p>Within the scope of this exercise, you will examine how an autoencoder works for the classification of a dataset.</p>



In [None]:
# train autoencoder for classification with no compression in the bottleneck layer
from sklearn.datasets import make_classification
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from tensorflow import keras
from keras.models import Model
from keras.layers import Input
from keras.layers import Dense
from keras.layers import LeakyReLU
from keras.layers import BatchNormalization
from matplotlib import pyplot

We will use the scikit-learn <i>make_classification()</i> function to define a synthetic classification dataset (2 classes) with 100 input features (columns) and 1,000 samples (rows).

Attention: In the given problem, most of the features are 90% redundant, allowing the autoencoder later to learn a truly useful compressed representation.

In [None]:
# synthetic classification dataset
from sklearn.datasets import make_classification
# define dataset
X, y = make_classification(n_samples=1000, n_features=100, n_informative=10, n_redundant=90, random_state=1)
# number of input columns
n_inputs = X.shape[1]
# summarize the dataset
print(X.shape, y.shape)

(1000, 100) (1000,)


Next, we will develop an autoencoder model [Multilayer Perceptron (MLP)](https://en.wikipedia.org/wiki/Multilayer_perceptron).

The model will take all samples as input and then output the same values. Therefore, it will learn to reconstruct the same pattern.

The autoencoder consists of two parts: the encoder and the decoder.

Once the autoencoder is trained, the decoder will not be used again, and we keep only the encoder to compress input samples into the vectors produced by the bottleneck.

In this first autoencoder, we will not compress the input at all, and we will use intermediate layers with the same number of connections as the number of input nodes (thus equal to the number of dimensions of the sample). This will be done so that the model learns almost perfectly and we can confirm that the model has been implemented correctly.

Before defining and fitting the model, we will split the data into training and test sets and scale the input data by normalizing the values to the range 0–1.


In [None]:
# split into train test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# scale data
t = MinMaxScaler()
t.fit(X_train)
X_train = t.transform(X_train)
X_test = t.transform(X_test)

We will define the encoder to have two hidden layers: the first with twice the number of input dimensions (e.g., 200) and the second with the original number of dimensions (100).


In [None]:
# define encoder
visible = Input(shape=(n_inputs,))
# encoder level 1
e = Dense(n_inputs*2)(visible)
e = BatchNormalization()(e)
e = LeakyReLU()(e)
# encoder level 2
e = Dense(n_inputs)(e)
e = BatchNormalization()(e)
e = LeakyReLU()(e)
# bottleneck
n_bottleneck = n_inputs
bottleneck = Dense(n_bottleneck)(e)

Next, we define the decoder. The decoder will have the reverse structure of the encoder.

In [None]:
# define decoder, level 1
d = Dense(n_inputs)(bottleneck)
d = BatchNormalization()(d)
d = LeakyReLU()(d)
# decoder level 2
d = Dense(n_inputs*2)(d)
d = BatchNormalization()(d)
d = LeakyReLU()(d)
# output layer
output = Dense(n_inputs, activation='linear')(d)
# define autoencoder model
model = Model(inputs=visible, outputs=output)

Next, we compile our model.

In [None]:
# compile autoencoder model
model.compile(optimizer='adam', loss='mse')

Next, we can train the model to reproduce the input data and monitor the model’s performance compared to the test set.

Question 1: What is the difference between mini-batches and epochs?


In [None]:
# fit the autoencoder model to reconstruct input
history = model.fit(X_train, X_train, epochs=200, batch_size=16, verbose=2, validation_data=(X_test,X_test))

Below are the learning curves for the train and test dataset.

In [None]:
# plot loss
pyplot.plot(history.history['loss'], label='train')
pyplot.plot(history.history['val_loss'], label='test')
pyplot.legend(title = "datasets")
pyplot.title("Loss")
pyplot.show()

Question 2: Try running the model with a bottleneck of (a) 50 and (b) 25. What do you observe regarding the model’s results?

Question 3: Try using three layers (the two existing ones and one additional layer with a bottleneck of 50). What differences do you observe?