# Contractive Autoencoder (CAE) Tutorial

## Introduction

A Contractive Autoencoder (CAE) is a type of autoencoder that includes a regularization term to ensure that the learned features are robust to small variations in the input data. This is achieved by adding a penalty to the loss function that penalizes the Jacobian matrix of the encoder's activations with respect to the input data.

## Architecture

A CAE consists of two main parts:
1. **Encoder**: Compresses the input into a latent-space representation.
2. **Decoder**: Reconstructs the input from the latent-space representation.

### Encoder

The encoder function, $h = f(x)$, maps the input $x$ to a hidden representation $h$:

$$
h = f(x) = \sigma(Wx + b)
$$

where:
- $W$ is a weight matrix
- $b$ is a bias vector
- $\sigma$ is an activation function (e.g., ReLU, sigmoid)

### Decoder

The decoder function, $\hat{x} = g(h)$, maps the hidden representation $h$ back to the original input space:

$$
\hat{x} = g(h) = \sigma(W'h + b')
$$

where:
- $W'$ is a weight matrix
- $b'$ is a bias vector
- $\sigma$ is an activation function

### Loss Function

The loss function in a Contractive Autoencoder consists of three parts:
1. **Reconstruction Loss**: Measures how well the decoder reconstructs the input.
2. **Contractive Loss**: Penalizes the sensitivity of the encoder activations to changes in the input.

The total loss is:

$$
L = \text{Reconstruction Loss} + \lambda \text{Contractive Loss}
$$

where $\lambda$ is a regularization parameter that controls the weight of the contractive loss.

#### Reconstruction Loss

The reconstruction loss is typically the mean squared error (MSE):

$$
\text{Reconstruction Loss} = \frac{1}{n} \sum_{i=1}^{n} (x_i - \hat{x}_i)^2
$$

#### Contractive Loss

The contractive loss penalizes the Frobenius norm of the Jacobian matrix of the encoder's activations with respect to the input:

$$
\text{Contractive Loss} = \text{Tr}(J^T J)
$$

where $J$ is the Jacobian matrix of the encoder's activations:

$$
J_{ij} = \frac{\partial h_i}{\partial x_j}
$$

## Training the Contractive Autoencoder

Training a CAE involves minimizing the total loss function, which includes both the reconstruction loss and the contractive loss. This is typically done using gradient descent.

### Derivatives

Let's derive the gradients for the encoder weights $W$.

#### Gradient of the Reconstruction Loss

The gradient of the reconstruction loss with respect to the encoder weights is:

$$
\frac{\partial L}{\partial W} = \frac{\partial L}{\partial \hat{x}} \cdot \frac{\partial \hat{x}}{\partial h} \cdot \frac{\partial h}{\partial W}
$$

where:

$$
\frac{\partial \hat{x}}{\partial h} = W'
$$

$$
\frac{\partial h}{\partial W} = x
$$

So:

$$
\frac{\partial L}{\partial W} = (x - \hat{x}) \cdot W' \cdot x^T
$$

#### Gradient of the Contractive Loss

The gradient of the contractive loss with respect to the encoder weights is:

$$
\frac{\partial (\text{Contractive Loss})}{\partial W} = \frac{\partial (\text{Tr}(J^T J))}{\partial W}
$$

Using the chain rule and matrix calculus:

$$
\frac{\partial (\text{Tr}(J^T J))}{\partial W} = 2 \cdot \left( \frac{\partial h}{\partial W} \cdot \frac{\partial h}{\partial W}^T \cdot J^T J \right)
$$

Since:

$$
J = \frac{\partial h}{\partial x}
$$

Thus:

$$
\frac{\partial (\text{Contractive Loss})}{\partial W} = 2 \cdot x \cdot (J^T J) \cdot x^T
$$

### Gradient Descent Update

The weights and biases are updated using the gradients:

$$
W \leftarrow W - \eta \left(\frac{\partial \text{Reconstruction Loss}}{\partial W} + \lambda \frac{\partial (\text{Contractive Loss})}{\partial W}\right)
$$

$$
b \leftarrow b - \eta \frac{\partial \text{Reconstruction Loss}}{\partial b}
$$

where $\eta$ is the learning rate.

# Advantages and Drawbacks

## Advantages
- **Robust Feature Learning**: CAEs learn features that are more invariant to small perturbations in the input data, making them more robust.
- **Better Generalization**: The contractive penalty encourages the model to generalize better to new data by learning smooth and stable features.
- **Reduced Overfitting**: The regularization term helps to reduce overfitting by penalizing complex mappings.

## Drawbacks
- **Computational Cost**: Calculating the Jacobian and the contractive penalty adds computational complexity, making training slower.
- **Hyperparameter Tuning**: Choosing the right hyperparameters (e.g., regularization strength) can be challenging and time-consuming.


