### Lab: Carbon Capture Simulations with VAEs

#### Problem Statement
In efforts to reduce carbon emissions and capture carbon dioxide from the environment, analyzing vast datasets with advanced machine learning techniques can be pivotal. However, due to the complexity of carbon capture data, interpreting large-scale tabular data can be challenging. Variational Autoencoders (VAEs) offer a way to model complex, high-dimensional data distributions. This lab involves creating a VAE model for analyzing and generating synthetic data based on carbon capture datasets.


#### Objective
The objective of this lab is to:
- Build and understand a Variational Autoencoder (VAE) architecture in TensorFlow, tailored for tabular data.
- Apply the VAE to a carbon capture dataset to capture key data patterns.
- Train the VAE model and evaluate its ability to generate new synthetic data samples.
- Learn the principles of reconstruction and latent space regularization.


#### System Requirements

**To Train the model in Local System**

**Hardware:**
- GPU (Recommended for faster training, e.g., Nvidia CUDA-compatible GPUs)
- RAM: 8 GB or more
- Storage: At least 5 GB of free space for datasets and model storage

**Software:**
- Python 3.x
- TensorFlow 2.x
- Pandas, NumPy
- Matplotlib (Optional for visualization)
- Jupyter Notebook or any Python IDE
- GPU drivers and CUDA (if available)

It is recommended to use google colab, so there will not be any need of setting up the environment.

#### Dataset
The dataset used in this project is called carbon_capture_data.csv.


#### Code Explanation

**Import Libraries** <br>
The code begins by importing essential libraries required for VAE model creation and data handling.<br>
**tensorflow** is used to define and train the VAE model.<br>
**numpy** assists in numerical data operations.<br>
**pandas** is used for handling the tabular data format.<br>

In [None]:
import tensorflow as tf
from tensorflow.keras import layers
import numpy as np
import pandas as pd

**GPU Configuration**

In [None]:
# This cell of the code is useful if you have a GPU like Nvidia or If you are using a cloud platform like kaggle
# Ensure GPU is being used
physical_devices = tf.config.experimental.list_physical_devices('GPU')
if len(physical_devices) > 0:
    tf.config.experimental.set_memory_growth(physical_devices[0], True)
    print("Using GPU")
else:
    print("No GPU found, using CPU")

**Load Dataset**<br>
Load the carbon capture data. Here, data is read from a CSV file, assumed to be formatted with relevant features.

In [None]:
df = pd.read_csv("carbon_capture_dataset.csv")
data = df.values
num_features = data.shape[1]

**Define the Encoder Model**<br>
The Encoder is the first half of the VAE. It compresses the input into a latent representation.

In [None]:
class Encoder(tf.keras.Model):
    def __init__(self, latent_dim):
        super(Encoder, self).__init__()
        self.dense1 = layers.Dense(64, activation='relu')
        self.dense2 = layers.Dense(32, activation='relu')
        self.dense_mean = layers.Dense(latent_dim) # Mean of latent distribution
        self.dense_log_var = layers.Dense(latent_dim) # Log variance of latent distribution

    def call(self, inputs):
        x = self.dense1(inputs)
        x = self.dense2(x)
        mean = self.dense_mean(x)
        log_var = self.dense_log_var(x)
        return mean, log_var


**Layers:**<br>
- Dense layers (dense1 and dense2) are applied to reduce the feature space gradually.
- dense_mean and dense_log_var produce the mean and log variance of the latent space distribution, crucial for sampling new data points.

**Define the Decoder Model**<br>
The Decoder reconstructs data from the compressed latent representation produced by the Encoder.

In [None]:
class Decoder(tf.keras.Model):
    def __init__(self, latent_dim):
        super(Decoder, self).__init__()
        self.dense1 = layers.Dense(32, activation='relu')
        self.dense2 = layers.Dense(64, activation='relu')
        self.output_layer = layers.Dense(num_features) # Reconstruct original data dimensions

    def call(self, inputs):
        x = self.dense1(inputs)
        x = self.dense2(x)
        return self.output_layer(x)  

**Layers:**<br>
- Dense layers (dense1 and dense2) decode the latent representation gradually back into the original feature space.
- output_layer matches the original data dimensions for reconstruction.

**Define the VAE Loss Function**<br>
This custom loss function combines Reconstruction Loss (ensuring data similarity) and KL Divergence (latent space regularization).

In [None]:
def vae_loss(original, reconstructed, mean, log_var):
    # Mean Squared Error for tabular data reconstruction loss
    reconstruction_loss = tf.reduce_mean(tf.keras.losses.mse(original, reconstructed))
    # KL divergence for latent space regularization
    kl_loss = -0.5 * tf.reduce_mean(1 + log_var - tf.square(mean) - tf.exp(log_var))
    return reconstruction_loss + kl_loss


- **Reconstruction Loss**: Measures the difference between original and reconstructed data.
- **KL Divergence**: Regularizes the latent space, encouraging Gaussian distribution around zero.

**Model Initialization and Training Configuration**<br>
Set the latent dimension and initialize the Encoder and Decoder.

In [None]:
# Model setup
latent_dim = 2
encoder = Encoder(latent_dim)
decoder = Decoder(latent_dim)
optimizer = tf.keras.optimizers.Adam()


- **latent_dim**: Sets the dimensionality of the compressed representation.
- **optimizer**: Adam optimizer is chosen for efficient gradient-based training.

**Define the Training Step**<br>
The training step function performs forward and backward passes and updates weights.

In [None]:
@tf.function
def train_step(data_batch):
    with tf.GradientTape() as tape:
        mean, log_var = encoder(data_batch)
        epsilon = tf.random.normal(shape=tf.shape(mean))
        z = mean + tf.exp(log_var * 0.5) * epsilon
        reconstructed = decoder(z)
        loss = vae_loss(data_batch, reconstructed, mean, log_var)
    gradients = tape.gradient(loss, encoder.trainable_variables + decoder.trainable_variables)
    optimizer.apply_gradients(zip(gradients, encoder.trainable_variables + decoder.trainable_variables))
    return loss


- **Gradient Tape**: Tracks operations for automatic differentiation.
- **Reparameterization Trick**: Adds random noise to ensure sampling is differentiable.
- **Loss Calculation**: Combines reconstruction and KL losses to guide training.

**Training Loop**<br>
Iterate over epochs and batches, calculating the loss for each batch and updating the model.

In [None]:
epochs = 10
batch_size = 32
dataset = tf.data.Dataset.from_tensor_slices(data).batch(batch_size)

for epoch in range(epochs):
    print(f"Epoch: {epoch+1}/{epochs}")
    for data_batch in dataset:
        loss = train_step(data_batch)
    print(f"Loss: {loss:.4f}")


- **Epochs**: Number of complete passes through the dataset.
- **Batch Size**: Number of samples per batch to optimize model training speed and stability.

**Generate New Samples**<br>
Generate synthetic data samples from random latent vectors.

In [None]:
num_samples = 10
random_latent_vectors = tf.random.normal(shape=(num_samples, latent_dim))
generated_samples = decoder(random_latent_vectors)

print("Generated samples:")
print(generated_samples.numpy())


Generated Samples: Random latent vectors sampled from a Gaussian are fed into the Decoder to create new synthetic data, representing simulated carbon capture data.