<a href="https://colab.research.google.com/github/pinatics/datacution/blob/master/vae_german_credit_complete_(1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# VAE Analysis on German Credit Dataset
This project demonstrates the implementation of a Variational Autoencoder (VAE) on the German Credit Dataset using TensorFlow and Keras in a Google Colab environment.

### Project Objectives
- Understand the dataset structure
- Preprocess the dataset
- Build and train a VAE model
- Visualize the results

This project is designed as part of my resume projects, showcasing VAE's capabilities in analyzing financial data.

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense, Lambda
from tensorflow.keras.models import Model
from tensorflow.keras import backend as K
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='whitegrid')

print('Libraries imported successfully.')

### Step 1: Load the Dataset
The German Credit dataset can be found on the UCI Machine Learning Repository. Let's load the data directly into a pandas DataFrame and display the first few rows to understand its structure.

In [None]:
# Load the dataset
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data'
columns = ['Status', 'Duration', 'Credit_history', 'Purpose', 'Credit_amount', 'Savings',
           'Employment', 'Installment_rate', 'Personal_status', 'Other_debtors', 'Residence_since',
           'Property', 'Age', 'Other_installment_plans', 'Housing', 'Existing_credits', 'Job',
           'Num_dependents', 'Telephone', 'Foreign_worker', 'Target']
df = pd.read_csv(url, delimiter=' ', header=None, names=columns)

# Display the first few rows of the dataset
df.head()

### Step 2: Data Preprocessing
Before training the VAE, we need to preprocess the data:
- Encode categorical variables
- Standardize numerical features
- Split the data into training and testing sets

In [None]:
# Encode categorical variables using one-hot encoding
df = pd.get_dummies(df, drop_first=True)

# Standardize the data
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df.drop(columns=['Target']))

# Split the data into training and testing sets
X_train, X_test = train_test_split(df_scaled, test_size=0.2, random_state=42)
print('Data preprocessing completed.')

### Step 3: Build the VAE Model
We will create a VAE with three main components:
- An Encoder
- A Latent space sampling function
- A Decoder

In [None]:
# Define the VAE architecture
original_dim = X_train.shape[1]
input_shape = (original_dim, )
latent_dim = 2  # Number of latent space dimensions
intermediate_dim = 64  # Number of neurons in the hidden layer

# Encoder
inputs = Input(shape=input_shape, name='encoder_input')
h = Dense(intermediate_dim, activation='relu')(inputs)
z_mean = Dense(latent_dim, name='z_mean')(h)
z_log_var = Dense(latent_dim, name='z_log_var')(h)

# Sampling function
def sampling(args):
    z_mean, z_log_var = args
    epsilon = K.random_normal(shape=(K.shape(z_mean)[0], latent_dim), mean=0., stddev=0.1)
    return z_mean + K.exp(z_log_var) * epsilon

z = Lambda(sampling, output_shape=(latent_dim,), name='z')([z_mean, z_log_var])

# Decoder
decoder_h = Dense(intermediate_dim, activation='relu')
decoder_mean = Dense(original_dim, activation='sigmoid')
h_decoded = decoder_h(z)
x_decoded_mean = decoder_mean(h_decoded)

# VAE model
vae = Model(inputs, x_decoded_mean)
print('VAE model defined.')

### Step 4: Define the VAE Loss Function
The loss function for a VAE is composed of two parts:
- Reconstruction loss: Measures how well the decoder is able to reconstruct the input data
- KL Divergence loss: Regularizes the latent space to ensure it approximates a normal distribution
The total VAE loss is a combination of these two components.

In [None]:
# Define the VAE loss
def vae_loss(x, x_decoded_mean):
    # Reconstruction loss (mean squared error)
    reconstruction_loss = tf.reduce_mean(tf.square(x - x_decoded_mean)) * original_dim

    # KL divergence loss
    kl_loss = -0.5 * tf.reduce_mean(1 + z_log_var - tf.square(z_mean) - tf.exp(z_log_var))

    # Total loss
    return reconstruction_loss + kl_loss

# Compile the VAE model
vae.add_loss(vae_loss(inputs, x_decoded_mean))
vae.compile(optimizer='adam')
print('VAE model compiled with custom loss function.')

### Step 5: Train the VAE Model
We will train the model using the training data for a specified number of epochs. The loss function values will help us evaluate how well the model is learning.

In [None]:
# Train the VAE model
history = vae.fit(X_train, X_train, epochs=50, batch_size=32, validation_data=(X_test, X_test))
print('Training completed.')

### Step 6: Visualize Training Loss
We'll plot the training and validation loss over the epochs to see how well the model converges.

In [None]:
# Plot training and validation loss
plt.figure(figsize=(10, 6))
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.title('Training and Validation Loss Over Epochs')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()

### Step 7: Visualize Latent Space
By projecting the data into the latent space, we can gain insights into how the VAE represents the data in a lower-dimensional form.

In [None]:
# Define an encoder model to project data into the latent space
encoder = Model(inputs, z_mean)
X_train_encoded = encoder.predict(X_train)

# Visualize the latent space
plt.figure(figsize=(10, 6))
plt.scatter(X_train_encoded[:, 0], X_train_encoded[:, 1], alpha=0.5)
plt.title('Latent Space Visualization')
plt.xlabel('Latent Dimension 1')
plt.ylabel('Latent Dimension 2')
plt.show()

### Step 8: Generate New Samples
Finally, we can use the decoder part of the VAE to generate new synthetic samples from the latent space.

In [None]:
# Define a decoder model
decoder_input = Input(shape=(latent_dim,))
h_decoded = decoder_h(decoder_input)
x_decoded_mean = decoder_mean(h_decoded)
generator = Model(decoder_input, x_decoded_mean)

# Generate new samples by sampling from the latent space
new_samples = generator.predict(np.random.normal(size=(10, latent_dim)))
new_samples_rescaled = scaler.inverse_transform(new_samples)
print('Generated new samples:', new_samples_rescaled)

## Conclusion
In this project, we successfully implemented a Variational Autoencoder (VAE) on the German Credit Dataset, explored the latent space, and generated new samples. This demonstrates the potential of VAEs in understanding complex datasets and generating synthetic data.