<a href="https://colab.research.google.com/github/pratikagithub/DS-Case-Studies/blob/main/Synthetic_Data_Generation_with_Generative_AI.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Synthetic Data Generation with Generative AI**

Synthetic data is artificially generated data that mimics real-world data. It is created by algorithms, models, or simulations rather than being collected from actual events or real-world scenarios.

To get started with the task of Synthetic Data Generation, we need a dataset that we can use to feed into a Generative Adversarial Networks (GANs) model, which will be trained to generate new data samples that will be similar to the original data and the relationships between the features in the original data.

I found an ideal dataset for this task, which contains daily records of insights into app usage patterns over time. Our goal will be to generate synthetic data that mimics the original dataset by ensuring that it maintains the same statistical properties while providing privacy for users’ actual usage behaviour.

Now, let’s get started with the task of synthetic data generation using Generative AI by importing the necessary Python libraries

In [None]:
import numpy as np
import pandas as pd
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LeakyReLU, BatchNormalization
from tensorflow.keras.optimizers import Adam
from sklearn.preprocessing import MinMaxScaler

from google.colab import files
uploaded = files.upload()

Saving screentime_analysis.csv to screentime_analysis.csv


In [None]:
data = pd.read_csv('screentime_analysis.csv')
data.head()

Unnamed: 0,Date,App,Usage (minutes),Notifications,Times Opened
0,2024-08-07,Instagram,81,24,57
1,2024-08-08,Instagram,90,30,53
2,2024-08-26,Instagram,112,33,17
3,2024-08-22,Instagram,82,11,38
4,2024-08-12,Instagram,59,47,16


The dataset contains the following columns:

Date: The date of the screentime data.

Usage: Total usage time of the app (likely in minutes).

Notifications: The number of notifications received.

Times opened: The number of times the app was opened.

App: The name of the app.

To create a Generative AI model using GANs for generating synthetic data, we need to:

Drop unnecessary columns: We will not generate the Date or App fields as they are specific identifiers. Instead, we’ll focus on Usage, Notifications, and Times opened. In case, you want to use the app column, you can use the app column by converting the value of the column into numerical values.
Normalize the data: GANs perform better with normalized data, usually between 0 and 1.
Prepare the dataset for training: Ensure the remaining columns are numeric and ready for the model.

Let’s preprocess the data with all the preprocessing steps we discussed above:

In [None]:
# drop unnecessary columns
data_gan = data.drop(columns = ['Date', 'App'])

# initialize a MinMaxScaler to normalize the data between 0 and 1
scaler = MinMaxScaler()

# normalize the data
normalized_data = scaler.fit_transform(data_gan)

# convert back to a DataFrame
normalized_df = pd.DataFrame(normalized_data, columns=data_gan.columns)
normalized_df.head()

Unnamed: 0,Usage (minutes),Notifications,Times Opened
0,0.677966,0.163265,0.571429
1,0.754237,0.204082,0.530612
2,0.940678,0.22449,0.163265
3,0.686441,0.07483,0.377551
4,0.491525,0.319728,0.153061


The dataset has been normalized, with values between 0 and 1 for the following columns: Usage, Notifications, and Times opened. Now, let’s move on to building the GAN model.

Using GANs to Build a Generative AI Model for Synthetic Data Generation
Here’s the process to define and train the GAN:

The generator will be trained to produce data similar to the normalized Usage, Notifications, and Times opened columns.
The discriminator will be trained to distinguish between the real and generated data.
Next, we will alternate between training the discriminator and the generator. The discriminator will be trained to classify real vs fake data, and the generator will be trained to fool the discriminator.

Let’s start building the GAN. The generator will take a latent noise vector as input and generate a synthetic sample similar to the data. Use the LeakyReLU activation for better gradient flow:

In [None]:
latent_dim = 100 #size of the random noise vector

latent_dim = 100 #latent space dimension(size of the random noise input)

def build_generator(latent_dim):
  model = Sequential([
      Dense(128, input_dim=latent_dim),
      LeakyReLU(alpha=0.01),
      BatchNormalization(momentum=0.8),
      Dense(256),
      LeakyReLU(alpha=0.01),
      BatchNormalization(momentum=0.8),
      Dense(512),
      LeakyReLU(alpha=0.01),
      BatchNormalization(momentum=0.8),
      Dense(3, activation='sigmoid')
  ])
  return model

# create the generator
generator = build_generator(latent_dim)
generator.summary()

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


Here’s an example of generating data using the generator network:

In [None]:
# generator random noise for samples
noise=np.random.normal(0, 1, (1000, latent_dim))

# generate synthtic data using the  generator
generated_data = generator.predict(noise)

#display the generated data
generated_data[:5]

[1m32/32[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step


array([[0.5140989 , 0.60596377, 0.50901747],
       [0.5432038 , 0.50396097, 0.42585033],
       [0.45073977, 0.6120297 , 0.44924968],
       [0.40830863, 0.6460937 , 0.45432606],
       [0.57500684, 0.513053  , 0.49516222]], dtype=float32)

Now, the discriminator will take a real or synthetic data sample and classify it as real or fake:

In [None]:
def build_discriminator():
  model = Sequential([
      Dense(512, input_shape=(3,)),
      LeakyReLU(alpha=0.01),
      Dense(256),
      LeakyReLU(alpha=0.01),
      Dense(128),
      LeakyReLU(alpha=0.01),
      Dense(1, activation='sigmoid')
  ])
  model.compile(loss='binary_crossentropy', optimizer=Adam(), metrics=['accuracy'])
  return model

# create the discriminator
discriminator = build_discriminator()
discriminator.summary()

Next, we will freeze the discriminator’s weights when training the generator to ensure only the generator is updated during those training steps:

In [None]:
def build_gan(generator, discriminantor):
  # freeze the discriminator's weightswhile training the generator
  discriminator.trainable = False

  model = Sequential([generator, discriminator])
  model.compile(loss='binary_crossentropy', optimizer=Adam())
  return model

# create the GAN
gan = build_gan(generator, discriminator)
gan.summary()

Now, we will train the GAN using the following steps:

1. Generate random noise.

2. Use the generator to create fake data.

3. Train the discriminator on both real and fake data.

4. Train the generator via the GAN to fool the discriminator.

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Reshape

latent_dim = 100  # Size of the random noise vector

generator = Sequential([
    Dense(128, activation='relu', input_dim=latent_dim),
    Dense(256, activation='relu'),
    Dense(784, activation='sigmoid'),  # Output size depends on the data; for MNIST, use 28x28=784
    Reshape((28, 28, 1))  # Reshape to match image format
])


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


In [None]:
from tensorflow.keras.layers import Flatten, LeakyReLU
from tensorflow.keras.optimizers import Adam

discriminator = Sequential([
    Flatten(input_shape=(28, 28, 1)),  # Match generator output shape
    Dense(256),
    LeakyReLU(alpha=0.2),
    Dense(128),
    LeakyReLU(alpha=0.2),
    Dense(1, activation='sigmoid')  # Output: probability (real or fake)
])

discriminator.compile(optimizer=Adam(), loss='binary_crossentropy', metrics=['accuracy'])


  super().__init__(**kwargs)


In [None]:
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input

# Freeze the discriminator during generator training
discriminator.trainable = False

gan_input = Input(shape=(latent_dim,))
gan_output = discriminator(generator(gan_input))
gan = Model(gan_input, gan_output)
gan.compile(optimizer=Adam(), loss='binary_crossentropy')


In [None]:
# Example for MNIST data:
from tensorflow.keras.datasets import mnist

# Load MNIST data and preprocess
(x_train, _), (_, _) = mnist.load_data()
x_train = (x_train / 255.0).reshape(-1, 28, 28, 1)  # Normalize and reshape
normalized_data = x_train


Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
[1m11490434/11490434[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step


In [None]:
def train_gan(gan, generator, discriminator, data, epochs=10000, batch_size=128, latent_dim=100):
    for epoch in range(epochs):
        # Select a random batch of real data
        idx = np.random.randint(0, data.shape[0], batch_size)
        real_data = data[idx]

        # Generate a batch of fake data
        noise = np.random.normal(0, 1, (batch_size, latent_dim))
        fake_data = generator.predict(noise)

        # Labels for real and fake data
        real_labels = np.ones((batch_size, 1))  # Real data has label 1
        fake_labels = np.zeros((batch_size, 1))  # Fake data has label 0

        # Train the discriminator
        d_loss_real = discriminator.train_on_batch(real_data, real_labels)
        d_loss_fake = discriminator.train_on_batch(fake_data, fake_labels)

        # Train the generator via the GAN
        noise = np.random.normal(0, 1, (batch_size, latent_dim))
        valid_labels = np.ones((batch_size, 1))
        g_loss = gan.train_on_batch(noise, valid_labels)

        # Print progress every 1000 epochs
        if epoch % 1000 == 0:
            print(f"Epoch {epoch}: D Loss: {0.5 * np.add(d_loss_real, d_loss_fake)}, G Loss: {g_loss}")
