<a href="https://colab.research.google.com/github/pritamrp/Synthetic-tabular-data/blob/main/Synthetic_Tabular_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Synthetic Data Generation with GAN and Transformer**

In [None]:
# prompt: use csv file from gdrive and convert to dataframe
import pandas as pd
import numpy as np
from google.colab import drive
drive.mount('/content/gdrive')

df = pd.read_csv('/content/gdrive/MyDrive/online_gaming_behavior_dataset.csv')

Mounted at /content/gdrive


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40034 entries, 0 to 40033
Data columns (total 13 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   PlayerID                   40034 non-null  int64  
 1   Age                        40034 non-null  int64  
 2   Gender                     40034 non-null  object 
 3   Location                   40034 non-null  object 
 4   GameGenre                  40034 non-null  object 
 5   PlayTimeHours              40034 non-null  float64
 6   InGamePurchases            40034 non-null  int64  
 7   GameDifficulty             40034 non-null  object 
 8   SessionsPerWeek            40034 non-null  int64  
 9   AvgSessionDurationMinutes  40034 non-null  int64  
 10  PlayerLevel                40034 non-null  int64  
 11  AchievementsUnlocked       40034 non-null  int64  
 12  EngagementLevel            40034 non-null  object 
dtypes: float64(1), int64(7), object(5)
memory usag

Original Dataset : https://www.kaggle.com/datasets/rabieelkharoua/predict-online-gaming-behavior-dataset

# Data Dictionary :




*  PlayerID: Unique identifier for each player.
*  Age: Age of the player.
*  Gender: Gender of the player.
*  Location: Geographic location of the player.
*  GameGenre: Genre of the game the player is engaged in.
*  InGamePurchases: Indicates whether the player makes in-game purchases (0 = No, 1 = Yes).
*  GameDifficulty: Difficulty level of the game.
*  SessionsPerWeek: Number of gaming sessions per week.
*  AvgSessionDurationMinutes: Average duration of each gaming session in minutes.
*  PlayerLevel: Current level of the player in the game.
*  AchievementsUnlocked: Number of achievements unlocked by the player.
*   EngagementLevel: Categorized engagement level reflecting player retention ('High', 'Medium', 'Low').

















In [None]:
df.nunique()

Unnamed: 0,0
PlayerID,40034
Age,35
Gender,2
Location,4
GameGenre,5
PlayTimeHours,40034
InGamePurchases,2
GameDifficulty,3
SessionsPerWeek,20
AvgSessionDurationMinutes,170


In [None]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
PlayerID,40034.0,29016.5,11556.964675,9000.0,19008.25,29016.5,39024.75,49033.0
Age,40034.0,31.992531,10.043227,15.0,23.0,32.0,41.0,49.0
PlayTimeHours,40034.0,12.024365,6.914638,0.000115,6.067501,12.008002,17.963831,23.999592
InGamePurchases,40034.0,0.200854,0.400644,0.0,0.0,0.0,0.0,1.0
SessionsPerWeek,40034.0,9.471774,5.763667,0.0,4.0,9.0,14.0,19.0
AvgSessionDurationMinutes,40034.0,94.792252,49.011375,10.0,52.0,95.0,137.0,179.0
PlayerLevel,40034.0,49.655568,28.588379,1.0,25.0,49.0,74.0,99.0
AchievementsUnlocked,40034.0,24.526477,14.430726,0.0,12.0,25.0,37.0,49.0


In [None]:
import pandas as pd
import numpy as np

def identify_outliers(df, multiplier=1.5):
    numeric_df = df.select_dtypes(include=['number'])

    Q1 = numeric_df.quantile(0.25)
    Q3 = numeric_df.quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - (multiplier * IQR)
    upper_bound = Q3 + (multiplier * IQR)

    return ((numeric_df < lower_bound) | (numeric_df > upper_bound)).sum()

def count_outliers(df):
    return identify_outliers(df).sum()


In [None]:
df.select_dtypes(include=['number']).corr()

Unnamed: 0,PlayerID,Age,PlayTimeHours,InGamePurchases,SessionsPerWeek,AvgSessionDurationMinutes,PlayerLevel,AchievementsUnlocked
PlayerID,1.0,-0.003044,0.000923,0.002321,-0.005944,-0.001801,-0.001769,0.00319
Age,-0.003044,1.0,0.002462,-0.000186,0.008777,-0.002269,0.001353,-0.0011
PlayTimeHours,0.000923,0.002462,1.0,-0.006067,-0.003655,-0.001925,-0.005152,0.003913
InGamePurchases,0.002321,-0.000186,-0.006067,1.0,0.005132,-0.003059,0.006524,9.8e-05
SessionsPerWeek,-0.005944,0.008777,-0.003655,0.005132,1.0,-0.00062,0.003257,0.003187
AvgSessionDurationMinutes,-0.001801,-0.002269,-0.001925,-0.003059,-0.00062,1.0,0.001368,-0.002227
PlayerLevel,-0.001769,0.001353,-0.005152,0.006524,0.003257,0.001368,1.0,0.006343
AchievementsUnlocked,0.00319,-0.0011,0.003913,9.8e-05,0.003187,-0.002227,0.006343,1.0


In [None]:
from scipy.stats import chisquare

# Assuming 'df' is your DataFrame
df_c = df.select_dtypes(include=['object']).apply(lambda x: pd.factorize(x)[0] + 1)

result = pd.DataFrame([
    chisquare(df_c[col].values)[0] for col in df_c.columns
], index=df_c.columns, columns=['chi_square_statistic'])

print(result)

                 chi_square_statistic
Gender                    6864.155929
Location                 12900.819203
GameGenre                26694.030780
GameDifficulty           10325.598330
EngagementLevel          15587.980963


# Dividing the Original Data

In [None]:
from sklearn.utils import shuffle
df = shuffle(df)

In [None]:
Seed =df[0:int(df.shape[0]/2)]
Hold_out =df[int(df.shape[0]/2):]

# Synthetic Data with GAN



In [None]:
pip install ctgan



In [None]:
from ctgan import CTGAN

Making sure the data types are accepted input

In [None]:
for i in df.columns:
  if df[i].dtype=='object':
    df[i] = df[i].astype('category')
  elif df[i].dtype=='int64':
    df[i] = df[i].astype('float64')

In [None]:
# Import necessary libraries

from ctgan import CTGAN

# Step 1: Identify categorical columns
categorical_columns = Seed.select_dtypes(include=['object', 'category']).columns.tolist()

# Step 2: Initialize the CTGAN model
ctgan = CTGAN()

# Step 3: Fit the CTGAN model to the data
ctgan.fit(Seed, discrete_columns=categorical_columns)

# Step 4: Generate synthetic data
synthetic_data = ctgan.sample(len(Seed))

In [None]:
synthetic_data_GAN = pd.DataFrame(synthetic_data)

In [None]:
synthetic_data_GAN.shape

(20017, 13)

In [None]:
synthetic_data_GAN.to_csv('synthetic_data_GAN.csv')

# Synthetic Data With Transformer

In [None]:
!pip install REaLTabFormer

Collecting REaLTabFormer
  Downloading REaLTabFormer-0.1.7-py3-none-any.whl.metadata (11 kB)
Collecting datasets>=2.6.1 (from REaLTabFormer)
  Downloading datasets-2.20.0-py3-none-any.whl.metadata (19 kB)
Collecting pyarrow>=15.0.0 (from datasets>=2.6.1->REaLTabFormer)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets>=2.6.1->REaLTabFormer)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets>=2.6.1->REaLTabFormer)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets>=2.6.1->REaLTabFormer)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.5.0,>=2023.1.0 (from fsspec[http]<=2024.5.0,>=2023.1.0->datasets>=2.6.1->REaLTabFormer)
  Downloading fsspec-2024.5.0-py3-none-any.whl.metadata (11 kB)
Downloading REaLTabFormer-0.1.7-py3-none-any.whl 

In [None]:
from realtabformer import REaLTabFormer

AttributeError: module 'pyarrow.lib' has no attribute 'ListViewType'

In [None]:
rtf_model = REaLTabFormer(
    model_type="tabular",
    gradient_accumulation_steps=2,  # Adjusted for faster updates
    logging_steps=500  # Less frequent logging
)

In [None]:
Seed_np = Seed.to_numpy()

In [None]:
rtf_model.fit(Seed)

In [None]:
samples = rtf_model.sample(n_samples=2017)
samples_df = pd.DataFrame(samples, columns=Seed.columns)
samples_df.to_csv('synthetic_data_TF.csv', index=False)

# Evaluating Synthetic Data

Mean, Median, Mode and Standard Deviation

In [None]:
df_GAN = pd.read_csv('/content/gdrive/MyDrive/synthetic_data_GAN.csv')
df_TF = pd.read_csv('/content/gdrive/MyDrive/synthetic_data_TF_new.csv')

In [None]:
df_GAN.drop(columns=['Unnamed: 0'], inplace=True)

Blending Dataset (GAN and Transformer)

In [None]:
df_GAN.shape

(20017, 13)

In [None]:
df_GAN_shuffled = shuffle(df_GAN)
df_TF_shuffled = shuffle(df_TF)
df_blend = pd.concat([df_GAN_shuffled[0:10016], df_TF_shuffled[10016:]], axis=0)
df_blend = shuffle(df_blend)

In [None]:
round(((Seed.describe().T[['mean','std','min','max']]-df_GAN.describe().T[['mean','std','min','max']])/Seed.describe().T[['mean','std','min','max']]),3).T


invalid value encountered in subtract



Unnamed: 0,PlayerID,Age,PlayTimeHours,InGamePurchases,SessionsPerWeek,AvgSessionDurationMinutes,PlayerLevel,AchievementsUnlocked
mean,-0.102,0.039,0.009,-0.082,-0.019,0.025,0.099,0.044
std,0.046,0.048,-0.03,-0.029,0.007,0.007,0.149,-0.013
min,0.066,0.133,7411.689,,inf,0.3,4.0,inf
max,-0.034,-0.02,-0.027,0.0,-0.053,-0.05,-0.01,-0.041


In [None]:
round(((Seed.describe().T[['mean','std','min','max']]-df_TF.describe().T[['mean','std','min','max']])/Seed.describe().T[['mean','std','min','max']]),3).T

Unnamed: 0,PlayerID,Age,PlayTimeHours,InGamePurchases,SessionsPerWeek,AvgSessionDurationMinutes,PlayerLevel,AchievementsUnlocked
mean,0.005,-0.008,0.002,0.124,0.011,-0.051,-0.02,-0.027
std,0.007,0.009,-0.006,0.049,0.01,-0.005,0.011,0.011
min,0.0,0.0,1.0,,,0.8,1.0,
max,-0.018,0.0,-0.244,0.0,0.0,-0.112,0.0,0.0


In [None]:
round(((Seed.describe().T[['mean','std','min','max']]-df_blend.describe().T[['mean','std','min','max']])/Seed.describe().T[['mean','std','min','max']]),3).T


invalid value encountered in subtract



Unnamed: 0,PlayerID,Age,PlayTimeHours,InGamePurchases,SessionsPerWeek,AvgSessionDurationMinutes,PlayerLevel,AchievementsUnlocked
mean,-0.048,0.017,0.008,0.005,-0.001,-0.013,0.042,0.012
std,0.02,0.026,-0.016,0.002,0.01,-0.003,0.071,-0.004
min,0.066,0.133,7021.161,,inf,0.7,4.0,inf
max,-0.034,-0.02,-0.24,0.0,-0.053,-0.112,-0.01,-0.041


Synthetic dataset with Transformer have the lowest percentage difference with the original dataset. It means it has better matched the data.

In [None]:
(Seed.nunique()-df_GAN.nunique())

Unnamed: 0,0
PlayerID,4678
Age,-3
Gender,0
Location,0
GameGenre,0
PlayTimeHours,1
InGamePurchases,0
GameDifficulty,0
SessionsPerWeek,-2
AvgSessionDurationMinutes,-12


In [None]:
df_TF.nunique()

Unnamed: 0,0
PlayerID,15610
Age,35
Gender,2
Location,4
GameGenre,5
PlayTimeHours,18774
InGamePurchases,2
GameDifficulty,3
SessionsPerWeek,20
AvgSessionDurationMinutes,187


In [None]:
Seed.nunique()-df_blend.nunique()

Unnamed: 0,0
PlayerID,4343
Age,-3
Gender,0
Location,0
GameGenre,0
PlayTimeHours,320
InGamePurchases,0
GameDifficulty,0
SessionsPerWeek,-2
AvgSessionDurationMinutes,-21


Synthetic Data generated by GAN is closed to the original dataset in terms of unique values.

In [None]:
Seed.isnull().sum()
df_GAN.isnull().sum()
df_TF.isnull().sum()
df_blend.isnull().sum()

Unnamed: 0,0
PlayerID,0
Age,0
Gender,0
Location,0
GameGenre,0
PlayTimeHours,0
InGamePurchases,0
GameDifficulty,0
SessionsPerWeek,0
AvgSessionDurationMinutes,0


None of the datasets have any null values.

In [None]:
(Seed.describe().T[['25%','50%','75%']]-df_GAN.describe().T[['25%','50%','75%']]).T

Unnamed: 0,PlayerID,Age,PlayTimeHours,InGamePurchases,SessionsPerWeek,AvgSessionDurationMinutes,PlayerLevel,AchievementsUnlocked
25%,-3209.0,1.0,-0.036757,0.0,0.0,4.0,-3.0,2.0
50%,-2772.0,1.0,1.378445,0.0,-1.0,5.0,5.0,2.0
75%,-2612.0,3.0,-0.822477,0.0,0.0,3.0,16.0,1.0


In [None]:
(Seed.describe().T[['25%','50%','75%']]-df_TF.describe().T[['25%','50%','75%']]).T

Unnamed: 0,PlayerID,Age,PlayTimeHours,InGamePurchases,SessionsPerWeek,AvgSessionDurationMinutes,PlayerLevel,AchievementsUnlocked
25%,57.0,-1.0,-0.085427,0.0,0.0,-5.0,-1.0,0.0
50%,182.0,0.0,0.340296,0.0,-1.0,-9.0,-2.0,-2.0
75%,384.0,0.0,-0.076309,0.0,0.0,-5.0,-1.0,0.0


In [None]:
(Seed.describe().T[['25%','50%','75%']]-df_blend.describe().T[['25%','50%','75%']]).T

Unnamed: 0,PlayerID,Age,PlayTimeHours,InGamePurchases,SessionsPerWeek,AvgSessionDurationMinutes,PlayerLevel,AchievementsUnlocked
25%,-1853.0,0.0,-0.017627,0.0,0.0,0.0,-2.0,1.0
50%,-1129.0,1.0,0.836376,0.0,-1.0,-2.0,3.0,0.0
75%,-1170.0,2.0,-0.477449,0.0,0.0,-2.0,6.0,0.0


Synthetic data with Transformer quantile distribution closed to the original distribution.

# Categorical Variable Distribution

In [None]:
Seed_cat = Seed.select_dtypes(include=['object'])

In [None]:
import pandas as pd
import plotly.graph_objects as go
from plotly.subplots import make_subplots


fig = make_subplots(rows=2, cols=2,
                    subplot_titles=("Game Genre Distribution", "Location Distribution",
                                    "Game Difficulty Distribution", "Engagement Level Distribution"),
                    vertical_spacing=0.1)

# Game Genre Distribution
genre_counts = Seed_cat['GameGenre'].value_counts()
fig.add_trace(go.Bar(x=genre_counts.index, y=genre_counts.values, name="Game Genre",
                     marker_color='rgba(58, 71, 80, 0.6)', marker_line_color='rgba(58, 71, 80, 1.0)',
                     marker_line_width=1.5), row=1, col=1)

# Location Distribution
location_counts = Seed_cat['Location'].value_counts()
fig.add_trace(go.Bar(x=location_counts.index, y=location_counts.values, name="Location",
                     marker_color='rgba(246, 78, 139, 0.6)', marker_line_color='rgba(246, 78, 139, 1.0)',
                     marker_line_width=1.5), row=1, col=2)

# Game Difficulty Distribution
difficulty_counts = Seed_cat['GameDifficulty'].value_counts()
fig.add_trace(go.Bar(x=difficulty_counts.index, y=difficulty_counts.values, name="Game Difficulty",
                     marker_color='rgba(6, 147, 227, 0.6)', marker_line_color='rgba(6, 147, 227, 1.0)',
                     marker_line_width=1.5), row=2, col=1)

# Engagement Level Distribution
engagement_counts = Seed_cat['EngagementLevel'].value_counts()
fig.add_trace(go.Bar(x=engagement_counts.index, y=engagement_counts.values, name="Engagement Level",
                     marker_color='rgba(153, 0, 153, 0.6)', marker_line_color='rgba(153, 0, 153, 1.0)',
                     marker_line_width=1.5), row=2, col=2)


fig.update_layout(
    title_text="Original Categorical variable Distribution",
    title_font_size=24,
    showlegend=False,
    plot_bgcolor='rgba(240, 240, 240, 0.8)',
    paper_bgcolor='rgba(240, 240, 240, 0.8)',
    height=800,
    width=1200,
    font=dict(family="Arial", size=12, color="rgb(50, 50, 50)"),
)

# Update axes
fig.update_xaxes(tickangle=45, title_text="", showgrid=True, gridwidth=1, gridcolor='rgba(200, 200, 200, 0.2)')
fig.update_yaxes(title_text="Count", showgrid=True, gridwidth=1, gridcolor='rgba(200, 200, 200, 0.2)')

# Add annotations
fig.add_annotation(
    text="Data insights",
    xref="paper", yref="paper",
    x=0.5, y=1.05,
    showarrow=False,
    font=dict(size=14, color="rgb(80, 80, 80)")
)

# Show the plot
fig.show()

In [None]:
df_GAN_cat = df_GAN.select_dtypes(include=['object'])

# Create subplots
fig = make_subplots(rows=2, cols=2,
                    subplot_titles=("Game Genre Distribution", "Location Distribution",
                                    "Game Difficulty Distribution", "Engagement Level Distribution"),
                    vertical_spacing=0.1)

# Game Genre Distribution
genre_counts = df_GAN_cat['GameGenre'].value_counts()
fig.add_trace(go.Bar(x=genre_counts.index, y=genre_counts.values, name="Game Genre",
                     marker_color='rgba(58, 71, 80, 0.6)', marker_line_color='rgba(58, 71, 80, 1.0)',
                     marker_line_width=1.5), row=1, col=1)

# Location Distribution
location_counts = df_GAN_cat['Location'].value_counts()
fig.add_trace(go.Bar(x=location_counts.index, y=location_counts.values, name="Location",
                     marker_color='rgba(246, 78, 139, 0.6)', marker_line_color='rgba(246, 78, 139, 1.0)',
                     marker_line_width=1.5), row=1, col=2)

# Game Difficulty Distribution
difficulty_counts = df_GAN_cat['GameDifficulty'].value_counts()
fig.add_trace(go.Bar(x=difficulty_counts.index, y=difficulty_counts.values, name="Game Difficulty",
                     marker_color='rgba(6, 147, 227, 0.6)', marker_line_color='rgba(6, 147, 227, 1.0)',
                     marker_line_width=1.5), row=2, col=1)

# Engagement Level Distribution
engagement_counts = df_GAN_cat['EngagementLevel'].value_counts()
fig.add_trace(go.Bar(x=engagement_counts.index, y=engagement_counts.values, name="Engagement Level",
                     marker_color='rgba(153, 0, 153, 0.6)', marker_line_color='rgba(153, 0, 153, 1.0)',
                     marker_line_width=1.5), row=2, col=2)

# Update layout for a stunning appearance
fig.update_layout(
    title_text="Synthetic data with GAN Categorical variable Distribution ",
    title_font_size=24,
    showlegend=False,
    plot_bgcolor='rgba(240, 240, 240, 0.8)',
    paper_bgcolor='rgba(240, 240, 240, 0.8)',
    height=800,
    width=1200,
    font=dict(family="Arial", size=12, color="rgb(50, 50, 50)"),
)

# Update axes
fig.update_xaxes(tickangle=45, title_text="", showgrid=True, gridwidth=1, gridcolor='rgba(200, 200, 200, 0.2)')
fig.update_yaxes(title_text="Count", showgrid=True, gridwidth=1, gridcolor='rgba(200, 200, 200, 0.2)')


fig.add_annotation(
    text="Data insights",
    xref="paper", yref="paper",
    x=0.5, y=1.05,
    showarrow=False,
    font=dict(size=14, color="rgb(80, 80, 80)")
)

# Show the plot
fig.show()

In [None]:
df_TF_cat = df_TF.select_dtypes(include=['object'])

# Create subplots
fig = make_subplots(rows=2, cols=2,
                    subplot_titles=("Game Genre Distribution", "Location Distribution",
                                    "Game Difficulty Distribution", "Engagement Level Distribution"),
                    vertical_spacing=0.1)

# Game Genre Distribution
genre_counts = df_TF_cat['GameGenre'].value_counts()
fig.add_trace(go.Bar(x=genre_counts.index, y=genre_counts.values, name="Game Genre",
                     marker_color='rgba(58, 71, 80, 0.6)', marker_line_color='rgba(58, 71, 80, 1.0)',
                     marker_line_width=1.5), row=1, col=1)

# Location Distribution
location_counts = df_TF_cat['Location'].value_counts()
fig.add_trace(go.Bar(x=location_counts.index, y=location_counts.values, name="Location",
                     marker_color='rgba(246, 78, 139, 0.6)', marker_line_color='rgba(246, 78, 139, 1.0)',
                     marker_line_width=1.5), row=1, col=2)

# Game Difficulty Distribution
difficulty_counts = df_TF_cat['GameDifficulty'].value_counts()
fig.add_trace(go.Bar(x=difficulty_counts.index, y=difficulty_counts.values, name="Game Difficulty",
                     marker_color='rgba(6, 147, 227, 0.6)', marker_line_color='rgba(6, 147, 227, 1.0)',
                     marker_line_width=1.5), row=2, col=1)

# Engagement Level Distribution
engagement_counts = df_TF_cat['EngagementLevel'].value_counts()
fig.add_trace(go.Bar(x=engagement_counts.index, y=engagement_counts.values, name="Engagement Level",
                     marker_color='rgba(153, 0, 153, 0.6)', marker_line_color='rgba(153, 0, 153, 1.0)',
                     marker_line_width=1.5), row=2, col=2)

# Update layout for a stunning appearance
fig.update_layout(
    title_text="Synthetic data with GAN Categorical variable Distribution",
    title_font_size=24,
    showlegend=False,
    plot_bgcolor='rgba(240, 240, 240, 0.8)',
    paper_bgcolor='rgba(240, 240, 240, 0.8)',
    height=800,
    width=1200,
    font=dict(family="Arial", size=12, color="rgb(50, 50, 50)"),
)

# Update axes
fig.update_xaxes(tickangle=45, title_text="", showgrid=True, gridwidth=1, gridcolor='rgba(200, 200, 200, 0.2)')
fig.update_yaxes(title_text="Count", showgrid=True, gridwidth=1, gridcolor='rgba(200, 200, 200, 0.2)')

# Add annotations
fig.add_annotation(
    text="Data insights",
    xref="paper", yref="paper",
    x=0.5, y=1.05,
    showarrow=False,
    font=dict(size=14, color="rgb(80, 80, 80)")
)

# Show the plot
fig.show()

In [None]:
df_blend_cat = df_blend.select_dtypes(include=['object'])

# Create subplots
fig = make_subplots(rows=2, cols=2,
                    subplot_titles=("Game Genre Distribution", "Location Distribution",
                                    "Game Difficulty Distribution", "Engagement Level Distribution"),
                    vertical_spacing=0.1)

# Game Genre Distribution
genre_counts = df_blend_cat['GameGenre'].value_counts()
fig.add_trace(go.Bar(x=genre_counts.index, y=genre_counts.values, name="Game Genre",
                     marker_color='rgba(58, 71, 80, 0.6)', marker_line_color='rgba(58, 71, 80, 1.0)',
                     marker_line_width=1.5), row=1, col=1)

# Location Distribution
location_counts = df_blend_cat['Location'].value_counts()
fig.add_trace(go.Bar(x=location_counts.index, y=location_counts.values, name="Location",
                     marker_color='rgba(246, 78, 139, 0.6)', marker_line_color='rgba(246, 78, 139, 1.0)',
                     marker_line_width=1.5), row=1, col=2)

# Game Difficulty Distribution
difficulty_counts = df_blend_cat['GameDifficulty'].value_counts()
fig.add_trace(go.Bar(x=difficulty_counts.index, y=difficulty_counts.values, name="Game Difficulty",
                     marker_color='rgba(6, 147, 227, 0.6)', marker_line_color='rgba(6, 147, 227, 1.0)',
                     marker_line_width=1.5), row=2, col=1)

# Engagement Level Distribution
engagement_counts = df_blend_cat['EngagementLevel'].value_counts()
fig.add_trace(go.Bar(x=engagement_counts.index, y=engagement_counts.values, name="Engagement Level",
                     marker_color='rgba(153, 0, 153, 0.6)', marker_line_color='rgba(153, 0, 153, 1.0)',
                     marker_line_width=1.5), row=2, col=2)

# Update layout for a stunning appearance
fig.update_layout(
    title_text="Synthetic data with Blending Categorical variable Distribution",
    title_font_size=24,
    showlegend=False,
    plot_bgcolor='rgba(240, 240, 240, 0.8)',
    paper_bgcolor='rgba(240, 240, 240, 0.8)',
    height=800,
    width=1200,
    font=dict(family="Arial", size=12, color="rgb(50, 50, 50)"),
)

# Update axes
fig.update_xaxes(tickangle=45, title_text="", showgrid=True, gridwidth=1, gridcolor='rgba(200, 200, 200, 0.2)')
fig.update_yaxes(title_text="Count", showgrid=True, gridwidth=1, gridcolor='rgba(200, 200, 200, 0.2)')

# Add annotations
fig.add_annotation(
    text="Data insights",
    xref="paper", yref="paper",
    x=0.5, y=1.05,
    showarrow=False,
    font=dict(size=14, color="rgb(80, 80, 80)")
)

# Show the plot
fig.show()

df_TF and df_GAN has done great job for all the categorical column except game genre.

# Histogram to verify distribution of the synthetic data.

In [None]:
import plotly.express as px
def create_multi_histogram(dataframes, column_name):
    fig = make_subplots(rows=2, cols=2,
                        subplot_titles=(f"{column_name} Distribution - Seed",
                                        f"{column_name} Distribution - GAN",
                                        f"{column_name} Distribution - TF",
                                        f"{column_name} Distribution - Blend"),
                        vertical_spacing=0.1)

    for idx, (name, df) in enumerate(dataframes.items()):
        row = idx // 2 + 1
        col = idx % 2 + 1

        hist = px.histogram(df, x=column_name, nbins=40)
        fig.add_trace(hist.data[0], row=row, col=col)

    fig.update_layout(
        title_text=f"{column_name} Distribution Comparison",
        title_font_size=24,
        showlegend=False,
        height=800,
        width=1200,
    )

    fig.update_xaxes(title_text=column_name, showgrid=True)
    fig.update_yaxes(title_text="Frequency", showgrid=True)

    return fig

dataframes = {
    'Seed': Seed,
    'GAN': df_GAN,
    'TF': df_TF,
    'Blend': df_blend
}

numerical_columns = Seed.select_dtypes(include=[np.number]).columns

for column in numerical_columns:
    fig = create_multi_histogram(dataframes, column)
    fig.show()

df_TF was able to match better distribution among numerical columns.

# Utility of the model.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score
from sklearn.inspection import permutation_importance
import matplotlib.pyplot as plt

# Assuming the dataframes are already defined as mentioned in the prompt
dataframes = {'Seed': Seed, 'GAN': df_GAN, 'TF': df_TF, 'Blend': df_blend}

# Function to prepare data and train models
def train_and_evaluate(df, model_name):
    # Prepare the data
    X = df[['Age', 'Gender', 'Location', 'GameGenre', 'PlayTimeHours',
            'InGamePurchases', 'GameDifficulty', 'SessionsPerWeek',
            'AvgSessionDurationMinutes', 'PlayerLevel', 'AchievementsUnlocked']]
    y = df['EngagementLevel']

    # Encode categorical variables
    X = pd.get_dummies(X, columns=['Gender', 'Location', 'GameGenre','GameDifficulty'])

    # Split the data
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


    # Train and evaluate models
    models = {
        'Random Forest': RandomForestClassifier(random_state=42),
        'Logistic Regression': LogisticRegression(random_state=42),
        'SVM': SVC(random_state=42)
    }

    results = {}
    for name, model in models.items():
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        accuracy = accuracy_score(y_test, y_pred)
        results[name] = {
            'accuracy': accuracy,
            'report': classification_report(y_test, y_pred)
        }

        # Feature importance (for Random Forest only)
        if name == 'Random Forest':
            importances = permutation_importance(model, X_test, y_test, n_repeats=10, random_state=42)
            feature_importance = pd.DataFrame({
                'feature': X.columns,
                'importance': importances.importances_mean
            }).sort_values('importance', ascending=False)
            results[name]['feature_importance'] = feature_importance

    return results

In [None]:
# Train and evaluate models for each dataset
all_results = {}
for dataset_name, df in dataframes.items():
    all_results[dataset_name] = train_and_evaluate(df, dataset_name)

# Compare model performances
for dataset_name, results in all_results.items():
    print(f"\nResults for {dataset_name} dataset:")
    for model_name, model_results in results.items():
        print(f"  {model_name} Accuracy: {model_results['accuracy']:.4f}")



lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to th


Results for Seed dataset:
  Random Forest Accuracy: 0.8926
  Logistic Regression Accuracy: 0.7752
  SVM Accuracy: 0.8851

Results for GAN dataset:
  Random Forest Accuracy: 0.5994
  Logistic Regression Accuracy: 0.5764
  SVM Accuracy: 0.5889

Results for TF dataset:
  Random Forest Accuracy: 0.9246
  Logistic Regression Accuracy: 0.7800
  SVM Accuracy: 0.9221

Results for Blend dataset:
  Random Forest Accuracy: 0.7355
  Logistic Regression Accuracy: 0.6573
  SVM Accuracy: 0.6961


In [None]:
for dataset_name in all_results.keys():
    rf_importance = all_results[dataset_name]['Random Forest']['feature_importance']

    fig = px.bar(
        rf_importance,
        x='feature',
        y='importance',
        title=f'Feature Importance (Random Forest, {dataset_name} dataset)',
        labels={'feature': 'Features', 'importance': 'Importance'}
    )

    fig.update_layout(
        xaxis={'categoryorder': 'total descending'},
        xaxis_title='Features',
        yaxis_title='Importance',
        title={'x': 0.5},
    )

    fig.show()


From checking the utility of the synthetic data. In other words, The way synthetic data has mimicked the original data and provide predictive power for downstream machine learning task. We can say that synthetic data generated with transformer is far more suitable technique to generate synthetic tabular data.