## DSS Worshop: Synthetic Data Generation With SDV
Data privacy is increasingly becoming an important consideration when we tackle any data science use-case. In order to comply with regulation and still innovate and create value, Privacy Enhancing Techniques (PETs) are worth considering. For a broad overview of PETs, we invite you to follow our sub-community in DS4A and check out our wiki and our past talks. If you are interested in PETs and would like to contribute with ideas, talks, questions, use-cases or introduce us to a cool technique you have seen somewhere, you are the most welcome !

Synthetic data generation is one of the hottest PETs, so this workshop aims to introduce synthetic data generation using the Synthetic Data Vault library (https://sdv.dev/). We will explore together an example of synthetic data generation on a simple insurance dataset. First, we will load an visualize our dataset. Then, we will test the following synthetic data generation models:
- Gaussian Copula
- CTGAN
- TVAE

### Guidelines for the workshop
- We recommend that you complete this workshop using Colab, you can use your favorite notebook environment if you wish, but we won't be able to help with errors due to environment settings
- You will have sections marked **TO DO** that contain comments `#Your code` for you to fill in. You will be given time to complete the sections and we will discuss together. The instructions will be explained before each exercise.
- At any given moment, you can ask questions. **Please interreupt the speaker if you are unable to run the notebook**. You can also use the chat to ask questions and we will read the questions.

**!!! SKIP IF ALREADY EXECUTED !!!**

In [None]:
!pip install matplotlib==3.1.3
!pip install -U numpy
!pip install sdv
!pip install -U seaborn

**!!!!!!!!!!!!!!!!!!!!!!!!!!!!**

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sdv.metadata import SingleTableMetadata
from sdv.lite import SingleTablePreset
from sdv.single_table import CTGANSynthesizer, GaussianCopulaSynthesizer, TVAESynthesizer, CopulaGANSynthesizer
from sdv.evaluation.single_table import evaluate_quality, get_column_plot

### Load and visualize the dataset

In [None]:
url = 'https://raw.githubusercontent.com/oumasab/dss-workshop/main/insurance.csv'
df = pd.read_csv(url)
df.head()

#### Display the dataset using seaborn.pairplot()
**TO DO:** Let's modify the variable `hue` to visualize data's dependence on the categorical variables

In [None]:
sns.pairplot(data=df, hue=#Your code)
plt.show()

### First attempt at Synthetic data generation
Let's generate a first version of synthetic data. Here, we will use the models directly on our dataset without any additional feature engineering.

#### Quick usage of SDV models
Nothing is easier than getting started with SDV models. All you have to do is to follow these steps:
1. Load your dataset as Pandas `DataFrame`
2. (Optional, but recommended for the Tabular Preset) Define you dataset's metadata (description of column types)
3. Initialize the model
4. Use the `fit()` method
5. Sample new data using the `sample()` method

In [None]:
metadata = SingleTableMetadata()
metadata.detect_from_dataframe(data=df)

#### Single Table Preset
The `SingleTablePreset` is a tabular model that comes with pre-configured settings. This is meant for users who want to get started with using synthetic data and spend less time worrying about which model to choose or how to tune its parameters.
The `FAST_ML` preset uses machine learning (ML) to model your data while optimizing for the modeling time. This is a great choice if it’s your first time using the SDV for a large custom dataset or if you’re exploring the benefits of using ML to create synthetic data.

In [None]:
model_fastml = SingleTablePreset(name='FAST_ML', metadata=metadata)
model_fastml.fit(df)

In [None]:
df_fastml = model_fastml.sample(1400)

**Let's visualize the synthetic data**

**TO DO:** Visualize the synthetic data using `sns.pairplot()`

In [None]:
#Your code

#### Gaussian Copula
The `sdv.single_table.GaussianCopulaSynthesizer` model is based on copula funtions.

In mathematical terms, a copula is a distribution over the unit cube $[0,1]^{𝑑}$ which is constructed from a multivariate normal distribution over $ℝ^{𝑑}$ by using the probability integral transform. Intuitively, a copula is a mathematical function that allows us to describe the joint distribution of multiple random variables by analyzing the dependencies between their marginal distributions. 

In order to "learn" the original dataset, the `GaussianCopulaSynthesizer()` model performs the following steps:

1. Learn the format and data types of the passed data

2. Transform the non-numerical and null data using Reversible Data Transforms to obtain a fully numerical representation of the data from which we can learn the probability distributions.

3. Learn the probability distribution of each column from the table

4. Transform the values of each numerical column by converting them to their marginal distribution CDF values and then applying an inverse CDF transformation of a standard normal on them.

5. Learn the correlations of the newly generated random variables.

After this, in order to generate the synthetic dataset using the trained model, the following steps are performed:

1. Sample from a Multivariate Standard Normal distribution with the learned correlations.

2. Revert the sampled values by computing their standard normal CDF and then applying the inverse CDF of their marginal distributions.

3. Revert the RDT transformations to go back to the original data format.


In [None]:
model_gaussian = GaussianCopulaSynthesizer(metadata=metadata)
model_gaussian.fit(df)

In [None]:
df_gaussian = model_gaussian.sample(1400)

In [None]:
sns.pairplot(data=df_gaussian, hue="smoker")
plt.show()

**Train a CTGAN model**

#### CTGAN
The `sdv.single_table.CTGANSynthesizer` model is based on the GAN-based Deep Learning data synthesizer which was presented at the NeurIPS 2020 conference by the paper titled Modeling Tabular data using Conditional GAN.

A few details about CTGAN:
- GAN-based method to sample from the distribution of the input data
- This method is more adapted to deal with class imbalance in categorical variables as the Generator uses training-by-sampling
- Both the Generator and Discriminator are fully connected networks, which allows to capture all correlations between features
- CTGAN achieves Differential Privacy as the CTGAN Generator never accesses the real data throughout training

**TO DO:** Train then sample data from a CTGAN model. Visualize the result.

In [None]:
#Your code

In [None]:
#Your code

**Visualize the synthetic dataset**

In [None]:
#Your code

#### TVAE
The `sdv.tabular.TVAESynthesizer` model is based on the VAE-based Deep Learning data synthesizer which was presented at the NeurIPS 2020 conference by the paper titled Modeling Tabular data using Conditional GAN.

A few details about TVAE:
- An adaptation of VAE for tabular data: the encoder is adapted to work with tabular data, the decoder is kept as in classic VAE
- In a nutshell, TVAE (much like VAE) is composed of an encoder that transforms input to a latent space, then a decoder that samples new data from the latent space
- Unlike CTGAN, TVAE is not compatible with Differential Privacy

**TO DO:** Train then sample data from a TVAE model. Visualize the result.

In [None]:
#Your code

**Visualize the generated dataset**

In [None]:
#Your code

#### Deep-dive into a few features
**Visualize the figure of charges w.r.t to age in the synthetic dataset and the real dataset in the same figure using `sns.scatterplot()`**

In [None]:
to_plot = df_gaussian[["age", "charges"]]
to_plot["data"] = "synthetic"
to_plot = pd.concat([to_plot, df[["age", "charges"]]])
to_plot.fillna("real", inplace=True)
sns.scatterplot(data=to_plot, x="age", y="charges", hue="data")
plt.show()

:**Visualize the distributions of given features in the synthetic data using `sdv.evaluation.single_table.get_column_plot()`**

**TO DO:** Change the parameter `column_name` to visualize different variables

In [None]:
fig = get_column_plot(
    real_data=df,
    synthetic_data=df_gaussian,
    column_name=#Your code,
    metadata=metadata
)

fig.show()

#### Evaluate the models
SDV offers the possibility to evaluate the quality of the generated data using `sdv.evaluation.single_table.evaluate_quality`, which is the newest model evaluation framework and allows to visualize quality metrics

In [None]:
columns = ["age", "sex", "bmi", "children", "smoker", "region", "charges"]
real_data = df[columns]
synthetic_data = df_gaussian[columns]

**Visualize the Quality report**
The quality report relies on two main metrics:
- **Total Variation Distance:** which evaluates the similarity between two discrete distributions (used for categorical variables)
- **Inverted Kolmogorov-Smirnov D statistic:** which evaluates the similarity between two continuous distributions

In [None]:
report = evaluate_quality(
    real_data=real_data,
    synthetic_data=synthetic_data,
    metadata=metadata)

In [None]:
report.get_details(property_name='Column Shapes')

In [None]:
fig = report.get_visualization(property_name='Column Shapes')

**A focus on data privacy**

This metric measures whether each row in the synthetic data is novel, or whether it exactly matches an original row in the real data.

Score:
- (best) 1.0: The rows in the synthetic data are all new. There are no matches with the real data.
- (worst) 0.0: All the rows in the synthetic data are copies of rows in the real data.

In [None]:
from sdmetrics.single_table import NewRowSynthesis

NewRowSynthesis.compute(
    real_data=real_data,
    synthetic_data=synthetic_data,
    metadata=metadata,
    numerical_match_tolerance=0.01,
)

### Let's improve our model !
At this stage, we need to ask ourselves: is this the best we can do? It seems like we can do better since the models that we have seen so far have failed to capture some of the correlations that we have visualized in the original dataset. 

**Preliminary question:** Which features do you propose to transform in order to improve the results? Let's hear your ideas !

#### Feature engineering


**Focus on charges vs. age**

**TO DO** Visualize `charges` vs. `age` using `sns.scatterplot()`

In [None]:
#Your code

In [None]:
df_2 = df.copy()

**TO DO:** Define a new variable that describes the tiers of charges, let's call it `tier`. It is defined as follows:
- `tier 1` if `charges<15000`
- `tier 2` if `charges>=15000` and `charges<33000`
- `tier 3` if `charges>=33000`

**TO DO:** Visualize `charges` vs. `age` using `tier` as a legend.

In [None]:
# Let's define charges tiers
#Your code

In [None]:
#Your code

**Transform charges by substracting the fixed charge associated to each tier**

In [None]:
# Let's define the fixed charge for each tier
fixed_charge_1 = df_2.loc[(df_2["tier"]=="tier 1") & (df_2["age"]==18), "charges"].min()
fixed_charge_2 = df_2.loc[(df_2["tier"]=="tier 2") & (df_2["age"]==18), "charges"].min()
fixed_charge_3 = df_2.loc[(df_2["tier"]=="tier 3") & (df_2["age"]==18), "charges"].min()
df_2.loc[df_2["tier"]=="tier 1", "fixed_charge"] = fixed_charge_1
df_2.loc[df_2["tier"]=="tier 2", "fixed_charge"] = fixed_charge_2
df_2.loc[df_2["tier"]=="tier 3", "fixed_charge"] = fixed_charge_3

In [None]:
# Let's subtract the fixed charge from charges for all tiers
df_2["charges"] = df_2["charges"] - df_2["fixed_charge"]
df_2.drop(columns=["fixed_charge"], inplace=True)

**Visualize the new values of `charges` vs. `age`**

In [None]:
sns.scatterplot(data=df_2, x="age", y="charges")
plt.show()

**Visualize a fitted polynomial to `charges` vs `age`**

In [None]:
sns.regplot(data=df_2, x="age", y="charges", order=2, scatter=False)
plt.show()

**TO DO:** Based on what we have seen previously, create a new feature that accounts for the order of the polynomial that describes the relationship between `charges` and `age`

In [None]:
# Let's take into account the above observation !
#Your code

In [None]:
metadata_2 = SingleTableMetadata()
metadata_2.detect_from_dataframe(data=df_2)

#### Let's train a synthetic data generation model

**TO DO:** Train then sample data from a synthetic data generation model of your choice (`GuassianCopula`, `CTGAN` or `TVAE`) using the new transformed dataset. 

In [None]:
model_gaussian_2 = GaussianCopulaSynthesizer(metadata=metadata_2)
model_gaussian_2.fit(df_2)

In [None]:
df_gaussian_2 = model_gaussian_2.sample(1400)

**Transform the real data back to its original form:** Add the corresponding fixed charge to the charges of each tier in the real data

In [None]:
df_2.loc[df_2["tier"]=="tier 1", "fixed_charge"] = fixed_charge_1
df_2.loc[df_2["tier"]=="tier 2", "fixed_charge"] = fixed_charge_2
df_2.loc[df_2["tier"]=="tier 3", "fixed_charge"] = fixed_charge_3
df_2["charges"] = df_2["charges"] + df_2["fixed_charge"]
df_2.drop(columns=["fixed_charge", "age_squared"], inplace=True)

**TO DO:** Transform the synthetic data back

In [None]:
# Let's transform the data back
#Your code

**Visualize the synthetic data**

In [None]:
sns.pairplot(data=df_gaussian_2, hue="smoker")
plt.show()

#### Let's evaluate the new generated data

In [None]:
columns = ["age", "sex", "bmi", "children", "smoker", "region", "charges"]
real_data = df_2[columns]
synthetic_data = df_gaussian_2[columns]

**TO DO:** Evaluate the synthetic data using `sdv.evaluation.single_table.evaluate_quality`

In [None]:
#Your code

**TO DO:** Plot `charges` vs. `age` in the synthetic dataset vs. the real dataset. Let's discuss !

In [None]:
#Your code

**TO DO:** Visualize individual feature distributions using `get_column_plot`

In [None]:
#Your code

In [None]:
report.get_details(property_name='Column Shapes')

### Synthetic data generation limitations

- **Quality metrics**: As we saw in this example, the metrics might not accurately reflect the quality of the generated data. It is very important to rely on visualization in order to get a better idea of the quality of the synthetic dataset.
- **Models:** As in every ML problem, selecting the right model is a process that needs to be taken seriously. An essential first step towards selecting a good model is to understand the correlations within the data and define features that will be helpful for the model training. We saw here that without feature engineering, the models aren't very performant and that feature engineering improved the results significantly.
- **Bigger is not always better**: We might easily think that CTGAN would work every single time, but we have seen in this example that a simple model like a Gaussian Copula can work even better than CTGAN if it's coupled with good feature engineering. For small datasets (like the one we tried here), using large models like CTGAN might result in overfitting and poor generalization, which hinders the capabiliy to generate a good synthetic dataset.

- **PII columns**: PII columns need to be processed before training a synthetic data generation model. This is due to the fact that text is considered as a categorical variable and displays it back into the synthetic dataset. So there is a risk of serious data leakage if PII is not treated correctly.

**Let's demonstrate the danger of not treating PII correctly**

**Let's create fake names and add them to the dataset**

In [None]:
from faker import Faker
fake = Faker()

In [None]:
names = [fake.name_male() if row["sex"]=="male" else fake.name_female() for _,row in df.iterrows()]
df["name"] = names

In [None]:
metadata = SingleTableMetadata()
metadata.detect_from_dataframe(data=df)

In [None]:
metadata.update_column(
    column_name='name',
    sdtype='name')

In [None]:
model_gaussian_pii = GaussianCopulaSynthesizer(metadata)
model_gaussian_pii.fit(df)

In [None]:
df_gaussian_pii = model_gaussian_pii.sample(1400)

**Are any names from the real dataset repeated in the synthetic dataset?**

In [None]:
len(set(df_gaussian_pii["name"]) & set(df["name"]))

**How do we fix it?**

In [None]:
from rdt.transformers.pii import AnonymizedFaker

model_gaussian_pii.update_transformers(column_name_to_transformer={
    'name': AnonymizedFaker(provider_name='person', function_name='name'),
})

In [None]:
model_gaussian_pii.fit(df)

In [None]:
df_gaussian_pii_2 = model_gaussian_pii.sample(1400)

**Does it really work ?**

In [None]:
#Your code