# Data synthesis pipeline
Following the steps in this Notebook will allow you to synthesize data as .csv input.<br>
All names under chapter <b>2. Variables</b> should be checked and changed if necessary.<br>
Processing and synthesizing the data may take a while depending on the chosen dataset.<br>
Results are stored in the \Results folder of this pipeline.

## 1. Environment
Install required python package via pip (shapely, geopandas). Warnings may be ignored. If you encounter errors, try installing the packages via Anconda Prompt (from the Start menu).

In [None]:
pip install shapely

In [None]:
pip install synthgauge 

In [None]:
pip install pyreadstat

In [None]:
pip install geopandas --user

### 1.1 Restart Kernel or comment out .plot_functions
To continue with the next section, restart the kernel or comment out the line: "from Pipeline.plotting.plot_functions import map_plotter, gemeente_lader, distribution_comparison" (note that you won't be able to execute chapter 5.4). To restart the kernel, go to "Kernel" in the toolbar and select "Restart". After restart you can execute the following block.

In [None]:
# Import all the required modules. Do not change these settings.
import os
import subprocess
import Pipeline.final_score

from Pipeline.full_data_process import full_data_process
from Pipeline.df_compare import *
from Pipeline.privacy_functions import privacy_calc, privacy_calc_id
from Pipeline.final_score import final_scoring
from Pipeline.plotting.plot_functions import map_plotter, gemeente_lader, distribution_comparison
from Pipeline.data_processing.df_comparison.df_comparer import correlation_comparison

## 2. Variables
The following section contains the variables and configuration used for the data synthesization. You can change these accordingly, depending on where the (input) data is located and where you would like to store the output (synthethic data).

In [None]:
# The full data location of the dataset that will be used as input
full_data_location = '<data location.csv>'

# Columns to drop from the dataframe (default: empty)
drop_cols = []
# The name used for the synthetic files and subfolders
name = '<name of result>'
# The location where all the output such as the training dataset and synthetic dataset will be stored.
output_location = os.path.join(os.getcwd(), 'Results\\' + name)
# Y columns are the columns in the dataset for which the utility score will be calculated (default: age, gender). 
y_columns = ['age','gender']
# ID columns are the columns in the dataset for which additional privacy scores will be calculated (default: age, gender, zip_code)
id_columns = ['age', 'gender', 'zip_code']
# Differential Privacy columns
dp_columns = ['age', 'zip_code']

# Variables used for drawing the graphs, col is the column which will be plotted for both the map and the distribution
col = 'age'
# The name of the column containing the zip_code4 data (default: zip_code)
zip_code_column = 'zip_code'


# The location of the synthpop file
synthpop_file = os.path.join(os.getcwd(), 'R_scripts\synthpop_script.R')
# The location of the Rscript exe file
# To find this open Rstudio, go to tools, global settings and copy the R version line:
# Make sure all backwards slashes are changed to forward.
rscript_loc = '//<pathto>/RforWindows/R-4.2.0' + '/bin/Rscript.exe'

## 3. Data processing
The following section performs the necessary data (pre-)processing steps. The output folders will be created, including the training and holdout datasets.

In [None]:
# These variables do not require changing
train_csv_location = output_location + '\\' + name + '_train.csv'
holdout_csv_location = output_location + '\\' + name + '_holdout.csv'
synth_csv_folder = output_location + '\\' + name + '_synths\\'
synth_csv_location = synth_csv_folder + name + '_synthpop'
synth_csv_folder = synth_csv_folder.replace('/','\\')

In [None]:
# Create the folder of the output location (only if it does not exist yet)
if not os.path.exists(output_location):
    os.makedirs(output_location)
# Create the folder of for the synthetic datasets (only if it does not exist yet)
if not os.path.exists(synth_csv_folder):
    os.makedirs(synth_csv_folder)

In [None]:
# This function will make sure the file can be processed by synthpop
# The function will print the column types: binary, categorical and continues columns
full_data_process(file_loc=full_data_location, train_test_path=output_location, name=name, drop_cols=drop_cols)

# 4. Running Synthpop
The code below will run the R script and use synthpop to create synthetic data <br>
The script may take a while to run. <br>
<br>
Synthpop default processing method: <b>Visit order of empty columns first.</b>

In [None]:
# This script uses the provided (local) R-libraries. 
# Verify if the script accesses the correct folder containing the synthpop package.
print(synthpop_file)

In [None]:
# This will run the Rscript and use synthpop to generate synthetic data
synthpop = subprocess.Popen([rscript_loc, synthpop_file, train_csv_location, synth_csv_location],
              stdout=subprocess.PIPE, stderr=subprocess.PIPE)
while True:
    out = synthpop.stdout.readline()
    if out.decode() == '' and synthpop.poll() == 0:
        break
    if out:
        print(out.decode())
    if out.decode() == 'NULL':
        break

# 6. Apply differential privacy

In [None]:
# Load dataset
synthetic_data = pd.read_csv(synth_csv_location + '.csv')

# Set privacy parameter and select columns to privatize
epsilon, sensitivity = 0.8, 1
columns = ['age', 'zip_code']

# Add Laplace noise to create noisy and synthetic datasets
noisy_data = synthetic_data.copy()
for column in columns:
    noisy_data[column] += np.random.laplace(0, scale=sensitivity/epsilon, size=len(synthetic_data)).round(0)

# Write the noisy datasets to CSV files
noisy_csv_location = synth_csv_folder + name + '_noisy.csv'
noisy_data.to_csv(noisy_csv_location, index=False)

In [None]:
# Select the column to compare
column_to_compare = 'age'

# Print basic statistics of the real and synthetic datasets
print('Real data:')
print(synthetic_data[column_to_compare].describe())
print('\nSynthetic data:')
print(noisy_data[column_to_compare].describe())

# Compute the mean absolute difference between the real and synthetic datasets
mad = np.mean(np.abs(synthetic_data[column_to_compare] - noisy_data[column_to_compare]))

print('\nMean Absolute Difference:', mad)

# 7. Apply k-anonimity

In [None]:
def custom_round(x, base=5):
    return int(base * round(float(x)/base))

# Load dataset
synthetic_data = pd.read_csv(synth_csv_location + '.csv')
synthetic_data = synthetic_data.dropna(subset = ['zip_code'])

# Define the columns that contain sensitive information
sensitive_columns = id_columns

# Apply k-anonimity measures (rounding of variables)
synthetic_data['zip_code'] = synthetic_data['zip_code'].apply(lambda x: custom_round(x, base=5)).astype("float64")
synthetic_data['age'] = synthetic_data['age'].apply(lambda x: custom_round(x, base=5)).astype("float64")

# Group the data by the sensitive columns and count the number of rows in each group
group_counts = synthetic_data.groupby(sensitive_columns).size().reset_index(name='count')

# Determine the minimum group size (k) for each sensitive attribute combination
min_counts = group_counts.groupby(sensitive_columns)['count'].min().reset_index(name='min_count')

# Compute the overall minimum group size (k-anonymity level) as the minimum of all the individual k values
k_anonymity_level = min_counts['min_count'].min()

# Print the k-anonymity level
print('The dataset has a k-anonymity level of', k_anonymity_level)
print(group_counts)

# Write the k-anonymity datasets to CSV files
synthetic_data.to_csv(synth_csv_folder + name + '_kanonymity.csv', index=False)

# 5. Evaluation
The code below will evaluate the generated synthpop data against the holdout dataset using the metrics on utility, fidelity, and privacy. These outcomes can be used to evaluate the performance of synthetic data generation methods when compared to real data.<br>

<b>Utility</b>: Utility refers to the usefulness of synthetic data for a particular task or analysis. In other words, how well does the synthetic data perform when used in place of real data? A synthetic dataset with high utility should be able to provide similar or equivalent results to those obtained using real data.<br>

<b>Fidelity</b>: Fidelity refers to the degree to which the synthetic data accurately represents the real data. A synthetic dataset with high fidelity should be able to capture the key statistical properties of the real data, such as the mean, median, standard deviation, and distribution of variables.<br>

<b>Privacy</b>: Privacy refers to the level of protection provided to individuals' personal information in the synthetic dataset. A synthetic dataset with high privacy should not be susceptible to re-identification attacks, meaning that it should not be possible to link an individual's identity to their personal information in the dataset.<br>

## 5.1 Fidelity & Utility calculations


In [None]:
end_results, ratio_results, reggre, classi = df_compare(train_csv_location,
                                                        holdout_csv_location, 
                                                        synth_csv_folder,
                                                        c=1,
                                                        y_columns=y_columns,
                                                        subset=None)

### 5.1.1 Fidelity results

Fidelty evaluations compared to the real dataset.<br><br>
    <b>dupe_numbers</b>: Number of duplicate records.<br>
    <b>sum_%mean_diff</b>: This variable represents the sum of the percentage difference between the means of two sets of data. It can be used to quantify the degree of difference between the two sets of data.<br>
    <b>sum_%median_diff</b>: This variable represents the sum of the percentage difference between the medians of two sets of data. It can be used to quantify the degree of difference between the two sets of data.<br>
    <b>sum_%std_diff</b>: This variable represents the sum of the percentage difference between the standard deviations of two sets of data. It can be used to quantify the degree of difference between the two sets of data.<br>
    <b>binary_val_count_diff</b>: This variable represents the difference in the count of binary values between two sets of data. It can be used to compare the frequency of occurrence of certain binary values between two sets of data.<br>
    <b>correlation_norm</b>: This variable represents the normalized correlation between two sets of data. It can be used to measure the strength and direction of the linear relationship between the two sets of data.<br>
    <b>real_or_snyth_acc</b>: This variable represents the accuracy of a machine learning model in classifying real versus synthetic data. It can be used to evaluate the performance of the model in distinguishing between real and synthetic data.<br>
    <b>jenson_shannon</b>: This variable represents the Jensen-Shannon divergence between two probability distributions. It can be used to measure the dissimilarity between the two distributions.<br>
    <b>total_variational_dist</b>: This variable represents the total variation distance between two probability distributions. It can be used to measure the distance between the two distributions.<br>
    <b>wasserstein_dist</b>: This variable represents the Wasserstein distance between two probability distributions. It can be used to measure the distance between the two distributions, taking into account the underlying geometry of the space in which the distributions are defined.<br>

In [None]:
end_results.T

### 5.1.2 Fidelity ratio results
Calculated by dividing all results from a synthetic dataset by the holdout dataset. <br>

Each result (expect for dupe_numbers) should near 1.0 to compare the synthetic data to the holdout dataset. This would conclude a good statistical comparable synthetic dataset.

In [None]:
ratio_results.T

### 5.1.3 Utility regression results
Only showing the r2 score, the other scores calculated are the mse and the max error.<br>

The r2 score (also known as the coefficient of determination) is a very important metric that is used to evaluate the performance of a regression-based machine learning model. It works by measuring the amount of variance in the predictions explained by the dataset. Simply put, it is the difference between the samples in the dataset and the predictions made by the model.<br>

If the value of the r squared score is 1, it means that the model is perfect and if its value is 0, it means that the model will perform badly on an unseen dataset. The result should be compared to the score of the holdout dataset.



In [None]:
reggre.xs('r2', level=1, drop_level=False).sort_values(by=reggre.columns[:1][0], ascending=False)

### 5.1.4 Utility classification results
Only showing the accuracy, other scores calculated are the f1, recall and precision.

Accuracy is the percentage of correct classifications that a trained machine learning model achieves, i.e., the number of correct predictions divided by the total number of predictions across all classes. Accuracy of 0 means the classifier always predicts the wrong label, whereas accuracy of 1 means that it always predicts the correct label.<br>

Accuracy is an indicator for under- and overfitting and the value should be comparable to the holdout dataset.

In [None]:
classi.xs('accuracy', level=1, drop_level=False).sort_values(by=classi.columns[:1][0], ascending=False)

## 5.2 Privacy calculations

### 5.2.1 Calculations on entire dataset
Calculate the privacy scores for the entire dataset, including all columns.

In [None]:
privacy_results, privacy_ratio = privacy_calc(train_csv_location,
                                              holdout_csv_location, 
                                              synth_csv_folder,
                                              sample_per=75, 
                                              memory=400)

### 5.2.2 Calculations on quassi-identifiers
Calculate the privacy scores based on the quasi-identifiers indicated in the configuration step (2) of this notebook.

In [None]:
privacy_results_id, privacy_ratio_id = privacy_calc_id(train_csv_location,
                                              holdout_csv_location, 
                                              synth_csv_folder,
                                              id_columns,
                                              sample_per=75, 
                                              memory=400)

### 5.2.3 Privacy score results
DCR (Distance to Closest Record) and NNDR (Nearest Neighbour Distance Ratio) are  two evaluation metrics commonly used in the field of record linkage, which is the process of identifying records in different data sources that refer to the same entity. The values for the synthetic data should be comparable to the holdout dataset. Significantly lower scores indicate that records are close to the actual data and that the model overfits. 

The first two results are based on the entire dataset. The last two results are based on only the quasi-identifiers.

In [None]:
privacy_results

In [None]:
privacy_ratio2 = privacy_ratio.T.add_suffix('_ratio')
priv_both = pd.merge(privacy_results.T, privacy_ratio2, left_index=True, right_index=True)
priv_both.sort_values(by='DCR')

In [None]:
privacy_results_id

In [None]:
privacy_ratio2 = privacy_ratio_id.T.add_suffix('_ratio')
priv_both = pd.merge(privacy_results_id.T, privacy_ratio2, left_index=True, right_index=True)
priv_both.sort_values(by='DCR')

## 5.3 Final score
In comparing a synthetic dataset with real data, it is important to evaluate each of these variables to ensure that the synthetic data is a suitable replacement for real data in a given analysis or task. A high level of utility and fidelity suggests that the synthetic data can be used with confidence, while a high level of privacy suggests that individuals' personal information is well protected.<br>

<b>privacy</b>: Privacy score for the entire dataset<br>
<b>privacy on ids</b>: Privacy score for the quasi-identifiers<br>
<b>fidelity</b>: Statistical comparison between the datasets<br>
<b>utlity</b>: Correlations between variables in the dataset<br>

<b>For the final score, the lower the score per domain, the better the performance in that domain.</b><br> As explained by Rients:
<i>For each of the three domains multiple evaluation methods have been used to assess the performance of each dataset. This results in a large number of scores, which can be difficult to interpret and draw conclusions from. To obtain a clearer understanding of the performance of each dataset, a final score has been calculated. These scores are calculated by aggregating the individual scores resulting in a clear overview of the performance of each dataset in each domain. Not every individual score contributes as much to the final aggregated score, because based on initial results certain methods such as calculating the sum percentage difference of the standard deviation returned unstable results. All aggregated scores have been calculated in a penalty like matter, meaning that the lower the score is, the better the dataset has performed.</i>

In [None]:
end_score, priv_score, priv_score_id, fidel_score, ml_score, fin_frame = final_scoring(ratio_results, privacy_ratio, privacy_ratio_id, reggre, classi)

In [None]:
end_score.sort_index(ascending=False)

## 5.4 Graph plotting
For visual comparison of the results various graphs and plots can be used. These are computed below.

In [None]:
# Change this to one of the columns to alter the graphs.
col = col

In [None]:
gemeentes = gemeente_lader()
df_real, synth_frame = data_loader(train_csv_location, holdout_csv_location, synth_csv_folder)

In [None]:
plots = []
for frame in synth_frame:
    df = pd.read_csv(synth_frame[frame])
    plots.append(map_plotter(df_real, df, gemeentes, frame, column=col, zip_code=zip_code_column))

### 5.4.1 Geographical spread and visual
Showing averages of a column grouped on zip code/municipality. Atleast 25 participants need to have the same zip code to be included in the visual

In [None]:
plots[0]

In [None]:
plots[3]

In [None]:
plots[2]

In [None]:
plots[1]

### 5.4.2 Univariate distribution plots

In [None]:
print(df_real[col].min(), df_real[col].max(), df_real[col].nunique())

In [None]:
# This variable could be changed.
# If there are to many bars in the distribution increasing the split nr will benefit this.
# With binary columns change it to 0.5
split = 6

# Change this to one of the columns to alter the graphs.
col = col

In [None]:
dist  = []
for frame in synth_frame:
    df = pd.read_csv(synth_frame[frame])
    dist.append(distribution_comparison(df_real, df, column=col, step_split=split, name=frame))

In [None]:
dist[0]

In [None]:
dist[2]

In [None]:
dist[1]