# Setup


In [None]:

import pandas as pd
from data_getters import load_s3_data

BUCKET_NAME = "dsp-tutorials"

## Explore and clean data

# Data preparation

For today, we're using a cleaned dataset, to focus on data generation and not data cleaning. However, the cleaning steps are described FYI below, and code used to clean the dataset can be found here (dap_tutorials/synth_data_demo/preprocessing_df.py):

1. The synthetic data generators will mimic all statistical relationships they can find in the data. So:
    - Drop any numeric 'ids' in the data, unless the ordering of the observations matters! It will really eat into the fidelity of the data if the generator tries to mimic the ordering of the ids
    - Make sure you change any 'numeric' categorical variables to strings, to ensure they are treated as non-ordered variables. Whilst you can just define the datatype, I found this was a bit buggy. It was easier to just replace numbers with letters in the data. 


2. 'Missingness' in this context is an important statistical feature that we want to preserve. However, the many packages  won't accept nan values:
    - For categorical variables, we can just replace nan with a new category ('NA' for example)
    - For continuous variable, we need to impute values to maintain the original statistical properties of the variable, and then create a new binary column that indicates if the original value was missing (so that the generator learns both the correct statistical relationships, and the 'missingness' pattern)

3. Some generators struggle with a large number of categories within a single variable. So take note of how many categories there are - if you have issues with training a model later on, this will be a good place to start optimising. 

4. Maybe goes without saying, but:
    - You can't generate synthetic free text fields (eg, comments from a survey) with tabular data generation methods, like we're using today.  So drop any before you start.
    - All var types need to reflect the data type (numbers are ints or floats, categorical variables are bools or objects)
    
5. There is a tradeoff of how well statistical relationships are preserved (think of it like allocating a 'fidelity buget' across different variables). If there are columns that are really irrelevant to your analysis or final dataset, you could drop them before you start (although be very careful doing this, in case it's a confounding variable).Alternatively, there are methods that allow you to nominate the priority fields (although we aren't using them today). 

In [None]:
#Import the cleaned data and explore

#Load pre-cleaned dataset from S3 (we will create a synthetic version from this)

orig_df = load_s3_data(BUCKET_NAME,'synthetic_data_generation/magic_breakfast_cleaned_data.csv')


print('DataFrame columns:', orig_df.columns)
print('DataFrame shape:', orig_df.shape)

In [None]:
display(orig_df.head())

In [None]:
orig_df.dtypes

In [None]:
orig_df.isna().sum()

## Synthetic data model - Maximum Spanning Tree

* MST is a graph-based method for generating tabular synthetic data. It finds the marginal distribution for variables in the sample (the probability that a value occurs for a single variable, independent of the other variables). 
* Then, it adds some noise to the marginal distribution (this is the step that introduces DP; and the amount of noise is controlled by the privacy threshold ). 
* Finally, it generates synthetic data that preserves the marginal distributions.
* Pros: 
    - High fidelity method
    - Auditable (not a black box model)
    - Can allocate the privacy budget to specific variables, enabling greater preservation of some relationships, and more noise in others
* Cons: 
    - Requires discrete values (like all graph-based methods), so any continuous variables would need to be converted to discrete categories before generating synthetic data. However, the method is able to take such a high number of discrete values that data can be ‘binned’ (grouped together) into very small ranges (it was still an effective method on the UK census, with a number of continuous variables including ages and dates, with 55 million observations each). 
    - Slightly more of a manual process to set up than some other methods, as you need to define the privacy budget for each variable.


In [None]:
from datetime import datetime
import json
import numpy as np
import os
import pandas as pd
from scipy import sparse

from mbi import Dataset, FactoredInference, Domain

In [None]:
#Create the 'domain' file - an input of each column and count of their unique values required to train the MST generator

def create_domain(df: pd.DataFrame, save: bool = True) -> dict:
    
    """
    Creates a domain dictionary of the format {column name: count of unique values} for each column.

    Parameters:
        df (pd.DataFrame): The input DataFrame.
        save (bool): Flag indicating whether to save the domain dictionary as a JSON file. Default is True.

    Returns:
        dict: The domain dictionary with column names as keys and the number of unique values as values.
    """

    domain_dict = {col: df[col].nunique() for col in df.columns}

    if save:
        with open("domain.json", "w") as f:
            json.dump(domain_dict, f)

    return domain_dict


def load_data_mst(df: pd.DataFrame, domain_file: str) -> Dataset:
    """
    The package only takes locally saved data as inputs - so this function first dumps the data to a CSV file and then loads it.

    Parameters:
    raw_data_file (str): The path to the raw data file.
    domain_file (str): The path to the domain file.

    Returns:
    data (Dataset): The loaded dataset.
    domain (Domain): The domain of the loaded dataset.
    """
    raw_data_file = "magic_breakfast.csv"
    df.to_csv(raw_data_file)
    data = Dataset.load(raw_data_file, domain_file)
    domain = data.domain
    return data, domain





In [None]:
#Create domain and load data
create_domain(orig_df)
data, domain = load_data_mst(orig_df, 'domain.json')

In [None]:
#Allocate privacy budget and train model
'''
Define the cliques - the relationships between variables that will have less noise allocated to them. 
For an RCT, the relationships we most want to preserve are between the allocation to the control and treatment groups and the post-test outcomes.'''

CLIQUES = [
    ("Treatment_Allocation", "PostTest_Outcome_1"),
    ("Treatment_Allocation", "PostTest_Outcome_2"),
]

def allocate_privacy_budget(data: Dataset, cliques: list, total_epsilon:int=1) -> list:
    """
    Allocates privacy budget for measuring marginals in a dataset.

    Args:
        data (Dataset): The dataset containing the data.
        domain (Domain): The domain of the data.
        cliques (list): A list of cliques representing 2 and 3 way marginals
        (these are the relationships between variables that will have less noise allocated to them).

    Returns:
        list: A list of measurements, each containing the measurement matrix, noisy data, noise scale, and marginal information.
    """

    # spend half of privacy budget to measure all 1 way marginals
    np.random.seed(0)

    epsilon = total_epsilon
    epsilon_split = epsilon / (len(data.domain) + len(cliques))
    sigma = 2.0 / epsilon_split

    measurements = []
    for col in data.domain:
        x = data.project(col).datavector()
        y = x + np.random.laplace(loc=0, scale=sigma, size=x.size)
        I = sparse.eye(x.size)
        measurements.append((I, y, sigma, (col,)))

    # spend half of privacy budget to measure some more 2 and 3 way marginals

    for cl in cliques:
        x = data.project(cl).datavector()
        y = x + np.random.laplace(loc=0, scale=sigma, size=x.size)
        I = sparse.eye(x.size)
        measurements.append((I, y, sigma, cl))

    return measurements


def train_model(data: Dataset, domain: Domain, measurements: list, no_iterations:int = 2500):
    """
    Trains a model using the given dataset, domain, and measurements.

    Args:
        data (Dataset): The dataset to train the model on.
        domain (Domain): The domain of the data.
        measurements (list): The list of measurements to use for training.

    Returns:
        model: The trained model.
    """
    engine = FactoredInference(domain, log=True, iters=no_iterations)
    total = data.df.shape[0]
    model = engine.estimate(measurements, total=total)
    return model
 

In [None]:
#Set paramaeters and train model (takes ~30s to run)
measurements = allocate_privacy_budget(data, CLIQUES)
model = train_model(data, domain, measurements)

In [None]:
def generate_data(model, no_rows: int, save: bool = True) -> pd.DataFrame:
    """
    Generate synthetic data using the given model.

    Parameters:
    model (Model): The model used to generate synthetic data.
    no_rows (int): The number of rows to generate.
    save (bool, optional): Whether to save the generated data. Defaults to True.

    Returns:
    pd.DataFrame: The generated synthetic data as a pandas DataFrame.
    """

    # generate synthetic data
    synthetic_data = model.synthetic_data(rows=no_rows)
    synth_df = synthetic_data.df
    return synth_df

In [None]:
#Generate synthetic data

mst_synth_data = generate_data(model, len(orig_df))

## Inspect the synthetic data

There are endless options for evaluating synthetic data. We'll use a package called 'synthgauge' to inspect the data. It was developed by the ONS, and contains well-regarded benhchmarks for tabular synthetic data utility, privacy and fidelity. We chose it because it's from a reputable source in the UK (important to create trust for EEF if they produce high fidelity data), can be applied to data from many sources/ methods, and is easy to implement. 

 More information can be found here:  https://datasciencecampus.github.io/synthgauge/

In [None]:
#Import required packages

from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler

import synthgauge as sg

In [None]:

mst_evaluator = sg.Evaluator(orig_df, mst_synth_data)

In [None]:
#Functions for data exploration

def histograms(evaluator):
    return evaluator.plot_histograms(figsize=(15, 30))

def cont_data_description (evaluator):
    return evaluator.describe_numeric()

def cat_data_description (evaluator):
    return evaluator.describe_categorical()

def correlation_tables(evaluator):
    return evaluator.plot_correlation(figsize=(15, 12))

In [None]:
mst_evaluator.plot_histograms(figsize=(15, 30))

Explore the data using these functions, and any other analysis you can think of.
- Where does the generator perform well? 
- What happens if you change the relationships that have a privacy budget allocated to them? (CLIQUES in the allocate_privacy_budget function above)
- What happens if you change the total privacy budget? (epsilon in the allocate_privacy_budget above)
- What happens if you change the number of iterations? (n_iter in the train_model function above)
- Finally, what if you produce different numbers of observations? (no_rows in the generate_data function above)


## Benchmark the MST synthetic data

Along with informal inspection of the data, there are formal metrics available to benchmark different models in terms of their privacy, utility, and fidelity. In the project, we examined a number of metrics. For the sake of time, just one is loaded below (although you can add additional metrics if you have time - see documentation here https://datasciencecampus.github.io/synthgauge/autoapi/synthgauge/metrics/index.html)

**Fidelity**
- **PMSE**: Mean-squared difference in pairwise Pearson correlation coefficients (continuous data) - Measures how well relationships between continuous variables have been preserved, by taking the difference in correlations between variables between the synthetic and original data. *(Minimise towards 0)*
- **PMSE-ratio**: Correlation ratio mean-squared difference (continuous/categorical data) - Measures how well relationships between continuous or categorical variables have been preserved, by taking the difference in correlations between variables between the synthetic and original data. *(Minimise towards 0)*
- **MAD**: Mean absolute difference of feature densities - Measures how well the distribution of individual variables has been preserved, by taking the difference in the measure of their distribution between synthetic and original data. *(Minimise towards 0)*

**Utility**
- **Classification error**: Classification comparison (difference in precision, recall, F1 scores on a classification task) - Uses the synthetic data to train a classifier, and then tests the difference in accuracy when predicting on real data. *(Minimize towards 0)*

**Privacy** 
- **Nearest neighbors**: Minimum distance nearest neighbor - A check to ensure that outliers have not been replicated in the synthetic data *(Maximise)*
- **TCAP score**: Target Correct Attributional Probability Score - The risk that a target variable can be generated given a key variable *(Minimise towards 0)*
- **Sample overlap**: Proportion of real data found in synthetic data - A straightforward check that no real observations are contained in the synthetic dataset (necessary, but not sufficient, to preserve privacy) *(Should always be 0)*




In [None]:
def tcap_test(
    original_df: pd.DataFrame, synthetic_df: pd.DataFrame, key: list, target: str
) -> pd.DataFrame:
    """
    This function calculates the tcap_score metric - which is the chance that an attacker could infer the
    true values of the target variable, if they had acess to both the true and synthetic values of the 'key' variables.
    The key variables should therefore be data that is widely available; the target should be sensitive data.

    Parameters:
        original_df (pd.DataFrame): The original dataframe containing the actual data.
        synthetic_df (pd.DataFrame): The synthetic dataframe containing the generated data.

    Returns:
        pd.DataFrame: The evaluation results as a dataframe.
    """
    evaluator = sg.Evaluator(original_df, synthetic_df)
    evaluator.add_metric(
        "tcap_score",
        key=key,
        target=target,
    )
    return evaluator.evaluate(as_df=True)

In [None]:
tcap_test(orig_df, mst_synth_data, ["Treatment_Allocation"], "PostTest_Outcome_1")