# SPEML - Assignment 1
---
This exercise is intended for group work of 2 students. Only one student per group needs to submit the report - when you are signed up to a group, the submission will also be visible to the second group member.

Synthetic Data Creation and Evaluation
In this task, you will work with a synthetic data generation tool, and evaluate fidelity, utility and privacy aspects of the synthetic data; to keep computation times reasonable, you shall work with tabular data. The idea is that you use existing tools, and mainly develop the boilerplate code around using these tools (dataset loading, spitting, model training, calling evaluation tools, ....).

The exercise thus consists of roughly the following steps:

Pick a tabular dataset from a public repository; ideally, the dataset contains some sensitive/personal attributes (age, income, education, a diagnosis, ....). Overall, your dataset does not need to be too large, anything from 1k samples up should be fine.
(we focus on tabular data due to computational costs / runtime, interpretability of the results, and the meaningfulness of fidelity & privacy evaluation; if you would want to work with some other data modality, contact me, we can discuss if that is feasible)

Chose a data synthetisation framework; while there are many frameworks around, we can recommend the [Synthetic Data Vault](https://github.com/sdv-dev/SDV) (SDV) as a very comprehensive tool including many different techniques (their original Gaussian Copula, but also CTGAN, and implementations of Bayesian Networks, ...) , gretel.ai; synthpop offers decision trees (primary implementation is in R), DataSynthesizer offers Bayesian Networks. You can find a curated list e.g. at https://github.com/joofio/awesome-data-synthesis

Prepare your data for evaluation; basically, you shall have a test set available for testing classifiers

Create a synthetic data version from your original training data; for simplicity reasons, just create a dataset with the same number of samples as the original dataset.

Perform a fidelity evaluation of the synthetic dataset, by comparing data characteristics to the original training data; you can utilise e.g. the SDMetric from the SDV https://github.com/sdv-dev/SDMetrics; select the most interesting aspects from the evaluation for your report.

Perform a utility evaluation of the synthetic dataset, by training one classifier model, once on the original training data, and once on the synthetic data; then evaluate both models on the test data, and compare their performance. Use a fast, shallow, but still powerful model (e.g. SVM, RF, Boosting algorithms, ....)

Perform a privacy evaluation of the synthetic dataset, by using a privacy evaluation tool. You can use e.g. Anonymeter (https://github.com/statice/anonymeter), and focus on metrics that can be computed from the synthetic data directly. There are other tools around, but many of them provide attack-based evaluation (e.g. membership inference attacks), which are computationally expensive. Again, select the most interesting aspects from the evaluation for your report.

Write up all of this in a short report; in the report, describe what tools and parameter settings you used, and describe and discuss the results in terms of fidelity, utility and privacy of the synthetic data. Submit a complete package including your report, your code, and your data files (original and generated).

To avoid everyone using the same dataset, you shall register your dataset on the wiki page, and make sure that each dataset is used at most by three groups.

## Packages
The following packages are used in this project

In [1]:
! pip install -r ./requirements.txt



## General data

In [2]:
# define a random state
random_state = 12014500

In [None]:
import numpy as np
np.random.seed(random_state)

## Dataset
We work with the compass dataset...

In [3]:
import pandas as pd

file_path = './data/compas-scores-raw.csv'
df = pd.read_csv(file_path)
display(df.head())

Unnamed: 0,Person_ID,AssessmentID,Case_ID,Agency_Text,LastName,FirstName,MiddleName,Sex_Code_Text,Ethnic_Code_Text,DateOfBirth,...,RecSupervisionLevel,RecSupervisionLevelText,Scale_ID,DisplayText,RawScore,DecileScore,ScoreText,AssessmentType,IsCompleted,IsDeleted
0,50844,57167,51950,PRETRIAL,Fisher,Kevin,,Male,Caucasian,12/05/92,...,1,Low,7,Risk of Violence,-2.08,4,Low,New,1,0
1,50844,57167,51950,PRETRIAL,Fisher,Kevin,,Male,Caucasian,12/05/92,...,1,Low,8,Risk of Recidivism,-1.06,2,Low,New,1,0
2,50844,57167,51950,PRETRIAL,Fisher,Kevin,,Male,Caucasian,12/05/92,...,1,Low,18,Risk of Failure to Appear,15.0,1,Low,New,1,0
3,50848,57174,51956,PRETRIAL,KENDALL,KEVIN,,Male,Caucasian,09/16/84,...,1,Low,7,Risk of Violence,-2.84,2,Low,New,1,0
4,50848,57174,51956,PRETRIAL,KENDALL,KEVIN,,Male,Caucasian,09/16/84,...,1,Low,8,Risk of Recidivism,-1.5,1,Low,New,1,0


In [4]:
df.dtypes

Person_ID                    int64
AssessmentID                 int64
Case_ID                      int64
Agency_Text                 object
LastName                    object
FirstName                   object
MiddleName                  object
Sex_Code_Text               object
Ethnic_Code_Text            object
DateOfBirth                 object
ScaleSet_ID                  int64
ScaleSet                    object
AssessmentReason            object
Language                    object
LegalStatus                 object
CustodyStatus               object
MaritalStatus               object
Screening_Date              object
RecSupervisionLevel          int64
RecSupervisionLevelText     object
Scale_ID                     int64
DisplayText                 object
RawScore                   float64
DecileScore                  int64
ScoreText                   object
AssessmentType              object
IsCompleted                  int64
IsDeleted                    int64
dtype: object

In [5]:
df.columns

Index(['Person_ID', 'AssessmentID', 'Case_ID', 'Agency_Text', 'LastName',
       'FirstName', 'MiddleName', 'Sex_Code_Text', 'Ethnic_Code_Text',
       'DateOfBirth', 'ScaleSet_ID', 'ScaleSet', 'AssessmentReason',
       'Language', 'LegalStatus', 'CustodyStatus', 'MaritalStatus',
       'Screening_Date', 'RecSupervisionLevel', 'RecSupervisionLevelText',
       'Scale_ID', 'DisplayText', 'RawScore', 'DecileScore', 'ScoreText',
       'AssessmentType', 'IsCompleted', 'IsDeleted'],
      dtype='object')

In [28]:
nominal_features = ['Agency_Text', 'Sex_Code_Text', 'Ethnic_Code_Text', 'AssessmentReason', 'Language', 'LegalStatus', 'CustodyStatus', 'MaritalStatus', 'RecSupervisionLevel']

In the following code snippet, we do some small preprocessing. First we drop some redundant and unnecessary attributes. For example IDs that are unique to the person or the case should be ignored, because they do not represent any important information, other than identifying this person. Furthermore, we remove the name from the data, because this not only instantly allows to know, who the records belong to and it isn't important when training a simple model. Futhermore, we convert the date to timestamps and remove all other possible "target" attributes

In [6]:
import numpy as np
columns_to_drop = ['Person_ID', 'AssessmentID', 'Case_ID', 'LastName', 'FirstName', 'MiddleName', 'ScaleSet', 'RecSupervisionLevelText', 'DisplayText', 'RawScore', 'DecileScore', 'AssessmentType', 'IsCompleted', 'IsDeleted'] 

df = df.drop(columns_to_drop, axis=1)
df['DateOfBirth'] = pd.to_datetime(df['DateOfBirth']).astype(int) // 10**9  # Convert to Unix timestamp
df['Screening_Date'] = pd.to_datetime(df['Screening_Date']).astype(int) // 10**9  # Convert to Unix timestamp

df = df.dropna(subset=['ScoreText'])

In [7]:
from sklearn.model_selection import train_test_split

df_train, df_test = train_test_split(df, test_size=0.2, random_state=random_state)

In [8]:
import numpy as np
# Replace '?' with NaN, if necessary (check those special characters)
df_train.replace('NaN', np.nan, inplace=True)
df_test.replace('NaN', np.nan, inplace=True)

## Baseline Classifier

In [133]:
def split_data(df_train, df_test):
    X_train = df_train.drop('ScoreText', axis=1)
    X_train = pd.get_dummies(X_train, columns=nominal_features)
    y_train = df_train['ScoreText']

    X_test = df_test.drop('ScoreText', axis=1)
    X_test = pd.get_dummies(X_test, columns=nominal_features)
    y_test = df_test['ScoreText']
    
    return X_train, y_train, X_test, y_test

In [134]:
from sklearn.metrics import accuracy_score

def train_and_evaluate(clf, df_train, df_test, remove_diff_cols=False):
    X_train, y_train, X_test, y_test = split_data(df_train, df_test)
    if remove_diff_cols:
        # Remove columns that are not in the train set
        X_test = X_test[X_train.columns]

    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"Accuracy: {accuracy:0.2f}%")


In [33]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=250, max_depth=15, random_state=random_state)
train_and_evaluate(clf, df_train, df_test)

Accuracy: 0.8449013157894737%


## Data synthetization

### Metadata

In [34]:
from sdv.metadata import SingleTableMetadata

metadata = SingleTableMetadata()
metadata.detect_from_dataframe(df_train)

metadata.visualize(
    show_table_details='summarized',
    output_filepath='my_metadata.png'
)
metadata.validate_data(data=df_train)

### GaussianCopulaSynthesizer

In [35]:
from sdv.single_table import GaussianCopulaSynthesizer

synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.fit(df_train)
synthetic_data = synthesizer.sample(num_rows=len(df_train))


We strongly recommend saving the metadata using 'save_to_json' for replicability in future SDV versions.



In [None]:
display(synthetic_data.head())

Unnamed: 0,Agency_Text,Sex_Code_Text,Ethnic_Code_Text,DateOfBirth,ScaleSet_ID,AssessmentReason,Language,LegalStatus,CustodyStatus,MaritalStatus,Screening_Date,RecSupervisionLevel,Scale_ID,ScoreText
0,PRETRIAL,Female,Caucasian,3087591231,22,Intake,English,Pretrial,Pretrial Defendant,Married,1367421196,1,7,Low
1,Probation,Male,Native American,512573810,22,Intake,English,Other,Probation,Single,1365116352,1,7,Low
2,DRRD,Male,African-American,574738297,17,Intake,English,Post Sentence,Probation,Single,1375433376,3,18,High
3,PRETRIAL,Male,Hispanic,507498928,22,Intake,English,Pretrial,Pretrial Defendant,Divorced,1380907287,1,8,Low
4,PRETRIAL,Male,African-American,572225354,22,Intake,English,Pretrial,Jail Inmate,Unknown,1416249008,1,8,Low


#### Classifier

In [36]:
clf = RandomForestClassifier(n_estimators=250, max_depth=15, random_state=random_state)
train_and_evaluate(clf, synthetic_data, df_test)

Accuracy: 0.6847861842105263%


### CTGANSynthesizer

In [58]:
from sdv.single_table import CTGANSynthesizer

synthesizer_ctgan = CTGANSynthesizer(metadata, epochs=200, verbose=True)
synthesizer_ctgan.fit(df_train)

synthetic_data_ctgan = synthesizer_ctgan.sample(num_rows=len(df_train))

Gen. (-1.76) | Discrim. (-0.24): 100%|██████████| 200/200 [10:11<00:00,  3.06s/it]


In [59]:
display(synthetic_data_ctgan.head())

Unnamed: 0,Agency_Text,Sex_Code_Text,Ethnic_Code_Text,DateOfBirth,ScaleSet_ID,AssessmentReason,Language,LegalStatus,CustodyStatus,MaritalStatus,Screening_Date,RecSupervisionLevel,Scale_ID,ScoreText
0,PRETRIAL,Male,African-American,3083589624,22,Intake,English,Pretrial,Probation,Married,1369844811,1,7,Low
1,PRETRIAL,Male,Native American,3120306771,22,Intake,English,Pretrial,Pretrial Defendant,Married,1362471562,1,18,Low
2,Probation,Male,Caucasian,692416916,22,Intake,English,Post Sentence,Probation,Married,1377745723,3,8,Medium
3,PRETRIAL,Male,African-American,622767518,22,Intake,English,Pretrial,Pretrial Defendant,Married,1370277698,1,7,Low
4,PRETRIAL,Male,African-American,3034402793,22,Intake,English,Pretrial,Jail Inmate,Unknown,1416264499,1,18,Low


#### Classifier

In [60]:
clf = RandomForestClassifier(n_estimators=250, max_depth=15, random_state=random_state)
train_and_evaluate(clf, synthetic_data_ctgan, df_test)

Accuracy: 0.7552631578947369%


### CopulaGANSynthesizer

In [61]:
from sdv.single_table import CopulaGANSynthesizer

synthesizer_tvae = CopulaGANSynthesizer(metadata, epochs=200, verbose=True)
synthesizer_tvae.fit(df_train)

synthetic_data_tvae = synthesizer_tvae.sample(num_rows=len(df_train))

Gen. (-1.31) | Discrim. (-0.18): 100%|██████████| 200/200 [12:11<00:00,  3.66s/it]


In [62]:
display(synthetic_data_tvae.head())

Unnamed: 0,Agency_Text,Sex_Code_Text,Ethnic_Code_Text,DateOfBirth,ScaleSet_ID,AssessmentReason,Language,LegalStatus,CustodyStatus,MaritalStatus,Screening_Date,RecSupervisionLevel,Scale_ID,ScoreText
0,Probation,Male,Caucasian,2799653771,22,Intake,English,Post Sentence,Probation,Married,1357019144,1,7,Low
1,PRETRIAL,Female,Native American,451822190,22,Intake,English,Pretrial,Pretrial Defendant,Single,1387839313,1,18,Low
2,Probation,Female,Caucasian,827719971,22,Intake,English,Post Sentence,Probation,Single,1406086151,1,8,Low
3,PRETRIAL,Male,Hispanic,596773614,22,Intake,English,Pretrial,Jail Inmate,Divorced,1410770343,1,8,Medium
4,PRETRIAL,Male,Hispanic,2781048197,22,Intake,English,Pretrial,Jail Inmate,Unknown,1398808805,1,18,Low


#### Classifier

In [63]:
clf = RandomForestClassifier(n_estimators=250, max_depth=15, random_state=random_state)
train_and_evaluate(clf, synthetic_data_tvae, df_test)

Accuracy: 0.7770559210526315%


### DataSynthesizer (correlated attribute mode)

In [66]:
from DataSynthesizer.DataDescriber import DataDescriber
from DataSynthesizer.DataGenerator import DataGenerator
from DataSynthesizer.ModelInspector import ModelInspector
from DataSynthesizer.lib.utils import read_json_file, display_bayesian_network

import pandas as pd

In [99]:
file_path_preprocessed = './data/preprocessed_data.csv'
# location of two output files
mode = 'correlated_attribute_mode'
description_file = f'./out/{mode}/description.json'
synthetic_data = f'./out/{mode}/sythetic_data.csv'

In [72]:
threshold_value = 10

categorical_attributes = { col: True for col in df_train.columns if df_train[col].value_counts().shape[0] < threshold_value }

candidate_keys = {}
epsilon = 1

degree_of_bayesian_network = 2

num_tuples_to_generate = len(df_train) 

In [73]:
df_train.to_csv(file_path_preprocessed, index=False)

In [74]:
describer = DataDescriber(category_threshold=threshold_value)
describer.describe_dataset_in_correlated_attribute_mode(dataset_file=file_path_preprocessed, 
                                                        epsilon=epsilon, 
                                                        k=degree_of_bayesian_network,
                                                        attribute_to_is_categorical=categorical_attributes,
                                                        attribute_to_is_candidate_key=candidate_keys)
describer.save_dataset_description_to_file(description_file)

Adding ROOT ScoreText
Adding attribute RecSupervisionLevel
Adding attribute Scale_ID
Adding attribute DateOfBirth
Adding attribute MaritalStatus
Adding attribute Ethnic_Code_Text
Adding attribute LegalStatus
Adding attribute CustodyStatus
Adding attribute Agency_Text
Adding attribute Screening_Date
Adding attribute ScaleSet_ID
Adding attribute Language
Adding attribute Sex_Code_Text
Adding attribute AssessmentReason


In [75]:
display_bayesian_network(describer.bayesian_network)

Constructed Bayesian network:
    RecSupervisionLevel has parents ['ScoreText'].
    Scale_ID            has parents ['RecSupervisionLevel', 'ScoreText'].
    DateOfBirth         has parents ['Scale_ID', 'ScoreText'].
    MaritalStatus       has parents ['DateOfBirth', 'RecSupervisionLevel'].
    Ethnic_Code_Text    has parents ['MaritalStatus', 'RecSupervisionLevel'].
    LegalStatus         has parents ['Ethnic_Code_Text', 'DateOfBirth'].
    CustodyStatus       has parents ['LegalStatus', 'Scale_ID'].
    Agency_Text         has parents ['CustodyStatus', 'Ethnic_Code_Text'].
    Screening_Date      has parents ['CustodyStatus', 'Ethnic_Code_Text'].
    ScaleSet_ID         has parents ['CustodyStatus', 'Scale_ID'].
    Language            has parents ['Screening_Date', 'Scale_ID'].
    Sex_Code_Text       has parents ['ScaleSet_ID', 'CustodyStatus'].
    AssessmentReason    has parents ['CustodyStatus', 'MaritalStatus'].


In [76]:
generator = DataGenerator()
generator.generate_dataset_in_correlated_attribute_mode(num_tuples_to_generate, description_file)
generator.save_synthetic_data(synthetic_data)

In [100]:
synthetic_df = pd.read_csv(synthetic_data)
attribute_description = read_json_file(description_file)['attribute_description']

#### Classifier

In [101]:
clf = RandomForestClassifier(n_estimators=250, max_depth=15, random_state=random_state)
train_and_evaluate(clf, synthetic_df, df_test)

Accuracy: 0.84%


### DataSynthesizer (independent attribute mode)

In [79]:
from DataSynthesizer.DataDescriber import DataDescriber
from DataSynthesizer.DataGenerator import DataGenerator
from DataSynthesizer.ModelInspector import ModelInspector
from DataSynthesizer.lib.utils import read_json_file, display_bayesian_network

import pandas as pd

In [106]:
file_path_preprocessed = './data/preprocessed_data.csv'

mode = 'independent_attribute_mode'
description_file = f'./out/{mode}/description.json'
synthetic_data = f'./out/{mode}/sythetic_data.csv'

In [127]:
threshold_value = 15

categorical_attributes = { col: True for col in df_train.columns if df_train[col].value_counts().shape[0] < threshold_value }

candidate_keys = {}

num_tuples_to_generate = len(df_train)

In [128]:
df_train.to_csv(file_path_preprocessed, index=False)

In [129]:
describer = DataDescriber(category_threshold=threshold_value)
describer.describe_dataset_in_independent_attribute_mode(dataset_file=file_path_preprocessed,
                                                         attribute_to_is_categorical=categorical_attributes,
                                                         attribute_to_is_candidate_key=candidate_keys)
describer.save_dataset_description_to_file(description_file)

In [130]:
generator = DataGenerator()
generator.generate_dataset_in_independent_mode(num_tuples_to_generate, description_file)
generator.save_synthetic_data(synthetic_data)

In [131]:
synthetic_df = pd.read_csv(synthetic_data)
attribute_description = read_json_file(description_file)['attribute_description']

#### Classifier

In [135]:
clf = RandomForestClassifier(n_estimators=250, max_depth=15, random_state=random_state)
train_and_evaluate(clf, synthetic_df, df_test, remove_diff_cols=False)

ValueError: The feature names should match those that were passed during fit.
Feature names unseen at fit time:
- CustodyStatus_Residential Program
- Ethnic_Code_Text_African-Am
- Ethnic_Code_Text_Native American
- Ethnic_Code_Text_Oriental
- Language_Spanish
- ...
