# Tutorial for Disjoint Generative Models
In this notebook we show the basic functionality of the DGMs codebase.

### Example 1: Getting started with DGMs

First we do a very rudimentary example of DGMs on a simple dataset. We specify two models ```synthpop``` and ```privbayes``` to each be responsible for one part of the dataset. 

Unless otherwise specified, the dataset manager module will randomly split the dataset into equal parts for each model.

In [1]:
# Imports
import pandas as pd
from disjoint_generative_model import DisjointGenerativeModels

In [None]:
# Load the training data
df_train = pd.read_csv('experiments/datasets/heart_train.csv')

# Define DGMs using the Synthpop CART model and PrivBayes BN
dgms = DisjointGenerativeModels(df_train, generative_models=['synthpop', 'privbayes'])
df_syn = dgms.fit_generate(num_samples=20)
print(dgms.used_splits)
df_syn.head()

{'split0': ['oldpeak', 'exang', 'age', 'trestbps', 'sex', 'cp', 'fbs'], 'split1': ['slope', 'target', 'thalach', 'chol', 'thal', 'ca', 'restecg']}


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,51,1,0,140,302,0,1,163,1,3.8,1,0,2,0
1,58,0,1,100,220,0,1,88,0,0.6,1,2,3,0
2,41,0,2,110,224,0,1,114,0,0.0,2,0,2,1
3,35,1,2,118,219,0,1,131,0,0.0,1,1,3,1
4,62,0,0,117,198,0,2,133,1,1.4,1,1,3,0


If we want to specify the split, we can do so by passing a dictionary to the model containing the column names.

```python	
prepared_splits = {
    "part1": ["age", "sex", "cp", "trestbps", "chol"],
    "part2": ["fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "target"]
}

dgms = DisjointGenerativeModels(df_train, generative_models=['synthpop', 'privbayes'], prepared_splits=prepared_splits)
```
Alternatively, we can specify the split by passing a dictionary with model names as keys and the corresponding column names as values (note that with this method one cannot specify using the same model for two different partitions).

```python
gms_splits = {
    "synthpop": ["age", "sex", "cp", "trestbps", "chol"],
    "privbayes": ["fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "target"]
}

dgms = DisjointGenerativeModels(df_train, generative_models=gms_splits)
```
Finally, it is also possible to specify the number of equal-sized parts rather than the specific columns in both of the above methods.

e.g. send 2 parts to the synthpop model and 1 part to the PrivBayes model

In [None]:
dgms = DisjointGenerativeModels(df_train, generative_models={'synthpop': 2, 'privbayes': 1})
df_syn = dgms.fit_generate(num_samples=5)
print(dgms.used_splits)

df_syn

...
{'split0': ['thal', 'restecg', 'thalach', 'slope', 'target', 'ca', 'age', 'chol', 'oldpeak', 'cp'], 'split1': ['sex', 'trestbps', 'exang', 'fbs']}


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,59,0,3,130,258,0,0,162,0,1.9,1,0,2,1
1,53,0,1,135,175,0,0,131,0,0.0,1,0,2,1
2,44,0,2,115,295,0,1,152,1,0.0,2,1,2,1
3,43,1,2,140,226,0,1,170,0,0.0,2,0,2,1
4,52,0,0,160,256,1,0,164,0,0.0,2,1,2,0


Finally, we can also import the method used for randomly splitting the dataset and use it to split the dataset ourselves. This is helpful if we want to use the same split for multiple models, but we don't want to specify the split manually.

In [19]:
from disjoint_generative_model.utils.dataset_manager import random_split_columns

random_split = random_split_columns(df_train, {'part1': 2, 'part2': 1, 'part3': 1})
random_split

{'part1': ['sex', 'thalach', 'exang', 'chol', 'cp', 'fbs', 'age', 'slope'],
 'part2': ['thal', 'restecg', 'target'],
 'part3': ['ca', 'trestbps', 'oldpeak']}

### Example 2: Joining Strategies

The DGMs framework allows for virtually any sort of joining procedure. In this library the following joining starategies are implemented:

- ```Concatenating```: Simply concatenates the synthetic data generated by each model.
- ```RandomJoining```: Same as Concatenating, but shuffles the data before concatenating.

- ```UsingJoiningValidator```: Strategy for joining the synthetic data using a validator model. The validator model can use three different adapters ```JoiningValidator```, ```OneClassValidator``` and ```OutlierValidator```, the first admits binary classification model backends, the second one-class/outlier detection models, and the final uses outlier detection methods such as isolation forest. They assign prediction scores to querry joins on the synthetic samples repeadedly subject to various control parameters. Accepted joins are removed from the pool for the next round. 

The ```UsingJoiningValidator``` strategy has various control parameters that can be overwritten by the user, but for most regular use, the ```'behaviour'``` argument acts as a shorthand for selecting pre-configured option sets. The following behaviours are available:
- ```'adaptive'```: The parameters are adjusted during the joining process to get more items, the selection threshold is automatically inferred. 
- ```'standard'```: Inherits the default settings from the ```JoiningValidator``` or ```OneClassValidator``` adapter.
- ```'strict'```: No parameters are changed during the joining process (likely to fail in getting enough good joins, consider adjusting the ```'join_multiplier'``` attribute of the DGMs object).


In [20]:
# Imports
import pandas as pd

from disjoint_generative_model import DisjointGenerativeModels
from disjoint_generative_model.utils.joining_validator import JoiningValidator, OneClassValidator
from disjoint_generative_model.utils.joining_strategies import UsingJoiningValidator

In [None]:
# Load the training data
df_train = pd.read_csv('experiments/datasets/heart_train.csv')

gms = {'synthpop': 2, 'privbayes': 1}

JS = UsingJoiningValidator()    # JoiningValidator with random forest model is used by default
dgms1 = DisjointGenerativeModels(df_train, gms, joining_strategy=JS)

df_syn1 = dgms1.fit_generate()

df_syn1

...
Threshold auto-set to: 0.9611650485436893


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,35,0,1,112,208,0,1,146,0,0.0,1,0,1,1
1,43,1,0,126,169,0,0,123,1,1.0,0,0,3,0
2,44,1,0,100,217,0,0,178,0,0.0,2,2,0,0
3,47,0,2,130,270,0,0,145,0,0.6,0,0,1,1
4,60,1,1,132,273,1,1,156,0,0.8,1,0,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
237,68,0,1,130,223,0,1,130,0,1.1,1,1,0,1
238,67,0,3,138,232,0,1,131,0,4.2,1,2,2,1
239,58,1,1,105,166,0,1,186,0,0.6,2,1,2,1
240,59,1,3,150,246,1,1,149,0,4.2,1,0,2,1


In [None]:
JS = UsingJoiningValidator(OneClassValidator(), behaviour='adaptive')
dgms2 = DisjointGenerativeModels(df_train, gms, joining_strategy=JS)

df_syn2 = dgms2.fit_generate()

df_syn2.head()

...
Threshold auto-set to: 2.1409624827961125
Predicted good joins fraction: 0.10055096418732783
Predicted good joins fraction: 0.04900459418070444
Predicted good joins fraction: 0.02254428341384863
...
Predicted good joins fraction: 0.0



Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,38,1,2,140,253,0,1,145,0,0.0,1,0,2,1
1,53,0,0,144,230,0,0,140,0,3.6,2,2,2,0
2,53,0,1,138,223,0,1,155,0,1.6,2,0,2,0
3,49,0,0,124,259,0,1,138,0,0.0,1,0,2,1
4,56,0,0,115,256,0,0,167,1,0.0,1,0,2,1


In [None]:
# Final example is how to change the validator model backend, in the example we use the SVM classifier model from sklearn
from sklearn.svm import SVC

JS = UsingJoiningValidator(
    JoiningValidator(
        classifier_model_base=SVC(kernel='linear', 
                                  probability=True, 
                                  class_weight='balanced'),
        verbose=False)
        ) 
dgms1 = DisjointGenerativeModels(df_train, gms, joining_strategy=JS)

df_syn3 = dgms1.fit_generate()

df_syn3.head()

...
Threshold auto-set to: 0.5921787709497207


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,35,0,1,102,269,0,0,192,1,0.6,1,1,2,1
1,63,1,3,130,193,0,1,134,1,3.2,1,0,3,0
2,45,1,1,110,234,0,1,99,1,4.2,0,0,3,0
3,48,1,1,148,197,0,0,178,1,0.0,2,2,2,0
4,49,0,0,101,209,0,1,154,0,0.0,2,1,2,0


We can compare the three generated datasets on a selection of metrics, using the [SynthEval Library](https://github.com/schneiderkamplab/syntheval/tree/main).

In [None]:
from syntheval import SynthEval

### Metrics
metrics = {
    "h_dist"    : {},
    "corr_diff" : {"mixed_corr": True},
    "auroc_diff" : {"model": "rf_cls"},
    "cls_acc"   : {"F1_type": "macro"},
    "eps_risk"  : {},
    "dcr"       : {},
    "mia"  : {"num_eval_iter": 5},
}

df_train = pd.read_csv('experiments/datasets/heart_train.csv')
df_test = pd.read_csv('experiments/datasets/heart_test.csv')

SE = SynthEval(df_train, df_test)
res, _ = SE.benchmark({'cls_rf': df_syn1, 
                       'cls_svm': df_syn3,
                       'occls_svm': df_syn2,}, 
                       analysis_target_var="target",
                       rank_strategy='summation', 
                       **metrics)

print("""Inferred categorical columns (unique threshold: 10):
['sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'ca', 'thal', 'target']""")
res

Inferred categorical columns (unique threshold: 10):
['sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'ca', 'thal', 'target']


Unnamed: 0_level_0,avg_h_dist,avg_h_dist,corr_mat_diff,corr_mat_diff,auroc,auroc,avg_F1_diff,avg_F1_diff,avg_F1_diff_hout,avg_F1_diff_hout,...,median_DCR,median_DCR,mia_recall,mia_recall,mia_precision,mia_precision,rank,u_rank,p_rank,f_rank
Unnamed: 0_level_1,value,error,value,error,value,error,value,error,value,error,...,value,error,value,error,value,error,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
dataset,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
cls_rf,0.098259,0.033148,1.399292,,-0.028141,,-0.134679,0.019503,-0.06882,0.012446,...,1.310746,,0.2625,0.045928,0.577273,0.050616,8.549027,4.41979,4.129237,0.0
cls_svm,0.062594,0.025121,1.302288,,-0.006141,,-0.127209,0.017154,-0.04399,0.01017,...,1.350807,,0.3125,0.044194,0.483452,0.07303,8.520403,4.508283,4.01212,0.0
occls_svm,0.024114,0.006303,1.436192,,-0.044918,,-0.112339,0.017606,-0.067719,0.013859,...,1.054135,,0.2875,0.025,0.485781,0.057365,8.390216,4.501258,3.888958,0.0


According to this presentation, the dataset generated by the default model is slightly better on privacy metrics, The SVM classifier model is marginally better on utility, and the one-class SVM model underperforms slightly on privacy in comparison. 