# Tutorial for Disjoint Generative Models
In this notebook we show the basic functionality of the DGMs codebase.

### Example 1: Getting started with DGMs

First we do a very rudimentary example of DGMs on a simple dataset. We specify two models ```synthpop``` and ```privbayes``` to each be responsible for one part of the dataset. 

Unless otherwise specified, the dataset manager module will randomly split the dataset into equal parts for each model.

In [None]:
# Imports
import pandas as pd
from disjoint_generative_model import DisjointGenerativeModels

In [None]:
# Load the training data
df_train = pd.read_csv('experiments/datasets/heart_train.csv')

# Define DGMs using the Synthpop CART model and PrivBayes BN
dgms = DisjointGenerativeModels(df_train, generative_models=['synthpop', 'privbayes'])
df_syn = dgms.fit_generate(num_samples=20)

df_syn.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,58,1,0,140,265,0,1,111,1,0.4,2,0,3,0
1,57,0,0,132,307,0,0,159,0,0.0,2,2,2,1
2,58,1,0,130,252,0,1,111,0,0.4,1,4,3,1
3,47,1,2,138,250,0,0,166,0,0.1,2,0,2,1
4,68,1,0,150,276,0,0,151,0,2.0,1,3,3,0


If we want to specify the split, we can do so by passing a dictionary to the model containing the column names.

```python	
prepared_splits = {
    "part1": ["age", "sex", "cp", "trestbps", "chol"],
    "part2": ["fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "target"]
}

dgms = DisjointGenerativeModels(df_train, generative_models=['synthpop', 'privbayes'], prepared_splits=prepared_splits)
```
Alternatively, we can specify the split by passing a dictionary with model names as keys and the corresponding column names as values.

```python
gms_splits = {
    "synthpop": ["age", "sex", "cp", "trestbps", "chol"],
    "privbayes": ["fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "target"]
}

dgms = DisjointGenerativeModels(df_train, generative_models=gms_splits)
```
Finally, it is also possible to specify the number of equal-sized parts rather than the specific columns in both of the above methods.

e.g. send 2 parts to the synthpop model and 1 part to the PrivBayes model

In [None]:
dgms = DisjointGenerativeModels(df_train, generative_models={'synthpop': 2, 'privbayes': 1}) 
df_syn = dgms.fit_generate(num_samples=5)
 
df_syn



Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,64,1,0,105,259,0,0,126,1,2.2,1,3,2,0
1,43,1,2,144,248,0,0,179,0,0.0,2,0,2,0
2,68,0,2,100,247,0,1,162,0,1.8,2,0,2,1
3,63,1,2,174,249,0,2,137,0,0.6,1,1,3,0
4,49,1,0,174,264,0,0,195,0,0.0,2,0,2,1


Note that we get a ```UserWarning``` since a perfect 2:1 split ratio is not achievable (i.e. 14 is not divisible by 3).

Finally, we can also import the method used for randomly splitting the dataset and use it to split the dataset ourselves. This is helpful if we want to use the same split for multiple models, but we don't want to specify the split manually.

In [6]:
from disjoint_generative_model.utils.dataset_manager import random_split_columns

random_split = random_split_columns(df_train, {'part1': 2, 'part2': 1, 'part3': 1})
random_split



{'part1': ['cp', 'ca', 'thalach', 'slope', 'sex', 'restecg', 'chol', 'thal'],
 'part2': ['fbs', 'trestbps', 'age'],
 'part3': ['exang', 'oldpeak', 'target']}

### Example 2: Joining Strategies

The DGMs framework allows for virtually any sort of joining procedure. This library implements various joining starategies:

Unsupervised:
- ```Concatenating```: Simply concatenates the synthetic data generated by each model.
- ```RandomJoining```: Same as Concatenating, but shuffles the data before concatenating.

Supervised:
- ```UsingJoiningValidator```: Strategy for joining the synthetic data using a validator model. The validator model is a classifier that is trained on correctly joined input data and is used to predict correct join probability of the synthetic data. Accepted joins are removed from the pool of possible joins, and the process is repeated until termination criteria are met.


In [None]:
# Imports
import pandas as pd

from sklearn.svm import OneClassSVM

from disjoint_generative_model import DisjointGenerativeModels
from disjoint_generative_model.utils.joining_validator import JoiningValidator
from disjoint_generative_model.utils.joining_strategies import UsingJoiningValidator

In [None]:
# Load the training data
df_train = pd.read_csv('experiments/datasets/heart_train.csv')


gms = {'synthpop': 2, 'privbayes': 1}

OCSVM = OneClassSVM(kernel='rbf', nu=0.01)
JS = UsingJoiningValidator(JoiningValidator(OCSVM, threshold=0.5), patience=5)

dgms = DisjointGenerativeModels(df_train, gms, joining_strategy=JS)

df_syn = dgms.fit_generate()
df_syn

ValueError: The classifier model must have a predict_proba method