## MST
MST is a differentially private synthesizer that relies on [Private-PGM](https://github.com/ryan112358/private-pgm) to (privately) find the likeliest distribution over data given a set of measured marginals. Details on the method, and how it won a NIST competition, can be found in this paper (https://arxiv.org/abs/2108.04978).

#### Why "MST"
The acronym “MST” stands for “Maximum-Spanning-Tree” as the method produces differentially private synthetic data by relying on a “Maximum-Spanning-Tree" of mutual information.

MST finds the maximum spanning tree on a graph where nodes are data attributes and edge weights correspond to approximate mutual information between any two attributes. We say approximate here, because the “maximum spanning tree” is built using the exponential mechanism, which helps select edge weights with high levels of mutual information in a differentially private manner. The marginals are measured using the Gaussian mechanism.

### Specifying Domain
MST is easy to use, although it does require the data owner to specify the domain of each attribute a priori.

Here, we walk through a basic example of MST, and how to **properly specify the domain** using either a JSON file or a python dictionary.


In [None]:
import subprocess
import os

import pandas as pd

from snsynth.mst import MSTSynthesizer

git_root_dir = subprocess.check_output("git rev-parse --show-toplevel".split(" ")).decode("utf-8").strip()

csv_path = os.path.join(git_root_dir, os.path.join("datasets", "PUMS.csv"))

df = pd.read_csv(csv_path)
df = df.drop(["income"], axis=1)
df = df.sample(frac=1, random_state=42)

### Creating a correct `Domain` json file
Here, we specify a domains dictionary, where we can list out names and filepaths for each of our datasets domains

In [2]:
Domains = {
    "pums": "pums-domain.json"
}

Our PUMS csv data here looks like this, which can be easily loaded into a pandas dataframe.

```
age,sex,educ,race,married
59,1,9,1,1
31,0,1,3,0
36,1,11,1,1
54,1,11,1,1
```

From this data, the domain ```pums-domain.json``` file here looks like this:

```json
{
    "age": 95,
    "sex": 2,
    "educ": 17,
    "race": 7,
    "married": 2
}
```

Each column in our data **has to be included in our domain file**, and we must further specify the maximum value for each attribute, *m*, in their domain.

MST will then impose a [0-*m*] range on each attribute when synthesizing.

Note that MST does not work with continuous data, only categorical and low dimensional ordinal data. It is up to the data owner to properly (and privately) bin continous data for use with MST, if they so desire. Here, we have simply dropped the ```income``` column.

### Synthesizing with MST
Once the domain file is specified, synthesizing with MST is as easy as with any other smartnoise synthesizer.

Specify an epsilon, a delta (if you like), and point to the domain file you created!

In [3]:
mst_synth = MSTSynthesizer(domains_dict=Domains, 
                           domain='pums',
                           epsilon=1.0,
                           delta=1e-9)

mst_synth.fit(df)

sample_size = len(df)
synth_data = mst_synth.sample(sample_size)

Domain(age: 95, sex: 2, educ: 17, race: 7, married: 2)
Index(['age', 'sex', 'educ', 'race', 'married'], dtype='object')


In [4]:
df.describe()

Unnamed: 0,age,sex,educ,race,married
count,1000.0,1000.0,1000.0,1000.0,1000.0
mean,44.797,0.514,9.888,1.954,0.549
std,17.745385,0.500054,3.415424,1.155517,0.497842
min,18.0,0.0,1.0,1.0,0.0
25%,31.0,0.0,9.0,1.0,0.0
50%,42.0,1.0,11.0,1.0,1.0
75%,55.0,1.0,13.0,3.0,1.0
max,93.0,1.0,16.0,6.0,1.0


In [5]:
synth_data.describe()

Unnamed: 0,age,sex,educ,race,married
count,1000.0,1000.0,1000.0,1000.0,1000.0
mean,47.211,0.544,9.629,1.908,0.569
std,27.810104,0.498309,3.827271,1.075496,0.495464
min,0.0,0.0,0.0,0.0,0.0
25%,23.0,0.0,9.0,1.0,0.0
50%,47.5,1.0,11.0,1.0,1.0
75%,71.0,1.0,13.0,3.0,1.0
max,94.0,1.0,16.0,5.0,1.0
