# Tutorial 10: Sequential Synthesis
In this tutorial, we explore the **Sequential Synthesis** approach using
the `syn_seq` plugin in `synthcity`. Sequential synthesis allows us to
model variables one-by-one (column-by-column), using conditional relationships
learned from the real data. The main idea is:
1. Synthesize the first variable (often with sample-without-replacement, "SWR"),
2. Then synthesize the second variable conditioned on the first,
3. And so on for each subsequent variable.
This approach can better preserve complex dependencies among columns than
simple marginal or naive methods.
We'll demonstrate this using the **diabetes** dataset, just like other tutorials,
and compare the resulting synthetic data.

In [None]:
!pip install synthcity

In [None]:
# stdlib
import warnings

warnings.filterwarnings("ignore")

# third party
from sklearn.datasets import load_diabetes

# synthcity absolute
from synthcity.plugins import Plugins

from synthcity.plugins.core.dataloader import Syn_SeqDataLoader

eval_plugin = "syn_seq"

### Load dataset


In [None]:
# synthcity absolute
from synthcity.plugins.core.dataloader import GenericDataLoader

X, y = load_diabetes(return_X_y=True, as_frame=True)
X["target"] = y

loader = Syn_SeqDataLoader(X, target_column="target", sensitive_columns=["sex"])

loader.dataframe()

### Train the generator


In [None]:
# synthcity absolute
from synthcity.plugins import Plugins

syn_model = Plugins().get(eval_plugin)

syn_model.fit(loader)

### Generate new samples


In [None]:
syn_model.generate(count=1000).dataframe()

In [None]:
# third party
import matplotlib.pyplot as plt

syn_model.plot(plt, loader)

plt.show()

### Benchmarks

In [None]:
# synthcity absolute
from synthcity.benchmark import Benchmarks

score = Benchmarks.evaluate(
    [
        (eval_plugin, eval_plugin, {"n_iter": 50})
    ],  # (testname, plugin, plugin_args) REPLACE {"n_iter" : 50} with {} for better performance
    loader,
    repeats=2,
    metrics={"detection": ["detection_mlp"]},  # DELETE THIS LINE FOR ALL METRICS
)

In [None]:
Benchmarks.print(score)

# User Modification

from the below, we are using 'Adult' dataset which contains many highly skewed categorical variables.
As you can see from the tutorial5_differential_privacy, such dataset resembles closely with real world dataset and it is very hard to synthesize.

## Load Dataset

If we run the dataloader, it automatically shows order of synthesis and variable selection matrix. Variable selection matrix indicates which variables are used to synthesize the variable in each synthesis.

In [None]:
# Load the reference data
# Note: preprocessing data with OneHotEncoder or StandardScaler is not needed or recommended. Synthcity handles feature encoding and standardization internally.
from synthcity.utils.datasets.categorical.categorical_adult import CategoricalAdultDataloader

X = CategoricalAdultDataloader().load()

X.head()

## Preprocess the data for special values and imbalanced dataset

We provide feature that preprocess the dataset to create better quality in sequential synthesis.
Preprocessing includes data type assignment, encoded value flag, and imbalanced variable handling.

In [None]:
# synthcity absolute
from synthcity.plugins.core.models.syn_seq.syn_seq_preprocess import SynSeqPreprocessor

prep = SynSeqPreprocessor(
    user_dtypes={
        "workclass": "category",
        "occupation": "category",
        "relationship": "category",
        "native-country": "category",
        "race": "category",
        "martial-status": "category",
        "sex": "category",
        "income>50K": "category",
    },
    user_special_values={
        "capital-gain": [0],
        "capital-loss": [0]
    },
    max_categories=15
)

# 2) Preprocess (date -> offset, numeric split 등)
X_processed = prep.preprocess(X, oversample=True)

## Define the dataloader with user custom

After preprocessing the dataset, user can define what order to synthesize and what methods to apply for each sequence.
Variables with many categories like 'native-country' are recommended to come first.

In [None]:
user_custom = {
# Decide which order to synthesize the dataset.
    'syn_order' : ['native-country', 'sex', 'workclass', 'education-num', 'marital-status', 'age',
       'occupation', 'relationship', 'fnlwgt', 'race', 'capital-loss', 'hours-per-week', 'income>50K', 'capital-gain'],

# Specify the method to use for certain variables. 'CART' is used as default.
    'method' : {"relationship": "rf",
                "race": "pmm"
                },

# Select which variables to use as predictor of synthesizing for each sequence.
    'variable_selection' : {
      'capital-loss': ['age', 'sex', 'workclass', 'education-num', 'marital-status',
         'occupation', 'relationship', 'fnlwgt', 'race'],
      'hours-per-week': ['age', 'workclass', 'fnlwgt', 'education-num', 'marital-status',
         'occupation', 'relationship', 'race', 'sex'],
      'native-country': ['age', 'workclass', 'fnlwgt', 'education-num', 'marital-status',
       'occupation', 'relationship', 'race', 'sex', 'hours-per-week', 'native-country'],
      'income>50K': ['age', 'workclass', 'fnlwgt', 'education-num', 'marital-status',
       'occupation', 'relationship', 'race', 'sex', 'hours-per-week', 'native-country'],
      'capital-gain': ['age', 'sex', 'workclass', 'education-num', 'marital-status',
       'occupation', 'relationship', 'fnlwgt', 'race', 'capital-loss', 'hours-per-week', 'native-country', 'income>50K']
         }
}

In [None]:
loader = Syn_SeqDataLoader(X_processed,
                           user_custom=user_custom,
                           target_column="income>50K", sensitive_columns=["sex", "race"])

loader.dataframe().head()

## Existing plugins

In [None]:
# synthcity absolute
from synthcity.plugins import Plugins

generators = Plugins()

generators.list()

In [None]:
syn_model = Plugins().get("syn_seq")

In [None]:
syn_model.fit(loader)

In [None]:
synthetic_loader = syn_model.generate(
    count = len(X)
    )

## Post processing

User can also apply the rules and merge back the temporary created columns

In [None]:
synthetic_df = synthetic_loader.dataframe()

In [None]:
user_rules = {
  "martial-status":[
    ("age", "<=", 18),
    ("martial-status", "=", 2)
  ]
}

In [None]:
synthetic_df = prep.postprocess(synthetic_df, rules=user_rules)

In [None]:
synthetic_df.head()

## Congratulations!

Congratulations on completing this notebook tutorial! If you enjoyed this and would like to join the movement towards Machine learning and AI for medicine, you can do so in the following ways!

### Star [Synthcity](https://github.com/vanderschaarlab/synthcity) on GitHub

- The easiest way to help our community is just by starring the Repos! This helps raise awareness of the tools we're building.


### Checkout other projects from vanderschaarlab
- [HyperImpute](https://github.com/vanderschaarlab/hyperimpute)
- [AutoPrognosis](https://github.com/vanderschaarlab/autoprognosis)
