# Perform Multi-Table Synthesization

In this exercise, we are going to walk through the synthesis of a relational table structure. For that, we will be using a slightly trimmed down version of the Berka dataset [[1](#refs)]: a dataset containing Czech bank transactions. It consists of a total of 8 tables, with one of these ("district") serving as a reference table, and all others containing privacy-sensitive information.

<img src='https://raw.githubusercontent.com/mostly-ai/mostly-tutorials/dev/multi-table/berka-original.png' width="600px"/>

There are two ways to perform multi-table synthesization with MOSTLY AI:
1. via an **ad-hoc job** by manually uploading data files (such as CSV) and defining the relationship between the tables.
2. by **connecting to a relational database** and importing the relationships from the db directly.

You will explore both approaches in this tutorial. We will focus especially on the database connector approach as it is the most commonly used approach for working with multiple tables. To help with setting up the database infrastructure, the tutorial will first provide helper scripts for creating two public database instances, to load the original data into one of them, and to then make the required job configuration.

Once the multi-table data has been synthesized, you will check the synthetic data for referential integrity, as well as for the retention of specific statistical properties that span multiple tables.

## Import Data to a Database

If you don't have a DB server available, then go to your preferred cloud provider (AWS, GPC, Azure, etc.) and launch an instance there first. Make sure that clients can connect externally via username / password credentials and have the required rights to create, update and delete database instances there.

<img src='https://raw.githubusercontent.com/mostly-ai/mostly-tutorials/dev/multi-table/sql1.png' width="400px"/> <img src='https://raw.githubusercontent.com/mostly-ai/mostly-tutorials/dev/multi-table/sql2.png' width="400px"/><br /><img src='https://raw.githubusercontent.com/mostly-ai/mostly-tutorials/dev/multi-table/sql3.png' width="400px"/> <img src='https://raw.githubusercontent.com/mostly-ai/mostly-tutorials/dev/multi-table/sql4.png' width="400px"/>

Once in place, please update the following variables accordingly. For security, we have stored the password in an environment variable. You can of course use your own preferred method for passing the credentials securely.

In [None]:
# !pip install python-dotenv

In [None]:
from dotenv import load_dotenv
import os

load_dotenv()

db_host = "34.122.91.200"
db_usr = "postgres"
db_pwd = os.environ["DB_PASS"]

Let's then create two database instances:
1. an instance that will contain the original data, and 
2. another instance that will serve as a destination for the synthetic tables.

For that we will need to install SQLAlchemy 2.x.

In [None]:
# # install required Python packages
# !pip install --pre psycopg2-binary sqlalchemy==2.0.9

In [None]:
import sqlalchemy
import psycopg2
from sqlalchemy import text
from sqlalchemy import create_engine
print(f"SQLAlchemy v{sqlalchemy.__version__}")
assert sqlalchemy.__version__.startswith('2.')

def create_db(host, user, pwd, db_name, if_exists="fail"):
    con = psycopg2.connect(f"postgresql://{user}:{pwd}@{host}:5432/postgres")
    con.autocommit = True
    cur = con.cursor()
    cur.execute(f"SELECT 1 FROM pg_catalog.pg_database WHERE datname = '{db_name}'")
    exists = cur.fetchone()
    if exists and if_exists == "fail":
        raise Exception(f"database {db_name} already exists")
    elif exists and if_exists == "replace":
        cur.execute("DROP DATABASE " + db_name)
    cur.execute("CREATE DATABASE " + db_name)
    con.close()

def connect_db(host, user, pwd, db_name):
    engine = create_engine(f"postgresql://{user}:{pwd}@{host}:5432/{db_name}")
    return engine

### Create Source and Destination Database

replace `if_exists='replace'` if you want to re-create the database

In [None]:
db_name_source = 'berka_original'
create_db(db_host, db_usr, db_pwd, db_name_source, if_exists="replace")

In [None]:
db_name_destination = 'berka_synthetic'
create_db(db_host, db_usr, db_pwd, db_name_destination, if_exists="replace")

### Load Data into Source Database

In [None]:
# check whether we are in Google colab
try:
    from google.colab import files
    print("running in COLAB mode")
    repo = 'https://github.com/mostly-ai/mostly-tutorials/raw/dev/multi-table'
except:
    print("running in LOCAL mode")
    repo = '.'

In [None]:
# import data into DB
from pathlib import Path
import pandas as pd
csv_files = [
    f'{repo}/account.csv', 
    f'{repo}/card.csv', 
    f'{repo}/client.csv', 
    f'{repo}/disposition.csv', 
    f'{repo}/district.csv',
    f'{repo}/loan.csv', 
    f'{repo}/orders.csv', 
    f'{repo}/transaction.csv'
]

engine = connect_db(db_host, db_usr, db_pwd, db_name_source)

originals = {}
for fn in csv_files:
    # read data from CSV into Pandas DataFrame
    df = pd.read_csv(fn)
    # ensure all columns are NULL-able
    df = df.convert_dtypes()
    # convert date columns
    for col in df.columns:
        if col in ['date', 'issued']:
            df[col] = pd.to_datetime(df[col])
        if col.endswith('_id'):
            df[col] = df[col].astype(str)
    # get filename w/o extension
    db_table = Path(fn).stem
    # write DataFrame to DB
    df.to_sql(db_table, engine, index=False, if_exists='fail')
    print(f"created table `{db_table}` with {df.shape[0]:,} records")
    originals[db_table] = df

print('DONE')

In [None]:
with engine.connect() as conn:
    # define primary keys in the database
    conn.execute(text('ALTER TABLE account ADD PRIMARY KEY (account_id);'))
    conn.execute(text('ALTER TABLE card ADD PRIMARY KEY (card_id);'))
    conn.execute(text('ALTER TABLE client ADD PRIMARY KEY (client_id);'))
    conn.execute(text('ALTER TABLE disposition ADD PRIMARY KEY (disposition_id);'))
    conn.execute(text('ALTER TABLE district ADD PRIMARY KEY (district_id);'))
    conn.execute(text('ALTER TABLE loan ADD PRIMARY KEY (loan_id);'))
    conn.execute(text('ALTER TABLE orders ADD PRIMARY KEY (orders_id);'))
    conn.execute(text('ALTER TABLE transaction ADD PRIMARY KEY (transaction_id);'))
    print(f"created primary keys")
    # define foreign key constraints in the database
    conn.execute(text('ALTER TABLE account ADD CONSTRAINT fk_district_a FOREIGN KEY (district_id) REFERENCES district (district_id);')) #
    conn.execute(text('ALTER TABLE client ADD CONSTRAINT fk_district_c FOREIGN KEY (district_id) REFERENCES district (district_id);')) #
    conn.execute(text('ALTER TABLE disposition ADD CONSTRAINT fk_disp_a FOREIGN KEY (account_id) REFERENCES account (account_id);'))
    conn.execute(text('ALTER TABLE disposition ADD CONSTRAINT fk_disp_c FOREIGN KEY (client_id) REFERENCES client (client_id);'))
    conn.execute(text('ALTER TABLE card ADD CONSTRAINT fk_card FOREIGN KEY (disposition_id) REFERENCES disposition (disposition_id);'))
    conn.execute(text('ALTER TABLE transaction ADD CONSTRAINT fk_trans FOREIGN KEY (account_id) REFERENCES account (account_id);'))
    conn.execute(text('ALTER TABLE loan ADD CONSTRAINT fk_loan FOREIGN KEY (account_id) REFERENCES account (account_id);'))
    conn.execute(text('ALTER TABLE orders ADD CONSTRAINT fk_order FOREIGN KEY (account_id) REFERENCES account (account_id);'))
    print(f"created foreign keys")
    conn.commit()
print('DONE')

## Synthesize Data via MOSTLY AI

Go to MOSTLY AI, and

1. Create two data connectors, one for the source DB `berka_original`, and one for the destination DB `berka_synthetic`

<img src='https://raw.githubusercontent.com/mostly-ai/mostly-tutorials/dev/multi-table/mostly0a.png' width="400px"/> <img src='https://raw.githubusercontent.com/mostly-ai/mostly-tutorials/dev/multi-table/mostly0b.png' width="400px"/><br />

2. Create a data catalog using the data connector for `berka_original`

    - Select the `account` table along with all of its child tables
    - Select the `client` table together with all of its child tables
    - Configure smart select column `district_id` for the `disposition -> client` relation

<img src='https://raw.githubusercontent.com/mostly-ai/mostly-tutorials/dev/multi-table/mostly1.png' width="400px"/> <img src='https://raw.githubusercontent.com/mostly-ai/mostly-tutorials/dev/multi-table/mostly2.png' width="400px"/><br />
<img src='https://raw.githubusercontent.com/mostly-ai/mostly-tutorials/dev/multi-table/mostly3.png' width="400px"/> <img src='https://raw.githubusercontent.com/mostly-ai/mostly-tutorials/dev/multi-table/mostly4.png' width="400px"/><br />
<img src='https://raw.githubusercontent.com/mostly-ai/mostly-tutorials/dev/multi-table/mostly5.png' width="400px"/> <img src='https://raw.githubusercontent.com/mostly-ai/mostly-tutorials/dev/multi-table/mostly6.png' width="400px"/><br />
<img src='https://raw.githubusercontent.com/mostly-ai/mostly-tutorials/dev/multi-table/mostly7.png' width="400px"/> 

These are then the configured table types and relations.

<img src='https://raw.githubusercontent.com/mostly-ai/mostly-tutorials/dev/multi-table/berka-synthetic.png' width="600px"/>

3. Launch the job, and select `berka_synthetic` as a destination in "Output settings"

4. Once the job has completed, continue with executing the next cell.

In [None]:
# fetch synthetic data from destination database
engine = connect_db(db_host, db_usr, db_pwd, db_name_destination)
tables = [Path(fn).stem for fn in csv_files if 'district' not in fn]
synthetics = {}
for db_table in tables:
    with engine.begin() as conn:
        df = pd.read_sql_query(sql=text(f'select * from {db_table};'), con=conn)
    print(f"extracted table {db_table} with {df.shape[0]:,} records")
    synthetics[db_table] = df

## Explore Synthetic Data

### Show sample records for each table

In [None]:
for k in synthetics:
    print("===", k, "===")
    display(synthetics[k].sample(n=3))

### Check basic statistics

The newly generated tables are statistically representative of the original.

In [None]:
display(synthetics['transaction']['amount'].quantile(q=[.1, .5, .9]))
display(originals['transaction']['amount'].quantile(q=[.1, .5, .9]))

In [None]:
display(synthetics['account']['date'].quantile(q=[.1, .5, .9], interpolation='nearest'))
display(pd.to_datetime(originals['account']['date']).quantile(q=[.1, .5, .9], interpolation='nearest'))

### Check referential integrity

The newly generated foreign keys are also present as primary keys in the connected tables.

In [None]:
assert synthetics['transaction']['account_id'].isin(synthetics['account']['account_id']).all()
assert synthetics['card']['disposition_id'].isin(synthetics['disposition']['disposition_id']).all()

### Check context relations

The cardinality of context FK relations is perfectly retained.

In [None]:
print('Orders per Account - Synthetic')
display(synthetics['orders'].groupby('account_id').size().value_counts())
print('\nOrders per Account - Original')
display(originals['orders'].groupby('account_id').size().value_counts())

In [None]:
print('Cards per Disposition - Synthetic')
display(synthetics['card'].groupby('disposition_id').size().value_counts())
print('\nCards per Disposition - Original')
display(originals['card'].groupby('disposition_id').size().value_counts())

### Check smart select relations

The cardinality of smart select FK relation is not retained, as these get randomly assigned.

In [None]:
print('\nDispositions per Client - Synthetic')
display(synthetics['disposition'].groupby('client_id').size().value_counts())
print('Dispositions per Client - Original')
display(originals['disposition'].groupby('client_id').size().value_counts())

Some of the statistical relations between a child and its randomly assigned smart select parent can be retained, if corresponding smart select columns were configured. E.g. if smart select is properly configured, then the the share of cases where the `client` has the same `district_id` as the `account`, that she owns, should be similar.

In [None]:
def matching_districts(datasets):    
    df = datasets['disposition']
    df = df.loc[df.type=='OWNER']
    df = df.merge(
        datasets['client'], 
        on='client_id',
    ).merge(
        datasets['account'], 
        on='account_id',
    )
    return (df['district_id_x']==df['district_id_y']).mean()

print(f"Share of accounts and clients with identical district_id")
print(f"synthetic: {matching_districts(synthetics):4.0%}")
print(f"original:  {matching_districts(originals):4.0%}")

## Conclusion

In this tutorial we have demonstrated how to synthesize a multi-table relational database. We have seen that structure, statistics and referential integrity are perfectly retained. We have also seen how to configure Smart Select, and its impact on retaining statistics across non-context relations. We also observed that there are limitations to what can be retained, in particular when it comes to the cardinality of smart select relations.

If you are interested in learning more about how to run ad-hoc multi-table jobs that are not synced to a relational database, check out [the video tutorial]().

## References<a class="anchor" name="refs"></a>

1. https://data.world/lpetrocelli/czech-financial-dataset-real-anonymized-transactions