# MultiTable

## Quickstart

In [None]:
# End-to-end synthetics example

from gretel_trainer.relational import MultiTable, sqlite_conn


!curl -o "ecom_xf.db" "https://gretel-blueprints-pub.s3.us-west-2.amazonaws.com/rdb/ecom_xf.db"


connector = sqlite_conn("ecom_xf.db")
relational_data = connector.extract()

mt = MultiTable(relational_data)
mt.train()
mt.generate()

connector.save(mt.synthetic_output_tables, prefix="synthetic_")

## Detailed walkthrough

### Set up source relational data

In [None]:
# Display the schema of our demo database

from IPython.display import Image

Image("https://gretel-blueprints-pub.s3.us-west-2.amazonaws.com/rdb/ecommerce_db.png", width=600, height=600)

In [None]:
# Download the demo database

!curl -o "ecom_xf.db" "https://gretel-blueprints-pub.s3.us-west-2.amazonaws.com/rdb/ecom_xf.db"

The core Python object capturing source relational data and metadata is named `RelationalData`.
It can be created automatically using a `Connector`, or it can be created manually.


In [None]:
# Connect to SQLite database and extract relational data

from gretel_trainer.relational import sqlite_conn

ecommerce_db_path = "ecom_xf.db"

sqlite = sqlite_conn(path=ecommerce_db_path)
relational_data = sqlite.extract()

In [None]:
# Alternatively, manually define relational data

from gretel_trainer.relational import RelationalData
import pandas as pd

csv_dir = "/path/to/extracted_csvs"

tables = [
    ("events", "id"),
    ("users", "id"),
    ("distribution_center", "id"),
    ("products", "id"),
    ("inventory_items", "id"),
    ("order_items", "id"),
]

foreign_keys = [
    ("events.user_id", "users.id"),
    ("order_items.user_id", "users.id"),
    ("order_items.inventory_item_id", "inventory_items.id"),
    ("inventory_items.product_id", "products.id"),
    ("inventory_items.product_distribution_center_id", "distribution_center.id"),
    ("products.distribution_center_id", "distribution_center.id"),
]

rel_data = RelationalData()

for table, primary_key in tables:
    rel_data.add_table(table, primary_key, pd.read_csv(f"{csv_dir}/{table}.csv"))

for fk, referencing in foreign_keys:
    rel_data.add_foreign_key(fk, referencing)

### Operate on the source data

The `MultiTable` class is the interface to working with relational data. It requires a `RelationalData` instance. Several other options can be configured; the defaults are shown below as comments.

In [None]:
from gretel_trainer.relational import MultiTable

multitable = MultiTable(
    relational_data,
    # project_name="multi-table",
    # working_dir="multi-table", # matches the projet name by default
    # gretel_model="amplify",
    # strategy="cross-table",
    # refresh_interval=180,
)

#### Transforms

Provide Gretel Transforms configs for each table you want to run transforms on. If you intend to train synthetic models on the transformed output instead of the source data, add the argument `in_place=True`.

In [None]:
# Transform some tables

multitable.transform(
    configs={
        "users": "https://gretel-blueprints-pub.s3.amazonaws.com/rdb/users_policy.yaml",
        "events": "https://gretel-blueprints-pub.s3.amazonaws.com/rdb/events_policy.yaml",
    }
)

In [None]:
# Compare original to transformed

print(multitable.relational_data.get_table_data("users").head(5))
print(multitable.transform_output_tables["users"].head(5))

#### Synthetics

In [None]:
# Throughout the synthetics process, there are a few ways to inspect the overall state

multitable.train_statuses
multitable.generate_statuses
multitable.state_by_action
multitable.state_by_table

In [None]:
# Train synthetic models for all tables

multitable.train()

When training is complete, you'll find a number of artifacts in your working directory, including the CSVs on which models were trained (`train_{table}.csv`) and the standard Gretel model artifacts, including HTML and JSON reports and logs (`artifacts_{table}/`).

You can also view some evaluation metrics at this point. (We'll expand upon them after generating synthetic data.)

In [None]:
multitable.evaluations

When you generate synthetic data, you can optionally change the amount of data to generate via `record_size_ratio`, as well as optionally preserve certain tables' source data via `preserve_tables`.

In [None]:
# Generate synthetic data

multitable.generate()

In [None]:
# Compare original to synthetic data

print(multitable.relational_data.get_table_data("user").head(5))
print(multitable.synthetic_output_tables["user"].head(5))

Now that we have synthetic output data, we can expand the table evaluations to provide another perspective on synthetic data quality.

In [None]:
multitable.expand_evaluations()
multitable.evaluations

We now have all the data we need to create a full multitable report that summarizes and explains all this information. After running the cell below you'll find `multitable_report.html` in the working directory.

In [None]:
from gretel_trainer.relational import create_report

create_report(multitable)

In [None]:
import IPython
from smart_open import open

report_path = str(multitable._working_dir / "multitable_report.html")

IPython.display.HTML(data=open(report_path).read())

The synthetic data is automatically written to the working directory as `synth_{table}.csv`. You can optionally use a `Connector` to write the synthetic data to a database. (If you're writing back to the same database as your source, pass a `prefix: str` argument to the `save` method to avoid overwriting your source tables!)

In [None]:
# Write output data to a new SQLite database

from gretel_trainer.relational import sqlite_conn

synthetic_db_path = "out.db"

synthetic_db_conn = sqlite_conn(synthetic_db_path)
synthetic_db_conn.save(multitable.synthetic_output_tables)

### Postgres demo via Docker

In [None]:
# Start up a postgres container with docker

!docker run --rm -d --name multitable_pgdemo -e POSTGRES_PASSWORD=password -p 5432:5432 postgres

In [None]:
# Write synthetic tables to the Postgres db

from gretel_trainer.relational import postgres_conn

out_db = postgres_conn("postgres", "password", "localhost", 5432)
out_db.save(multitable.synthetic_output_tables)


In [None]:
# Inspect the postgres database

!docker exec multitable_pgdemo psql -U postgres -c "\dt"
!docker exec multitable_pgdemo psql -U postgres -c "select * from users limit 5;"

In [None]:
# Tear down the docker container

!docker stop multitable_pgdemo