# Synthetics and Multi-Table Relational Databases
* This notebook shows how to generate synthetic data directly from a mutli-table relational database
* The database used in the example below was first run through this [notebook](https://github.com/gretelai/public_research/blob/main/mutli-table-transforms/RDB_Transforms.ipynb) which is discussed in this [blog](https://gretel.ai/blog/transforms-and-multi-table-relational-databases) where transforms were used to de-identify PII.


## Capabilities
* This notebook can be run on any database SQLAlchemy supports such as Postgresql, SQLite or MySQL
* This notebook also contains instructions on how to create synthetic data when the relational tables exist in CSV files.
* It is not necessary to first use transforms on your data before using this notebook.
* Referential integriety of primary and foreign keys will remain intact
* User enters the ratio of synthetic records to original records that they would like produced. For example, 2 means you'd like the synthetic data to be twice the size of the original data. Alternatively, you can "subset" the database by using a value less than one. For example, a value of .5 means you'd like the synthetic data to be half the size of the original data.


## Limitations
* The primary and foreign keys in your database must be IDs
* Keys cannot be composite keys
* Cross table field correlations are not maintained.

## How to use this notebook on your own dataset
* Change the database connection string to refer to your database
* Alternatively, change the name and location of the CSV files where your data resides
* Specify the ratio of synthetic to original records you would like produced
* When viewing your data, change the table names used to your own table names
* When viewing the synthetic performance report, change the table name used to one of your own table names
* Modify the location where you'd like your final synthetic data to be stored

## Our ecommerce database
* Execute the below cell to see a diagram of the database we'll be using in this blueprint. The lines in the diagram show connections between primary and foreign keys

In [None]:
from IPython.display import Image
Image("https://gretel-blueprints-pub.s3.us-west-2.amazonaws.com/rdb/ecommerce_db.png",width = 600, height = 600)

## Getting started

In [None]:
import os

!git clone https://github.com/gretelai/multi-table.git

os.chdir('./multi-table')
!pip install -U .

In [None]:
# Specify your Gretel API key

from getpass import getpass
import pandas as pd
from gretel_client import configure_session, ClientConfig

pd.set_option('max_colwidth', None)

configure_session(ClientConfig(api_key=getpass(prompt="Enter Gretel API key"), 
                               endpoint="https://api.gretel.cloud"))

## Gather data and schema relationships directly from a database
* For demonstration purposes, we'll first grab our ecommerce SQLite database from S3
* This notebook can be run on any database SQLAlchemy supports such as Postgresql or MySQL
* For example, if you have a postgres database, simply swap the `sqlite:///` connection string for a `postgres://` one in the `create_engine` command
* Using SQLAlchemy's reflection extension, we will crawl the schema, gather table data and produce a list of relationships by table primary key.

In [None]:
from sqlalchemy import create_engine
import multi_table.rdb_util as rdb

!wget https://gretel-blueprints-pub.s3.amazonaws.com/rdb/ecom_xf.db
    
engine = create_engine("sqlite:///ecom_xf.db")
rdb_config = rdb.crawl_db(engine)

## Alternatively, specify primary/foreign key relationships and locations of data csv files 
* This is an alternative to the above four cells that work directly with a database
* First, assign `base_path` to the directory where the csv files are located.
* Then, add a name/key pair for each table name/filename to `rdb_config["table_files"]`
* Add all primary keys for each table to `rdb_config["primary_keys"]`
* Add all foreign key/primary keys that connect to the same set under `rdb_config["relationships"]`

In [None]:
# base_path is the directory where your csv files can be found
base_path = "https://gretel-blueprints-pub.s3.amazonaws.com/rdb/"

rdb_config = {
   "table_files": {
      "users": base_path + "users_transform.csv",

      "order_items": base_path + "order_items_transform.csv",
       
      "events": base_path + "events_transform.csv",
       
      "inventory_items": base_path + "inventory_items_transform.csv",  
       
      "products": base_path + "products_transform.csv",
       
      "distribution_center": base_path + "distribution_center_transform.csv"
   },

  # List the primary keys for each table
    
   "primary_keys": {
      "users": "id",

      "order_items": "id",
       
      "events": "id",
       
      "inventory_items": "id",  
       
      "products": "id",
       
      "distribution_center": "id"
   },

  # List the (table, field) relationships between primary and foreign keys 
   "relationships": [
          [("users","id"),
           ("order_items","user_id"),
           ("events","user_id")
          ],         
       
          [("inventory_items","id"),
           ("order_items","inventory_item_id")  
          ],         

          [("products","id"),
           ("inventory_items","product_id")
          ],                

          [("distribution_center","id"),
           ("products","distribution_center_id"),
           ("inventory_items", "product_distribution_center_id")
          ]             
   ]
}

# Gather the table data using the filenames entered above

rdb_config["table_data"] = {}
for table in rdb_config["table_files"]:
    filename = rdb_config["table_files"][table]
    df = pd.read_csv(filename)
    rdb_config["table_data"][table] = df

## Enter the ratio of synthetic records to original records you would like to produce

In [None]:
# Entering 1 means the synthetic data will be the same size as the original data
# Entering 2 means the synthetic data will be twice the size as the original data
# Entering .5 means the synthetic data will be half the size of the original data

rdb_config["synth_record_size_ratio"] = 2

## Take a look at your data by joining two tables
* Note that every record in the table "order_items" matches to an entry in the table "users"
* An "inner" join will take the intersection of two tables

In [None]:
pd.set_option("display.max_columns", None)

table1 = "order_items"
table2 = "users"
table1_key = "user_id"
table2_key = "id"
df1 = rdb_config["table_data"][table1]
df2 = rdb_config["table_data"][table2]

joined_data = df1.join(df2.set_index(table2_key), how='inner', on=table1_key, lsuffix='_order_items', rsuffix='_users')
print("Number of records in order_items table is " + str(len(df1)))
print("Number of records in user table is " + str(len(df2)))
print("Number of records in joined data is " + str(len(joined_data)))

joined_data.head()

## Set up the training configs
* We'll assign each table the default training config
* We'll turn off the similarity privacy filter for the table "distribution_center" as it has only 10 training records
* Similarly, you can modify the other table configs to match the characteristics of that table (see [here](https://github.com/gretelai/gretel-blueprints/tree/main/config_templates/gretel/synthetics) for example configs that can be used).

In [None]:
# Grab the default Synthetic Config file:
from smart_open import open
import yaml
import copy

with open("https://raw.githubusercontent.com/gretelai/gretel-blueprints/main/config_templates/gretel/synthetics/default.yml", 'r') as stream:
    default_config = yaml.safe_load(stream)
    
training_configs = {}
for table in rdb_config["table_data"]:
    training_configs[table] = copy.deepcopy(default_config)

training_configs["distribution_center"]['models'][0]['synthetics']['privacy_filters']['similarity'] = None

## Create synthetic data

In [None]:
import multi_table.synth_models as sm
from gretel_client.projects import create_or_get_unique_project

# Designate a project
project = create_or_get_unique_project(name="rdb-synthetics")

# Synthesize your tables
synthetic_tables, errors, models = sm.synthesize_tables(rdb_config, project, training_configs)

# Synthesize your primary/foreign keys
if errors == False:
    synthetic_tables = sm.synthesize_keys(synthetic_tables, rdb_config)

## Verify the size of the new synthetic tables

In [None]:
for table in synthetic_tables:
    new_len = len(synthetic_tables[table])
    orig_len = len(rdb_config["table_data"][table])
    ratio = new_len / orig_len
    print("Table " + table + ": Original record count: " + str(orig_len) + " New record count: " + str(new_len) + " Ratio: " + str(ratio))

## View the synthetic data
* We'll again join the order_items and users tables

In [None]:
pd.set_option("display.max_columns", None)

table1 = "order_items"
table2 = "users"
table1_key = "user_id"
table2_key = "id"
df1 = synthetic_tables[table1]
df2 = synthetic_tables[table2]

joined_data = df1.join(df2.set_index(table2_key), how='inner', on=table1_key, lsuffix='_order_items', rsuffix='_users')
print("Number of records in order_items table is " + str(len(df1)))
print("Number of records in user table is " + str(len(df2)))
print("Number of records in joined data is " + str(len(joined_data)))

joined_data.head()

## View the synthetic performance reports

In [None]:
# Generate report that shows the statistical performance between the training and synthetic data

from smart_open import open
from IPython.core.display import display, HTML

# Change table_name to any of the tables in your relational database
table_name = "users"
display(HTML(data=open(models[table_name]["model"].get_artifact_link("report")).read(), metadata=dict(isolated=True)))

## Save the synthetic data back into an SQLite database
* Here, we're saving the data into an sqlite database called ecom_synth
* To save into a postgres database, use type="postgres"

In [None]:
# Save the new data to ecom_synth using the schema in ecom_xf
rdb.save_to_rdb("ecom_xf", "ecom_synth", synthetic_tables, engine, type="sqlite")

## Alterntively, save the synthetic content into CSV files

In [None]:
# Change final_dir to be the location where you'd like your csv files saved
final_dir = "./"
for table in synthetic_tables:
    df = synthetic_tables[table]
    filename = final_dir + table + '_synth.csv'
    df.to_csv(filename, index=False, header=True)