# Non-Context Foreign Keys with MOSTLY AI & SDV

A column in one table, Table A, which references a column in another table, Table B, is called a foreign key. In most Synthetic Data generator engines, when you have more than one foreign key in a single table, the foreign key whose parent contains the other foreign keys also included in your table, this foreign key is called the Context Foreign Key.

In this notebook we compare two synthetic data generation engines, The Synthetic Data Vault (SDV) and the Synthetic Data SDK from MOSTLY AI to demonstrate how each of the two platforms handles non-context foreign keys when generating Synthetic Data.

## Contents

1. [Set up Environment](#set-up)
    - [Install SDV](#install-sdv)
    - [Install MOSTLY AI](#install-mostly-ai)
2. [Data Preparation](#data-preparation)
    - [Download Data](#download-data)
    - [Save Data in Environment Memory](#save-data-in-environment-memory)
3. [SDV Implementation](#sdv-implementation)
    - [SDV Configuration](#sdv-configuration)
    - [SDV Model Training](#sdv-model-training)
    - [SDV Synthetic Data Generation](#sdv-synthetic-data-generation)
    - [SDV Synthetic Data Preview](#sdv-synthetic-data-preview)
    - [Save SDV Synthetic Data](#save-sdv-synthetic-data)
  4. [MOSTLY AI Implementation](#mostly-ai-implementation)
      - [MOSTLY AI Configuration](#mostly-ai-configuration)
      - [MOSTLY AI Generator Training](#mostly-ai-generator-training)
      - [MOSTLY AI Synthetic Data Generation](#mostly-ai-synthetic-data-generation)
      - [MOSTLY AI Synthetic Data Preview](#mostly-ai-synthetic-data-preview)
      - [Save MOSTLY AI Synthetic Data](#save-mostly-ai-synthetic-data)
  5. [MOSTLY AI Synthetic Data Quality Assurance](#mostly-ai-synthetic-data-quality-assurance)
      - [Instantiate the MOSTLY AI Synthetic Data QA Library](#instantiate-the-mostly-ai-synthetic-data-qa-library)
      - [SDV Synthetic Data Quality](#sdv-synthetic-data-quality)
          - [SDV - START_ID](#sdv---start_id)
          - [SDV - END_ID](#sdv---end_id)
      - [MOSTLY AI Synthetic Data Quality](#mostly-ai-synthetic-data-quality)
          - [MOSTLY AI - START_ID](#mostly-ai---start_id)
          - [MOSTLY AI - END_ID](#mostly-ai---end_id)

## Set up Environment

### Install SDV

In [None]:
# Install The Synthetic Data Vault
%pip install sdv==1.24.0 -qqq

### Install MOSTLY AI

In [None]:
# Install the Synthetic Data SDK from MOSTLY AI
%pip install -U "mostlyai[local]" -qqq

## Data Preparation

### Download Data

In [None]:
import pandas as pd

BASE = "https://raw.githubusercontent.com/mostly-ai/public-demo-data/dev/gleif/"
URL_ORGS = BASE + "organizations.csv.gz"
URL_RELS = BASE + "relations.csv.gz"

organizations = pd.read_csv(URL_ORGS, compression="infer", low_memory=False)
relations     = pd.read_csv(URL_RELS, compression="infer", low_memory=False)

def inspect_df(df, name):
    """
    Comprehensive data inspection function to understand:
    - Dataset dimensions and structure
    - Column names and data types
    - Sample data for manual review
    """
    print(f'--- {name} ---')
    print(f'Shape: {df.shape[0]:,} rows Ã— {df.shape[1]} columns')
    print('Columns:', df.columns.tolist())
    print('Dtypes:', df.dtypes)

data = {
    'organizations': organizations,
    'relations': relations
}


### Save Data in Environment Memory

In [None]:
import os

# Ensure the target directory exists
os.makedirs("./data/subject-data", exist_ok=True)

# Save CSV files
organizations.to_csv("./data/subject-data/organizations.csv.gz", index=False, compression="gzip")
relations.to_csv("./data/subject-data/relations.csv.gz", index=False, compression="gzip")


## SDV Implementation

### SDV Configuration

As noted by the SDV team, SDV supports multiple foreign keys by invoking the `metadata` method multiple times.

In [None]:
from sdv.metadata import Metadata

metadata = Metadata.detect_from_dataframes(data, infer_keys='primary_and_foreign')

metadata.add_relationship(
    parent_table_name='organizations',
    parent_primary_key='ID',
    child_table_name='relations',
    child_foreign_key='START_ID'
)

metadata.add_relationship(
    parent_table_name='organizations',
    parent_primary_key='ID',
    child_table_name='relations',
    child_foreign_key='END_ID'
)

metadata.validate()
metadata.validate_data(data)

### SDV Model Training

An interesting comparison beyond simply the validity of the generated synthetic data is the time required to train a model to create it.

We'll use the `time` library to compare performance of the two tools against the full dataset.

In [None]:
import time

from sdv.multi_table import HMASynthesizer

synthesizer = HMASynthesizer(metadata)

start = time.time()
synthesizer.fit(data)
end = time.time()

print('Fitting time:', round(end-start, 2), 'seconds')

### SDV Synthetic Data Generation

In [None]:
start = time.time()
synthetic_data = synthesizer.sample(scale=0.10)
end = time.time()

print('Sampling time:', round(end-start, 2), 'seconds')

#### SDV Synthetic Data Preview

In [None]:
synthetic_data['organizations'].head()

In [None]:
synthetic_data['relations'].head()

### Save SDV Synthetic Data

In [None]:
import os

# Create target directory
os.makedirs("./data/sdv", exist_ok=True)

# Define file paths
orgs_output_file = "./data/sdv/sdv_organizations.parquet"
rels_output_file = "./data/sdv/sdv_relations.parquet"

# Save tables
synthetic_data["organizations"].to_parquet(orgs_output_file, index=False)
synthetic_data["relations"].to_parquet(rels_output_file, index=False)


## MOSTLY AI Implementation

### MOSTLY AI Configuration

In [None]:
from mostlyai.sdk import MostlyAI
mostly = MostlyAI(local=True)

config = {
    'name': 'GLEIF Organizations & Relations Generator',
    'tables': [
        {
            'name': 'organizations',
            'data': organizations,
            'primary_key': 'ID',
            'tabular_model_configuration': {
                'enable_model_report': False
            }
        },
        {
            'name': 'relations',
            'data': relations,
            'foreign_keys': [
                {
                    'column': 'START_ID',
                    'referenced_table': 'organizations',
                    'is_context': True
                },
                {
                    'column': 'END_ID',
                    'referenced_table': 'organizations',
                    'is_context': False
                }
            ],
            'tabular_model_configuration': {
                'enable_model_report': False
            }
        }
    ]
}


### MOSTLY AI Generator Training

In [None]:
# Launch MOSTLY AI generator training job
start_time = time.time()
g = mostly.train(config=config, start=True, wait=True)
end_time = time.time()

# Measure and print elapsed time for generator training
elapsed = end_time - start_time
print(f"Training completed in {elapsed:.2f} seconds ({elapsed/60:.2f} minutes).")


### MOSTLY AI Synthetic Data Generation

In [None]:
# Synthetic data generation
start_time = time.time()
sd = mostly.generate(g, size=int(0.10 * len(organizations)))
mostlyai_synthetic_data = sd.data()
end_time = time.time()

# Measure and print elapsed time for data generation
elapsed = end_time - start_time
print(f"Generation completed in {elapsed:.2f} seconds ({elapsed/60:.2f} minutes).")


### MOSTLY AI Synthetic Data Preview

In [None]:
mostlyai_synthetic_data['organizations'].head()

In [None]:
mostlyai_synthetic_data['relations'].head()

### Save MOSTLY AI Synthetic Data

In [None]:
os.makedirs("./data/mostly", exist_ok=True)

orgs_train_output_file = './data/mostly/mostlyai_organizations.parquet'
rels_train_output_file = './data/mostly/mostlyai_relations.parquet'
mostlyai_synthetic_data['organizations'].to_parquet(orgs_train_output_file, index=False)
mostlyai_synthetic_data['relations'].to_parquet(rels_train_output_file, index=False)
print(f"ðŸ’¾ MOSTLY AI synthetic data saved to: {orgs_train_output_file} and {rels_train_output_file}")

## MOSTLY AI Synthetic Data Quality Assurance

As the SDV team has already demonstrated that the generated synthetic data maintains referential integrity, we'll dive deeper and explore the quality of the generated data. If you are interesting in seeing the referential integrity of the generated datasets, please refer to the [Confirming Referential Integrity](#confirming-referential-integity) where we'll use [SDMetrics](https://docs.sdv.dev/sdmetrics) to confirm the referential intrigty of all generated data.

And while referential integrity is, of course, an important piece of the puzzle when generating synthetic data, one of the key advantages of synthetic data (as compared to [homomorphic encryption](https://en.wikipedia.org/wiki/Homomorphic_encryption#:~:text=Homomorphic%20encryption%20is%20a%20form%20of%20encryption%20with%20an%20additional,extension%20of%20public%2Dkey%20cryptography.), for example) is its ability to not just maintain privacy protections but also resemble the subject data-- not just to a machine, but indeed, to a human as well.

We'll see that while the data generated by SDV indeed maintained referential integrity, it failed to maintain observable features of the subject dataset that are essential to creating realistic synthetic data.


### Initialize MOSTLY AI Synthetic Data QA Library

In [None]:
from mostlyai import qa

qa.init_logging()

### SDV Synthetic Data Quality

### SDV - START_ID

In [None]:
sdv_relations = pd.read_parquet('./data/sdv/sdv_relations.parquet')
sdv_organizations = pd.read_parquet('./data/sdv/sdv_organizations.parquet')

id_columns_to_exclude = ['ID', 'END_ID']

def remove_id_columns(df, columns_to_remove):
    """Remove specified columns if they exist in the dataframe"""
    return df.drop(columns=[col for col in columns_to_remove if col in df.columns])

sdv_relations = remove_id_columns(sdv_relations, id_columns_to_exclude)
rels_train_qa = remove_id_columns(relations, id_columns_to_exclude)

report_path, metrics = qa.report(
    syn_tgt_data = sdv_relations,
    trn_tgt_data = rels_train_qa,
    syn_ctx_data = sdv_organizations,
    trn_ctx_data = organizations,
    ctx_primary_key = "ID",
    tgt_context_key = "START_ID",
    max_sample_size_embeddings=10_000,
    report_path='sdv_relations_qa_report_start_id.html'
)

print(f"SDV Relations Quality Report saved to: {report_path}")
print("\nSDV Relations Quality Metrics:")
print(metrics.model_dump_json(indent=4))

### SDV - END_ID

In [None]:
sdv_relations = pd.read_parquet('./data/sdv/sdv_relations.parquet')
sdv_organizations = pd.read_parquet('./data/sdv/sdv_organizations.parquet')

id_columns_to_exclude = ['ID', 'START_ID']

def remove_id_columns(df, columns_to_remove):
    """Remove specified columns if they exist in the dataframe"""
    return df.drop(columns=[col for col in columns_to_remove if col in df.columns])

sdv_relations = remove_id_columns(sdv_relations, id_columns_to_exclude)
rels_train_qa = remove_id_columns(relations, id_columns_to_exclude)

report_path, metrics = qa.report(
    syn_tgt_data = sdv_relations,
    trn_tgt_data = rels_train_qa,
    syn_ctx_data = sdv_organizations,
    trn_ctx_data = organizations,
    ctx_primary_key = "ID",
    tgt_context_key = "END_ID",
    max_sample_size_embeddings=10_000,
    report_path='sdv_relations_qa_report_end_id.html'
)

print(f"SDV Relations Quality Report saved to: {report_path}")
print("\nSDV Relations Quality Metrics:")
print(metrics.model_dump_json(indent=4))

## MOSTLY AI Synthetic Data Quality

### MOSTLY AI - START_ID

In [None]:
mostlyai_relations = pd.read_parquet('./data/mostly/mostlyai_relations.parquet')
mostlyai_organizations = pd.read_parquet('./data/mostly/mostlyai_organizations.parquet')

id_columns_to_exclude = ['ID', 'END_ID']

def remove_id_columns(df, columns_to_remove):
    """Remove specified columns if they exist in the dataframe"""
    return df.drop(columns=[col for col in columns_to_remove if col in df.columns])

mostlyai_relations = remove_id_columns(mostlyai_relations, id_columns_to_exclude)
rels_train_qa = remove_id_columns(relations, id_columns_to_exclude)

report_path, metrics = qa.report(
    syn_tgt_data = mostlyai_relations,
    trn_tgt_data = rels_train_qa,
    syn_ctx_data = mostlyai_organizations,
    trn_ctx_data = organizations,
    ctx_primary_key = "ID",
    tgt_context_key = "START_ID",
    max_sample_size_embeddings=10_000,
    report_path='mostlyai_relations_qa_report_start_id.html'
)

print(f"MOSTLY AI START_ID Quality Report saved to: {report_path}")
print("\nMOSTLY AI START_ID Quality Metrics:")
print(metrics.model_dump_json(indent=4))


### MOSTLY AI - END_ID

In [None]:
mostlyai_relations = pd.read_parquet('./data/mostly/mostlyai_relations.parquet')
mostlyai_organizations = pd.read_parquet('./data/mostly/mostlyai_organizations.parquet')

id_columns_to_exclude = ['ID', 'START_ID']

def remove_id_columns(df, columns_to_remove):
    """Remove specified columns if they exist in the dataframe"""
    return df.drop(columns=[col for col in columns_to_remove if col in df.columns])

mostlyai_relations = remove_id_columns(mostlyai_relations, id_columns_to_exclude)
rels_train_qa = remove_id_columns(relations, id_columns_to_exclude)

report_path, metrics = qa.report(
    syn_tgt_data = mostlyai_relations,
    trn_tgt_data = rels_train_qa,
    syn_ctx_data = mostlyai_organizations,
    trn_ctx_data = organizations,
    ctx_primary_key = "ID",
    tgt_context_key = "END_ID",
    max_sample_size_embeddings=10_000,
    report_path='mostlyai_relations_qa_report_end_id.html'
)

print(f"MOSTLY AI START_ID Quality Report saved to: {report_path}")
print("\nMOSTLY AI START_ID Quality Metrics:")
print(metrics.model_dump_json(indent=4))

