# References:

SDV is a public, source-available Python library for generating and evaluating synthetic data. You can download and use it under the Business Source License.

License: https://github.com/sdv-dev/SDV/blob/main/LICENSE

In this notebook, I've demonstrated a series of steps involved in generating synthetic transaction data using the sdv library in Python. Starting with loading the original transaction data into a DataFrame, I proceeded to create metadata using the SingleTableMetadata class from sdv. This metadata was then used to synthesize new data using the GaussianCopulaSynthesizer, incorporating constraints to maintain fixed combinations of certain columns. Quality evaluation and diagnostic assessment were conducted to compare the synthetic data with the original dataset, ensuring its fidelity. Additionally, measures were taken to generate consistent primary key generation like order IDs. Finally, the synthetic transaction data was saved to a CSV file for further analysis.

## Reading and Displaying Clean Olist Data

This cell imports the necessary library and reads the cleaned Olist data (created in the previous notebook step) from a CSV file into a pandas DataFrame. It then displays the first few rows of the DataFrame.

In [1]:
# Importing the pandas library as pd
import pandas as pd

# Reading the clean Olist data from a CSV file into a DataFrame
masterdf = pd.read_csv(
    '../data/clean_olist_data.csv'
)

# Displaying the first few rows of the DataFrame
masterdf.head()

Unnamed: 0,order_id,timestamp,user_id,customer_city,product_category,product_id,quantity,price,review_score
0,e481f51cbdc54678b7cc49136f2d6af7,1506941793,7c396fd4830fd04220f754e42b4e5bff,sao paulo,housewares,housewares SKU 0,1.0,29.99,4.0
1,53cdb2fc8bc7dce0b6741e2150273451,1532464897,af07308b275d755c9edb36a90c618231,barreiras,perfumery,perfumery SKU 0,1.0,118.7,4.0
2,47770eb9100c2d0c44946d9cf07ec65d,1533717529,3a653a41f6f9fc3d2a113cf8398680e8,vianopolis,auto,auto SKU 0,1.0,159.9,5.0
3,949d5b44dbf5de918fe9c16f97b45f8a,1511033286,7c142cf63193a1473d2e66489a9ae977,sao goncalo do amarante,pet_shop,pet_shop SKU 0,1.0,45.0,5.0
4,ad21c59c0840e6cb83a9ceb5573f8159,1518556719,72632f0f9dd73dfee390c9b22eb56dd6,santo andre,stationery,stationery SKU 0,1.0,19.9,5.0


In [2]:
# Checking the shape of the DataFrame
masterdf.shape

(102425, 9)

In [3]:
# Displaying information about the DataFrame
masterdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 102425 entries, 0 to 102424
Data columns (total 9 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   order_id          102425 non-null  object 
 1   timestamp         102425 non-null  int64  
 2   user_id           102425 non-null  object 
 3   customer_city     102425 non-null  object 
 4   product_category  102425 non-null  object 
 5   product_id        102425 non-null  object 
 6   quantity          102425 non-null  float64
 7   price             102425 non-null  float64
 8   review_score      102425 non-null  float64
dtypes: float64(3), int64(1), object(5)
memory usage: 7.0+ MB


In [4]:
# Filtering the DataFrame to select rows where the 'order_id' column matches a specific value
masterdf[masterdf['order_id']=='ca3625898fbd48669d50701aba51cd5f']

Unnamed: 0,order_id,timestamp,user_id,customer_city,product_category,product_id,quantity,price,review_score
61603,ca3625898fbd48669d50701aba51cd5f,1534039880,c8ed31310fc440a3f8031b177f9842c3,ipua,construction_tools_construction,construction_tools_construction SKU 52,1.0,33.9,3.0
61604,ca3625898fbd48669d50701aba51cd5f,1534039880,c8ed31310fc440a3f8031b177f9842c3,ipua,construction_tools_construction,construction_tools_construction SKU 281,2.0,159.0,3.0
61605,ca3625898fbd48669d50701aba51cd5f,1534039880,c8ed31310fc440a3f8031b177f9842c3,ipua,construction_tools_construction,construction_tools_construction SKU 146,1.0,309.0,3.0
61606,ca3625898fbd48669d50701aba51cd5f,1534039880,c8ed31310fc440a3f8031b177f9842c3,ipua,construction_tools_construction,construction_tools_construction SKU 282,1.0,63.7,3.0
61607,ca3625898fbd48669d50701aba51cd5f,1534039880,c8ed31310fc440a3f8031b177f9842c3,ipua,construction_tools_construction,construction_tools_construction SKU 239,2.0,56.0,3.0
61608,ca3625898fbd48669d50701aba51cd5f,1534039880,c8ed31310fc440a3f8031b177f9842c3,ipua,construction_tools_construction,construction_tools_construction SKU 38,1.0,109.9,3.0
61609,ca3625898fbd48669d50701aba51cd5f,1534039880,c8ed31310fc440a3f8031b177f9842c3,ipua,construction_tools_construction,construction_tools_construction SKU 28,1.0,95.9,3.0
61610,ca3625898fbd48669d50701aba51cd5f,1534039880,c8ed31310fc440a3f8031b177f9842c3,ipua,construction_tools_construction,construction_tools_construction SKU 29,1.0,95.9,3.0


## Generating Metadata for DataFrame with sdv

This cell utilizes the sdv library to automatically detect metadata from the provided DataFrame (`masterdf`). It then validates the detected metadata and converts it into a Python dictionary for further use.

In [5]:
from sdv.metadata import SingleTableMetadata

# Creating SingleTableMetadata object
metadata = SingleTableMetadata()

# Detecting metadata from the DataFrame. As an optional step we can save it to json for reproducibility.
metadata.detect_from_dataframe(masterdf)

# Validating the detected metadata
if metadata.validate() is not None:
    print("Problems with metadata auto-detection!")

# Converting metadata to a dictionary
python_dict = metadata.to_dict()

# Displaying the dictionary
python_dict

{'METADATA_SPEC_VERSION': 'SINGLE_TABLE_V1',
 'columns': {'order_id': {'sdtype': 'unknown', 'pii': True},
  'timestamp': {'sdtype': 'numerical'},
  'user_id': {'sdtype': 'unknown', 'pii': True},
  'customer_city': {'sdtype': 'city', 'pii': True},
  'product_category': {'sdtype': 'categorical'},
  'product_id': {'sdtype': 'unknown', 'pii': True},
  'quantity': {'sdtype': 'numerical'},
  'price': {'sdtype': 'numerical'},
  'review_score': {'sdtype': 'numerical'}}}

## Extracting Columns from Metadata

In this cell, we extract the names of all columns from the metadata obtained using the sdv library. The 'order_id' column is excluded from the list of all columns meant to be converted to categorical.

In [6]:
# Extracting all column names from metadata
all_cols = metadata.get_column_names()

# Identifying columns to be converted to categorical by excluding 'order_id'
all_cat_cols = list(set(all_cols) - set(['order_id']))

## Generating Synthetic Data with Gaussian Copula Synthesizer and Constraints

This cell uses the Gaussian Copula Synthesizer from the sdv library to generate synthetic data based on the provided metadata. Constraints are applied to maintain fixed combinations of certain columns, such as user and product identifiers. The resulting synthetic data is displayed to provide an overview.

In [7]:
from sdv.single_table import GaussianCopulaSynthesizer

# Updating metadata columns to categorical
metadata.update_columns(
    column_names=all_cat_cols,
    sdtype='categorical',
)

# Initializing GaussianCopulaSynthesizer with metadata and enforcing min-max values
synthesizer = GaussianCopulaSynthesizer(
    metadata,
    enforce_min_max_values=True,
)

# Defining user and product constraints
user_constraint = {
    'constraint_class': 'FixedCombinations',
    'constraint_parameters': {
        'column_names': ['user_id', 'customer_city']
    }
}
product_constraint = {
    'constraint_class': 'FixedCombinations',
    'constraint_parameters': {
        'column_names': ['product_id', 'product_category']
    }
}

# Adding constraints to the synthesizer
synthesizer.add_constraints(constraints=[
    user_constraint,
    product_constraint
])

# Fitting the synthesizer to the original data
synthesizer.fit(masterdf)

# Generating synthetic data
synthetic_data = synthesizer.sample(num_rows=10000000)

# Displaying the first few rows of synthetic data
synthetic_data.head()

Sampling rows: 100%|██████████| 10000000/10000000 [04:48<00:00, 34638.20it/s]


Unnamed: 0,order_id,timestamp,user_id,customer_city,product_category,product_id,quantity,price,review_score
0,sdv-pii-m7p96,1524132612,fd15d3e876d92b03747f4c3cd48d41d8,santa cruz do rio pardo,computers_accessories,computers_accessories SKU 116,1.0,16.99,5.0
1,sdv-pii-g7lj0,1485027982,bacb93ae220562bbec5b93899bf028d3,alvorada,sports_leisure,sports_leisure SKU 261,1.0,109.0,5.0
2,sdv-pii-g2nag,1515171090,9e27d2071b688512d1313394991ccac6,botucatu,sports_leisure,sports_leisure SKU 607,1.0,220.0,5.0
3,sdv-pii-m3eiw,1513341982,d3d7dd472d49050c5f6c1e81f71545f0,tramandai,sports_leisure,sports_leisure SKU 2490,1.0,120.0,5.0
4,sdv-pii-oxl31,1521017076,ca6a81ca06a8d538dee02d3ce5c0d23e,hortolandia,auto,auto SKU 1102,1.0,284.9,5.0


## Loading and Validating Metadata from Python Dictionary

In this cell, the metadata for the DataFrame is loaded from a Python dictionary using the sdv library. The loaded metadata is then validated to ensure its integrity. If any issues are detected during the validation process, a corresponding message is printed.

In [8]:
from sdv.metadata import SingleTableMetadata

# Loading metadata from the Python dictionary
metadata = SingleTableMetadata.load_from_dict(python_dict)

# Validating the loaded metadata
if metadata.validate() is not None:
    print("Problems with resetting metadata detected!")

## Running Diagnostic Evaluation between Real and Synthetic Data

In this cell, a diagnostic evaluation is performed to compare the characteristics of the real data (`masterdf`) with the synthetic data (`synthetic_data`) generated using the Gaussian Copula Synthesizer. The evaluation is conducted using the provided metadata.

In [9]:
from sdv.evaluation.single_table import run_diagnostic

# Running diagnostic evaluation
diagnostic_report = run_diagnostic(
    real_data=masterdf,
    synthetic_data=synthetic_data,
    metadata=metadata
)

Generating report ...

(1/2) Evaluating Data Validity: |██████████| 9/9 [00:00<00:00, 11.16it/s]|
Data Validity Score: 100.0%

(2/2) Evaluating Data Structure: |██████████| 1/1 [00:00<00:00, 433.70it/s]|
Data Structure Score: 100.0%

Overall Score (Average): 100.0%



## Evaluating the Quality of Synthetic Data

This cell assesses the quality of synthetic data generated using the Gaussian Copula Synthesizer by comparing it with the real data (`masterdf`). The evaluation is performed using various quality metrics and the provided metadata.

In [10]:
from sdv.evaluation.single_table import evaluate_quality

# Evaluating the quality of synthetic data
quality_report = evaluate_quality(
    real_data=masterdf,
    synthetic_data=synthetic_data,
    metadata=metadata
)

Generating report ...

(1/2) Evaluating Column Shapes: |██████████| 9/9 [00:05<00:00,  1.69it/s]|
Column Shapes Score: 99.96%

(2/2) Evaluating Column Pair Trends: |██████████| 36/36 [00:27<00:00,  1.31it/s]|
Column Pair Trends Score: 97.53%

Overall Score (Average): 98.74%



In [11]:
quality_report.get_details(property_name='Column Shapes')

Unnamed: 0,Column,Metric,Score
0,timestamp,KSComplement,0.999641
1,product_category,TVComplement,0.999159
2,quantity,KSComplement,0.999773
3,price,KSComplement,0.999707
4,review_score,KSComplement,0.999783


## Creating primary key Order IDs in Synthetic Data

This cell anonymizes the order IDs in the synthetic data by combining the 'timestamp' and 'user_id' columns, hashing the combined values using SHA256, and storing the result as the new 'order_id' column. Any duplicate rows in the synthetic data are then removed to ensure data integrity.

In [12]:
import hashlib

# Combining 'timestamp' and 'user_id' columns and hashing the combined values to create 'order_id'
synthetic_data['combined_values'] = synthetic_data['timestamp'].astype(str) + '-' + synthetic_data['user_id']
synthetic_data['order_id'] = synthetic_data['combined_values'].apply(lambda x: hashlib.sha256(x.encode()).hexdigest()[:32])

# Dropping the 'combined_values' column
synthetic_data = synthetic_data.drop(columns=['combined_values'])

# Dropping duplicate rows from the synthetic data
synthetic_data.drop_duplicates(inplace=True)

In [17]:
synthetic_data[['timestamp', 'user_id', 'customer_city']].value_counts().reset_index().loc[0]

timestamp                              1506941793
user_id          7c396fd4830fd04220f754e42b4e5bff
customer_city                           sao paulo
count                                          45
Name: 0, dtype: object

In [18]:
synthetic_data[synthetic_data["user_id"]=="7c396fd4830fd04220f754e42b4e5bff"]

Unnamed: 0,order_id,timestamp,user_id,customer_city,product_category,product_id,quantity,price,review_score
110751,807bc46bd8f8141127cedda2bb9e46da,1506941793,7c396fd4830fd04220f754e42b4e5bff,sao paulo,bed_bath_table,bed_bath_table SKU 33,1.0,109.99,3.0
196825,807bc46bd8f8141127cedda2bb9e46da,1506941793,7c396fd4830fd04220f754e42b4e5bff,sao paulo,watches_gifts,watches_gifts SKU 21,1.0,45.00,1.0
517865,ddd6450cc8df3c500dd41340d19e3116,1532464897,7c396fd4830fd04220f754e42b4e5bff,sao paulo,food_drink,food_drink SKU 1,1.0,13.99,5.0
521382,dcba0d61b94f35b2368988dfa8bb6f0c,1518556719,7c396fd4830fd04220f754e42b4e5bff,sao paulo,bed_bath_table,bed_bath_table SKU 70,1.0,39.00,5.0
815486,807bc46bd8f8141127cedda2bb9e46da,1506941793,7c396fd4830fd04220f754e42b4e5bff,sao paulo,UNKNOWN,UNKNOWN SKU 4,1.0,689.89,5.0
...,...,...,...,...,...,...,...,...,...
9626632,807bc46bd8f8141127cedda2bb9e46da,1506941793,7c396fd4830fd04220f754e42b4e5bff,sao paulo,computers_accessories,computers_accessories SKU 144,1.0,20.99,3.0
9749406,807bc46bd8f8141127cedda2bb9e46da,1506941793,7c396fd4830fd04220f754e42b4e5bff,sao paulo,audio,audio SKU 7,1.0,25.00,3.0
9866735,807bc46bd8f8141127cedda2bb9e46da,1506941793,7c396fd4830fd04220f754e42b4e5bff,sao paulo,housewares,housewares SKU 5,1.0,63.90,4.0
9877666,c2f785e9c847f4e46f7b1263c557d8bf,1511033286,7c396fd4830fd04220f754e42b4e5bff,sao paulo,stationery,stationery SKU 282,1.0,56.00,5.0


In [19]:
len(synthetic_data)

10000000

## Saving Synthetic Transaction Data to CSV

This cell saves the synthetic transaction data stored in the DataFrame `synthetic_data` to a CSV file named 'all_transaction_data.csv' in the '../data/' directory. The index column is excluded from the CSV file.

In [20]:
synthetic_data.to_csv('../data/all_transaction_data.csv', index=False)

By leveraging sdv, I've generated synthetic dataset that preserve key statistical properties and relationships present in the original data. However, it's essential to carefully evaluate the quality and integrity of synthetic data before use, ensuring it accurately reflects the underlying patterns of the real-world data. In subsequent notebooks, this dataset will serve as the foundation for constructing rank-based recommendation models. Additionally, it will be harnessed to furnish data corresponding to user queries in later stages.