

<div align="center">
<p align="center">


# 🚀 Synthetic Data Generator

</p>
</div>
The Synthetic Data Generator (SDG) is a specialized framework designed to generate high-quality structured tabular data.

Synthetic data does not contain any sensitive information, yet it retains the essential characteristics of the original data, making it exempt from privacy regulations such as GDPR and ADPPA.

High-quality synthetic data can be safely utilized across various domains including data sharing, model training and debugging, system development and testing, etc.

In [None]:
# install from git
!pip install git+https://github.com/hitsz-ids/synthetic-data-generator.git

We demonstrate with a single table data synthetic example.

In [2]:
from sdgx.data_connectors.csv_connector import CsvConnector
from sdgx.data_loader import DataLoader
from sdgx.data_models.metadata import Metadata

# 1. Load data

The demo data set for this demonstration is a risk control data set used to predict whether an individual will default on a loan. This dataset contains the following features:

| Column name | Meaning |
|-----------------------|-----------------------|
| loan_id | loan ID |
| user_id | user ID |
| total_loan | Total loan amount |
| year_of_loan | Loan period |
...

This code shows the process of loading real data:

In [3]:
# In the future, this part of the function will be integrated into `sdgx.processor`
import os 
import requests


def download_file(url, path):
    response = requests.get(url)
    if response.status_code == 200:
        with open(path, 'wb') as file:
            file.write(response.content)
        print(f"File downloaded successfully to {path}")
    else:
        print(f"Failed to download file from {url}")

# download dataset from github
# This datajset can be downloaded through sdgx.utils 
dataset_url = "https://raw.githubusercontent.com/aialgorithm/Blog/master/projects/一文梳理风控建模全流程/train_internet.csv"
file_path = 'train_internet.csv'

if not os.path.exists(file_path):
    download_file(dataset_url, file_path)

In [4]:
from pathlib import Path

path_obj = Path(file_path)

# Create a data connector and data loader for csv data
data_connector = CsvConnector(path=path_obj)
data_loader = DataLoader(data_connector)

# 2. Create Metadata from Dataloader

sdgx supports creating metadata from pd.DataFrame or DataLoader, and also supports creating metadata from scratch from blank metadata (but this method is not recommended because it is more troublesome).

In this example, we use `from_dataloader` to create the first Metadata.

In [5]:
loan_metadata = Metadata.from_dataloader(data_loader)

[32m2024-03-25 17:10:54.848[0m | [1mINFO    [0m | [36msdgx.data_models.metadata[0m:[36mfrom_dataloader[0m:[36m280[0m - [1mInspecting metadata...[0m
[32m2024-03-25 17:10:58.441[0m | [1mINFO    [0m | [36msdgx.data_models.metadata[0m:[36mupdate_primary_key[0m:[36m482[0m - [1mPrimary Key updated: {'loan_id', 'user_id'}.[0m


Let’s first understand some common member variables in Metadata.

The most important and commonly used method is to use `column_list` to view column information. This variable returns a list. The order of columns corresponds to the order of the actual table.

In [6]:
loan_metadata.column_list

['loan_id',
 'user_id',
 'total_loan',
 'year_of_loan',
 'interest',
 'monthly_payment',
 'class',
 'sub_class',
 'work_type',
 'employer_type',
 'industry',
 'work_year',
 'house_exist',
 'house_loan_status',
 'censor_status',
 'marriage',
 'offsprings',
 'issue_date',
 'use',
 'post_code',
 'region',
 'debt_loan_ratio',
 'del_in_18month',
 'scoring_low',
 'scoring_high',
 'pub_dero_bankrup',
 'early_return',
 'early_return_amount',
 'early_return_amount_3mon',
 'recircle_b',
 'recircle_u',
 'initial_list_status',
 'earlies_credit_mon',
 'title',
 'policy_code',
 'f0',
 'f1',
 'f2',
 'f3',
 'f4',
 'f5',
 'is_default']

# 3. Use Inspectors to automatically label column types

Currently, when a sdgx's Metadata module pd.DataFrame or DataLoader is created, it will load Inspectors, automatically scan some data (not all data), and label the columns in the table according to the logic of each Inspector.

Currently, we support automatic inference of multiple data types, and sdgx supports the following basic types:
- bool
- int
- float
- datetime
- discrete
- id

Basic data types guarantee that each column will be labeled to one of the data types.

sdgx also supports the following data types, and the sdgx team will continue to add data types:

- english_name
- email
- china_mainland_mobile_phone
- china_mainland_id
- china_mainland_postcode
- unified_social_credit_code
- china_mainland_address
- chinese_name

If you need to query a column of a certain data type, you can access it through `.{column_name}_columns`, for example: access the datetime column through `.datetime_columns`, and access the english_name column through `english_name_columns`.

For example, we can access discrete columns through `.discrete_columns`, which will return a set containing the set of column names that are considered **discrete** columns.


In [7]:
# Automatically infer discrete columns
loan_metadata.discrete_columns

{'class',
 'earlies_credit_mon',
 'employer_type',
 'industry',
 'issue_date',
 'sub_class',
 'work_type',
 'work_year'}

Similarly, we can view `int_columns`, `bool_columns` and other columns as follows:

In [8]:
# No Bool columns in current tabular data.
loan_metadata.bool_columns

set()

In [9]:
# check int columns
loan_metadata.int_columns

{'censor_status',
 'del_in_18month',
 'early_return',
 'early_return_amount',
 'early_return_amount_3mon',
 'f0',
 'f1',
 'f2',
 'f3',
 'f4',
 'f5',
 'house_exist',
 'house_loan_status',
 'initial_list_status',
 'is_default',
 'loan_id',
 'marriage',
 'offsprings',
 'policy_code',
 'post_code',
 'pub_dero_bankrup',
 'recircle_b',
 'region',
 'scoring_high',
 'scoring_low',
 'title',
 'total_loan',
 'use',
 'user_id',
 'year_of_loan'}

Use `datetime_columns` to view datetime types, but note that datetime type needs to add formats before data processing, **datetime formats need to completely correspond to datetime columns**. 

For specific operations, please refer to the manual interface of metadata below.

In [10]:
loan_metadata.datetime_columns

{'earlies_credit_mon', 'issue_date'}

⚠️ It is worth noting that Inspectors work well in most cases, but all types in tabular data may not be fully covered, or there may be incomplete coverage.

Therefore, before proceeding to the next step of training the model or further processing the data, we still recommend that data analysts **check** all the labeling of data types.

# 4. Understand the inspect_level mechanism in Metadata
 
Since Metadata will run multiple Inspectors when it is created, the same data column may be labeled multiple times. For example, a column is marked as PostCode and discrete at the same time. In fact, this column is a post code column. 

From this, we use `inspect_level` to solve this problem. Different inspectors have different inspect levels, and the final mark of the final column is determined by the mark with the higher inspect level.

In [11]:
# ·column_inspect_level· records the inspect_level values of all inspectors
# the default inspect_level is 10 
loan_metadata.column_inspect_level

defaultdict(<function sdgx.data_models.metadata.Metadata.<lambda>()>,
            {'id_columns': 20,
             'int_columns': 10,
             'float_columns': 10,
             'bool_columns': 10,
             'datetime_columns': 20,
             'discrete_columns': 10,
             'china_mainland_mobile_phone_columns': 30,
             'unified_social_credit_code_columns': 30,
             'china_mainland_id_columns': 30,
             'china_mainland_postcode_columns': 20,
             'email_columns': 30})

# 5. Metadata manual interface

Metadata supports the following manual interfaces, which can finely modify column labels one by one according to your ideas:
- query: Query the tag of a certain column.
- get: Get all tags by key.
- set：Set tags, will convert value to set if value is not a set.
- add: Add tags.



In [12]:
loan_metadata.set('id_columns', {'loan_id'})

loan_metadata.id_columns

{'loan_id'}

Note that currently only the datetime type needs to `add formats`, and before data processing, datetime formats need to completely correspond to datetime columns (otherwise the column will be deleted during the data preprocessing process), other data types do not need it.

In [13]:
# datetime_format has no content, which will cause an error in the subsequent process.
loan_metadata.datetime_format

defaultdict(str, {})

The above four basic methods only apply to columns.

For the dict type datetime format, it is recommended to assign values directly.

In [14]:
datetime_format = {
    'issue_date': '%Y-%m-%d',
    'earlies_credit_mon': '%b-%Y'
}
loan_metadata.datetime_format = datetime_format

loan_metadata.datetime_format

{'issue_date': '%Y-%m-%d', 'earlies_credit_mon': '%b-%Y'}

# 6. Get the exact data type of each column

We provide the get_column_data_type method to query the final data type of each column:

In [15]:
loan_metadata.get_column_data_type("f0")

'int'

In [16]:
loan_metadata.get_column_data_type("recircle_u")

'float'

If you need to get the exact data type of all columns, you can combine it with the `.column_list` method:

In [17]:
for each_col in loan_metadata.column_list:
    print(f'{each_col}: {loan_metadata.get_column_data_type(each_col)}')

loan_id: id
user_id: int
total_loan: int
year_of_loan: int
interest: float
monthly_payment: float
class: discrete
sub_class: discrete
work_type: discrete
employer_type: discrete
industry: discrete
work_year: discrete
house_exist: int
house_loan_status: int
censor_status: int
marriage: int
offsprings: int
issue_date: datetime
use: int
post_code: int
region: int
debt_loan_ratio: float
del_in_18month: int
scoring_low: int
scoring_high: int
pub_dero_bankrup: int
early_return: int
early_return_amount: int
early_return_amount_3mon: int
recircle_b: int
recircle_u: float
initial_list_status: int
earlies_credit_mon: datetime
title: int
policy_code: int
f0: int
f1: int
f2: int
f3: int
f4: int
f5: int
is_default: int
