
<div align="center">
  <img src="https://github.com/hitsz-ids/synthetic-data-generator/blob/main/assets/sdg_logo.png?raw=true" width="400" >
</div>
<div align="center">





# 🚀 Synthetic data generation without Raw Data using LLM




The Synthetic Data Generator (SDG) is a specialized framework designed to generate high-quality structured tabular data. It incorporates a wide range of single-table, multi-table data synthesis algorithms and LLM-based synthetic data generation models.

Synthetic data, generated by machines using real data, metadata, and algorithms, does not contain any sensitive information, yet it retains the essential characteristics of the original data. There is no direct correlation between synthetic data and real data, making it exempt from privacy regulations such as GDPR and ADPPA. This eliminates the risk of privacy breaches in practical applications.

In [None]:
# install dependencies
# !pip install sdgx
# OR
!pip install git+https://github.com/hitsz-ids/synthetic-data-generator.git

Collecting git+https://github.com/hitsz-ids/synthetic-data-generator.git
  Cloning https://github.com/hitsz-ids/synthetic-data-generator.git to /tmp/pip-req-build-mn8ro352
  Running command git clone --filter=blob:none --quiet https://github.com/hitsz-ids/synthetic-data-generator.git /tmp/pip-req-build-mn8ro352
  Resolved https://github.com/hitsz-ids/synthetic-data-generator.git to commit 54ee09b185ee33d5d7027f456c1bd263c8e571d8
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting faker>=10 (from sdgx==0.1.6.dev0)
  Downloading Faker-23.2.1-py3-none-any.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m23.7 MB/s[0m eta [36m0:00:00[0m
Collecting loguru (from sdgx==0.1.6.dev0)
  Downloading loguru-0.7.2-py3-none-any.whl (62 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.5/62.5 kB[0m [31

We demonstrate with a single table synthetic example.

# LLM-integrated synthetic data generation

For a long time, LLM has been used to understand and generate various types of data.

In fact, LLM also has certain capabilities in tabular data generation. LLM has some abilities that cannot be achieved by traditional (GAN-based models or statistical models) .



In [None]:
# please set your openAI key here:

OPEN_AI_KEY = ""
OPEN_AI_BASE = "https://api.openai.com/v1/"

In [None]:
# import packages

import pandas as pd
from sdgx.utils import download_demo_data
from sdgx.data_models.metadata import Metadata
from sdgx.models.LLM.single_table.gpt import SingleTableGPTModel

# read the demo data
# currently we use the well-known adult dataset as a example
data_path = download_demo_data()
df = pd.read_csv(data_path)
metadata = Metadata.from_dataframe(df)

[32m2024-02-27 08:04:47.112[0m | [1mINFO    [0m | [36msdgx.utils[0m:[36mdownload_demo_data[0m:[36m68[0m - [1mDownloading demo data from github data source to /content/dataset/adult.csv[0m



# Synthetic data generation without Data


Our `sdgx.models.LLM.single_table.gpt.SingleTableGPTModel` implements “Synthetic data generation without Raw Data”.

No training data is required, synthetic data can be generated based on metadata data.

![LLM_Case_1](https://github.com/hitsz-ids/synthetic-data-generator/blob/main/assets/LLM_Case_1.gif?raw=true)

In [None]:
model = SingleTableGPTModel()
model.set_openAI_settings(OPEN_AI_BASE, OPEN_AI_KEY)
model.gpt_model = "gpt-3.5-turbo"

In [None]:
model.fit(metadata)
# this may take a while
sampled_data = model.sample(30)

[32m2024-02-27 08:05:01.632[0m | [1mINFO    [0m | [36msdgx.models.LLM.single_table.gpt[0m:[36m_fit_with_metadata[0m:[36m228[0m - [1mFitting model with metadata...[0m
[32m2024-02-27 08:05:01.640[0m | [1mINFO    [0m | [36msdgx.models.LLM.single_table.gpt[0m:[36m_fit_with_metadata[0m:[36m232[0m - [1mFitting model with metadata... Finished.[0m
[32m2024-02-27 08:05:01.643[0m | [1mINFO    [0m | [36msdgx.models.LLM.single_table.gpt[0m:[36msample[0m:[36m385[0m - [1mSampling use GPT model ...[0m
[32m2024-02-27 08:05:01.647[0m | [1mINFO    [0m | [36msdgx.models.LLM.single_table.gpt[0m:[36m_sample_with_metadata[0m:[36m446[0m - [1mSampling with metadata.[0m
[32m2024-02-27 08:05:01.652[0m | [1mINFO    [0m | [36msdgx.models.LLM.base[0m:[36m_form_dataset_description[0m:[36m122[0m - [1mNo dataset_description given in current model.[0m
[32m2024-02-27 08:05:01.656[0m | [1mINFO    [0m | [36msdgx.models.LLM.base[0m:[36m_form_message_with_o

In [None]:
sampled_data

Unnamed: 0,educational-num,income,occupation,relationship,workclass,fnlwgt,gender,hours-per-week,capital-loss,marital-status,capital-gain,education,age,native-country,race
0,12,<=50K,Sales,Not-in-family,Private,234721,Female,40,0,Never-married,2174,HS-grad,25,United-States,White
1,10,<=50K,Craft-repair,Husband,Self-emp-not-inc,338409,Male,13,0,Married-civ-spouse,0,Some-college,38,United-States,White
2,13,>50K,Exec-managerial,Wife,Private,284582,Female,40,0,Married-civ-spouse,0,Bachelors,28,Cuba,Black
3,10,<=50K,Other-service,Not-in-family,Private,160187,Female,40,0,Never-married,0,10th,32,United-States,Black
4,12,<=50K,Adm-clerical,Own-child,Private,209642,Male,40,0,Never-married,0,HS-grad,18,United-States,White
5,6,<=50K,Handlers-cleaners,Unmarried,Private,45781,Female,30,0,Never-married,0,11th,29,United-States,Black
6,9,<=50K,Sales,Other-relative,Self-emp-not-inc,159449,Male,40,0,Not-in-family,0,HS-grad,34,United-States,White
7,10,<=50K,Craft-repair,Unmarried,Private,280464,Male,45,0,Never-married,0,Some-college,63,United-States,White
8,4,<=50K,Other-service,Own-child,Local-gov,141297,Female,35,0,Never-married,0,7th-8th,24,United-States,White
9,13,>50K,Prof-specialty,Husband,Private,122272,Male,60,0,Married-civ-spouse,7688,Bachelors,59,United-States,White


View the original information returned by gpt through the `_responses` attribute.

In [None]:
model._responses

[ChatCompletion(id='chatcmpl-8wmkQ05gBA2dYa6BPAjzaKQrlgCop', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='educational-num is 12, income is <=50K, occupation is Sales, relationship is Not-in-family, workclass is Private, fnlwgt is 234721, gender is Female, hours-per-week is 40, capital-loss is 0, marital-status is Never-married, capital-gain is 2174, education is HS-grad, age is 25, native-country is United-States, race is White\neducational-num is 10, income is <=50K, occupation is Craft-repair, relationship is Husband, workclass is Self-emp-not-inc, fnlwgt is 338409, gender is Male, hours-per-week is 13, capital-loss is 0, marital-status is Married-civ-spouse, capital-gain is 0, education is Some-college, age is 38, native-country is United-States, race is White\neducational-num is 13, income is >50K, occupation is Exec-managerial, relationship is Wife, workclass is Private, fnlwgt is 284582, gender is Female, hours-per-week is 4