
<div align="center">
  <img src="https://github.com/hitsz-ids/synthetic-data-generator/blob/main/assets/sdg_logo.png?raw=true" width="400" >
</div>
<div align="center">





# 🚀 Synthetic data generation without Raw Data using LLM




The Synthetic Data Generator (SDG) is a specialized framework designed to generate high-quality structured tabular data. It incorporates a wide range of single-table, multi-table data synthesis algorithms and LLM-based synthetic data generation models.

Synthetic data, generated by machines using real data, metadata, and algorithms, does not contain any sensitive information, yet it retains the essential characteristics of the original data. There is no direct correlation between synthetic data and real data, making it exempt from privacy regulations such as GDPR and ADPPA. This eliminates the risk of privacy breaches in practical applications.

In [1]:
# install dependencies
# !pip install sdgx
# OR
!pip install git+https://github.com/hitsz-ids/synthetic-data-generator.git

Collecting git+https://github.com/hitsz-ids/synthetic-data-generator.git
  Cloning https://github.com/hitsz-ids/synthetic-data-generator.git to /tmp/pip-req-build-3lmjup65
  Running command git clone --filter=blob:none --quiet https://github.com/hitsz-ids/synthetic-data-generator.git /tmp/pip-req-build-3lmjup65
  Resolved https://github.com/hitsz-ids/synthetic-data-generator.git to commit 54ee09b185ee33d5d7027f456c1bd263c8e571d8
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting faker>=10 (from sdgx==0.1.6.dev0)
  Downloading Faker-23.2.1-py3-none-any.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m13.5 MB/s[0m eta [36m0:00:00[0m
Collecting loguru (from sdgx==0.1.6.dev0)
  Downloading loguru-0.7.2-py3-none-any.whl (62 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.5/62.5 kB[0m [31

We demonstrate with a single table synthetic example.

# LLM-integrated synthetic data generation

For a long time, LLM has been used to understand and generate various types of data.

In fact, LLM also has certain capabilities in tabular data generation. LLM has some abilities that cannot be achieved by traditional (GAN-based models or statistical models) .



In [2]:
# please set your openAI key here:

OPEN_AI_KEY = ""
OPEN_AI_BASE = "https://api.openai.com/v1/"

In [3]:
# import packages

import pandas as pd
from sdgx.utils import download_demo_data
from sdgx.data_models.metadata import Metadata
from sdgx.models.LLM.single_table.gpt import SingleTableGPTModel

# read the demo data
# currently we use the well-known adult dataset as a example
data_path = download_demo_data()
df = pd.read_csv(data_path)
metadata = Metadata.from_dataframe(df)

[32m2024-02-27 09:35:33.841[0m | [1mINFO    [0m | [36msdgx.utils[0m:[36mdownload_demo_data[0m:[36m68[0m - [1mDownloading demo data from github data source to /content/dataset/adult.csv[0m



# Synthetic data generation without Data


Our `sdgx.models.LLM.single_table.gpt.SingleTableGPTModel` implements “Synthetic data generation without Raw Data”.

No training data is required, synthetic data can be generated based on metadata data.

![LLM_Case_1](https://github.com/hitsz-ids/synthetic-data-generator/blob/main/assets/LLM_Case_2.gif?raw=true)

In [4]:
model = SingleTableGPTModel()
model.set_openAI_settings(OPEN_AI_BASE, OPEN_AI_KEY)
model.gpt_model = "gpt-3.5-turbo"

In [5]:
model.fit(metadata)
# this may take a while
model.sample(30, off_table_features = ['has_car'])

[32m2024-02-27 09:35:46.877[0m | [1mINFO    [0m | [36msdgx.models.LLM.single_table.gpt[0m:[36m_fit_with_metadata[0m:[36m228[0m - [1mFitting model with metadata...[0m
[32m2024-02-27 09:35:46.888[0m | [1mINFO    [0m | [36msdgx.models.LLM.single_table.gpt[0m:[36m_fit_with_metadata[0m:[36m232[0m - [1mFitting model with metadata... Finished.[0m
[32m2024-02-27 09:35:46.896[0m | [1mINFO    [0m | [36msdgx.models.LLM.single_table.gpt[0m:[36msample[0m:[36m385[0m - [1mSampling use GPT model ...[0m
[32m2024-02-27 09:35:46.899[0m | [1mINFO    [0m | [36msdgx.models.LLM.single_table.gpt[0m:[36m_sample_with_metadata[0m:[36m446[0m - [1mSampling with metadata.[0m
[32m2024-02-27 09:35:46.909[0m | [1mINFO    [0m | [36msdgx.models.LLM.base[0m:[36m_form_dataset_description[0m:[36m122[0m - [1mNo dataset_description given in current model.[0m
[32m2024-02-27 09:35:46.911[0m | [1mINFO    [0m | [36msdgx.models.LLM.base[0m:[36m_form_message_with_o

Unnamed: 0,capital-loss,educational-num,workclass,fnlwgt,hours-per-week,capital-gain,gender,age,race,marital-status,education,income,native-country,occupation,relationship
0,1500,12,Private,50000,40,2000,Male,35,White,Married-civ-spouse,Bachelors,>50K,United-States,Sales,Husband
1,0,10,Self-emp-not-inc,60000,45,0,Female,28,Black,Never-married,HS-grad,<=50K,United-States,Tech-support,Not-in-family
2,2000,8,State-gov,70000,35,0,Male,45,Asian-Pac-Islander,Divorced,11th,<=50K,China,Exec-managerial,Unmarried
3,0,14,Private,80000,50,5000,Female,40,White,Married-civ-spouse,Masters,>50K,Canada,Prof-specialty,Wife
4,1000,9,Self-emp-inc,90000,55,10000,Male,50,Black,Separated,Some-college,<=50K,Mexico,Craft-repair,Own-child
5,500,13,Federal-gov,100000,60,3000,Female,30,White,Never-married,Assoc-acdm,<=50K,United-States,Adm-clerical,Unmarried
6,0,11,Private,110000,40,0,Male,38,Black,Married-civ-spouse,Assoc-voc,>50K,United-States,Protective-serv,Husband
7,1500,12,Private,120000,45,2000,Female,25,Asian-Pac-Islander,Married-civ-spouse,Bachelors,>50K,Japan,Sales,Wife
8,2000,8,State-gov,130000,35,0,Male,55,White,Divorced,11th,<=50K,United-States,Tech-support,Not-in-family
9,0,10,Private,140000,50,5000,Female,42,Black,Never-married,HS-grad,<=50K,United-States,Exec-managerial,Unmarried


View the original information returned by gpt through the `_responses` attribute.

In [7]:
model._responses

[ChatCompletion(id='chatcmpl-8woAFkiUjRull9O1T1Tsgc8mAFcWN', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='capital-loss is 1500, educational-num is 12, workclass is Private, fnlwgt is 50000, hours-per-week is 40, capital-gain is 2000, gender is Male, age is 35, race is White, marital-status is Married-civ-spouse, education is Bachelors, income is >50K, native-country is United-States, occupation is Sales, relationship is Husband\ncapital-loss is 0, educational-num is 10, workclass is Self-emp-not-inc, fnlwgt is 60000, hours-per-week is 45, capital-gain is 0, gender is Female, age is 28, race is Black, marital-status is Never-married, education is HS-grad, income is <=50K, native-country is United-States, occupation is Tech-support, relationship is Not-in-family\ncapital-loss is 2000, educational-num is 8, workclass is State-gov, fnlwgt is 70000, hours-per-week is 35, capital-gain is 0, gender is Male, age is 45, race is Asian-Pac-I