
<div align="center">
  <img src="https://github.com/hitsz-ids/synthetic-data-generator/blob/main/assets/sdg_logo.png?raw=true" width="400" >
</div>
<div align="center">





# 🚀 Synthetic data generation without Raw Data using LLM




The Synthetic Data Generator (SDG) is a specialized framework designed to generate high-quality structured tabular data. It incorporates a wide range of single-table, multi-table data synthesis algorithms and LLM-based synthetic data generation models.

Synthetic data, generated by machines using real data, metadata, and algorithms, does not contain any sensitive information, yet it retains the essential characteristics of the original data. There is no direct correlation between synthetic data and real data, making it exempt from privacy regulations such as GDPR and ADPPA. This eliminates the risk of privacy breaches in practical applications.

In [None]:
# install dependencies
!pip install sdgx
# OR
# !pip install git+https://github.com/hitsz-ids/synthetic-data-generator.git

Collecting sdgx
  Downloading sdgx-0.2.0-py3-none-any.whl (216 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m216.7/216.7 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
Collecting faker>=10 (from sdgx)
  Downloading Faker-25.8.0-py3-none-any.whl (1.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m12.8 MB/s[0m eta [36m0:00:00[0m
Collecting loguru (from sdgx)
  Downloading loguru-0.7.2-py3-none-any.whl (62 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.5/62.5 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
Collecting openai>=1.10.0 (from sdgx)
  Downloading openai-1.33.0-py3-none-any.whl (325 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m325.5/325.5 kB[0m [31m13.2 MB/s[0m eta [36m0:00:00[0m
Collecting table-evaluator (from sdgx)
  Downloading table_evaluator-1.6.1-py3-none-any.whl (22 kB)
Collecting httpx<1,>=0.23.0 (from openai>=1.10.0->sdgx)
  Downloading httpx-0.27.0-py3-no

In [None]:
from sdgx.models.LLM.single_table.gpt import *
class SingleTableGLMModel(SingleTableGPTModel):

    def ask_gpt(self, question, model=None):
        """
        Sends a question to the GPT model.

        Args:
            question (str): The question to ask.
            model (str): The GPT model to use. Defaults to None.

        Returns:
            str: The response from the GPT model.

        Raises:
            SynthesizerInitError: If the check method fails.
        """
        self.check()
        api_key = self.openai_API_key
        if model:
            model = model
        else:
            model = self.gpt_model
        openai.api_key = api_key
        client = openai.OpenAI(base_url=self.openai_API_url,
                               api_key=api_key)
        logger.info(f"Ask GPT with temperature = {self.temperature}.")
        response = client.chat.completions.create(
            model=model,
            messages=[
                {
                    "role": "user",
                    "content": question,
                },
            ],

            temperature=self.temperature,
            max_tokens=self.max_tokens,
            timeout=self.timeout,
        )
        logger.info("Ask GPT Finished.")
        # store response
        self._responses.append(response)
        # return the content of the gpt response
        return response.choices[0].message.content

We demonstrate with a single table simulation example.

# LLM-integrated synthetic data generation

For a long time, LLM has been used to understand and generate various types of data.

In fact, LLM also has certain capabilities in tabular data generation. LLM has some abilities that cannot be achieved by traditional (GAN-based models or statistical models) .



In [None]:
# please set your GLM4 key here:

GLM4_AI_KEY = {YOUR_KEY}
GLM4_AI_BASE = 'https://open.bigmodel.cn/api/paas/v4/'

In [None]:
# import packages

import pandas as pd
from sdgx.utils import download_demo_data
from sdgx.data_models.metadata import Metadata
from sdgx.models.LLM.single_table.gpt import SingleTableGPTModel

# read the demo data
# currently we use the well-known adult dataset as a example
data_path = download_demo_data()
df = pd.read_csv(data_path)
metadata = Metadata.from_dataframe(df)

[32m2024-06-12 02:08:49.329[0m | [1mINFO    [0m | [36msdgx.utils[0m:[36mdownload_demo_data[0m:[36m68[0m - [1mDownloading demo data from github data source to /content/dataset/adult.csv[0m



# Synthetic data generation without Data


Our `sdgx.models.LLM.single_table.gpt.SingleTableGPTModel` implements “Synthetic data generation without Raw Data”.

No training data is required, synthetic data can be generated based on metadata data.

![LLM_Case_1](https://github.com/hitsz-ids/synthetic-data-generator/blob/main/assets/LLM_Case_2.gif?raw=true)

In [None]:
model = SingleTableGLMModel()
model.set_openAI_settings(GLM4_AI_BASE, GLM4_AI_KEY)
model.gpt_model = "glm-4"

In [None]:
model.fit(metadata)
# this may take a while
model.sample(30, off_table_features = ['has_car'])

[32m2024-06-12 02:10:53.133[0m | [1mINFO    [0m | [36msdgx.models.LLM.single_table.gpt[0m:[36m_fit_with_metadata[0m:[36m228[0m - [1mFitting model with metadata...[0m
[32m2024-06-12 02:10:53.135[0m | [1mINFO    [0m | [36msdgx.models.LLM.single_table.gpt[0m:[36m_fit_with_metadata[0m:[36m232[0m - [1mFitting model with metadata... Finished.[0m
[32m2024-06-12 02:10:53.137[0m | [1mINFO    [0m | [36msdgx.models.LLM.single_table.gpt[0m:[36msample[0m:[36m385[0m - [1mSampling use GPT model ...[0m
[32m2024-06-12 02:10:53.138[0m | [1mINFO    [0m | [36msdgx.models.LLM.single_table.gpt[0m:[36m_sample_with_metadata[0m:[36m446[0m - [1mSampling with metadata.[0m
[32m2024-06-12 02:10:53.140[0m | [1mINFO    [0m | [36msdgx.models.LLM.base[0m:[36m_form_dataset_description[0m:[36m122[0m - [1mNo dataset_description given in current model.[0m
[32m2024-06-12 02:10:53.142[0m | [1mINFO    [0m | [36msdgx.models.LLM.base[0m:[36m_form_message_with_o

Unnamed: 0,educational-num,marital-status,native-country,capital-loss,gender,hours-per-week,income,education,fnlwgt,workclass,age,race,relationship,capital-gain,occupation
0,10,Married-civ-spouse,United-States,0,Male,40,<=50K,Bachelors,205263,Private,45,White,Husband,1000,Exec-managerial
1,13,Never-married,Mexico,1500,Female,45,<=50K,Masters,189284,Self-emp-not-inc,32,Hispanic,Own-child,0,Prof-specialty
2,9,Divorced,Philippines,0,Male,60,>50K,HS-grad,267450,Federal-gov,38,Asian-Pac-Islander,Unmarried,2000,Protective-serv
3,12,Married-AF-spouse,Germany,500,Female,35,<=50K,Assoc-voc,223434,Local-gov,29,White,Wife,0,Tech-support
4,8,Separated,Canada,2000,Male,55,<=50K,Some-college,336924,State-gov,41,Black,Other-relative,1500,Sales
5,11,Widowed,India,0,Female,20,<=50K,Doctorate,278844,Self-emp-inc,56,Asian-Pac-Islander,Not-in-family,3000,Prof-specialty
6,7,Married-civ-spouse,United-States,1000,Male,30,<=50K,Bachelors,147115,Private,36,White,Husband,0,Craft-repair
7,14,Never-married,United-States,0,Female,50,>50K,Prof-school,324442,Self-emp-not-inc,28,White,Own-child,2500,Exec-managerial
8,6,Divorced,Italy,1500,Male,65,>50K,HS-grad,212323,Federal-gov,42,White,Unmarried,500,Adm-clerical
9,13,Married-AF-spouse,United-States,0,Female,38,<=50K,Assoc-voc,198234,Local-gov,31,Black,Wife,1000,Handlers-cleaners


View the original information returned by gpt through the `_responses` attribute.

In [None]:
model._responses

[ChatCompletion(id='8741737668587097609', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content="Here are 30 synthetic data samples based on the provided column headers:\n\n1. educational-num is 10, marital-status is Married-civ-spouse, native-country is United-States, capital-loss is 0, gender is Male, hours-per-week is 40, income is <=50K, education is Bachelors, fnlwgt is 205263, workclass is Private, age is 45, race is White, relationship is Husband, capital-gain is 1000, occupation is Exec-managerial\n2. educational-num is 13, marital-status is Never-married, native-country is Mexico, capital-loss is 1500, gender is Female, hours-per-week is 45, income is <=50K, education is Masters, fnlwgt is 189284, workclass is Self-emp-not-inc, age is 32, race is Hispanic, relationship is Own-child, capital-gain is 0, occupation is Prof-specialty\n3. educational-num is 9, marital-status is Divorced, native-country is Philippines, capital-loss is 0,