
<div align="center">
  <img src="https://github.com/hitsz-ids/synthetic-data-generator/blob/main/assets/sdg_logo.png?raw=true" width="400" >
</div>
<div align="center">





# 🚀 Synthetic data generation without Raw Data using LLM




The Synthetic Data Generator (SDG) is a specialized framework designed to generate high-quality structured tabular data. It incorporates a wide range of single-table, multi-table data synthesis algorithms and LLM-based synthetic data generation models.

Synthetic data, generated by machines using real data, metadata, and algorithms, does not contain any sensitive information, yet it retains the essential characteristics of the original data. There is no direct correlation between synthetic data and real data, making it exempt from privacy regulations such as GDPR and ADPPA. This eliminates the risk of privacy breaches in practical applications.

In [None]:
from sdgx.models.LLM.single_table.gpt import *

In [None]:
class SingleTableGLMModel(SingleTableGPTModel):

    def ask_gpt(self, question, model=None):
        """
        Sends a question to the GPT model.

        Args:
            question (str): The question to ask.
            model (str): The GPT model to use. Defaults to None.

        Returns:
            str: The response from the GPT model.

        Raises:
            SynthesizerInitError: If the check method fails.
        """
        self.check()
        api_key = self.openai_API_key
        if model:
            model = model
        else:
            model = self.gpt_model
        openai.api_key = api_key
        client = openai.OpenAI(base_url=self.openai_API_url,
                               api_key=api_key)
        logger.info(f"Ask GPT with temperature = {self.temperature}.")
        response = client.chat.completions.create(
            model=model,
            messages=[
                {
                    "role": "user",
                    "content": question,
                },
            ],

            temperature=self.temperature,
            max_tokens=self.max_tokens,
            timeout=self.timeout,
        )
        logger.info("Ask GPT Finished.")
        # store response
        self._responses.append(response)
        # return the content of the gpt response
        return response.choices[0].message.content


In [None]:
# install dependencies
!pip install sdgx
# OR
# !pip install git+https://github.com/hitsz-ids/synthetic-data-generator.git



We demonstrate with a single table synthetic example.

# LLM-integrated synthetic data generation

For a long time, LLM has been used to understand and generate various types of data.

In fact, LLM also has certain capabilities in tabular data generation. LLM has some abilities that cannot be achieved by traditional (GAN-based models or statistical models) .



In [None]:
# please set your GLM4 key here:

GLM4_AI_KEY = {YOUR_KEY}
GLM4_AI_BASE = 'https://open.bigmodel.cn/api/paas/v4/'

In [None]:
# import packages

import pandas as pd
from sdgx.utils import download_demo_data
from sdgx.data_models.metadata import Metadata
from sdgx.models.LLM.single_table.gpt import SingleTableGPTModel

# read the demo data
# currently we use the well-known adult dataset as a example
data_path = download_demo_data()
df = pd.read_csv(data_path)
metadata = Metadata.from_dataframe(df)


# Synthetic data generation without Data


Our `sdgx.models.LLM.single_table.gpt.SingleTableGPTModel` implements “Synthetic data generation without Raw Data”.

No training data is required, synthetic data can be generated based on metadata data.

![LLM_Case_1](https://github.com/hitsz-ids/synthetic-data-generator/blob/main/assets/LLM_Case_1.gif?raw=true)

In [None]:
model = SingleTableGLMModel()
model.set_openAI_settings(GLM4_AI_BASE, GLM4_AI_KEY)
model.gpt_model = "glm-4"

In [None]:
model.fit(metadata)
# this may take a while
sampled_data = model.sample(30)

[32m2024-06-10 13:52:27.718[0m | [1mINFO    [0m | [36msdgx.models.LLM.single_table.gpt[0m:[36m_fit_with_metadata[0m:[36m228[0m - [1mFitting model with metadata...[0m
[32m2024-06-10 13:52:27.722[0m | [1mINFO    [0m | [36msdgx.models.LLM.single_table.gpt[0m:[36m_fit_with_metadata[0m:[36m232[0m - [1mFitting model with metadata... Finished.[0m
[32m2024-06-10 13:52:27.725[0m | [1mINFO    [0m | [36msdgx.models.LLM.single_table.gpt[0m:[36msample[0m:[36m385[0m - [1mSampling use GPT model ...[0m
[32m2024-06-10 13:52:27.728[0m | [1mINFO    [0m | [36msdgx.models.LLM.single_table.gpt[0m:[36m_sample_with_metadata[0m:[36m446[0m - [1mSampling with metadata.[0m
[32m2024-06-10 13:52:27.730[0m | [1mINFO    [0m | [36msdgx.models.LLM.base[0m:[36m_form_dataset_description[0m:[36m122[0m - [1mNo dataset_description given in current model.[0m
[32m2024-06-10 13:52:27.733[0m | [1mINFO    [0m | [36msdgx.models.LLM.base[0m:[36m_form_message_with_o

In [None]:
sampled_data

Unnamed: 0,fnlwgt,capital-loss,age,educational-num,occupation,relationship,native-country,gender,education,income,workclass,race,capital-gain,marital-status,hours-per-week
0,249472,0,52,13,Exec-managerial,Husband,United-States,Male,Bachelors,<=50K,Private,White,0,Married-civ-spouse,40
1,312466,1900,31,10,Craft-repair,Not-in-family,Mexico,Female,HS-grad,<=50K,Self-emp-not-inc,Hispanic,0,Never-married,45
2,222211,0,38,12,Adm-clerical,Own-child,Philippines,Male,Assoc-voc,<=50K,Local-gov,Asian-Pac-Islander,1500,Married-civ-spouse,38
3,198839,500,28,9,Other-service,Wife,Jamaica,Female,Some-college,<=50K,Private,Black,0,Married-AF-spouse,37
4,291775,0,44,14,Prof-specialty,Husband,United-States,Male,Masters,>50K,Self-emp-inc,White,2500,Married-civ-spouse,50
5,178556,1200,23,7,Machine-op-inspct,Own-child,Canada,Female,Bachelors,<=50K,Private,White,0,Never-married,30
6,275935,0,36,11,Tech-support,Not-in-family,India,Male,Assoc-acdm,<=50K,State-gov,Asian-Pac-Islander,0,Divorced,40
7,392763,3000,29,8,Sales,Wife,United-States,Female,HS-grad,<=50K,Private,Black,0,Married-civ-spouse,35
8,234733,0,41,13,Exec-managerial,Husband,United-States,Male,Prof-school,>50K,Self-emp-inc,White,5000,Married-civ-spouse,60
9,153660,0,54,15,Prof-specialty,Wife,United-States,Female,Doctorate,>50K,Local-gov,White,0,Married-civ-spouse,40


View the original information returned by gpt through the `_responses` attribute.

In [None]:
model._responses

[ChatCompletion(id='8730210182522821228', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content="Here are 30 synthetic data samples based on the provided column headers:\n\n1. fnlwgt is 249472, capital-loss is 0, age is 52, educational-num is 13, occupation is Exec-managerial, relationship is Husband, native-country is United-States, gender is Male, education is Bachelors, income is <=50K, workclass is Private, race is White, capital-gain is 0, marital-status is Married-civ-spouse, hours-per-week is 40\n2. fnlwgt is 312466, capital-loss is 1900, age is 31, educational-num is 10, occupation is Craft-repair, relationship is Not-in-family, native-country is Mexico, gender is Female, education is HS-grad, income is <=50K, workclass is Self-emp-not-inc, race is Hispanic, capital-gain is 0, marital-status is Never-married, hours-per-week is 45\n3. fnlwgt is 222211, capital-loss is 0, age is 38, educational-num is 12, occupation is Adm-clerical, r