
<div align="center">
  <img src="https://github.com/hitsz-ids/synthetic-data-generator/blob/main/assets/sdg_logo.png?raw=true" width="400" >
</div>
<div align="center">





# 🚀 Synthetic data generation without Raw Data using LLM




The Synthetic Data Generator (SDG) is a specialized framework designed to generate high-quality structured tabular data. It incorporates a wide range of single-table, multi-table data synthesis algorithms and LLM-based synthetic data generation models.

Synthetic data, generated by machines using real data, metadata, and algorithms, does not contain any sensitive information, yet it retains the essential characteristics of the original data. There is no direct correlation between synthetic data and real data, making it exempt from privacy regulations such as GDPR and ADPPA. This eliminates the risk of privacy breaches in practical applications.

In [1]:
from sdgx.models.LLM.single_table.gpt import *

In [2]:
class SingleTableGLMModel(SingleTableGPTModel):
    
    def ask_gpt(self, question, model=None):
        """
        Sends a question to the GPT model.

        Args:
            question (str): The question to ask.
            model (str): The GPT model to use. Defaults to None.

        Returns:
            str: The response from the GPT model.

        Raises:
            SynthesizerInitError: If the check method fails.
        """
        self.check()
        api_key = self.openai_API_key
        if model:
            model = model
        else:
            model = self.gpt_model
        openai.api_key = api_key
        client = openai.OpenAI(base_url=self.openai_API_url,
                               api_key=api_key)
        logger.info(f"Ask GPT with temperature = {self.temperature}.")
        response = client.chat.completions.create(
            model=model,
            messages=[
                {
                    "role": "user",
                    "content": question,
                },
            ],

            temperature=self.temperature,
            max_tokens=self.max_tokens,
            timeout=self.timeout,
        )
        logger.info("Ask GPT Finished.")
        # store response
        self._responses.append(response)
        # return the content of the gpt response
        return response.choices[0].message.content


In [3]:
# install dependencies
# !pip install sdgx
# OR
!pip install git+https://github.com/hitsz-ids/synthetic-data-generator.git

Looking in indexes: https://mirrors.aliyun.com/pypi/simple/
Collecting git+https://github.com/hitsz-ids/synthetic-data-generator.git
  Cloning https://github.com/hitsz-ids/synthetic-data-generator.git to /private/var/folders/sj/9gkwxdgs4h3ck3kytmdgk4240000gn/T/pip-req-build-2ehjkri6
  Running command git clone --filter=blob:none --quiet https://github.com/hitsz-ids/synthetic-data-generator.git /private/var/folders/sj/9gkwxdgs4h3ck3kytmdgk4240000gn/T/pip-req-build-2ehjkri6
  Resolved https://github.com/hitsz-ids/synthetic-data-generator.git to commit d60da3ff1f0e84efa20d9d66c0542f8a41492156
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone


We demonstrate with a single table synthetic example.

# LLM-integrated synthetic data generation

For a long time, LLM has been used to understand and generate various types of data.

In fact, LLM also has certain capabilities in tabular data generation. LLM has some abilities that cannot be achieved by traditional (GAN-based models or statistical models) .



In [4]:
# please set your GLM4 key here:

GLM4_AI_KEY = "-"
GLM4_AI_BASE = 'http://192.168.8.126:8006/v1/'

In [5]:
# import packages

import pandas as pd
from sdgx.utils import download_demo_data
from sdgx.data_models.metadata import Metadata
from sdgx.models.LLM.single_table.gpt import SingleTableGPTModel

# read the demo data
# currently we use the well-known adult dataset as a example
data_path = download_demo_data()
df = pd.read_csv(data_path)
metadata = Metadata.from_dataframe(df)


# Synthetic data generation without Data


Our `sdgx.models.LLM.single_table.gpt.SingleTableGPTModel` implements “Synthetic data generation without Raw Data”.

No training data is required, synthetic data can be generated based on metadata data.

![LLM_Case_1](https://github.com/hitsz-ids/synthetic-data-generator/blob/main/assets/LLM_Case_1.gif?raw=true)

In [6]:
model = SingleTableGLMModel()
model.set_openAI_settings(GLM4_AI_BASE, GLM4_AI_KEY)
model.gpt_model = "glm4"

In [7]:
model.fit(metadata)
# this may take a while
sampled_data = model.sample(30)

[32m2024-06-07 16:01:56.035[0m | [1mINFO    [0m | [36msdgx.models.LLM.single_table.gpt[0m:[36m_fit_with_metadata[0m:[36m231[0m - [1mFitting model with metadata...[0m
[32m2024-06-07 16:01:56.036[0m | [1mINFO    [0m | [36msdgx.models.LLM.single_table.gpt[0m:[36m_fit_with_metadata[0m:[36m235[0m - [1mFitting model with metadata... Finished.[0m
[32m2024-06-07 16:01:56.037[0m | [1mINFO    [0m | [36msdgx.models.LLM.single_table.gpt[0m:[36msample[0m:[36m388[0m - [1mSampling use GPT model ...[0m
[32m2024-06-07 16:01:56.038[0m | [1mINFO    [0m | [36msdgx.models.LLM.single_table.gpt[0m:[36m_sample_with_metadata[0m:[36m449[0m - [1mSampling with metadata.[0m
[32m2024-06-07 16:01:56.038[0m | [1mINFO    [0m | [36msdgx.models.LLM.base[0m:[36m_form_dataset_description[0m:[36m122[0m - [1mNo dataset_description given in current model.[0m
[32m2024-06-07 16:01:56.039[0m | [1mINFO    [0m | [36msdgx.models.LLM.base[0m:[36m_form_message_with_o

In [8]:
sampled_data

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income


View the original information returned by gpt through the `_responses` attribute.

In [9]:
model._responses

[ChatCompletion(id='7397f509-071b-41b3-9bca-c9093b55f228', choices=[Choice(finish_reason='length', index=0, logprobs=None, message=ChatCompletionMessage(content='is a file format used for LaTeX documents. It is not related to data generation or tabular data. Please provide the correct information or context for generating data samples.', role='assistant', function_call=None, tool_calls=None))], created=None, model='glm4', object='chat.completion', system_fingerprint=None, usage=None)]