# Hybrid Data Generation with GAN and LLM: A Demo


# Loading the libraries

| Platform |
|----------|
| [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/osllmai/inDox/blob/master/cookbook/indoxGen/Hybrid_Data_Generation_with_GAN_and_LLM.ipynb) |
| [![GitHub](https://img.shields.io/badge/GitHub-Repository-blue?logo=github)](https://github.com/osllmai/inDox/blob/master/cookbook/indoxGen/Hybrid_Data_Generation_with_GAN_and_LLM.ipynb) |



This notebook demonstrates how to generate synthetic data by combining GAN (for numerical data) and LLM (for text data). We will be using `indoxGen` and `indoxGen-torch` libraries to create this hybrid pipeline.

## 1. Install Required Libraries

We start by installing the required libraries. `indoxGen` and `indoxGen-torch` are used for hybrid data generation, `python-dotenv` for loading API keys, and `openai` for working with language models.


In [None]:
# !pip install indoxGen indoxGen-torch indoxgen openai loguru dotenv tenacity

## 2. Load API Keys
Next, we load API keys from an environment file (api.env). This file should contain your API keys for both IndoxAPI and OpenAI/NVIDIA. Make sure you have the following in your api.env file:

In [12]:
import os
from dotenv import load_dotenv



load_dotenv()
INDOX_API_KEY = os.environ['OPENAI_API_KEY']
NVIDIA_API_KEY = os.environ['NVIDIA_API_KEY']


## 3. Initialize Language Models (LLMs)
We will use IndoxAPI as the judge model and Nemotron (OpenAI/NVIDIA model) as the generator model for text generation. These models will help in generating realistic and diverse text data based on the numerical context provided.

In [13]:
from indoxGen.llms import OpenAi, IndoxApi

indox = IndoxApi(api_key=INDOX_API_KEY)
nemotron = OpenAi(api_key=NVIDIA_API_KEY, model="nvidia/nemotron-4-340b-instruct", base_url="https://integrate.api.nvidia.com/v1")


ImportError: cannot import name 'IndoxApi' from 'indoxGen.llms' (c:\Users\AliNemati\.conda\envs\indox\Lib\site-packages\indoxGen\llms\__init__.py)

## 4. Create a Sample Dataset
We will use a simple dataset containing information about individuals, such as their age, income, years of experience, job title, and remarks. This dataset will serve as the basis for generating synthetic data.

In [7]:
import pandas as pd

sample_data = [
    {'age': 25, 'income': 45.5, 'years_of_experience': 3, 'job_title': 'Junior Developer', 'remarks': 'Looking to grow my career in full-stack development.'},
    {'age': 32, 'income': 60.0, 'years_of_experience': 7, 'job_title': 'Software Engineer', 'remarks': 'Focused on backend development and database optimization.'},
    {'age': 45, 'income': 80.2, 'years_of_experience': 20, 'job_title': 'Lead Developer', 'remarks': 'Experienced in leading large-scale software projects.'},
    {'age': 28, 'income': 50.1, 'years_of_experience': 5, 'job_title': 'Data Analyst', 'remarks': 'Passionate about data-driven decision making and visualization.'},
    {'age': 38, 'income': 70.0, 'years_of_experience': 15, 'job_title': 'Senior Developer', 'remarks': 'Skilled in cloud architecture and microservices.'},
    {'age': 23, 'income': 40.0, 'years_of_experience': 2, 'job_title': 'Junior Developer', 'remarks': 'Passionate about front-end technologies and user experience.'},
    {'age': 50, 'income': 90.5, 'years_of_experience': 25, 'job_title': 'Technical Architect', 'remarks': 'Expert in designing scalable systems and enterprise solutions.'},
    {'age': 29, 'income': 55.3, 'years_of_experience': 6, 'job_title': 'Full-Stack Developer', 'remarks': 'Enjoys working on both front-end and back-end systems.'},
    {'age': 35, 'income': 65.0, 'years_of_experience': 10, 'job_title': 'DevOps Engineer', 'remarks': 'Dedicated to automating infrastructure and improving CI/CD pipelines.'},
    {'age': 27, 'income': 48.7, 'years_of_experience': 4, 'job_title': 'Web Developer', 'remarks': 'Enjoys creating responsive and dynamic web applications.'},
    {'age': 42, 'income': 75.4, 'years_of_experience': 18, 'job_title': 'Product Manager', 'remarks': 'Focused on aligning software development with business goals.'},
    {'age': 33, 'income': 63.2, 'years_of_experience': 9, 'job_title': 'Mobile App Developer', 'remarks': 'Experienced in building cross-platform mobile applications.'},
    {'age': 41, 'income': 85.0, 'years_of_experience': 19, 'job_title': 'Engineering Manager', 'remarks': 'Leads engineering teams with a focus on agile development and collaboration.'},
    {'age': 30, 'income': 58.0, 'years_of_experience': 8, 'job_title': 'Machine Learning Engineer', 'remarks': 'Specializes in building and deploying machine learning models.'},
    {'age': 22, 'income': 42.0, 'years_of_experience': 1, 'job_title': 'Intern', 'remarks': 'Learning about cloud computing and containerization.'},
    {'age': 37, 'income': 68.9, 'years_of_experience': 14, 'job_title': 'Cloud Engineer', 'remarks': 'Expert in AWS and Azure infrastructure, optimizing cloud deployments.'},
    {'age': 48, 'income': 95.0, 'years_of_experience': 22, 'job_title': 'Director of Engineering', 'remarks': 'Oversees multiple development teams and sets technical strategy.'},
    {'age': 31, 'income': 57.0, 'years_of_experience': 6, 'job_title': 'UX/UI Designer', 'remarks': 'Focused on creating intuitive and user-friendly interfaces.'},
    {'age': 34, 'income': 61.5, 'years_of_experience': 7, 'job_title': 'Database Administrator', 'remarks': 'Manages complex databases and ensures data integrity.'},
    {'age': 26, 'income': 46.8, 'years_of_experience': 3, 'job_title': 'Systems Analyst', 'remarks': 'Analyzes and improves system processes for efficiency.'},
    {'age': 40, 'income': 78.5, 'years_of_experience': 17, 'job_title': 'Security Engineer', 'remarks': 'Specialized in network security and vulnerability assessment.'},
    {'age': 50, 'income': 100.0, 'years_of_experience': 25, 'job_title': 'CTO', 'remarks': 'Responsible for the technology strategy and leadership across the company.'},
    {'age': 36, 'income': 67.4, 'years_of_experience': 12, 'job_title': 'Backend Developer', 'remarks': 'Enjoys optimizing server-side logic and APIs.'},
    {'age': 44, 'income': 82.7, 'years_of_experience': 21, 'job_title': 'Project Manager', 'remarks': 'Skilled in leading cross-functional teams to deliver on time and within budget.'},
    {'age': 28, 'income': 51.0, 'years_of_experience': 5, 'job_title': 'Scrum Master', 'remarks': 'Facilitates agile ceremonies and helps the team improve productivity.'},
    {'age': 46, 'income': 88.4, 'years_of_experience': 22, 'job_title': 'Head of IT Operations', 'remarks': 'Oversees IT infrastructure and ensures smooth day-to-day operations.'},
    {'age': 39, 'income': 71.3, 'years_of_experience': 16, 'job_title': 'QA Engineer', 'remarks': 'Passionate about ensuring software quality and test automation.'},
    {'age': 24, 'income': 47.0, 'years_of_experience': 2, 'job_title': 'Junior Data Scientist', 'remarks': 'Exploring data science and machine learning techniques.'},
    {'age': 51, 'income': 102.5, 'years_of_experience': 26, 'job_title': 'Chief Data Officer', 'remarks': 'Manages the organization\'s data strategy and governance.'},
]


data = pd.DataFrame(sample_data)

# Preview the dataset
data.head()


Unnamed: 0,age,income,years_of_experience,job_title,remarks
0,25,45.5,3,Junior Developer,Looking to grow my career in full-stack develo...
1,32,60.0,7,Software Engineer,Focused on backend development and database op...
2,45,80.2,20,Lead Developer,Experienced in leading large-scale software pr...
3,28,50.1,5,Data Analyst,Passionate about data-driven decision making a...
4,38,70.0,15,Senior Developer,Skilled in cloud architecture and microservices.


## 5. Define Columns for Text and Numerical Data
We need to separate the columns into numerical data (such as age, income, and years_of_experience) and text data (such as job_title and remarks). We will also define the columns that contain integer values.

In [8]:
numerical_columns = ['age', 'income', 'years_of_experience']
text_columns = ['job_title', 'remarks']
integer_columns = ['age', 'years_of_experience']

# Extract example data
# example_data = data[numerical_columns + text_columns].to_dict(orient='records')


In [9]:
example_data_llm = [
    {
        'age': 25,
        'income': 45.5,
        'years_of_experience': 3,
        'job_title': 'Junior Developer',
        'remarks': 'Looking to grow my career in full-stack development.'
    },
    {
        'age': 32,
        'income': 60.0,
        'years_of_experience': 7,
        'job_title': 'Software Engineer',
        'remarks': 'Focused on backend development and database optimization.'
    },
    {
        'age': 45,
        'income': 80.2,
        'years_of_experience': 20,
        'job_title': 'Lead Developer',
        'remarks': 'Experienced in leading large-scale software projects.'
    },
    {
        'age': 28,
        'income': 50.1,
        'years_of_experience': 5,
        'job_title': 'Data Analyst',
        'remarks': 'Passionate about data-driven decision making and visualization.'
    },
    {
        'age': 38,
        'income': 70.0,
        'years_of_experience': 15,
        'job_title': 'Senior Developer',
        'remarks': 'Skilled in cloud architecture and microservices.'
    },
    {
        'age': 23,
        'income': 40.0,
        'years_of_experience': 2,
        'job_title': 'Junior Developer',
        'remarks': 'Passionate about front-end technologies and user experience.'
    },
    {
        'age': 50,
        'income': 90.5,
        'years_of_experience': 25,
        'job_title': 'Technical Architect',
        'remarks': 'Expert in designing scalable systems and enterprise solutions.'
    },
    {
        'age': 29,
        'income': 55.3,
        'years_of_experience': 6,
        'job_title': 'Full-Stack Developer',
        'remarks': 'Enjoys working on both front-end and back-end systems.'
    },
    {
        'age': 35,
        'income': 65.0,
        'years_of_experience': 10,
        'job_title': 'DevOps Engineer',
        'remarks': 'Dedicated to automating infrastructure and improving CI/CD pipelines.'
    },
    {
        'age': 27,
        'income': 48.7,
        'years_of_experience': 4,
        'job_title': 'Web Developer',
        'remarks': 'Enjoys creating responsive and dynamic web applications.'
    },
    {
        'age': 42,
        'income': 75.4,
        'years_of_experience': 18,
        'job_title': 'Product Manager',
        'remarks': 'Focused on aligning software development with business goals.'
    },
    {
        'age': 33,
        'income': 63.2,
        'years_of_experience': 9,
        'job_title': 'Mobile App Developer',
        'remarks': 'Experienced in building cross-platform mobile applications.'
    },
    {
        'age': 41,
        'income': 85.0,
        'years_of_experience': 19,
        'job_title': 'Engineering Manager',
        'remarks': 'Leads engineering teams with a focus on agile development and collaboration.'
    },
    {
        'age': 30,
        'income': 58.0,
        'years_of_experience': 8,
        'job_title': 'Machine Learning Engineer',
        'remarks': 'Specializes in building and deploying machine learning models.'
    },
    {
        'age': 22,
        'income': 42.0,
        'years_of_experience': 1,
        'job_title': 'Intern',
        'remarks': 'Learning about cloud computing and containerization.'
    },
    {
        'age': 37,
        'income': 68.9,
        'years_of_experience': 14,
        'job_title': 'Cloud Engineer',
        'remarks': 'Expert in AWS and Azure infrastructure, optimizing cloud deployments.'
    },
    {
        'age': 48,
        'income': 95.0,
        'years_of_experience': 22,
        'job_title': 'Director of Engineering',
        'remarks': 'Oversees multiple development teams and sets technical strategy.'
    },
    {
        'age': 31,
        'income': 57.0,
        'years_of_experience': 6,
        'job_title': 'UX/UI Designer',
        'remarks': 'Focused on creating intuitive and user-friendly interfaces.'
    },
    {
        'age': 34,
        'income': 61.5,
        'years_of_experience': 7,
        'job_title': 'Database Administrator',
        'remarks': 'Manages complex databases and ensures data integrity.'
    },
    {
        'age': 26,
        'income': 46.8,
        'years_of_experience': 3,
        'job_title': 'Systems Analyst',
        'remarks': 'Analyzes and improves system processes for efficiency.'
    }
]


## 6. Initialize LLM Setup
We will now set up the language model (LLM) to generate synthetic text data. The nemotron model will generate new text (like job titles and remarks) based on the numerical context (like age and years of experience). We specify a diversity threshold to encourage variety in the generated text.

In [10]:
from indoxGen.hybrid_synth import initialize_llm_synth

user_instruction = (
    "Generate realistic and diverse text data based on the provided numerical context. "
    "Ensure that the generated text reflects the diversity of experiences and does not repeat previous patterns. "
    "Vary the wording, job titles, and remarks for each individual."
)

llm_setup = initialize_llm_synth(
    generator_llm=nemotron,
    judge_llm=indox,
    columns=['age', 'income', 'years_of_experience', 'job_title', 'remarks'],
    example_data=example_data_llm,
    user_instruction=user_instruction,
    diversity_threshold=0.3,
    max_diversity_failures=20,  # Tolerate fewer diversity failures
    verbose=1
)


## 7. Initialize GAN Setup
Next, we set up the GAN to generate numerical data. We define the architecture for the GAN, including the number of layers, learning rate, and other parameters.

In [11]:
from indoxGen.hybrid_synth import initialize_gan_synth

numerical_data = pd.DataFrame(data[numerical_columns])

gan_setup = initialize_gan_synth(
    input_dim=200,
    generator_layers=[128, 256, 512],
    discriminator_layers=[512, 256, 128],
    learning_rate=2e-4,
    beta_1=0.5,
    beta_2=0.9,
    batch_size=64,
    epochs=50,
    n_critic=5,
    categorical_columns=[],
    mixed_columns={},
    integer_columns=integer_columns,
    data=numerical_data
)


Epoch [1/50] - D Loss: 0.9003, G Loss: -0.1424
Epoch [2/50] - D Loss: 0.2867, G Loss: 0.4912
Epoch [3/50] - D Loss: 0.8099, G Loss: 0.6775
Epoch [4/50] - D Loss: 0.2298, G Loss: 0.3096
Epoch [5/50] - D Loss: -0.2670, G Loss: 0.4776
Epoch [6/50] - D Loss: 0.1253, G Loss: 0.4239
Epoch [7/50] - D Loss: -0.2146, G Loss: 0.5971
Epoch [8/50] - D Loss: -0.1074, G Loss: 0.4334
Epoch [9/50] - D Loss: -0.4592, G Loss: 0.8686
Epoch [10/50] - D Loss: -0.0580, G Loss: 0.6474
Epoch [11/50] - D Loss: 0.1739, G Loss: 0.6142
Epoch [12/50] - D Loss: 0.0540, G Loss: 0.4765
Epoch [13/50] - D Loss: -0.0333, G Loss: 0.3782
Epoch [14/50] - D Loss: -0.0440, G Loss: 0.5570
Epoch [15/50] - D Loss: -0.1334, G Loss: 0.4432
Epoch [16/50] - D Loss: 0.0505, G Loss: 0.2441

Early stopping triggered. Generator loss did not improve for 15 epochs.
Training stopped early at epoch 16 due to no improvement in generator loss.


## 8. Combine GAN and LLM: TextTabularSynth
We now create an instance of the TextTabularSynth class, which integrates both the GAN (for numerical data) and the LLM (for text data). This allows us to generate synthetic samples that combine numerical and text data.

In [12]:
from indoxGen.hybrid_synth import TextTabularSynth

synth_pipeline = TextTabularSynth(tabular=gan_setup, text=llm_setup)


## 9. Generate Synthetic Data
We are now ready to generate synthetic data! We specify how many samples we want to generate and then preview the resulting data.

In [13]:
# Specify the number of synthetic samples to generate
num_samples = 10

# Generate the synthetic data
synthetic_data = synth_pipeline.generate(num_samples)

# Preview the synthetic data
print("\nSynthetic Data:")
synthetic_data.head()


[32mINFO[0m: [1mGenerated data point: {'age': '45', 'income': '85.2', 'years_of_experience': '20', 'job_title': 'Cybersecurity Specialist', 'remarks': 'Expert in network security and threat intelligence, with a focus on protecting sensitive data and maintaining system integrity.'}[0m
[32mINFO[0m: [1mGenerated data point: {'age': '22', 'income': '45.0', 'years_of_experience': '1', 'job_title': 'Junior UX Designer', 'remarks': 'Excels in user-centered design and prototyping, with a strong focus on accessibility and usability. Eager to learn and grow in the field of user experience.'}[0m
[32mINFO[0m: [1mGenerated data point: {'age': '30', 'income': '55.0', 'years_of_experience': '5', 'job_title': 'AI Research Scientist', 'remarks': 'Specializes in developing and implementing machine learning algorithms for natural language processing tasks. Passionate about ethical AI and ensuring fairness in AI systems.'}[0m
[32mINFO[0m: [1mGenerated data point: {'age': '36', 'income': '72

Unnamed: 0,age,income,years_of_experience,age.1,income.1,years_of_experience.1,job_title,remarks
0,39,82.271339,18,45,85.2,20,Cybersecurity Specialist,Expert in network security and threat intellig...
1,43,83.526192,19,22,45.0,1,Junior UX Designer,Excels in user-centered design and prototyping...
2,31,65.260857,13,30,55.0,5,AI Research Scientist,Specializes in developing and implementing mac...
3,41,66.374245,8,36,72.1,11,Blockchain Developer,Proficient in developing decentralized applica...
4,27,50.170284,6,48,90.5,23,Cybersecurity Consultant,"Highly skilled in network security, threat int..."


In [14]:
synthetic_data

Unnamed: 0,age,income,years_of_experience,age.1,income.1,years_of_experience.1,job_title,remarks
0,39,82.271339,18,45,85.2,20,Cybersecurity Specialist,Expert in network security and threat intellig...
1,43,83.526192,19,22,45.0,1,Junior UX Designer,Excels in user-centered design and prototyping...
2,31,65.260857,13,30,55.0,5,AI Research Scientist,Specializes in developing and implementing mac...
3,41,66.374245,8,36,72.1,11,Blockchain Developer,Proficient in developing decentralized applica...
4,27,50.170284,6,48,90.5,23,Cybersecurity Consultant,"Highly skilled in network security, threat int..."
5,35,62.212044,15,38,78.3,15,UX Designer and Researcher,"Expert in user-centered design, usability test..."
6,35,64.420952,7,25,55.2,3,AI Ethicist,Specializes in ensuring ethical considerations...
7,44,80.769485,17,32,60.1,9,Cloud Infrastructure Architect,Proficient in designing and implementing scala...
8,32,52.044697,11,28,51.3,5,Cybersecurity Specialist,Dedicated to protecting digital assets and ens...
9,28,63.448936,6,37,78.5,15,UX Research Lead,Expert in user-centered design and research me...


## Conclusion
This demo showcases how to use a hybrid approach, combining GAN for numerical data and LLM for text data, to generate diverse and realistic synthetic data. By leveraging the indoxGen libraries, this process can be automated and customized to fit your data generation needs.

## Join Us

Join us in exploring how Indox can revolutionize your document processing workflow, bringing clarity and organization to your data retrieval needs. Connect with us and become part of our growing community through the platforms below:

## Community

- [Discord](https://discord.com/invite/xGz5tQYaeq)
- [X (Twitter)](https://x.com/osllmai)
- [LinkedIn](https://www.linkedin.com/company/osllmai/)
- [YouTube](https://www.youtube.com/@osllm-rb9pr)
- [Telegram](https://t.me/osllmai)


Reviewed by: Ali Nemati - March, 22, 2025

*Note: some issue had been reported!*

*lack of demo*
