 Synthetic data refers to artificially generated information that mimics the characteristics of real-world data, but is created through computational methods rather than being collected from actual events or sources..

Generating synthetic employee records using the langchain library.

In [42]:
import pandas as pd
import re,os
from langchain_core.prompts import FewShotPromptTemplate, PromptTemplate
from pydantic import BaseModel
from langchain_experimental.tabular_synthetic_data.openai import (
    OPENAI_TEMPLATE,
    create_openai_data_generator,
)
from langchain_experimental.tabular_synthetic_data.prompts import (
    SYNTHETIC_FEW_SHOT_PREFIX,
    SYNTHETIC_FEW_SHOT_SUFFIX,
)
from langchain_openai import ChatOpenAI
from dotenv import load_dotenv

In [43]:
# Load environment variables from .env file
load_dotenv()
openai_api_key = os.getenv("OPENAI_API_KEY")

## Synthetic Data Schema
The Employee class defines our schema, outlining the structure and expectations for our synthetic data for our structured. This informs the generator about:
 - Data types and relationships
- Field formats and constraints

By defining this schema, we ensure our synthetic data mirrors real-world data characteristics

In [44]:
class EmployeeRecord(BaseModel):
    id: int
    name: str
    age: int
    qualification: str
    salary: int
    bonus: int

## Sample Data

In [45]:
examples = [
    {
        "example": """ID: 101, Name: John Doe, Age: 30, 
        Qualification: Bachelor's Degree, Salary: $50000, Bonus: $5000"""
    },
    {
        "example": """ID: 102, Name: Jane Smith, Age: 28, 
        Qualification: Master's Degree, Salary: $60000, Bonus: $6000"""
    },
    {
        "example": """ID: 103, Name: Bob Johnson, Age: 35, 
        Qualification: PhD, Salary: $70000, Bonus: $7000"""
    },
]

## Prompt Template 

In [46]:
PROMPT_TEMPLATE = PromptTemplate(input_variables=["example"], template="{example}")

prompt_template = FewShotPromptTemplate(
    prefix=SYNTHETIC_FEW_SHOT_PREFIX,
    examples=examples,
    suffix=SYNTHETIC_FEW_SHOT_SUFFIX,
    input_variables=["subject", "extra"],
    example_prompt=PROMPT_TEMPLATE,
)

The `FewShotPromptTemplate` includes:
- prefix: Initial text preceding the examples.
- examples: List of example dictionaries containing sample input/output pairs.
- suffix: Text following the examples.
- input_variables: The variables ("subject", "extra") are placeholders, to dynamically fill later, especially to guide the model further. For Ex, "subject" might be filled with "payroll-creation"..
- example_prompt: Prompt template to instruct our LLM.

## Defining the Data Generator

In [47]:
synthetic_data_generator = create_openai_data_generator(
    output_schema=EmployeeRecord,
    llm=ChatOpenAI(
        temperature=1),  
    prompt=prompt_template)

## Generate Synthetic Data

In [48]:
synthetic_results = ''
synthetic_results = synthetic_data_generator.generate(
subject="payroll-creation",
extra="""
id: must be unique.
name: must be chosen at random. Make it something you wouldn't normally choose.
The age must be from 23 to 60.
Salary should be based on qualification and age.
Qualification: Bachelor's (40000-60000), Master's (60000-80000), PhD (80000-100000).
Bonus: 8-10% of salary for employees above 40 years old and 6-8% for employees below 40.
""",
runs=5
)

In [40]:
synthetic_results

[EmployeeRecord(id=104, name='Alice Johnson', age=27, qualification="Bachelor's Degree", salary=52000, bonus=3744),
 EmployeeRecord(id=105, name='Liam Santos', age=42, qualification="Bachelor's Degree", salary=57000, bonus=4560),
 EmployeeRecord(id=106, name='Jasmine Rodriguez', age=52, qualification='PhD', salary=94000, bonus=9400),
 EmployeeRecord(id=107, name='Xavier Patel', age=38, qualification="Master's Degree", salary=72000, bonus=5760),
 EmployeeRecord(id=108, name='Heather Nguyen', age=30, qualification="Bachelor's Degree", salary=55000, bonus=4400)]

## Store the Data

In [41]:
data = [
    {
        "emp_id": record.id,
        "emp_name": record.name,
        "age": record.age,
        "qualification": record.qualification,
        "salary": record.salary,
        "bonus": record.bonus
    } for record in synthetic_results
]

# Create DataFrame
df = pd.DataFrame(data)

# Print DataFrame
df.to_csv('EmployeeRecords_openai_generated.csv')
df

Unnamed: 0,emp_id,emp_name,age,qualification,salary,bonus
0,104,Alice Johnson,27,Bachelor's Degree,52000,3744
1,105,Liam Santos,42,Bachelor's Degree,57000,4560
2,106,Jasmine Rodriguez,52,PhD,94000,9400
3,107,Xavier Patel,38,Master's Degree,72000,5760
4,108,Heather Nguyen,30,Bachelor's Degree,55000,4400
