[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/osllmai/indoxGen/blob/master/examples/generated_with_llm_judge_feedback.ipynb)

In [None]:
%pip install indoxGen
%pip install openai
%pip install python-dotenv

In [1]:
import os
from dotenv import load_dotenv

load_dotenv('api.env') # Make sure your api keys save in a file named 'api.env'
INDOX_API_KEY = os.environ['INDOX_API_KEY']
NVIDIA_API_KEY = os.environ['NVIDIA_API_KEY']

### Example Data:

This cell defines a set of column and corresponding example data for generating synthetic data which focused on general demographic information.

1. **General Data**:
   - `columns`: A list of three column names: `"name"`, `"age"`, and `"occupation"`. These represent typical demographic information fields.
   - `example_data`: A list of dictionaries where each dictionary represents a person's demographic data. Two individuals are included:
     - **Alice Johnson** (35 years old, Manager).
     - **Bob Williams** (42 years old, Accountant).


In [2]:
columns = ["name", "age", "occupation"]
example_data = [
    {"name": "Alice Johnson", "age": 35, "occupation": "Manager"},
    {"name": "Bob Williams", "age": 42, "occupation": "Accountant"}
]

### Setting Up API Clients for Synthetic Data Generation

This cell establishes connections to two API clients, which are used to interact with large language models (LLMs) for synthetic data generation tasks.

1. **Imports**:
   - `from indoxGen.llms import OpenAi, IndoxApi`: Imports the `OpenAi` and `IndoxApi`classes from the `indoxGen.llms` module, enabling access to the OpenAI API and IndoxApi for generating synthetic data.

2. **Initialization of API Clients**:
   - **IndoxApi Client**:
     ```python
     indox = IndoxApi(api_key=INDOX_API_KEY)
     ```
      Initializes an `IndoxApi` client using the provided `INDOX_API_KEY` to access the OpenAI model. This client will be used for generating synthetic data, offering lightweight and efficient data generation capabilities.

   - **Nemotron (NVIDIA) Client**:
     ```python
     nemotron = OpenAi(api_key=NVIDIA_API_KEY, model="nvidia/nemotron-4-340b-instruct", base_url="https://integrate.api.nvidia.com/v1")
     ```
     Initializes another `OpenAi` client, this time for interacting with the `"nvidia/nemotron-4-340b-instruct"` model, using the `NVIDIA_API_KEY` and connecting to NVIDIA's integration API via the `base_url`. The Nemotron model is a high-capacity language model designed for complex data generation tasks.

This setup allows for generating synthetic data using two different models: the lightweight GPT-4 mini for more general tasks and the NVIDIA Nemotron model for heavier, more complex data generation.


In [3]:
from indoxGen.llms import IndoxApi, OpenAi

indox = IndoxApi(api_key=INDOX_API_KEY)

nemotron = OpenAi(api_key=NVIDIA_API_KEY, model="nvidia/nemotron-4-340b-instruct",
                  base_url="https://integrate.api.nvidia.com/v1")

### Initializing Synthetic Data Generator with human feedback feature

This cell sets up the synthetic data generation with human feedback pipeline by initializing an instance of `SyntheticDataGeneratorHF`. The generator leverages two language models (LLMs) to create realistic synthetic data based on the provided examples and user instructions. Also the generator allows you to inspect data that falls below a certain threshold, either accepting it or regenerating them using a feedback to add to the existing dataset.

1. **Imports**:
   - `from indoxGen.synthCore import SyntheticDataGeneratorHF`: Imports the `SyntheticDataGeneratorHF` class from the `indoxGen.synthCore` module, which is responsible for generating synthetic data with human feedback based on specified parameters.

2. **Initializing the `SyntheticDataGeneratorHF`**:
    The `SyntheticDataGeneratorHF` is initialized with the following parameters:
- **generator_llm**: The main language model (`nemotron`) used for generating the synthetic data.
- **judge_llm**: A secondary language model (`openai`) used for evaluating the generated data, ensuring its quality and accuracy.
- **columns**: Specifies the structure of the synthetic data, including `"name"`, `"age"`, and `"occupation"`.
- **example_data**: Provides example entries to guide the generation process.
- **user_instruction**: A detailed instruction for generating synthetic data, ensuring diversity in names, ages, occupations, and race. It also ensures that the generated data covers both common and rare procedures, along with appropriate age ranges.
- **verbose**: Controls the verbosity of the generator, with `1` enabling detailed output during the generation process.
- **diversity_threshold**: Threshold for determining data diversity
- **feedback_min_score**: Minimum score for accepting generated data, data falling below this threshold will be held in pending review for a decision on whether to accept or regenerate it.


In [4]:
from indoxGen.synthCore import SyntheticDataGeneratorHF
generator = SyntheticDataGeneratorHF(
    generator_llm=nemotron,
    judge_llm=indox,
    columns=columns,
    example_data=example_data,
    user_instruction="Generate realistic data including name, age and occupation. Ensure a mix of common and rare procedures, varying race, and appropriate date ranges for age.",
    verbose=1,
    diversity_threshold=0.8,
    feedback_min_score = 0.9
)

In [5]:
# Generate data
generated_data = generator.generate_data(num_samples=10)

Generated data point: {'name': 'Rev. Hector Mendez-Villaneuva', 'age': '38', 'occupation': 'Bilingual School Guidance Counselor (Spanish-English)'}
Progress: 1/10 data points generated. Attempts: 10


In [6]:
generated_data

Unnamed: 0,name,age,occupation
0,Rev. Hector Mendez-Villaneuva,38,Bilingual School Guidance Counselor (Spanish-E...


In [7]:
generator.pending_review

Unnamed: 0,data,score
0,"{'name': 'Dr. Maya Patel', 'age': '39', 'occup...",0.8
1,"{'name': 'Captain Jamal Al-Hussein', 'age': '3...",0.8
2,"{'name': 'Professor Yuko Takahashi', 'age': '4...",0.8
3,"{'name': 'Sergeant Major sorsha O'Sullivan', '...",0.8
4,"{'name': 'Mahmoud Sheikh-Collins', 'age': '37'...",0.8
5,"{'name': 'Rev. Sister glitches Chang', 'age': ...",0.8
6,{'name': 'Dr. Esperanza Garc erweiterte Şehit'...,0.7
7,"{'name': 'Dr. Kaya competences Ns算法 Engineer',...",0.7
8,"{'name': 'Prof. Aasha de Luca', 'age': '39', '...",0.8


* **`accepted_rows`:** A list of data points that have been approved and will be added to the generated dataset.If set to 'all', all pending data points will be accepted.
* **`regenerate_rows`:** A list of data points that will be regenerated based on the provided feedback. If set to 'all', all pending data points will be regenerated.
* **`regeneration_feedback`:** A string specifying the desired changes for the regeneration process.
* **`min_score`:** The minimum score a data point must achieve to be considered for acceptance.
 
* **Note**: If both `accepted_rows` and `regenerate_row` set to 'all', it would just consider `accepted_row` to all.


In [8]:
generator.user_review_and_regenerate(accepted_rows = [0,1,2,3,4,5,8],regenerate_rows= [6,7],regeneration_feedback = 'change name to another name , also change occupation to another occupation',min_score=0.7)

Unnamed: 0,name,age,occupation
0,Rev. Hector Mendez-Villaneuva,38,Bilingual School Guidance Counselor (Spanish-E...
1,Dr. Maya Patel,39,Neurosurgeon
2,Captain Jamal Al-Hussein,36,Aircraft Pilot
3,Professor Yuko Takahashi,41,Quantum Physicist
4,Sergeant Major sorsha O'Sullivan,37,Military Musician - French Horn Specialist
5,Mahmoud Sheikh-Collins,37,Islamic Art Historian and Curator
6,Rev. Sister glitches Chang,36,Catholic Nun and Computer Science Teacher
7,Prof. Aasha de Luca,39,Archaeologist - Ancient Italian and Indian Civ...
8,Dr. Kamilla W Zola-Mbeki,37,Pediatric Neurologist and South African Langua...
9,Dr. Zhora Nur angiography Tariq,40,Interventional Radiologist and Persian Calligr...


In [9]:
generator.pending_review

Unnamed: 0,index,data,score
