# Turn a Bit of Data into A Bunch of Data with SDG Hub

## Environment Setup
Before running any commands, make sure your environment is set up for SDG Hub. You‚Äôll need Python 3.10 or newer, a virtual environment (recommended for dependency management), and either a local model endpoint like Ollama or vLLM, or access to an OpenAI-compatible API key. Once your environment is ready, installing SDG Hub and its example flows is as simple as a few pip commands‚Äîthen you‚Äôre ready to start generating synthetic data right from your terminal or Jupyter Notebook.

## Step 1: Install dependencies
In a terminal or a Jupyter notebook cell, run the following commands to install SDG Hub along with example flows and vLLM integration.

!pip install sdg-hub
!pip install sdg-hub[vllm,examples]

## Step 2: Include the necessary libraries

In [None]:
from sdg_hub.core.flow import FlowRegistry
from sdg_hub.core.blocks import BlockRegistry

**Show the available Flows**
- List out all of the available flows. Flows are prebuilt workflows for generating synthetic data. 

In [None]:
FlowRegistry.discover_flows()

**Show the available Blocks**
- Then list out all of the available blocks. Blocks are components that make up the flows, you can rearrange them to build your own flow, like legos. 

In [None]:
BlockRegistry.discover_blocks()

## Step 3: Run your first flow
Here we‚Äôll  import a prebuilt Question-Answer Generation Flow for knowledge tuning.

In [None]:
from sdg_hub.core.flow import FlowRegistry, Flow
from datasets import Dataset

For our purposes here, we will run one of the pre-built workflows that generates question and answer pairs.

In [None]:
# load a pre-built flow
flow_name = "Advanced Document Grounded Question-Answer Generation Flow for Knowledge Tuning"
flow_path = FlowRegistry.get_flow_path(flow_name)
flow = Flow.from_yaml(flow_path)

## Configure your model backend
This workflow requires a Large Language Model to generate content as well as act as a teacher and a critic. SDG Hub doesn‚Äôt download or run these models for you‚Äîyou‚Äôll need to have your chosen model endpoint set up separately before proceeding. SDG Hub can connect to any OpenAI-compatible API, whether that‚Äôs a locally hosted option like Ollama or vLLM, or a cloud-hosted service such as OpenAI or Anthropic. Once your endpoint is running, you‚Äôll point SDG Hub to it by specifying the model name, API base URL, and API key in the configuration.

**Option A: Ollama (free, easiest local option)**
Ollama is great for testing ‚Äî install it, pull a model (e.g. ollama pull llama3), and SDG Hub can use it as an OpenAI-compatible endpoint for free.
To run locally on CPU/GPU via Ollama:

In [None]:
flow.set_model_config({
    "model": "ollama/llama3",
    "api_base": "http://localhost:11434/v1",
    "api_key": "ollama"
})

**Option B: Local vLLM (free, GPU required)**
If you‚Äôre running vLLM locally or as a remote endpoint: 
(note using 'vllm/' as a  local vLLM SDK in-process (has been deprecated).)

In [None]:
flow.set_model_config(
    model="hosted_vllm/meta-llama/Llama-3.1-8B-Instruct",
    api_base="http://remote-ip or localhost:8000/v1",
    api_key="your_api_key_here or dummy",
)

**Option C: OpenAI or Claude API (paid)**
You can use any OpenAI-compatible endpoint ‚Äî local or hosted.

In [None]:
flow.set_model_config(
    model="openai/gpt-3.5-turbo",
    api_key=‚Äùyour_api_key_here‚Äù
)

## Getting the default model for the flow
Each pre-built flow has a list of recommended models to use. To view them, run the following code.

In [None]:
# Discover recommended models
default_model = flow.get_default_model()
recommendations = flow.get_model_recommendations()

print('default_model:')
print(default_model)

print('recommendations: ')
print(recommendations)

## Step 4: Create a sample dataset
We‚Äôll start with a simple document and a few in-context examples (ICL queries and responses). To keep things simple, we have defined the dataset in code. You are also able to load data from multiple sources, such as documents via Docling, or other data storage systems.  

In [None]:
# Create a sample and simple dataset
dataset = Dataset.from_dict({
    'document': ['The Great Dane is a German breed of domestic dog known for its imposing size. It is one of the world\'s tallest dog breeds, often referred to as the "Apollo of Dogs."'],
    'document_outline': ['1. Great Dane Origin; 2. Size and Height; 3. Breed Nicknames'],
    'domain': ['Canine Breeds'],
    'icl_document': ['The Labrador Retriever is a British breed of retriever gun dog that is consistently one of the most popular dog breeds in the world.'],
    'icl_query_1': ['What is the origin of the Labrador Retriever?'],
    'icl_response_1': ['The Labrador Retriever is a British breed.'],
    'icl_query_2': ['What type of dog is a Labrador?'],
    'icl_response_2': ['The Labrador is a retriever gun dog.'],
    'icl_query_3': ['How popular is the Labrador Retriever?'],
    'icl_response_3': ['It is consistently one of the most popular dog breeds in the world.']
})

## Quick Note for Running the Code in a Jupyter Notebook
Before running asynchronous code in a Jupyter Notebook, you may encounter runtime errors like RuntimeError: This event loop is already running.
That‚Äôs because SDG Hub executes parts of its pipelines asynchronously to handle multiple model requests efficiently. Jupyter itself already runs an event loop, so without a patch, Python would try to start a second loop and fail.
The following lines fix that by applying the nest_asyncio patch, which safely allows nested event loops in the same runtime:

In [None]:
import nest_asyncio
nest_asyncio.apply()

## Step 5: Dry run (recommended first)
This runs a quick test to make sure the pipeline has no errors or configuration issues.

In [None]:
# Test with a small sample first (recommended!)
print("üß™ Running dry run...")
dry_result = flow.dry_run(dataset, sample_size=1)

If that runs without errors, run the see the results.

In [None]:
print(f"‚úÖ Dry run completed in {dry_result['execution_time_seconds']:.2f}s")
print(f"üìä Output columns: {list(dry_result['final_dataset']['columns'])}")

## Step 6: Generate synthetic data
Once the dry run successfully completes, you have confirmed that it is ready for a full run. Run the following code to do just that.

In [None]:
# Generate high-quality QA pairs
print("üèóÔ∏è Generating synthetic data...")
result = flow.generate(dataset)

## Step 7: Review and export your generated Data
You will notice that it takes longer to complete the full run than the dry run. That‚Äôs due to the fact that far more data is being generated than is in the dry run. 

Run the following code to see how many QA (question and answer) pairs have been generated.

In [None]:
# Explore the results
print(f"\nüìà Generated {len(result)} QA pairs!")

Now we know how many pairs have been generated. Run the following code to look at the QA pairs we generated synthetically. 

**Review the Generated QA Pairs**

In [None]:
# The length is determined by the length of any of the lists (e.g., 'question')
num_pairs = len(result['question'])

print(f"\n--- Generated {num_pairs} QA pairs ---")

i = 0

# Iterate from index 0 up to (but not including) num_pairs
for i in range(num_pairs -1):
    print(f"\n--- QA Pair #{i+1} ---")
    print(f"üìù Question: {result['question'][i]}")
    print(f"üí¨ Answer: {result['response'][i]}")
    print(f"üéØ Faithfulness Score: {result['faithfulness_judgment'][i]}")
    print(f"üìè Relevancy Score: {result['relevancy_score'][i]}")

print("\n--- End of Report ---")

**Explore synthetic data More Closely**

In [None]:
type(result)

In [None]:
df = result.to_pandas()

In [None]:
df.shape

In [None]:
df.info()   

**Show Synthetic Data**

In [None]:
df.head()

**Export Entire Dataset to CSV**

In [None]:
df.to_csv('entire_synthetic_dataset.csv')

**Narrow down the dataset to just Q&A pairs**

In [None]:
qa_df = result.to_pandas()[["question", "response", "verification_rating", "relevancy_score", "faithfulness_judgment" ]]
qa_df

**Export the Q&A pairs to CSV**

In [None]:
# assuming qa_df is your DataFrame
qa_df.to_csv("synthetic_qa_pairs.csv", index=False)

Hopefully, you found this helpful and are now left with a purpose-built dataset ready to train your model. If so, check out SDG Hub, clone the repo, tweak a flow, and start teaching your model something new. Happy generating!