# 🚀 Build Synthetic Datasets with Cerebras + Synthetic Data Kit

Checkout: Synthetic-Data-Kit here: https://github.com/meta-llama/synthetic-data-kit/

**ODSC Workshop - From Research Paper to Fine-Tuning Dataset**

In this notebook, you'll:
- ✅ Parse the Llama 3 research paper
- ✅ Generate 50+ Q&A pairs using Cerebras inference
- ✅ Filter for quality using LLM-as-judge
- ✅ Export to fine-tuning format

**No coding required - just run the cells!** ⚡

## 🔑 Step 1: Set Your Cerebras API Key

Enter your Cerebras API key below:

In [None]:
import os
from google.colab import userdata

# Option 1: Enter your API key directly (not recommended for sharing)
CEREBRAS_API_KEY = "csk-jn2394dxhp5vmv4w36x4k93t36eryc9cdr9rxnm3fdffkwk8"

# Option 2: Use Colab Secrets (recommended - add key as 'CEREBRAS_API_KEY' in secrets)
# Uncomment below if using secrets:
# CEREBRAS_API_KEY = userdata.get('CEREBRAS_API_KEY')

# Set environment variable
os.environ['CEREBRAS_API_KEY'] = CEREBRAS_API_KEY

print("✅ API key configured!")
print(f"🔑 Key preview: {CEREBRAS_API_KEY[:10]}...")

✅ API key configured!
🔑 Key preview: csk-kktf8h...


## 📦 Step 2: Install Synthetic Data Kit

Installing the toolkit and dependencies...

In [None]:
!pip install -q synthetic-data-kit
!pip install -q datasets  # For HuggingFace format export

# Verify installation
!synthetic-data-kit --help | head -15

print("\n✅ Installation complete!")

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/79.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.7/79.7 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.6/4.6 MB[0m [31m80.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.6/5.6 MB[0m [31m103.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m253.0/253.0 kB[0m [31m20.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m472.8/472.8 kB[0m [31m37.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.6/57.6 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m175.3/175.3 kB[0m [31m14.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[

## ⚙️ Step 3: Download Workshop Configuration

Downloading the ready-to-use config from GitHub and setting up directories...

In [None]:
# Create directory structure
!mkdir -p data/{parsed,generated,curated,final}

print("📥 Downloading workshop config from GitHub...")

# Download the ready-to-use config from GitHub (ODSC-Workshop branch)
!wget -q https://raw.githubusercontent.com/meta-llama/synthetic-data-kit/ODSC-Workshop/configs/config.yaml -O cerebras_config.yaml

print("✅ Config downloaded!")

# Replace the API key placeholder with your actual key
import os

with open('cerebras_config.yaml', 'r') as f:
    config_content = f.read()

# Replace the placeholder with actual API key
config_content = config_content.replace('YOUR_CEREBRAS_API_KEY', os.environ.get('CEREBRAS_API_KEY'))

with open('cerebras_config.yaml', 'w') as f:
    f.write(config_content)

print("✅ Configuration ready with your API key!")
print("\n📁 Directory structure:")
!tree data/ || ls -R data/

print("\n📄 Config preview (first 35 lines):")
!head -35 cerebras_config.yaml

📥 Downloading workshop config from GitHub...
✅ Config downloaded!
✅ Configuration ready with your API key!

📁 Directory structure:
/bin/bash: line 1: tree: command not found
data/:
curated  final	generated  input  parsed

data/curated:

data/final:

data/generated:

data/input:

data/parsed:

📄 Config preview (first 35 lines):
# Master configuration file for Synthetic Data Kit
# Workshop-ready configuration with Cerebras defaults

# Global paths configuration
paths:
  # Input data location (directory containing files to process)
  input: "data/input"           # Directory containing PDF, HTML, DOCX, PPT, TXT files

  # Output locations (4-stage pipeline directories)
  output:
    parsed: "data/parsed"       # Stage 1: Where parsed text files are saved (ingest output)
    generated: "data/generated" # Stage 2: Where generated QA pairs are saved (create output)
    curated: "data/curated"     # Stage 3: Where curated QA pairs are saved (curate output)
    final: "data/final"         # St

## 🔌 Step 4: Test API Connection

Verifying connection to Cerebras...

In [None]:
!synthetic-data-kit -c cerebras_config.yaml system-check

print("\n✅ If you see 'API endpoint access confirmed' above, you're ready to go!")

Loading config from: /usr/local/lib/python3.12/dist-packages/synthetic_data_kit/config.yaml
Config has LLM provider set to: api-endpoint
Loading config from: /usr/local/lib/python3.12/dist-packages/synthetic_data_kit/config.yaml
Config has LLM provider set to: api-endpoint
Loading config from: cerebras_config.yaml
Config has LLM provider set to: api-endpoint
[1;34mEnvironment variable check:[0m
API_ENDPOINT_KEY: Not found
get_llm_provider returning: api-endpoint
API_ENDPOINT_KEY environment variable: Not found
API key source: Config file
[2K[32m⠼[0m Checking API endpoint access...INFO:httpx:HTTP Request: POST https://api.cerebras.ai/v1/chat/completions "HTTP/1.1 200 OK"
[2K[32m API endpoint access confirmed[0m
[2K[32mUsing custom API base: [0m[4;94mhttps://api.cerebras.ai/v1[0m
[2K[32mDefault model: llama3.[0m[1;36m3[0m[32m-70b[0m
[2K[32mResponse from model: Hello. How can I assist you today?[0m
[2K[32m⠴[0m Checking API endpoint access...
[1A[2K
✅ If you see

## 📥 Step 5: Download Llama 3 Paper

Downloading the research paper from arXiv...

In [None]:
!wget -q https://arxiv.org/pdf/2407.21783 -O llama3_paper.pdf

# Verify download
import os
file_size = os.path.getsize('llama3_paper.pdf') / 1024  # KB

print(f"✅ Paper downloaded successfully!")
print(f"📄 File: llama3_paper.pdf")
print(f"💾 Size: {file_size:.1f} KB")

!ls -lh llama3_paper.pdf

✅ Paper downloaded successfully!
📄 File: llama3_paper.pdf
💾 Size: 9602.7 KB
-rw-r--r-- 1 root root 9.4M Nov 26  2024 llama3_paper.pdf


---

# 🔄 The 4-Stage Pipeline

```
PDF → INGEST → CREATE → CURATE → SAVE-AS → Training Data ✨
```

## 📚 Stage 1: INGEST - Parse the PDF

**What it does:** Extracts clean text from the PDF and saves as .txt

This takes ~30-60 seconds...

In [None]:
%%time

!synthetic-data-kit -c cerebras_config.yaml \
  ingest llama3_paper.pdf

print("\n" + "="*60)
print("✅ INGEST complete!")
print("="*60)

# Check output
!ls -lh data/parsed/

# Preview first few lines of the extracted text
print("\n📝 Preview of extracted text:")
!head -20 data/parsed/llama3_paper.txt

Loading config from: /usr/local/lib/python3.12/dist-packages/synthetic_data_kit/config.yaml
Config has LLM provider set to: api-endpoint
Loading config from: /usr/local/lib/python3.12/dist-packages/synthetic_data_kit/config.yaml
Config has LLM provider set to: api-endpoint
Loading config from: cerebras_config.yaml
Config has LLM provider set to: api-endpoint
[2K[32m⠧[0m Processing llama3_paper.pdf...
[1A[2K[32m✅ Text successfully extracted to [0m[1;32mdata/parsed/llama3_paper.txt[0m

✅ INGEST complete!
total 352K
-rw-r--r-- 1 root root 352K Oct 28 14:51 llama3_paper.txt

📝 Preview of extracted text:
4
2
0
2

v
o
N
3
2

]
I

A
.
s
c
[

CPU times: user 91 ms, sys: 6.18 ms, total: 97.2 ms
Wall time: 15.2 s


## 🤖 Stage 2: CREATE - Generate Q&A Pairs

**What it does:** Uses Cerebras + Llama 3.3-70B with custom prompts to generate intelligent Q&A pairs

This takes ~2-4 minutes for 50 pairs... ☕

In [None]:
%%time

!synthetic-data-kit -c cerebras_config.yaml \
  create data/parsed/llama3_paper.txt \
  --type qa \
  --num-pairs 50 \
  --verbose

print("\n" + "="*60)
print("✅ CREATE complete!")
print("="*60)

# Check output
!ls -lh data/generated/

# Count Q&A pairs
import json
with open('data/generated/llama3_paper_qa_pairs.json', 'r') as f:
    data = json.load(f)

print(f"\n📊 Generated {len(data['qa_pairs'])} Q&A pairs")

Loading config from: /usr/local/lib/python3.12/dist-packages/synthetic_data_kit/config.yaml
Config has LLM provider set to: api-endpoint
Loading config from: /usr/local/lib/python3.12/dist-packages/synthetic_data_kit/config.yaml
Config has LLM provider set to: api-endpoint
Loading config from: cerebras_config.yaml
Config has LLM provider set to: api-endpoint
get_llm_provider returning: api-endpoint
[32m🔗 Using api-endpoint provider[0m
[?25lLoading config from: cerebras_config.yaml
[32m⠋[0m Generating qa content from data/parsed/llama3_paper.txt...[2KConfig has LLM provider set to: api-endpoint
[32m⠋[0m Generating qa content from data/parsed/llama3_paper.txt...[2KAPI_ENDPOINT_KEY from environment: Not found
[32m⠋[0m Generating qa content from data/parsed/llama3_paper.txt...[2KUsing API key: From config
[2KUsing API base URL: https://api.cerebras.ai/v1
[2KL Using api-endpoint provider
[2KLoading config from: cerebras_config.yaml
[2KConfig has LLM provider set to: api-e

### 🔍 Preview Generated Q&A Pairs

In [None]:
import json

# Load and display first 3 Q&A pairs
with open('data/generated/llama3_paper_qa_pairs.json', 'r') as f:
    data = json.load(f)

print("📝 Summary:")
print(data['summary'][:200] + "...\n")

print("\n" + "="*60)
print("📚 Sample Q&A Pairs:")
print("="*60)

for i, pair in enumerate(data['qa_pairs'][:3], 1):
    print(f"\n{i}. Question:")
    print(f"   {pair['question']}")
    print(f"\n   Answer:")
    print(f"   {pair['answer'][:150]}...")
    print("\n" + "-"*60)

📝 Summary:
Here is a summary of the document in 3-5 sentences, focusing on the main topic and key concepts:

The paper introduces Llama 3, a new set of foundation models for language that natively support multil...


📚 Sample Q&A Pairs:

1. Question:
   What is the size of the largest Llama 3 model in terms of parameters?

   Answer:
   405B parameters...

------------------------------------------------------------

2. Question:
   How many parameters does the flagship model have?

   Answer:
   405B...

------------------------------------------------------------

3. Question:
   What is the parameter size of the largest Llama 3 language model?

   Answer:
   405B...

------------------------------------------------------------


## ✨ Stage 3: CURATE - Filter Quality

**What it does:** Uses LLM-as-judge with custom rating prompt to rate and filter Q&A pairs

This takes ~2-3 minutes... 🎯

In [None]:
%%time

!synthetic-data-kit -c cerebras_config.yaml \
  curate data/generated/llama3_paper_qa_pairs.json \
  --threshold 7.5 \
  --verbose

print("\n" + "="*60)
print("✅ CURATE complete!")
print("="*60)

# Check output
!ls -lh data/curated/

Loading config from: /usr/local/lib/python3.12/dist-packages/synthetic_data_kit/config.yaml
Config has LLM provider set to: api-endpoint
Loading config from: /usr/local/lib/python3.12/dist-packages/synthetic_data_kit/config.yaml
Config has LLM provider set to: api-endpoint
Loading config from: cerebras_config.yaml
Config has LLM provider set to: api-endpoint
get_llm_provider returning: api-endpoint
[32m🔗 Using api-endpoint provider[0m
[?25lLoading config from: cerebras_config.yaml
[32m⠋[0m Cleaning content from data/generated/llama3_paper_qa_pairs.json...[2KConfig has LLM provider set to: api-endpoint
[32m⠋[0m Cleaning content from data/generated/llama3_paper_qa_pairs.json...[2KAPI_ENDPOINT_KEY from environment: Not found
[32m⠋[0m Cleaning content from data/generated/llama3_paper_qa_pairs.json...[2KUsing API key: From config
[2KUsing API base URL: https://api.cerebras.ai/v1
[2KLoading config from: cerebras_config.yaml
[2KConfig has LLM provider set to: api-endpoint
[

### 📊 Quality Metrics

In [None]:
import json

# Load curated data
with open('data/curated/llama3_paper_qa_pairs_cleaned.json', 'r') as f:
    curated = json.load(f)

metrics = curated.get('metrics', {})

print("="*60)
print("📊 CURATION RESULTS")
print("="*60)
print(f"\n📝 Total pairs generated:     {metrics.get('total', 0)}")
print(f"✅ Pairs kept (≥7.5 rating):  {metrics.get('filtered', 0)}")
print(f"📈 Retention rate:            {metrics.get('retention_rate', 0)*100:.1f}%")
print(f"⭐ Average quality score:     {metrics.get('avg_score', 0):.1f}/10")

print("\n" + "="*60)
print("🎯 Quality filtering complete!")
print(f"   Kept {metrics.get('filtered', 0)} high-quality pairs")
print("="*60)

📊 CURATION RESULTS

📝 Total pairs generated:     50
✅ Pairs kept (≥7.5 rating):  46
📈 Retention rate:            92.0%
⭐ Average quality score:     8.5/10

🎯 Quality filtering complete!
   Kept 46 high-quality pairs


### 👀 Preview Top-Rated Q&A Pairs

In [None]:
import json

with open('data/curated/llama3_paper_qa_pairs_cleaned.json', 'r') as f:
    curated = json.load(f)

# Sort by rating (descending)
sorted_pairs = sorted(curated['qa_pairs'], key=lambda x: x.get('rating', 0), reverse=True)

print("="*60)
print("🌟 TOP 3 HIGHEST-RATED Q&A PAIRS")
print("="*60)

for i, pair in enumerate(sorted_pairs[:3], 1):
    print(f"\n{i}. Rating: ⭐ {pair.get('rating', 'N/A')}/10")
    print(f"\n   Q: {pair['question']}")
    print(f"\n   A: {pair['answer'][:200]}...")
    print("\n" + "-"*60)

🌟 TOP 3 HIGHEST-RATED Q&A PAIRS

1. Rating: ⭐ 9/10

   Q: What is the size of the largest Llama 3 model in terms of parameters?

   A: 405B parameters...

------------------------------------------------------------

2. Rating: ⭐ 9/10

   Q: What is the parameter size of the largest Llama 3 language model?

   A: 405B...

------------------------------------------------------------

3. Rating: ⭐ 9/10

   Q: What is the purpose of the language model pre-training data curation process?

   A: The purpose of the language model pre-training data curation process is to obtain high-quality tokens by applying several de-duplication methods and data cleaning mechanisms on each data source....

------------------------------------------------------------


## 💾 Stage 4: SAVE-AS - Export to Training Format

**What it does:** Converts to fine-tuning ready formats

We'll create multiple formats...

In [None]:
%%time

# Format 1: HuggingFace Dataset (Arrow format - recommended!)
print("📦 Creating HuggingFace dataset...")
!synthetic-data-kit -c cerebras_config.yaml \
  save-as data/curated/llama3_paper_qa_pairs_cleaned.json \
  --format ft \
  --storage hf

# Format 2: OpenAI Fine-Tuning (JSON)
print("\n📦 Creating OpenAI FT format...")
!synthetic-data-kit -c cerebras_config.yaml \
  save-as data/curated/llama3_paper_qa_pairs_cleaned.json \
  --format ft

# Format 3: Alpaca format
print("\n📦 Creating Alpaca format...")
!synthetic-data-kit -c cerebras_config.yaml \
  save-as data/curated/llama3_paper_qa_pairs_cleaned.json \
  --format alpaca

print("\n" + "="*60)
print("✅ SAVE-AS complete!")
print("="*60)

# Show all formats
!ls -lh data/final/

📦 Creating HuggingFace dataset...
Loading config from: /usr/local/lib/python3.12/dist-packages/synthetic_data_kit/config.yaml
Config has LLM provider set to: api-endpoint
Loading config from: /usr/local/lib/python3.12/dist-packages/synthetic_data_kit/config.yaml
Config has LLM provider set to: api-endpoint
Loading config from: cerebras_config.yaml
Config has LLM provider set to: api-endpoint
[?25l[32m⠋[0m Converting data/curated/llama3_paper_qa_pairs_cleaned.json to ft format with 
[2K[1A[2K[32m⠙[0m Converting data/curated/llama3_paper_qa_pairs_cleaned.json to ft format with 
[2K[1A[2K[32m⠹[0m Converting data/curated/llama3_paper_qa_pairs_cleaned.json to ft format with 
[2K[1A[2K[32m⠸[0m Converting data/curated/llama3_paper_qa_pairs_cleaned.json to ft format with 
[2K[1A[2K[32m⠼[0m Converting data/curated/llama3_paper_qa_pairs_cleaned.json to ft format with 
[2K[1A[2K[32m⠴[0m Converting data/curated/llama3_paper_qa_pairs_cleaned.json to ft format with 
[2K

### 🎯 Load & Inspect HuggingFace Dataset

In [None]:
from datasets import load_from_disk
import json

# Load the HuggingFace dataset
dataset = load_from_disk('data/final/llama3_paper_qa_pairs_cleaned_ft_hf')

print("="*60)
print("📊 HUGGINGFACE DATASET INFO")
print("="*60)
print(f"\n📦 Dataset size: {len(dataset)} examples")
print(f"\n🔧 Features: {dataset.features}")

print("\n" + "="*60)
print("📝 SAMPLE TRAINING EXAMPLE (OpenAI Format)")
print("="*60)

# Show first example
example = dataset[0]
print(json.dumps(example, indent=2))

print("\n" + "="*60)
print("✅ Ready to use with Transformers, Axolotl, or any training framework!")
print("="*60)

📊 HUGGINGFACE DATASET INFO

📦 Dataset size: 46 examples

🔧 Features: {'messages': List({'content': Value('string'), 'role': Value('string')})}

📝 SAMPLE TRAINING EXAMPLE (OpenAI Format)
{
  "messages": [
    {
      "content": "You are a helpful assistant.",
      "role": "system"
    },
    {
      "content": "What is the size of the largest Llama 3 model in terms of parameters?",
      "role": "user"
    },
    {
      "content": "405B parameters",
      "role": "assistant"
    }
  ]
}

✅ Ready to use with Transformers, Axolotl, or any training framework!


---

# 🎉 Success! Your Dataset is Ready!

## 📊 Final Summary

In [None]:
import json
from datasets import load_from_disk

# Load files
with open('data/generated/llama3_paper_qa_pairs.json', 'r') as f:
    generated = json.load(f)

with open('data/curated/llama3_paper_qa_pairs_cleaned.json', 'r') as f:
    curated = json.load(f)

dataset = load_from_disk('data/final/llama3_paper_qa_pairs_cleaned_ft_hf')

print("\n" + "="*60)
print("🎉 WORKSHOP COMPLETE - SUMMARY")
print("="*60)

print("\n📚 Source:")
print("   • Llama 3 Research Paper (arXiv:2407.21783)")

print("\n🔄 Pipeline Results:")
print(f"   1️⃣ INGEST:   ✅ PDF → Clean text (.txt)")
print(f"   2️⃣ CREATE:   ✅ Generated {len(generated['qa_pairs'])} Q&A pairs (custom prompts)")
print(f"   3️⃣ CURATE:   ✅ Kept {len(curated['qa_pairs'])} high-quality pairs (≥7.5/10)")
print(f"   4️⃣ SAVE-AS:  ✅ Exported to 3 formats")

metrics = curated.get('metrics', {})
print("\n📊 Quality Metrics:")
print(f"   • Retention rate: {metrics.get('retention_rate', 0)*100:.1f}%")
print(f"   • Average score: {metrics.get('avg_score', 0):.1f}/10")

print("\n💾 Output Formats:")
print(f"   • HuggingFace Dataset: {len(dataset)} examples (Arrow format)")
print(f"   • OpenAI Fine-Tuning: JSON format")
print(f"   • Alpaca: JSON format")

print("\n📂 Files Location:")
print("   • data/final/ (all formats)")

print("\n" + "="*60)
print("🚀 Your dataset is ready for fine-tuning!")
print("="*60)

print("\n💡 Next Steps:")
print("   • Download the dataset from data/final/")
print("   • Use with Transformers, Axolotl, or your training framework")
print("   • Fine-tune your model!")


🎉 WORKSHOP COMPLETE - SUMMARY

📚 Source:
   • Llama 3 Research Paper (arXiv:2407.21783)

🔄 Pipeline Results:
   1️⃣ INGEST:   ✅ PDF → Clean text (.txt)
   2️⃣ CREATE:   ✅ Generated 50 Q&A pairs (custom prompts)
   3️⃣ CURATE:   ✅ Kept 46 high-quality pairs (≥7.5/10)
   4️⃣ SAVE-AS:  ✅ Exported to 3 formats

📊 Quality Metrics:
   • Retention rate: 92.0%
   • Average score: 8.5/10

💾 Output Formats:
   • HuggingFace Dataset: 46 examples (Arrow format)
   • OpenAI Fine-Tuning: JSON format
   • Alpaca: JSON format

📂 Files Location:
   • data/final/ (all formats)

🚀 Your dataset is ready for fine-tuning!

💡 Next Steps:
   • Download the dataset from data/final/
   • Use with Transformers, Axolotl, or your training framework
   • Fine-tune your model!


---

# 🎮 Bonus Experiments

Try these optional experiments to explore more features!

## 🧪 Experiment 1: Try Different Quality Thresholds

In [None]:
import json

# Strict filtering (8.5+)
print("🔍 Testing threshold 8.5 (very strict)...")
!synthetic-data-kit -c cerebras_config.yaml \
  curate data/generated/llama3_paper_qa_pairs.json \
  --threshold 8.5 \
  -o data/curated/strict_8.5.json

# Lenient filtering (6.5+)
print("\n🔍 Testing threshold 6.5 (lenient)...")
!synthetic-data-kit -c cerebras_config.yaml \
  curate data/generated/llama3_paper_qa_pairs.json \
  --threshold 6.5 \
  -o data/curated/lenient_6.5.json

# Compare results
with open('data/curated/strict_8.5.json') as f:
    strict = json.load(f)
with open('data/curated/lenient_6.5.json') as f:
    lenient = json.load(f)
with open('data/curated/llama3_paper_qa_pairs_cleaned.json') as f:
    default = json.load(f)

print("\n" + "="*60)
print("📊 THRESHOLD COMPARISON")
print("="*60)
print(f"\nThreshold 8.5 (strict):   {len(strict['qa_pairs'])} pairs kept")
print(f"Threshold 7.5 (default):  {len(default['qa_pairs'])} pairs kept")
print(f"Threshold 6.5 (lenient):  {len(lenient['qa_pairs'])} pairs kept")
print("\n💡 Lower threshold = more pairs, but potentially lower quality")

🔍 Testing threshold 8.5 (very strict)...
Loading config from: /usr/local/lib/python3.12/dist-packages/synthetic_data_kit/config.yaml
Config has LLM provider set to: api-endpoint
Loading config from: /usr/local/lib/python3.12/dist-packages/synthetic_data_kit/config.yaml
Config has LLM provider set to: api-endpoint
Loading config from: cerebras_config.yaml
Config has LLM provider set to: api-endpoint
get_llm_provider returning: api-endpoint
[32m🔗 Using api-endpoint provider[0m
[?25lLoading config from: cerebras_config.yaml
[2KConfig has LLM provider set to: api-endpoint
[2KAPI_ENDPOINT_KEY from environment: Not found
[2KUsing API key: From config
[2KUsing API base URL: https://api.cerebras.ai/v1
[2KLoading config from: cerebras_config.yaml
[2KConfig has LLM provider set to: api-endpoint
[2KProcessing 17 batches of QA pairs...
[2K[32m⠇[0m Cleaning content from data/generated/llama3_paper_qa_pairs.json...INFO:httpx:HTTP Request: POST https://api.cerebras.ai/v1/chat/completions

## 🧪 Experiment 2: Generate More Q&A Pairs

In [None]:
%%time

print("🎯 Generating 100 Q&A pairs...\n")

!synthetic-data-kit -c cerebras_config.yaml \
  create data/parsed/llama3_paper.txt \
  --type qa \
  --num-pairs 100 \
  -o data/generated/large_dataset.json \
  --verbose

# Count pairs
import json
with open('data/generated/large_dataset.json') as f:
    large = json.load(f)

print(f"\n✅ Generated {len(large['qa_pairs'])} Q&A pairs!")
print("\n💡 You can now curate this larger dataset with:")
print("   synthetic-data-kit curate data/generated/large_dataset.json")

🎯 Generating 100 Q&A pairs...

Loading config from: /usr/local/lib/python3.12/dist-packages/synthetic_data_kit/config.yaml
Config has LLM provider set to: api-endpoint
Loading config from: /usr/local/lib/python3.12/dist-packages/synthetic_data_kit/config.yaml
Config has LLM provider set to: api-endpoint
Loading config from: cerebras_config.yaml
Config has LLM provider set to: api-endpoint
get_llm_provider returning: api-endpoint
[32m🔗 Using api-endpoint provider[0m
[?25lLoading config from: cerebras_config.yaml
[2KConfig has LLM provider set to: api-endpoint
[2KAPI_ENDPOINT_KEY from environment: Not found
[2KUsing API key: From config
[2KUsing API base URL: https://api.cerebras.ai/v1
[2KL Using api-endpoint provider
[2KLoading config from: cerebras_config.yaml
[2KConfig has LLM provider set to: api-endpoint
[2KGenerating document summary...
[32m⠙[0m Generating qa content from data/parsed/llama3_paper.txt...INFO:synthetic_data_kit.models.llm_client:Sending request to api-en

## 🧪 Experiment 3: Different Chunking Strategies

In [None]:
import json

# Small chunks (more granular)
print("📏 Testing small chunks (2000 chars)...\n")
!synthetic-data-kit -c cerebras_config.yaml \
  create data/parsed/llama3_paper.txt \
  --type qa \
  --num-pairs 20 \
  --chunk-size 2000 \
  --chunk-overlap 100 \
  -o data/generated/small_chunks.json

# Large chunks (more context)
print("\n📏 Testing large chunks (6000 chars)...\n")
!synthetic-data-kit -c cerebras_config.yaml \
  create data/parsed/llama3_paper.txt \
  --type qa \
  --num-pairs 20 \
  --chunk-size 6000 \
  --chunk-overlap 300 \
  -o data/generated/large_chunks.json

# Compare questions
with open('data/generated/small_chunks.json') as f:
    small = json.load(f)
with open('data/generated/large_chunks.json') as f:
    large = json.load(f)

print("\n" + "="*60)
print("📊 CHUNKING COMPARISON")
print("="*60)

print("\n🔬 Small Chunks (2000 chars) - Sample Question:")
print(f"   {small['qa_pairs'][0]['question']}")

print("\n📚 Large Chunks (6000 chars) - Sample Question:")
print(f"   {large['qa_pairs'][0]['question']}")

print("\n💡 Small chunks = more specific questions")
print("💡 Large chunks = more context-aware questions")

## 🧠 Experiment 4: Chain-of-Thought Enhancement

**Advanced:** Add reasoning steps to your Q&A pairs using custom CoT prompts!

In [None]:
# Step 1: Create CoT config with custom enhancement prompt
cot_config = f"""llm:
  provider: "api-endpoint"

api-endpoint:
  api_base: "https://api.cerebras.ai/v1"
  api_key: "{os.environ.get('CEREBRAS_API_KEY')}"
  model: "llama3.3-70b"

generation:
  temperature: 0.2
  max_tokens: 8192

prompts:
  cot_enhancement: |
    You are enhancing Q&A conversations by adding step-by-step reasoning.

    For each assistant response, add detailed reasoning BEFORE the answer:

    Transform:
    Q: "What is Llama 3's context length?"
    A: "128K tokens"

    Into:
    Q: "What is Llama 3's context length?"
    A: "Let me break this down:
    Step 1: Looking at the architecture section...
    Step 2: The paper states...
    Therefore: Llama 3 supports 128K tokens"

    Enhance these conversations:
    {{{{conversations}}}}
"""

with open('cot_config.yaml', 'w') as f:
    f.write(cot_config)

print("✅ CoT config created with custom enhancement prompt!\n")

# Step 2: Generate simple Q&A
print("📝 Generating 10 simple Q&A pairs...\n")
!synthetic-data-kit -c cerebras_config.yaml \
  create data/parsed/llama3_paper.txt \
  --type qa \
  --num-pairs 10 \
  -o data/generated/simple_for_cot.json

# Step 3: Add reasoning
print("\n🧠 Adding Chain-of-Thought reasoning...\n")
!synthetic-data-kit -c cot_config.yaml \
  create data/generated/simple_for_cot.json \
  --type cot-enhance \
  -o data/generated/with_reasoning.json \
  --verbose

print("\n✅ Chain-of-Thought enhancement complete!")

In [None]:
import json

# Compare before and after
with open('data/generated/simple_for_cot.json') as f:
    before = json.load(f)
with open('data/generated/with_reasoning.json') as f:
    after = json.load(f)

print("="*60)
print("🔍 CHAIN-OF-THOUGHT COMPARISON")
print("="*60)

# Get first Q&A from conversations
before_conv = before['qa_pairs'][0]
after_conv = after[0]['conversations'] if isinstance(after, list) else after['conversations'][0]

print("\n📝 BEFORE (Simple answer):")
print(f"Q: {before_conv['question']}")
print(f"A: {before_conv['answer'][:150]}...")

print("\n" + "-"*60)

print("\n🧠 AFTER (With reasoning):")
for msg in after_conv:
    if msg['role'] == 'user':
        print(f"Q: {msg['content']}")
    elif msg['role'] == 'assistant':
        print(f"A: {msg['content'][:300]}...")

print("\n" + "="*60)
print("✨ Notice the step-by-step reasoning in the enhanced version!")
print("="*60)

---

# 📥 Download Your Dataset

Download the files to your local machine:

In [None]:
# Create a zip file with all outputs
!zip -r llama3_dataset.zip data/final/

print("✅ Dataset packaged!")
print("\n📦 Download 'llama3_dataset.zip' from the Files panel (left sidebar)")
print("   Or run this cell and click the download link below:")

from google.colab import files
files.download('llama3_dataset.zip')

---

# 🎓 Workshop Complete!

## What You Accomplished:

✅ **Parsed** a research paper automatically (to .txt format)  
✅ **Generated** 50+ Q&A pairs using Cerebras with custom prompts  
✅ **Filtered** for quality using LLM-as-judge with custom rating criteria  
✅ **Exported** to multiple training formats  
✅ **Learned** advanced features (CoT, chunking, thresholds, custom prompts)  

## 🚀 Next Steps:

1. **Try your own PDFs** - Upload any research paper or document
2. **Customize prompts** - Edit the prompts in the config for your domain
3. **Adjust parameters** - Experiment with thresholds, chunk sizes, etc.
4. **Fine-tune a model** - Use your dataset with Transformers/Axolotl
5. **Scale up** - Process entire directories of documents

## 📚 Resources:

- **Toolkit:** https://github.com/meta-llama/synthetic-data-kit
- **Cerebras API:** https://cerebras.ai/
- **Documentation:** Check the toolkit README for advanced features

---

**🎉 Happy Dataset Building!**