SynthGen

Generate compact-but-sufficient prompt→response pairs by prompting your teacher model (via Ollama) with style instructions, collect, tokenize & filter, save as JSONL (or Arrow), and provide a CLI parameter to target dataset size for 3B / 8B / 14B student models to train a LoRA.

Setup

Install the dependencies:
```
pip install -r requirements.txt
```
(Ensure you have requests, transformers, tokenizers, datasets, fuzzywuzzy, python-Levenshtein, tqdm installed).
Make sure you have Ollama running. If using a remote host or different port, set the OLLAMA_URL environment variable or pass --teacher-url.

Usage

Generate Dataset

Generate a dataset targeted to 3B models (~2M tokens):

python generate_dataset.py generate \
  --target 3b \
  --style "Always answer as a high medieval English aristocrat, formal and archaic." \
  --teacher-url "http://localhost:11434" \
  --out-file data/style_3b.jsonl

Generate by explicit token budget:

python generate_dataset.py generate \
  --target-tokens 2000000 \
  --batch-size 8 \
  --out-file data/style_bytokens.jsonl

Show Statistics

Show dataset stats:

python generate_dataset.py stats --in-file data/style_3b.jsonl

Convert to Arrow

Convert the JSONL dataset to Apache Arrow format:

python generate_dataset.py convert \
  --in-file data/style_3b.jsonl \
  --out-file data/style_3b.arrow

Sample a Validation Set

Create a small holdout evaluation set:

python generate_dataset.py sample \
  --in-file data/style_3b.jsonl \
  --n 200

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.github		.github
__pycache__		__pycache__
data		data
lib		lib
raw		raw
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml
generate_dataset.py		generate_dataset.py
prompts_seed.txt		prompts_seed.txt
requirements.txt		requirements.txt
test_e2e.py		test_e2e.py
test_utils.py		test_utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SynthGen

Setup

Usage

Generate Dataset

Show Statistics

Convert to Arrow

Sample a Validation Set

About

Uh oh!

Releases

Sponsor this project

Uh oh!

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

SynthGen

Setup

Usage

Generate Dataset

Show Statistics

Convert to Arrow

Sample a Validation Set

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages