Generate compact-but-sufficient prompt→response pairs by prompting your teacher model (via Ollama) with style instructions, collect, tokenize & filter, save as JSONL (or Arrow), and provide a CLI parameter to target dataset size for 3B / 8B / 14B student models to train a LoRA.
-
Install the dependencies:
pip install -r requirements.txt
(Ensure you have
requests,transformers,tokenizers,datasets,fuzzywuzzy,python-Levenshtein,tqdminstalled). -
Make sure you have Ollama running. If using a remote host or different port, set the
OLLAMA_URLenvironment variable or pass--teacher-url.
Generate a dataset targeted to 3B models (~2M tokens):
python generate_dataset.py generate \
--target 3b \
--style "Always answer as a high medieval English aristocrat, formal and archaic." \
--teacher-url "http://localhost:11434" \
--out-file data/style_3b.jsonlGenerate by explicit token budget:
python generate_dataset.py generate \
--target-tokens 2000000 \
--batch-size 8 \
--out-file data/style_bytokens.jsonlShow dataset stats:
python generate_dataset.py stats --in-file data/style_3b.jsonlConvert the JSONL dataset to Apache Arrow format:
python generate_dataset.py convert \
--in-file data/style_3b.jsonl \
--out-file data/style_3b.arrowCreate a small holdout evaluation set:
python generate_dataset.py sample \
--in-file data/style_3b.jsonl \
--n 200