Skip to content

gowtham0992/picochat

Repository files navigation

Picochat

Build small language models without hiding the evidence.

Picochat is an honest SLM training factory: dataset import, tokenizer training, base pretraining, SFT, optional DPO, eval, serving, and release gates in one inspectable repo.

Product Page · Training Paths · Model Evidence · Pipeline Guide · 100M Runbook · 1B Runbook · Benchmarks · Registry · Release Gates · Honesty Checks · Deploy

GitHub stars License: MIT Python 3.10+ H100/H200 ready Status: research factory

Picochat workbench release readiness

What is Picochat?

Picochat is a from-scratch pipeline for building, checking, and releasing small language models without hiding data leakage, memorization, weak evals, or GPU-wasting launch mistakes.

It is inspired by Andrej Karpathy's nanochat, but the product goal is different. Picochat is not trying to claim frontier behavior from a tiny run. It is trying to make the whole small-model factory inspectable:

dataset -> tokenizer -> base pretraining -> chat SFT -> optional DPO -> eval -> release gate

Picochat now exposes two different training starts:

  • Train from scratch: picochat run tiny builds a Picochat-native model from dataset pack through tokenizer, base pretraining, SFT, eval, and release gates.
  • Fine-tune an existing model: picochat train hf-sft starts from an existing Hugging Face causal LM such as SmolLM and trains on Picochat chat JSONL or multi-turn messages rows with assistant-only labels. This path is useful for hackathons and Qwen/SmolLM-style LoRA experiments.

See the Training Paths page before choosing a GPU workflow.

Current Status

Picochat is a research-grade training harness and local workbench. The current recommended public proof path is a one-GPU h100-100m run on a bounded SmolLM-Corpus pack (fineweb-edu-dedup plus cosmopedia-v2) with explicit sanity, preflight, dry-run, release-skills SFT/eval, and evidence-bundle steps. The 1B-class h200-1b-ddp8 path remains prepared for a later 8xH100/H200 run.

Public model evidence is still pending. Picochat should not be judged by a claimed 1B result until a trained model, release card, model card, benchmark report, and contamination report are published together. The current evidence plan is tracked in the Model Evidence page.

Artifact Public status Release rule
Local tiny demo Ready Smoke test only; not a model-quality claim
100M H100/H200 public proof Runbook ready Publish only with eval, samples, gate report, and honesty report
1B h200-1b-ddp8 run Prepared, not claimed Publish only after preflight, DDP dry run, full eval, external benchmark, and release gate
Hugging Face model Pending Must include model card, release manifest, and honesty evidence

What is ready:

  • 1B-class decoder-only GPT stack: RoPE, RMSNorm, SwiGLU, GQA, QK norm, tied embeddings, parallel residual, scaled residual init, BF16, torch.compile, gradient checkpointing, and CUDA/DDP.
  • ClimbMix import with corpus manifests, document-boundary checks, and sharded token loaders.
  • Release-oriented SFT/eval packs for identity, refusal, choice, arithmetic, and spelling.
  • Preflight checks that block unsafe or dishonest long runs before training.
  • Post-run gates that block release when SFT fit, held-out fit, visible eval, external benchmarks, prompt echo, refusal behavior, or honesty checks fail.
  • A local web dashboard with release readiness, loss curves, preflight output, Scale Up commands, paid-GPU confirmation, and DDP dry-run commands.
  • Native PyTorch serving through picochat serve, including local OpenAI-compatible /v1/completions, /v1/chat/completions, and /v1/models endpoints plus stream=true SSE response framing for smoke integrations.
  • Optional post-SFT DPO through picochat train dpo for curated preference pairs when teams have real chosen/rejected examples.
  • Dockerized local workbench and serving smoke paths for reproducible demos.
  • HF-style export with model card, release manifest, Transformers trust_remote_code adapter, optional safetensors, and optional --push-to-hub publishing.
  • Public benchmark protocol, CI, and contribution templates for external review.

What is not claimed:

  • Picochat is not a production assistant.
  • Picochat is not RAG.
  • Picochat does not claim a useful 1B model before the 1B run and gates pass.
  • Synthetic SFT is behavior-focused; it does not magically create knowledge the base model never learned.

Why Picochat Exists

Small-model projects often fail in predictable ways: eval prompts leak into SFT, validation text overlaps training data, tiny corpora are replayed hundreds of times, losses are shown without context, checkpoints corrupt on crash, and large GPU launches start before anyone has run a real preflight.

Picochat treats those as product problems, not afterthoughts.

The factory is built around four principles:

  1. Train visibly. Every stage writes artifacts that can be inspected and compared.
  2. Gate honestly. Preflight and release checks can block the run or block release.
  3. Separate practice from scoring. SFT rows are practice; eval rows are the scoreboard.
  4. Protect GPU spend. Scale-up commands include sanity checks, preflight, a short DDP dry run, and explicit paid-launch confirmation.

Picochat's paid-run path is still deliberately conservative: the 1B release recipe uses DDP, while experimental FSDP is exposed for base-training smoke tests before it graduates into the full factory.

On GPU hosts, picochat sanity preh100 --capacity-scale h200-1b-ddp8 can also run a one-batch memory check for the exact scale before training starts.

Quick Start

Install locally:

git clone https://github.com/gowtham0992/picochat.git
cd picochat
python3 -m venv .venv
source .venv/bin/activate
python -m pip install -e ".[dev,hf]"

The installed command is picochat. A shorter pico alias is also provided, but picochat avoids colliding with the system pico editor when the virtual environment is not active.

Run the tiny demo:

picochat demo

Open the workbench:

picochat web --runs-dir runs --port 8765

Then visit:

http://127.0.0.1:8765

Or start the workbench with Docker:

docker compose up --build picochat-web

Serve a trained checkpoint through a local OpenAI-compatible API:

export PICOCHAT_API_KEY="replace-me"

picochat serve \
  --checkpoint runs/pico-demo/sft/checkpoint \
  --tokenizer runs/pico-demo/tokenizer.json \
  --host 127.0.0.1 \
  --port 8000 \
  --api-key-env PICOCHAT_API_KEY

Then call:

curl http://127.0.0.1:8000/v1/chat/completions \
  -H 'content-type: application/json' \
  -H "authorization: Bearer $PICOCHAT_API_KEY" \
  -d '{"model":"picochat","messages":[{"role":"user","content":"What is Picochat?"}],"max_tokens":80}'

This native server is for local smoke tests and integration work. High-throughput production serving through vLLM, TGI, TensorRT-LLM, or llama.cpp remains future adapter work.

Run optional DPO after SFT when you have real preference pairs:

picochat data preference-starter \
  --input runs/pico-demo/chat_benchmark.jsonl \
  --out data/preferences.jsonl

picochat train dpo \
  --input data/preferences.jsonl \
  --tokenizer runs/pico-demo/tokenizer.json \
  --checkpoint runs/pico-demo/sft/checkpoint \
  --out-dir runs/pico-demo/dpo \
  --learning-rate 0.000005 \
  --beta 0.1

Or wire DPO into the end-to-end factory so fit/eval/release gates score the post-DPO checkpoint:

picochat run tiny \
  --out-dir runs/pico-demo \
  --dataset-pack runs/pack/dataset_pack.json \
  --dpo-input data/preferences.jsonl \
  --dpo-steps 200

Preference rows are JSONL with user or prompt, chosen, and rejected fields. DPO improves preference alignment after SFT; it does not replace base pretraining, SFT coverage, or the release gates. The preference starter uses synthetic negatives for plumbing and smoke tests; release alignment needs human or judge-reviewed preference pairs.

Build a model registry from completed runs:

picochat registry --runs-dir runs \
  --out reports/model_registry.md \
  --json-out reports/model_registry.json

Write a standard lm-eval-harness benchmark command for a HF export:

picochat eval lm-harness \
  --model-path exports/picochat-run \
  --tasks arc_easy,hellaswag \
  --out-dir reports/picochat-run/lm_eval \
  --device cuda:0 \
  --dry-run

8xH100/H200 Path

The 1B-class path is intentionally gated. The short version is:

setup -> sanity -> ClimbMix import -> release skills pack
  -> preflight -> 100-step DDP dry run -> full run -> SFT/eval -> release gate

Read the runbook before spending GPU money:

The current h200-1b-ddp8 scale targets about 1.12B parameters and 22.4B planned training tokens, roughly 20 tokens per parameter.

One-GPU 100M Public Proof

For the next paid run, use the 100M runbook before attempting 1B:

  • 100M public proof runbook
  • Dataset: bounded SmolLM-Corpus local pack.
  • Scale: h100-100m, about 107M parameters and 2.16B planned tokens.
  • Gate: skill_release, with identity, refusal, choice, arithmetic, and spelling eval coverage.

This path is intentionally smaller, cheaper, and easier to publish honestly than the 1B run. It is the first model-quality proof target.

Release Readiness

Picochat does not treat a completed run as a release. A run can finish training and still be blocked.

The release gate checks:

  • preflight status
  • token/parameter budget and corpus replay risk
  • SFT fit and held-out SFT fit
  • visible eval pass rate
  • per-skill thresholds for identity, refusal, choice, math, and spelling
  • external benchmark presence
  • prompt echo and refusal behavior
  • corpus/SFT/eval contamination signals
  • data honesty report issues

Release gate preview

Workbench

The local dashboard reads real run artifacts. It does not display fake training progress.

Key stations:

  • Dataset Bay: corpus preview, import, pack generation, tuning inspection, launch preflight, and CLI command preview.
  • Tokenizer Lab: text-to-token-ID inspection.
  • Training Dash: base/SFT loss and BPB curves with lower-is-better context.
  • Eval Scoreboard: pass/fail rows, failure causes, prompt echo, and repair guidance.
  • Release Readiness: the post-run gate in both beginner and research modes.
  • Scale Up: remote setup, sanity, import, benchmark, preflight, DDP dry run, full train, bundle, and return commands.

Training dash loss curves

Scale Up commands

Documentation

To publish the product page with GitHub Pages, set the repository Pages source to the docs/ folder on the develop or main branch.

Development

Run tests:

pytest -q

Run only web/dashboard checks:

pytest tests/test_web.py -q

See CONTRIBUTING.md for PR standards and release-evidence expectations.

Optional TensorBoard logging:

python -m pip install -e ".[monitor]"
picochat run tiny \
  --out-dir runs/monitored-smoke \
  --tensorboard-log-dir runs/monitored-smoke/tensorboard
tensorboard --logdir runs/monitored-smoke/tensorboard

License

MIT. See LICENSE.

About

Production-honest small language model training factory: data import, pretraining, SFT, eval gates, contamination checks, and GPU runbooks.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors