Tinker-style API for local LoRA fine-tuning of small language models (1B–13B) on a single GPU.
Local Tinker gives you clean, high-level primitives — forward_backward, optim_step, sample — for fine-tuning open-weight LLMs on your own hardware. It handles model loading, LoRA management, gradient accumulation, and inference internally so you can focus on writing training loops.
Inspired by Thinking Machines' Tinker API, but everything runs locally on a single machine instead of a managed cloud cluster.
graph TB
subgraph User Code
Script["Training Script"]
end
subgraph Local Tinker API
SC["ServiceClient"]
TR["TrainingRun"]
TC["TrainingClient"]
SaC["SamplingClient"]
LF["Loss Functions"]
end
subgraph Backend
HF["HuggingFace Transformers"]
PEFT["PEFT / LoRA"]
BNB["bitsandbytes / QLoRA"]
GPU["Local GPU"]
end
Script --> SC
SC -->|create_training_run| TR
TR -->|training_client| TC
TR -->|sampling_client| SaC
TC -->|forward_backward| LF
TC -->|optim_step| GPU
SaC -->|sample| GPU
SC --> HF
SC --> PEFT
SC --> BNB
HF --> GPU
PEFT --> GPU
sequenceDiagram
participant U as User Script
participant SC as ServiceClient
participant TR as TrainingRun
participant TC as TrainingClient
participant Sa as SamplingClient
participant M as Model on GPU
U->>SC: create ServiceClient
U->>SC: create_training_run
SC->>M: Load model + LoRA adapter
SC-->>U: TrainingRun
U->>TR: training_client
TR-->>U: TrainingClient
U->>TR: sampling_client
TR-->>U: SamplingClient
loop Training Steps
U->>TC: forward_backward
TC->>M: train, forward, loss, backward
TC-->>U: ForwardBackwardOutput
U->>TC: optim_step
TC->>M: optimizer step + zero grad
TC-->>U: OptimStepResponse
end
U->>Sa: sample
Sa->>M: eval + generate
Sa-->>U: SampleResponse
graph LR
subgraph "Phase 1: Accumulate Gradients"
FB1["forward_backward<br/>micro-batch 1"] --> G["Gradients<br/>accumulate"]
FB2["forward_backward<br/>micro-batch 2"] --> G
FB3["forward_backward<br/>micro-batch 3"] --> G
FB4["forward_backward<br/>micro-batch 4"] --> G
end
subgraph "Phase 2: Apply"
G --> OS["optim_step<br/>Adam optimizer"]
OS --> ZG["zero_grad"]
end
classDiagram
class LossFunction {
<<abstract>>
+compute(logits, labels) Tensor
}
class CrossEntropyLoss {
+mask_prompt_tokens bool
+compute(logits, labels) Tensor
}
class PPOLoss {
+clip_range float
+kl_coeff float
}
class GRPOLoss {
+clip_range float
+kl_coeff float
}
class DPOLoss {
+beta float
}
class CustomLoss {
+fn Callable
}
LossFunction <|-- CrossEntropyLoss
LossFunction <|-- PPOLoss
LossFunction <|-- GRPOLoss
LossFunction <|-- DPOLoss
LossFunction <|-- CustomLoss
graph LR
subgraph "1. Rollout"
E["Environment<br/>ProblemEnv"] -->|observation| SC["SamplingClient<br/>sample"]
SC -->|action| E
E -->|reward| T["Trajectory"]
end
subgraph "2. Advantage"
T --> CA["compute_advantages<br/>GRPO"]
CA --> D["Training Data"]
end
subgraph "3. Policy Update"
D --> FB["TrainingClient<br/>forward_backward"]
FB --> OS["TrainingClient<br/>optim_step"]
end
# Clone and install
git clone https://github.com/josephgec/finetuning.git
cd finetuning/local-tinker
# Using uv (recommended)
uv venv && uv pip install -e ".[dev]"
# Or using pip
pip install -e ".[dev]"- Python 3.10+
- PyTorch 2.1+
- A CUDA-compatible GPU (or MPS on Apple Silicon, or CPU for testing)
from local_tinker import (
ServiceClient, LoraConfig, SamplingParams, AdamParams,
ModelInput, Datum,
)
from local_tinker.losses import CrossEntropyLoss
# 1. Create a training run
client = ServiceClient() # auto-detects GPU
run = client.create_training_run(
model="meta-llama/Llama-3.2-1B-Instruct",
lora_config=LoraConfig(rank=16, alpha=32, target_modules=["q_proj", "v_proj"]),
)
tc = run.training_client()
sc = run.sampling_client()
# 2. Train (one SFT step)
text = "The capital of France is Paris."
encoded = run.tokenizer(text, return_tensors="pt")
ids = encoded.input_ids[0].tolist()
result = tc.forward_backward(
data=[Datum(input_ids=ids, labels=ids)],
loss_fn=CrossEntropyLoss(),
)
print(f"Loss: {result.loss:.4f}")
tc.optim_step(AdamParams(lr=1e-4))
# 3. Sample
output = sc.sample(
prompt=ModelInput.from_text("The capital of France is", run.tokenizer),
params=SamplingParams(max_tokens=20, temperature=0.7),
)
print(f"Generated: {output.text}")The entrypoint. Creates training runs by loading a model with LoRA adapters.
client = ServiceClient(device="auto") # "cuda", "mps", "cpu", or "auto"
run = client.create_training_run(
model="meta-llama/Llama-3.2-3B-Instruct",
lora_config=LoraConfig(rank=16, alpha=32),
quantize=True, # 4-bit QLoRA — fits larger models on smaller GPUs
)Wraps the model for training. Uses a two-phase design: forward_backward accumulates gradients, optim_step applies them.
tc = run.training_client()
# Gradient accumulation: call forward_backward multiple times
for micro_batch in split(batch, chunks=4):
tc.forward_backward(micro_batch, loss_fn=CrossEntropyLoss())
# Then apply all accumulated gradients at once
tc.optim_step(AdamParams(lr=2e-4))| Method | Description |
|---|---|
forward_backward(data, loss_fn) |
Forward + backward pass. Accumulates gradients. |
optim_step(params) |
Applies gradients with Adam, then zeros them. |
get_step() |
Returns current training step count. |
save_weights(path) |
Saves LoRA adapter weights to disk. |
load_weights(path) |
Loads LoRA adapter weights from disk. |
Wraps the model for inference. Shares the same model instance as TrainingClient — no weight syncing needed.
sc = run.sampling_client()
response = sc.sample(
prompt=ModelInput.from_text("Explain quantum computing:", tokenizer),
params=SamplingParams(max_tokens=256, temperature=0.7, top_p=0.9),
)
print(response.text)
print(response.log_probs) # per-token log probabilities| Method | Description |
|---|---|
sample(prompt, params) |
Generate a single completion. |
batch_sample(prompts, params) |
Generate completions for multiple prompts. |
Loss functions are objects passed to forward_backward. They receive model logits and return a scalar loss.
from local_tinker.losses import CrossEntropyLoss
loss_fn = CrossEntropyLoss(mask_prompt_tokens=True)
result = tc.forward_backward(data, loss_fn)| Loss | Description |
|---|---|
CrossEntropyLoss |
Standard next-token prediction (SFT). Supports prompt masking via -100 labels. |
DPOLoss |
Direct Preference Optimization — chosen vs rejected pairs with reference model. |
PPOLoss |
PPO clipped surrogate objective with optional KL penalty. |
GRPOLoss |
Group Relative Policy Optimization — normalizes rewards within groups. |
CustomLoss |
Wrap any callable(logits, labels) -> scalar as a loss function. |
All types use Pydantic v2 BaseModel for validation and serialization.
from local_tinker import (
LoraConfig, # rank, alpha, target_modules, dropout
SamplingParams, # max_tokens, temperature, top_p, top_k, stop
AdamParams, # lr, betas, weight_decay, eps
Datum, # input_ids, labels, attention_mask
ModelInput, # .from_text(str, tokenizer) / .from_ids(list[int])
ForwardBackwardOutput, # loss, num_tokens, grad_norm
OptimStepResponse, # step, lr
SampleResponse, # tokens, text, log_probs
)| Field | Type | Default | Description |
|---|---|---|---|
rank |
int |
16 |
LoRA rank (r) |
alpha |
float |
32.0 |
LoRA alpha scaling |
target_modules |
list[str] |
["q_proj", "v_proj"] |
Modules to attach LoRA to |
dropout |
float |
0.05 |
LoRA dropout rate |
| Field | Type | Default | Description |
|---|---|---|---|
max_tokens |
int |
256 |
Maximum tokens to generate |
temperature |
float |
0.7 |
Sampling temperature (0 = greedy) |
top_p |
float |
0.9 |
Nucleus sampling threshold |
top_k |
int |
50 |
Top-k sampling |
stop |
list[str] |
[] |
Stop sequences |
| Field | Type | Default | Description |
|---|---|---|---|
lr |
float |
2e-4 |
Learning rate |
betas |
tuple[float, float] |
(0.9, 0.999) |
Adam beta parameters |
weight_decay |
float |
0.0 |
Weight decay |
eps |
float |
1e-8 |
Adam epsilon |
Fit larger models on smaller GPUs by enabling 4-bit quantization:
run = client.create_training_run(
model="meta-llama/Llama-3.1-8B-Instruct",
lora_config=LoraConfig(rank=16, alpha=32),
quantize=True, # 4-bit QLoRA via bitsandbytes
)This uses NF4 quantization with double quantization and bfloat16 compute dtype for stability.
| Model | Params | Min VRAM (4-bit) | Min VRAM (fp16) | Default LoRA Targets |
|---|---|---|---|---|
| Llama-3.2-1B-Instruct | 1.2B | ~3 GB | ~5 GB | q_proj, v_proj |
| Llama-3.2-3B-Instruct | 3.2B | ~4 GB | ~8 GB | q_proj, v_proj |
| Qwen2.5-3B-Instruct | 3B | ~4 GB | ~8 GB | q_proj, v_proj |
| Phi-3.5-mini-instruct | 3.8B | ~5 GB | ~9 GB | q_proj, v_proj |
| Llama-3.1-8B-Instruct | 8B | ~6 GB | ~17 GB | q_proj, v_proj, k_proj, o_proj |
| Mistral-7B-Instruct-v0.3 | 7.2B | ~6 GB | ~16 GB | q_proj, v_proj |
| Qwen2.5-7B-Instruct | 7B | ~6 GB | ~16 GB | q_proj, v_proj |
| Gemma-2-9B-it | 9.2B | ~7 GB | ~20 GB | q_proj, v_proj |
Any HuggingFace causal LM model works — the table above lists tested configurations.
local-tinker/
├── pyproject.toml # Package configuration
├── CLAUDE.md # Claude Code instructions
├── src/local_tinker/
│ ├── __init__.py # Public API exports
│ ├── types.py # Pydantic types & config objects
│ ├── config.py # (reserved)
│ ├── model_registry.py # Supported model catalog + VRAM estimates
│ ├── service_client.py # ServiceClient + TrainingRun
│ ├── training_client.py # TrainingClient (forward_backward, optim_step)
│ ├── sampling_client.py # SamplingClient (sample, batch_sample)
│ ├── checkpoint.py # Save/load/list LoRA checkpoints
│ ├── weights.py # Merge LoRA → full model, export, push to Hub
│ ├── losses/
│ │ ├── base.py # Abstract LossFunction interface
│ │ ├── cross_entropy.py # SFT cross-entropy loss
│ │ ├── dpo.py # Direct Preference Optimization
│ │ ├── ppo.py # PPO clipped surrogate loss
│ │ ├── grpo.py # Group Relative Policy Optimization
│ │ └── custom.py # User-defined scalar loss
│ ├── utils/
│ │ ├── gpu.py # GPU memory tracking, device selection
│ │ ├── tokenizer.py # Tokenizer helpers, chat templates
│ │ ├── logging.py # CSV / W&B / TensorBoard logging
│ │ └── logprobs.py # Per-token log-probability extraction
│ └── cli/
│ └── main.py # CLI entrypoint (run, models, info, checkpoint)
├── cookbook/
│ ├── supervised.py # SupervisedDataset, ChatDatasetBuilder
│ ├── rl.py # Env, MessageEnv, ProblemEnv, rollout
│ ├── preference.py # Comparison, PreferenceDataset
│ ├── renderers.py # Llama3/Qwen/Mistral chat renderers
│ ├── completers.py # TokenCompleter, MessageCompleter
│ └── recipes/
│ ├── chat_sft.py # End-to-end chat SFT
│ ├── math_rl.py # Math RL with verifier rewards
│ ├── code_rl.py # Code RL with execution rewards
│ ├── dpo_preference.py # DPO preference tuning
│ └── distillation.py # Knowledge distillation
├── examples/
│ ├── hello_tinker.py # Minimal end-to-end example
│ ├── sft_alpaca.py # SFT on Alpaca dataset
│ ├── rl_gsm8k.py # RL on GSM8K math problems
│ └── dpo_ultrafeedback.py # DPO on preference data
└── tests/ # 179 tests, high coverage
├── test_types.py
├── test_losses.py
├── test_training_client.py
├── test_sampling_client.py
├── test_service_client.py
├── test_checkpoint.py
├── test_weights.py
├── test_model_registry.py
├── test_utils.py
├── test_cookbook.py
└── test_cli.py
# Install dev dependencies
uv pip install -e ".[dev]"
# Run tests
uv run pytest
# Run tests with coverage
uv run pytest --cov=local_tinker --cov-report=term-missing
# Run smoke test (requires GPU + model access)
uv run python examples/hello_tinker.py| Component | Library | Purpose |
|---|---|---|
| Model loading | transformers |
Load HuggingFace models |
| LoRA adapters | peft |
Attach, train, save LoRA adapters |
| Quantization | bitsandbytes |
4-bit / 8-bit QLoRA |
| Tensor ops | torch |
Autograd, optimizer, GPU memory |
| Config | pydantic v2 |
Typed, validated config objects |
| CLI | typer |
CLI commands (run, models, checkpoint) |
| Packaging | pyproject.toml + uv |
Distribution |
# List supported models with VRAM requirements
local-tinker models
# Show GPU info and recommended models
local-tinker info
# Run a training script
local-tinker run examples/hello_tinker.py
# Checkpoint management
local-tinker checkpoint list ./checkpoints
local-tinker checkpoint inspect ./checkpoints/step-100
local-tinker checkpoint export ./checkpoints/step-100 --format lorafrom local_tinker.checkpoint import save_checkpoint, load_checkpoint, list_checkpoints
from local_tinker.weights import merge_and_save, export_lora, push_to_hub
# Save checkpoint (LoRA weights + optimizer + metadata)
save_checkpoint(tc, "./checkpoints/step-100", metadata={"eval_loss": 0.42})
# Load checkpoint
meta = load_checkpoint(tc, "./checkpoints/step-100")
# List all checkpoints
for ckpt in list_checkpoints("./checkpoints"):
print(f"Step {ckpt.step}: {ckpt.metadata}")
# Auto-checkpointing (on TrainingClient)
tc = run.training_client()
tc._auto_checkpoint_every = 50 # checkpoint every 50 steps
tc._checkpoint_dir = "./checkpoints"
tc._max_checkpoints = 3 # keep only latest 3
# Merge LoRA into base model and save
merge_and_save(tc, "./merged-model")
# Export just the lightweight LoRA adapter
export_lora(tc, "./lora-adapter")
# Push merged model to HuggingFace Hub
push_to_hub(tc, "my-username/my-model", private=True)The cookbook/ module provides high-level abstractions for common tasks:
# Supervised fine-tuning
from cookbook.supervised import SupervisedDataset, ChatDatasetBuilder
dataset = SupervisedDataset.from_jsonl("data.jsonl", tokenizer)
for batch in dataset.batch(4):
tc.forward_backward(batch, CrossEntropyLoss())
tc.optim_step(AdamParams(lr=2e-4))
# RL training
from cookbook.rl import ProblemEnv, rollout, compute_advantages
env = ProblemEnv(problems, extract_answer_fn=extract_number)
trajectories = [rollout(env, sc, tokenizer) for _ in range(8)]
data = compute_advantages(trajectories, tokenizer, method="grpo")
# DPO preference tuning
from cookbook.preference import PreferenceDataset
dataset = PreferenceDataset.from_jsonl("prefs.jsonl", tokenizer)
# Chat completion
from cookbook.completers import MessageCompleter
completer = MessageCompleter(sc, tokenizer)
response = completer.complete([{"role": "user", "content": "Hello!"}])| Recipe | Script | Description |
|---|---|---|
| Chat SFT | cookbook/recipes/chat_sft.py |
Fine-tune on chat conversations |
| Math RL | cookbook/recipes/math_rl.py |
RL on math problems with verifier |
| Code RL | cookbook/recipes/code_rl.py |
RL on code with execution rewards |
| DPO | cookbook/recipes/dpo_preference.py |
DPO preference tuning |
| Distillation | cookbook/recipes/distillation.py |
KL distillation from larger model |
- Tinker-compatible API surface —
ServiceClientcreates runs,TrainingClienthandles training,SamplingClienthandles generation. - Single-GPU simplicity — No distributed training. Everything runs on one device.
- LoRA-only — Only LoRA/QLoRA fine-tuning, not full-parameter training.
- Two-phase training —
forward_backwardaccumulates gradients,optim_stepapplies them. This enables gradient accumulation with zero extra code. - Shared model instance — Training and sampling clients share the same model. No weight syncing needed.
- Phase 1: Core primitives (ServiceClient, TrainingClient, SamplingClient, CrossEntropyLoss)
- Phase 2: RL loss functions (PPO, GRPO, DPO, CustomLoss) + reference model support
- Phase 3: Checkpoint management + weight export/merge/push to Hub
- Phase 4: Cookbook (datasets, RL environments, renderers, completers, 5 recipes)
- Phase 5: CLI (run, models, info, checkpoint) + GPU utilities + model registry + logging
- Phase 6: 179 tests across 11 test modules
MIT