fix(ollama): keep_alive=-1 + bump api timeout 10s→30s#234
Merged
Conversation
Two reliability fixes for the gemma4:e4b setup on prod. 1. OLLAMA_KEEP_ALIVE=-1 in compose env Default keep_alive is 5min. After idle, model unloads. Reload from cold takes ~50-60s on CPU + Ollama's pre-load memory check goes conservative (says 9.8 GiB needed but only 8.2 GiB available, even though host has 27 GiB free) — both make the API's fire-and-forget distractor call fail intermittently. -1 keeps model resident as long as the container is up, so cold-load only happens on container restart (deploy). 2. Api/appsettings.json Ollama:TimeoutSeconds 10s → 30s Worker is already 30s. API was 10s — works for warm gemma4 (measured 2.8s for distractor inference) but leaves no margin and breaks on cold-load. Aligning to 30s removes the asymmetry. Verified on prod (after manual `ollama run --keepalive=24h`): - ollama ps: gemma4:e4b loaded, UNTIL=24h - distractor inference: 2.8s for 5 single-word answers - free -h: 13 GiB used (was 3 GiB) — model is in RAM Post-deploy step: `docker compose exec ollama ollama run gemma4:e4b ""` once to trigger first load. Then it stays via keep_alive=-1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
User noticed prod RAM showed only 3.1 GiB used despite gemma4:e4b being 9.8 GiB — the model unloads after idle. Cold reload takes ~50-60s on CPU and Ollama's pre-load memory check is conservative ("requires 9.8 GiB but only 8.2 GiB available" even though host has 27 GiB available) — both make fire-and-forget distractor calls fail intermittently.
What
`OLLAMA_KEEP_ALIVE=-1` in compose env. Once the model loads, it stays resident for the lifetime of the container. Cold-load only happens on container restart (i.e. on each deploy), not after every 5-min idle window.
Api/appsettings.json `Ollama:TimeoutSeconds` 10s → 30s. Worker was already 30s; API was 10s. Warm gemma4 distractor inference measured at 2.8s (plenty of headroom under 30s) but 10s leaves no margin and definitely breaks on cold-load. Aligning to 30s removes the asymmetry.
Verified on prod
After manual `ollama run --keepalive=24h gemma4:e4b`:
```
$ docker compose exec ollama ollama ps
NAME ID SIZE PROCESSOR CONTEXT UNTIL
gemma4:e4b c6eb396dbd59 10 GB 100% CPU 4096 24 hours from now
$ free -h
total used free buff/cache available
Mem: 30Gi 13Gi 351Mi 18Gi 17Gi
```
Distractor inference timing:
```
$ time ollama run gemma4:e4b "Generate 5 single-word distractors for linearizability..."
consistency, atomicity, serialization, concurrency, visibility
real 0m2.837s
```
Post-deploy step
After this merges + auto-deploys, prod's container will recreate with the new env. Need to trigger first load once:
```bash
ssh asus
cd ~/projects/onlinelib/textstack
docker compose exec ollama ollama run gemma4:e4b "" --keepalive=-1
docker compose exec ollama ollama ps # verify UNTIL=Forever
```
After that, `KEEP_ALIVE=-1` env on the daemon side keeps it resident across all subsequent inference calls.
🤖 Generated with Claude Code