A terminal-based control panel for managing LLM server stacks on Linux and Windows.
Install, configure, run, and benchmark Ollama, vLLM, LM Studio, and llama.cpp from a single TUI — no manual config files or CLI juggling required.
+-----------------------------------------------------------------------------+
| LLM Manager v0.1.1 [N]Notifs [P]Profile [?]Help |
|-----------------------------------------------------------------------------|
| [1]Dashboard [2]Servers [3]Models [4]Logs [5]Bench [6]Profiles [7]API|
+-----------------------------------------------------------------------------+
| |
| Ollama * Running vLLM - Stopped llama.cpp * Running LM Studio - |
| |
| GPU 0: RTX 4090 VRAM [############........] 12.3/24.0 GB Util 34% |
| CPU [####................] 18% RAM [########....] 18.2/64.0 GB |
+-----------------------------------------------------------------------------+
curl -fsSL https://raw.githubusercontent.com/jonathjan0397/llmmanager/master/install.sh | shThen run:
llmmanagerThe installer will:
- Detect your distro and install Python 3.11+ if needed
- Install pipx if not present
- Install LLMManager into an isolated environment
- Check for optional GPU tools (
nvidia-smi,rocm-smi,xpu-smi) and clipboard support
Requirements: Linux (primary), Windows (supported), Python 3.11+
# Via pipx (recommended — isolated, adds to PATH)
pipx install git+https://github.com/jonathjan0397/llmmanager
# Via pip (into active venv)
pip install git+https://github.com/jonathjan0397/llmmanager
# From source
git clone https://github.com/jonathjan0397/llmmanager
cd llmmanager
pip install -e .The llama.cpp backend installs llama-cpp-python[server] into its own isolated venv
(~/.llmmanager-venvs/llamacpp/) when you click Install on the Server Management screen.
The installer auto-detects your GPU and sets the right build flags:
| Hardware | Build backend | What LLMManager does |
|---|---|---|
| NVIDIA CUDA | LLAMA_CUDA=on |
Detects nvcc in PATH, sets CMAKE_ARGS automatically |
| AMD ROCm | LLAMA_HIP=on |
Detects hipcc in PATH |
| Apple Silicon | LLAMA_METAL=on |
Detected on macOS/ARM |
| CPU-only | (default) | No extra flags needed |
If you want to manage the build yourself, install llama-cpp-python[server] manually into
~/.llmmanager-venvs/llamacpp/ before hitting Install in the UI — LLMManager will detect it.
| Vendor | Library | Notes |
|---|---|---|
| NVIDIA | pynvml (auto-installed as nvidia-ml-py) |
Requires NVIDIA drivers |
| AMD | rocm-smi CLI |
Requires ROCm installation |
| Intel | xpu-smi CLI |
Requires Intel oneAPI |
| CPU-only | psutil |
Always available — shows system RAM |
- Install Ollama, vLLM, and llama.cpp with one click — live output streamed to the TUI
- Start / Stop / Restart servers from the UI
- Configure all server CLI flags with descriptions and type validation
- Pre-flight checks before starting (disk space, port availability, binary present)
- Auto-start toggle per server
- Port conflict detection
- Version management — pin to specific releases
- Browse the Ollama library (gemma4, gemma3, llama4, llama3.3, mistral-small, phi4-mini, qwq, deepseek-v3, codegemma, tinyllama, smollm2, command-r, and more) from inside the app
- Browse HuggingFace Hub for GGUF and safetensor models
- Version picker — when a model has multiple tags (e.g.
7b,13b,70b) a dialog lets you choose which variant to download before the transfer starts - Download, load, unload, and delete models
- VRAM estimator — see if a model fits your hardware before downloading
- Hardware compatibility tier per model: Comfortable / Limited / Too Large
- Import local GGUF/safetensors files
- Multi-model runs — select any number of models with checkboxes; LLMManager runs each one sequentially and then presents a unified comparison
- Benchmark categories:
- Throughput — sustained tokens/sec
- Latency — TTFT, p50/p95/p99 per request
- Memory — actual vs estimated VRAM delta
- Concurrency ramp — 1 → 2 → 4 → 8 → 16 → 32 → 64 → 128 parallel requests (stops automatically if p99 latency > 30 s or error rate > 10 %)
- Context scaling — TPS at 1 K / 8 K / 32 K / 128 K token context lengths
- Quality probes — standardised prompt sets: coding, reasoning, instruction, chat
- Benchmark profiles: Quick (~1 min/model), Standard (~5 min/model), Stress (full ramp)
- Comparison charts — bar charts, concurrency/context scaling tables, and Unicode sparklines shown side-by-side for all tested models after a run
- Report export — every completed run is saved as a human-readable plain-text file:
- Location:
~/.local/share/llmmanager/benchmarks/reports/ - Filename:
YYYYMMDD_HHMMSS_model1_model2_model3.txt - Contents: per-model summary, full scaling tables, and a side-by-side comparison table
- Location:
- JSON result files saved to
~/.local/share/llmmanager/benchmarks/for history and scripting
- Per-server status cards with uptime
- Quick Load widget — dropdown of currently available models with one-click load/unload and a refresh button; no manual typing required
- GPU utilisation and VRAM meters (NVIDIA / AMD / Intel / CPU-only)
- Temperature, power draw, fan speed
- CPU and RAM usage
- Multi-turn conversation with any loaded model on any running server
- Server and model dropdowns populated live from running instances
- Enter key sends the message; Shift+Enter or the Send button also work
- Streaming response rendered in the terminal as tokens arrive
- All active endpoints listed with one-click copy
- Quick inference test — send a prompt, see streaming response and latency
- Save named configuration snapshots (e.g. "Coding", "Chat", "High-throughput")
- Switch between profiles with a single keypress
- Server crash detection with optional auto-restart
- Low VRAM warnings (configurable threshold)
- Download completion alerts
| Key | Action |
|---|---|
1–7 |
Switch screens |
s |
Start selected server |
S |
Stop selected server |
r |
Restart selected server |
i |
Install server |
d |
Download selected model |
b |
Benchmark selected model |
Delete |
Delete selected model (with confirmation) |
/ |
Search / filter |
c |
Copy endpoint URL |
n |
Notifications panel |
p |
Profile switcher |
F5 |
Force refresh |
F1 / ? |
Help |
q / F10 |
Quit |
| Server | Install | Configure | Start/Stop | Models |
|---|---|---|---|---|
| Ollama | Auto | Full flags | Yes | Full (pull / load / unload / delete) |
| vLLM | Auto (venv) | Full flags | Yes | Full |
| llama.cpp | Auto (venv, GPU-aware) | Full flags | Yes | Full (GGUF path required) |
| LM Studio | Manual (GUI app) | Connection only | GUI only | List all / load / unload |
- Install / Uninstall uses
sudoon Linux. LLMManager will prompt for your sudo password in-app — it is never stored. - Pre-loading: after Start or Restart, LLMManager sends a short warm-up prompt to the default model so the first real request is fast.
- keep-alive: set
keep-alive = "0"in flags to unload models from VRAM immediately after each request; useful when you share the GPU with other workloads.
- Runs in an isolated venv at
~/.local/share/llmmanager/venvs/vllm/— it will not conflict with other Python environments on your system. - One model at a time: vLLM loads a single model at startup via the
--modelflag. Set your model in the Model field before clicking Start. - Model visibility: LLMManager shows models from three sources: the running server,
the
--modelflag in your saved config, and your local HuggingFace cache (~/.cache/huggingface/hub/). If a model appears in the list but is greyed out it is cached but not currently loaded. - CUDA required for GPU inference. CPU-only inference is possible but very slow.
- HuggingFace token: if a model requires authentication, set
HF_TOKENin your environment before launching LLMManager, or pass it via the--tokenizer/ env flags.
- GGUF path required: set the full path to a
.gguffile in the Model field before clicking Start. The server will not start without it. - GPU layers (
--n-gpu-layers): set to-1to offload all layers, or a specific number to keep part of the model in VRAM and the rest in RAM (useful for models larger than your VRAM). - Context size (
--ctx-size): defaults to 512; increase to 4096–32768 for longer conversations. Larger contexts use more VRAM. - Continuous batching: enable
--cont-batchingfor better throughput when running multiple concurrent requests. - Build flags are auto-detected at install time (CUDA / ROCm / Metal / CPU). If you upgrade your GPU drivers after installing, re-install llama.cpp from the Server Management screen to rebuild with the correct flags.
LM Studio is a GUI desktop application. LLMManager cannot install, start, or stop it — these controls are disabled when LM Studio is selected.
Setup:
- Download and install LM Studio from lmstudio.ai
- Open LM Studio → go to Local Server (the
<->icon in the left sidebar) - Click Start Server
- (Optional) Set an API key under Local Server settings if you want to secure access
In LLMManager:
- Go to Servers → LM Studio
- Set the port if you changed it from the default (1234)
- Set the API key if you enabled one in LM Studio
- Click Save & Poll — LLMManager will verify the connection and show what models are loaded
- LM Studio will then appear as Running on the Dashboard and be available in Chat, Benchmarks, and the API panel
Model loading: use the model picker on the Server Management screen or the Quick Load widget on the Dashboard. LLMManager sends load/unload requests to LM Studio's local API — you do not need to use the LM Studio GUI to switch models.
Polling: LLMManager polls LM Studio every 2 seconds (same as other servers). If LM Studio is closed, the Dashboard card will switch to Stopped automatically.
Config lives at ~/.config/llmmanager/config.toml — human-editable TOML.
[servers.ollama]
server_type = "ollama"
port = 11434
auto_start = false
[servers.ollama.flags]
keep-alive = "10m"
num-parallel = 4
flash-attention = true
[servers.llamacpp]
server_type = "llamacpp"
port = 8080
auto_start = false
[servers.llamacpp.flags]
model = "/path/to/your/model.gguf"
n-gpu-layers = 35
ctx-size = 4096
cont-batching = true
[notifications]
low_vram_threshold_pct = 10.0
crash_auto_restart = falsegit clone https://github.com/jonathjan0397/llmmanager
cd llmmanager
make venv
source .venv/bin/activate # Linux/macOS
# .venv\Scripts\activate # Windows
make dev # editable install
make run # launch the TUI
make check # lint + typecheck
make test # run tests
make help # show all targetsgit tag v0.1.1
git push origin v0.1.1GitHub Actions will build and publish to PyPI automatically (requires trusted publisher setup on pypi.org).
MIT