diff --git a/docs/content/docs/advanced/advanced-usage.md b/docs/content/docs/advanced/advanced-usage.md index 77e6ef63efc4..faefebba0605 100644 --- a/docs/content/docs/advanced/advanced-usage.md +++ b/docs/content/docs/advanced/advanced-usage.md @@ -6,310 +6,39 @@ weight = 21 url = '/advanced' +++ -### Advanced configuration with YAML files +### Model Configuration with YAML Files -In order to define default prompts, model parameters (such as custom default `top_p` or `top_k`), LocalAI can be configured to serve user-defined models with a set of default parameters and templates. +LocalAI uses YAML configuration files to define model parameters, templates, and behavior. You can create individual YAML files in the models directory or use a single configuration file with multiple models. -In order to configure a model, you can create multiple `yaml` files in the models path or either specify a single YAML configuration file. -Consider the following `models` folder in the `example/chatbot-ui`: - -``` -base ❯ ls -liah examples/chatbot-ui/models -36487587 drwxr-xr-x 2 mudler mudler 4.0K May 3 12:27 . -36487586 drwxr-xr-x 3 mudler mudler 4.0K May 3 10:42 .. -36465214 -rw-r--r-- 1 mudler mudler 10 Apr 27 07:46 completion.tmpl -36464855 -rw-r--r-- 1 mudler mudler ?G Apr 27 00:08 luna-ai-llama2-uncensored.ggmlv3.q5_K_M.bin -36464537 -rw-r--r-- 1 mudler mudler 245 May 3 10:42 gpt-3.5-turbo.yaml -36467388 -rw-r--r-- 1 mudler mudler 180 Apr 27 07:46 chat.tmpl -``` - -In the `gpt-3.5-turbo.yaml` file it is defined the `gpt-3.5-turbo` model which is an alias to use `luna-ai-llama2` with pre-defined options. - -For instance, consider the following that declares `gpt-3.5-turbo` backed by the `luna-ai-llama2` model: +**Quick Example:** ```yaml name: gpt-3.5-turbo -# Default model parameters parameters: - # Relative to the models path model: luna-ai-llama2-uncensored.ggmlv3.q5_K_M.bin - # temperature temperature: 0.3 - # all the OpenAI request options here.. -# Default context size context_size: 512 threads: 10 -# Define a backend (optional). By default it will try to guess the backend the first time the model is interacted with. -backend: llama-stable # available: llama, stablelm, gpt2, gptj rwkv - -# Enable prompt caching -prompt_cache_path: "alpaca-cache" -prompt_cache_all: true +backend: llama-stable -# stopwords (if supported by the backend) -stopwords: -- "HUMAN:" -- "### Response:" -# define chat roles -roles: - assistant: '### Response:' - system: '### System Instruction:' - user: '### Instruction:' template: - # template file ".tmpl" with the prompt template to use by default on the endpoint call. Note there is no extension in the files completion: completion chat: chat ``` -Specifying a `config-file` via CLI allows to declare models in a single file as a list, for instance: - -```yaml -- name: list1 - parameters: - model: testmodel - context_size: 512 - threads: 10 - stopwords: - - "HUMAN:" - - "### Response:" - roles: - user: "HUMAN:" - system: "GPT:" - template: - completion: completion - chat: chat -- name: list2 - parameters: - model: testmodel - context_size: 512 - threads: 10 - stopwords: - - "HUMAN:" - - "### Response:" - roles: - user: "HUMAN:" - system: "GPT:" - template: - completion: completion - chat: chat -``` - -See also [chatbot-ui](https://github.com/mudler/LocalAI-examples/tree/main/chatbot-ui) as an example on how to use config files. - -It is possible to specify a full URL or a short-hand URL to a YAML model configuration file and use it on start with local-ai, for example to use phi-2: - -``` -local-ai github://mudler/LocalAI/examples/configurations/phi-2.yaml@master -``` - -### Full config model file reference - -```yaml -# Main configuration of the model, template, and system features. -name: "" # Model name, used to identify the model in API calls. - -# Precision settings for the model, reducing precision can enhance performance on some hardware. -f16: null # Whether to use 16-bit floating-point precision. - -embeddings: true # Enable embeddings for the model. - -# Concurrency settings for the application. -threads: null # Number of threads to use for processing. - -# Roles define how different entities interact in a conversational model. -# It can be used to map roles to specific parts of the conversation. -roles: {} # Roles for entities like user, system, assistant, etc. - -# Backend to use for computation (like llama-cpp, diffusers, whisper). -backend: "" # Backend for AI computations. - -# Templates for various types of model interactions. -template: - chat: "" # Template for chat interactions. Uses golang templates with Sprig functions. - chat_message: "" # Template for individual chat messages. Uses golang templates with Sprig functions. - completion: "" # Template for generating text completions. Uses golang templates with Sprig functions. - edit: "" # Template for edit operations. Uses golang templates with Sprig functions. - function: "" # Template for function calls. Uses golang templates with Sprig functions. - use_tokenizer_template: false # Whether to use a specific tokenizer template. (vLLM) - join_chat_messages_by_character: null # Character to join chat messages, if applicable. Defaults to newline. - -# Function-related settings to control behavior of specific function calls. -function: - disable_no_action: false # Whether to disable the no-action behavior. - grammar: - parallel_calls: false # Allow to return parallel tools - disable_parallel_new_lines: false # Disable parallel processing for new lines in grammar checks. - mixed_mode: false # Allow mixed-mode grammar enforcing - no_mixed_free_string: false # Disallow free strings in mixed mode. - disable: false # Completely disable grammar enforcing functionality. - prefix: "" # Prefix to add before grammars rules. - expect_strings_after_json: false # Expect string after JSON data. - no_action_function_name: "" # Function name to call when no action is determined. - no_action_description_name: "" # Description name for no-action functions. - response_regex: [] # Regular expressions to match response from - argument_regex: [] # Named regular to extract function arguments from the response. - argument_regex_key_name: "key" # Name of the named regex capture to capture the key of the function arguments - argument_regex_value_name: "value" # Name of the named regex capture to capture the value of the function arguments - json_regex_match: [] # Regular expressions to match JSON data when in tool mode - replace_function_results: [] # Placeholder to replace function call results with arbitrary strings or patterns. - replace_llm_results: [] # Replace language model results with arbitrary strings or patterns. - capture_llm_results: [] # Capture language model results as text result, among JSON, in function calls. For instance, if a model returns a block for "thinking" and a block for "response", this will allow you to capture the thinking block. - function_name_key: "name" - function_arguments_key: "arguments" - -# Feature gating flags to enable experimental or optional features. -feature_flags: {} - -# System prompt to use by default. -system_prompt: "" - -# Configuration for splitting tensors across GPUs. -tensor_split: "" - -# Identifier for the main GPU used in multi-GPU setups. -main_gpu: "" - -# Small value added to the denominator in RMS normalization to prevent division by zero. -rms_norm_eps: 0 - -# Natural question generation model parameter. -ngqa: 0 - -# Path where prompt cache is stored. -prompt_cache_path: "" - -# Whether to cache all prompts. -prompt_cache_all: false - -# Whether the prompt cache is read-only. -prompt_cache_ro: false - -# Mirostat sampling settings. -mirostat_eta: null -mirostat_tau: null -mirostat: null - -# GPU-specific layers configuration. -gpu_layers: null - -# Memory mapping for efficient I/O operations. -mmap: null - -# Memory locking to ensure data remains in RAM. -mmlock: null - -# Mode to use minimal VRAM for GPU operations. -low_vram: null - -# Words or phrases that halts processing. -stopwords: [] - -# Strings to cut from responses to maintain context or relevance. -cutstrings: [] - -# Strings to trim from responses for cleaner outputs. -trimspace: [] -trimsuffix: [] - -# Default context size for the model's understanding of the conversation or text. -context_size: null - -# Non-uniform memory access settings, useful for systems with multiple CPUs. -numa: false - -# Configuration for LoRA -lora_adapter: "" -lora_base: "" -lora_scale: 0 - -# Disable matrix multiplication queuing in GPU operations. -no_mulmatq: false - -# Model for generating draft responses. -draft_model: "" -n_draft: 0 - -# Quantization settings for the model, impacting memory and processing speed. -quantization: "" - -# List of KV Overrides for llama.cpp (--override-kv flag) -# Format: KEY=TYPE:VALUE -# Example: `qwen3moe.expert_used_count=int:10` -# Use this to override model configuration values at runtime. -# Supported types include: int, float, string, bool. -# Multiple overrides can be specified as a list. -overrides: - - KEY=TYPE:VALUE - -# Utilization percentage of GPU memory to allocate for the model. (vLLM) -gpu_memory_utilization: 0 - -# Whether to trust and execute remote code. -trust_remote_code: false +For a complete reference of all available configuration options, see the [Model Configuration]({{%relref "docs/advanced/model-configuration" %}}) page. -# Force eager execution of TensorFlow operations if applicable. (vLLM) -enforce_eager: false +**Configuration File Locations:** -# Space allocated for swapping data in and out of memory. (vLLM) -swap_space: 0 +1. **Individual files**: Create `.yaml` files in your models directory (e.g., `models/gpt-3.5-turbo.yaml`) +2. **Single config file**: Use `--models-config-file` or `LOCALAI_MODELS_CONFIG_FILE` to specify a file containing multiple models +3. **Remote URLs**: Specify a URL to a YAML configuration file at startup: + ```bash + local-ai run github://mudler/LocalAI/examples/configurations/phi-2.yaml@master + ``` -# Maximum model length, possibly referring to the number of tokens or parameters. (vLLM) -max_model_len: 0 - -# Size of the tensor parallelism in distributed computing environments. (vLLM) -tensor_parallel_size: 0 - -# vision model to use for multimodal -mmproj: "" - -# Disables offloading of key/value pairs in transformer models to save memory. -no_kv_offloading: false - -# Scaling factor for the rope penalty. -rope_scaling: "" - -# Type of configuration, often related to the type of task or model architecture. -type: "" - -# YARN settings -yarn_ext_factor: 0 -yarn_attn_factor: 0 -yarn_beta_fast: 0 -yarn_beta_slow: 0 -# configuration for diffusers model -diffusers: - cuda: false # Whether to use CUDA - pipeline_type: "" # Type of pipeline to use. - scheduler_type: "" # Type of scheduler for controlling operations. - enable_parameters: "" # Parameters to enable in the diffuser. - cfg_scale: 0 # Scale for CFG in the diffuser setup. - img2img: false # Whether image-to-image transformation is supported. - clip_skip: 0 # Number of steps to skip in CLIP operations. - clip_model: "" # Model to use for CLIP operations. - clip_subfolder: "" # Subfolder for storing CLIP-related data. - control_net: "" # Control net to use - -# Step count, usually for image processing models -step: 0 - -# Configuration for gRPC communication. -grpc: - attempts: 0 # Number of retry attempts for gRPC calls. - attempts_sleep_time: 0 # Sleep time between retries. - -# Text-to-Speech (TTS) configuration. -tts: - voice: "" # Voice setting for TTS. - vall-e: - audio_path: "" # Path to audio files for Vall-E. - -# Whether to use CUDA for GPU-based operations. -cuda: false - -# List of files to download as part of the setup or operations. -download_files: [] -``` +See also [chatbot-ui](https://github.com/mudler/LocalAI-examples/tree/main/chatbot-ui) as an example on how to use config files. ### Prompt templates @@ -475,82 +204,11 @@ docker run --env REBUILD=true localai docker run --env-file .env localai ``` -### CLI parameters - -You can control LocalAI with command line arguments, to specify a binding address, or the number of threads. Any command line parameter can be specified via an environment variable. - -In the help text below, BASEPATH is the location that local-ai is being executed from - -#### Global Flags -{{< table "table-responsive" >}} -| Parameter | Default | Description | Environment Variable | -|-----------|---------|-------------|----------------------| -| -h, --help | | Show context-sensitive help. | -| --log-level | info | Set the level of logs to output [error,warn,info,debug] | $LOCALAI_LOG_LEVEL | -{{< /table >}} - -#### Storage Flags -{{< table "table-responsive" >}} -| Parameter | Default | Description | Environment Variable | -|-----------|---------|-------------|----------------------| -| --models-path | BASEPATH/models | Path containing models used for inferencing | $LOCALAI_MODELS_PATH | -| --backend-assets-path |/tmp/localai/backend_data | Path used to extract libraries that are required by some of the backends in runtime | $LOCALAI_BACKEND_ASSETS_PATH | -| --generated-content-path | /tmp/generated/content | Location for assets generated by backends (e.g. stablediffusion) | $LOCALAI_GENERATED_CONTENT_PATH | -| --upload-path | /tmp/localai/upload | Path to store uploads from files api | $LOCALAI_UPLOAD_PATH | -| --config-path | /tmp/localai/config | | $LOCALAI_CONFIG_PATH | -| --localai-config-dir | BASEPATH/configuration | Directory for dynamic loading of certain configuration files (currently api_keys.json and external_backends.json) | $LOCALAI_CONFIG_DIR | -| --localai-config-dir-poll-interval | | Typically the config path picks up changes automatically, but if your system has broken fsnotify events, set this to a time duration to poll the LocalAI Config Dir (example: 1m) | $LOCALAI_CONFIG_DIR_POLL_INTERVAL | -| --models-config-file | STRING | YAML file containing a list of model backend configs | $LOCALAI_MODELS_CONFIG_FILE | -{{< /table >}} +### CLI Parameters -#### Models Flags -{{< table "table-responsive" >}} -| Parameter | Default | Description | Environment Variable | -|-----------|---------|-------------|----------------------| -| --galleries | STRING | JSON list of galleries | $LOCALAI_GALLERIES | -| --autoload-galleries | | | $LOCALAI_AUTOLOAD_GALLERIES | -| --remote-library | "https://raw.githubusercontent.com/mudler/LocalAI/master/embedded/model_library.yaml" | A LocalAI remote library URL | $LOCALAI_REMOTE_LIBRARY | -| --preload-models | STRING | A List of models to apply in JSON at start |$LOCALAI_PRELOAD_MODELS | -| --models | MODELS,... | A List of model configuration URLs to load | $LOCALAI_MODELS | -| --preload-models-config | STRING | A List of models to apply at startup. Path to a YAML config file | $LOCALAI_PRELOAD_MODELS_CONFIG | -{{< /table >}} +For a complete reference of all CLI parameters, environment variables, and command-line options, see the [CLI Reference]({{%relref "docs/reference/cli-reference" %}}) page. -#### Performance Flags -{{< table "table-responsive" >}} -| Parameter | Default | Description | Environment Variable | -|-----------|---------|-------------|----------------------| -| --f16 | | Enable GPU acceleration | $LOCALAI_F16 | -| -t, --threads | 4 | Number of threads used for parallel computation. Usage of the number of physical cores in the system is suggested | $LOCALAI_THREADS | -| --context-size | 512 | Default context size for models | $LOCALAI_CONTEXT_SIZE | -{{< /table >}} - -#### API Flags -{{< table "table-responsive" >}} -| Parameter | Default | Description | Environment Variable | -|-----------|---------|-------------|----------------------| -| --address | ":8080" | Bind address for the API server | $LOCALAI_ADDRESS | -| --cors | | | $LOCALAI_CORS | -| --cors-allow-origins | | | $LOCALAI_CORS_ALLOW_ORIGINS | -| --upload-limit | 15 | Default upload-limit in MB | $LOCALAI_UPLOAD_LIMIT | -| --api-keys | API-KEYS,... | List of API Keys to enable API authentication. When this is set, all the requests must be authenticated with one of these API keys | $LOCALAI_API_KEY | -| --disable-welcome | | Disable welcome pages | $LOCALAI_DISABLE_WELCOME | -| --disable-webui | false | Disables the web user interface. When set to true, the server will only expose API endpoints without serving the web interface | $LOCALAI_DISABLE_WEBUI | -| --machine-tag | | If not empty - put that string to Machine-Tag header in each response. Useful to track response from different machines using multiple P2P federated nodes | $LOCALAI_MACHINE_TAG | -{{< /table >}} - -#### Backend Flags -{{< table "table-responsive" >}} -| Parameter | Default | Description | Environment Variable | -|-----------|---------|-------------|----------------------| -| --parallel-requests | | Enable backends to handle multiple requests in parallel if they support it (e.g.: llama.cpp or vllm) | $LOCALAI_PARALLEL_REQUESTS | -| --single-active-backend | | Allow only one backend to be run at a time | $LOCALAI_SINGLE_ACTIVE_BACKEND | -| --preload-backend-only | | Do not launch the API services, only the preloaded models / backends are started (useful for multi-node setups) | $LOCALAI_PRELOAD_BACKEND_ONLY | -| --external-grpc-backends | EXTERNAL-GRPC-BACKENDS,... | A list of external grpc backends | $LOCALAI_EXTERNAL_GRPC_BACKENDS | -| --enable-watchdog-idle | | Enable watchdog for stopping backends that are idle longer than the watchdog-idle-timeout | $LOCALAI_WATCHDOG_IDLE | -| --watchdog-idle-timeout | 15m | Threshold beyond which an idle backend should be stopped | $LOCALAI_WATCHDOG_IDLE_TIMEOUT, $WATCHDOG_IDLE_TIMEOUT | -| --enable-watchdog-busy | | Enable watchdog for stopping backends that are busy longer than the watchdog-busy-timeout | $LOCALAI_WATCHDOG_BUSY | -| --watchdog-busy-timeout | 5m | Threshold beyond which a busy backend should be stopped | $LOCALAI_WATCHDOG_BUSY_TIMEOUT | -{{< /table >}} +You can control LocalAI with command line arguments to specify a binding address, number of threads, model paths, and many other options. Any command line parameter can be specified via an environment variable. ### .env files @@ -635,6 +293,10 @@ A list of the environment variable that tweaks parallelism is the following: Note that, for llama.cpp you need to set accordingly `LLAMACPP_PARALLEL` to the number of parallel processes your GPU/CPU can handle. For python-based backends (like vLLM) you can set `PYTHON_GRPC_MAX_WORKERS` to the number of parallel requests. +### VRAM and Memory Management + +For detailed information on managing VRAM when running multiple models, see the dedicated [VRAM and Memory Management]({{%relref "docs/advanced/vram-management" %}}) page. + ### Disable CPU flagset auto detection in llama.cpp LocalAI will automatically discover the CPU flagset available in your host and will use the most optimized version of the backends. diff --git a/docs/content/docs/advanced/model-configuration.md b/docs/content/docs/advanced/model-configuration.md new file mode 100644 index 000000000000..975c6fb1d091 --- /dev/null +++ b/docs/content/docs/advanced/model-configuration.md @@ -0,0 +1,504 @@ ++++ +disableToc = false +title = "Model Configuration" +weight = 23 +url = '/advanced/model-configuration' ++++ + +LocalAI uses YAML configuration files to define model parameters, templates, and behavior. This page provides a complete reference for all available configuration options. + +## Overview + +Model configuration files allow you to: +- Define default parameters (temperature, top_p, etc.) +- Configure prompt templates +- Specify backend settings +- Set up function calling +- Configure GPU and memory options +- And much more + +## Configuration File Locations + +You can create model configuration files in several ways: + +1. **Individual YAML files** in the models directory (e.g., `models/gpt-3.5-turbo.yaml`) +2. **Single config file** with multiple models using `--models-config-file` or `LOCALAI_MODELS_CONFIG_FILE` +3. **Remote URLs** - specify a URL to a YAML configuration file at startup + +### Example: Basic Configuration + +```yaml +name: gpt-3.5-turbo +parameters: + model: luna-ai-llama2-uncensored.ggmlv3.q5_K_M.bin + temperature: 0.3 + +context_size: 512 +threads: 10 +backend: llama-stable + +template: + completion: completion + chat: chat +``` + +### Example: Multiple Models in One File + +When using `--models-config-file`, you can define multiple models as a list: + +```yaml +- name: model1 + parameters: + model: model1.bin + context_size: 512 + backend: llama-stable + +- name: model2 + parameters: + model: model2.bin + context_size: 1024 + backend: llama-stable +``` + +## Core Configuration Fields + +### Basic Model Settings + +| Field | Type | Description | Example | +|-------|------|-------------|---------| +| `name` | string | Model name, used to identify the model in API calls | `gpt-3.5-turbo` | +| `backend` | string | Backend to use (e.g. `llama-cpp`, `vllm`, `diffusers`, `whisper`) | `llama-cpp` | +| `description` | string | Human-readable description of the model | `A conversational AI model` | +| `usage` | string | Usage instructions or notes | `Best for general conversation` | + +### Model File and Downloads + +| Field | Type | Description | +|-------|------|-------------| +| `parameters.model` | string | Path to the model file (relative to models directory) or URL | +| `download_files` | array | List of files to download. Each entry has `filename`, `uri`, and optional `sha256` | + +**Example:** +```yaml +parameters: + model: my-model.gguf + +download_files: + - filename: my-model.gguf + uri: https://example.com/model.gguf + sha256: abc123... +``` + +## Parameters Section + +The `parameters` section contains all OpenAI-compatible request parameters and model-specific options. + +### OpenAI-Compatible Parameters + +These settings will be used as defaults for all the API calls to the model. + +| Field | Type | Default | Description | +|-------|------|---------|-------------| +| `temperature` | float | `0.9` | Sampling temperature (0.0-2.0). Higher values make output more random | +| `top_p` | float | `0.95` | Nucleus sampling: consider tokens with top_p probability mass | +| `top_k` | int | `40` | Consider only the top K most likely tokens | +| `max_tokens` | int | `0` | Maximum number of tokens to generate (0 = unlimited) | +| `frequency_penalty` | float | `0.0` | Penalty for token frequency (-2.0 to 2.0) | +| `presence_penalty` | float | `0.0` | Penalty for token presence (-2.0 to 2.0) | +| `repeat_penalty` | float | `1.1` | Penalty for repeating tokens | +| `repeat_last_n` | int | `64` | Number of previous tokens to consider for repeat penalty | +| `seed` | int | `-1` | Random seed (omit for random) | +| `echo` | bool | `false` | Echo back the prompt in the response | +| `n` | int | `1` | Number of completions to generate | +| `logprobs` | bool/int | `false` | Return log probabilities of tokens | +| `top_logprobs` | int | `0` | Number of top logprobs to return per token (0-20) | +| `logit_bias` | map | `{}` | Map of token IDs to bias values (-100 to 100) | +| `typical_p` | float | `1.0` | Typical sampling parameter | +| `tfz` | float | `1.0` | Tail free z parameter | +| `keep` | int | `0` | Number of tokens to keep from the prompt | + +### Language and Translation + +| Field | Type | Description | +|-------|------|-------------| +| `language` | string | Language code for transcription/translation | +| `translate` | bool | Whether to translate audio transcription | + +### Custom Parameters + +| Field | Type | Description | +|-------|------|-------------| +| `batch` | int | Batch size for processing | +| `ignore_eos` | bool | Ignore end-of-sequence tokens | +| `negative_prompt` | string | Negative prompt for image generation | +| `rope_freq_base` | float32 | RoPE frequency base | +| `rope_freq_scale` | float32 | RoPE frequency scale | +| `negative_prompt_scale` | float32 | Scale for negative prompt | +| `tokenizer` | string | Tokenizer to use (RWKV) | + +## LLM Configuration + +These settings apply to most LLM backends (llama.cpp, vLLM, etc.): + +### Performance Settings + +| Field | Type | Default | Description | +|-------|------|---------|-------------| +| `threads` | int | `processor count` | Number of threads for parallel computation | +| `context_size` | int | `512` | Maximum context size (number of tokens) | +| `f16` | bool | `false` | Enable 16-bit floating point precision (GPU acceleration) | +| `gpu_layers` | int | `0` | Number of layers to offload to GPU (0 = CPU only) | + +### Memory Management + +| Field | Type | Default | Description | +|-------|------|---------|-------------| +| `mmap` | bool | `true` | Use memory mapping for model loading (faster, less RAM) | +| `mmlock` | bool | `false` | Lock model in memory (prevents swapping) | +| `low_vram` | bool | `false` | Use minimal VRAM mode | +| `no_kv_offloading` | bool | `false` | Disable KV cache offloading | + +### GPU Configuration + +| Field | Type | Description | +|-------|------|-------------| +| `tensor_split` | string | Comma-separated GPU memory allocation (e.g., `"0.8,0.2"` for 80%/20%) | +| `main_gpu` | string | Main GPU identifier for multi-GPU setups | +| `cuda` | bool | Explicitly enable/disable CUDA | + +### Sampling and Generation + +| Field | Type | Default | Description | +|-------|------|---------|-------------| +| `mirostat` | int | `0` | Mirostat sampling mode (0=disabled, 1=Mirostat, 2=Mirostat 2.0) | +| `mirostat_tau` | float | `5.0` | Mirostat target entropy | +| `mirostat_eta` | float | `0.1` | Mirostat learning rate | + +### LoRA Configuration + +| Field | Type | Description | +|-------|------|-------------| +| `lora_adapter` | string | Path to LoRA adapter file | +| `lora_base` | string | Base model for LoRA | +| `lora_scale` | float32 | LoRA scale factor | +| `lora_adapters` | array | Multiple LoRA adapters | +| `lora_scales` | array | Scales for multiple LoRA adapters | + +### Advanced Options + +| Field | Type | Description | +|-------|------|-------------| +| `no_mulmatq` | bool | Disable matrix multiplication queuing | +| `draft_model` | string | Draft model for speculative decoding | +| `n_draft` | int32 | Number of draft tokens | +| `quantization` | string | Quantization format | +| `load_format` | string | Model load format | +| `numa` | bool | Enable NUMA (Non-Uniform Memory Access) | +| `rms_norm_eps` | float32 | RMS normalization epsilon | +| `ngqa` | int32 | Natural question generation parameter | +| `rope_scaling` | string | RoPE scaling configuration | +| `type` | string | Model type/architecture | +| `grammar` | string | Grammar file path for constrained generation | + +### YARN Configuration + +YARN (Yet Another RoPE extensioN) settings for context extension: + +| Field | Type | Description | +|-------|------|-------------| +| `yarn_ext_factor` | float32 | YARN extension factor | +| `yarn_attn_factor` | float32 | YARN attention factor | +| `yarn_beta_fast` | float32 | YARN beta fast parameter | +| `yarn_beta_slow` | float32 | YARN beta slow parameter | + +### Prompt Caching + +| Field | Type | Description | +|-------|------|-------------| +| `prompt_cache_path` | string | Path to store prompt cache (relative to models directory) | +| `prompt_cache_all` | bool | Cache all prompts automatically | +| `prompt_cache_ro` | bool | Read-only prompt cache | + +### Text Processing + +| Field | Type | Description | +|-------|------|-------------| +| `stopwords` | array | Words or phrases that stop generation | +| `cutstrings` | array | Strings to cut from responses | +| `trimspace` | array | Strings to trim whitespace from | +| `trimsuffix` | array | Suffixes to trim from responses | +| `extract_regex` | array | Regular expressions to extract content | + +### System Prompt + +| Field | Type | Description | +|-------|------|-------------| +| `system_prompt` | string | Default system prompt for the model | + +## vLLM-Specific Configuration + +These options apply when using the `vllm` backend: + +| Field | Type | Description | +|-------|------|-------------| +| `gpu_memory_utilization` | float32 | GPU memory utilization (0.0-1.0, default 0.9) | +| `trust_remote_code` | bool | Trust and execute remote code | +| `enforce_eager` | bool | Force eager execution mode | +| `swap_space` | int | Swap space in GB | +| `max_model_len` | int | Maximum model length | +| `tensor_parallel_size` | int | Tensor parallelism size | +| `disable_log_stats` | bool | Disable logging statistics | +| `dtype` | string | Data type (e.g., `float16`, `bfloat16`) | +| `flash_attention` | string | Flash attention configuration | +| `cache_type_k` | string | Key cache type | +| `cache_type_v` | string | Value cache type | +| `limit_mm_per_prompt` | object | Limit multimodal content per prompt: `{image: int, video: int, audio: int}` | + +## Template Configuration + +Templates use Go templates with [Sprig functions](http://masterminds.github.io/sprig/). + +| Field | Type | Description | +|-------|------|-------------| +| `template.chat` | string | Template for chat completion endpoint | +| `template.chat_message` | string | Template for individual chat messages | +| `template.completion` | string | Template for text completion | +| `template.edit` | string | Template for edit operations | +| `template.function` | string | Template for function/tool calls | +| `template.multimodal` | string | Template for multimodal interactions | +| `template.reply_prefix` | string | Prefix to add to model replies | +| `template.use_tokenizer_template` | bool | Use tokenizer's built-in template (vLLM/transformers) | +| `template.join_chat_messages_by_character` | string | Character to join chat messages (default: `\n`) | + +### Template Variables + +Templating supports [sprig](https://masterminds.github.io/sprig/) functions. + +Following are common variables available in templates: +- `{{.Input}}` - User input +- `{{.Instruction}}` - Instruction for edit operations +- `{{.System}}` - System message +- `{{.Prompt}}` - Full prompt +- `{{.Functions}}` - Function definitions (for function calling) +- `{{.FunctionCall}}` - Function call result + +### Example Template + +```yaml +template: + chat: | + {{.System}} + {{range .Messages}} + {{if eq .Role "user"}}User: {{.Content}}{{end}} + {{if eq .Role "assistant"}}Assistant: {{.Content}}{{end}} + {{end}} + Assistant: +``` + +## Function Calling Configuration + +Configure how the model handles function/tool calls: + +| Field | Type | Default | Description | +|-------|------|---------|-------------| +| `function.disable_no_action` | bool | `false` | Disable the no-action behavior | +| `function.no_action_function_name` | string | `answer` | Name of the no-action function | +| `function.no_action_description_name` | string | | Description for no-action function | +| `function.function_name_key` | string | `name` | JSON key for function name | +| `function.function_arguments_key` | string | `arguments` | JSON key for function arguments | +| `function.response_regex` | array | | Named regex patterns to extract function calls | +| `function.argument_regex` | array | | Named regex to extract function arguments | +| `function.argument_regex_key_name` | string | `key` | Named regex capture for argument key | +| `function.argument_regex_value_name` | string | `value` | Named regex capture for argument value | +| `function.json_regex_match` | array | | Regex patterns to match JSON in tool mode | +| `function.replace_function_results` | array | | Replace function call results with patterns | +| `function.replace_llm_results` | array | | Replace LLM results with patterns | +| `function.capture_llm_results` | array | | Capture LLM results as text (e.g., for "thinking" blocks) | + +### Grammar Configuration + +| Field | Type | Default | Description | +|-------|------|---------|-------------| +| `function.grammar.disable` | bool | `false` | Completely disable grammar enforcement | +| `function.grammar.parallel_calls` | bool | `false` | Allow parallel function calls | +| `function.grammar.mixed_mode` | bool | `false` | Allow mixed-mode grammar enforcing | +| `function.grammar.no_mixed_free_string` | bool | `false` | Disallow free strings in mixed mode | +| `function.grammar.disable_parallel_new_lines` | bool | `false` | Disable parallel processing for new lines | +| `function.grammar.prefix` | string | | Prefix to add before grammar rules | +| `function.grammar.expect_strings_after_json` | bool | `false` | Expect strings after JSON data | + +## Diffusers Configuration + +For image generation models using the `diffusers` backend: + +| Field | Type | Description | +|-------|------|-------------| +| `diffusers.cuda` | bool | Enable CUDA for diffusers | +| `diffusers.pipeline_type` | string | Pipeline type (e.g., `stable-diffusion`, `stable-diffusion-xl`) | +| `diffusers.scheduler_type` | string | Scheduler type (e.g., `euler`, `ddpm`) | +| `diffusers.enable_parameters` | string | Comma-separated parameters to enable | +| `diffusers.cfg_scale` | float32 | Classifier-free guidance scale | +| `diffusers.img2img` | bool | Enable image-to-image transformation | +| `diffusers.clip_skip` | int | Number of CLIP layers to skip | +| `diffusers.clip_model` | string | CLIP model to use | +| `diffusers.clip_subfolder` | string | CLIP model subfolder | +| `diffusers.control_net` | string | ControlNet model to use | +| `step` | int | Number of diffusion steps | + +## TTS Configuration + +For text-to-speech models: + +| Field | Type | Description | +|-------|------|-------------| +| `tts.voice` | string | Voice file path or voice ID | +| `tts.audio_path` | string | Path to audio files (for Vall-E) | + +## Roles Configuration + +Map conversation roles to specific strings: + +```yaml +roles: + user: "### Instruction:" + assistant: "### Response:" + system: "### System Instruction:" +``` + +## Feature Flags + +Enable or disable experimental features: + +```yaml +feature_flags: + feature_name: true + another_feature: false +``` + +## MCP Configuration + +Model Context Protocol (MCP) configuration: + +| Field | Type | Description | +|-------|------|-------------| +| `mcp.remote` | string | YAML string defining remote MCP servers | +| `mcp.stdio` | string | YAML string defining STDIO MCP servers | + +## Agent Configuration + +Agent/autonomous agent configuration: + +| Field | Type | Description | +|-------|------|-------------| +| `agent.max_attempts` | int | Maximum number of attempts | +| `agent.max_iterations` | int | Maximum number of iterations | +| `agent.enable_reasoning` | bool | Enable reasoning capabilities | +| `agent.enable_planning` | bool | Enable planning capabilities | +| `agent.enable_mcp_prompts` | bool | Enable MCP prompts | +| `agent.enable_plan_re_evaluator` | bool | Enable plan re-evaluation | + +## Pipeline Configuration + +Define pipelines for audio-to-audio processing: + +| Field | Type | Description | +|-------|------|-------------| +| `pipeline.tts` | string | TTS model name | +| `pipeline.llm` | string | LLM model name | +| `pipeline.transcription` | string | Transcription model name | +| `pipeline.vad` | string | Voice activity detection model name | + +## gRPC Configuration + +Backend gRPC communication settings: + +| Field | Type | Description | +|-------|------|-------------| +| `grpc.attempts` | int | Number of retry attempts | +| `grpc.attempts_sleep_time` | int | Sleep time between retries (seconds) | + +## Overrides + +Override model configuration values at runtime (llama.cpp): + +```yaml +overrides: + - "qwen3moe.expert_used_count=int:10" + - "some_key=string:value" +``` + +Format: `KEY=TYPE:VALUE` where TYPE is `int`, `float`, `string`, or `bool`. + +## Known Use Cases + +Specify which endpoints this model supports: + +```yaml +known_usecases: + - chat + - completion + - embeddings +``` + +Available flags: `chat`, `completion`, `edit`, `embeddings`, `rerank`, `image`, `transcript`, `tts`, `sound_generation`, `tokenize`, `vad`, `video`, `detection`, `llm` (combination of CHAT, COMPLETION, EDIT). + +## Complete Example + +Here's a comprehensive example combining many options: + +```yaml +name: my-llm-model +description: A high-performance LLM model +backend: llama-stable + +parameters: + model: my-model.gguf + temperature: 0.7 + top_p: 0.9 + top_k: 40 + max_tokens: 2048 + +context_size: 4096 +threads: 8 +f16: true +gpu_layers: 35 + +system_prompt: "You are a helpful AI assistant." + +template: + chat: | + {{.System}} + {{range .Messages}} + {{if eq .Role "user"}}User: {{.Content}} + {{else if eq .Role "assistant"}}Assistant: {{.Content}} + {{end}} + {{end}} + Assistant: + +roles: + user: "User:" + assistant: "Assistant:" + system: "System:" + +stopwords: + - "\n\nUser:" + - "\n\nHuman:" + +prompt_cache_path: "cache/my-model" +prompt_cache_all: true + +function: + grammar: + parallel_calls: true + mixed_mode: false + +feature_flags: + experimental_feature: true +``` + +## Related Documentation + +- See [Advanced Usage]({{%relref "docs/advanced/advanced-usage" %}}) for other configuration options +- See [Prompt Templates]({{%relref "docs/advanced/advanced-usage#prompt-templates" %}}) for template examples +- See [CLI Reference]({{%relref "docs/reference/cli-reference" %}}) for command-line options + diff --git a/docs/content/docs/advanced/vram-management.md b/docs/content/docs/advanced/vram-management.md new file mode 100644 index 000000000000..986b80c100c4 --- /dev/null +++ b/docs/content/docs/advanced/vram-management.md @@ -0,0 +1,178 @@ ++++ +disableToc = false +title = "VRAM and Memory Management" +weight = 22 +url = '/advanced/vram-management' ++++ + +When running multiple models in LocalAI, especially on systems with limited GPU memory (VRAM), you may encounter situations where loading a new model fails because there isn't enough available VRAM. LocalAI provides two mechanisms to automatically manage model memory allocation and prevent VRAM exhaustion. + +## The Problem + +By default, LocalAI keeps models loaded in memory once they're first used. This means: +- If you load a large model that uses most of your VRAM, subsequent requests for other models may fail +- Models remain in memory even when not actively being used +- There's no automatic mechanism to unload models to make room for new ones, unless done manually via the web interface + +This is a common issue when working with GPU-accelerated models, as VRAM is typically more limited than system RAM. For more context, see issues [#6068](https://github.com/mudler/LocalAI/issues/6068), [#7269](https://github.com/mudler/LocalAI/issues/7269), and [#5352](https://github.com/mudler/LocalAI/issues/5352). + +## Solution 1: Single Active Backend + +The simplest approach is to ensure only one model is loaded at a time. When a new model is requested, LocalAI will automatically unload the currently active model before loading the new one. + +### Configuration + +```bash +# Via command line +./local-ai --single-active-backend + +# Via environment variable +LOCALAI_SINGLE_ACTIVE_BACKEND=true ./local-ai +``` + +### Use cases + +- Single GPU systems with limited VRAM +- When you only need one model active at a time +- Simple deployments where model switching is acceptable + +### Example + +```bash +# Start LocalAI with single active backend +LOCALAI_SINGLE_ACTIVE_BACKEND=true ./local-ai + +# First request loads model A +curl http://localhost:8080/v1/chat/completions -d '{"model": "model-a", ...}' + +# Second request automatically unloads model A and loads model B +curl http://localhost:8080/v1/chat/completions -d '{"model": "model-b", ...}' +``` + +## Solution 2: Watchdog Mechanisms + +For more flexible memory management, LocalAI provides watchdog mechanisms that automatically unload models based on their activity state. This allows multiple models to be loaded simultaneously, but automatically frees memory when models become inactive or stuck. + +### Idle Watchdog + +The idle watchdog monitors models that haven't been used for a specified period and automatically unloads them to free VRAM. + +#### Configuration + +```bash +# Enable idle watchdog with default timeout (15 minutes) +LOCALAI_WATCHDOG_IDLE=true ./local-ai + +# Customize the idle timeout (e.g., 10 minutes) +LOCALAI_WATCHDOG_IDLE=true LOCALAI_WATCHDOG_IDLE_TIMEOUT=10m ./local-ai + +# Via command line +./local-ai --enable-watchdog-idle --watchdog-idle-timeout=10m +``` + +### Busy Watchdog + +The busy watchdog monitors models that have been processing requests for an unusually long time and terminates them if they exceed a threshold. This is useful for detecting and recovering from stuck or hung backends. + +#### Configuration + +```bash +# Enable busy watchdog with default timeout (5 minutes) +LOCALAI_WATCHDOG_BUSY=true ./local-ai + +# Customize the busy timeout (e.g., 10 minutes) +LOCALAI_WATCHDOG_BUSY=true LOCALAI_WATCHDOG_BUSY_TIMEOUT=10m ./local-ai + +# Via command line +./local-ai --enable-watchdog-busy --watchdog-busy-timeout=10m +``` + +### Combined Configuration + +You can enable both watchdogs simultaneously for comprehensive memory management: + +```bash +LOCALAI_WATCHDOG_IDLE=true \ +LOCALAI_WATCHDOG_IDLE_TIMEOUT=15m \ +LOCALAI_WATCHDOG_BUSY=true \ +LOCALAI_WATCHDOG_BUSY_TIMEOUT=5m \ +./local-ai +``` + +Or using command line flags: + +```bash +./local-ai \ + --enable-watchdog-idle --watchdog-idle-timeout=15m \ + --enable-watchdog-busy --watchdog-busy-timeout=5m +``` + +### Use cases + +- Multi-model deployments where different models may be used intermittently +- Systems where you want to keep frequently-used models loaded but free memory from unused ones +- Recovery from stuck or hung backend processes +- Production environments requiring automatic resource management + +### Example + +```bash +# Start LocalAI with both watchdogs enabled +LOCALAI_WATCHDOG_IDLE=true \ +LOCALAI_WATCHDOG_IDLE_TIMEOUT=10m \ +LOCALAI_WATCHDOG_BUSY=true \ +LOCALAI_WATCHDOG_BUSY_TIMEOUT=5m \ +./local-ai + +# Load multiple models +curl http://localhost:8080/v1/chat/completions -d '{"model": "model-a", ...}' +curl http://localhost:8080/v1/chat/completions -d '{"model": "model-b", ...}' + +# After 10 minutes of inactivity, model-a will be automatically unloaded +# If a model gets stuck processing for more than 5 minutes, it will be terminated +``` + +### Timeout Format + +Timeouts can be specified using Go's duration format: +- `15m` - 15 minutes +- `1h` - 1 hour +- `30s` - 30 seconds +- `2h30m` - 2 hours and 30 minutes + +## Limitations and Considerations + +### VRAM Usage Estimation + +LocalAI cannot reliably estimate VRAM usage of new models to load across different backends (llama.cpp, vLLM, diffusers, etc.) because: +- Different backends report memory usage differently +- VRAM requirements vary based on model architecture, quantization, and configuration +- Some backends may not expose memory usage information before loading the model + +### Manual Management + +If automatic management doesn't meet your needs, you can manually stop models using the LocalAI management API: + +```bash +# Stop a specific model +curl -X POST http://localhost:8080/backend/shutdown \ + -H "Content-Type: application/json" \ + -d '{"model": "model-name"}' +``` + +To stop all models, you'll need to call the endpoint for each loaded model individually, or use the web UI to stop all models at once. + +### Best Practices + +1. **Monitor VRAM usage**: Use `nvidia-smi` (for NVIDIA GPUs) or similar tools to monitor actual VRAM usage +2. **Start with single active backend**: For single-GPU systems, `--single-active-backend` is often the simplest solution +3. **Tune watchdog timeouts**: Adjust timeouts based on your usage patterns - shorter timeouts free memory faster but may cause more frequent reloads +4. **Consider model size**: Ensure your VRAM can accommodate at least one of your largest models +5. **Use quantization**: Smaller quantized models use less VRAM and allow more flexibility + +## Related Documentation + +- See [Advanced Usage]({{%relref "docs/advanced/advanced-usage" %}}) for other configuration options +- See [GPU Acceleration]({{%relref "docs/features/GPU-acceleration" %}}) for GPU setup and configuration +- See [Backend Flags]({{%relref "docs/advanced/advanced-usage#backend-flags" %}}) for all available backend configuration options + diff --git a/docs/content/docs/reference/cli-reference.md b/docs/content/docs/reference/cli-reference.md new file mode 100644 index 000000000000..60569b66b746 --- /dev/null +++ b/docs/content/docs/reference/cli-reference.md @@ -0,0 +1,181 @@ ++++ +disableToc = false +title = "CLI Reference" +weight = 25 +url = '/reference/cli-reference' ++++ + +Complete reference for all LocalAI command-line interface (CLI) parameters and environment variables. + +> **Note:** All CLI flags can also be set via environment variables. Environment variables take precedence over CLI flags. See [.env files]({{%relref "docs/advanced/advanced-usage#env-files" %}}) for configuration file support. + +## Global Flags + +{{< table "table-responsive" >}} +| Parameter | Default | Description | Environment Variable | +|-----------|---------|-------------|----------------------| +| `-h, --help` | | Show context-sensitive help | | +| `--log-level` | `info` | Set the level of logs to output [error,warn,info,debug,trace] | `$LOCALAI_LOG_LEVEL` | +| `--debug` | `false` | **DEPRECATED** - Use `--log-level=debug` instead. Enable debug logging | `$LOCALAI_DEBUG`, `$DEBUG` | +{{< /table >}} + +## Storage Flags + +{{< table "table-responsive" >}} +| Parameter | Default | Description | Environment Variable | +|-----------|---------|-------------|----------------------| +| `--models-path` | `BASEPATH/models` | Path containing models used for inferencing | `$LOCALAI_MODELS_PATH`, `$MODELS_PATH` | +| `--generated-content-path` | `/tmp/generated/content` | Location for assets generated by backends (e.g. stablediffusion, images, audio, videos) | `$LOCALAI_GENERATED_CONTENT_PATH`, `$GENERATED_CONTENT_PATH` | +| `--upload-path` | `/tmp/localai/upload` | Path to store uploads from files API | `$LOCALAI_UPLOAD_PATH`, `$UPLOAD_PATH` | +| `--localai-config-dir` | `BASEPATH/configuration` | Directory for dynamic loading of certain configuration files (currently api_keys.json and external_backends.json) | `$LOCALAI_CONFIG_DIR` | +| `--localai-config-dir-poll-interval` | | Time duration to poll the LocalAI Config Dir if your system has broken fsnotify events (example: `1m`) | `$LOCALAI_CONFIG_DIR_POLL_INTERVAL` | +| `--models-config-file` | | YAML file containing a list of model backend configs (alias: `--config-file`) | `$LOCALAI_MODELS_CONFIG_FILE`, `$CONFIG_FILE` | +{{< /table >}} + +## Backend Flags + +{{< table "table-responsive" >}} +| Parameter | Default | Description | Environment Variable | +|-----------|---------|-------------|----------------------| +| `--backends-path` | `BASEPATH/backends` | Path containing backends used for inferencing | `$LOCALAI_BACKENDS_PATH`, `$BACKENDS_PATH` | +| `--backends-system-path` | `/usr/share/localai/backends` | Path containing system backends used for inferencing | `$LOCALAI_BACKENDS_SYSTEM_PATH`, `$BACKEND_SYSTEM_PATH` | +| `--external-backends` | | A list of external backends to load from gallery on boot | `$LOCALAI_EXTERNAL_BACKENDS`, `$EXTERNAL_BACKENDS` | +| `--external-grpc-backends` | | A list of external gRPC backends (format: `BACKEND_NAME:URI`) | `$LOCALAI_EXTERNAL_GRPC_BACKENDS`, `$EXTERNAL_GRPC_BACKENDS` | +| `--backend-galleries` | | JSON list of backend galleries | `$LOCALAI_BACKEND_GALLERIES`, `$BACKEND_GALLERIES` | +| `--autoload-backend-galleries` | `true` | Automatically load backend galleries on startup | `$LOCALAI_AUTOLOAD_BACKEND_GALLERIES`, `$AUTOLOAD_BACKEND_GALLERIES` | +| `--parallel-requests` | `false` | Enable backends to handle multiple requests in parallel if they support it (e.g.: llama.cpp or vllm) | `$LOCALAI_PARALLEL_REQUESTS`, `$PARALLEL_REQUESTS` | +| `--single-active-backend` | `false` | Allow only one backend to be run at a time | `$LOCALAI_SINGLE_ACTIVE_BACKEND`, `$SINGLE_ACTIVE_BACKEND` | +| `--preload-backend-only` | `false` | Do not launch the API services, only the preloaded models/backends are started (useful for multi-node setups) | `$LOCALAI_PRELOAD_BACKEND_ONLY`, `$PRELOAD_BACKEND_ONLY` | +| `--enable-watchdog-idle` | `false` | Enable watchdog for stopping backends that are idle longer than the watchdog-idle-timeout | `$LOCALAI_WATCHDOG_IDLE`, `$WATCHDOG_IDLE` | +| `--watchdog-idle-timeout` | `15m` | Threshold beyond which an idle backend should be stopped | `$LOCALAI_WATCHDOG_IDLE_TIMEOUT`, `$WATCHDOG_IDLE_TIMEOUT` | +| `--enable-watchdog-busy` | `false` | Enable watchdog for stopping backends that are busy longer than the watchdog-busy-timeout | `$LOCALAI_WATCHDOG_BUSY`, `$WATCHDOG_BUSY` | +| `--watchdog-busy-timeout` | `5m` | Threshold beyond which a busy backend should be stopped | `$LOCALAI_WATCHDOG_BUSY_TIMEOUT`, `$WATCHDOG_BUSY_TIMEOUT` | +{{< /table >}} + +For more information on VRAM management, see [VRAM and Memory Management]({{%relref "docs/advanced/vram-management" %}}). + +## Models Flags + +{{< table "table-responsive" >}} +| Parameter | Default | Description | Environment Variable | +|-----------|---------|-------------|----------------------| +| `--galleries` | | JSON list of galleries | `$LOCALAI_GALLERIES`, `$GALLERIES` | +| `--autoload-galleries` | `true` | Automatically load galleries on startup | `$LOCALAI_AUTOLOAD_GALLERIES`, `$AUTOLOAD_GALLERIES` | +| `--preload-models` | | A list of models to apply in JSON at start | `$LOCALAI_PRELOAD_MODELS`, `$PRELOAD_MODELS` | +| `--models` | | A list of model configuration URLs to load | `$LOCALAI_MODELS`, `$MODELS` | +| `--preload-models-config` | | A list of models to apply at startup. Path to a YAML config file | `$LOCALAI_PRELOAD_MODELS_CONFIG`, `$PRELOAD_MODELS_CONFIG` | +| `--load-to-memory` | | A list of models to load into memory at startup | `$LOCALAI_LOAD_TO_MEMORY`, `$LOAD_TO_MEMORY` | +{{< /table >}} + +> **Note:** You can also pass model configuration URLs as positional arguments: `local-ai run MODEL_URL1 MODEL_URL2 ...` + +## Performance Flags + +{{< table "table-responsive" >}} +| Parameter | Default | Description | Environment Variable | +|-----------|---------|-------------|----------------------| +| `--f16` | `false` | Enable GPU acceleration | `$LOCALAI_F16`, `$F16` | +| `-t, --threads` | | Number of threads used for parallel computation. Usage of the number of physical cores in the system is suggested | `$LOCALAI_THREADS`, `$THREADS` | +| `--context-size` | | Default context size for models | `$LOCALAI_CONTEXT_SIZE`, `$CONTEXT_SIZE` | +{{< /table >}} + +## API Flags + +{{< table "table-responsive" >}} +| Parameter | Default | Description | Environment Variable | +|-----------|---------|-------------|----------------------| +| `--address` | `:8080` | Bind address for the API server | `$LOCALAI_ADDRESS`, `$ADDRESS` | +| `--cors` | `false` | Enable CORS (Cross-Origin Resource Sharing) | `$LOCALAI_CORS`, `$CORS` | +| `--cors-allow-origins` | | Comma-separated list of allowed CORS origins | `$LOCALAI_CORS_ALLOW_ORIGINS`, `$CORS_ALLOW_ORIGINS` | +| `--csrf` | `false` | Enable Fiber CSRF middleware | `$LOCALAI_CSRF` | +| `--upload-limit` | `15` | Default upload-limit in MB | `$LOCALAI_UPLOAD_LIMIT`, `$UPLOAD_LIMIT` | +| `--api-keys` | | List of API Keys to enable API authentication. When this is set, all requests must be authenticated with one of these API keys | `$LOCALAI_API_KEY`, `$API_KEY` | +| `--disable-webui` | `false` | Disables the web user interface. When set to true, the server will only expose API endpoints without serving the web interface | `$LOCALAI_DISABLE_WEBUI`, `$DISABLE_WEBUI` | +| `--disable-gallery-endpoint` | `false` | Disable the gallery endpoints | `$LOCALAI_DISABLE_GALLERY_ENDPOINT`, `$DISABLE_GALLERY_ENDPOINT` | +| `--disable-metrics-endpoint` | `false` | Disable the `/metrics` endpoint | `$LOCALAI_DISABLE_METRICS_ENDPOINT`, `$DISABLE_METRICS_ENDPOINT` | +| `--machine-tag` | | If not empty, add that string to Machine-Tag header in each response. Useful to track response from different machines using multiple P2P federated nodes | `$LOCALAI_MACHINE_TAG`, `$MACHINE_TAG` | +{{< /table >}} + +## Hardening Flags + +{{< table "table-responsive" >}} +| Parameter | Default | Description | Environment Variable | +|-----------|---------|-------------|----------------------| +| `--disable-predownload-scan` | `false` | If true, disables the best-effort security scanner before downloading any files | `$LOCALAI_DISABLE_PREDOWNLOAD_SCAN` | +| `--opaque-errors` | `false` | If true, all error responses are replaced with blank 500 errors. This is intended only for hardening against information leaks and is normally not recommended | `$LOCALAI_OPAQUE_ERRORS` | +| `--use-subtle-key-comparison` | `false` | If true, API Key validation comparisons will be performed using constant-time comparisons rather than simple equality. This trades off performance on each request for resilience against timing attacks | `$LOCALAI_SUBTLE_KEY_COMPARISON` | +| `--disable-api-key-requirement-for-http-get` | `false` | If true, a valid API key is not required to issue GET requests to portions of the web UI. This should only be enabled in secure testing environments | `$LOCALAI_DISABLE_API_KEY_REQUIREMENT_FOR_HTTP_GET` | +| `--http-get-exempted-endpoints` | `^/$,^/browse/?$,^/talk/?$,^/p2p/?$,^/chat/?$,^/text2image/?$,^/tts/?$,^/static/.*$,^/swagger.*$` | If `--disable-api-key-requirement-for-http-get` is overridden to true, this is the list of endpoints to exempt. Only adjust this in case of a security incident or as a result of a personal security posture review | `$LOCALAI_HTTP_GET_EXEMPTED_ENDPOINTS` | +{{< /table >}} + +## P2P Flags + +{{< table "table-responsive" >}} +| Parameter | Default | Description | Environment Variable | +|-----------|---------|-------------|----------------------| +| `--p2p` | `false` | Enable P2P mode | `$LOCALAI_P2P`, `$P2P` | +| `--p2p-dht-interval` | `360` | Interval for DHT refresh (used during token generation) | `$LOCALAI_P2P_DHT_INTERVAL`, `$P2P_DHT_INTERVAL` | +| `--p2p-otp-interval` | `9000` | Interval for OTP refresh (used during token generation) | `$LOCALAI_P2P_OTP_INTERVAL`, `$P2P_OTP_INTERVAL` | +| `--p2ptoken` | | Token for P2P mode (optional) | `$LOCALAI_P2P_TOKEN`, `$P2P_TOKEN`, `$TOKEN` | +| `--p2p-network-id` | | Network ID for P2P mode, can be set arbitrarily by the user for grouping a set of instances | `$LOCALAI_P2P_NETWORK_ID`, `$P2P_NETWORK_ID` | +| `--federated` | `false` | Enable federated instance | `$LOCALAI_FEDERATED`, `$FEDERATED` | +{{< /table >}} + +## Other Commands + +LocalAI supports several subcommands beyond `run`: + +- `local-ai models` - Manage LocalAI models and definitions +- `local-ai backends` - Manage LocalAI backends and definitions +- `local-ai tts` - Convert text to speech +- `local-ai sound-generation` - Generate audio files from text or audio +- `local-ai transcript` - Convert audio to text +- `local-ai worker` - Run workers to distribute workload (llama.cpp-only) +- `local-ai util` - Utility commands +- `local-ai explorer` - Run P2P explorer +- `local-ai federated` - Run LocalAI in federated mode + +Use `local-ai --help` for more information on each command. + +## Examples + +### Basic Usage + +```bash +# Start LocalAI with default settings +./local-ai run + +# Start with custom model path and address +./local-ai run --models-path /path/to/models --address :9090 + +# Start with GPU acceleration +./local-ai run --f16 +``` + +### Environment Variables + +```bash +# Using environment variables +export LOCALAI_MODELS_PATH=/path/to/models +export LOCALAI_ADDRESS=:9090 +export LOCALAI_F16=true +./local-ai run +``` + +### Advanced Configuration + +```bash +# Start with multiple models, watchdog, and P2P enabled +./local-ai run \ + --models model1.yaml model2.yaml \ + --enable-watchdog-idle \ + --watchdog-idle-timeout=10m \ + --p2p \ + --federated +``` + +## Related Documentation + +- See [Advanced Usage]({{%relref "docs/advanced/advanced-usage" %}}) for configuration examples +- See [VRAM and Memory Management]({{%relref "docs/advanced/vram-management" %}}) for memory management options +