-
-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Open
Labels
Description
Feature Request: Flexible Model Import with Backend and Quantization Selection
Currently, the model import workflow in LocalAI is somewhat rigid, especially when importing models from Hugging Face (HF), Ollama, local files, or OCI images. Users lack fine-grained control over:
- The choice of backend (e.g.,
vLLM,transformers,llama.cpp,diffusersfor image generation) - The specific quantization (e.g., Q4_K_M, Q5_K_S, GGUF variants)
- Automatic backend detection and template handling
Proposal
Enhance the model import system to support a flexible, user-driven workflow that allows:
-
Source Flexibility:
- Import models directly from Hugging Face (e.g.,
HuggingFace: meta-llama/Llama-3-8b-instruct) - Import from Ollama (e.g.,
Ollama: llama3:instruct) - Load from local files (e.g.,
.gguf,.safetensors) - Pull from OCI images (e.g.,
oci://my-registry.com/my-model:latest)
- Import models directly from Hugging Face (e.g.,
-
Backend and Quantization Selection:
- Allow users to explicitly choose the backend during import
- Provide a list of available quantizations and backends for each model
- Enable automatic detection of suitable backends based on file type (e.g.,
.gguf→llama.cpp)
-
Seamless Integration with Gallery:
- The model gallery should remain lightweight, focusing only on curated "latest and greatest" models
- The import flow should handle complex or niche models, reducing the need to maintain every model in the gallery
-
Auto-Detection of Native Templates:
- When importing a
llama.cpp-compatible model (e.g.,.gguf), detect and use its native chat template from the upstreamllama.cppproject - Fallback to inline template definition if not available (maintaining backward compatibility)
- When importing a
Benefits
- Greater flexibility for advanced users to select optimal backends and quantizations
- Reduced bloat in the model gallery—focus on quality, not quantity
- Improved user experience for importing models from diverse sources
- Better compatibility with upstream standards (e.g.,
llama.cpptemplates)
Example Workflow
- User selects "Import from Hugging Face"
- Enters
meta-llama/Llama-3-8b-instruct - LocalAI lists available quantizations (Q4_K_M, Q5_K_S, etc.) and backends (vLLM, llama.cpp, transformers)
- User selects
llama.cpp+Q4_K_M+auto-apply template - LocalAI downloads the
.gguffile, auto-applies thellama-3chat template fromllama.cpp, and loads the model
This feature would make LocalAI a truly universal model runner, supporting any model, any backend, any quantization—with minimal friction.
gmaOCRraghibhaque