Skip to content

Enhance Model Import Flexibility: Support Backend & Quantization Selection Across Sources (HF, Ollama, Files, OCI) #7114

@localai-bot

Description

@localai-bot

Feature Request: Flexible Model Import with Backend and Quantization Selection

Currently, the model import workflow in LocalAI is somewhat rigid, especially when importing models from Hugging Face (HF), Ollama, local files, or OCI images. Users lack fine-grained control over:

  • The choice of backend (e.g., vLLM, transformers, llama.cpp, diffusers for image generation)
  • The specific quantization (e.g., Q4_K_M, Q5_K_S, GGUF variants)
  • Automatic backend detection and template handling

Proposal

Enhance the model import system to support a flexible, user-driven workflow that allows:

  1. Source Flexibility:

    • Import models directly from Hugging Face (e.g., HuggingFace: meta-llama/Llama-3-8b-instruct)
    • Import from Ollama (e.g., Ollama: llama3:instruct)
    • Load from local files (e.g., .gguf, .safetensors)
    • Pull from OCI images (e.g., oci://my-registry.com/my-model:latest)
  2. Backend and Quantization Selection:

    • Allow users to explicitly choose the backend during import
    • Provide a list of available quantizations and backends for each model
    • Enable automatic detection of suitable backends based on file type (e.g., .ggufllama.cpp)
  3. Seamless Integration with Gallery:

    • The model gallery should remain lightweight, focusing only on curated "latest and greatest" models
    • The import flow should handle complex or niche models, reducing the need to maintain every model in the gallery
  4. Auto-Detection of Native Templates:

    • When importing a llama.cpp-compatible model (e.g., .gguf), detect and use its native chat template from the upstream llama.cpp project
    • Fallback to inline template definition if not available (maintaining backward compatibility)

Benefits

  • Greater flexibility for advanced users to select optimal backends and quantizations
  • Reduced bloat in the model gallery—focus on quality, not quantity
  • Improved user experience for importing models from diverse sources
  • Better compatibility with upstream standards (e.g., llama.cpp templates)

Example Workflow

  1. User selects "Import from Hugging Face"
  2. Enters meta-llama/Llama-3-8b-instruct
  3. LocalAI lists available quantizations (Q4_K_M, Q5_K_S, etc.) and backends (vLLM, llama.cpp, transformers)
  4. User selects llama.cpp + Q4_K_M + auto-apply template
  5. LocalAI downloads the .gguf file, auto-applies the llama-3 chat template from llama.cpp, and loads the model

This feature would make LocalAI a truly universal model runner, supporting any model, any backend, any quantization—with minimal friction.

Metadata

Metadata

Assignees

Labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions