Enhance Model Import Flexibility: Support Backend & Quantization Selection Across Sources (HF, Ollama, Files, OCI)

## Feature Request: Flexible Model Import with Backend and Quantization Selection

Currently, the model import workflow in LocalAI is somewhat rigid, especially when importing models from Hugging Face (HF), Ollama, local files, or OCI images. Users lack fine-grained control over:

- The choice of backend (e.g., `vLLM`, `transformers`, `llama.cpp`, `diffusers` for image generation)
- The specific quantization (e.g., Q4_K_M, Q5_K_S, GGUF variants)
- Automatic backend detection and template handling

### Proposal

Enhance the model import system to support a flexible, user-driven workflow that allows:

1. **Source Flexibility**:
   - Import models directly from Hugging Face (e.g., `HuggingFace: meta-llama/Llama-3-8b-instruct`)
   - Import from Ollama (e.g., `Ollama: llama3:instruct`)
   - Load from local files (e.g., `.gguf`, `.safetensors`)
   - Pull from OCI images (e.g., `oci://my-registry.com/my-model:latest`)

2. **Backend and Quantization Selection**:
   - Allow users to explicitly choose the backend during import
   - Provide a list of available quantizations and backends for each model
   - Enable automatic detection of suitable backends based on file type (e.g., `.gguf` → `llama.cpp`)

3. **Seamless Integration with Gallery**:
   - The model gallery should remain lightweight, focusing only on curated "latest and greatest" models
   - The import flow should handle complex or niche models, reducing the need to maintain every model in the gallery

4. **Auto-Detection of Native Templates**:
   - When importing a `llama.cpp`-compatible model (e.g., `.gguf`), detect and use its native chat template from the upstream `llama.cpp` project
   - Fallback to inline template definition if not available (maintaining backward compatibility)

### Benefits
- Greater flexibility for advanced users to select optimal backends and quantizations
- Reduced bloat in the model gallery—focus on quality, not quantity
- Improved user experience for importing models from diverse sources
- Better compatibility with upstream standards (e.g., `llama.cpp` templates)

### Example Workflow
1. User selects "Import from Hugging Face"
2. Enters `meta-llama/Llama-3-8b-instruct`
3. LocalAI lists available quantizations (Q4_K_M, Q5_K_S, etc.) and backends (vLLM, llama.cpp, transformers)
4. User selects `llama.cpp` + `Q4_K_M` + `auto-apply template`
5. LocalAI downloads the `.gguf` file, auto-applies the `llama-3` chat template from `llama.cpp`, and loads the model

This feature would make LocalAI a truly universal model runner, supporting any model, any backend, any quantization—with minimal friction.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Enhance Model Import Flexibility: Support Backend & Quantization Selection Across Sources (HF, Ollama, Files, OCI) #7114

Feature Request: Flexible Model Import with Backend and Quantization Selection

Proposal

Benefits

Example Workflow

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Enhance Model Import Flexibility: Support Backend & Quantization Selection Across Sources (HF, Ollama, Files, OCI) #7114

Description

Feature Request: Flexible Model Import with Backend and Quantization Selection

Proposal

Benefits

Example Workflow

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions