This repository was archived by the owner on Jul 4, 2025. It is now read-only.

Description
Goal
- Goal: Can we have a minimalist fork of llama.cpp as
llamacpp-engine
- cortex.cpp's desktop focus means Drogon's features are unused
- We should contribute our vision and multimodal work upstream as a form of llama.cpp server
- Very clear Engines abstraction (i.e. support OpenVino etc in the future)
- Goal: Contribute upwards to llama.cpp
- Vision, multimodal
- May not be possible if the vision, audio encoders are Python-runtime based
Can we consider refactoring llamacpp-engine to use the server implementation, and maintain a fork with our improvements to speech, vision etc? This is especially if we do a C++ implementation of whisperVQ in the future.
Potential issues
Key Changes
- Use
llama-server instead of Drogon that we use in cortex.llamacpp
- Use a spawned
llama.cpp process instead of dylib (better stablity, parallelism)
- However, we will effectively need to build a process manager