python-llama-cpp-http

Python HTTP Server and LangChain LLM Client for llama.cpp.

Server has only two routes:

call: for a prompt get whole text completion at once: POST /api/1.0/text/completion
stream: for a prompt get text chunks via WebSocket: GET /api/1.0/text/completion
embeddings: for a prompt get text embeddings: POST /api/1.0/text/embeddings

LangChain LLM Client has support for sync calls only based on Python packages requests and websockets.

Install

pip install llama_cpp_http

Manual install

Assumption is that GPU driver, and OpenCL / CUDA libraries are installed.

Make sure you follow instructions from LLAMA_CPP.md below for one of following:

CPU - including Apple, recommended for beginners
OpenCL for AMDGPU/NVIDIA CLBlast
HIP/ROCm for AMDGPU hipBLAS,
CUDA for NVIDIA cuBLAS

It is the easiest to start with just CPU-based version of llama.cpp if you do not want to deal with GPU drivers and libraries.

Install build packages

Arch/Manjaro: sudo pacman -Sy base-devel python git jq
Debian/Ubuntu: sudo apt install build-essential python3-dev python3-venv python3-pip libffi-dev libssl-dev git jq

Clone repo

git clone https://github.com/mtasic85/python-llama-cpp-http.git
cd python-llama-cpp-http

Make sure you are inside cloned repo directory python-llama-cpp-http.

Setup python venv

python -m venv venv
source venv/bin/activate
python -m ensurepip --upgrade
pip install -U .

Clone and compile llama.cpp

git clone https://github.com/ggerganov/llama.cpp llama.cpp
cd llama.cpp
make -j

Download Meta's Llama 2 7B Model

Download GGUF model from https://huggingface.co/TheBloke/Llama-2-7B-GGUF to local directory models.

Our advice is to use model https://huggingface.co/TheBloke/Llama-2-7B-GGUF/blob/main/llama-2-7b.Q2_K.gguf with minimum requirements, so it can fit in both RAM/VRAM.

Run Server

python -m llama_cpp_http.server --backend cpu --models-path ./models --llama-cpp-path ./llama.cpp

Experimental:

gunicorn 'llama_cpp_http.server:get_gunicorn_app(backend="clblast", models_path="~/models", llama_cpp_path="~/llama.cpp-clblast", platforms_devices="0:0")' --reload --bind '0.0.0.0:5000' --worker-class aiohttp.GunicornWebWorker

Run Client Examples

Simple text completion call /api/1.0/text/completion:

python -B misc/example_client_call.py | jq .

WebSocket stream /api/1.0/text/completion:

python -B misc/example_client_stream.py | jq -R '. as $line | try (fromjson) catch $line'

Simple text embeddings call /api/1.0/text/embeddings:

python -B misc/example_client_langchain_embedding.py

Licensing

python-llama-cpp-http is licensed under the MIT license. Check the LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 97 Commits
llama_cpp_http		llama_cpp_http
misc		misc
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
LLAMA_CPP.md		LLAMA_CPP.md
README.md		README.md
poetry.lock		poetry.lock
poetry.toml		poetry.toml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

python-llama-cpp-http

Install

Manual install

Install build packages

Clone repo

Setup python venv

Clone and compile llama.cpp

Download Meta's Llama 2 7B Model

Run Server

Run Client Examples

Licensing

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

mtasic85/python-llama-cpp-http

Folders and files

Latest commit

History

Repository files navigation

python-llama-cpp-http

Install

Manual install

Install build packages

Clone repo

Setup python venv

Clone and compile llama.cpp

Download Meta's Llama 2 7B Model

Run Server

Run Client Examples

Licensing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages