FastAPI API wrapper for quantized LLMs.
Designed to allow simpler interactions with local LLMs, and provide features like response caching.
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.
- Create an SSH key and clone this repository.
git clone git@github.kcl.ac.uk:hi/openai-server.git
(Alternative) Clone this repository using HTTPs, supplying username and password:
git clone https://github.kcl.ac.uk/hi/openai-server.git
- Create a virtual environment, activate it and install packages using the provided requirements
python -m venv .venv
. .venv/bin/activate
pip install -r requirements.txt
- Install
tox
:
pip install tox
- Create an empty SQLite database:
mkdir output
touch output/cache.db
Models currently supported by the server:
Name | Hugging Face (HF) repository | HF filename | Notes |
---|---|---|---|
Llama 3.1 | SanctumAI/Meta-Llama-3.1-8B-Instruct-GGUF | meta-llama-3.1-8b-instruct.Q4_K_M.gguf | Quantized version of model not distributed by Meta directly. Model required for this server. |
MedLlama 3 | johnsnowlabs/JSL-MedLlama-3-8B-v2.0 | N/A | Requires quantization via llama.cpp before use. Model not required for this server. |
Biomistral 3 | skfrost19/BioMistralMerged | biomistral-merged-v0.1.gguf | Quantized version of model not distributed directly by BioMistral. Model not required for this server. |
To download Llama 3.1 (required), and any of the other models:
- Install
huggingface-hub
(with virtual environment activated):
pip install huggingface-hub
- Download model using information in the table above, and note download location for configuration:
huggingface-cli download <HF repository> <HF filename> --local-dir .
- Set environment variables in
.env
. Current variables, also shown in .env.template, are:
Variable | Details |
---|---|
Llama__3_1__8B_Quant_Instruct | Location of downloaded Llama model (required) |
MedLlama__3__8B_Quant | Location of downloaded MedLama model (optional) |
Biomistral__7B_Quant | Location of downloaded Biomistral model (optional) |
MODEL_FOLDER | Parent folder of all models (e.g. /home/user/models , if /home/user/models/[ModelA] and /home/user/models/[ModelB] exist) (optional; required for Docker) |
- Determine suitable configuration options (
config/config.ini
):
Option | Details | Default |
---|---|---|
CACHE > ACTIVE | Whether to store prompt answers and return these to the same prompt in the future. Sourced from a DB rather than the LLM directly. | False |
tox is used a test orchestrator, creating environments for linting (flake8), type checks (mypy) and finally units tests (pytest). It can be run using tox
.
A Makefile has been included packaging common commands for convenience. make test
runs a loop that will fail if any of the environments fail, providing easier to read output.
Install and run locally as a python package (e.g. for integration tests) as follows:
pip install .
openaiserver
Run through docker as follows (either locally or remotely):
docker compose build
docker compose up -d
The app can then be interacted with in the same manner as if running as a python package.
- Install the OpenAI client:
pip install openai
- Create and run a python file containing the following code:
import os
from openai import OpenAI
client = OpenAI(
base_url = 'http://localhost:8080/v1/',
api_key = 'foo'
)
chat_completion = client.chat.completions.create(
messages=[
{
"role": "user",
"content": "Say this is a test",
}
],
model='Llama__3_1__8B_Quant_Instruct',
max_tokens=1024,
temperature=0.7
)
As this is a Python package, most of the logic is contained within the src folder. General recommendations for editing are:
- Use
make prettier
, another command made available within the Makefile for convenience, to automatically format code. - Always runs tests (
make test
) before committing. - Commits can be made as follows:
git add .
git commit -m "[details of changes]"
git push
This project is licensed under the MIT License - see the LICENSE.md file for details.