illama

illama is a lightweight, fast inference server for Llama and ExLlamav2 based large language models (LLMs).

Features

Continuous batching - Handles multiple requests simultaneously.
Open-AI compatible server - Use official OpenAI API clients
Quantization Support - Load any quantized ExLlamaV2 compatible models (GPTQ, EXL2, or SafeTensors).
GPU Focused - Distribute model across any number of local GPUs.
Uses FlashAttention 2 with Paged Attention by default

Getting Started

To get started, clone the repo.

git clone https://github.com/nickpotafiy/illama.git
cd illama

With Conda

Optionally, create a new conda environment.

conda create -n illama python=3.10
conda activate illama

Install PyTorch

Install Nvidia Cuda Toolkit and PyTorch. Ideally, both versions should match to minimize incompatibilities. PyTorch CUDA 12.1 is recommended with Nvidia CUDA Toolkit 12.1+.

Install Torch w/ Pip

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

Install Torch w/ Conda

conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia

Check Torch CUDA version with: python -c "import torch; print(torch.version.cuda)"

Install Dependencies

Next, install all the necessary dependencies.

pip install flash-attn pydantic rich tokenizers uvicorn fastapi
pip install git+https://github.com/nickpotafiy/exllamav2.git

Running the Server

To start illama server, run this command:

python server.py --model-path "<path>" --batch-size 10 --host "0.0.0.0" --port 5000 --verbose

Run python server.py --help to get a list of all available options.

Troubleshooting

If you get an error saying OSError: CUDA_HOME environment variable is not set. Please set it to your CUDA install root, that typically means PyTorch was not installed correctly. You can verify PyTorch installation by activating your environment and executing python:

import torch
torch.version.cuda

If you don't get your PyTorch CUDA version, then it was not installed correctly. You may have installed PyTorch without CUDA (like a Preview build).

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
illama		illama
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
server.py		server.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

illama

Features

Getting Started

With Conda

Install PyTorch

Install Torch w/ Pip

Install Torch w/ Conda

Install Dependencies

Running the Server

Troubleshooting

About

Releases

Packages

Languages

License

nickpotafiy/illama

Folders and files

Latest commit

History

Repository files navigation

illama

Features

Getting Started

With Conda

Install PyTorch

Install Torch w/ Pip

Install Torch w/ Conda

Install Dependencies

Running the Server

Troubleshooting

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages