llm_microlibs
consists of several small microlibs, which
enable you to run parts of a bigger LLM (large language model).
The parts are being run in sequential manner, and the time during which a node is not busy with computation can be used to load future layers into the GPU memory, which allows decent run of full, unquantized LLMs even on consumer grade hardware.
Every part is just a standard PyTorch nn.Module
.
For example, if you have multiple old GPUs with less memory, you can run different parts of the LLM on each of them and when you make them communicate, you can run the full model on multiple heterogeneous hosts. For example, if you have 4 old gaming PCs with a 3090 card (~6000$), you can run 40B models real-time (5-6 tokens/s).
Most 7b models require around 15GB of memory. If you have 2 GPUs, each of which have 24GB, you can load three 7b models into your two GPUs, by loading three halves of each models on each GPU.
All microlibs have minimal third-party dependencies and few abstractions in the public API, which means you can combine them with any framework like Huggingface Transformers, xformers, etc.
Install with:
pip install llm_sampler
llm_sampler
allows you to sample from any LLM by given logits.
It is a collection of various sampling techniques found online.
For now, the methods are:
sample_huggingface
- the one used in transformerssample_gpt_fast
- the one used in gpt-fast
Install with:
pip install llm_sepweight
The llm_sepweight
microlib is designed to manage the weights of large language models (LLMs) by organizing them into directories.
It allows you to store and distribute the weights of an LLM as normal files for each layer.
Install with:
pip install llm_falcon_model
llm_falcon_model
allows you to run a part of a Falcon model as a standalone PyTorch module.
This enables you to run in distributed mode, using even old GPUs with less memory.
- Serialization/deserialization of KV-cache
- Release
llm_llama2_model
- Release
llm_qwen_model
- Release
llm_goliath_model
- Release
llm_yi_model
- Release
llm_mistral_model
and future bigger models by Mistral - Integrate
deepseek
models - Explore hand-crafted pipeline parallelism.
- Speculative decoding.
- Support gpt-fast for llama models.
- Make a write-up of an example distributed run.
... and many more!
Thank you for your interest, if you like our work, please consider leaving a star and sharing it with your friends.