llm_microlibs

llm_microlibs consists of several small microlibs, which enable you to run parts of a bigger LLM (large language model).

The parts are being run in sequential manner, and the time during which a node is not busy with computation can be used to load future layers into the GPU memory, which allows decent run of full, unquantized LLMs even on consumer grade hardware.

Every part is just a standard PyTorch nn.Module.

Example: multiple, possibly heterogeneous GPUs

For example, if you have multiple old GPUs with less memory, you can run different parts of the LLM on each of them and when you make them communicate, you can run the full model on multiple heterogeneous hosts. For example, if you have 4 old gaming PCs with a 3090 card (~6000$), you can run 40B models real-time (5-6 tokens/s).

Example: fitting 3 7b models on 2 24GB GPUs

Most 7b models require around 15GB of memory. If you have 2 GPUs, each of which have 24GB, you can load three 7b models into your two GPUs, by loading three halves of each models on each GPU.

All microlibs have minimal third-party dependencies and few abstractions in the public API, which means you can combine them with any framework like Huggingface Transformers, xformers, etc.

LLM sampler

Install with:

pip install llm_sampler

llm_sampler allows you to sample from any LLM by given logits.

It is a collection of various sampling techniques found online.

For now, the methods are:

sample_huggingface - the one used in transformers
sample_gpt_fast - the one used in gpt-fast

LLM sepweight

Install with:

pip install llm_sepweight

The llm_sepweight microlib is designed to manage the weights of large language models (LLMs) by organizing them into directories.

It allows you to store and distribute the weights of an LLM as normal files for each layer.

LLM Falcon model

Install with:

pip install llm_falcon_model

llm_falcon_model allows you to run a part of a Falcon model as a standalone PyTorch module. This enables you to run in distributed mode, using even old GPUs with less memory.

Future work (in the next week)

Serialization/deserialization of KV-cache
Release llm_llama2_model

Future work (in the next month)

Release llm_qwen_model
Release llm_goliath_model
Release llm_yi_model
Release llm_mistral_model and future bigger models by Mistral
Integrate deepseek models
Explore hand-crafted pipeline parallelism.
Speculative decoding.
Support gpt-fast for llama models.
Make a write-up of an example distributed run.

... and many more!

Thank you for your interest, if you like our work, please consider leaving a star and sharing it with your friends.

Name		Name	Last commit message	Last commit date
Latest commit History 359 Commits
llm_falcon_model		llm_falcon_model
llm_sampler		llm_sampler
llm_sepweight		llm_sepweight
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llm_microlibs

Example: multiple, possibly heterogeneous GPUs

Example: fitting 3 7b models on 2 24GB GPUs

List of microlibs

LLM sampler

LLM sepweight

LLM Falcon model

Future work (in the next week)

Future work (in the next month)

About

Releases

Packages

Languages

License

microlib-org/llm_microlibs

Folders and files

Latest commit

History

Repository files navigation

llm_microlibs

Example: multiple, possibly heterogeneous GPUs

Example: fitting 3 7b models on 2 24GB GPUs

List of microlibs

LLM sampler

LLM sepweight

LLM Falcon model

Future work (in the next week)

Future work (in the next month)

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages