Skip to content

microlib-org/llm_microlibs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

llm_microlibs

llm_microlibs consists of several small microlibs, which enable you to run parts of a bigger LLM (large language model).

The parts are being run in sequential manner, and the time during which a node is not busy with computation can be used to load future layers into the GPU memory, which allows decent run of full, unquantized LLMs even on consumer grade hardware.

Every part is just a standard PyTorch nn.Module.

Example: multiple, possibly heterogeneous GPUs

For example, if you have multiple old GPUs with less memory, you can run different parts of the LLM on each of them and when you make them communicate, you can run the full model on multiple heterogeneous hosts. For example, if you have 4 old gaming PCs with a 3090 card (~6000$), you can run 40B models real-time (5-6 tokens/s).

Example: fitting 3 7b models on 2 24GB GPUs

Most 7b models require around 15GB of memory. If you have 2 GPUs, each of which have 24GB, you can load three 7b models into your two GPUs, by loading three halves of each models on each GPU.

All microlibs have minimal third-party dependencies and few abstractions in the public API, which means you can combine them with any framework like Huggingface Transformers, xformers, etc.

List of microlibs

  1. llm_sampler
  2. llm_sepweight
  3. llm_falcon_model

LLM sampler

Install with:

pip install llm_sampler

Downloads PyPi version PyPI license

llm_sampler allows you to sample from any LLM by given logits.

It is a collection of various sampling techniques found online.

For now, the methods are:

  1. sample_huggingface - the one used in transformers
  2. sample_gpt_fast - the one used in gpt-fast

Read more

LLM sepweight

Install with:

pip install llm_sepweight

Downloads PyPi version PyPI license

The llm_sepweight microlib is designed to manage the weights of large language models (LLMs) by organizing them into directories.

It allows you to store and distribute the weights of an LLM as normal files for each layer.

Read more

LLM Falcon model

Install with:

pip install llm_falcon_model

Downloads PyPi version PyPI license

llm_falcon_model allows you to run a part of a Falcon model as a standalone PyTorch module. This enables you to run in distributed mode, using even old GPUs with less memory.

Read more

Future work (in the next week)

  • Serialization/deserialization of KV-cache
  • Release llm_llama2_model

Future work (in the next month)

  • Release llm_qwen_model
  • Release llm_goliath_model
  • Release llm_yi_model
  • Release llm_mistral_model and future bigger models by Mistral
  • Integrate deepseek models
  • Explore hand-crafted pipeline parallelism.
  • Speculative decoding.
  • Support gpt-fast for llama models.
  • Make a write-up of an example distributed run.

... and many more!

Thank you for your interest, if you like our work, please consider leaving a star and sharing it with your friends.

About

Large Language Model (LLM) related microlibs

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Languages