GitHub - pgosar/fast-mamba-inference

To run:

Create build directory
vcpkg install
python3 download_models.py
./build.sh

Command line arguments will be used to control inference, for example, quantization level, debugging verbosity, input prompt.

Model configuration will be done through model_config.yaml, for example, temperature (text diversity), generated text amount, batch size. There may be multiple selectable configurations, these are selected through the command line arguments.

TODO

Helpful references:

Models

Jamba

Mamba Variants

Model Configuration

https://ivibudh.medium.com/a-guide-to-controlling-llm-model-output-exploring-top-k-top-p-and-temperature-parameters-ed6a31313910

Implementations:

Implementation of some optimization techniques

https://github.com/MDK8888/GPTFast/tree/master

Mamba LLM

https://github.com/redotvideo/mamba-chat

Quantization:

state-spaces/mamba#133 (only quantize nn.linear)

https://huggingface.co/docs/transformers/v4.33.0/en/main_classes/quantization

https://leimao.github.io/article/Neural-Networks-Quantization/

Fast matrix mult:

https://coffeebeforearch.github.io/2020/06/23/mmul.html

https://justine.lol/matmul/

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
mamba @ 12d8550		mamba @ 12d8550
scripts		scripts
src		src
.clang-format		.clang-format
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
build.sh		build.sh
mamba_configuration.yaml		mamba_configuration.yaml
meson.build		meson.build
vcpkg.json		vcpkg.json

pgosar/fast-mamba-inference

Folders and files

Latest commit

History

Repository files navigation

TODO

Helpful references:

Models

Model Configuration

Implementations:

Using ReLu instead of SiLu (mamba's default):

Flash memory:

Speculative Streaming:

Speculative Decoding:

1 bit model variant:

Quantization:

Fast matrix mult:

About

Resources

Stars

Watchers

Forks

Languages