To run:
- Create build directory
- vcpkg install
- python3 download_models.py
- ./build.sh
Command line arguments will be used to control inference, for example, quantization level, debugging verbosity, input prompt.
Model configuration will be done through model_config.yaml, for example, temperature (text diversity), generated text amount, batch size. There may be multiple selectable configurations, these are selected through the command line arguments.
-
Initial C++ Implementation
-
Quantization
-
1-bit weight experimentation
-
Speculative Decoding
- draft model fine tuning for jamba
-
Flash mem
- neuron activation data
- hot and cold neurons prediction
-
Matrix mult optimization and overall optimization
Implementation of some optimization techniques
https://github.com/MDK8888/GPTFast/tree/master
Mamba LLM
https://github.com/redotvideo/mamba-chat
https://arxiv.org/abs/2310.04564
https://arxiv.org/abs/2312.11514
https://arxiv.org/abs/2402.11131
https://arxiv.org/abs/2211.17192
https://arxiv.org/abs/2402.17764
state-spaces/mamba#133 (only quantize nn.linear)
https://huggingface.co/docs/transformers/v4.33.0/en/main_classes/quantization
https://leimao.github.io/article/Neural-Networks-Quantization/