# BitNet.cpp

- https://arxiv.org/abs/2410.16144
- https://arxiv.org/pdf/2402.17764
- https://www.microsoft.com/en-us/research/publication/bitnet-scaling-1-bit-transformers-for-large-language-models/
- https://huggingface.co/HF1BitLLM/Llama3-8B-1.58-100B-tokens
- https://github.com/microsoft/BitNet?tab=readme-ov-file

  
bitnet.cpp is the official inference framework for 1-bit LLMs (e.g., BitNet b1.58). It offers a suite of optimized kernels, that support fast and lossless inference of 1.58-bit models on CPU (with NPU and GPU support coming next).

The first release of bitnet.cpp is to support inference on CPUs. bitnet.cpp achieves speedups of 1.37x to 5.07x on ARM CPUs, with larger models experiencing greater performance gains. Additionally, it reduces energy consumption by 55.4% to 70.0%, further boosting overall efficiency. On x86 CPUs, speedups range from 2.37x to 6.17x with energy reductions between 71.9% to 82.2%. Furthermore, bitnet.cpp can run a 100B BitNet b1.58 model on a single CPU, achieving speeds comparable to human reading (5-7 tokens per second), significantly enhancing the potential for running LLMs on local devices. Please refer to the technical report for more details.

```
@misc{,
      title={1.58-Bit LLM: A New Era of Extreme Quantization}, 
      author={Mohamed Mekkouri and Marc Sun and Leandro von Werra and Thomas Wolf},
      year={2024},```
}




![image.png](image.png)

# Quantization Function

https://medium.com/@isaakmwangi2018/microsoft-open-sources-1-bit-llms-run-100b-parameter-models-locally-with-bitnet-b1-58-f28aa8c4e702

## Scaling Weights

![image.png](scaling.png)

## Rounding and Clipping

![image.png](clip.png)

## Example

![image.png](1.png)

### Scaling factor

![image.png](2.png)

### Scale Matrix

![image.png](3.png)

### Apply RoundClip Function

![image.png](4.png)

# Install

#### Ubuntu /Debian
```
sudo apt update
sudo apt install build-essential libssl-dev -y

#### cmake
wget https://github.com/Kitware/CMake/releases/download/v3.28.1/cmake-3.28.1.tar.gz
tar -zxvf cmake-3.28.1.tar.gz
cd cmake-3.28.1
sudo ./bootstrap
sudo make
sudo make install

#### clang
sudo apt install clang 
````

```
# Clone the repo
 git clone git clone −−recursive https://github.com/microsoft/BitNet.git
 cd BitNet

# Create a new conda environment

 conda create −n bitnet−cpp python=3.9
 conda activate bitnet−cpp
 pip install −r requirements.txt

#  Install Huggingface hub CLI
pip install -U "huggingface_hub[cli]"

# Download the model from Hugging Face, convert it to quantized gguf format, and build the project
# These models were neither trained nor released by Microsoft. We used them to demonstrate the inference capabilities of bitnet.cpp

python setup_env.py −−hf−repo HF1BitLLM/Llama3−8B−1.58−100B−tokens −q i2_s

# Or you can manually download the model and run it using a local path

 huggingface−cli download HF1BitLLM/Llama3−8B−1.58−100B−tokens −−local−dir models/Llama3−8B−1.58−100B−tokens
 python setup_env.py −md models/Llama3−8B−1.58−100B−tokens −q i2_s
```

# INFERENCE

##### Ex: 1
```
python run_inference.py -m models/Llama3-8B-1.58-100B-tokens/ggml-model-i2_s.gguf -p \
"Daniel went back to the the the garden. Mary travelled to the kitchen. Sandra journeyed to the kitchen. Sandra went to the hallway. John went to the bedroom. \
Mary went back to the garden. Where is Mary?\nAnswer:" -n 6 -temp
```
##### Ex: 2
```
python run_inference.py -m models/Llama3-8B-1.58-100B-tokens/ggml-model-i2_s.gguf -p \
"I went to the market and bought 10 apples. I gave 2 apples to the neighbor and 2 to the repairman.
Then I went again and bought 5 more apples and I ate 1 apple.
How many apples did I remain with? Let's think step by step\nAnswer:" -n 6 -temp 0.8 -c 40```96
 0 0