<a href="https://colab.research.google.com/github/joshuaalpuerto/ML-guide/blob/main/AirLLM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## References
- https://medium.com/@lyo.gavin/unbelievable-run-70b-llm-inference-on-a-single-4gb-gpu-with-this-new-technique-93e2057c7eeb
- https://huggingface.co/blog/lyogavin/airllm

In [1]:
!pip install -qU airllm --progress-bar off
# needed for compression
!pip install -qU bitsandbytes --progress-bar off

In [2]:
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


`NOTE:` The process of splitting model is very disk-consuming. If you encounter error, You may need to extend your disk space, clear huggingface .cache and rerun.

In [3]:
import os
# Move the cache to our drive because we need more space
os.environ['HF_HOME'] = '/content/drive/MyDrive/huggingface'
os.environ['TRANSFORMERS_CACHE'] = '/content/drive/MyDrive/huggingface/models'

- Seems the optimization is about able to run the LLM with limited resources.
    - But the speed of inference is slow
- Tho it's not using quantization, you can still apply compression (bits and bytes) if you want to speed up the inference
    - More info [here](https://github.com/lyogavin/Anima/tree/main/air_llm#how-model-compression-here-is-different-from-quantization)
    - The difference of compression vs quantization:
        - Quantization normally needs to quantize both `weights` and `activations` to really speed things up. Which makes it harder to maintain accuracy and avoid the impact of outliers in all kinds of inputs.
        - AirLLM only compress `weights`

In [None]:
from airllm import AirLLMMistral

MAX_LENGTH = 128
# could use hugging face model repo id:
#
model = AirLLMMistral("mistralai/Mistral-7B-Instruct-v0.1", compression='4bit')

# or use model's local path...
#model = AirLLMLlama2("/home/ubuntu/.cache/huggingface/hub/models--garage-bAInd--Platypus2-70B-instruct/snapshots/b585e74bcaae02e52665d9ac6d23f4d0dbc81a0f")

In [None]:
MISTRAL_PROMPT_1 = """<s>[INST] You are a friendly chatbot who always responds in the style of a pirate.
How many helicopters can a human eat in one sitting?
[/INST]
"""

ZEPHYR_PROMPT_1 = """<|system|>
You are a friendly chatbot who always responds in the style of a pirate.</s>
<|user|>
How many helicopters can a human eat in one sitting?</s>
<|assistant|>
"""

input_text = [ MISTRAL_PROMPT_1 ]

input_tokens = model.tokenizer(input_text,
    return_tensors="pt",
    return_attention_mask=False,
    truncation=True,
    max_length=MAX_LENGTH
)

generation_output = model.generate(
    input_tokens['input_ids'].cuda(),
    use_cache=True,
    return_dict_in_generate=True)

output = model.tokenizer.decode(generation_output.sequences[0])


In [None]:
print(output)