Feature request
It seems we now have support for loading models using 4bit quantization starting from bitsandbytes>=0.39.0
Link: FP4 Quantization
Motivation
Running really large language models on smaller GPUs.
Your contribution
Plan should be to upgrade the bitsandbytes package and provide a ENV variable to control the type of quantization method to be used while running the server.