# SparseQJL Demo
This is a notebook which you can follow to run the same experiments we did. For experiments with Llama models, we recommend using an `A100` runtime for sufficient RAM.

In [1]:
%%capture
!pip install -q datasets
!pip install -q transformers
!pip install triton
!pip install flash_attn

In [2]:
%%capture
!git clone https://github.com/kentjliu/sparseqjl.git

In [3]:
%cd sparseqjl

/content/sparseqjl


In [4]:
# Set up your Huggingface Token here to load models, For llama models, you need to request to access from Meta first.
import os
os.environ["HF_TOKEN"] = "YOUR_HUGGINGFACE_TOKEN"

## Build kernels (~5 min)

In [5]:
%%capture
!python qjl_kernel/setup.py build_ext --inplace --build-temp=./qjl_kernel/build --build-lib=./qjl_kernel

Ensure kernels are built in the correct directory. The following files should show:
* `cuda_qjl_gqa_score.cpython-310-x86_64-linux-gnu.so*`
* `cuda_qjl_quant.cpython-310-x86_64-linux-gnu.so*  `
* `cuda_qjl_score.cpython-310-x86_64-linux-gnu.so*`
* `quantization.cpython-310-x86_64-linux-gnu.so*`

In [8]:
%ls qjl_kernel

[0m[01;34mbuild[0m/                                               matmul.py
[01;34mcsrc[0m/                                                new_pack.py
[01;32mcuda_qjl_gqa_score.cpython-310-x86_64-linux-gnu.so[0m*  qjl_kernel.py
[01;32mcuda_qjl_quant.cpython-310-x86_64-linux-gnu.so[0m*      [01;32mquantization.cpython-310-x86_64-linux-gnu.so[0m*
[01;32mcuda_qjl_score.cpython-310-x86_64-linux-gnu.so[0m*      setup.py


## Test SparseQJL on Llama
Params:

* `model_name`: String denoting HuggingFace Llama model path to test. Default: `meta-llama/Llama-2-7b-hf`
* `qjl`: Boolean flag denoting whether or not to apply QJL.
* `sparsity`: Float between 0 and 1 denoting \% uniform sparsity with SparseGPT. Default: `0.0`
* `wbits`: Int denoting the bit-width for weight quantization. We suggest using a value of `4`. Default: `16` (No quant)
* `dtype`: String denoting standard datatype of model. Options are `float16` and `float32`. Default: `float16`.

Note:
* `meta-llama/Llama-2-7b-hf`: takes about 20 minutes to run with pruning
* `meta-llama/Llama-2-13b-hf`: takes about 35 minutes to run with pruning

In [45]:
!python llama_sparseqjl.py --model_name "meta-llama/Llama-2-7b-hf" \
    --qjl \
    --sparsity 0.5 \
    --wbits 4 \
    --dtype "float16"

2024-12-21 02:53:41.418481: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-12-21 02:53:41.436308: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-12-21 02:53:41.457902: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-12-21 02:53:41.464421: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-12-21 02:53:41.479677: I tensorflow/core/platform/cpu_feature_guar

## Test SparseQJL on OPT
Params:

* `model_name`: String denoting HuggingFace OPT model path to test. Default: `facebook/opt-125m`
* `qjl`: Boolean flag denoting whether or not to apply QJL.
* `sparsity`: Float between 0 and 1 denoting \% uniform sparsity with SparseGPT. Default: `0.0`
* `wbits`: Int denoting the bit-width for weight quantization. We suggest using a value of `4`. Default: `16` (No quant)
* `dtype`: String denoting standard datatype of model. Options are `float16` and `float32`. Default: `float16`.

Note:

* Currently, our implementation of SparseQJL on OPT is still not 100\% refined, hence the extremely high perplexity scores. We will continue to resolve the issue and make updates to this repo.
* When prompted to run custom code, answer `y` in the CLI

In [43]:
!python opt_sparseqjl.py --model_name "facebook/opt-350m" \
    --qjl \
    --sparsity 0.5 \
    --wbits 4 \
    --dtype "float16"

2024-12-21 02:45:47.624566: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-12-21 02:45:47.641813: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-12-21 02:45:47.662526: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-12-21 02:45:47.668804: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-12-21 02:45:47.683915: I tensorflow/core/platform/cpu_feature_guar