# **CodeGen: End to End Demo**

In this guide, we will walk through an example of how to use Neural Magic's stack to sparsify and run LLM inference with a text generation model, using [`Salesforce/codegen-350M-mono`](https://huggingface.co/Salesforce/codegen-350M-mono) as an example.

There are a few steps:
- Installation
- Export to ONNX
- Apply One Shot Pruning and Quantization
- Evaluate Accuracy
- Inject KV Cache
- Run Inference with DeepSparse

## **Installation**

Install Sparsify and DeepSparse:

In [None]:
%pip install sparsify-nightly==1.6.0.20230817
%pip install deepsparse-nightly==1.6.0.20230817[transformers] --upgrade
%pip install torch --index-url https://download.pytorch.org/whl/cpu

Authenticate via the CLI token:

In [None]:
!sparsify.login YOUR_CLI_TOKEN

## **ONNX Export**



Start by downloading and exporting the model to ONNX.

In [4]:
!git clone https://huggingface.co/Salesforce/codegen-350M-mono

Cloning into 'codegen-350M-mono'...
remote: Enumerating objects: 37, done.[K
remote: Total 37 (delta 0), reused 0 (delta 0), pack-reused 37[K
Unpacking objects: 100% (37/37), 1.09 MiB | 6.33 MiB/s, done.


In [None]:
!sparseml.transformers.export_onnx \
    --model_path ./codegen-350M-mono \
    --task text-generation \
    --sequence_length 256

In [7]:
%mv ./deployment ./dense-fp32
%ls ./dense-fp32

config.json  model.onnx               tokenizer_config.json  vocab.json
merges.txt   special_tokens_map.json  tokenizer.json


## **Apply One-Shot**

We will next optimize the model by applying pruning and quantization, using Sparsify, Neural Magic's model optimization toolkit.

For compressing LLMs, we will use a post-training algorithm called `FastOBCQ`, which we can apply using the `sparsify.run one-shot` pathway.

### **Format Dataset**

`FastOBCQ` uses calibration data during the pruning and quantization process. In this case, we will use the [`codeparrot/apps`](https://huggingface.co/datasets/codeparrot/apps) dataset as the calibration data.

The Sparsify One-Shot pathway requires preprocessed data to be passed as a folder holding `.npz` files. Using about 1000 samples is generally enough.

Run the following to pre-process the dataset:

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
import json

SAMPLES = 1000
SEQUENCE_LENGTH = 256

model_path = "./codegen-350M-mono"
model = AutoModelForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

dataset = load_dataset("codeparrot/apps")
def preprocess_function(examples):
    text = ""
    for solution in json.loads(examples["solutions"]):
      text += solution
      text += "\n"
    
    return tokenizer(text, truncation=True, max_length=256, return_tensors="pt")

ds_sampled = dataset["train"].shuffle(seed=42).select(range(SAMPLES + 100)).with_format("torch")
tokenized_dataset = ds_sampled.map(
    preprocess_function,
    batched=False,
).filter(lambda example: example["input_ids"].shape[1] == SEQUENCE_LENGTH).select(range(SAMPLES))

print(tokenized_dataset["input_ids"].shape)

Run the following to format and save the data as NPZ files.

In [14]:
!mkdir data

In [None]:
import numpy as np
import torch
from torch import Tensor

# numpy exporter helper
class NumpyExportWrapper(torch.nn.Module):
    def __init__(self, model):
        super(NumpyExportWrapper, self).__init__()
        self.model = model
        self.model.eval()  # Set model to evaluation mode
        self.numpy_data = []

    def forward(self, *args, **kwargs):
        with torch.no_grad():
            inputs = {}
            batch_size = 0

            for index, arg in enumerate(args):
                if isinstance(arg, Tensor):
                    inputs[f"input_{index}"] = arg
                    batch_size = arg.size[0]

            for key, val in kwargs.items():
                if isinstance(val, Tensor):
                    inputs[key] = val
                    batch_size = val.shape[0]

            start_index = len(self.numpy_data)
            for _ in range(batch_size):
                self.numpy_data.append({})

            for input_key in iter(inputs):
              for idx, input in enumerate(inputs[input_key]):
                  self.numpy_data[start_index+idx][input_key] = input

            # uncomment if you want to inspect the results, but slows us down
            # return self.model(*args, **kwargs)

    def save(self, path: str = "data"):
        for index, item in enumerate(self.numpy_data):
            npz_file_path = f'{path}/input{str(index).zfill(4)}.npz'
            np.savez(npz_file_path, **item)

        print(f'Saved {len(self.numpy_data)} npz files to {path}')


# wrap model with numpy exporter
model = NumpyExportWrapper(model)

# format as numpy
for data in tokenized_dataset:
    input_ids = data["input_ids"]
    attention_mask = data["attention_mask"]
    model(input_ids=input_ids, attention_mask=attention_mask)

# save to ./data
model.save()

### **Run FastOBCQ**

With the data setup, we are ready to apply `FastOBCQ` using Sparsify One-Shot.

First, we will create a Recipe to run `FastOBCQ`. Recipes specify the algorithm to apply as well as the hyperparameters to use during the pruning and quantization process. For CodeGen-350M-Mono, there is a [premade recipe](https://sparsezoo.neuralmagic.com/models/codegen_mono-350m-bigpython_bigquery_thepile-pruned50_quantized?hardware=deepsparse-c6i.12xlarge&comparison=codegen_mono-350m-bigpython_bigquery_thepile-base&tab=3) available in the SparseZoo:

```yaml
# recipe.yaml
!FastOBCQModifier
  target_sparsity: 0.5
  block_size: 128
  layers_per_gpu: 8
  supported_ops: ['MatMul']
  omit_edge_layers: True
  mse:
      norm: 2.4
      grid: 100
      max_shrink: 0.8
  scheme:
      input_activations:
          num_bits: 8
          symmetric: False
      weights:
          num_bits: 8
          symmetric: True
  scheme_overrides:
      Gemm:
          input_activations:
              num_bits: 8
              symmetric: True
  ignore: ['ReduceMean', 'Tanh', 'Softmax', 'Equal', 'Pow', 'Add', 'Sub', 'Div', 'Neg', 'Softmax', 'ConstantOfShape', 'Constant', 'Sqrt', 'Mul', 'Gather']
  quantize_non_obq_weights: False
  
```

Save the recipe to a YAML file called `recipe.yaml`. Apply the recipe with the following:

In [None]:
!sparsify.run one-shot \
    --use-case text-generation \
	--model ./dense-fp32/model.onnx \
	--data ./data \
	--recipe ./recipe.yaml

This will take a couple hours to run. The resulting ONNX model will be saved in a directory called `./deployment`.

Let's copy over the tokenizer and configuration files from `./dense-fp32`:

In [None]:
%cp -r dense-fp32 50sparse-int8
%mv deployment/model.onnx 50sparse-int8/model.onnx
%rm -rf deployment

%ls 50sparse-int8

## **Evaluate Accuracy**

We can evaluate the accuracy of the model using the `deepsparse.transformers.eval_downstream` CLI, which allows us to compute perplexity.

Run the following to evaluate the dense-fp32 model:

In [None]:
!deepsparse.transformers.eval_downstream ./dense-fp32 --dataset openai_humaneval

Run the following to evaluate the 50sparse-int8 model:

In [None]:
!deepsparse.transformers.eval_downstream ./50sparse-int8 --dataset openai_humaneval

## **Inject KV Cache**

With validation complete, we can now inject the KV-caching mechanism into the ONNX graph to enable performant inference with DeepSparse.

Create a directory to house the files:

In [29]:
!cp -r 50sparse-int8 50sparse-int8-kvcache

We can use the following script to do so:

In [None]:
from sparseml.exporters.kv_cache_injector import KeyValueCacheInjector
import onnx

input_file = "50sparse-int8/model.onnx"
output_file = "50sparse-int8-kvcache/model.onnx"

model = onnx.load(input_file, load_external_data=False)
model = KeyValueCacheInjector(os.path.dirname(input_file)).apply(model)
onnx.save(model, output_file)
print(f"Modified model saved to: {output_file}")

## **Run Inference With DeepSparse**

Now, we can run inference with DeepSparse using the following:

In [None]:
from deepsparse import Pipeline

pipeline = Pipeline.create(
    task="text-generation", 
    model_path="./50sparse-int8-kvcache",
    max_generated_tokens=128)

prompt = "def fib(n):"
output = pipeline(sequences=prompt)
print(f"{prompt}{output.sequences[0]}")