<a href="https://colab.research.google.com/github/piesauce/llm-playbooks/blob/main/notebooks/Inference_Text_Summarization_Metrics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Intro

This notebook is a first pass at creating a pipeline that runs a LLM against a summarization test / eval set and collects inference metrics, e.g. how much memory is used, how does latency scale, etc.

The goal is _not to measure statistical performance_, since principled evaluation for text generation is difficult and will probably be left to a later project, but to understand resource and latency tradeoffs.

Since this notebook is meant to be a first pass that will later be converted into a script, we will stick to the following models (depending on how much additional memory is needed to run inference):

- `meta-llama/Llama-2-7b-chat-hf`
- `meta-llama/Llama-2-13b-chat-hf`

## Setup

### Installation

In [1]:
!git clone https://github.com/piesauce/llm-playbooks
!cd llm-playbooks && git pull && pip install -q -e .

fatal: destination path 'llm-playbooks' already exists and is not an empty directory.
Already up to date.
  Preparing metadata (setup.py) ... [?25l[?25hdone


In [3]:
!pip install -U bitsandbytes --root-user-action=ignore
!pip install -U git+https://github.com/huggingface/transformers.git --root-user-action=ignore
!pip install -U git+https://github.com/huggingface/peft.git --root-user-action=ignore
!pip install -U git+https://github.com/huggingface/accelerate.git --root-user-action=ignore
!pip install tokenizers==0.13.3

Collecting git+https://github.com/huggingface/transformers.git
  Cloning https://github.com/huggingface/transformers.git to /tmp/pip-req-build-xc51i3b5
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers.git /tmp/pip-req-build-xc51i3b5
  Resolved https://github.com/huggingface/transformers.git to commit 0a55d9f7376f72ad3ff296d4249840021b03bcc4
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting git+https://github.com/huggingface/peft.git
  Cloning https://github.com/huggingface/peft.git to /tmp/pip-req-build-0c8__5hq
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/peft.git /tmp/pip-req-build-0c8__5hq
  Resolved https://github.com/huggingface/peft.git to commit 6b4554e6437a372177d16a7800df8cfd03fbd16f
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements

### Validate setup / login

In [2]:
from llm_playbooks.utils.colab import print_wrap

text = """In my project, I have a bunch of strings that are.
Read in from a file.
Most of them, when printed in the command console, exceed 80. Characters in length and wrap around, looking ugly."""

print_wrap(text)

In my project, I have a bunch of strings that are.
Read in from a file.
Most of them, when printed in the command console, exceed 80. Characters in length and
ꜛwrap around, looking ugly.


In [4]:
# Llama 2 models require permissions to load; be sure to log in
!git config --global credential.helper store

from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

# Load model

In [5]:
# from shutil import rmtree
# from transformers import file_utils

# # delete cache
# rmtree(file_utils.default_cache_path, ignore_errors=True)

# # clear model from memory
# try: del model
# except NameError: pass

In [6]:
from torch import cuda, bfloat16
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, pipeline
import torch
import transformers

model_id = "meta-llama/Llama-2-13b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16
    ),
    device_map="auto"
)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

# Run inference

**TODOs**:

1. Get GPU memory utilization
2. Run across a subset of the full test set
3. Aggregate metrics across queries and create plots
4. Convert to script


**References**:

* Test set data: https://huggingface.co/datasets/emozilla/booksum-summary-analysis_llama-8192/viewer/default/test
* Llama2 prompt: https://huggingface.co/blog/llama2#how-to-prompt-llama-2



In [23]:
from llm_playbooks.utils.colab import print_wrap
from torch import cuda
import pprint as pp


# Prompt
passage = " It is necessary to consider another point in examining the character of these principalities: that is, whether a prince has such power that, in case of need, he can support himself with his own resources, or whether he has always need of the assistance of others. And to make this quite clear I say that I consider those who are able to support themselves by their own resources who can, either by abundance of men or money, raise a sufficient army to join battle against any one who comes to attack them; and I consider those always to have need of others who cannot show themselves against the enemy in the field, but are forced to defend themselves by sheltering behind walls. The first case has been discussed, but we will speak of it again should it recur. In the second case one can say nothing except to encourage such princes to provision and fortify their towns, and not on any account to defend the country. And whoever shall fortify his town well, and shall have managed the other concerns of his subjects in the way stated above, and to be often repeated, will never be attacked without great caution, for men are always adverse to enterprises where difficulties can be seen, and it will be seen not to be an easy thing to attack one who has his town well fortified, and is not hated by his people. The cities of Germany are absolutely free, they own but little country around them, and they yield obedience to the emperor when it suits them, nor do they fear this or any other power they may have near them, because they are fortified in such a way that every one thinks the taking of them by assault would be tedious and difficult, seeing they have proper ditches and walls, they have sufficient artillery, and they always keep in public depots enough for one year's eating, drinking, and firing. And beyond this, to keep the people quiet and without loss to the state, they always have the means of giving work to the community in those labours that are the life and strength of the city, and on the pursuit of which the people are supported; they also hold military exercises in repute, and moreover have many ordinances to uphold them. Therefore, a prince who has a strong city, and had not made himself odious, will not be attacked, or if any one should attack he will only be driven off with disgrace; again, because that the affairs of this world are so changeable, it is almost impossible to keep an army a whole year in the field without being interfered with. And whoever should reply: If the people have property outside the city, and see it burnt, they will not remain patient, and the long siege and self-interest will make them forget their prince; to this I answer that a powerful and courageous prince will overcome all such difficulties by giving at one time hope to his subjects that the evil will not be for long, at another time fear of the cruelty of the enemy, then preserving himself adroitly from those subjects who seem to him to be too bold. Further, the enemy would naturally on his arrival at once burn and ruin the country at the time when the spirits of the people are still hot and ready for the defence; and, therefore, so much the less ought the prince to hesitate; because after a time, when spirits have cooled, the damage is already done, the ills are incurred, and there is no longer any remedy; and therefore they are so much the more ready to unite with their prince, he appearing to be under obligations to them now that their houses have been burnt and their possessions ruined in his defence. For it is the nature of men to be bound by the benefits they confer as much as by those they receive. Therefore, if everything is well considered, it will not be difficult for a wise prince to keep the minds of his citizens steadfast from first to last, when he does not fail to support and defend them. "

text = f"""<s>[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>

Provide a concise academic abstract for the following passage:

"{passage}" [/INST]
"""


# Clear inputs / output tokens from memory
try: del inputs, output_tokens
except NameError: pass


# Run model
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)

start.record()
inputs = tokenizer(text, return_tensors="pt").to("cuda:0")
output_tokens = model.generate(**inputs, max_new_tokens=1024)
end.record()

torch.cuda.synchronize()
elapsed_time = start.elapsed_time(end) / 1000

# Collect results
n_input_tokens = len(inputs["input_ids"][0])
n_output_tokens = len(output_tokens[0])
n_generated_tokens = n_output_tokens - n_input_tokens
throughput = n_generated_tokens / elapsed_time

metrics = dict(
    n_input_tokens=n_input_tokens,
    n_output_tokens=n_output_tokens,
    n_generated_tokens=n_generated_tokens,
    elapsed_time=elapsed_time,
    throughput=throughput,
)

output_text = tokenizer.decode(output_tokens[0], skip_special_tokens=True)
print_wrap(output_text)

print()
pp.pprint(metrics, sort_dicts=False)

[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as
ꜛpossible, while being safe.  Your answers should not include any harmful, unethical,
ꜛracist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses
ꜛare socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead
ꜛof answering something not correct. If you don't know the answer to a question, please
ꜛdon't share false information.
<</SYS>>

Provide a concise academic abstract for the following passage:

" It is necessary to consider another point in examining the character of these
ꜛprincipalities: that is, whether a prince has such power that, in case of need, he can
ꜛsupport himself with his own resources, or whether he has always need of the assistance
ꜛof others. And to make this quite clear I say that I consider those who are able to
ꜛsupport themselves by their own resources who can, eit