<a href="https://colab.research.google.com/github/piesauce/llm-playbooks/blob/main/notebooks/Loading_Models_with_BnB_in_Colab_Free_Tier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Models

## Resource limitations in free tier
The selection of models assumes that we have access to the following resources (which can vary!):
* 78.2GB disk space (!)
* 15.0GB GPU RAM (!)
* 12.7GB system RAM

System RAM is not the most important factor here since we are loading our models in GPU, but the disk space and GPU RAM is crucial. Here is how they scale w.r.t. the number of model parameters:

* Assuming model weights are stored in 16-bit float or [bfloat](https://en.wikipedia.org/wiki/Bfloat16_floating-point_format) format, each 1B parameters will require 2GB of disk space.

* Assuming models are loaded into 4-bit format and no sharding is involved, each 1B parameters will require 500MB of GPU RAM.

Therefore the Llama 2 model size that free tier Colab can support without sharding is 13B (26GB disk space, 6.5GB GPU RAM). This does not yet account for the GPU RAM needed to cache tokens, KV values, or other overhead that may otherwise also be needed to run the model.

Since the 70B model will require 140GB of disk space and 35GB of GPU RAM, we will not be able to load it here.

## Llama 2 permissions and OPT-350M
We also put in OPT-350M as a sanity check since Llama 2 requires requesting permissions from Meta to run ([request here](https://huggingface.co/meta-llama/Llama-2-7b)) and that can take a few hours or more to be granted.

In [None]:
MODEL_IDS = {
    # strictly a sanity check: this model is too small and not instruction tuned
    "opt-350m":               "facebook/opt-350m",

    # need to request permissions to load Llama 2 models
    "llama2-7b":              "meta-llama/Llama-2-7b-hf",
    "llama2-7b-chat":         "meta-llama/Llama-2-7b-chat-hf",
    "llama2-13b":             "meta-llama/Llama-2-13b-hf",
    "llama2-13b-chat":        "meta-llama/Llama-2-13b-chat-hf",

    # codellama: similar to Llama 2, but no need for Meta permissions + split into 3 categories
    "codellama-7b":           "codellama/CodeLlama-7b-hf",          # completion + infilling
    "codellama-7b-python":    "codellama/CodeLlama-7b-Python-hf",   # completion + better Python
    "codellama-7b-instruct":  "codellama/CodeLlama-7b-Instruct-hf", # completion + infilling + instructions
    "codellama-13b":          "codellama/CodeLlama-13b-hf",
    "codellama-13b-python":   "codellama/CodeLlama-13b-Python-hf",
    "codellama-13b-instruct": "codellama/CodeLlama-13b-Instruct-hf",
}

# Boring setup stuff

* Colab and Jupyter notebooks in general don't format output text very well; we use `textwrap` to manually print out text in a readable format.
* We need to update some HuggingFace libraries and install `bitsandbytes` manually.

In [None]:
# TODO: put this util into the repo
import textwrap

def print_wrap(text: str, width: int = 100, subsequent_indent: str = "ꜛ"):
    """
    Wrap and print text to some max width.

    Useful for notebooks which do not wrap output text, making it very hard to read
    in both Colab and GitHub.
    """
    print(
        "\n".join(
            textwrap.fill(line, subsequent_indent=subsequent_indent, width=width)
            for line in text.split("\n")
        )
    )


text = """In my project, I have a bunch of strings that are.
Read in from a file.
Most of them, when printed in the command console, exceed 80. Characters in length and wrap around, looking ugly."""

print_wrap(text)

In my project, I have a bunch of strings that are.
Read in from a file.
Most of them, when printed in the command console, exceed 80. Characters in length and wrap around,
ꜛlooking ugly.


In [None]:
!pip install -U bitsandbytes --root-user-action=ignore
!pip install -U git+https://github.com/huggingface/transformers.git --root-user-action=ignore
!pip install -U git+https://github.com/huggingface/peft.git --root-user-action=ignore
!pip install -U git+https://github.com/huggingface/accelerate.git --root-user-action=ignore
!pip install -U tokenizers

Collecting git+https://github.com/huggingface/transformers.git
  Cloning https://github.com/huggingface/transformers.git to /tmp/pip-req-build-t7hloi_u
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers.git /tmp/pip-req-build-t7hloi_u
  Resolved https://github.com/huggingface/transformers.git to commit d8e13b3e04da9e61c6f16df43815656f59688abd
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting git+https://github.com/huggingface/peft.git
  Cloning https://github.com/huggingface/peft.git to /tmp/pip-req-build-actib07p
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/peft.git /tmp/pip-req-build-actib07p
  Resolved https://github.com/huggingface/peft.git to commit 0c9354bda98eb7f5348699e23ab752e8dca1e60e
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements

# Loading and running model

Now we will load our selected model and see its outputs in a question answering setting.

Due to resource constraints, we simply cannot load and run all models in a single runtime instance. So we will need to clear the model from GPU memory every time and clear disk space as needed.

We will also be loading our models in 4-bit in order to minimize usage, although it should also be possible to load the 7B model in 8-bit format using [`BitesAndBytesConfig(load_in_8bit=True, ...)`](https://huggingface.co/docs/transformers/v4.33.0/en/main_classes/quantization#transformers.BitsAndBytesConfig).

In [None]:
# Llama 2 models require permissions to load; be sure to log in
!git config --global credential.helper store

from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
# Clear disk space as needed
from transformers import file_utils
from shutil import rmtree

rmtree(file_utils.default_cache_path, ignore_errors=True)

In [None]:
from torch import cuda, bfloat16
from transformers import AutoTokenizer, pipeline, AutoModelForCausalLM, BitsAndBytesConfig
import torch
import transformers


# CONFIG FOR MODEL
model_id = MODEL_IDS["codellama-13b"]

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)


# Load model
try:
    # If an existing model is loaded, clear it from GPU memory
    del model
except NameError:
    pass

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto"
)

Downloading (…)okenizer_config.json:   0%|          | 0.00/749 [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/411 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/589 [00:00<?, ?B/s]

Downloading (…)fetensors.index.json:   0%|          | 0.00/31.4k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading (…)of-00003.safetensors:   0%|          | 0.00/9.95G [00:00<?, ?B/s]

Downloading (…)of-00003.safetensors:   0%|          | 0.00/9.90G [00:00<?, ?B/s]

Downloading (…)of-00003.safetensors:   0%|          | 0.00/6.18G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Now comes the fun part: let us see what the model outputs for a simple Q&A problem instance. We will try to get the model to give us an answer to the question

> How do you load a BERT model for sentiment classification using Hugging Face transfomers?

To do this we need to set up a simple QA prompt. For pedagogical purposes we will split up the prompt into some constituent components
* `instruction_text`: This helps the model frame the task and the constraints around the task. Considering how common `Question: <...>, Answer: <...>` text is, this is probably not strictly necessary.
* `input_text`: Here is where we ask the specific question that we want the answer to. Notice the `Question:` prefix.
* `output_prefix`: By providing the `Answer:` prefix, we help to condition the model so that it will generate tokens to answer the question. Without this it is possible that the model will continue generating text for the question, which is not what we want.

Feel free to change the model and prompt and see how this affects the generated output!

In [None]:
def qa_prompt(question: str) -> str:
    instruction_text = (
        "You are a helpful, respectful and honest assistant. "
        "Always answer as helpfully as possible, while being safe. "
        "Your answers should not include any harmful, unethical, racist, sexist, toxic, "
        "dangerous, or illegal content. Please ensure that your responses are socially unbiased "
        "and positive in nature."
        "\n\n"
        "If a question does not make any sense, or is not factually coherent, explain why instead "
        "of answering something not correct. If you don't know the answer to a question, please "
        "don't share false information."
    )
    input_text = (
        "Question:\n"
        + question
    )
    output_prefix = "Answer:\n"
    return "\n\n".join([instruction_text, input_text, output_prefix])

text = qa_prompt(
    "How do you load a BERT model for sentiment classification using Hugging Face transfomers?"
)

device = "cuda:0"
inputs = tokenizer(text, return_tensors="pt").to(device)

outputs = model.generate(**inputs, max_new_tokens=1000)
output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print_wrap(output_text)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while
ꜛbeing safe. Your answers should not include any harmful, unethical, racist, sexist, toxic,
ꜛdangerous, or illegal content. Please ensure that your responses are socially unbiased and positive
ꜛin nature.

If a question does not make any sense, or is not factually coherent, explain why instead of
ꜛanswering something not correct. If you don't know the answer to a question, please don't share
ꜛfalse information.

Question:
How do you load a BERT model for sentiment classification using Hugging Face transfomers?

Answer:

```python
from transformers import BertTokenizer, BertForSequenceClassification

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=8)
```

This will load the BERT model and tokenizer for sentiment classification.

You can then use the model to classify text as follows:

```pyt

# Coding

In [None]:
text = """Suppose there is a Python class with the following signature:

```python
import numpy as np
from dataclasses import dataclass

@dataclass
class NumericalFunction:
    params: dict[str, np.ndarray]  # a dictionary of named parameters

    @staticmethod
    def compute_function(x: np.ndarray, params: dict[str, np.ndarray]) -> np.float64:
        \"\"\"Computes the value of the NumericalFunction at x given its params.\"\"\"
        ...

    def __call__(self, x: np.ndarray) -> np.float64
        \"\"\"
        Computes the value of the function instance at x, e.g.
            function = NumericalFunction(params)
            y = function(x)
        \"\"\"
        ...

    def gradient(self, x: np.ndarray) -> dict[str, np.ndarray]:
        \"\"\"
        Computes the gradient of the function instance at x with respect to its params.
        The output is a dictionary with the gradient matrix for each parameter in params.
        \"\"\"
        ...
```

**Task:**
Write a unit test to determine whether the `gradient` method is correct for arbitrary x.
Do this by comparing a first-order Taylor expansion of the function and checking approximations
of function values for small perturbations of the input x. You can do this by combining the
`__call__` and `gradient` methods.
Only test the `gradient` method, do not test other methods.

**Answer:**
"""

device = "cuda:0"
inputs = tokenizer(text, return_tensors="pt").to(device)

outputs = model.generate(**inputs, max_new_tokens=1000)
output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print_wrap(output_text)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Suppose there is a Python class with the following signature:

```python
import numpy as np
from dataclasses import dataclass

@dataclass
class NumericalFunction:
    params: dict[str, np.ndarray]  # a dictionary of named parameters

    @staticmethod
    def compute_function(x: np.ndarray, params: dict[str, np.ndarray]) -> np.float64:
        """Computes the value of the NumericalFunction at x given its params."""
        ...

    def __call__(self, x: np.ndarray) -> np.float64
        """
        Computes the value of the function instance at x, e.g.
            function = NumericalFunction(params)
            y = function(x)
        """
        ...

    def gradient(self, x: np.ndarray) -> dict[str, np.ndarray]:
        """
        Computes the gradient of the function instance at x with respect to its params.
        The output is a dictionary with the gradient matrix for each parameter in params.
        """
        ...
```

**Task:**
Write a unit test to determine whether the `gr