# Models

Note: intended to be run in [Google Colab](https://colab.research.google.com/) using a T4 runtime.

Lower level API of Transformers - the models that wrap PyTorch code for the transformers themselves.

## Dependencies

In [None]:
!pip install -q requests torch bitsandbytes transformers sentencepiece accelerate

In [None]:
from google.colab import userdata
from huggingface_hub import login
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer, BitsAndBytesConfig
import torch

## HuggingFace Token

**IMPORTANT** requires read and write permissions.

Add `HF_TOKEN` to secrets, paste value and toggle on for this notebook.

In [None]:
hf_token = userdata.get('HF_TOKEN')
login(hf_token, add_to_git_credential=True)

# Models and Prompts

In [None]:
# instruct models

LLAMA = "meta-llama/Meta-Llama-3.1-8B-Instruct"
PHI3 = "microsoft/Phi-3-mini-4k-instruct"
GEMMA2 = "google/gemma-2-2b-it"
QWEN2 = "Qwen/Qwen2-7B-Instruct"
MIXTRAL = "mistralai/Mixtral-8x7B-Instruct-v0.1" # Requires a lot of GPU memory

In [None]:
messages = [
    {"role": "system", "content": "You are a helpful assistant"},
    {"role": "user", "content": "Tell a light-hearted joke for a room of Data Scientists"}
  ]

# Model Analysis

## Llama 3.1 from Meta

**IMPORTANT** Meta does require you to sign their [terms of service](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B).

### Quantization Config

In [None]:
# Quantization Config - this allows us to load the model into memory and use less memory
# load_in_4bit: what to reduce to and can change to 8 bit and compare accuracy
# bnb_4bit_use_double_quant: quantizes all of the weights twice to save a little more memory
# and doesn't impact accuracy too much
# bnb_4bit_compute_dtype: in doing the calculations use specified data type that can make
# some improvements to performance
# bnb_4bit_quant_type: determines how to treat/interpret/compress the numbers when reduced
# down to 4 or 8 bits, e.g. normalized float 4 bit (nf4)

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4"
)

## Tokenizer

In [None]:
# Tokenizer
# pad_token: which token is used to fill up the prompt if there needs to be more added when
# fed to the neural network, usually the same as the end of sentance (eos) prompt token
# apply_chat_template: takes list dictionary messsages and converts them to tokens

tokenizer = AutoTokenizer.from_pretrained(LLAMA)
tokenizer.pad_token = tokenizer.eos_token
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")

## The Model

In [None]:
# The model
# AutoModelForCausalLM: load the model to disk and in memory (as code that we can run,
# PyTorch layers of a neural network), general class for creating any LLM
# Causal LLM: is the same as an auto regressive LLM used for GenAI, an LLM that takes
# some set of tokens in the past and predicts future tokens
# device_map="auto" means if we have a GPU use it

model = AutoModelForCausalLM.from_pretrained(LLAMA, device_map="auto", quantization_config=quant_config)

In [None]:
memory = model.get_memory_footprint() / 1e6
print(f"Memory footprint: {memory:,.1f} MB")

In [None]:
model

PyTorch layers of the LLAMA model:

    LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(128256, 4096)
        (layers): ModuleList(
          (0-31): 32 x LlamaDecoderLayer(
            (self_attn): LlamaAttention(
              (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
              (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
              (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
              (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
            )
            (mlp): LlamaMLP(
              (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
              (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
              (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
              (act_fn): SiLU()
            )
            (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
            (post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
          )
        )
        (norm): LlamaRMSNorm((4096,), eps=1e-05)
        (rotary_emb): LlamaRotaryEmbedding()
      )
      (lm_head): Linear(in_features=4096, out_features=128256, bias=False)
    )

Model made up of:
- begins with embedding layer, which is how the tokens become embedded within the neural network
- next is a series of modules that are the layers withing the neural network
    - tension layers that are at the heart of what makes a transformer ("tension is all you need")
    - multilayer perceptron (MLP) with an activation function, e.g. SiLU() the [Sigmoid Linear Unit (SiLU) aka swish function](https://pytorch.org/docs/stable/generated/torch.nn.SiLU.html)
- norm layer that implements the operation as described in the paper [Layer Normalization](https://arxiv.org/abs/1607.06450)
- linear layer at the end

__Always look at the dimenionality (number of input dimensions representing the vocab), e.g. Embedding(128256, 4096), and see that it matches the output, e.g. Linear(in_features=4096, out_features=128256, bias=False)__

In [None]:
outputs = model.generate(inputs, max_new_tokens=80)
print(tokenizer.decode(outputs[0]))

## Clear Cache

In [None]:
# Clean up memory

del inputs, outputs, model
torch.cuda.empty_cache()

## Reuseable Function to Analyze Models:

Uses a HuggingFace utility TextStreamer to stream back results.

[Argument](https://huggingface.co/docs/transformers/main/en/chat_templating#what-are-generation-prompts) `add_generation_prompt=True` used to create the Chat template. This ensures that Phi generates a response to the question, instead of just predicting how the user prompt continues. Try experimenting with setting this to False to see what happens. You can read about this argument here:

In [None]:
# Wrapping everything in a function - and adding Streaming and generation prompts

def generate(model, messages):
  tokenizer = AutoTokenizer.from_pretrained(model)
  tokenizer.pad_token = tokenizer.eos_token
  inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to("cuda")
  streamer = TextStreamer(tokenizer)
  model = AutoModelForCausalLM.from_pretrained(model, device_map="auto", quantization_config=quant_config)
  outputs = model.generate(inputs, max_new_tokens=80, streamer=streamer)
  del tokenizer, streamer, model, inputs, outputs
  torch.cuda.empty_cache()

## Phi 3 from Microsoft

In [None]:
generate(PHI3, messages)

## Gemma from Google

**IMPORTANT** Google does require you to accept their [terms of service](https://huggingface.co/google/gemma-2-2b-it).

In [None]:
messages = [
    {"role": "user", "content": "Tell a light-hearted joke for a room of Data Scientists"}
  ]
generate(GEMMA2, messages)