# Models

Looking at the lower level API of Transformers - the models that wrap PyTorch code for the transformers themselves.

This notebook can run on a low-cost or free T4 runtime.

In [1]:
!pip install -q requests torch bitsandbytes transformers sentencepiece accelerate

In [8]:
import os
from dotenv import load_dotenv
from huggingface_hub import login
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer, BitsAndBytesConfig
import torch

In [9]:
import torch
print("CUDA available:", torch.cuda.is_available())
print("CUDA version:", torch.version.cuda)
print("GPU device name:", torch.cuda.get_device_name(0) if torch.cuda.is_available() else "No GPU")

CUDA available: True
CUDA version: 12.6
GPU device name: NVIDIA RTX A2000 Laptop GPU


In [10]:
#to know memory allocation before loading the model to memory
print(torch.cuda.memory_allocated()/1e9, "GB allocated")
print(torch.cuda.memory_reserved()/1e9, "GB reserved")

0.0 GB allocated
0.0 GB reserved


In [11]:
#login into HuggingFacce 
hf_token = os.getenv('HF_TOKEN')
login(hf_token, add_to_git_credential=True)

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

# Example of Estimate Memory Usage

## Pythia model in FP32

- **How many parameters?**  400 millions 
  - \( 400 * 10^6 \) parameters  
  - 32 bits = 4 bytes  
  - (8 bits → 1 byte)  

- **How much memory?**  
  - \( 400 * 10^6 \) parameters × 4 bytes  
  - = \( 1,600 * 10^6 \) bytes  
  - = 1,600 megabytes  
  - = 1.6 gigabytes  

# Quantized models are easier to distribute

## 70B parameter model (e.g. Llama 2 70B)
- **280 GB storage** in FP32 (32-bit precision)  
- Reduce to **40GB** if stored in 4-bit precision  
- **7x reduction** (280GB / 40GB) ~ 7  

## Llama 2 7B
- **28 GB storage** in FP32 (32-bit precision)  
- Reduce to **~4GB** if stored in 4-bit precision, in **"GGUF"** format.  
- **Can run on your computer!**  


In [12]:
# instruct models Text Generation

LLAMA = "meta-llama/Meta-Llama-3.1-8B-Instruct"
PHI3 = "microsoft/Phi-3-mini-4k-instruct"
GEMMA2 = "google/gemma-2-2b-it"
#QWEN2 = "Qwen/Qwen2-7B-Instruct" # exercise for you -> too large
QWEN2 ="Qwen/Qwen2.5-1.5B-Instruct"
MIXTRAL = "mistralai/Mixtral-8x7B-Instruct-v0.1" # If this doesn't fit it your GPU memory, try others from the hub
DEEPSEEKR1 = "deepseek-ai/DeepSeek-R1"

In [13]:
messages = [
    {"role": "system", "content": "You are a helpful assistant"},
    {"role": "user", "content": "Tell a light-hearted joke for a room of Data Scientists"}
  ]

# Accessing Llama 3.1 from Meta

In order to use the fantastic Llama 3.1, Meta does require you to sign their terms of service.

Visit their model instructions page in Hugging Face:
https://huggingface.co/meta-llama/Meta-Llama-3.1-8B

At the top of the page are instructions on how to agree to their terms. If possible, you should use the same email as your huggingface account.

In my experience approval comes in a couple of minutes. Once you've been approved for any 3.1 model, it applies to the whole family of models.

In [14]:
#By default the model without Quantization is loaded with its Weight at full-precision at FP32 (32 floating points) that takes a lot of memory
#model = AutoModelForCausalLM.from_pretrained(LLAMA, device_map={"":0}).to("cuda")


In [15]:
# #to know memory allocation before loading the model to memory
# print(torch.cuda.memory_allocated()/1e9, "GB allocated")
# print(torch.cuda.memory_reserved()/1e9, "GB reserved")

### bitsandbytes

-Is the easiest option for quantizing a model to 8 and 4-bit

-8-bit quantization multiplies outliers in fp16 with non-outliers in int8, converts the non-outlier values back to fp16, and then adds them together to return the weights in fp16. This reduces the degradative effect outlier values have on a model’s performance. 

-4-bit quantization compresses a model even further, and it is commonly used with QLoRA to finetune quantized LLMs.

- Quantization to BF16 (here we are downcasting from FP32 to BF16)

In [16]:
quant_config_bf16 = BitsAndBytesConfig(bnb_4bit_compute_dtype=torch.bfloat16)


- Quantization to 8-bits (here we are downcasting from FP32 to 8-bit )

In [17]:
quant_config_8bits = BitsAndBytesConfig(load_in_8bit=True)

- Quantization to 4-bits (here we are downcasting from FP32 to 4-bits)

In [18]:
quant_config_4bits = BitsAndBytesConfig(load_in_4bit=True)

In [19]:
#Verifying the model loaded into memory with quantization
#NOTE: device_map="auto" does not work

#model_16bits= AutoModelForCausalLM.from_pretrained(LLAMA,device_map={"":0},quantization_config=quant_config_bf16)

#model_8bit = AutoModelForCausalLM.from_pretrained(LLAMA, device_map={"":0},quantization_config=quant_config_8bits)

#model_4bit = AutoModelForCausalLM.from_pretrained(LLAMA, device_map={"":0},quantization_config=quant_config_4bits)

In [20]:
#to know memory allocation after loading the model with quantization
print(torch.cuda.memory_allocated()/1e9, "GB allocated")
print(torch.cuda.memory_reserved()/1e9, "GB reserved")

0.0 GB allocated
0.0 GB reserved


In [22]:
#to check the model memory occupation footprint
#memory = model_16bits.get_memory_footprint() / 1e6
#print(f"Memory footprint: {memory:,.1f} MB")

If the next cell gives you a 403 permissions error, then please check:
1. Are you logged in to HuggingFace? Try running `login()` to check your key works
2. Did you set up your API key with full read and write permissions?
3. If you visit the Llama3.1 page at https://huggingface.co/meta-llama/Meta-Llama-3.1-8B, does it show that you have access to the model near the top?


In [14]:
# Tokenizer

tokenizer = AutoTokenizer.from_pretrained(LLAMA)
tokenizer.pad_token = tokenizer.eos_token
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")

In [19]:
#Generate the output

#this output gives token only
outputs = model_16bits.generate(inputs, max_new_tokens=80)

#we need to decode the output so we can understand
print(tokenizer.decode(outputs[0],skip_special_tokens=True))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


system

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

You are a helpful assistantuser

Tell a light-hearted joke for a room of Data Scientistsassistant

Why did the linear regression model break up with the decision tree?

Because it couldn't handle the variable relationship and the tree's tendency to over-split!


In [22]:
# Create an attention mask that is 0 for pad tokens and 1 for real tokens
attention_mask = (inputs != tokenizer.pad_token_id).long().to("cuda")
 
outputs = model_16bits.generate(
    inputs,
    attention_mask=attention_mask,  # This mask tells the model which tokens to attend to
    pad_token_id=tokenizer.pad_token_id,  # Explicitly set pad token
    max_new_tokens=80,
)
print(tokenizer.decode(outputs[0],skip_special_tokens=True))

system

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

You are a helpful assistantuser

Tell a light-hearted joke for a room of Data Scientistsassistant

Why did the linear regression go to therapy?

Because it was struggling to model its emotions!


In [26]:
# NOTE Clean up: restart kernel


## A couple of quick notes on the next block of code:

I'm using a HuggingFace utility called TextStreamer so that results stream back.
To stream results, we simply replace:  
`outputs = model.generate(inputs, max_new_tokens=80)`  
With:  
`streamer = TextStreamer(tokenizer)`  
`outputs = model.generate(inputs, max_new_tokens=80, streamer=streamer)`

Also I've added the argument `add_generation_prompt=True` to my call to create the Chat template. This ensures that Phi generates a response to the question, instead of just predicting how the user prompt continues. Try experimenting with setting this to False to see what happens. You can read about this argument here:

https://huggingface.co/docs/transformers/main/en/chat_templating#what-are-generation-prompts

Thank you to student Piotr B for raising the issue!

In [7]:
#Kernel restarted 

In [30]:
# Wrapping everything in a function - and adding Streaming and generation prompts

def generate(model, messages,quant_config):
    tokenizer = AutoTokenizer.from_pretrained(model)
    tokenizer.pad_token = tokenizer.eos_token
    inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to("cuda")
    streamer = TextStreamer(tokenizer)
    model = AutoModelForCausalLM.from_pretrained(model, device_map={"":0}, quantization_config=quant_config).to("cuda") #forcing all to the GPU  
    outputs = model.generate(inputs, max_new_tokens=80, streamer=streamer,)
    
    del tokenizer, streamer, model, inputs, outputs
    torch.cuda.empty_cache()

In [28]:
generate(PHI3, messages,quant_config_bf16)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

<|system|> You are a helpful assistant<|end|><|user|> Tell a light-hearted joke for a room of Data Scientists<|end|><|assistant|> Sure, here's a light-hearted joke for a room of Data Scientists:

Why did the data scientist break up with the algorithm?

Because it was always trying to predict their every move and couldn't handle the uncertainty of their love life!<|end|>

Filtered output: You are a helpful assistant Tell a light-hearted joke for a room of Data Scientists Sure, here's a light-hearted joke for a room of Data Scientists:

Why did the data scientist break up with the algorithm?

Because it was always trying to predict their every move and couldn't handle the uncertainty of their love life!


## Accessing Gemma from Google

A student let me know (thank you, Alex K!) that Google also now requires you to accept their terms in HuggingFace before you use Gemma.

Please visit their model page at this link and confirm you're OK with their terms, so that you're granted access.

https://huggingface.co/google/gemma-2-2b-it

In [33]:
messages = [
    {"role": "user", "content": "Tell a light-hearted joke for a room of Data Scientists"}
  ]
generate(GEMMA2, messages, quant_config_4bits)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

<bos><start_of_turn>user
Tell a light-hearted joke for a room of Data Scientists<end_of_turn>
<start_of_turn>model




Why did the data scientist bring a ladder to work? 

Because they heard the data was going to be *high*! 🪜📈 

Let me know if you'd like another one! 😄 
<end_of_turn>


In [34]:
# Quantization Config - this allows us to load the model into memory and use less memory

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=False,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4"
)

In [41]:
# Wrapping everything in a function - and adding Streaming and generation prompts

def generate(model, messages, quant_config):
    # torch.cuda.empty_cache()  # Free up GPU memory
    # torch.cuda.ipc_collect()
    tokenizer = AutoTokenizer.from_pretrained(model)
    # tokenizer.pad_token = tokenizer.eos_token
    
    inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to("cuda")
    streamer = TextStreamer(tokenizer)
    
    #to make it work with 8bit quantization
    model = AutoModelForCausalLM.from_pretrained(model, device_map={"":0}, quantization_config=quant_config)
    #for the rest of quantization it works with .to() function
    #model = AutoModelForCausalLM.from_pretrained(model, device_map={"":0}, quantization_config=quant_config).to("cuda")  
    
    outputs = model.generate(inputs, max_new_tokens=80, streamer=streamer)
    
    del tokenizer, streamer, model, inputs, outputs
    torch.cuda.empty_cache()

      

In [36]:
generate(QWEN2, messages, quant_config_bf16)

<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>
<|im_start|>user
Tell a light-hearted joke for a room of Data Scientists<|im_end|>
<|im_start|>assistant
Sure! Here's one:

Why did the data scientist break up with the linear regression?

Because he found an outlier in the dataset and couldn't get past it!<|im_end|>


In [40]:
generate(QWEN2, messages, quant_config_8bits)

<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>
<|im_start|>user
Tell a light-hearted joke for a room of Data Scientists<|im_end|>
<|im_start|>assistant
Sure! Here's a light-hearted joke that might be enjoyed by a room full of data scientists:

Why don't scientists trust atoms?

Because they make up everything!

This joke plays on the word "atoms," which is often used to mean "everything" in scientific contexts. It's a playful way to poke fun at how we sometimes use words without fully understanding their meanings or implications.<|im_end|>
