## Remember!!! Even this is big model for CPU based machines.
### Install required modules
Use existing package managers (Conda, UV, Pip) to install required modules.
Ran this model on a CPU based Server, with 64 GB RAM and for inferencing CPU as 100% for more than 5 minutes.

In [2]:
import warnings
warnings.filterwarnings("ignore")
import os
from dotenv import load_dotenv
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig,BitsAndBytesConfig
import torch
import time

### Check version of Torch and is Torch enabled with GPU.
CUDA libraries are developed by NVidia and Pytorch are python abstractions over NVidia CUDA

In [3]:
print(f"Torch Version: {torch.__version__}")
print(f"GPU enabled with Pytorch:  {torch.cuda.is_available()}")

Torch Version: 2.6.0+cpu
GPU enabled with Pytorch:  False


### Hugging Face API
1. Create Hugging Face Account if not already exists.
2. Create API Token
3. Configure token in .env file 

In [4]:
load_dotenv()
token = os.getenv("HUGGING_FACE_TOKEN")

Function: Load Model
1. Given a model name
2. From HF model hub, loads the model in memory.

Note: 
1. When model is loaded it uses GPU / CPU based on avilable compute resources.
2. By default, pytorch uses datatype of weights as FP32.
3. On GPUs, loading models may fail if they exceed GPU memory.


In [6]:
def load_model(model_name="deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"):
    model_name = model_name
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name, token=token)
    return model, tokenizer

Load Model in Memory

In [7]:
model, tokenizer = load_model("unsloth/DeepSeek-R1-Distill-Llama-8B")
print("Model loaded")

Loading checkpoint shards: 100%|██████████| 4/4 [00:15<00:00,  3.99s/it]


Model loaded


Lets undersand details of model.
1. Number of parameters or weights
2. Datatype of weights.
3. CPU / GPU based compute
4. Model Layers

In [11]:
print(f"Number of model parameters: {model.num_parameters()}")
print(f"Approximate (Original) model size: {round(model.get_memory_footprint()/1024/1024/1024)} GB")

Number of model parameters: 8030261248
Approximate (Original) model size: 30 GB


Important things to note here
1. Layers (number of layers)
2. Data Type (Float32)
3. Device: CPU

In [13]:
print(f"Number of Model Layers: {model.config.num_hidden_layers}")
print(f"Model Embedding Size: {model.config.hidden_size}")
print(f"Model Device: {model.device}")
for name, param in model.named_parameters():
    print(name, param.dtype, param.device)

Number of Model Layers: 32
Model Embedding Size: 4096
Model Device: cpu
model.embed_tokens.weight torch.float32 cpu
model.layers.0.self_attn.q_proj.weight torch.float32 cpu
model.layers.0.self_attn.k_proj.weight torch.float32 cpu
model.layers.0.self_attn.v_proj.weight torch.float32 cpu
model.layers.0.self_attn.o_proj.weight torch.float32 cpu
model.layers.0.mlp.gate_proj.weight torch.float32 cpu
model.layers.0.mlp.up_proj.weight torch.float32 cpu
model.layers.0.mlp.down_proj.weight torch.float32 cpu
model.layers.0.input_layernorm.weight torch.float32 cpu
model.layers.0.post_attention_layernorm.weight torch.float32 cpu
model.layers.1.self_attn.q_proj.weight torch.float32 cpu
model.layers.1.self_attn.k_proj.weight torch.float32 cpu
model.layers.1.self_attn.v_proj.weight torch.float32 cpu
model.layers.1.self_attn.o_proj.weight torch.float32 cpu
model.layers.1.mlp.gate_proj.weight torch.float32 cpu
model.layers.1.mlp.up_proj.weight torch.float32 cpu
model.layers.1.mlp.down_proj.weight torch

"generate_model_response" generates responses from model. This function is used multiple times in the experiement.
It takes model and tokenizer as parameters along with ineference parameters like Prompt(context), temperature, number outputs (k).
1. Prompts are converted into Tokens with token_id.
2. And measuring inference time.


In [14]:
def generate_model_response(
        prompt:str,
        tokenizer:AutoTokenizer,
        model:AutoModelForCausalLM,
        max_length:int=3500,
        temperature:float=0.1,
        top_k:int=50)->str:
    input_ids = tokenizer(prompt, return_tensors="pt",padding=True)
    device = "cuda" if torch.cuda.is_available() else "cpu"
    inputs = {k: v.to(device) for k, v in input_ids.items()}
    attention_mask = input_ids["attention_mask"]
    input_ids = input_ids["input_ids"]
    pad_token_id = tokenizer.pad_token_id
    eos_token_id = tokenizer.eos_token_id
    start_time = time.time()
    with torch.no_grad():
        logits = model(**inputs).logits
        output = model.generate(
                                    **inputs, 
                                    max_length=max_length, 
                                    do_sample=True,
                                    temperature=temperature, 
                                    top_k=top_k,
                                    # attention_mask=attention_mask,
                                    pad_token_id=pad_token_id,
                                    eos_token_id=eos_token_id
                                    )
        final_output = tokenizer.decode(output[0], skip_special_tokens=True)
        print(final_output)
    end_time = time.time()
    print(f"Time taken: {end_time-start_time}")


Here the mode is down casted from Float32 to Brain (Google Brain) Float16. Refer below links for better understanding: 
1. https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus
2. https://huggingface.co/docs/transformers/main/en/model_memory_anatomy#anatomy-of-models-memory
3. https://huggingface.co/docs/transformers/main/en/quantization/overview


In [15]:
model_bfloat16 = model.to(dtype=torch.bfloat16)
# tokenizer_fp16 = tokenizer
print("Model loaded in BFloat16")

Model loaded in BFloat16


Now the model size is approx 15 GB compared previous 30 GB but the number of parameters are identical. This effectively reduces the memory footprint

In [16]:
print(f"Number of Model Layers: {model.config.num_hidden_layers}")
print(f"Model Embedding Size: {model.config.hidden_size}")
print(f"Model Device: {model.device}")

Number of Model Layers: 32
Model Embedding Size: 4096
Model Device: cpu


Inference from in memory loaded model.

In [15]:
generate_model_response("What is the meaning of life?", tokenizer=tokenizer, model=model_bfloat16)

{'input_ids': tensor([[128000,   3923,    374,    279,   7438,    315,   2324,     30]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]])}
tensor([1, 1, 1, 1, 1, 1, 1, 1])
What is the meaning of life? This is a question that has been pondered by countless individuals throughout history. Different cultures and philosophies have offered various answers, but none have been universally accepted. So, perhaps the meaning of life is something that each person has to discover for themselves. Let me explore some of the common perspectives on this age-old question.

First, let's consider the philosophical perspective. In philosophy, thinkers like Socrates, Aristotle, and existentialists such as Sartre have all had their own takes on the meaning of life. Socrates believed that the unexamined life is not worth living, suggesting that self-examination and reflection are essential. Aristotle, on the other hand, proposed that the meaning of life is found in the pursuit of knowledge and virtue, a