## This is GPU based machines.
### Install required modules
Use existing package managers (Conda, UV, Pip) to install required modules.
Ran this model on a CPU based Server, with 64 GB RAM and for inferencing CPU as 100% for more than 5 minutes.

In [1]:
import os
import warnings
warnings.filterwarnings("ignore")
from dotenv import load_dotenv
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import accelerate
import time


### Check version of Torch and is Torch enabled with GPU.
CUDA libraries are developed by NVidia and Pytorch are python abstractions over NVidia CUDA

In [2]:
import torch.version
print(torch.__version__)
print(torch.version.cuda)


2.6.0+cu124
12.4


### Hugging Face API
1. Create Hugging Face Account if not already exists.
2. Create API Token
3. Configure token in .env file 

In [3]:
load_dotenv()
token = os.getenv("HUGGING_FACE_TOKEN")

Function: Load Model
1. Given a model name
2. From HF model hub, loads the model in memory.

Note: 
1. When model is loaded it uses GPU / CPU based on avilable compute resources.
2. By default, pytorch uses datatype of weights as FP32.
3. On GPUs, loading models may fail if they exceed GPU memory.


In [5]:
def load_model(
                    model_name="deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B",
                    ):
    quantification_config = BitsAndBytesConfig(
    load_in_4bit=True,
    torch_dtype="auto",
    bnb_4bit_quant_type='fp4'
    )
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        # device_map=,
        quantization_config=quantification_config,  #! Quantization
    )
    return model, tokenizer

In [6]:
model_name = "unsloth/DeepSeek-R1-Distill-Llama-8B"
model, tokenizer = load_model(model_name=model_name)

Loading checkpoint shards: 100%|██████████| 4/4 [01:38<00:00, 24.59s/it]


Review the number of parameters and size of the model (7 GB as against the base model with 30 GB)

In [8]:
print(f"Total Model Parameter: {model.num_parameters()} and approximate size of model {round(model.num_parameters()*1/1024/1024/1024)} GBs")

Total Model Parameter: 8030261248 and approximate size of model 7 GBs


In [9]:
for name, param in model.named_parameters():
    print(f"{name} is loaded with {param.dtype} and device type {param.device}")

model.embed_tokens.weight is loaded with torch.float16 and device type cuda:0
model.layers.0.self_attn.q_proj.weight is loaded with torch.uint8 and device type cuda:0
model.layers.0.self_attn.k_proj.weight is loaded with torch.uint8 and device type cuda:0
model.layers.0.self_attn.v_proj.weight is loaded with torch.uint8 and device type cuda:0
model.layers.0.self_attn.o_proj.weight is loaded with torch.uint8 and device type cuda:0
model.layers.0.mlp.gate_proj.weight is loaded with torch.uint8 and device type cuda:0
model.layers.0.mlp.up_proj.weight is loaded with torch.uint8 and device type cuda:0
model.layers.0.mlp.down_proj.weight is loaded with torch.uint8 and device type cuda:0
model.layers.0.input_layernorm.weight is loaded with torch.float16 and device type cuda:0
model.layers.0.post_attention_layernorm.weight is loaded with torch.float16 and device type cuda:0
model.layers.1.self_attn.q_proj.weight is loaded with torch.uint8 and device type cuda:0
model.layers.1.self_attn.k_proj.

In [10]:
import time

In [11]:
def generate_model_response(
        prompt:str,
        tokenizer:AutoTokenizer,
        model:AutoModelForCausalLM,
        max_length:int=3500,
        temperature:float=0.1,
        top_k:int=50)->str:
    input_ids = tokenizer(prompt, return_tensors="pt",padding=True)
    device = "cuda" if torch.cuda.is_available() else "cpu"
    inputs = {k: v.to(device) for k, v in input_ids.items()}
    print(inputs)
    attention_mask = input_ids["attention_mask"]
    input_ids = input_ids["input_ids"]
    pad_token_id = tokenizer.pad_token_id
    eos_token_id = tokenizer.eos_token_id
    print(attention_mask[0])
    start_time = time.time()
    with torch.no_grad():
        logits = model(**inputs).logits
        output = model.generate(
                                    **inputs, 
                                    max_length=max_length, 
                                    do_sample=True,
                                    temperature=temperature, 
                                    top_k=top_k,
                                    # attention_mask=attention_mask,
                                    pad_token_id=pad_token_id,
                                    eos_token_id=eos_token_id
                                    )
        final_output = tokenizer.decode(output[0], skip_special_tokens=True)
        print(final_output)
    end_time = time.time()
    print(f"Time taken: {end_time-start_time}")

In [12]:
generate_model_response("What is the meaning of life?", tokenizer=tokenizer, model=model)

{'input_ids': tensor([[128000,   3923,    374,    279,   7438,    315,   2324,     30]],
       device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')}
tensor([1, 1, 1, 1, 1, 1, 1, 1])
What is the meaning of life? That's a big question. I think it's different for everyone. For me, I guess it's about finding purpose and being happy. But I'm not entirely sure. Maybe it's about helping others or making a difference in the world. Or maybe it's just about enjoying the journey. I'm still figuring it out.
Okay, so I'm trying to figure out the meaning of life. I know different people have different answers, so I shouldn't be too strict about it. For me, I think it's about finding my purpose and being happy. But I'm not entirely sure if that's the right approach. Maybe it's more about helping others or making a difference in the world. Or maybe it's just about enjoying the journey. I'm still figuring it out.

I remember hearing that some people find their purpos