<a href="https://colab.research.google.com/github/kinjaljoshi/deepseek-quen2.5-quantization/blob/master/deepseek_fp16_gptq_awq_eval.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

We evaluates the [**DeepSeek-R1-Distill-Qwen-14B**](https://huggingface.co/Qwen/Qwen2.5-14B) model across different precision levels **(FP16, 8-bit, 4-bit, GPTQ, AWQ, and offloaded)** by testing its performance on simple and complex tasks.
* Simple tasks include fact questions, math problems, and science explanations
* Complex tasks require multi-step reasoning such as classification, phishing detection, and fraud analysis.

The model is benchmarked using default generation settings, a ** sampling strategy (temperature=0.6, top-p=0.95, 20 responses)**, and **beam search (beam sizes 2-5)**.

Inference time is measured for each configuration, and results are stored in a structured DataFrame for comparison. The final analysis provides insights into the trade-offs between speed, accuracy, and computational efficiency across various quantization methods and decoding strategies.










In [None]:
!pip install torch transformers accelerate auto-gptq

In [None]:
!pip install -U bitsandbytes

In [None]:
!pip install -U autoawq

In [1]:
import torch
import psutil
import pandas as pd
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, AwqConfig
import time
from auto_gptq import AutoGPTQForCausalLM
from accelerate import dispatch_model

  @custom_fwd
  @custom_bwd
  @custom_fwd(cast_inputs=torch.float16)


In [2]:

def get_memory_usage():
    """Displays CPU and GPU memory usage."""
    print(f"\n {'+' *5}Memory Usage{'+' *5}")
    print(f"CPU Memory: {psutil.virtual_memory().percent}% used")
    if torch.cuda.is_available():
        print(f"GPU Memory: {torch.cuda.memory_reserved(0) / 1e9:.2f} GB reserved")

def load_model(model_name, precision, offload=False):
    """Load DeepSeek-R1-Distill-Qwen-14B model in given precision."""
    tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

    if precision == "full":
      #loading FP32 will require more than 40GM of memory so split 70% GPU and 30% CPU
        model = AutoModelForCausalLM.from_pretrained(model_name)

        # Get total number of layers
        num_layers = len(model.model.layers)
        gpu_layers = int(0.7 * num_layers)  # Assign 70% layers to GPU
        cpu_layers = num_layers - gpu_layers  # Remaining 30% to CPU

        # Create device map
        device_map = {f"model.layers.{i}": "cuda" if i < gpu_layers else "cpu"
                      for i in range(num_layers)}
        device_map["lm_head"] = "cuda"  # Keep final output head on GPU

        # Dispatch model across CPU and GPU
        model = dispatch_model(model, device_map=device_map)
        print(f"Assigned {gpu_layers} layers to GPU and {cpu_layers} layers to CPU.")

    elif precision == "FP16":
        model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16).to("cuda")
    elif precision == "8bit":
        bnb_config = BitsAndBytesConfig(
            load_in_8bit=True,
        )
        #.to('cuda') is not needed as model is automatically moved to device when using bitsandbytes
        model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config = bnb_config)
    elif precision == "4bit":
        bnb_config = BitsAndBytesConfig(
                    load_in_4bit=True,  # Enable 4-bit quantization
                    bnb_4bit_compute_dtype=torch.float16,  # Match input dtype for faster inference
                    bnb_4bit_use_double_quant=True  # Further compression optimization
                    )

        model = AutoModelForCausalLM.from_pretrained(
            model_name, quantization_config=bnb_config,device_map="auto")
    elif precision == "gptq":
        model = AutoGPTQForCausalLM.from_pretrained(model_name, quantize_config={"bits": 4}).to("cuda")
    elif precision == "awq":
        awq_config = AwqConfig(
                                          bits=4,
                                          fuse_max_seq_len=512,
                                          do_fuse=True,
                                        )

        model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=awq_config).to("cuda")

    if offload:
        model = AutoModelForCausalLM.from_pretrained(model_name)

        # Get total number of layers
        num_layers = len(model.model.layers)
        gpu_layers = int(0.5 * num_layers)  # Assign 70% layers to GPU
        cpu_layers = num_layers - gpu_layers  # Remaining 30% to CPU

        # Create device map
        device_map = {f"model.layers.{i}": "cuda" if i < gpu_layers else "cpu"
                      for i in range(num_layers)}
        device_map["lm_head"] = "cuda"  # Keep final output head on GPU

        # Dispatch model across CPU and GPU
        model = dispatch_model(model, device_map=device_map)
        print(f"Assigned {gpu_layers} layers to GPU and {cpu_layers} layers to CPU.")

    model.eval()
    get_memory_usage()
    return model, tokenizer

In [3]:
simple_tasks = [
    "What is the capital of France, just provide answer, explanation is not needed?",
    "Who wrote 'Lord of the rings, provide name of autho, explanation is not needed?",
    "What is the boiling point of water in degree centigrade, provide answer, explanantion is not needed?",
    "Solve for x: 5x + 10 = 30, solve for x and explanation is not needed",
    "Explain the process of photosynthesis in one sentence."
]

complex_tasks = [
    "You are an AI assistant. First, classify the user's request as either 'math', 'science', or 'general'. "
    "Then, if it's 'math', solve the problem. If it's 'science', explain the concept. If it's 'general', answer normally. "
    "User request: 'What is the integral of x^2 + 3x + 5?'",

    "A user provides an email and asks whether it's a phishing attempt. First, classify the email as 'safe' or 'suspicious'. "
    "Then, if 'suspicious', list the warning signs. If 'safe', explain why it's trustworthy. Email: 'Dear user, your bank account has been compromised. Click this link to secure it now.'",

    "A company wants to classify customer feedback as 'positive', 'neutral', or 'negative'. Once classified, provide a response appropriate for the sentiment. "
    "Customer feedback: 'The delivery was delayed by 5 days and no updates were given. I'm very disappointed.'",

    "Given a dataset with thousands of financial transactions, design a multi-step approach to detect fraud. "
    "Explain how you would preprocess the data, extract features, and implement an anomaly detection system.",

    "You are a python expert, provide python only in response no explanation is needed.Write a Python function to calculate the factorial of a given number.",

    "You are a python expert, provide python only in response no explanation is needed.Generate a Python script that reads a CSV file and calculates the average value of a specific column.",

    "You are a python expert, provide python only in response no explanation is needed.Write a Python class named `BankAccount` with methods for deposit, withdrawal, and balance check.",

    "You are a python expert, provide python only in response no explanation is needed.Create a Python program that fetches real-time weather data using an API and prints the temperature and humidity.",

    """You are a SQL expert, provide SQL Query only in response no explanation is needed.
     Question - Given the following table, write an SQL query to find all employees who have been with the company for more than 5 years:
     Table: Employees
     Columns: employee_id (INT), name (VARCHAR), hire_date (DATE), department (VARCHAR)""",

    """Given the following two tables, write an SQL query to find the total sales per customer:
     Table: Customers
     Columns: customer_id (INT), name (VARCHAR), email (VARCHAR)
     Table: Orders
     Columns: order_id (INT), customer_id (INT), order_date (DATE), total_amount (DECIMAL)""",

    """You are a SQL expert, provide SQL Query only in response no explanation is needed.
      Question - Given the following three tables, write an SQL query to find the top 3 selling products in the last month:
     Table: Products
     Columns: product_id (INT), name (VARCHAR), price (DECIMAL)
     Table: Orders
     Columns: order_id (INT), customer_id (INT), order_date (DATE)
     Table: Order_Items
     Columns: order_item_id (INT), order_id (INT), product_id (INT), quantity (INT)"""
]

In [4]:
benchmark_config = {
    "do_sample":True,
    "temperature": 0.6,
    "top_p": 0.9,
    "num_return_sequences": 1,
    "max_new_tokens": 512
}

beam_sizes = [2]

#sequential run
def run_task(model, tokenizer, prompt, config=None):
    """Runs inference and measures time taken."""
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    start = time.time()
    if config:
        output = model.generate(**inputs, **config, eos_token_id=tokenizer.eos_token_id,
                                pad_token_id=tokenizer.eos_token_id)
    else:
        output = model.generate(**inputs, max_new_tokens=512,eos_token_id=tokenizer.eos_token_id,
                                pad_token_id=tokenizer.eos_token_id)
    end = time.time()
    response = tokenizer.decode(output[0], skip_special_tokens=True)
    return  response, end - start
#batch run
def run_batch_task(model, tokenizer, prompts, config=None):

    #define system prompt
    system_message = "System: You are an AI assistant. Please provide clear, concise, and accurate responses.\nUser: "
    updated_prompts = [system_message + prompt for prompt in prompts]

    inputs = tokenizer(updated_prompts, return_tensors="pt", padding=True, truncation=True).to("cuda")
    start = time.time()
    if config:
        output = model.generate(**inputs, **config)
    else:
        output = model.generate(**inputs, max_new_tokens=512)
    end = time.time()
    responses = [tokenizer.decode(out, skip_special_tokens=True) for out in output]
    return responses, end - start

## FP16

In [5]:
precision = "FP16"
offload =  False
model, tokenizer = load_model("deepseek-ai/DeepSeek-R1-Distill-Qwen-14B", precision if precision != "offloaded" else "fp16", offload)



def test_16bit_model_batch(model, tokenizer):
    results = []
    print(f"\n {'+' *5} Testing {precision} Precision {'+' *5}")

    # Run simple tasks in batch (Default)
    simple_responses, simple_time = run_batch_task(model, tokenizer, simple_tasks)
    print(f"Batch Simple Task Time (Default): {simple_time:.2f}s")
    for prompt, response in zip(simple_tasks, simple_responses):
        results.append((precision, "Batch Simple", "Default", prompt, simple_time, "-", "-", response))

    # Run complex tasks in batch (Default)
    complex_responses, complex_time = run_batch_task(model, tokenizer, complex_tasks)
    print(f"Batch Complex Task Time (Default): {complex_time:.2f}s")
    for prompt, response in zip(complex_tasks, complex_responses):
        results.append((precision, "Batch Complex", "Default", prompt, complex_time, "-", "-", response))

    # Run simple tasks with benchmark configuration
    benchmark_simple_responses, benchmark_simple_time = run_batch_task(model, tokenizer, simple_tasks, config=benchmark_config)
    print(f"Batch Simple Task Time (Benchmark): {benchmark_simple_time:.2f}s")
    for prompt, response in zip(simple_tasks, benchmark_simple_responses):
        results.append((precision, "Batch Simple", "Benchmark", prompt, "-", benchmark_simple_time, "-", response))

    # Run complex tasks with benchmark configuration
    benchmark_complex_responses, benchmark_complex_time = run_batch_task(model, tokenizer, complex_tasks, config=benchmark_config)
    print(f"Batch Complex Task Time (Benchmark): {benchmark_complex_time:.2f}s")
    for prompt, response in zip(complex_tasks, benchmark_complex_responses):
        results.append((precision, "Batch Complex", "Benchmark", prompt, "-", benchmark_complex_time, "-", response))

    # Run simple tasks with beam search
    for beam_size in beam_sizes:
        beam_config = {"num_beams": beam_size, "max_new_tokens": 512}
        simple_responses, simple_time = run_batch_task(model, tokenizer, simple_tasks, config=beam_config)
        print(f"Batch Simple Task Time (Beam {beam_size}): {simple_time:.2f}s")
        for prompt, response in zip(simple_tasks, simple_responses):
            results.append((precision, "Batch Simple", f"Beam {beam_size}", prompt, "-", "-", simple_time, response))

    # Run complex tasks with beam search
    for beam_size in beam_sizes:
        beam_config = {"num_beams": beam_size, "max_new_tokens": 512}
        complex_responses, complex_time = run_batch_task(model, tokenizer, complex_tasks, config=beam_config)
        print(f"Batch Complex Task Time (Beam {beam_size}): {complex_time:.2f}s")
        for prompt, response in zip(complex_tasks, complex_responses):
            results.append((precision, "Batch Complex", f"Beam {beam_size}", prompt, "-", "-", complex_time, response))

    # Convert results to DataFrame for analysis
    df = pd.DataFrame(results, columns=["Precision", "Batch Type", "Method", "Task", "Default Time", "Benchmark Time", "Beam Search Time", "Response"])
    #print(df)
    return df




# df_results16_bits = test_16bit_model(model, tokenizer)
df_results_batch = test_16bit_model_batch(model, tokenizer)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.



 +++++Memory Usage+++++
CPU Memory: 4.9% used
GPU Memory: 29.69 GB reserved

 +++++ Testing FP16 Precision +++++


Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


Batch Simple Task Time (Default): 15.40s


Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


Batch Complex Task Time (Default): 33.75s


Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


Batch Simple Task Time (Benchmark): 10.90s


Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


Batch Complex Task Time (Benchmark): 33.69s


Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


Batch Simple Task Time (Beam 2): 25.78s
Batch Complex Task Time (Beam 2): 43.45s


In [6]:
df_results_batch.to_csv("test_result_16bits.csv", index=False)

##FP8

In [5]:
precision = "8bit"
offload =  False
model, tokenizer = load_model("deepseek-ai/DeepSeek-R1-Distill-Qwen-14B", precision, offload)



def test_8bits_model_batch(model, tokenizer):
    results = []
    print(f"\n {'+' *5} Testing {precision} Precision {'+' *5}")

    # Run simple tasks in batch (Default)
    simple_responses, simple_time = run_batch_task(model, tokenizer, simple_tasks)
    print(f"Batch Simple Task Time (Default): {simple_time:.2f}s")
    for prompt, response in zip(simple_tasks, simple_responses):
        results.append((precision, "Batch Simple", "Default", prompt, simple_time, "-", "-", response))

    # Run complex tasks in batch (Default)
    complex_responses, complex_time = run_batch_task(model, tokenizer, complex_tasks)
    print(f"Batch Complex Task Time (Default): {complex_time:.2f}s")
    for prompt, response in zip(complex_tasks, complex_responses):
        results.append((precision, "Batch Complex", "Default", prompt, complex_time, "-", "-", response))

    # Run simple tasks with benchmark configuration
    benchmark_simple_responses, benchmark_simple_time = run_batch_task(model, tokenizer, simple_tasks, config=benchmark_config)
    print(f"Batch Simple Task Time (Benchmark): {benchmark_simple_time:.2f}s")
    for prompt, response in zip(simple_tasks, benchmark_simple_responses):
        results.append((precision, "Batch Simple", "Benchmark", prompt, "-", benchmark_simple_time, "-", response))

    # Run complex tasks with benchmark configuration
    benchmark_complex_responses, benchmark_complex_time = run_batch_task(model, tokenizer, complex_tasks, config=benchmark_config)
    print(f"Batch Complex Task Time (Benchmark): {benchmark_complex_time:.2f}s")
    for prompt, response in zip(complex_tasks, benchmark_complex_responses):
        results.append((precision, "Batch Complex", "Benchmark", prompt, "-", benchmark_complex_time, "-", response))

    # Run simple tasks with beam search
    for beam_size in beam_sizes:
        beam_config = {"num_beams": beam_size, "max_new_tokens": 512}
        simple_responses, simple_time = run_batch_task(model, tokenizer, simple_tasks, config=beam_config)
        print(f"Batch Simple Task Time (Beam {beam_size}): {simple_time:.2f}s")
        for prompt, response in zip(simple_tasks, simple_responses):
            results.append((precision, "Batch Simple", f"Beam {beam_size}", prompt, "-", "-", simple_time, response))

    # Run complex tasks with beam search
    for beam_size in beam_sizes:
        beam_config = {"num_beams": beam_size, "max_new_tokens": 512}
        complex_responses, complex_time = run_batch_task(model, tokenizer, complex_tasks, config=beam_config)
        print(f"Batch Complex Task Time (Beam {beam_size}): {complex_time:.2f}s")
        for prompt, response in zip(complex_tasks, complex_responses):
            results.append((precision, "Batch Complex", f"Beam {beam_size}", prompt, "-", "-", complex_time, response))

    # Convert results to DataFrame for analysis
    df = pd.DataFrame(results, columns=["Precision", "Batch Type", "Method", "Task", "Default Time", "Benchmark Time", "Beam Search Time", "Response"])
    #print(df)
    return df




df_results_batch = test_8bits_model_batch(model, tokenizer)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
`low_cpu_mem_usage` was None, now default to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.



 +++++Memory Usage+++++
CPU Memory: 4.9% used
GPU Memory: 16.54 GB reserved

 +++++ Testing 8bit Precision +++++


Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


Batch Simple Task Time (Default): 42.71s


Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


Batch Complex Task Time (Default): 131.79s


Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


Batch Simple Task Time (Benchmark): 103.14s


Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


Batch Complex Task Time (Benchmark): 130.00s


Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


Batch Simple Task Time (Beam 2): 102.34s
Batch Complex Task Time (Beam 2): 151.97s


In [6]:
df_results_batch.to_csv("test_result_8bits.csv", index=False)

##gptq

In [11]:
precision = "gptq"
offload =  False
model, tokenizer = load_model("deepseek-ai/DeepSeek-R1-Distill-Qwen-14B", precision, offload)



def test_gptq_model_batch(model, tokenizer):
    results = []
    print(f"\n {'+' *5} Testing {precision} Precision {'+' *5}")

    # Run simple tasks in batch (Default)
    simple_responses, simple_time = run_batch_task(model, tokenizer, simple_tasks)
    print(f"Batch Simple Task Time (Default): {simple_time:.2f}s")
    for prompt, response in zip(simple_tasks, simple_responses):
        results.append((precision, "Batch Simple", "Default", prompt, simple_time, "-", "-", response))

    # Run complex tasks in batch (Default)
    complex_responses, complex_time = run_batch_task(model, tokenizer, complex_tasks)
    print(f"Batch Complex Task Time (Default): {complex_time:.2f}s")
    for prompt, response in zip(complex_tasks, complex_responses):
        results.append((precision, "Batch Complex", "Default", prompt, complex_time, "-", "-", response))

    # Run simple tasks with benchmark configuration
    benchmark_simple_responses, benchmark_simple_time = run_batch_task(model, tokenizer, simple_tasks, config=benchmark_config)
    print(f"Batch Simple Task Time (Benchmark): {benchmark_simple_time:.2f}s")
    for prompt, response in zip(simple_tasks, benchmark_simple_responses):
        results.append((precision, "Batch Simple", "Benchmark", prompt, "-", benchmark_simple_time, "-", response))

    # Run complex tasks with benchmark configuration
    benchmark_complex_responses, benchmark_complex_time = run_batch_task(model, tokenizer, complex_tasks, config=benchmark_config)
    print(f"Batch Complex Task Time (Benchmark): {benchmark_complex_time:.2f}s")
    for prompt, response in zip(complex_tasks, benchmark_complex_responses):
        results.append((precision, "Batch Complex", "Benchmark", prompt, "-", benchmark_complex_time, "-", response))

    # Run simple tasks with beam search
    for beam_size in beam_sizes:
        beam_config = {"num_beams": beam_size, "max_new_tokens": 512}
        simple_responses, simple_time = run_batch_task(model, tokenizer, simple_tasks, config=beam_config)
        print(f"Batch Simple Task Time (Beam {beam_size}): {simple_time:.2f}s")
        for prompt, response in zip(simple_tasks, simple_responses):
            results.append((precision, "Batch Simple", f"Beam {beam_size}", prompt, "-", "-", simple_time, response))

    # Run complex tasks with beam search
    for beam_size in beam_sizes:
        beam_config = {"num_beams": beam_size, "max_new_tokens": 512}
        complex_responses, complex_time = run_batch_task(model, tokenizer, complex_tasks, config=beam_config)
        print(f"Batch Complex Task Time (Beam {beam_size}): {complex_time:.2f}s")
        for prompt, response in zip(complex_tasks, complex_responses):
            results.append((precision, "Batch Complex", f"Beam {beam_size}", prompt, "-", "-", complex_time, response))

    # Convert results to DataFrame for analysis
    df = pd.DataFrame(results, columns=["Precision", "Batch Type", "Method", "Task", "Default Time", "Benchmark Time", "Beam Search Time", "Response"])
    #print(df)
    return df




df_results_batch = test_gptq_model_batch(model, tokenizer)



Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.



 +++++Memory Usage+++++
CPU Memory: 5.3% used
GPU Memory: 29.69 GB reserved

 +++++ Testing gptq Precision +++++


Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


Batch Simple Task Time (Default): 9.92s


Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


Batch Complex Task Time (Default): 29.63s


Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


Batch Simple Task Time (Benchmark): 28.91s


Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


Batch Complex Task Time (Benchmark): 29.37s


Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


Batch Simple Task Time (Beam 2): 29.22s
Batch Complex Task Time (Beam 2): 40.99s


In [6]:
df_results_batch.to_csv("test_result_gptq.csv", index=False)

##4bits

**UserWarning**: Input type into Linear4bit is torch.float16, but bnb_4bit_compute_dtype=torch.float32 (default).
This will lead to slow inference or training speed.


This warning means that:
Model is receiving inputs in torch.float16 (which is more efficient for GPU computation).
However, bitsandbytes (used for 4-bit quantization) is processing computations in torch.float32 by default.
This mismatch in data types can slow down inference/training because:
1. The model has to convert data between float16 and float32, which adds computational overhead.
2. float16 is faster on GPUs (especially NVIDIA Tensor Cores), whereas float32 takes more memory and is slower.


In [5]:
precision = "4bit"
offload =  False
model, tokenizer = load_model("deepseek-ai/DeepSeek-R1-Distill-Qwen-14B", precision, offload)



def test_gptq_model_batch(model, tokenizer):
    results = []
    print(f"\n {'+' *5} Testing {precision} Precision {'+' *5}")

    # Run simple tasks in batch (Default)
    simple_responses, simple_time = run_batch_task(model, tokenizer, simple_tasks)
    print(f"Batch Simple Task Time (Default): {simple_time:.2f}s")
    for prompt, response in zip(simple_tasks, simple_responses):
        results.append((precision, "Batch Simple", "Default", prompt, simple_time, "-", "-", response))

    # Run complex tasks in batch (Default)
    complex_responses, complex_time = run_batch_task(model, tokenizer, complex_tasks)
    print(f"Batch Complex Task Time (Default): {complex_time:.2f}s")
    for prompt, response in zip(complex_tasks, complex_responses):
        results.append((precision, "Batch Complex", "Default", prompt, complex_time, "-", "-", response))

    # Run simple tasks with benchmark configuration
    benchmark_simple_responses, benchmark_simple_time = run_batch_task(model, tokenizer, simple_tasks, config=benchmark_config)
    print(f"Batch Simple Task Time (Benchmark): {benchmark_simple_time:.2f}s")
    for prompt, response in zip(simple_tasks, benchmark_simple_responses):
        results.append((precision, "Batch Simple", "Benchmark", prompt, "-", benchmark_simple_time, "-", response))

    # Run complex tasks with benchmark configuration
    benchmark_complex_responses, benchmark_complex_time = run_batch_task(model, tokenizer, complex_tasks, config=benchmark_config)
    print(f"Batch Complex Task Time (Benchmark): {benchmark_complex_time:.2f}s")
    for prompt, response in zip(complex_tasks, benchmark_complex_responses):
        results.append((precision, "Batch Complex", "Benchmark", prompt, "-", benchmark_complex_time, "-", response))

    # Run simple tasks with beam search
    for beam_size in beam_sizes:
        beam_config = {"num_beams": beam_size, "max_new_tokens": 512}
        simple_responses, simple_time = run_batch_task(model, tokenizer, simple_tasks, config=beam_config)
        print(f"Batch Simple Task Time (Beam {beam_size}): {simple_time:.2f}s")
        for prompt, response in zip(simple_tasks, simple_responses):
            results.append((precision, "Batch Simple", f"Beam {beam_size}", prompt, "-", "-", simple_time, response))

    # Run complex tasks with beam search
    for beam_size in beam_sizes:
        beam_config = {"num_beams": beam_size, "max_new_tokens": 512}
        complex_responses, complex_time = run_batch_task(model, tokenizer, complex_tasks, config=beam_config)
        print(f"Batch Complex Task Time (Beam {beam_size}): {complex_time:.2f}s")
        for prompt, response in zip(complex_tasks, complex_responses):
            results.append((precision, "Batch Complex", f"Beam {beam_size}", prompt, "-", "-", complex_time, response))

    # Convert results to DataFrame for analysis
    df = pd.DataFrame(results, columns=["Precision", "Batch Type", "Method", "Task", "Default Time", "Benchmark Time", "Beam Search Time", "Response"])
    #print(df)
    return df




df_results_batch = test_gptq_model_batch(model, tokenizer)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.



 +++++Memory Usage+++++
CPU Memory: 4.6% used
GPU Memory: 10.03 GB reserved

 +++++ Testing 4bit Precision +++++


Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


Batch Simple Task Time (Default): 51.32s


Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


Batch Complex Task Time (Default): 61.48s


Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


Batch Simple Task Time (Benchmark): 37.69s


Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


Batch Complex Task Time (Benchmark): 62.54s


Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


Batch Simple Task Time (Beam 2): 59.74s
Batch Complex Task Time (Beam 2): 69.87s


In [6]:
df_results_batch.to_csv("test_result_4bits.csv", index=False)