# What I learned so far!!

## Why We Need Quantization

Quantization is essential for several reasons:

1. **Memory Reduction**: It allows large language models to run on devices with limited hardware resources, such as single GPUs or even CPUs.
2. **Efficiency**: Quantization enables more efficient deployment and broader adoption of large language models across various hardware configurations.
3. **Speed**: Some quantization techniques can improve inference speed while maintaining reasonable accuracy.

## Different Quantization Techniques and Their Use Cases

### Post Training Quantization (PTQ)

- **Technique**: Applied after the model is fully trained, converting high-precision models (e.g., float32) to lower-precision (e.g., int8) without retraining.
- **Pros**: Easy to implement, no retraining required, faster inference, great for edge devices.
- **Cons**: May lead to significant accuracy degradation, especially for large language models.
- **Use Case**: When quick results and inference speed are crucial, and some accuracy loss is acceptable.

### Quantization-Aware Training (QAT)

- **Technique**: Introduces quantization operations during the training process, allowing the model to adapt to lower-precision operations.
- **Pros**: Better accuracy retention compared to PTQ, especially for complex models like LLaMA.
- **Cons**: Requires additional computational resources and longer training times.
- **Use Case**: When fine-tuning large language models like LLaMA, where maintaining high accuracy is critical.

### QLoRA (Quantized Low-Rank Adaptation)

- **Technique**: Quantizes base model weights to 4-bit precision while fine-tuning a small set of additional weights (adapters).
- **Use Case**: Efficient fine-tuning of large language models (up to 65B parameters) on limited hardware, such as a single GPU.

### PRILoRA (Pruned and Rank-Increasing Low-Rank Adaptation)

- **Technique**: Builds on LoRA by linearly increasing the rank across model layers and performing ongoing importance-based pruning of weights.
- **Use Case**: Improving efficiency in fine-tuning while maintaining or improving performance compared to full fine-tuning methods.

### GPTQ (General Pre-Trained Transformer Quantization)

- **Technique**: Uses layer-wise quantization to convert weights to lower precision (e.g., 4-bit integers) while minimizing output error.
- **Use Case**: Reducing model size to run large language models on a single GPU without significant performance loss.

### GGML/GGUF (GPT-Generated Model Language/GPT-Generated Unified Format)

- **Technique**: GGML is a library for quantizing models to run on CPUs, using various quantization methods. GGUF is its successor, designed to address limitations and enable quantization of non-Llama models.
- **Use Case**: Enabling large language models to run on CPU devices, increasing accessibility and deployment options.

### AWQ (Activation-Aware Weight Quantization)

- **Technique**: Considers the activations of the model during inference, tailoring the precision of weights to the input characteristics.
- **Use Case**: Minimizing accuracy loss during quantization by preserving the most significant weights based on activation patterns.

## Stacking Multiple Quantizations

Yes, it is possible to stack multiple quantizations:

- **Compounding Quantization**: This involves quantizing the already quantized parameters recursively.
- **HQQ (Hierarchical Quantization)**: This technique performs compounding quantization.
- **Trade-off**: While stacking quantizations can further reduce model size, it may lead to additional accuracy loss.

## Fine-tuning LLMs

When fine-tuning large language models like LLaMA, QAT is generally the better option because:

1. These models are highly sensitive to numerical precision.
2. PTQ might introduce unacceptable accuracy drops, especially in NLP tasks.
3. The investment in QAT during fine-tuning could Learnings about the techniques

Several approaches and tools are available for fine-tuning LLMs:

1. **LoRA with PEFT**: Recommended for efficient fine-tuning of base models.
2. **QLoRA**: Combines quantization with LoRA for memory-efficient fine-tuning.
3. **Mistral.rs**: Allows experimenting with different quantization levels and techniques like HQQ and GGUF.
4. **Flash Attention**: Improves inference speed but doesn't reduce memor
   ma Recipes**: Provides guidelines and tools for fine-tuning Llama models.

When fine-tuning, consider:

- Balancing quantization levels with accuracy requirements.
- Using techniques like Flash Attention for speed improvements.
- Exploring personalization options with your own data.
- Utilizing libraries and frameworks designed for efficient LLM fine-tuning.

Remember that the optimal approach may vary depending on your specific use case, hardware constraints, and performance requirements. For LLaMA and similar large language models, QAT is often the preferred choice due to its ability to maintain high accuracy while still benefiting from quantization.

## Techniques I explored!!
- LoRA (implemented)
- AWQ (implemented)
- GPTQ(has issues couldnot get a presentable output)
- Flash Attention(has issues couldnot get a presentable output)
- Unsloth(implemented)
## Models I explored!!
- LLama 3.2-1b(implemented)
- LLama 3.2-11b-vision (has issues couldnot get a present table output)

## Issue with LLama 3.2-11b-vision
- I tried multiple techniques to implement but these but always got this error
![errorfor11bmodel](errorfor11bmodel.jfif)

- I tried increase compute but the issue still remained
- I understand we need to write some code to handle cuda memory here, but I don't quite know how to do it, I tried some approaches but they failed.

## Things I wanted to explore
- Compounding Quantization
- HQQ (Hierarchical Quantization)
- Flash Attention

In [1]:
from huggingface_hub import notebook_login, HfApi
# Hugging Face API authentication
notebook_login()
username = "Mubinmodi007"
api = HfApi(token="your_huggingface_token")  # Replace with your actual token

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [2]:

!pip install torch==2.3.1 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

!pip install autoawq autoawq-kernels

Looking in indexes: https://download.pytorch.org/whl/cu118
Collecting torch==2.3.1
  Downloading https://download.pytorch.org/whl/cu118/torch-2.3.1%2Bcu118-cp310-cp310-linux_x86_64.whl (839.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m839.7/839.7 MB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu11==11.8.89 (from torch==2.3.1)
  Downloading https://download.pytorch.org/whl/cu118/nvidia_cuda_nvrtc_cu11-11.8.89-py3-none-manylinux1_x86_64.whl (23.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.2/23.2 MB[0m [31m23.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting nvidia-cuda-runtime-cu11==11.8.89 (from torch==2.3.1)
  Downloading https://download.pytorch.org/whl/cu118/nvidia_cuda_runtime_cu11-11.8.89-py3-none-manylinux1_x86_64.whl (875 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m875.6/875.6 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting nvidia-cuda-cupti-cu11==11.8.87 (from

In [3]:
# Install necessary packages
!pip install autoawq
!pip install nvidia-ml-py3
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer, AutoModelForCausalLM # Import AutoModelForCausalLM
from huggingface_hub import notebook_login, HfApi
from pynvml import nvmlInit, nvmlDeviceGetHandleByIndex, nvmlDeviceGetMemoryInfo
# Load the original model
original_model_path = "meta-llama/Llama-3.2-1B"
original_model = AutoModelForCausalLM.from_pretrained(original_model_path)
original_tokenizer = AutoTokenizer.from_pretrained(original_model_path)

# Load the quantized model
quantized_model_path = "Mubinmodi007/Llama-3.2-1B-AWQ-4bit"
quantized_model = AutoModelForCausalLM.from_pretrained(quantized_model_path)
quantized_tokenizer = AutoTokenizer.from_pretrained(quantized_model_path)

Collecting nvidia-ml-py3
  Downloading nvidia-ml-py3-7.352.0.tar.gz (19 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: nvidia-ml-py3
  Building wheel for nvidia-ml-py3 (setup.py) ... [?25l[?25hdone
  Created wheel for nvidia-ml-py3: filename=nvidia_ml_py3-7.352.0-py3-none-any.whl size=19171 sha256=5618ee3531f093c0477f2c55850c0680f070b9896a57e03502aea50e6d2df072
  Stored in directory: /root/.cache/pip/wheels/5c/d8/c0/46899f8be7a75a2ffd197a23c8797700ea858b9b34819fbf9e
Successfully built nvidia-ml-py3
Installing collected packages: nvidia-ml-py3
Successfully installed nvidia-ml-py3-7.352.0


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/843 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/185 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/301 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

You have loaded an AWQ model on CPU and have a CUDA device available, make sure to set your model on a GPU device in order to run your model.
`low_cpu_mem_usage` was None, now set to True since model is quantized.


model.safetensors:   0%|          | 0.00/1.56G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/180 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/301 [00:00<?, ?B/s]

In [4]:
!pip install datasets
from datasets import load_dataset

boolq = load_dataset("google/boolq")
winogrande = load_dataset("allenai/winogrande", "winogrande_xl")
piqa = load_dataset("ybisk/piqa")
arc = load_dataset("allenai/ai2_arc", "ARC-Challenge")



README.md:   0%|          | 0.00/6.57k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/3.69M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/1.26M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/9427 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3270 [00:00<?, ? examples/s]

winogrande.py:   0%|          | 0.00/5.65k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/9.97k [00:00<?, ?B/s]

The repository for allenai/winogrande contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/allenai/winogrande.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


Downloading data:   0%|          | 0.00/3.40M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/40398 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1767 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1267 [00:00<?, ? examples/s]

piqa.py:   0%|          | 0.00/5.36k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/8.41k [00:00<?, ?B/s]

The repository for ybisk/piqa contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/ybisk/piqa.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


Downloading data:   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/815k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/16113 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3084 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1838 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/9.00k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/190k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/204k [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/55.7k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1119 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1172 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/299 [00:00<?, ? examples/s]

In [5]:
def evaluate_winogrande(model, tokenizer, dataset):
    correct = 0
    total = 0
    for example in dataset:
        input_text = f"{example['sentence']}\nQuestion: Which option best completes the sentence?\nOptions:\n1. {example['option1']}\n2. {example['option2']}\nAnswer:"
        inputs = tokenizer(input_text, return_tensors="pt")
        outputs = model.generate(**inputs, max_new_tokens=5)
        predicted_answer = tokenizer.decode(outputs[0], skip_special_tokens=True).strip()
        correct += (predicted_answer == example['answer'])
        total += 1
    return correct / total

In [6]:
def evaluate_piqa(model, tokenizer, dataset):
    correct = 0
    total = 0
    device = model.device  # Get the device of the model
    for example in dataset:
        input_text = f"Goal: {example['goal']}\nWhich solution is better?\n1. {example['sol1']}\n2. {example['sol2']}\nAnswer:"
        inputs = tokenizer(input_text, return_tensors="pt").to(device) # Move inputs to the same device as the model
        outputs = model.generate(**inputs, max_new_tokens=5)
        predicted_answer = tokenizer.decode(outputs[0], skip_special_tokens=True).strip()
        correct_answer = "1" if example['label'] == 0 else "2"
        correct += (predicted_answer == correct_answer)
        total += 1
    return correct / total

In [7]:
def evaluate_boolq(model, tokenizer, dataset):
    correct = 0
    total = 0
    for example in dataset:
        input_text = f"Question: {example['question']}\nPassage: {example['passage']}\nAnswer:"
        inputs = tokenizer(input_text, return_tensors="pt")
        outputs = model.generate(**inputs, max_new_tokens=5)
        predicted_answer = tokenizer.decode(outputs[0], skip_special_tokens=True).strip().lower()
        correct += (predicted_answer == str(example['answer']).lower())
        total += 1
    return correct / total

# Create similar functions for other datasets

In [8]:
def evaluate_arc(model, tokenizer, dataset):
    correct = 0
    total = 0
    for example in dataset:
        input_text = f"Question: {example['question']}\nOptions:\n"
        for i, choice in enumerate(example['choices']['text']):
            input_text += f"{i+1}. {choice}\n"
        input_text += "Answer:"
        inputs = tokenizer(input_text, return_tensors="pt")
        outputs = model.generate(**inputs, max_new_tokens=5)
        predicted_answer = tokenizer.decode(outputs[0], skip_special_tokens=True).strip()
        correct_answer = str(example['choices']['label'].index(example['answerKey']) + 1)
        correct += (predicted_answer == correct_answer)
        total += 1
    return correct / total

## We can see the model seems to give some answer correct and some very generic answers

In [17]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch

# Create a pipeline, specifying the device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

pipe = pipeline("text-generation", model=quantized_model, tokenizer=quantized_tokenizer, device=device)

# Load datasets
datasets = {
    "PIQA": load_dataset("piqa", split="validation[:10]"),
    "BoolQ": load_dataset("boolq", split="validation[:10]"),
    "WinoGrande": load_dataset("winogrande", "winogrande_xl", split="validation[:10]"),
    "ARC-Challenge": load_dataset("ai2_arc", "ARC-Challenge", split="test[:10]")
}

# Function to format messages for each dataset
def format_messages(dataset_name, example):
    if dataset_name == "PIQA":
        return f"Answer the following question about physical intuition.\nGoal: {example['goal']}\nWhich solution is better?\n1. {example['sol1']}\n2. {example['sol2']}\nAnswer:"
    elif dataset_name == "BoolQ":
        return f"Answer the following yes/no question based on the given passage.\nPassage: {example['passage']}\nQuestion: {example['question']}\nAnswer (Yes/No):"
    elif dataset_name == "WinoGrande":
        return f"Complete the sentence by choosing the correct option.\nSentence: {example['sentence']}\nOptions:\n1. {example['option1']}\n2. {example['option2']}\nAnswer (1 or 2):"
    elif dataset_name == "ARC-Challenge":
        options = "\n".join([f"{i+1}. {choice}" for i, choice in enumerate(example['choices']['text'])])
        return f"Answer the following multiple-choice science question.\nQuestion: {example['question']}\nOptions:\n{options}\nAnswer (1, 2, 3, or 4):"
    else:
        print(f"Unknown dataset: {dataset_name}")
        return ""

# Generate prompts for each dataset
all_prompts = []
for dataset_name, dataset in datasets.items():
    print(f"Processing {dataset_name} dataset")
    for i, example in enumerate(dataset):
        prompt = format_messages(dataset_name, example)
        if prompt:
            all_prompts.append((dataset_name, prompt))
        else:
            print(f"No prompt generated for {dataset_name}, example {i}")

# Print the first prompt from each dataset as an example
prompts_by_dataset = {}
for dataset_name, prompt in all_prompts:
    if dataset_name not in prompts_by_dataset:
        prompts_by_dataset[dataset_name] = prompt
        print(f"\n{dataset_name} Example:")
        print(prompt)
        print("-" * 50)

if not all_prompts:
    print("No prompts were generated. Check the dataset loading and message formatting.")
else:
    print(f"Total prompts generated: {len(all_prompts)}")

# Generate responses for each prompt
responses = []
for dataset_name, prompt in all_prompts:
    try:
        response = pipe(prompt, max_new_tokens=50, do_sample=True, temperature=0.7)
        responses.append((dataset_name, prompt, response[0]['generated_text']))
    except Exception as e:
        print(f"Error generating response for {dataset_name}: {str(e)}")

# Print a few responses as examples
print("\nExample Responses:")
for dataset_name, prompt, response in responses[:5]:
    print(f"\n{dataset_name}:")
    print("Prompt:", prompt)
    print("Response:", response)
    print("-" * 50)

Using device: cuda


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Processing PIQA dataset
Processing BoolQ dataset
Processing WinoGrande dataset
Processing ARC-Challenge dataset

PIQA Example:
Answer the following question about physical intuition.
Goal: How do I ready a guinea pig cage for it's new occupants?
Which solution is better?
1. Provide the guinea pig with a cage full of a few inches of bedding made of ripped paper strips, you will also need to supply it with a water bottle and a food dish.
2. Provide the guinea pig with a cage full of a few inches of bedding made of ripped jeans material, you will also need to supply it with a water bottle and a food dish.
Answer:
--------------------------------------------------

BoolQ Example:
Answer the following yes/no question based on the given passage.
Passage: All biomass goes through at least some of these steps: it needs to be grown, collected, dried, fermented, distilled, and burned. All of these steps require resources and an infrastructure. The total amount of energy input into the process co

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_to


Example Responses:

PIQA:
Prompt: Answer the following question about physical intuition.
Goal: How do I ready a guinea pig cage for it's new occupants?
Which solution is better?
1. Provide the guinea pig with a cage full of a few inches of bedding made of ripped paper strips, you will also need to supply it with a water bottle and a food dish.
2. Provide the guinea pig with a cage full of a few inches of bedding made of ripped jeans material, you will also need to supply it with a water bottle and a food dish.
Answer:
Response: Answer the following question about physical intuition.
Goal: How do I ready a guinea pig cage for it's new occupants?
Which solution is better?
1. Provide the guinea pig with a cage full of a few inches of bedding made of ripped paper strips, you will also need to supply it with a water bottle and a food dish.
2. Provide the guinea pig with a cage full of a few inches of bedding made of ripped jeans material, you will also need to supply it with a water bottl

# Here I have tried to test the Accuracy of Quantized model(using AWQ) and base llama model and see the difference between for 100 test cases

In [20]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch
from tqdm import tqdm
import random

# Load the original model
original_model_path = "meta-llama/Llama-3.2-1B"
original_model = AutoModelForCausalLM.from_pretrained(original_model_path)
original_tokenizer = AutoTokenizer.from_pretrained(original_model_path)

# Load the quantized model
quantized_model_path = "Mubinmodi007/Llama-3.2-1B-AWQ-4bit"
quantized_model = AutoModelForCausalLM.from_pretrained(quantized_model_path)
quantized_tokenizer = AutoTokenizer.from_pretrained(quantized_model_path)

# Function to create pipelines for both models
def create_pipeline(model, tokenizer, device):
    return pipeline("text-generation", model=model, tokenizer=tokenizer, device=device)

# Function to format prompts for each dataset
def format_prompt(dataset_name, example):
    if dataset_name == "PIQA":
        return f"Goal: {example['goal']}\nWhich solution is better?\n1. {example['sol1']}\n2. {example['sol2']}\nAnswer:"
    elif dataset_name == "BoolQ":
        return f"Passage: {example['passage']}\nQuestion: {example['question']}\nAnswer (Yes/No):"
    elif dataset_name == "WinoGrande":
        return f"Sentence: {example['sentence']}\nOptions:\n1. {example['option1']}\n2. {example['option2']}\nAnswer (1 or 2):"
    elif dataset_name == "ARC-Challenge":
        options = "\n".join([f"{i+1}. {choice}" for i, choice in enumerate(example['choices']['text'])])
        return f"Question: {example['question']}\nOptions:\n{options}\nAnswer (1, 2, 3, or 4):"

# Function to evaluate model on a dataset
def evaluate_model(pipe, dataset, dataset_name, num_samples=100):
    correct = 0
    total = 0
    sampled_data = random.sample(list(dataset), min(num_samples, len(dataset)))

    for example in tqdm(sampled_data, desc=f"Evaluating {dataset_name}"):
        prompt = format_prompt(dataset_name, example)
        response = pipe(prompt, max_new_tokens=5, do_sample=False)[0]['generated_text']

        if dataset_name == "PIQA":
            predicted = response.split("Answer:")[-1].strip()
            correct_answer = str(example['label'] + 1)
        elif dataset_name == "BoolQ":
            predicted = response.split("Answer (Yes/No):")[-1].strip().lower()
            correct_answer = "yes" if example['answer'] else "no"
        elif dataset_name == "WinoGrande":
            predicted = response.split("Answer (1 or 2):")[-1].strip()
            correct_answer = example['answer']
        elif dataset_name == "ARC-Challenge":
            predicted = response.split("Answer (1, 2, 3, or 4):")[-1].strip()
            correct_answer = str(ord(example['answerKey']) - ord('A') + 1)

        if predicted == correct_answer:
            correct += 1
        total += 1

    return correct / total if total > 0 else 0

# Load datasets
datasets = {
    "PIQA": load_dataset("piqa", split="validation"),
    "BoolQ": load_dataset("boolq", split="validation"),
    "WinoGrande": load_dataset("winogrande", "winogrande_xl", split="validation"),
    "ARC-Challenge": load_dataset("ai2_arc", "ARC-Challenge", split="test")
}

# Create pipelines for both models
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
original_pipe = create_pipeline(original_model, original_tokenizer, device)
quantized_pipe = create_pipeline(quantized_model, quantized_tokenizer, device)

# Evaluate both models on each dataset
results = {}
for dataset_name, dataset in datasets.items():
    print(f"\nEvaluating {dataset_name}")
    original_accuracy = evaluate_model(original_pipe, dataset, dataset_name)
    quantized_accuracy = evaluate_model(quantized_pipe, dataset, dataset_name)

    accuracy_difference = quantized_accuracy - original_accuracy

    results[dataset_name] = {
        "Original Accuracy": original_accuracy,
        "Quantized Accuracy": quantized_accuracy,
        "Accuracy Difference": accuracy_difference
    }

# Print results
print("\nResults:")
for dataset_name, scores in results.items():
    print(f"\n{dataset_name}:")
    print(f"Original Model Accuracy: {scores['Original Accuracy']:.4f}")
    print(f"Quantized Model Accuracy: {scores['Quantized Accuracy']:.4f}")
    print(f"Accuracy Difference: {scores['Accuracy Difference']:.4f}")

# Calculate average accuracy difference
avg_accuracy_difference = sum(scores['Accuracy Difference'] for scores in results.values()) / len(results)
print(f"\nAverage Accuracy Difference: {avg_accuracy_difference:.4f}")

`low_cpu_mem_usage` was None, now set to True since model is quantized.


Using device: cuda

Evaluating PIQA


Evaluating PIQA:   0%|          | 0/100 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Evaluating PIQA:   1%|          | 1/100 [00:00<00:12,  7.75it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Evaluating PIQA:   3%|▎         | 3/100 [00:00<00:09,  9.77it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Evaluating PIQA:   5%|▌         | 5/100 [00:00<00:09, 10.23it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Evaluating PIQA:   7%|▋         | 7/100 [00:00<00:08, 10.51it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Evaluating PIQA:   9%|▉        


Evaluating BoolQ


Evaluating BoolQ:   0%|          | 0/100 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Evaluating BoolQ:   1%|          | 1/100 [00:00<00:11,  8.75it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Evaluating BoolQ:   2%|▏         | 2/100 [00:00<00:12,  7.67it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Evaluating BoolQ:   4%|▍         | 4/100 [00:00<00:10,  9.21it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Evaluating BoolQ:   5%|▌         | 5/100 [00:00<00:10,  9.14it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Evaluating BoolQ:   7%|▋         | 7/100 [00:00<00:09,  9.71it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Evaluating BoolQ:   8%|▊         |


Evaluating WinoGrande


Evaluating WinoGrande:   0%|          | 0/100 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Evaluating WinoGrande:   2%|▏         | 2/100 [00:00<00:08, 10.92it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Evaluating WinoGrande:   4%|▍         | 4/100 [00:00<00:08, 11.01it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Evaluating WinoGrande:   6%|▌         | 6/100 [00:00<00:08, 11.03it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Evaluating WinoGrande:   8%|▊         | 8/100 [00:00<00:08, 11.12it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



Evaluating ARC-Challenge


Evaluating ARC-Challenge:   0%|          | 0/100 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Evaluating ARC-Challenge:   1%|          | 1/100 [00:00<00:09,  9.96it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Evaluating ARC-Challenge:   3%|▎         | 3/100 [00:00<00:07, 12.79it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Evaluating ARC-Challenge:   5%|▌         | 5/100 [00:00<00:07, 13.40it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Evaluating ARC-Challenge:   7%|▋         | 7/100 [00:00<00:08, 11.51it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-en


Results:

PIQA:
Original Model Accuracy: 0.0000
Quantized Model Accuracy: 0.0000
Accuracy Difference: 0.0000

BoolQ:
Original Model Accuracy: 0.0000
Quantized Model Accuracy: 0.0000
Accuracy Difference: 0.0000

WinoGrande:
Original Model Accuracy: 0.0100
Quantized Model Accuracy: 0.0500
Accuracy Difference: 0.0400

ARC-Challenge:
Original Model Accuracy: 0.0000
Quantized Model Accuracy: 0.0200
Accuracy Difference: 0.0200

Average Accuracy Difference: 0.0150





# Here I have tried to test the Accuracy of Quantized model(using AWQ) and base llama model and see the difference between for all test cases

In [21]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch
from tqdm import tqdm

# Load the original model
original_model_path = "meta-llama/Llama-3.2-1B"
original_model = AutoModelForCausalLM.from_pretrained(original_model_path)
original_tokenizer = AutoTokenizer.from_pretrained(original_model_path)

# Load the quantized model
quantized_model_path = "Mubinmodi007/Llama-3.2-1B-AWQ-4bit"
quantized_model = AutoModelForCausalLM.from_pretrained(quantized_model_path)
quantized_tokenizer = AutoTokenizer.from_pretrained(quantized_model_path)

# Function to create pipelines for both models
def create_pipeline(model, tokenizer, device):
    return pipeline("text-generation", model=model, tokenizer=tokenizer, device=device)

# Function to format prompts for each dataset
def format_prompt(dataset_name, example):
    if dataset_name == "PIQA":
        return f"Goal: {example['goal']}\nWhich solution is better?\n1. {example['sol1']}\n2. {example['sol2']}\nAnswer:"
    elif dataset_name == "BoolQ":
        return f"Passage: {example['passage']}\nQuestion: {example['question']}\nAnswer (Yes/No):"
    elif dataset_name == "WinoGrande":
        return f"Sentence: {example['sentence']}\nOptions:\n1. {example['option1']}\n2. {example['option2']}\nAnswer (1 or 2):"
    elif dataset_name == "ARC-Challenge":
        options = "\n".join([f"{i+1}. {choice}" for i, choice in enumerate(example['choices']['text'])])
        return f"Question: {example['question']}\nOptions:\n{options}\nAnswer (1, 2, 3, or 4):"

# Function to evaluate model on a dataset
def evaluate_model(pipe, dataset, dataset_name):
    correct = 0
    total = 0

    for example in tqdm(dataset, desc=f"Evaluating {dataset_name}"):
        prompt = format_prompt(dataset_name, example)
        response = pipe(prompt, max_new_tokens=5, do_sample=False)[0]['generated_text']

        if dataset_name == "PIQA":
            predicted = response.split("Answer:")[-1].strip()
            correct_answer = str(example['label'] + 1)
        elif dataset_name == "BoolQ":
            predicted = response.split("Answer (Yes/No):")[-1].strip().lower()
            correct_answer = "yes" if example['answer'] else "no"
        elif dataset_name == "WinoGrande":
            predicted = response.split("Answer (1 or 2):")[-1].strip()
            correct_answer = example['answer']
        elif dataset_name == "ARC-Challenge":
            predicted = response.split("Answer (1, 2, 3, or 4):")[-1].strip()
            correct_answer = str(ord(example['answerKey']) - ord('A') + 1)

        if predicted == correct_answer:
            correct += 1
        total += 1

    return correct / total if total > 0 else 0

# Load full datasets
datasets = {
    "PIQA": load_dataset("piqa", split="validation"),
    "BoolQ": load_dataset("boolq", split="validation"),
    "WinoGrande": load_dataset("winogrande", "winogrande_xl", split="validation"),
    "ARC-Challenge": load_dataset("ai2_arc", "ARC-Challenge", split="test")
}

# Create pipelines for both models
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
original_pipe = create_pipeline(original_model, original_tokenizer, device)
quantized_pipe = create_pipeline(quantized_model, quantized_tokenizer, device)

# Evaluate both models on each dataset
results = {}
for dataset_name, dataset in datasets.items():
    print(f"\nEvaluating {dataset_name}")
    original_accuracy = evaluate_model(original_pipe, dataset, dataset_name)
    quantized_accuracy = evaluate_model(quantized_pipe, dataset, dataset_name)

    accuracy_difference = quantized_accuracy - original_accuracy

    results[dataset_name] = {
        "Original Accuracy": original_accuracy,
        "Quantized Accuracy": quantized_accuracy,
        "Accuracy Difference": accuracy_difference
    }

# Print results
print("\nResults:")
for dataset_name, scores in results.items():
    print(f"\n{dataset_name}:")
    print(f"Original Model Accuracy: {scores['Original Accuracy']:.4f}")
    print(f"Quantized Model Accuracy: {scores['Quantized Accuracy']:.4f}")
    print(f"Accuracy Difference: {scores['Accuracy Difference']:.4f}")

# Calculate average accuracy difference
avg_accuracy_difference = sum(scores['Accuracy Difference'] for scores in results.values()) / len(results)
print(f"\nAverage Accuracy Difference: {avg_accuracy_difference:.4f}")

`low_cpu_mem_usage` was None, now set to True since model is quantized.


Using device: cuda

Evaluating PIQA


Evaluating PIQA:   0%|          | 0/1838 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Evaluating PIQA:   0%|          | 1/1838 [00:00<04:06,  7.45it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Evaluating PIQA:   0%|          | 3/1838 [00:00<02:58, 10.28it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Evaluating PIQA:   0%|          | 5/1838 [00:00<02:53, 10.59it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Evaluating PIQA:   0%|          | 7/1838 [00:00<02:50, 10.73it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Evaluating PIQA:   0%|    


Evaluating BoolQ


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Evaluating BoolQ:  47%|████▋     | 1542/3270 [02:42<03:05,  9.30it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Evaluating BoolQ:  47%|████▋     | 1543/3270 [02:42<03:08,  9.15it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Evaluating BoolQ:  47%|████▋     | 1544/3270 [02:42<03:16,  8.80it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Evaluating BoolQ:  47%|████▋     | 1545/3270 [02:42<03:24,  8.45it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Evaluating BoolQ:  47%|████▋     | 1546/3270 [02:42<03:19,  8.64it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Evaluating BoolQ:  47%|████▋     | 1547/3270 [02:42<03:13,  8.90it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Evaluating BoolQ:  47%|████▋     | 1548/3270 [02:43<03:10,  9.04it/s]Setting `pad_t


Evaluating WinoGrande


Evaluating WinoGrande:   0%|          | 0/1267 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Evaluating WinoGrande:   0%|          | 2/1267 [00:00<02:02, 10.31it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Evaluating WinoGrande:   0%|          | 4/1267 [00:00<02:00, 10.46it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Evaluating WinoGrande:   0%|          | 6/1267 [00:00<02:01, 10.42it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Evaluating WinoGrande:   1%|          | 8/1267 [00:00<01:59, 10.55it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generati


Evaluating ARC-Challenge


Evaluating ARC-Challenge:   0%|          | 0/1172 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Evaluating ARC-Challenge:   0%|          | 1/1172 [00:00<01:59,  9.77it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Evaluating ARC-Challenge:   0%|          | 2/1172 [00:00<01:58,  9.90it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Evaluating ARC-Challenge:   0%|          | 4/1172 [00:00<01:56, 10.00it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Evaluating ARC-Challenge:   1%|          | 6/1172 [00:00<01:55, 10.10it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Evaluating ARC-Challenge:   1%|          | 8/1172 [00:


Results:

PIQA:
Original Model Accuracy: 0.0044
Quantized Model Accuracy: 0.0060
Accuracy Difference: 0.0016

BoolQ:
Original Model Accuracy: 0.0000
Quantized Model Accuracy: 0.0000
Accuracy Difference: 0.0000

WinoGrande:
Original Model Accuracy: 0.0087
Quantized Model Accuracy: 0.0813
Accuracy Difference: 0.0726

ARC-Challenge:
Original Model Accuracy: 0.0009
Quantized Model Accuracy: 0.0119
Accuracy Difference: 0.0111

Average Accuracy Difference: 0.0213





# Here I have tried to test the Accuracy of Quantized model(using LORA) and base llama model and see the difference between for 100 test cases

In [25]:
!pip install peft
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from peft import PeftModel, PeftConfig
import torch
from tqdm import tqdm

# Load the base model and tokenizer
base_model_path = "meta-llama/Llama-3.2-1B"
base_model = AutoModelForCausalLM.from_pretrained(base_model_path)
tokenizer = AutoTokenizer.from_pretrained(base_model_path)

# Load and apply LoRA weights
lora_model_path = "Mubinmodi007/Llama-3.2-1B-finetuned"
config = PeftConfig.from_pretrained(lora_model_path)
model = PeftModel.from_pretrained(base_model, lora_model_path)

# Function to create pipeline
def create_pipeline(model, tokenizer, device):
    return pipeline("text-generation", model=model, tokenizer=tokenizer, device=device)

# Function to format prompts for each dataset
def format_prompt(dataset_name, example):
    if dataset_name == "PIQA":
        return f"Goal: {example['goal']}\nWhich solution is better?\n1. {example['sol1']}\n2. {example['sol2']}\nAnswer:"
    elif dataset_name == "BoolQ":
        return f"Passage: {example['passage']}\nQuestion: {example['question']}\nAnswer (Yes/No):"
    elif dataset_name == "WinoGrande":
        return f"Sentence: {example['sentence']}\nOptions:\n1. {example['option1']}\n2. {example['option2']}\nAnswer (1 or 2):"
    elif dataset_name == "ARC-Challenge":
        options = "\n".join([f"{i+1}. {choice}" for i, choice in enumerate(example['choices']['text'])])
        return f"Question: {example['question']}\nOptions:\n{options}\nAnswer (1, 2, 3, or 4):"

# Function to evaluate model on a dataset
def evaluate_model(pipe, dataset, dataset_name):
    correct = 0
    total = 0

    for example in tqdm(dataset, desc=f"Evaluating {dataset_name}"):
        prompt = format_prompt(dataset_name, example)
        response = pipe(prompt, max_new_tokens=5, do_sample=False)[0]['generated_text']

        if dataset_name == "PIQA":
            predicted = response.split("Answer:")[-1].strip()
            correct_answer = str(example['label'] + 1)
        elif dataset_name == "BoolQ":
            predicted = response.split("Answer (Yes/No):")[-1].strip().lower()
            correct_answer = "yes" if example['answer'] else "no"
        elif dataset_name == "WinoGrande":
            predicted = response.split("Answer (1 or 2):")[-1].strip()
            correct_answer = example['answer']
        elif dataset_name == "ARC-Challenge":
            predicted = response.split("Answer (1, 2, 3, or 4):")[-1].strip()
            correct_answer = str(ord(example['answerKey']) - ord('A') + 1)

        if predicted == correct_answer:
            correct += 1
        total += 1

    return correct / total if total > 0 else 0

# Load datasets (only 100 rows)
datasets = {
    "PIQA": load_dataset("piqa", split="validation[:100]"),
    "BoolQ": load_dataset("boolq", split="validation[:100]"),
    "WinoGrande": load_dataset("winogrande", "winogrande_xl", split="validation[:100]"),
    "ARC-Challenge": load_dataset("ai2_arc", "ARC-Challenge", split="test[:100]")
}

# Create pipeline for the model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
pipe = create_pipeline(model, tokenizer, device)

# Evaluate the model on each dataset
results = {}
for dataset_name, dataset in datasets.items():
    print(f"\nEvaluating {dataset_name}")
    accuracy = evaluate_model(pipe, dataset, dataset_name)
    results[dataset_name] = {"Accuracy": accuracy}

# Print results
print("\nResults:")
for dataset_name, scores in results.items():
    print(f"\n{dataset_name}:")
    print(f"Accuracy: {scores['Accuracy']:.4f}")

# Calculate average accuracy
avg_accuracy = sum(scores['Accuracy'] for scores in results.values()) / len(results)
print(f"\nAverage Accuracy: {avg_accuracy:.4f}")

Collecting peft
  Downloading peft-0.13.0-py3-none-any.whl.metadata (13 kB)
Downloading peft-0.13.0-py3-none-any.whl (322 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m322.5/322.5 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: peft
Successfully installed peft-0.13.0


adapter_config.json:   0%|          | 0.00/728 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/45.1M [00:00<?, ?B/s]

Using device: cuda


The model 'PeftModelForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'LlamaForCausalLM', 'CodeGenForCausalLM', 'CohereForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'DbrxForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'FalconForCausalLM', 'FuyuForCausalLM', 'GemmaForCausalLM', 'Gemma2ForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'JambaForCausalLM', 'JetMoeForCausalLM', 'LlamaForCausalLM', 'MambaForCausalLM', 'Mamba2ForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MistralForCausalLM', 'MixtralForCausal


Evaluating PIQA


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Evaluating PIQA:   1%|          | 1/100 [00:00<00:16,  5.83it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Evaluating PIQA:   2%|▏         | 2/100 [00:00<00:16,  6.03it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Evaluating PIQA:   3%|▎         | 3/100 [00:00<00:15,  6.09it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Evaluating PIQA:   4%|▍         | 4/100 [00:00<00:15,  6.04it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Evaluating PIQA:   5%|▌         | 5/100 [00:00<00:15,  6.07it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Evaluating PIQA:   6%|▌         | 6/100 [00:00<00:15,  6.15it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Evaluating PIQA:   7%|▋         | 7/100 [00:01<00:15,  6.08it/s]Setting `pad_token_id` to `eos_token_id`:


Evaluating BoolQ


Evaluating BoolQ:   0%|          | 0/100 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Evaluating BoolQ:   1%|          | 1/100 [00:00<00:18,  5.34it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Evaluating BoolQ:   2%|▏         | 2/100 [00:00<00:18,  5.39it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Evaluating BoolQ:   3%|▎         | 3/100 [00:00<00:17,  5.69it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Evaluating BoolQ:   4%|▍         | 4/100 [00:00<00:16,  5.84it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Evaluating BoolQ:   5%|▌         | 5/100 [00:00<00:15,  5.96it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Evaluating BoolQ:   6%|▌         | 6/100 [00:01<00:15,  6.00it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Evaluating BoolQ:   7%|▋         | 7/100 [


Evaluating WinoGrande


Evaluating WinoGrande:   0%|          | 0/100 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Evaluating WinoGrande:   1%|          | 1/100 [00:00<00:16,  6.15it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Evaluating WinoGrande:   2%|▏         | 2/100 [00:00<00:15,  6.24it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Evaluating WinoGrande:   3%|▎         | 3/100 [00:00<00:15,  6.29it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Evaluating WinoGrande:   4%|▍         | 4/100 [00:00<00:15,  6.24it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Evaluating WinoGrande:   5%|▌         | 5/100 [00:00<00:15,  6.21it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Evaluating WinoGrande:   6%|▌         | 6/100 [00:00<00:15,  6.19it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Evaluat


Evaluating ARC-Challenge


Evaluating ARC-Challenge:   0%|          | 0/100 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Evaluating ARC-Challenge:   1%|          | 1/100 [00:00<00:15,  6.28it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Evaluating ARC-Challenge:   2%|▏         | 2/100 [00:00<00:15,  6.21it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Evaluating ARC-Challenge:   3%|▎         | 3/100 [00:00<00:15,  6.18it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Evaluating ARC-Challenge:   4%|▍         | 4/100 [00:00<00:15,  6.17it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Evaluating ARC-Challenge:   5%|▌         | 5/100 [00:00<00:15,  6.18it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Evaluating ARC-Challenge:   6%|▌         | 6/100 [00:00<00:15,  6.16it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-en


Results:

PIQA:
Accuracy: 0.0000

BoolQ:
Accuracy: 0.0000

WinoGrande:
Accuracy: 0.0000

ARC-Challenge:
Accuracy: 0.0000

Average Accuracy: 0.0000





## Memory Optimization of AWQ
![awq](awq_stats.jfif)


# I was successfully able to quantize LLama 3.2-1b using AWQ  & LLama 3.2-1b using lora but seems like there are issues while testing it, accuracy for both model looks to be very low, the mistake I have made a mistake here I tried to tokenize the dataset but there were issues during it and I wasnot able to get a presentable output, so I to used CHATcomplete to generate response , and the responses are hit and miss 