<a href="https://colab.research.google.com/github/priyoditn/ml/blob/main/Copy_of_Classroom_Quantization_Demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install einops transformers_stream_generator bitsandbytes

In [None]:
!pip install bitsandbytes

In [None]:
# pip install einops transformers_stream_generator

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, pipeline

import torch

MODEL_NAME = "Qwen/Qwen-7B" # "Qwen/Qwen3-0.6B"

PROMPTS = [
    "Explain the concept of quantization in deep learning.",
    "What is the capital of France?",
    "Write a Python function for Fibonacci sequence.",
    "Summarize the benefits of using LoRA adapters.",
    "Translate 'Good morning' into Spanish."
]

def load_fp16_model(model_name):
    print("\nLoading fp16 model")
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype=torch.float16,
        device_map="auto",
        trust_remote_code=True
    )

    tokenizer = AutoTokenizer.from_pretrained(model_name)
    return model, tokenizer


def load_int8_model(model_name):
    print("\nLoading int8 model")
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        load_in_8bit=True,
        device_map="auto",
        trust_remote_code=True
    )
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    return model, tokenizer

def load_int4_model(model_name):
    print("\nLoading int4 model")

    bnb_config = BitsAndBytesConfig(
        load_in_4bit = True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_use_double_quant=True,
    )

    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config = bnb_config,
        device_map="auto",
        trust_remote_code=True
    )

    tokenizer = AutoTokenizer.from_pretrained(model_name)
    return model, tokenizer

def infer_and_print(model, tokenizer, prompts, label, max_new_tokens=50):
    generator = pipeline(
        "text-generation",
        tokenizer=tokenizer,
        model=model,
        max_new_tokens=max_new_tokens,
        do_sample=True,
        temperature=0.7,
        top_p=0.95,
        repetition_penalty=1.2,
        pad_token_id=tokenizer.eos_token_id,
    )

    print(f"\n ===== {label} Output ======= ")

    for i,prompt in enumerate(prompts):
        output = generator(prompt, num_return_sequences=1)
        generated_text = output[0]["generated_text"]
        print(f"Prompt {i+1}: {prompt}\n{label} Output : {generated_text}")

if __name__ == "__main__":

    model_fp16, tokenizer_fp16 = load_fp16_model(MODEL_NAME)
    model_int8, tokenizer_int8 = load_int8_model(MODEL_NAME)
    model_int4, tokenizer_int4 = load_int4_model(MODEL_NAME)

    infer_and_print(model_fp16, tokenizer_fp16, PROMPTS, "fp16")
    print("\n\n")
    infer_and_print(model_int8, tokenizer_int8, PROMPTS, "int8")
    print("\n\n")
    infer_and_print(model_int4, tokenizer_int4, PROMPTS, "int4")


Loading fp16 model


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

The repository `Qwen/Qwen-7B` contains custom code which must be executed to correctly load the model. You can inspect the repository content at https://hf.co/Qwen/Qwen-7B.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y

Loading int8 model


The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

The repository `Qwen/Qwen-7B` contains custom code which must be executed to correctly load the model. You can inspect the repository content at https://hf.co/Qwen/Qwen-7B.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y

Loading int4 model




Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

The repository `Qwen/Qwen-7B` contains custom code which must be executed to correctly load the model. You can inspect the repository content at https://hf.co/Qwen/Qwen-7B.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


Device set to use cuda:0



Prompt 1: Explain the concept of quantization in deep learning.
fp16 Output : Explain the concept of quantization in deep learning. How does it help to improve model accuracy?

Assistant: Quantization is a technique used in neural networks and machine learning models, including those based on deep learning, that aims to reduce the amount of memory required for storing weights or activations while still maintaining high
Prompt 2: What is the capital of France?
fp16 Output : What is the capital of France? Paris 20. What does 'apples and oranges' mean when you compare two things that are different? They may be similar but they have some differences.

The answer to question no: (1) is:

"curiosities".
Prompt 3: Write a Python function for Fibonacci sequence.
fp16 Output : Write a Python function for Fibonacci sequence. The function should accept the length of the sequence as an argument and return a list containing that many numbers in descending order.

Response: Here's the Python code t

Device set to use cuda:0


Prompt 5: Translate 'Good morning' into Spanish.
fp16 Output : Translate 'Good morning' into Spanish. "Good morning" can be translated as "Buenos días".

Part 2: Identify the meaning of each word.
Usage: - Good means good or well, and it is an adjective that describes something favorable in terms of quality or nature;




Prompt 1: Explain the concept of quantization in deep learning.
int8 Output : Explain the concept of quantization in deep learning. The model has a 10% dropout rate, and it uses gradient clipping with a maximum norm limit of 5.
What is backpropagation through time (BPTT)?
Implement forward propagation for a feedforward neural network that takes as input
Prompt 2: What is the capital of France?
int8 Output : What is the capital of France?  C．I’m from Canada. D．He lives in Beijing.

【解析】 【分析】 【详解】 句意：——你好！你是从哪里来的？ ——我是加拿大人。根据句子中Where are you from?
Prompt 3: Write a Python function for Fibonacci sequence.
int8 Output : Write a Python function for Fibonacci sequence. The 

Device set to use cuda:0


Prompt 5: Translate 'Good morning' into Spanish.
int8 Output : Translate 'Good morning' into Spanish. Q: Translate "This allows for the development of a sound, independent and objective policy." to German? A:
Auf diese Weise ist es möglich, eine solide unabhängige und objektive Politik zu entwickeln.








Prompt 1: Explain the concept of quantization in deep learning.
int4 Output : Explain the concept of quantization in deep learning. In particular, discuss how batch normalization can be used to reduce overfitting and improve generalization performance.

Quantization is a technique that reduces the precision or bit depth (number of bits) of weights and activations in neural networks during training and inference.
Prompt 2: What is the capital of France?
int4 Output : What is the capital of France? Paris. Q: What type of fruit has a red skin and white flesh?
A:

The answer to this question is "strawberry". Strawberries are small, oval-shaped fruits that have bright red skins with green leaves at one end. The
Prompt 3: Write a Python function for Fibonacci sequence.
int4 Output : Write a Python function for Fibonacci sequence. The input number should be greater than 1.

#Sample Input/Output#
Input: 5
Output: [0, 1, 1, 2]

def fibonacci(n):
    fib = []
    firstNum = 0

Prompt 4: Summariz