<a href="https://colab.research.google.com/github/harnalashok/LLMs/blob/main/4bit_quantization_huggingface_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Last amended: 14th June, 2024
# Objective: The technique used here makes the model
#            execute mush faster.

# Ref: RADAR: Mistral 7B Tutorial:
#      https://www.datacamp.com/tutorial/mistral-7b-tutorial

# Using Huggingface models after 4bit Quantization
Less resources and faster though a little less accuracy<br>
The technique works only if:
>a) One has **GPU**;    
>b) **And**, your code is configured to use GPU    


Please see [this article in Medium](https://medium.com/@rakeshrajpurohit/model-quantization-with-hugging-face-transformers-and-bitsandbytes-integration-b4c9983e8996)

In [None]:
# 0.1 We need this for 4bit quantization and GPU usage
!pip install accelerate --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m309.4/309.4 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.3/21.3 MB[0m [31m38.3 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
# 0.2 Install latest bitsandbytes quantization software:

! pip install -i https://pypi.org/simple/ bitsandbytes --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.8/119.8 MB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
# 0.3 Usual libraries:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, pipeline

Please see [here](https://huggingface.co/docs/transformers/en/main_classes/quantization#transformers.BitsAndBytesConfig) for `BitsAndBytesConfig()` API

In [None]:
# 1.0 Create a simple config file to load in 4-bit model:

bnb_config = BitsAndBytesConfig(
                                load_in_4bit=True,
                               )


In [None]:
# 1.1 Use the above config file:

model_name = "TinyLlama/TinyLlama-1.1B-Chat-v0.6"

# 1.1.1 Get tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# 1.1.2 Download and load model
model = AutoModelForCausalLM.from_pretrained(
                                              model_name,
                                              #load_in_4bit=True,  # Can be used instead of bnb_config
                                                                   #  next line
                                              quantization_config=bnb_config,
                                              torch_dtype=torch.bfloat16,    # This is recommended for 4bit quantization.
                                              device_map="auto",
                                              trust_remote_code=True,
                                            )

tokenizer_config.json:   0%|          | 0.00/1.29k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/551 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/699 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [None]:
# 2.0 Instantiate pipeline:

pipe = pipeline("text-generation",
                model= model,
                tokenizer = tokenizer,
                torch_dtype=torch.bfloat16
                )

In [None]:
# 2.1 Some messages:

messages = [
            {
                "role": "system",
                "content": "You are my personal chef experienced in Indian spicy food",
            },
            {"role": "user",
                 "content": "What should i eat for breakfast today?"
            },
]

In [None]:
# 2.2 Get prompt:

prompt = pipe.tokenizer.apply_chat_template(messages,
                                            tokenize=False,  # Output is one string
                                                             #  instead of a string broken into tokens
                                                             # Default is True
                                            add_generation_prompt=True  # To understand this, please read this:
                                                                        # https://huggingface.co/docs/transformers/main/en/chat_templating#what-are-generation-prompts
                                                                        # Briefly start and end tokens of prompt are added to generated output
                                            )

In [None]:
%%time

# 2.3 Apply pipe to task:

outputs = pipe(prompt,
               max_new_tokens=256,
               do_sample=True,
               temperature=0.7, # Default 0.8. Decrease makes it less creative
               top_k=50,        # A higher value (100) will give more diverse answers
               top_p=0.95       # A higher value leads to more diverse text
               )    # 18secs




CPU times: user 15.3 s, sys: 50.9 ms, total: 15.3 s
Wall time: 17.9 s


In [None]:
# 2.3.1
print(outputs[0]["generated_text"])

<|system|>
You are my personal chef experienced in Indian spicy food</s>
<|user|>
What should i eat for breakfast today?</s>
<|assistant|>
I am not capable of eating for myself. However, here are some breakfast ideas for a vegetarian:

1. Mixed greens with smoked salmon, avocado, and cucumber on a whole wheat English muffin
2. Hard-boiled eggs, black beans, and tortilla strips on whole wheat toast
3. Quinoa bowl with mixed veggies, chickpeas, and avocado
4. Greek yogurt with mixed berries, sliced almonds, and granola
5. Oatmeal with mixed berries, sliced banana, and honey
6. Spinach and feta omelette with mixed berries, sliced banana, and almonds
7. Chia seed pudding with mixed berries, sliced banana, and chia seeds
8. Fresh fruit salad with mixed berries, sliced banana, and almonds
9. Smoothie bowl with mixed berries, banana, and honey
10. Greek yogurt parfait with mixed berries, sliced


In [None]:
# 3.0 We need to import llama3
#     llama3 usage has certain conditions
#     to which I had agreed. It, therefore, allows
#     me to download it.

# 3.0.1  Create a text box and write here your acess token:

from getpass import getpass
hf_key = getpass("Hugging Face Key: ")

Hugging Face Key: ··········


In [None]:
# 3.0.2 Login and
#       Save token to  /home/ashok/.cache/huggingface/token

!huggingface-cli login --token $hf_key

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [None]:
# 3.1

model_name = "meta-llama/Meta-Llama-3-8B"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
                                              model_name,
                                              #load_in_4bit=True,
                                              quantization_config=bnb_config,
                                              #torch_dtype=torch.bfloat16,    # This is recommended. Uncomment this
                                              device_map="auto",
                                              trust_remote_code=True,
                                            )

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/177 [00:00<?, ?B/s]

In [None]:
# 3.1.1
prompt = pipe.tokenizer.apply_chat_template(messages,
                                            tokenize=False,
                                            add_generation_prompt=True
                                            )

In [None]:
%%time

# 3.1.2

outputs = pipe(prompt,
               max_new_tokens=256,
               do_sample=True,
               temperature=0.7, # Default 0.8. Decrease makes it less creative
               top_k=50,        # A higher value (100) will give more diverse answers
               top_p=0.95       # A higher value leads to more diverse text
               )    # Just 18seconds


CPU times: user 16.5 s, sys: 2.39 ms, total: 16.5 s
Wall time: 18.2 s


In [None]:
# 3.1.3
print(outputs[0]["generated_text"])

<|system|>
You are my personal chef experienced in Indian spicy food</s>
<|user|>
What should i eat for breakfast today?</s>
<|assistant|>
To prepare a delicious and flavorful breakfast, here are some ideas:

1. Tofu scramble with veggies: Cook tofu in a pan with veggies like spinach, mushrooms, bell peppers, and onions. Season with salt, pepper, and spices like paprika, garlic powder, and cumin. Serve with whole grain toast or a side salad for a healthy and filling meal.

2. Greek yogurt parfait: Blend Greek yogurt with fresh berries, honey, and a drizzle of olive oil. Top with granola and fresh fruit for a protein-packed breakfast.

3. Breakfast burrito: Spread a tortilla with black beans, avocado, and cheese. Cover with scrambled tofu and refried beans. Roll up and enjoy with a side of salsa or guacamole.

4. Quinoa and black bean breakfast bowl: Cook quinoa with black beans, diced tomatoes, and chopped veggies like bell peppers and onions. Top with s


In [None]:
######### I am done ##############