<a href="https://colab.research.google.com/github/nym1tthm/github-slideshow/blob/main/Copy_of_LLM_Fine_Tuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-tune and quantize LLM in Google Colab using Q-LoRA



In [None]:
!pip install  accelerate peft bitsandbytes transformers trl



In [None]:
!CMAKE_ARGS="-DLLAMA_CUDA=on" pip install llama-cpp-python
#pip install llama-cpp-python



In [9]:
import os
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer

## XLSX TO CSV

In [10]:
import pandas as pd

# Read the XLSX file
xlsx_file = pd.read_excel('/content/emotion.xlsx')

# Convert XLSX to CSV
xlsx_file.to_csv('output.csv', index=False)

In [11]:
# @title prepare data

input_prompt = """Below is a Human Input, write appropriate Response based on the input.

### Input:
{}

### Response:
{}"""


## Detailed Explanation of Fine-Tuning Parameters:

This script defines various parameters for fine-tuning a pre-trained model using Low-Rank Adapters (LoRA) and quantization techniques. Here's a breakdown of each section and its role in fine-tuning:

**Model and Dataset:**

* `model_name`: This specifies the pre-trained model you want to use for fine-tuning. Here, it's set to "TinyLlama/TinyLlama-1.1B-Chat-v1.0" from the Hugging Face hub.
* `new_model`: This defines the name you'll give to the fine-tuned model after training (here, "tiny-llama-fine-tuned").

**LoRA Parameters:**

* `lora_r`: This defines the dimension of the LoRA projection space. It controls the size of the additional parameters introduced for adaptation with LoRA.
* `lora_alpha`: This parameter controls the scaling applied to the LoRA weights during training.
* `lora_dropout`: This sets the dropout probability for the LoRA layers, helping to prevent overfitting.

**BitsAndBytes Parameters (Quantization):**

* `use_4bit`: This activates 4-bit precision for loading the base model, potentially reducing model size and inference speed.
* `bnb_4bit_compute_dtype`: This sets the computation data type for the 4-bit model (here, "float16").
* `bnb_4bit_quant_type`: This specifies the type of quantization used (here, "nf4").
* `use_nested_quant`: This enables nested quantization (double quantization), which might further reduce memory usage but could impact accuracy.

**TrainingArguments Parameters:**

* `output_dir`: This defines the directory where the model's predictions and checkpoints are saved during training ("./results" here).
* `num_train_epochs`: This sets the number of training epochs (iterations over the entire dataset). Here, it's set to 50.
* `fp16`, `bf16`: These enable mixed-precision training using 16-bit floating-point (fp16) or bfloat16 data types, potentially accelerating training on compatible hardware (set to False here).
* `per_device_train_batch_size`: This defines the number of training examples processed per GPU during each training step (set to 1 here). Similarly, `per_device_eval_batch_size` defines the batch size for evaluation.
* `gradient_accumulation_steps`: This accumulates gradients for multiple training steps before updating the model weights, potentially improving memory efficiency (set to 1 here).
* `gradient_checkpointing`: Enables gradient checkpointing, which saves memory by only storing a subset of activations during backpropagation (enabled here).
* `max_grad_norm`: This sets the maximum gradient norm for gradient clipping, preventing exploding gradients (set to 0.3 here).
* `learning_rate`: This defines the initial learning rate for the optimizer (AdamW here, set to 2e-4).
* `weight_decay`: This applies weight decay (L2 regularization) to all layers except bias and LayerNorm weights, helping to prevent overfitting (set to 0.001 here).
* `optim`: This specifies the optimizer used for training. Here, it's set to "paged_adamw_32bit".
* `lr_scheduler_type`: This defines the learning rate schedule. Here, "cosine" is used, which gradually reduces the learning rate over training.
* `max_steps`: This sets the total number of training steps (overrides `num_train_epochs`). Here, it's set to -1, meaning all epochs will be used.
* `warmup_ratio`: This defines the portion of training steps for a linear warmup of the learning rate (set to 0.03 here).
* `group_by_length`: This groups sequences of similar lengths into batches, improving memory efficiency and training speed (enabled here).
* `save_steps`: This sets the number of training steps between saving model checkpoints (set to 0 here, meaning no intermediate saves).
* `logging_steps`: This defines the number of training steps between logging training information (set to 25 here).

**SFT Parameters:**

* `max_seq_length`: This sets the maximum sequence length for training and inference (can be left as None).
* `packing`: This enables packing multiple short examples into a single input sequence to improve efficiency (disabled here).
* `device_map`: This defines which GPUs to use for training. Here, it maps all training to GPU 0 ("": 0).

These parameters allow you to fine-tune the pre-trained model for a

In [12]:
# The model that you want to train from the Hugging Face hub
model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
#model_mame = "/content/final_weights_new"

# The instruction dataset to use
#dataset_name = "mlabonne/guanaco-llama2-1k"

# Fine-tuned model name
new_model = "tiny-llama-fine-tuned"

################################################################################
# QLoRA parameters
################################################################################

# LoRA attention dimension
lora_r = 64

# Alpha parameter for LoRA scaling
lora_alpha = 16

# Dropout probability for LoRA layers
lora_dropout = 0.1

################################################################################
# bitsandbytes parameters
################################################################################

# Activate 4-bit precision base model loading
use_4bit = True

# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16"

# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4"

# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = False

################################################################################
# TrainingArguments parameters
################################################################################

# Output directory where the model predictions and checkpoints will be stored
output_dir = "./results"

# Number of training epochs
num_train_epochs = 20

# Enable fp16/bf16 training (set bf16 to True with an A100)
fp16 = False
bf16 = False

# Batch size per GPU for training
per_device_train_batch_size = 1

# Batch size per GPU for evaluation
per_device_eval_batch_size = 1

# Number of update steps to accumulate the gradients for
gradient_accumulation_steps = 1

# Enable gradient checkpointing
gradient_checkpointing = True

# Maximum gradient normal (gradient clipping)
max_grad_norm = 0.3

# Initial learning rate (AdamW optimizer)
learning_rate = 2e-4 #0.0002 2x10-4

# Weight decay to apply to all layers except bias/LayerNorm weights
weight_decay = 0.001

# Optimizer to use
optim = "paged_adamw_32bit"

# Learning rate schedule
lr_scheduler_type = "cosine"

# Number of training steps (overrides num_train_epochs)
max_steps = -1

# Ratio of steps for a linear warmup (from 0 to learning rate)
warmup_ratio = 0.03

# Group sequences into batches with same length
# Saves memory and speeds up training considerably
group_by_length = True

# Save checkpoint every X updates steps
save_steps = 0

# Log every X updates steps
logging_steps = 25

################################################################################
# SFT parameters
################################################################################

# Maximum sequence length to use
max_seq_length = None

# Pack multiple short examples in the same input sequence to increase efficiency
packing = False

# Load the entire model on the GPU 0
device_map = {"": 0} # "auto"

#Fine-tuning
Parameter-efficient fine-tuning (PEFT) is a technique used to adapt large pre-trained language models (LLMs) to new tasks while significantly reducing the number of parameters that need to be trained. Here's a breakdown of the key points:

**Challenge of Fine-Tuning LLMs:**

* LLMs are massive, with billions of parameters.
* Fine-tuning them on new tasks often requires training all these parameters, which can be:
    * Computationally expensive (takes a long time and requires powerful hardware).
    * Prone to overfitting (the model memorizes the training data instead of learning generalizable patterns).

**PEFT Approach:**

PEFT addresses these challenges by focusing on training only a small subset of the model's parameters while keeping the rest frozen. This allows for:

* **Faster Training:** Less parameters to train means faster training times.
* **Reduced Memory Usage:** Smaller models require less memory on devices.
* **Improved Generalizability:** By not retraining everything, PEFT can help prevent overfitting and improve the model's ability to adapt to unseen data.

**How PEFT Works:**

There are several approaches to PEFT

* **Low-Rank Adapters (LoRA):** Introducing a small set of additional parameters that act as "adapters" on top of the pre-trained model. These adapters allow the model to adapt to the new task without significantly changing the core parameters.

**Benefits of PEFT:**

* Enables fine-tuning LLMs on resource-constrained devices (e.g., mobile phones).
* Reduces training costs associated with large models.
* Can potentially improve the generalizability of the fine-tuned model.

**Overall, PEFT is a valuable technique for making LLMs more accessible and adaptable to a wider range of tasks while keeping computational efficiency in mind.**

Here's a breakdown of why 4-bit quantization is used and what happens to the vectors:

**Why Use 4-Bit Quantization?**

The code utilizes 4-bit quantization likely for two main reasons:

1. **Reduced Model Size and Memory Usage:** Compared to using 32-bit floating-point numbers (FP32) for representing model weights and activations, 4-bit quantization (4 bits per number) significantly reduces the model size. This can be crucial for deploying the model on devices with limited memory, such as mobile phones or embedded systems.

2. **Potentially Faster Inference:** While not guaranteed, using lower precision formats like 4-bit can sometimes lead to faster inference speeds on hardware that supports such operations efficiently. This can be beneficial for real-time applications where quick response times are important.

**Is it Quantization-Aware Fine-Tuning?**

The code snippet doesn't explicitly show if it's using quantization-aware fine-tuning. However, there are clues suggesting it might be:

* **`BitsAndBytesConfig`:** This configuration likely controls the quantization settings.
* **Target Modules for LoRA:** Fine-tuning only specific modules (like those listed for LoRA) is a common approach when using quantization-aware fine-tuning. This allows for a balance between efficiency gains from quantization and maintaining accuracy.

**What Happens to the Vectors During Quantization?**

During 4-bit quantization, the original model's weights and activations (represented in FP32) are converted to 4-bit integers. This conversion process involves:

1. **Scaling and Clipping:** The FP32 values are first scaled to a specific range suitable for representing with 4 bits. This might involve considering the minimum and maximum values of the original data.
2. **Rounding or Quantization:**  A specific strategy is used to convert the scaled values into 4-bit integers. This could involve rounding or other quantization techniques.

**Impact on Accuracy:**

Quantization, especially aggressive quantization like 4-bit, can introduce some loss of accuracy compared to the original FP32 model. However, the goal is to find a balance between reduced model size/inference speed and acceptable accuracy for the specific task.

**Additional Notes:**

* The code snippet mentions `bnb_4bit_quant_type` which likely specifies the exact quantization method used (e.g., linear quantization).
* The `compute_dtype` (e.g., bfloat16) might be related to the computations performed during training/inference with potentially lower precision formats for further efficiency gains.

4-bit quantization aims to reduce model size and potentially speed up inference while considering the trade-off with accuracy.

In [13]:
# Load dataset (you can process it here)
#dataset = load_dataset(dataset_name, split="train")

# Load tokenizer and model with QLoRA configuration
import torch


compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

# Check GPU compatibility with bfloat16
if compute_dtype == torch.float16 and use_4bit:
    major, _ = torch.cuda.get_device_capability()
    if major >= 8:
        print("=" * 80)
        print("Your GPU supports bfloat16: accelerate training with bf16=True")
        print("=" * 80)

# Load base model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map=device_map
)
model.config.use_cache = False
model.config.pretraining_tp = 1

# Load LLaMA tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right" # Fix weird overflow issue with fp16 training
EOS_TOKEN = tokenizer.eos_token
def formatting_prompts_func(examples):
    inputs       = examples["Questions"]
    outputs      = examples["Answers"]
    texts = []
    for input, output in zip(inputs, outputs):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = input_prompt.format(input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }
pass
'''
def formatting_prompts_func(examples):
    inputs       = examples["instruction"]
    outputs      = examples["output"]
    texts = []
    for input, output in zip(inputs, outputs):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = input_prompt.format(input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }
pass'''

from datasets import load_dataset
dataset = load_dataset('csv', data_files='output.csv',split="train")
#dataset = load_dataset("nmdr/Mini-Physics-Instruct-1k", split = "train")
dataset = dataset.map(formatting_prompts_func, batched = True,)
# Load LoRA configuration
peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
      target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
)

# Set training parameters
training_arguments = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    fp16=fp16,
    bf16=bf16,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=group_by_length,
    lr_scheduler_type=lr_scheduler_type,
    report_to="tensorboard",

)

# Set supervised fine-tuning parameters
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=packing,

)

# Train model
trainer.train()

# Save trained model
trainer.model.save_pretrained(new_model)

Generating train split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/199 [00:00<?, ? examples/s]


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.


Map:   0%|          | 0/199 [00:00<?, ? examples/s]

Step,Training Loss
25,2.1196
50,1.3551
75,1.1344
100,0.9273
125,1.0311
150,0.9449
175,0.9664
200,0.9544
225,0.8174
250,0.7262




In [None]:
!pip install transformers accelerate bitsandbytes

Collecting accelerate
  Downloading accelerate-0.31.0-py3-none-any.whl (309 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m309.4/309.4 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: accelerate
Successfully installed accelerate-0.31.0


In [None]:
!pip install torch

Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch)
  Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl (410.6 MB)
Collecting nvidia-cufft-cu12==11.0.2.54 (from torch)
  Using cached nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl (121.6 MB)
Collecting nvidia-curand-cu12==10.3.2.106 (from torch)
  Using cached nvidia_curand_cu12-10.3.2.106-py3-none-manylinux1_x86_64.whl (56.5 MB)
Collectin

## Self-Attention with Query, Key, Value

Self-attention is a powerful mechanism in transformers that allows the model to focus on relevant parts of the input sequence when processing information. It works with three key components: query, key, and value.

**Analogy:** Imagine you're at a party and want to find someone specific (the answer). You (the model) ask everyone at the party a question (the query) to identify potential matches. This question could be "Are you interested in X?". Everyone responds with a short description of themselves (the key). You then compare these descriptions to what you're looking for (compare query and key). Finally, you talk to the people whose descriptions seem most relevant (high comparison score) and get more information from them (the value).

**Formally:**

* **Query (Q):** A vector representing the current focus of attention. It's like your question at the party.
* **Key (K):** A vector representing each element in the input sequence. It's like the short description of each person at the party.
* **Value (V):** A vector containing the actual information associated with each element in the sequence. It's like the detailed information you get from the relevant people.

The model calculates a score for each element in the sequence based on how well its "key" matches the "query." Higher scores indicate a better match. Finally, the model uses these scores to weight the "values" from each element, creating a new representation that focuses on the most relevant parts of the sequence.

**Example:**

Consider the sentence "The cat sat on the mat."

* **Query:** The query vector could represent the word we're currently focusing on, say "sat."
* **Key:** Each word in the sentence would have a key vector. For example, the key vector for "cat" might capture its semantic meaning (e.g., furry animal).
* **Value:** The value vector for each word would contain its embedding (numerical representation).

The model would compare the query vector for "sat" with the key vectors of all words. The key vector for "cat" might have a higher score than others because "sat" often describes actions involving objects that can be sat upon. The model would then use this score to weight the value vector of "cat," giving it more influence in the final representation.

## Gate, Up-proj, Down-proj, and O

These terms refer to specific parts within a transformer block that process information:

**o_proj (Output projection)**:This linear layer is part of the self-attention mechanism. It projects the attention weights (scores for each element) back to the embedding dimension. This allows the model to combine the information from the relevant parts of the sequence into a single representation. This part remains the same as in the regular self-attention mechanism. It's not directly involved in the adaptation process with LoRA.


* **Gate:** This is a linear layer within the MLP (multi-layer perceptron) sub-block of a transformer. It takes the hidden state (current representation of the sequence) and projects it to a higher dimension. This creates a more complex representation before applying a non-linear activation function (like ReLU).

 `gate_proj`, `up_proj`, and `down_proj` are all part of a transformer block, specifically within the **MLP sub-block**. They perform linear projections on the hidden state, which represents the current understanding of the sequence at that point in processing.

Here's a breakdown of their roles and what they project to:

* **gate_proj:** This linear layer projects the hidden state (current representation) to a **higher dimension**. This creates a more complex representation by allowing the model to capture a wider range of interactions between elements in the sequence.

* **up_proj:** Following the `gate_proj`, this layer further projects the high-dimensional representation to an **even higher dimension**. This allows the model to explore even more intricate relationships within the sequence data.

* **down_proj:** Finally, this layer projects the high-dimensional representation obtained from `up_proj` back to the **original embedding dimension**. This essentially compresses the information while still retaining the important details captured in the higher dimensional space.

**Overall Flow:**

1. The hidden state, representing the current understanding of the sequence, is fed into `gate_proj`.
2. `gate_proj` projects it to a higher dimension, creating a more complex representation.
3. `up_proj` takes this high-dimensional representation and projects it to an even higher dimension, allowing for exploration of intricate relationships.
4. Finally, `down_proj` projects the information back to the original embedding dimension, resulting in a compressed but informative representation.

**Where it's Used:**

This compressed representation is then fed into the final step of the transformer block, where it's combined with the residual connection (original hidden state) and a layer normalization step. This final output becomes the new hidden state for the next transformer block in the sequence, allowing the model to build a deeper understanding as it processes the entire sequence.

**In Summary:**

* `gate_proj`, `up_proj`, and `down_proj` are within the **MLP sub-block** of a transformer block.
* They project the hidden state to explore complex relationships in the sequence data.
* `gate_proj` and `up_proj` project to higher dimensions for more intricate analysis.
* `down_proj` projects back to the original dimension for a compressed but informative representation.
* This final representation is used to update the hidden state for the next transformer block.

In [14]:
##Inference
inputs = tokenizer(
[
    input_prompt.format(
        "What specific topics does Nadi cover?", # input
       "",    # leave blank as response generated by AI

    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
generated_text = tokenizer.batch_decode(outputs)[0]
first_response = generated_text.split('### Response:')[1].strip()
output = first_response.split('###')[0].strip()
print("the response is: ",output)

the response is:  Nadi covers a range of topics including anxiety, depression, and self-help. They provide a safe space for individuals to share their experiences and receive support. The platform emphasizes personal growth and self-awareness. Nadi aims to provide a supportive community where individuals can heal and grow


In [15]:
# Reload model in FP16 and merge it with LoRA weights w = w+del(w)
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map=device_map,
)
model = PeftModel.from_pretrained(base_model, new_model)
model = model.merge_and_unload() #W=w+del(w)

# Reload tokenizer to save it
#tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
#tokenizer.pad_token = tokenizer.eos_token
#tokenizer.padding_side = "right"

In [16]:
output_dir = "final_weights_new"
model.save_pretrained(output_dir)

In [17]:

# Reload tokenizer to save it
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
tokenizer.save_pretrained(output_dir)

('final_weights_new/tokenizer_config.json',
 'final_weights_new/special_tokens_map.json',
 'final_weights_new/tokenizer.model',
 'final_weights_new/added_tokens.json',
 'final_weights_new/tokenizer.json')

# Huggingface inference of saved model

In [18]:
# Run text generation pipeline with our next model
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("/content/final_weights_new")
model = AutoModelForCausalLM.from_pretrained("/content/final_weights_new")
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=64)
prompt=input_prompt.format(
        "who is Nandakishor", # input
        "", # leave blank as response generated by AI

    )
result = pipe(prompt)
generated_text  = result[0]['generated_text']

first_response = generated_text.split('### Response:')[1].strip()
first_response = first_response.split("\n")[0]

print(first_response)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Nandakishor is a Hindu deity associated with the Himalayas and the goddess Kali. He is often dep


#Quantization
## Key Concepts:
**GGUF (Giant GPT Unified Format)**: A model format designed for efficient storage and quantization of large Transformer-based language models like Llama.
Llama.cpp: A C++ library for working with GGUF models, including quantization tools.

**LoRA (Low-Rank Adaptation)**: A technique for model efficiency and fine-tuning that involves adding adapter layers.

**Quantization**: Converting floating-point model weights to lower-precision integers for reduced model size and faster inference.

## Quantization Methods:
1. **Format Breakdown:**
Q#K[S/M/L]:#: Number of bits used (e.g., Q4 = 4 bits).
K: Represents low-rank matrix factorization for efficient storage.
[S/M/L]: Level of low-rank approximation:S: Small (moderate compression, high precision).
M: Medium (balance between compression and precision).
L: Large (aggressive compression, lower precision).
2. **Conversion Step:**
Imagine model weights residing in an apartment complex (FP16 format).
Conversion acts like a renovation:Rearrangement: Apartments are grouped and reorganized for efficient processing by quantization tools.
Pre-processing: Each apartment gets a thorough cleaning and preparation for the quantization "paint job."
No actual quantization happens here; it's all about getting ready for the big transformation.
3. **Quantization Step:**
Now, the exciting transformation begins!
General Process:Calibration: Like measuring wall sizes before applying paint, optimal scaling factors are determined for each weight tensor.
Quantization: Weights are meticulously scaled and mapped to specific integer values within a limited range, like assigning each shade a specific paint color.
Matrix Factorization (K methods):Think of apartments being replaced with smaller studios (low-rank matrices) for some weights. This saves space and processing power.
Not all apartments get shrunk; only those deemed suitable for efficient compression.
Fine-tuning: After the renovation, some adjustments are needed. The model is fine-tuned, often using PEFT, to adapt to the quantization-induced "color shifts" and maintain accuracy.
Merged LoRA Weights:
Imagine LoRA adapters as extensions added to the apartment complex. They hold task-specific knowledge.
During quantization, these extensions go through the same process as the main building:Rearrangement for efficient processing.
Pre-processing for compatibility with quantization.
Calibration, scaling, and mapping to specific integer values (colors).
Selective matrix factorization for eligible weight tensors.
By treating LoRA weights equally, consistency and efficiency are maintained across the entire model after quantization.
Choosing the Right Method:
It's like picking the perfect renovation plan:Desired Size Reduction: How much do you want to shrink the apartment complex (model)?
Accuracy Trade-off: How much "color change" can you tolerate?
Hardware Compatibility: Will your neighbors (hardware) appreciate the new layout and materials?
Fine-tuning Resources: Do you have the tools and time to adjust to the changes?
Example: **Q4_K_M Explained:
This is like a moderate renovation:Walls get painted with specific "4-color" palettes (4-bit quantization).
Some rooms are converted into efficient studios (low-rank matrices) for better space utilization.
The balance between space saving and accuracy is carefully considered** (medium level of compression).
Additional Note:
Q8_0 is like keeping some rooms intact (without full quantization). They remain spacious (FP16), offering some size reduction but less efficiency compared to full renovations.


In [25]:
!git clone https://github.com/ggerganov/llama.cpp
%cd llama.cpp
!make

Cloning into 'llama.cpp'...
remote: Enumerating objects: 27561, done.[K
remote: Counting objects: 100% (9174/9174), done.[K
remote: Compressing objects: 100% (653/653), done.[K
remote: Total 27561 (delta 8831), reused 8617 (delta 8511), pack-reused 18387[K
Receiving objects: 100% (27561/27561), 48.46 MiB | 11.63 MiB/s, done.
Resolving deltas: 100% (19725/19725), done.
/content/llama.cpp
I ccache not found. Consider installing it for faster compilation.
I llama.cpp build info: 
I UNAME_S:   Linux
I UNAME_P:   x86_64
I UNAME_M:   x86_64
I CFLAGS:    -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_OPENMP -DGGML_USE_LLAMAFILE  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -fopenmp -Wdouble-promotion 
I CXXFLAGS:  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-

In [26]:
%cd /content/llama.cpp
!python3 convert-hf-to-gguf.py /content/final_weights_new --outtype f16

/content/llama.cpp
INFO:hf-to-gguf:Loading model: final_weights_new
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Set model parameters
INFO:hf-to-gguf:gguf: context length = 2048
INFO:hf-to-gguf:gguf: embedding length = 2048
INFO:hf-to-gguf:gguf: feed forward length = 5632
INFO:hf-to-gguf:gguf: head count = 32
INFO:hf-to-gguf:gguf: key-value head count = 4
INFO:hf-to-gguf:gguf: rope theta = 10000.0
INFO:hf-to-gguf:gguf: rms norm epsilon = 1e-05
INFO:hf-to-gguf:gguf: file type = 1
INFO:hf-to-gguf:Set model tokenizer
INFO:gguf.vocab:Setting special token type bos to 1
INFO:gguf.vocab:Setting special token type eos to 2
INFO:gguf.vocab:Setting special token type unk to 0
INFO:gguf.vocab:Setting special token type pad to 2
INFO:gguf.vocab:Setting add_bos_token to True
INFO:gguf.vocab:Setting add_eos_token to False
INFO:gguf.vocab:Setting chat_template to {% for message in messages %}
{% if message['role'] == 'user' %}
{{ '<|user|>
' + message['content

In [27]:
!./llama-quantize /content/final_weights_new/ggml-model-f16.gguf /content/final_weights_new/ggml-model-q4_k_m.gguf q4_k_m


main: build = 3187 (2075a66a)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: quantizing '/content/final_weights_new/ggml-model-f16.gguf' to '/content/final_weights_new/ggml-model-q4_k_m.gguf' as Q4_K_M
llama_model_loader: loaded meta data with 26 key-value pairs and 201 tensors from /content/final_weights_new/ggml-model-f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = final_weights_new
llama_model_loader: - kv   2:                          llama.block_count u32              = 22
llama_model_loader: - kv   3:                       llama.context_length u32              = 2048
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 2048
llama_mod

In [28]:

from llama_cpp import Llama
llm = Llama(model_path="/content/final_weights_new/ggml-model-q4_k_m.gguf",n_gpu_layers=30)
prompt = input_prompt.format(
        "What is anxiety?", # input
        ""              # leave blank as response generated by AI

    )
output = llm(prompt, max_tokens=200)
out = output['choices'][0]['text']
generated_text = out
first_response = generated_text.split('### Input:')[0].strip()

print(first_response)

llama_model_loader: loaded meta data with 26 key-value pairs and 201 tensors from /content/final_weights_new/ggml-model-q4_k_m.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = final_weights_new
llama_model_loader: - kv   2:                          llama.block_count u32              = 22
llama_model_loader: - kv   3:                       llama.context_length u32              = 2048
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 2048
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 5632
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.atten

Anxiety is a feeling of fear, dread, and uneasiness. It might cause you to sweat, feel restless and tense, and have a rapid heartbeat. It can be a normal reaction to stress. For example, you might feel anxious when faced with a difficult problem at work, before taking a test, or before making an important decision. It can help you to cope. The anxiety may give you a boost of energy or help you focus. The anxiety may also give you a sense of control and stability. However, if it becomes a problem over time, you might notice that it interferes with daily life in various ways. This can include problems sleeping, eating, or talking to people. It may also affect your work or social lives. The anxiety may be a normal reaction to stress. It may help you to feel better in difficult situations. The thing is, everyone feels anxious from time to time. It's
