### **Testing Quantized Qwen2 Model and Computing Similarity Scores**

#### **1. Loading Qwen2 Model with Quantization**
- Uses `Alibaba-NLP/gte-Qwen2-7B-instruct` for embedding generation.
- Enables `BitsAndBytesConfig` for efficient 4-bit quantization (`nf4`).
- Automatically maps the model to available devices (`CUDA` or `CPU`).
- Displays a message indicating that special tokens have been added to the vocabulary.

#### **2. Downloading and Loading Model Checkpoints**
- Downloads model checkpoint shards (7 parts).
- Loads all shards into memory efficiently.

#### **3. Printing Model Memory Usage**
- Implements `print_model_memory_usage()` to calculate and display the total memory usage of the model.
- Prints the model size in megabytes (`MB`), which is **5940.785 MB**.

#### **4. Moving Model to Correct Device**
- Checks if CUDA is available and assigns the appropriate device.
- Confirms the model's device placement.

#### **5. Defining Last Token Pooling**
- Implements `last_token_pool()` to extract meaningful vector representations from hidden states.
- Handles both left-padded and right-padded sequences.

#### **6. Formatting Query Instructions**
- Defines `get_detailed_instruct()` to structure query inputs in a standardized format.

#### **7. Preparing Queries and Documents**
- Defines a retrieval task: `"Given a web search query, retrieve relevant passages that answer the query."`
- Creates two example queries:
  1. `"how much protein should a female eat"`
  2. `"summit define"`
- Provides corresponding documents for similarity comparison.

#### **8. Defining Text Embedding Function**
- Implements `embed_texts()`:
  - Tokenizes and processes input text.
  - Ensures inputs are moved to the correct device.
  - Extracts last-token representations and normalizes embeddings.

#### **9. Embedding Queries and Documents**
- Generates embeddings separately for queries and documents using Qwen2.
- Confirms that the tokenized inputs are correctly assigned to the appropriate device.

#### **10. Computing Similarity Scores**
- Computes cosine similarity scores between queries and documents.
- Multiplies scores by **100** for better interpretability.
- Example results:
  - `"how much protein should a female eat"` → `"protein intake recommendation"` → **71.69**
  - `"summit define"` → `"definition of summit"` → **82.69**
- Displays computed similarity scores.

#### **11. Checking Embedding Shape**
- Prints the shape of `document_embeddings` (`torch.Size([2, 3584])`).


In [1]:
import torch
import torch.nn.functional as F
import os
from torch import Tensor
from transformers import AutoTokenizer, AutoModel, BitsAndBytesConfig
from dataProcessor import process_metadata, pew_metadata_path, statista_metadata_path
import tqdm as notebook_tqdm

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# print the sample dataframe for the first 100 rows
sample_df = process_metadata(pew_metadata_path, statista_metadata_path).head(100)

In [3]:
# Define the quantization configuration
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

In [5]:
# Load the tokenizer and quantized model
tokenizer = AutoTokenizer.from_pretrained('Alibaba-NLP/gte-Qwen2-7B-instruct', trust_remote_code=True)
model = AutoModel.from_pretrained(
    'Alibaba-NLP/gte-Qwen2-7B-instruct', trust_remote_code=True,
    quantization_config=quantization_config,
    device_map="auto"
)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Downloading shards: 100%|██████████| 7/7 [44:44<00:00, 383.44s/it]
Loading checkpoint shards: 100%|██████████| 7/7 [00:12<00:00,  1.73s/it]


In [6]:
# Print model memory usage
def print_model_memory_usage(model):
    param_size = 0
    for param in model.parameters():
        param_size += param.nelement() * param.element_size()
    buffer_size = 0
    for buffer in model.buffers():
        buffer_size += buffer.nelement() * buffer.element_size()
    size_all_mb = (param_size + buffer_size) / 1024**2
    print(f'Model size: {size_all_mb:.3f} MB')

print_model_memory_usage(model)

Model size: 5940.785 MB


In [7]:
# Ensure inputs are moved to the correct device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

def last_token_pool(last_hidden_states: Tensor, attention_mask: Tensor) -> Tensor:
    left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
    if left_padding:
        return last_hidden_states[:, -1]
    else:
        sequence_lengths = attention_mask.sum(dim=1) - 1
        batch_size = last_hidden_states.shape[0]
        return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]

def get_detailed_instruct(task_description: str, query: str) -> str:
    return f'Instruct: {task_description}\nQuery: {query}'

# Move the model to the appropriate device if needed
print(f"Model is on device: {next(model.parameters()).device}")

# Prepare queries and documents
task = 'Given a web search query, retrieve relevant passages that answer the query'
queries = [
    get_detailed_instruct(task, 'how much protein should a female eat'),
    get_detailed_instruct(task, 'summit define')
]
documents = [
    "As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
    "Definition of summit for English Language Learners. : 1  the highest point of a mountain : the top of a mountain. : 2  the highest level. : 3  a meeting or series of meetings between the leaders of two or more governments."
]

max_length = 1250

# Function to embed texts
def embed_texts(texts):
    input_token = tokenizer(texts, max_length=max_length, padding=True, truncation=True, return_tensors='pt')
    input_token = {k: v.to(device) for k, v in input_token.items()}
    print(f"Tokenized inputs are on device: {next(iter(input_token.values())).device}")
    outputs = model(**input_token)
    embeddings = last_token_pool(outputs.last_hidden_state, input_token['attention_mask'])
    embeddings = F.normalize(embeddings, p=2, dim=1)
    return embeddings

# Embed queries and documents separately
query_embeddings = embed_texts(queries)
document_embeddings = embed_texts(documents)

# Compute similarity scores between queries and documents
scores = (query_embeddings @ document_embeddings.T) * 100
print(f"Similarity scores are on device: {scores.device}")
print(scores.tolist())

Using device: cuda
Model is on device: cuda:0
Tokenized inputs are on device: cuda:0
Tokenized inputs are on device: cuda:0
Similarity scores are on device: cuda:0
[[71.6875, 5.203125], [5.69921875, 82.6875]]


In [8]:
document_embeddings.shape

torch.Size([2, 3584])

: 