To use **BGE-M3** (a **dense retrieval model**) and extract the **normalized hidden state of the special token \[CLS\]** as the dense embedding, follow these steps:

### **1. Install Dependencies**
Ensure you have `transformers` and `torch` installed:

```bash
pip install transformers torch
```

### **2. Load the BGE-M3 Model**
Use Hugging Face's `AutoModel` and `AutoTokenizer`:

```python
from transformers import AutoModel, AutoTokenizer
import torch

# Load model and tokenizer
model_name = "BAAI/bge-m3"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
```

### **3. Tokenize Input**
Tokenize the text while ensuring that `[CLS]` is included:

```python
text = "This is an example sentence."
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
```

### **4. Get Hidden States**
Pass the tokenized input through the model:

```python
with torch.no_grad():
    outputs = model(**inputs)

# Extract last hidden state
hidden_states = outputs.last_hidden_state  # Shape: (batch_size, seq_length, hidden_dim)
```

### **5. Extract `[CLS]` Token Embedding**
The `[CLS]` token is the first token in BERT-based models:

```python
cls_embedding = hidden_states[:, 0, :]  # Shape: (batch_size, hidden_dim)
```

### **6. Normalize the Embedding**
Normalize the extracted `[CLS]` embedding using L2 normalization:

```python
cls_embedding = torch.nn.functional.normalize(cls_embedding, p=2, dim=1)
```

### **7. Convert to Numpy (Optional)**
If needed, convert the embedding to a NumPy array:

```python
cls_embedding_np = cls_embedding.cpu().numpy()
```

### **Summary of Key Points**
- Tokenize text with `[CLS]` included.
- Extract the **first token's hidden state** from `outputs.last_hidden_state`.
- **L2 normalize** the `[CLS]` embedding for consistency.
- Use the result as the **dense embedding** for retrieval tasks.

This normalized `[CLS]` embedding is what BGE-M3 uses for efficient dense retrieval. 🚀

In [1]:
from transformers import AutoModel, AutoTokenizer
import torch

# Load model and tokenizer
model_name = "D:/LLMs/bge-m3"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
model

XLMRobertaModel(
  (embeddings): XLMRobertaEmbeddings(
    (word_embeddings): Embedding(250002, 1024, padding_idx=1)
    (position_embeddings): Embedding(8194, 1024, padding_idx=1)
    (token_type_embeddings): Embedding(1, 1024)
    (LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): XLMRobertaEncoder(
    (layer): ModuleList(
      (0-23): 24 x XLMRobertaLayer(
        (attention): XLMRobertaAttention(
          (self): XLMRobertaSdpaSelfAttention(
            (query): Linear(in_features=1024, out_features=1024, bias=True)
            (key): Linear(in_features=1024, out_features=1024, bias=True)
            (value): Linear(in_features=1024, out_features=1024, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): XLMRobertaSelfOutput(
            (dense): Linear(in_features=1024, out_features=1024, bias=True)
            (LayerNorm): LayerNorm((1024,), eps=1e-05, elem

In [4]:
text = "I like an apple"
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
inputs

{'input_ids': tensor([[     0,     87,   1884,    142, 108787,      2]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1]])}

In [19]:
with torch.no_grad():
    outputs = model(**inputs)
print(len(outputs["pooler_output"][0]))
outputs

1024


BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[-0.1833,  0.6057, -1.1345,  ...,  0.2107, -0.8259,  0.3222],
         [-0.1086,  0.0092, -0.5508,  ...,  0.2070, -0.5216,  0.2273],
         [ 0.5871, -0.0714, -0.7237,  ...,  0.4825, -0.4838,  0.5969],
         [ 0.2916,  0.0894, -0.3398,  ...,  0.5961, -1.0075,  0.2712],
         [ 0.5557, -0.2793, -0.2761,  ...,  0.2340, -0.3755, -0.2263],
         [-0.2901,  0.4516, -0.3684,  ...,  0.8776, -0.8294,  0.6854]]]), pooler_output=tensor([[-0.8096,  0.4748,  0.0235,  ..., -0.2627,  0.1331, -0.1672]]), hidden_states=None, past_key_values=None, attentions=None, cross_attentions=None)

In [20]:
# Extract last hidden state
hidden_states = outputs.last_hidden_state  # Shape: (batch_size, seq_length, hidden_dim)
hidden_states, len(hidden_states[0]), len(hidden_states[0][0])

(tensor([[[-0.1833,  0.6057, -1.1345,  ...,  0.2107, -0.8259,  0.3222],
          [-0.1086,  0.0092, -0.5508,  ...,  0.2070, -0.5216,  0.2273],
          [ 0.5871, -0.0714, -0.7237,  ...,  0.4825, -0.4838,  0.5969],
          [ 0.2916,  0.0894, -0.3398,  ...,  0.5961, -1.0075,  0.2712],
          [ 0.5557, -0.2793, -0.2761,  ...,  0.2340, -0.3755, -0.2263],
          [-0.2901,  0.4516, -0.3684,  ...,  0.8776, -0.8294,  0.6854]]]),
 6,
 1024)

In [6]:
cls_embedding = hidden_states[:, 0, :]  # Shape: (batch_size, hidden_dim)
cls_embedding

tensor([[-0.1833,  0.6057, -1.1345,  ...,  0.2107, -0.8259,  0.3222]])

In [7]:
cls_embedding = torch.nn.functional.normalize(cls_embedding, p=2, dim=1)
cls_embedding

tensor([[-0.0069,  0.0229, -0.0430,  ...,  0.0080, -0.0313,  0.0122]])

In [8]:
cls_embedding_np = cls_embedding.cpu().numpy()
cls_embedding_np

array([[-0.00694152,  0.02293514, -0.04296022, ...,  0.00797774,
        -0.03127418,  0.0122012 ]], shape=(1, 1024), dtype=float32)