<a href="https://colab.research.google.com/github/olonok69/LLM_Notebooks/blob/main/embeddings/How_to_generate_Embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# EMBEDDINGS
Embeddings in the context of large language models (LLMs) are dense vector representations of text. They are a way to convert words, phrases, or even entire documents into numerical form so that they can be processed by machine learning models.

### What Are Embeddings?
- Vector Representation: Embeddings transform text into vectors (arrays of numbers). Each dimension of the vector captures some aspect of the text's meaning.

- Dimensionality Reduction: Instead of representing words as sparse vectors (like one-hot encoding), embeddings use dense vectors with much lower dimensionality, making computations more efficient.

- Semantic Meaning: Words with similar meanings are mapped to vectors that are close to each other in the embedding space. For example, the words "king" and "queen" would have similar embeddings.

### How Are They Used in LLMs?
- Input Representation: When text is fed into an LLM, it is first converted into embeddings. These embeddings serve as the input to the model.

- Contextual Understanding: LLMs like BERT, GPT, and Mistral generate embeddings that capture the context of words within a sentence. This means the same word can have different embeddings depending on its context.

- Downstream Tasks: Embeddings are used for various NLP tasks such as text classification, sentiment analysis, and information retrieval. They can also be used in recommendation systems, as you mentioned earlier.


The choice between using the last hidden layer or the second-to-last hidden layer for embeddings can depend on your specific task and model. Here are some considerations:

- Last Hidden Layer: This layer contains the most contextually rich and semantically meaningful representations of the input text. It's often used because it captures the final output of the model's processing.

- Second-to-Last Hidden Layer: Some studies suggest that the second-to-last layer can provide better embeddings for certain tasks. This is because the last layer might be too specialized for the model's training objective (e.g., masked language modeling), while the second-to-last layer retains more general features

# MTEB
MTEB is a massive benchmark for measuring the performance of text embedding models on diverse embedding tasks.

- https://huggingface.co/blog/mteb
- https://huggingface.co/spaces/mteb/leaderboard


# large EMbeddings vs Small Embeddings
The choice between using large or small embeddings depends on your specific needs and constraints. Here are some key points to consider:

#### Large Embeddings
Pros:

- Richer Representations: Larger embeddings capture more nuanced and detailed information, which can improve performance on complex tasks1.
- Better Contextual Understanding: They often provide a deeper understanding of the text, which is beneficial for tasks requiring high accuracy1.

Cons:

- Higher Computational Cost: They require more memory and processing power, which can be a limitation for real-time applications or when working with large datasets1.
- Increased Cost: Larger embeddings can be more expensive to use, especially if you're processing a high volume of text1.

#### Small Embeddings
Pros:

- Faster Processing: Smaller embeddings are quicker to compute, making them suitable for applications where speed is critical1.
-Cost-Effective: They are generally cheaper to use, which is advantageous for budget-sensitive projects1.

Cons:

- Limited Performance: They might not capture as much detail, which can affect performance on more complex tasks1.
- Less Rich Representations: Smaller embeddings may struggle with nuanced queries and might not perform as well in tasks requiring deep semantic understanding1.

#### Practical Considerations
- Task Complexity: For simple tasks, small embeddings might suffice. For more complex tasks, large embeddings could provide better results.
- Resource Availability: Consider your computational resources and budget. If you have limited resources, small embeddings might be more practical.
- Experimentation: It's often beneficial to experiment with both sizes to see which one performs better for your specific use case.

In [1]:
! pip install -U bitsandbytes Faker -q

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m69.1/69.1 MB[0m [31m33.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m86.9 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
from transformers import AutoTokenizer, AutoModel
import torch
import json
import gc
import datetime
from faker import Faker
fake = Faker()
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

In [3]:
torch.cuda.empty_cache()
gc.collect()

30

# Mistral

https://huggingface.co/mistralai/Mistral-7B-v0.1

Embeddings 4096

Mistral 7B: This model has a context window size of 32k with default of 8,000 tokens

In [5]:
model_id = "mistralai/Mistral-7B-v0.1"

In [52]:
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id, device_map="auto", load_in_8bit=True)

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [14]:
model

MistralModel(
  (embed_tokens): Embedding(32000, 4096)
  (layers): ModuleList(
    (0-31): 32 x MistralDecoderLayer(
      (self_attn): MistralSdpaAttention(
        (q_proj): Linear8bitLt(in_features=4096, out_features=4096, bias=False)
        (k_proj): Linear8bitLt(in_features=4096, out_features=1024, bias=False)
        (v_proj): Linear8bitLt(in_features=4096, out_features=1024, bias=False)
        (o_proj): Linear8bitLt(in_features=4096, out_features=4096, bias=False)
        (rotary_emb): MistralRotaryEmbedding()
      )
      (mlp): MistralMLP(
        (gate_proj): Linear8bitLt(in_features=4096, out_features=14336, bias=False)
        (up_proj): Linear8bitLt(in_features=4096, out_features=14336, bias=False)
        (down_proj): Linear8bitLt(in_features=14336, out_features=4096, bias=False)
        (act_fn): SiLU()
      )
      (input_layernorm): MistralRMSNorm((4096,), eps=1e-05)
      (post_attention_layernorm): MistralRMSNorm((4096,), eps=1e-05)
    )
  )
  (norm): MistralRMS

In [17]:
test1 = fake.text(3000)
test2 = fake.text(3000)



In [16]:
test1

'Toward run or sure itself. Local cultural account her concern this onto require.\nQuality hard thing anything treatment cold probably probably. Identify hope heavy find finally increase current president.\nGroup poor fire season policy debate much. Agree thank various relate big. Close determine customer owner involve. Consider information recent see door single industry movement.\nRecord point nothing area do. Item world machine phone contain.\nReal resource field tax dream minute front. Imagine necessary ball break report close pick.\nWind policy evening likely field. Arrive key audience. Crime hot let foreign.\nYoung voice nothing ready sister media. Morning enter indicate billion peace bad rule body.\nHerself state admit organization. Teacher two believe majority moment nearly. War rest page.\nRemain piece push entire single.\nPage place teacher fast really those. Easy nation enough church them well. Young type receive story.\nOfficer tend summer into.\nCould road position game. V

In [9]:
test2

'Commercial color issue call investment. Six serious key rock.\nFather kitchen figure heart sing quite. Five sell system. Citizen several piece turn recent exactly.\nThat whether choose drive travel. Little herself professional seem. My think light role need.\nSimple tree argue seem bill. Response culture practice pull foreign beat bag agency.\nSeat thank magazine.\nPaper buy capital down lay any. About record including one.\nRoom particularly sport threat least. Dream main site American cause campaign development.\nCreate blood large able. Main scientist leader power.\nKnow trouble city real. Foot great must fish close certainly.\nYes information use. State thank know. Top civil speak although.\nTrial paper because conference fight guy night important.\nSomebody less cause stop growth training you. Minute news feel statement against nice. Sport oil fund trouble two reveal fast.\nBest outside everybody simply more. Skill trade owner low. Finally away mean never capital establish deal p

In [18]:
inputs = tokenizer(test1, return_tensors="pt", truncation=True, max_length=512)

In [19]:
len(inputs["input_ids"][0])

512

In [53]:
embeddings_mistral = {}

In [54]:
time_start = datetime.datetime.now()
for i in range(100):
  inputs = tokenizer(fake.text(3000), return_tensors="pt", truncation=True, max_length=512)
  inputs = inputs.to("cuda")
  with torch.no_grad():
      outputs = model(**inputs, output_hidden_states=True)
      hidden_states = outputs.hidden_states
      second_to_last_layer = hidden_states[-2]
      embeddings_mistral[i] = second_to_last_layer.mean(dim=1).cpu().numpy()
time_end = datetime.datetime.now()
print(f"Time taken: {time_end - time_start}")

Time taken: 0:00:13.623325


In [56]:
second_to_last_layer

tensor([[[ 0.1366, -0.3472,  1.5781,  ...,  0.4478,  0.1343,  0.9580],
         [-1.4463, -0.7510, -0.0922,  ..., -0.3245, -0.1188, -0.3416],
         [-0.3679, -0.7139,  0.6338,  ..., -0.2213,  0.2974,  0.7490],
         ...,
         [-1.1514, -0.6274,  1.5889,  ...,  0.1184,  0.1324, -0.4644],
         [-1.3125, -0.6240,  1.9277,  ...,  0.2695, -0.2954,  0.0134],
         [-1.4609, -0.5996,  0.7910,  ...,  0.3481, -0.0315, -0.2944]]],
       device='cuda:0', dtype=torch.float16)

In [57]:
hidden_states[-1]

tensor([[[ 6.6406e-02, -3.6206e-01,  5.1842e-03,  ...,  3.7598e-01,
           3.8965e-01,  2.5928e-01],
         [-3.9648e+00, -1.4854e+00,  1.9590e+00,  ...,  5.7764e-01,
          -2.1699e+00, -8.3740e-01],
         [-1.1143e+00, -1.2598e+00,  4.2891e+00,  ..., -2.5513e-01,
          -5.6055e-01,  1.4141e+00],
         ...,
         [-1.1104e+00, -1.8301e+00,  5.8438e+00,  ..., -4.9268e-01,
          -1.3145e+00,  4.8706e-01],
         [-1.2676e+00, -1.5986e+00,  5.1484e+00,  ..., -9.3231e-03,
          -2.5293e+00,  4.6533e-01],
         [-1.7988e+00, -1.7637e+00,  4.7500e+00,  ..., -4.3286e-01,
          -1.6582e+00,  6.2402e-01]]], device='cuda:0', dtype=torch.float16)

In [58]:
len(hidden_states)

29

In [22]:
len(embeddings_mistral[0][0])

4096

In [25]:
documents= [test1,test2]
embeddings_mistral = {}
for i in range(len(documents)):
  inputs = tokenizer(documents[i], return_tensors="pt", truncation=True, max_length=512)
  inputs = inputs.to("cuda")
  with torch.no_grad():
      outputs = model(**inputs, output_hidden_states=True)
      hidden_states = outputs.hidden_states
      second_to_last_layer = hidden_states[-2]
      embeddings_mistral[i] = second_to_last_layer.mean(dim=1).cpu().numpy()

In [26]:
# Calculate similarity
similarity = cosine_similarity(embeddings_mistral[0], embeddings_mistral[1])[0][0]
print(f"Similarity between texts: {similarity:.4f}")

Similarity between texts: 0.9600


In [27]:
del model
del embeddings_mistral
del tokenizer
torch.cuda.empty_cache()
gc.collect()

38

In [28]:
torch.cuda.empty_cache()
gc.collect()

20

# LLama 3.2B 3B
https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct

embeddings length 3072

Llama 3.2 3B: This model supports a context window size of up to 128,000 tokens, with a default setting of 8,192 tokens

In [29]:
model_id = "meta-llama/Llama-3.2-3B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id, device_map="auto", load_in_8bit=True)

tokenizer_config.json:   0%|          | 0.00/54.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/878 [00:00<?, ?B/s]

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


model.safetensors.index.json:   0%|          | 0.00/20.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.46G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [30]:
model

LlamaModel(
  (embed_tokens): Embedding(128256, 3072)
  (layers): ModuleList(
    (0-27): 28 x LlamaDecoderLayer(
      (self_attn): LlamaSdpaAttention(
        (q_proj): Linear8bitLt(in_features=3072, out_features=3072, bias=False)
        (k_proj): Linear8bitLt(in_features=3072, out_features=1024, bias=False)
        (v_proj): Linear8bitLt(in_features=3072, out_features=1024, bias=False)
        (o_proj): Linear8bitLt(in_features=3072, out_features=3072, bias=False)
        (rotary_emb): LlamaRotaryEmbedding()
      )
      (mlp): LlamaMLP(
        (gate_proj): Linear8bitLt(in_features=3072, out_features=8192, bias=False)
        (up_proj): Linear8bitLt(in_features=3072, out_features=8192, bias=False)
        (down_proj): Linear8bitLt(in_features=8192, out_features=3072, bias=False)
        (act_fn): SiLU()
      )
      (input_layernorm): LlamaRMSNorm((3072,), eps=1e-05)
      (post_attention_layernorm): LlamaRMSNorm((3072,), eps=1e-05)
    )
  )
  (norm): LlamaRMSNorm((3072,), eps=

In [32]:
embeddings = {}
time_start = datetime.datetime.now()
for i in range(100):
  inputs = tokenizer(fake.text(3000), return_tensors="pt", truncation=True, max_length=512)
  inputs = inputs.to("cuda")
  with torch.no_grad():
      outputs = model(**inputs, output_hidden_states=True)
      hidden_states = outputs.hidden_states
      second_to_last_layer = hidden_states[-2]
      embeddings[i] = second_to_last_layer.mean(dim=1).cpu().numpy()
time_end = datetime.datetime.now()
print(f"Time taken: {time_end - time_start}")

Time taken: 0:00:13.419268


In [33]:
len(embeddings[0][0])

3072

In [35]:
documents= [test1,test2]
embeddings_llama = {}
for i in range(len(documents)):
  inputs = tokenizer(documents[i], return_tensors="pt", truncation=True, max_length=512)
  inputs = inputs.to("cuda")
  with torch.no_grad():
      outputs = model(**inputs, output_hidden_states=True)
      hidden_states = outputs.hidden_states
      second_to_last_layer = hidden_states[-2]
      embeddings_llama[i] = second_to_last_layer.mean(dim=1).cpu().numpy()

In [36]:
# Calculate similarity
similarity = cosine_similarity(embeddings_llama[0], embeddings_llama[1])[0][0]
print(f"Similarity between texts: {similarity:.4f}")

Similarity between texts: 0.9596


In [38]:
del model
del embeddings_llama
del tokenizer
torch.cuda.empty_cache()
gc.collect()

66

In [39]:
torch.cuda.empty_cache()
gc.collect()

0

# NOMIC-embed-text-v1

https://huggingface.co/nomic-ai/nomic-embed-text-v1

Length 768 MAX

8192 context length text encoder

In [40]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("nomic-ai/nomic-embed-text-v1", trust_remote_code=True)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/128 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/70.9k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/54.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/2.03k [00:00<?, ?B/s]

configuration_hf_nomic_bert.py:   0%|          | 0.00/1.96k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/nomic-ai/nomic-bert-2048:
- configuration_hf_nomic_bert.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_hf_nomic_bert.py:   0%|          | 0.00/95.4k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/nomic-ai/nomic-bert-2048:
- modeling_hf_nomic_bert.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


pytorch_model.bin:   0%|          | 0.00/547M [00:00<?, ?B/s]

  state_dict = loader(resolved_archive_file)


tokenizer_config.json:   0%|          | 0.00/1.19k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/270 [00:00<?, ?B/s]

In [41]:
model

SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: NomicBertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

In [42]:
# GPU
embeddings = {}
time_start = datetime.datetime.now()
for i in range(100):

      embeddings[i] = model.encode(f'search_document: {fake.text(3000)}')
time_end = datetime.datetime.now()
print(f"Time taken: {time_end - time_start}")

Time taken: 0:00:02.429736


In [43]:
embeddings = {}
documents= [test1,test2]
time_start = datetime.datetime.now()
for i in range(len(documents)):

      embeddings[i] = model.encode(f'search_document: {documents[i]}')
time_end = datetime.datetime.now()
print(f"Time taken: {time_end - time_start}")

Time taken: 0:00:00.042624


In [44]:
embeddings[0].reshape(1,-1).shape

(1, 768)

In [45]:
# Calculate similarity
similarity = cosine_similarity(embeddings[0].reshape(1,-1), embeddings[1].reshape(1,-1))[0][0]
print(f"Similarity between texts: {similarity:.4f}")

Similarity between texts: 0.8025


In [46]:
del model
del embeddings
#del tokenizer
torch.cuda.empty_cache()
gc.collect()

2030

In [59]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("nomic-ai/nomic-embed-text-v1", trust_remote_code=True, device="cpu")

  state_dict = loader(resolved_archive_file)


In [60]:
model

SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: NomicBertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

In [48]:
# CPU
embeddings = {}
time_start = datetime.datetime.now()
for i in range(100):

      embeddings[i] = model.encode(f'search_document: {fake.text(3000)}')
time_end = datetime.datetime.now()
print(f"Time taken: {time_end - time_start}")

Time taken: 0:00:29.074736


In [49]:
import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel

def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

sentences = [f'search_query: {test1}', f'search_query: {test2}']

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model = AutoModel.from_pretrained('nomic-ai/nomic-embed-text-v1', trust_remote_code=True)
model.eval()

encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

with torch.no_grad():
    model_output = model(**encoded_input)

embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
embeddings = F.normalize(embeddings, p=2, dim=1)
print(embeddings)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



tensor([[-0.0228,  0.0059,  0.0037,  ..., -0.0198, -0.0165, -0.0464],
        [ 0.0139,  0.0150, -0.0060,  ...,  0.0118, -0.0281, -0.0461]])


In [50]:
embeddings[0].shape

torch.Size([768])

In [51]:
# Calculate similarity
similarity = cosine_similarity(embeddings[0].reshape(1,-1), embeddings[1].reshape(1,-1))[0][0]
print(f"Similarity between texts: {similarity:.4f}")

Similarity between texts: 0.7730
