<a href="https://colab.research.google.com/github/kkipngenokoech/agents/blob/main/embedding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
%%capture
!pip install -U sentence-transformers


In [None]:
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
embeddings = model.encode(sentences)
print(embeddings)


Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


[[ 6.76569194e-02  6.34959713e-02  4.87131476e-02  7.93049783e-02
   3.74480933e-02  2.65277131e-03  3.93749513e-02 -7.09846430e-03
   5.93614057e-02  3.15369777e-02  6.00981042e-02 -5.29052615e-02
   4.06067781e-02 -2.59308387e-02  2.98428647e-02  1.12688600e-03
   7.35148042e-02 -5.03817983e-02 -1.22386612e-01  2.37028562e-02
   2.97265332e-02  4.24768738e-02  2.56337579e-02  1.99516211e-03
  -5.69190793e-02 -2.71598026e-02 -3.29035036e-02  6.60248846e-02
   1.19007230e-01 -4.58791181e-02 -7.26214051e-02 -3.25840116e-02
   5.23413308e-02  4.50553037e-02  8.25298298e-03  3.67024504e-02
  -1.39415674e-02  6.53918684e-02 -2.64271945e-02  2.06370896e-04
  -1.36643145e-02 -3.62810642e-02 -1.95043720e-02 -2.89738011e-02
   3.94270122e-02 -8.84091184e-02  2.62427772e-03  1.36713730e-02
   4.83062901e-02 -3.11566498e-02 -1.17329173e-01 -5.11690713e-02
  -8.85288343e-02 -2.18963325e-02  1.42986095e-02  4.44167666e-02
  -1.34815648e-02  7.43392259e-02  2.66382862e-02 -1.98763069e-02
   1.79191

In [None]:
embeddings.dtype, embeddings.size, embeddings.ctypes, embeddings.shape, embeddings.ndim

(dtype('float32'),
 768,
 <numpy._core._internal._ctypes at 0x7edcdf10ac60>,
 (2, 384),
 2)

In [None]:
# ndarray
help(embeddings)

Help on ndarray object:

class ndarray(builtins.object)
 |  ndarray(shape, dtype=float, buffer=None, offset=0,
 |          strides=None, order=None)
 |
 |  An array object represents a multidimensional, homogeneous array
 |  of fixed-size items.  An associated data-type object describes the
 |  format of each element in the array (its byte-order, how many bytes it
 |  occupies in memory, whether it is an integer, a floating point number,
 |  or something else, etc.)
 |
 |  Arrays should be constructed using `array`, `zeros` or `empty` (refer
 |  to the See Also section below).  The parameters given here refer to
 |  a low-level method (`ndarray(...)`) for instantiating an array.
 |
 |  For more information, refer to the `numpy` module and examine the
 |  methods and attributes of an array.
 |
 |  Parameters
 |  ----------
 |  (for the __new__ method; see Notes below)
 |
 |  shape : tuple of ints
 |      Shape of created array.
 |  dtype : data-type, optional
 |      Any object that can

## EMBEDDING FROM SCRATCH

In [None]:
from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F

#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
model = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

# Normalize embeddings
sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)

print("Sentence embeddings:")
print(sentence_embeddings)
