# Sentence embeddings

In this section, we will explore more advanced embeddings models that are based on tokens smaller than words, that carries more semantic information than words. Theses models are also abaible to compute embeddings for an entire piece of texte (like a sentence).

## Load required libraries

In [2]:
from fastembed import TextEmbedding
import numpy as np

  from .autonotebook import tqdm as notebook_tqdm


## Load the model

This model is reduced version of the model `nomic-embed-text-v-1.5`.

Loading the model can take a few minutes, so be patient.

In [3]:
model = TextEmbedding("nomic-ai/nomic-embed-text-v1.5-Q")

Fetching 5 files: 100%|██████████| 5/5 [00:00<00:00, 57143.11it/s]


## Explore sentence embeddings

We will first compute the embedding vector for sentence:

In [10]:
# The .embed() methods returns a generator that we convert to a list
# and then take the first element to get the vector.
vector = list(model.embed("Hello world!"))[0]  

Display the embedding vector and its shape.

In [13]:
print(vector)
print(f"\nVector length: {len(vector)}")

[ 6.64209053e-02  2.07114875e-01 -4.30757713e+00 -6.29243627e-02
 -6.50362074e-01  1.85833192e+00 -1.15600415e-01 -2.41471291e-01
 -5.29089510e-01 -1.38232350e+00  5.17098248e-01  1.09080660e+00
  1.71259552e-01  1.84874988e+00  3.15962940e-01 -1.50176513e+00
  1.03669465e-01 -9.31438446e-01 -1.43772817e+00  2.07467705e-01
 -6.94462776e-01 -1.73679674e+00 -3.64706069e-01  1.00948822e+00
  3.23370862e+00  2.78402209e-01 -9.39982057e-01  1.18713355e+00
  3.19130898e-01  3.52867752e-01 -3.20809036e-01  2.04197094e-01
  2.10528836e-01  1.77232012e-01  1.29584503e+00 -2.50220336e-02
  4.06956315e-01  5.47878563e-01  4.30239365e-02 -1.61858708e-01
  7.63812661e-01  9.46502239e-02  4.22711492e-01 -2.59554148e-01
  1.44364047e+00 -3.86580139e-01 -1.21548288e-01 -5.82908690e-02
  1.47526121e+00 -1.12174594e+00 -4.86950636e-01 -1.83095853e-03
  1.85308129e-01  1.21643257e+00  1.38719308e+00  8.42705309e-01
  7.67087936e-01 -1.31899536e+00  2.83678949e-01  1.08697498e+00
  6.19799674e-01  1.24792

With this model, embeddings are 768-dimensional vectors.

We can also compute the embeddings for a list of sentences. The model will return a list of embeddings, one for each sentence.

In [16]:
sentences = [
    "The weather is lovely today.",
    "It's very sunny outside!",
    "Are you watching the soccer game on TV?",
    "I love playing football.",
    "Are you studying nanoporous materials?",
]
embeddings = np.array(list(model.embed(sentences)))
print(embeddings)
print(f"Shape of embedding vectors: {embeddings.shape}")

[[ 0.41381216 -0.15882386 -4.4720335  ... -0.3836937  -0.7774241
   0.19564037]
 [-0.00912805  0.14920007 -4.606499   ... -0.70395666 -0.32203782
  -0.3235845 ]
 [ 1.3647739   1.65053    -3.583437   ... -0.5653511  -1.1001977
   0.6852042 ]
 [ 0.35631263  1.2773244  -3.928822   ... -0.21479225 -0.92511356
   0.48946184]
 [ 0.9096239   2.103455   -2.8272188  ... -0.46910593 -1.1943344
  -0.09262528]]
Shape of embedding vectors: (5, 768)


In this example, we have 5 vectors of 768 dimensions.

## Compare sentences

We can compare the embeddings of two sentences by computing the cosine similarity between them.
To speed up the computation, we will first normalize the embeddings.

Nomaliszation and cosine similarity can be computed with the `scikit-learn` library or with the `numpy` library.

### Method 1 with scikit-learn

In [29]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import normalize
# Normalize the embeddings
embeddings_normalized = normalize(embeddings, axis=1)
# Compute the cosine similarity between embeddings
similarity = cosine_similarity(embeddings_normalized, embeddings_normalized)
print(similarity)

[[1.         0.74607956 0.4681632  0.48687425 0.36103436]
 [0.74607956 1.         0.5475793  0.48899338 0.356462  ]
 [0.4681632  0.5475793  0.9999999  0.59073675 0.43514606]
 [0.48687425 0.48899338 0.59073675 1.         0.35732406]
 [0.36103436 0.356462   0.43514606 0.35732406 1.        ]]


### Method 2 with Numpy

Cosine similarity between vectors X and Y is the normalized dot product of X and Y:

In [30]:
# Normalize embeddings
norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
embeddings_normalized = embeddings / norms
# Compute cosine similarity matrix
similarity_matrix = np.dot(embeddings_normalized, embeddings_normalized.T)
print(similarity_matrix)

[[0.99999994 0.7460795  0.4681632  0.48687416 0.3610344 ]
 [0.7460795  0.9999999  0.5475792  0.48899332 0.35646197]
 [0.4681632  0.5475792  1.         0.59073675 0.43514612]
 [0.48687416 0.48899332 0.59073675 0.99999994 0.35732403]
 [0.3610344  0.35646197 0.43514612 0.35732403 1.        ]]


### Find most similar sentences

In [33]:
# Set the diagonal to -inf to exclude self-similarity
np.fill_diagonal(similarity_matrix, -np.inf)
# Find the most similar sentence for each sentence
for i, sentence in enumerate(sentences):
    row = similarity_matrix[i]
    most_sim_idx = np.argmax(row)
    print(f"Target sentence      : {sentence}")
    print(f"Most similar sentence: {sentences[most_sim_idx]}")
    print(f"Cosine similarity    : {row[most_sim_idx]:.3f}")
    print("-" * 60)

Target sentence      : The weather is lovely today.
Most similar sentence: It's very sunny outside!
Cosine similarity    : 0.746
------------------------------------------------------------
Target sentence      : It's very sunny outside!
Most similar sentence: The weather is lovely today.
Cosine similarity    : 0.746
------------------------------------------------------------
Target sentence      : Are you watching the soccer game on TV?
Most similar sentence: I love playing football.
Cosine similarity    : 0.591
------------------------------------------------------------
Target sentence      : I love playing football.
Most similar sentence: Are you watching the soccer game on TV?
Cosine similarity    : 0.591
------------------------------------------------------------
Target sentence      : Are you studying nanoporous materials?
Most similar sentence: Are you watching the soccer game on TV?
Cosine similarity    : 0.435
------------------------------------------------------------
