# Sentence embeddings

In this section, we will explore more advanced embedding models that are based on the transformers neural network architecture. It also uses tokens, semantic units smaller than words, that carries more semantic information than words. Theses models are also able to compute embedding for an entire piece of texte (like a sentence).

## Load required libraries

In [1]:
from fastembed import TextEmbedding
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import normalize

## Load the model

This model is a reduced version of the model [`nomic-embed-text-v-1.5`](https://huggingface.co/nomic-ai/nomic-embed-text-v1.5). It is a 137-million parameter model that is trained on a large corpus of text to generate embeddings for sentences. Generated embeddings are 768-dimensional vectors that capture the semantic meaning of the input text.

Loading the model can take a few minutes, so be patient.

In [2]:
model = TextEmbedding("nomic-ai/nomic-embed-text-v1.5-Q")

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

## Explore sentence embeddings

We will first compute the embedding vector for sentence:

In [3]:
# The .embed() method returns a generator that we convert to a list
# and then take the first element to get the vector.
vector = list(model.embed("Hello world!"))[0]  

Display the embedding vector and its size.

In [4]:
print("Ten first elements of the vector:")
print(vector[:10])
print(f"\nVector size: {len(vector)}")

Ten first elements of the vector:
[ 0.06642091  0.20711488 -4.307577   -0.06292436 -0.6503621   1.8583319
 -0.11560041 -0.24147129 -0.5290895  -1.3823235 ]

Vector size: 768


With this model, embeddings are 768-dimensional vectors.

We can also compute the embeddings for a list of sentences. The model will return a list of embeddings, one for each sentence.

In [5]:
sentences = [
    "The weather is lovely today.",
    "It's very sunny outside!",
    "Are you watching the soccer game on TV?",
    "I love playing football.",
    "Are you studying nanoporous materials?",
]
embeddings = np.array(list(model.embed(sentences)))
print("\nTen first elements of the embedding vectors for each sentence:")
print(embeddings[:, :10])
print(f"Shape of embedding vectors: {embeddings.shape}")


Ten first elements of the embedding vectors for each sentence:
[[ 0.41381216 -0.15882386 -4.4720335  -0.3561031   0.43502694  1.8090299
   0.8556498  -0.48111734  0.32931688 -1.5628498 ]
 [-0.00912805  0.14920007 -4.606499   -0.16274986 -0.8524226   1.7413867
   0.4728594   0.55459136  0.4093597  -1.1776258 ]
 [ 1.3647739   1.65053    -3.583437    0.36573407  0.54171497  0.06346153
  -0.5936647  -0.33975348 -0.28960556 -0.58449554]
 [ 0.35631263  1.2773244  -3.928822    0.5598846   0.8928453   0.3641485
  -0.6450936  -0.48812354  0.05629476 -1.0712796 ]
 [ 0.9096239   2.103455   -2.8272188  -0.3196854   0.72942066 -0.20781904
   0.5402453   0.34872225  0.22590417  0.16909127]]
Shape of embedding vectors: (5, 768)


In this example, we have 5 vectors of 768 dimensions.

## Compare sentences

We can compare the embeddings of two sentences by computing the cosine similarity between corresponding vectors.
To speed up the computation, we will first normalize the embeddings.

Nomalization and cosine similarity can be computed with the `scikit-learn` library or with the `numpy` library.

### Method 1 with scikit-learn

In [6]:
# Normalize embeddings
embeddings_normalized = normalize(embeddings, axis=1)
# Compute the cosine similarity between embeddings
similarity_matrix = cosine_similarity(embeddings_normalized, embeddings_normalized)
print(similarity_matrix)

[[1.         0.74607956 0.4681632  0.48687425 0.36103436]
 [0.74607956 1.         0.5475793  0.48899338 0.356462  ]
 [0.4681632  0.5475793  0.9999999  0.59073675 0.43514606]
 [0.48687425 0.48899338 0.59073675 1.         0.35732406]
 [0.36103436 0.356462   0.43514606 0.35732406 1.        ]]


### Method 2 with Numpy

Cosine similarity between vectors X and Y is the normalized dot product of X and Y:

In [7]:
# Normalize embeddings
norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
embeddings_normalized = embeddings / norms
# Compute cosine similarity matrix
similarity_matrix = np.dot(embeddings_normalized, embeddings_normalized.T)
print(similarity_matrix)

[[0.99999994 0.7460795  0.4681632  0.48687416 0.3610344 ]
 [0.7460795  0.9999999  0.5475792  0.48899332 0.35646197]
 [0.4681632  0.5475792  1.         0.59073675 0.43514612]
 [0.48687416 0.48899332 0.59073675 0.99999994 0.35732403]
 [0.3610344  0.35646197 0.43514612 0.35732403 1.        ]]


### Find most similar sentences

In [8]:
# Set the diagonal to -inf to exclude self-similarity.
np.fill_diagonal(similarity_matrix, -np.inf)
# Find the most similar sentence for each sentence.
for i, sentence in enumerate(sentences):
    row = similarity_matrix[i]
    most_sim_idx = np.argmax(row)
    print(f"Target sentence      : {sentence}")
    print(f"Most similar sentence: {sentences[most_sim_idx]}")
    print(f"Cosine similarity    : {row[most_sim_idx]:.3f}")
    print("-" * 70)

Target sentence      : The weather is lovely today.
Most similar sentence: It's very sunny outside!
Cosine similarity    : 0.746
----------------------------------------------------------------------
Target sentence      : It's very sunny outside!
Most similar sentence: The weather is lovely today.
Cosine similarity    : 0.746
----------------------------------------------------------------------
Target sentence      : Are you watching the soccer game on TV?
Most similar sentence: I love playing football.
Cosine similarity    : 0.591
----------------------------------------------------------------------
Target sentence      : I love playing football.
Most similar sentence: Are you watching the soccer game on TV?
Cosine similarity    : 0.591
----------------------------------------------------------------------
Target sentence      : Are you studying nanoporous materials?
Most similar sentence: Are you watching the soccer game on TV?
Cosine similarity    : 0.435
------------------------