# Embeddings and Dense Vector Search: A Quick Primer (HuggingFace Version)

> **Note:** This notebook uses HuggingFace's `all-MiniLM-L6-v2` model instead of OpenAI. It runs completely locally with no API keys required!

If you come from an NLP background, embeddings are something you might be intimately familiar with - otherwise, you might find the topic a bit...dense. (this attempt at a joke will make more sense later)

In all seriousness, embeddings are a powerful piece of the NLP puzzle, so let's dive in!

> NOTE: While this notebook is language/NLP-centric, embeddings have uses beyond just text!

#### Why Do We Even Need Embeddings?

In order to fully understand what Embeddings are, we first need to understand why we have them!

Machine Learning algorithms, ranging from the very big to the very small, all have one thing in common:

They need numeric inputs.

So we need a process by which to translate the domain we live in, dominated by images, audio, language, and more, into the domain of the machine: Numbers.

Another thing we want to be able to do is capture "semantic information" about words/phrases so that we can use algorithmic approaches to determine if words are closely related or not!

So, we need to come up with a process that does these two things well:

- Convert non-numeric data into numeric-data
- Capture potential semantic relationships between individual pieces of data

#### How Do Embeddings Capture Semantic Relationships?

In a simplified sense, embeddings map a word or phrase into n-dimensional space with a dense continuous vector, where each dimension in the vector represents some "latent feature" of the data.

This is best represented in a classic example:

![image](https://i.imgur.com/K5eQtmH.png)

As can be seen in the extremely simplified example: The X_1 axis represents age, and the X_2 axis represents hair.

The relationship of "puppy -> dog" reflects the same relationship as "baby -> adult", but dogs are (typically) hairier than humans. However, adults typically have more hair than babies - so they are shifted slightly closer to dogs on the X_2 axis!

Now, this is a simplified and contrived example - but it is *essentially* the mechanism by which embeddings capture semantic information.

In reality, the dimensions don't sincerely represent hard-concepts like "age" or "hair", but it's useful as a way to think about how the semantic relationships are captured.

Alright, with some history behind us - let's examine how these might help us choose relevant context.

Let's begin with a simple example - simply looking at how close to embedding vectors are for a given phrase.

When we use the term "close" in this notebook - we're referring to a distance measure called "cosine similarity".

We discussed above that if two embeddings are close - they are semantically similar, cosine similarity gives us a quick way to measure how similar two vectors are!

Closeness is measured from 1 to -1, with 1 being extremely close and -1 being extremely close to opposite in meaning.

Let's implement it with Numpy below.

In [1]:
import numpy as np
from numpy.linalg import norm

def cosine_similarity(vec_1, vec_2):
    return np.dot(vec_1, vec_2) / (norm(vec_1) * norm(vec_2))

## Installing Dependencies

First, we need to install the HuggingFace sentence-transformers library. This will download and manage the embedding models for us.

> **Note:** This will also install PyTorch if you don't have it already. The first time you run the embedding model, it will download ~90MB.

In [2]:
!pip install sentence-transformers -q

Now let's import and initialize our HuggingFace embedding model. We'll use the `all-MiniLM-L6-v2` model which is optimized for semantic search.

> **Quick Info About `all-MiniLM-L6-v2`**:
> - It has a max sequence length of **256** word pieces
> - It returns vectors with dimension **384** (compared to OpenAI's 1536)
> - It runs completely locally - no API key required!
> - It's free and open source

In [None]:
import os
import certifi

# Create a combined cert bundle with Zscaler for corporate network
zscaler_cert = "/path/to/your/zscaler.crt"  # Update this path to your .crt file
combined_cert = "/tmp/combined_certs.pem"

with open(combined_cert, "w") as outfile:
    with open(certifi.where(), "r") as certifi_file:
        outfile.write(certifi_file.read())
    with open(zscaler_cert, "r") as zscaler_file:
        outfile.write(zscaler_file.read())

os.environ['REQUESTS_CA_BUNDLE'] = combined_cert
os.environ['SSL_CERT_FILE'] = combined_cert
os.environ['CURL_CA_BUNDLE'] = combined_cert

In [1]:
from sentence_transformers import SentenceTransformer

# Initialize the model - this will download it on first run
embedding_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

print(f"Model loaded: {embedding_model}")
print(f"Max sequence length: {embedding_model.max_seq_length}")
print(f"Embedding dimension: {embedding_model.get_sentence_embedding_dimension()}")

'[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1032)' thrown while requesting HEAD https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2/resolve/main/./modules.json
Retrying in 1s [Retry 1/5].
No sentence-transformers model found with name sentence-transformers/all-MiniLM-L6-v2. Creating a new one with mean pooling.
'[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1032)' thrown while requesting HEAD https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2/resolve/main/adapter_config.json
Retrying in 1s [Retry 1/5].


RuntimeError: Cannot send a request, as the client has been closed.

Let's define our two sentences:

In [None]:
puppy_sentence = "i love puppies"
dog_sentence = "i love dogs"

Now we can convert those into embedding vectors using HuggingFace!

In [None]:
puppy_vector = embedding_model.encode(puppy_sentence)
dog_vector = embedding_model.encode(dog_sentence)

print(f"Puppy vector shape: {puppy_vector.shape}")
print(f"Dog vector shape: {dog_vector.shape}")
print(f"\nFirst 10 dimensions of puppy vector: {puppy_vector[:10]}")

Now we can determine how closely they are related using our distance measure!

In [None]:
similarity = cosine_similarity(puppy_vector, dog_vector)
print(f"Cosine similarity between 'i love puppies' and 'i love dogs': {similarity}")

Remember, with cosine similarity, close to 1. means they're very close!

Let's see what happens if we use a different set of sentences.

In [None]:
puppy_sentence = "I love puppies!"
cat_sentence = "I dislike cats!"

puppy_vector = embedding_model.encode(puppy_sentence)
cat_vector = embedding_model.encode(cat_sentence)

similarity = cosine_similarity(puppy_vector, cat_vector)
print(f"Cosine similarity between 'I love puppies!' and 'I dislike cats!': {similarity}")

As you can see - these vectors are further apart - as expected!

### Embedding Vector Calculations

One of the ways that Embedding Vectors can be leveraged, and a fun "proof" that they work the way we expected can be explored via "Vector Calculations"

That is to say: If we take the vector for "King", and subtract the vector for "man", and add the vector for "woman" - we should have a vector that is similar to "Queen".

Let's try this out below!

In [None]:
king_vector = np.array(embedding_model.encode("King"))
man_vector = np.array(embedding_model.encode("man"))
woman_vector = np.array(embedding_model.encode("woman"))

vector_calculation_result = king_vector - man_vector + woman_vector

queen_vector = np.array(embedding_model.encode("Queen"))

similarity = cosine_similarity(vector_calculation_result, queen_vector)
print(f"Cosine similarity between (King - man + woman) and Queen: {similarity}")

As you can see - the resulting vector is indeed quite close to the "Queen" vector!

> NOTE: The loss is explained by the vectors not *literally* encoding information along axes as simple as "man" or "woman".

### Batch Processing

One of the great features of sentence-transformers is efficient batch processing. Let's see how we can embed multiple sentences at once:

In [None]:
sentences = [
    "The cat sits on the mat",
    "A feline rests on the rug",
    "Dogs are loyal companions",
    "The weather is sunny today",
    "It's a beautiful day outside"
]

# Encode all sentences at once
embeddings = embedding_model.encode(sentences)

print(f"Encoded {len(sentences)} sentences")
print(f"Embeddings shape: {embeddings.shape}")
print("\nSimilarities:")
print(f"Cat/mat vs Feline/rug: {cosine_similarity(embeddings[0], embeddings[1]):.4f}")
print(f"Cat/mat vs Dogs: {cosine_similarity(embeddings[0], embeddings[2]):.4f}")
print(f"Weather sentences: {cosine_similarity(embeddings[3], embeddings[4]):.4f}")

### Performance Comparison

Let's see how fast the local HuggingFace model is for embedding text:

In [None]:
import time

# Test single embedding
text = "This is a test sentence for measuring embedding speed."

start = time.time()
for _ in range(100):
    embedding = embedding_model.encode(text)
end = time.time()

print(f"Average time per embedding (single): {(end - start) / 100 * 1000:.2f}ms")

# Test batch embedding
texts = [f"This is test sentence number {i}" for i in range(100)]

start = time.time()
embeddings = embedding_model.encode(texts)
end = time.time()

print(f"Time for batch of 100: {(end - start) * 1000:.2f}ms")
print(f"Average time per embedding (batch): {(end - start) / 100 * 1000:.2f}ms")
print("\nBatch processing is much more efficient!")

### Advantages of Using HuggingFace Models

1. **No API Key Required**: No need to manage API keys or worry about rate limits
2. **Free**: Completely free to use, no per-token costs
3. **Privacy**: Your data never leaves your machine
4. **Offline**: Works without internet (after initial model download)
5. **Speed**: Can be very fast, especially with GPU acceleration
6. **Smaller Vectors**: 384 dimensions vs OpenAI's 1536 means less storage and faster comparisons
7. **Open Source**: Full transparency in model architecture and training

### Considerations

1. **Initial Download**: ~90MB model download on first use
2. **Hardware Dependent**: Performance depends on your CPU/GPU
3. **Dimension Differences**: Cannot mix embeddings from different models
4. **Max Length**: 256 word pieces (vs OpenAI's 8191 tokens)

### Using GPU Acceleration (Optional)

If you have a CUDA-capable GPU, you can significantly speed up embeddings:

In [None]:
import torch

# Check if CUDA is available
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA device: {torch.cuda.get_device_name(0)}")
    
    # Move model to GPU
    embedding_model = embedding_model.to('cuda')
    print("Model moved to GPU!")
else:
    print("Running on CPU")

### Conclusion

As you can see - embeddings can help us convert text into a machine understandable format, which we can leverage for a number of purposes.

HuggingFace's sentence-transformers library provides an excellent, free, and privacy-preserving alternative to proprietary embedding APIs. The `all-MiniLM-L6-v2` model offers:

- Strong performance for semantic search and similarity tasks
- Smaller embedding dimensions (384 vs 1536) for efficient storage
- Local processing without any external API dependencies
- Fast inference, especially with batch processing or GPU acceleration

This makes it an excellent choice for learning about embeddings and building RAG applications!