# Understanding Text Embeddings: From Words to Vectors

This notebook explores text embeddings, a fundamental concept in natural language processing. We'll investigate how computers understand semantic meaning by converting words and sentences into numerical vectors.

**What Are Embeddings?**
- Dense numerical representations of data in a continuous vector space
- Similar meanings are positioned close together 
- Relative positions capture semantic relationships

## 1. Setup and Installation

In [5]:
# Install the required packages
!uv pip install -U sentence-transformers

[2mUsing Python 3.12.9 environment at: /workspaces/fundamentals-of-ai-engineering-principles-and-practical-applications-6026542/.venv[0m
[2K[2mResolved [1m44 packages[0m [2min 226ms[0m[0m                                        [0m
[2K[2mPrepared [1m1 package[0m [2min 6ms[0m[0m                                                
         If the cache and target directories are on different filesystems, hardlinking may not be supported.
[2K[2mInstalled [1m1 package[0m [2min 136ms[0m[0m==4.0.2                         [0m
 [32m+[39m [1msentence-transformers[0m[2m==4.0.2[0m


In [6]:
# Import libraries
from sentence_transformers import SentenceTransformer
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.metrics.pairwise import cosine_similarity

# Set up matplotlib
try:
    plt.style.use('seaborn-v0_8-whitegrid')
except:
    try:
        plt.style.use('seaborn-whitegrid')  # Fallback for older versions
    except:
        pass  # Default style if neither is available
        
plt.rcParams['figure.figsize'] = (10, 7)
np.random.seed(42)  # For reproducibility

## 2. Loading an Embedding Model

We'll use the `all-MiniLM-L6-v2` model, which creates 384-dimensional embeddings and is optimized for semantic similarity tasks.

In [7]:
# Load the pre-trained model
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
print(f"Model: all-MiniLM-L6-v2")
print(f"Embedding dimension: {model.get_sentence_embedding_dimension()}")

NameError: name 'init_empty_weights' is not defined

## 3. Creating and Examining Embeddings

Let's create embeddings for some example sentences grouped by topic.

In [8]:
# Example sentences grouped by topic
sentences = [
    # AI/ML related sentences
    "I love machine learning and artificial intelligence.",
    "AI and ML are fascinating fields of study.",
    
    # Weather related sentences
    "The weather is beautiful today.",
    "It's a sunny day with clear skies.",
    
    # Python related sentences
    "Python is my favorite programming language.",
    "I enjoy coding in Python for data analysis."
]

# Topic labels for visualization
topics = ['AI/ML', 'AI/ML', 'Weather', 'Weather', 'Python', 'Python']

# Display our sentences with their topics
for i, (sentence, topic) in enumerate(zip(sentences, topics)):
    print(f"Sentence {i+1} ({topic}): {sentence}")

Sentence 1 (AI/ML): I love machine learning and artificial intelligence.
Sentence 2 (AI/ML): AI and ML are fascinating fields of study.
Sentence 3 (Weather): The weather is beautiful today.
Sentence 4 (Weather): It's a sunny day with clear skies.
Sentence 5 (Python): Python is my favorite programming language.
Sentence 6 (Python): I enjoy coding in Python for data analysis.


In [9]:
# Create embeddings for our sentences
embeddings = model.encode(sentences)

# Display embedding information
print(f"Shape of each embedding: {embeddings[0].shape}")
print(f"Number of embeddings: {len(embeddings)}")

# Show a snippet of the first embedding
print(f"\nFirst 10 dimensions of first embedding: {embeddings[0][:10]}")
print(f"Min: {embeddings[0].min():.4f}, Max: {embeddings[0].max():.4f}, Mean: {embeddings[0].mean():.4f}")

NameError: name 'model' is not defined

## 4. Measuring Similarity with Cosine Similarity

**Cosine Similarity Explained:**
- Measures the cosine of the angle between two vectors
- Ranges from -1 (opposite) to 1 (identical)
- Higher values indicate greater semantic similarity

In [10]:
# Calculate cosine similarity between all pairs of embeddings
similarity_matrix = cosine_similarity(embeddings)

# Display the similarity matrix
print("Cosine Similarity Matrix:")
np.set_printoptions(precision=4, suppress=True)
print(similarity_matrix)

NameError: name 'embeddings' is not defined

In [None]:
# Create labels for our heatmap
labels = [f"S{i+1}: {topic}" for i, topic in enumerate(topics)]

# Create a heatmap of the similarity matrix
plt.figure(figsize=(10, 8))
sns.heatmap(similarity_matrix, annot=True, cmap='viridis', xticklabels=labels, yticklabels=labels)
plt.title('Cosine Similarity Heatmap')
plt.tight_layout()
plt.show()

print("Heatmap Interpretation:")
print("- The diagonal (1.0 values) shows each sentence's similarity with itself")
print("- Brighter blocks show high similarity between sentences on the same topic")
print("- Darker areas show lower similarity between sentences on different topics")

## 5. Visualizing Embeddings in 2D Space

We'll use PCA to reduce our 384-dimensional embeddings to 2D for visualization.

In [None]:
# Reduce embeddings to 2 dimensions using PCA
pca = PCA(n_components=2)
embeddings_2d = pca.fit_transform(embeddings)

# Set up colors for topics
topic_colors = {'AI/ML': 'red', 'Weather': 'blue', 'Python': 'green'}
colors = [topic_colors[topic] for topic in topics]

# Plot the 2D embeddings
plt.figure(figsize=(12, 8))
for i, (x, y) in enumerate(embeddings_2d):
    plt.scatter(x, y, c=colors[i], s=100, alpha=0.7, edgecolors='black')
    plt.annotate(f"S{i+1}", 
                xy=(x, y), 
                xytext=(5, 5), 
                textcoords='offset points',
                fontsize=12,
                weight='bold')

# Add a legend
for topic, color in topic_colors.items():
    plt.scatter([], [], c=color, label=topic, s=100, alpha=0.7)
plt.legend(loc='upper right')

# Add title and labels
plt.title('2D PCA Projection of Sentence Embeddings', fontsize=15)
plt.xlabel(f'Principal Component 1 (Explained Variance: {pca.explained_variance_ratio_[0]:.2%})', fontsize=12)
plt.ylabel(f'Principal Component 2 (Explained Variance: {pca.explained_variance_ratio_[1]:.2%})', fontsize=12)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

print("Notice how sentences on the same topic cluster together in the 2D space.")
print(f"The two principal components capture {sum(pca.explained_variance_ratio_):.2%} of the total variance.")

## 6. Testing with New Sentences

Let's see how our model handles new sentences related to our original topics.

In [None]:
# Define new sentences
new_sentences = [
    "Deep learning has revolutionized computer vision.",  # AI/ML related
    "The forecast predicts rain for tomorrow.",           # Weather related
    "NumPy and Pandas are essential Python libraries."    # Python related
]

# Create embeddings for the new sentences
new_embeddings = model.encode(new_sentences)

# Calculate similarity between new and original sentences
similarity_to_original = cosine_similarity(new_embeddings, embeddings)

# Find the most similar original sentence for each new sentence
for i, new_sent in enumerate(new_sentences):
    # Get index of most similar original sentence
    most_similar_idx = np.argmax(similarity_to_original[i])
    print(f"\nNew: \"{new_sent}\"")
    print(f"Most similar to: \"{sentences[most_similar_idx]}\"")
    print(f"Similarity score: {similarity_to_original[i][most_similar_idx]:.4f}")
    print(f"Topic: {topics[most_similar_idx]}")

## 7. Visualizing Original and New Sentences Together

In [None]:
# Combine original and new embeddings
all_embeddings = np.vstack([embeddings, new_embeddings])
all_topics = topics + ['AI/ML', 'Weather', 'Python']

# Project to 2D using PCA
pca = PCA(n_components=2)
all_embeddings_2d = pca.fit_transform(all_embeddings)

# Create visualization
plt.figure(figsize=(12, 8))

# Plot original sentences
for i in range(len(sentences)):
    x, y = all_embeddings_2d[i]
    plt.scatter(x, y, c=topic_colors[all_topics[i]], s=100, alpha=0.7, edgecolors='black')
    plt.annotate(f"S{i+1}", xy=(x, y), xytext=(5, 5), textcoords='offset points', fontsize=10)

# Plot new sentences with star markers
for i in range(len(sentences), len(sentences) + len(new_sentences)):
    x, y = all_embeddings_2d[i]
    plt.scatter(x, y, c=topic_colors[all_topics[i]], s=150, alpha=0.9, marker='*', edgecolors='black')
    plt.annotate(f"N{i-len(sentences)+1}", xy=(x, y), xytext=(5, 5), textcoords='offset points', fontsize=10, weight='bold')

# Add a legend
for topic, color in topic_colors.items():
    plt.scatter([], [], c=color, label=topic, s=100, alpha=0.7)
plt.scatter([], [], c='gray', marker='o', s=100, label='Original', alpha=0.7)
plt.scatter([], [], c='gray', marker='*', s=150, label='New', alpha=0.9)
plt.legend(loc='upper right')

plt.title('PCA Projection of Original and New Sentences', fontsize=15)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

print("Notice how new sentences (stars) appear close to their semantically related original sentences.")

## 8. Real-World Applications of Embeddings

Embeddings power many modern AI applications including:

1. **Semantic Search**: Finding documents based on meaning rather than just keywords
2. **Document Clustering**: Automatically grouping similar documents
3. **Recommendation Systems**: Suggesting similar items based on semantic content
4. **Question Answering**: Finding relevant information to answer queries
5. **Retrieval Augmented Generation (RAG)**: Combining LLMs with knowledge bases using embeddings

## Conclusion

In this notebook, we've explored the fundamentals of text embeddings:
- How embeddings represent text as numerical vectors capturing semantic meaning
- Using cosine similarity to measure semantic relationships
- Visualizing high-dimensional embeddings in 2D space
- Testing embedding models with new sentences

Embeddings serve as the bridge between human language and machine understanding, forming the foundation of many modern NLP systems.