<a href="https://colab.research.google.com/github/paviayyala/AIML-Lab/blob/main/Embeddings_VectorSrch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
# Install sentence-transformers
#!pip install -q sentence-transformers

# Import libraries
from sentence_transformers import SentenceTransformer
import numpy as np

# Sample unstructured data (customer support tickets)
tickets = [
    "My laptop won't turn on after the latest update.",
    "The screen is blank and the device is unresponsive.",
    "I need help with a software crash on my computer.",
    "My laptop won't switched on after the latest update.",
    "My phone battery drains quickly after updating."
]

# Load SBERT model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Convert text to embeddings
embeddings = model.encode(tickets, convert_to_numpy=True)

# Display embeddings
for i, ticket in enumerate(tickets):
    print(f"\nTicket {i+1}: {ticket}")
    print(f"Embedding (first 5 dimensions): {embeddings[i][:5]}")
    print(f"Embedding shape: {embeddings[i].shape}")

# Optional: Example of full embedding for first ticket (truncated)
print(f"\nFull embedding for first ticket (truncated): {embeddings[0][:10]}...")


Ticket 1: My laptop won't turn on after the latest update.
Embedding (first 5 dimensions): [-0.04204226 -0.01871645  0.01252084 -0.01471921  0.06840078]
Embedding shape: (384,)

Ticket 2: The screen is blank and the device is unresponsive.
Embedding (first 5 dimensions): [ 0.05260247 -0.01585972  0.01697287 -0.0415807   0.03419019]
Embedding shape: (384,)

Ticket 3: I need help with a software crash on my computer.
Embedding (first 5 dimensions): [ 0.00935809  0.01494381  0.00097969 -0.07353229  0.00652451]
Embedding shape: (384,)

Ticket 4: My laptop won't switched on after the latest update.
Embedding (first 5 dimensions): [-0.01262457 -0.06487854  0.00390242  0.00583799  0.07987364]
Embedding shape: (384,)

Ticket 5: My phone battery drains quickly after updating.
Embedding (first 5 dimensions): [-0.01475925  0.07787177  0.00535659  0.01208504  0.02393223]
Embedding shape: (384,)

Full embedding for first ticket (truncated): [-0.04204226 -0.01871645  0.01252084 -0.01471921  0.06840

In [3]:
# Install sentence-transformers
#!pip install -q sentence-transformers

# Import libraries
from sentence_transformers import SentenceTransformer, util
import numpy as np

# Sample unstructured data (customer support tickets)
tickets = [
    "My laptop won't turn on after the latest update.",  # Ticket 1
    "The screen is blank and the device is unresponsive.",  # Ticket 2
    "I need help with a software crash on my computer.",  # Ticket 3
    "My laptop won't switched on after the latest update."  # Ticket 4
]

# Load SBERT model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Convert text to embeddings
embeddings = model.encode(tickets, convert_to_numpy=True)

# Compute pairwise cosine similarities
cosine_similarities = util.cos_sim(embeddings, embeddings)

# Display embeddings and similarities
for i, ticket in enumerate(tickets):
    print(f"\nTicket {i+1}: {ticket}")
    print(f"Embedding (first 5 dimensions): {embeddings[i][:5]}")
    print(f"Embedding shape: {embeddings[i].shape}")

# Display cosine similarity matrix
print("\nCosine Similarity Matrix:")
for i in range(len(tickets)):
    for j in range(len(tickets)):
        print(f"Similarity between Ticket {i+1} and Ticket {j+1}: {cosine_similarities[i][j]:.4f}")


Ticket 1: My laptop won't turn on after the latest update.
Embedding (first 5 dimensions): [-0.04204226 -0.01871645  0.01252084 -0.01471921  0.06840078]
Embedding shape: (384,)

Ticket 2: The screen is blank and the device is unresponsive.
Embedding (first 5 dimensions): [ 0.05260247 -0.01585972  0.01697287 -0.0415807   0.03419019]
Embedding shape: (384,)

Ticket 3: I need help with a software crash on my computer.
Embedding (first 5 dimensions): [ 0.00935809  0.01494381  0.00097969 -0.07353229  0.00652451]
Embedding shape: (384,)

Ticket 4: My laptop won't switched on after the latest update.
Embedding (first 5 dimensions): [-0.01262457 -0.06487854  0.00390242  0.00583799  0.07987364]
Embedding shape: (384,)

Cosine Similarity Matrix:
Similarity between Ticket 1 and Ticket 1: 1.0000
Similarity between Ticket 1 and Ticket 2: 0.3046
Similarity between Ticket 1 and Ticket 3: 0.2036
Similarity between Ticket 1 and Ticket 4: 0.8498
Similarity between Ticket 2 and Ticket 1: 0.3046
Similari

You’re correct that Ticket 1 ("My laptop won't turn on after the latest update.") and Ticket 4 ("My laptop won't switched on after the latest update.") are semantically very similar, yet their embeddings, as shown in the first five dimensions, appear different. This is an interesting observation, and I’ll explain why this happens, what it means for Enterprise Vector Search, and how to verify their similarity using the full embeddings. I’ll also provide a Colab-compatible code example to compute the similarity between these tickets’ embeddings to confirm their semantic closeness.
Why Embeddings Appear Different

High-Dimensional Embeddings: The all-MiniLM-L6-v2 model from Sentence-BERT generates 384-dimensional embeddings, but you’re only seeing the first five dimensions in the output. These five dimensions alone don’t fully capture the semantic content of the embeddings. The similarity between two sentences is determined by comparing all 384 dimensions, typically using a metric like cosine similarity.
Semantic Encoding: Sentence-BERT encodes the meaning of the entire sentence into the full 384-dimensional vector. Small differences in wording (e.g., “turn on” vs. “switched on”) can lead to variations in individual dimensions, but the overall vector direction (which determines similarity) remains close for semantically similar sentences.
Model Sensitivity: While Ticket 1 and Ticket 4 are nearly identical, the slight difference (“turn on” vs. “switched on”) may cause minor variations in the embedding. SBERT is trained to capture nuanced differences in phrasing, but for highly similar sentences, the full embeddings should still be close in vector space.
Truncated View: The first five dimensions are a small subset of the embedding and may not reflect the overall similarity. Cosine similarity or another metric over the full vectors will reveal that Ticket 1 and Ticket 4 are indeed closer to each other than to Tickets 2 or 3.

Confirming Similarity
To verify that Ticket 1 and Ticket 4 are semantically similar, we can compute the cosine similarity between their full embeddings. Cosine similarity measures the angle between two vectors, where a value close to 1 indicates high similarity. In Enterprise Vector Search, this is how a database (e.g., Milvus, Pinecone) identifies similar items.