## Text Embedding

Readings:

https://python.langchain.com/docs/how_to/embed_text/

In [57]:
# Install package if not Install
# !pip install -qU langchain-google-genai

gemini-embedding-001 is a is a sentence embedding model designed for semantic tasks like search, clustering, and classification. 
Its output of a string is a single vector that represents the meaning of the entire input text.

In contract when embedding happens in LLM models, it breaks down the strings to tokens and generate contextual embeddings for each token

In [58]:
import getpass
import os

if not os.environ.get("GOOGLE_API_KEY"):
  os.environ["GOOGLE_API_KEY"] = getpass.getpass("Enter API key for Google Gemini: ")

from langchain_google_genai import GoogleGenerativeAIEmbeddings

embeddings_model = GoogleGenerativeAIEmbeddings(model="models/gemini-embedding-001")

#### Embed a Document

In [59]:
document_embeddings = embeddings_model.embed_documents(
    [
        "Hi there!",
        "Oh, hello!",
        "විශ්වය අසීමිත විශාලත්වයකි",
        "Plate tectonics drive continental drift.",
        "光合作用是植物的能量來源.",
        "asdlkjqweoiu ### @@ 123456 !!! zzzz"
    ]
)
len(document_embeddings), len(document_embeddings[0])

(6, 3072)

In [60]:
document_embeddings[0][:5] # Check first 5 elements of first row

[-0.020293528214097023,
 0.00774721521884203,
 0.0071714180521667,
 -0.0875835195183754,
 -0.015736732631921768]

#### Embed a Query

In [61]:
query_embeddings = embeddings_model.embed_query("Hi there?")
len(query_embeddings)

3072

In [62]:
query_embeddings[:5]  # Check first 5 elements

[-0.02153073623776436,
 0.011656977236270905,
 0.008365157060325146,
 -0.08781148493289948,
 -0.014667927287518978]

### Measure similarity

-  Cosine Similarity: Measures the cosine of the angle between two vectors.
-  Euclidean Distance: Measures the straight-line distance between two points.
-  Dot Product: Measures the projection of one vector onto another.

In [63]:
import numpy as np

def cosine_similarity(vec1, vec2):
    dot_product = np.dot(vec1, vec2)
    norm_vec1 = np.linalg.norm(vec1)
    norm_vec2 = np.linalg.norm(vec2)
    return dot_product / (norm_vec1 * norm_vec2)



In [64]:
for i in range(len(document_embeddings)):   
    similarity = cosine_similarity(query_embeddings, document_embeddings[i])
    print("Cosine Similarity:", similarity)

Cosine Similarity: 0.9634565297958197
Cosine Similarity: 0.9410521678397361
Cosine Similarity: 0.7341226126548602
Cosine Similarity: 0.7520968708894665
Cosine Similarity: 0.7097672991439838
Cosine Similarity: 0.7678982878401283
