#### Embedding Techniques
##### Converting Texts into Vectors

The demo how to use OpenAI embeddings to convert text into vector representations and store them in a vector database for similarity search.

---

### Steps Covered:

1. **Load Environment Variables**:
   - Use the `dotenv` library to load API keys and other configurations from a `.env` file.

2. **Set Up OpenAI Embeddings**:
   - Use the `OpenAIEmbeddings` class to generate embeddings for text.

3. **Load Text Documents**:
   - Load text files (e.g., `speech.txt`) using `TextLoader`.

4. **Clean and Preprocess Text**:
   - Remove special characters, normalize whitespace, and split text into manageable chunks.

5. **Store Embeddings in a Vector Database**:
   - Use `Chroma` to store embeddings in a persistent vector database.

6. **Perform Similarity Search**:
   - Query the vector database to find the most similar text documents to a given query.

---

### Example Code Snippets:

#### Load Environment Variables
```python
import os
from dotenv import load_dotenv
load_dotenv()  # Load all the environment variables from the .env file

In [None]:
import os
from dotenv import load_dotenv
load_dotenv() #Load all the environment variables from the .env file

True

In [3]:
os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY")

In [6]:
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
embeddings

OpenAIEmbeddings(client=<openai.resources.embeddings.Embeddings object at 0x0000020DD2975CD0>, async_client=<openai.resources.embeddings.AsyncEmbeddings object at 0x0000020DD2A75400>, model='text-embedding-3-small', dimensions=None, deployment='text-embedding-ada-002', openai_api_version=None, openai_api_base=None, openai_api_type=None, openai_proxy=None, embedding_ctx_length=8191, openai_api_key=SecretStr('**********'), openai_organization=None, allowed_special=None, disallowed_special=None, chunk_size=1000, max_retries=2, request_timeout=None, headers=None, tiktoken_enabled=True, tiktoken_model_name=None, show_progress_bar=False, model_kwargs={}, skip_empty=False, default_headers=None, default_query=None, retry_min_seconds=4, retry_max_seconds=20, http_client=None, http_async_client=None, check_embedding_ctx_length=True)

In [13]:
text = "This is a demo on OPENAI embeddings."
# Get the embedding for the text
# embed_query is the function to get the embedding for a text, it returns a list of floats, those are vctors that represent the text
embeddings = embeddings.embed_query(text)
# we can also add dimesnions to the embedding, to get all the word vectors
# query_result = embeddings.embed_query(text, dimensions=1024)
embeddings

[0.012693448923528194,
 0.011115566827356815,
 0.034912850707769394,
 -0.017937416210770607,
 -0.00047182501293718815,
 -0.027334321290254593,
 0.003214422380551696,
 0.027897432446479797,
 -0.0004384636413305998,
 0.004114812705665827,
 0.051196128129959106,
 -0.04716050252318382,
 -0.01925133913755417,
 -0.0443449504673481,
 0.015462075360119343,
 -0.013819670304656029,
 -0.010370617732405663,
 -0.008282416500151157,
 -0.017526814714074135,
 0.016635222360491753,
 0.01509840041399002,
 -0.03650832921266556,
 0.011620019562542439,
 0.05147768557071686,
 0.010223974473774433,
 -0.03789264336228371,
 0.0008094713557511568,
 0.03791610524058342,
 0.010358886793255806,
 -0.015426880680024624,
 0.00095978076569736,
 -0.026841599494218826,
 0.020764699205756187,
 -0.0177379809319973,
 0.005551917478442192,
 0.04870905727148056,
 -0.010710830800235271,
 0.0140894940122962,
 -0.044485729187726974,
 0.01722179539501667,
 0.00661068269982934,
 -0.0029094042256474495,
 -0.005581246223300695,
 0.

In [15]:
len(embeddings)

1536

#### Lets load a txt file sppech.txt and transform the loaded text document to vectors and store it in vector store DB

In [10]:
from langchain_community.document_loaders import TextLoader
text_loader = TextLoader("speech.txt")
text_document = text_loader.load()
text_document

[Document(metadata={'source': 'speech.txt'}, page_content='The world must be made safe for democracy. Its peace must be planted upon the tested foundations of political liberty. We have no selfish ends to serve. We desire no conquest, no dominion. We seek no indemnities for ourselves, no material compensation for the sacrifices we shall freely make. We are but one of the champions of the rights of mankind. We shall be satisfied when those rights have been made as secure as the faith and the freedom of nations can make them.\n\nJust because we fight without rancor and without selfish object, seeking nothing for ourselves but what we shall wish to share with all free peoples, we shall, I feel confident, conduct our operations as belligerents without passion and ourselves observe with proud punctilio the principles of right and of fair play we profess to be fighting for.\n\n…\n\nIt will be all the easier for us to conduct ourselves as belligerents in a high spirit of right and fairness be

In [None]:
# Get the embedding for the text document and store it in VectorStore DB
from langchain_community.vectorstores import Chroma

In [None]:

from langchain_openai import OpenAIEmbeddings

# create the vector store for the text document, these are the vectors that represent the text document and stored in the chromadb
db = Chroma.from_documents(documents=text_document, embedding=embeddings, persist_directory="chroma_db")

db

<langchain_community.vectorstores.chroma.Chroma at 0x20df4080860>

In [None]:
# Query the vector store for the most similar text document to the query text
query_text = "What is the main topic of the speech?"

# similarity_search is the function to get the most similar text document to the query text, it returns a list of Document objects, 
# each object has a page_content and metadata attributes

# the k parameter is the number of most similar text documents to return, in this case we are getting only one document, 
# by default it returns 4 documents and we can change it to any number we want, recommended is 4 or 5 documents
# the similarity_search function uses the cosine similarity to find the most similar text document to the query text
query_result = db.similarity_search(query_text, k=4)
query_result[0].page_content

'The world must be made safe for democracy. Its peace must be planted upon the tested foundations of political liberty. We have no selfish ends to serve. We desire no conquest, no dominion. We seek no indemnities for ourselves, no material compensation for the sacrifices we shall freely make. We are but one of the champions of the rights of mankind. We shall be satisfied when those rights have been made as secure as the faith and the freedom of nations can make them.\n\nJust because we fight without rancor and without selfish object, seeking nothing for ourselves but what we shall wish to share with all free peoples, we shall, I feel confident, conduct our operations as belligerents without passion and ourselves observe with proud punctilio the principles of right and of fair play we profess to be fighting for.\n\n…\n\nIt will be all the easier for us to conduct ourselves as belligerents in a high spirit of right and fairness because we act without animus, not in enmity toward a people

In [30]:
from langchain_community.document_loaders import TextLoader
text_loader = TextLoader("about.txt")
documents = text_loader.load()
documents

[Document(metadata={'source': 'about.txt'}, page_content='**A Love Story from Darbhanga: Ravi & Sweeti – A Journey Made in Heaven**\n\nIn the heart of Bihar, nestled within the cultural town of Darbhanga, destiny brought together two beautiful souls—Ravi and Sweeti. Their story wasn’t just one of ordinary romance; it was woven with threads of deep connection, shared dreams, and a love that felt timeless.\n\nTheir journey began in the quiet charm of their hometown, where glances turned into conversations, and conversations blossomed into a bond that neither time nor distance could shake. On **28th June**, they sealed this bond with an engagement, and by **7th December 2020**, Ravi and Sweeti were tied together in the sacred knot of marriage.\n\nFrom the very beginning, it was clear—they were made for each other.\n\n---\n\n### Chapter 1: A Love That Travels\n\nTo celebrate their **first anniversary in 2021**, Ravi whisked Sweeti away to the misty hills of **Sikkim and Darjeeling**. The c

In [32]:
import re
cleaned_documents = []
for doc in documents:
    # Remove markdown symbols, punctuation, and extra whitespace
    cleaned_text = re.sub(r"[^\w\s]", "", doc.page_content)  # Removes special characters
    cleaned_text = re.sub(r"\s+", " ", cleaned_text).strip()  # Normalizes whitespace
    cleaned_documents.append(cleaned_text)
cleaned_documents

['A Love Story from Darbhanga Ravi Sweeti A Journey Made in Heaven In the heart of Bihar nestled within the cultural town of Darbhanga destiny brought together two beautiful soulsRavi and Sweeti Their story wasnt just one of ordinary romance it was woven with threads of deep connection shared dreams and a love that felt timeless Their journey began in the quiet charm of their hometown where glances turned into conversations and conversations blossomed into a bond that neither time nor distance could shake On 28th June they sealed this bond with an engagement and by 7th December 2020 Ravi and Sweeti were tied together in the sacred knot of marriage From the very beginning it was clearthey were made for each other Chapter 1 A Love That Travels To celebrate their first anniversary in 2021 Ravi whisked Sweeti away to the misty hills of Sikkim and Darjeeling The cool air the tea gardens and the charm of the toy train enchanted Sweeti She stood on a quiet ridge in Darjeeling watching the clo

In [47]:
#split the text document into chunks of 1000 characters, with an overlap of 200 characters, this is to avoid the text document to be too long for the embedding model
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
split_docs = text_splitter.create_documents(cleaned_documents)
split_docs

[Document(metadata={}, page_content='A Love Story from Darbhanga Ravi Sweeti A Journey Made in Heaven In the heart of Bihar nestled within the cultural town of Darbhanga destiny brought together two beautiful soulsRavi and Sweeti Their story wasnt just one of ordinary romance it was woven with threads of deep connection shared dreams and a love that felt timeless Their journey began in the quiet charm of their hometown where glances turned into conversations and conversations blossomed into a bond that neither time nor distance'),
 Document(metadata={}, page_content='turned into conversations and conversations blossomed into a bond that neither time nor distance could shake On 28th June they sealed this bond with an engagement and by 7th December 2020 Ravi and Sweeti were tied together in the sacred knot of marriage From the very beginning it was clearthey were made for each other Chapter 1 A Love That Travels To celebrate their first anniversary in 2021 Ravi whisked Sweeti away to the

In [48]:

from langchain_openai import OpenAIEmbeddings

# create the vector store for the text document, these are the vectors that represent the text document and stored in the chromadb
db = Chroma.from_documents(documents=text_document, embedding=embeddings, persist_directory="chroma_db")

db

<langchain_community.vectorstores.chroma.Chroma at 0x20df948eea0>

In [50]:
query_text = "tell about Ravi and Sweeti"
query_result = db.similarity_search(query_text, k=1)
print(query_result[0].page_content)

**A Love Story from Darbhanga: Ravi & Sweeti – A Journey Made in Heaven**

In the heart of Bihar, nestled within the cultural town of Darbhanga, destiny brought together two beautiful souls—Ravi and Sweeti. Their story wasn’t just one of ordinary romance; it was woven with threads of deep connection, shared dreams, and a love that felt timeless.

Their journey began in the quiet charm of their hometown, where glances turned into conversations, and conversations blossomed into a bond that neither time nor distance could shake. On **28th June**, they sealed this bond with an engagement, and by **7th December 2020**, Ravi and Sweeti were tied together in the sacred knot of marriage.

From the very beginning, it was clear—they were made for each other.

---

### Chapter 1: A Love That Travels
