### First create documents using text splitter and document loader 

In [1]:
from langchain_text_splitters import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap =10
)

In [2]:

with open("./data/appendix-keywords.txt", encoding="utf-8") as f:
   file = f.read()

In [3]:
file

'Semantic Search\n\nDefinition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.\nExample: Vectors of word embeddings can be stored in a database for quick access.\nRelated keywords: embedding, database, vectorization, vectorization\n\nEmbedding\n\nDefinition: Embedding is the process of converting textual data, such as words or sentences, into a low-dimensional, continuous vector. This allows computers to understand and process the text.\nExample: Represent the word “apple” as a vector such as [0.65, -0.23, 0.17].\nRelated keywords: natural language processing, vectorization, deep learning\n\nToken\n\nDefinition: A token is a breakup of text into smaller units. These can typically be words, sentences, or phrases.\nExample: Split the sentence “I am going to school” into “I am”, “to school”, and “going”.\nAssociated keywords: tokenization, natural language processing, parsing\n\nTokenizer\n\nDef

splitting the text and loading it in docs

In [4]:
chunks =  text_splitter.create_documents([file])
chunks

[Document(metadata={}, page_content='Semantic Search\n\nDefinition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.\nExample: Vectors of word embeddings can be stored in a database for quick access.\nRelated keywords: embedding, database, vectorization, vectorization\n\nEmbedding\n\nDefinition: Embedding is the process of converting textual data, such as words or sentences, into a low-dimensional, continuous vector. This allows computers to understand and process the text.\nExample: Represent the word “apple” as a vector such as [0.65, -0.23, 0.17].\nRelated keywords: natural language processing, vectorization, deep learning\n\nToken\n\nDefinition: A token is a breakup of text into smaller units. These can typically be words, sentences, or phrases.\nExample: Split the sentence “I am going to school” into “I am”, “to school”, and “going”.\nAssociated keywords: tokenization, natural language pro

In [5]:
for i, chunk in enumerate(chunks):
    chunk.metadata = {"doc_id": i}

chunks


[Document(metadata={'doc_id': 0}, page_content='Semantic Search\n\nDefinition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.\nExample: Vectors of word embeddings can be stored in a database for quick access.\nRelated keywords: embedding, database, vectorization, vectorization\n\nEmbedding\n\nDefinition: Embedding is the process of converting textual data, such as words or sentences, into a low-dimensional, continuous vector. This allows computers to understand and process the text.\nExample: Represent the word “apple” as a vector such as [0.65, -0.23, 0.17].\nRelated keywords: natural language processing, vectorization, deep learning\n\nToken\n\nDefinition: A token is a breakup of text into smaller units. These can typically be words, sentences, or phrases.\nExample: Split the sentence “I am going to school” into “I am”, “to school”, and “going”.\nAssociated keywords: tokenization, natural l

For Other Text Splitters - 
1. RescursiveTextSplitter
2. Spacy
3. NLTK

More -https://python.langchain.com/v0.1/docs/modules/data_connection/document_transformers/



# Embeddings

Embeddings convert complex data—whether text, images, or audio—into dense, high-dimensional vectors that capture the essence of the information. This means that items with similar meanings or features are positioned close together in the vector space. For instance, in a simple 3-dimensional space, consider the following embeddings:

boy = [1, 1, 1]
man = [1.1, 1, 1.4]

By combining or averaging the semantic characteristics of "boy" and "man," one might derive an embedding for "king," such as:

king = [1, 1.5, 1.3]

This example illustrates how related concepts can be represented by vectors that are close in space. Such embeddings are key in machine learning tasks like clustering, recommendation, and similarity searches (e.g., using KNN or ANN), where measuring the "distance" between vectors helps determine how similar or related different pieces of data are.

## Watch these videos for some context understanding-

RAG -https://www.youtube.com/watch?v=T-D1OfcDW1M

Vector Embeddings - https://www.youtube.com/watch?v=NEreO2zlXDk

RAG vs Fine Tuning - https://www.youtube.com/watch?v=00Q0G84kq3M

In [6]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")


In [7]:
doc_1_embedded = embeddings.embed_query(chunks[0].page_content)

In [8]:
len(doc_1_embedded)


1536

# Cosine Similarity

In [9]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

In [10]:
query  = "What is the vector store?"

query_embedding = embeddings.embed_query(query)

query_embedding


[-0.007016266696155071,
 0.045252200216054916,
 -0.05566782131791115,
 -0.0256311297416687,
 -0.0034350473433732986,
 0.0010062088258564472,
 0.0014149810886010528,
 0.007628150749951601,
 -0.043647706508636475,
 0.023877063766121864,
 0.01856047287583351,
 0.015514652244746685,
 -0.0077437288127839565,
 -0.06189544126391411,
 0.02114398218691349,
 -0.019580280408263206,
 0.007478578947484493,
 0.009830932132899761,
 -0.03260660544037819,
 -0.0077845207415521145,
 0.00978334154933691,
 0.010347633622586727,
 0.0191043708473444,
 -0.01182975247502327,
 -0.03274257853627205,
 -0.017418289557099342,
 -0.01648006960749626,
 0.041472118347883224,
 0.03978604078292847,
 -0.04288624972105026,
 -0.003282076446339488,
 -0.049005087465047836,
 0.029370419681072235,
 0.006193622946739197,
 0.02221817895770073,
 0.03521730750799179,
 0.010830341838300228,
 -0.035978764295578,
 0.02201421745121479,
 0.023278776556253433,
 0.003973844926804304,
 0.0036475069355219603,
 -0.001755766337737441,
 0.0115

In [11]:
documents = chunks

documents

[Document(metadata={'doc_id': 0}, page_content='Semantic Search\n\nDefinition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.\nExample: Vectors of word embeddings can be stored in a database for quick access.\nRelated keywords: embedding, database, vectorization, vectorization\n\nEmbedding\n\nDefinition: Embedding is the process of converting textual data, such as words or sentences, into a low-dimensional, continuous vector. This allows computers to understand and process the text.\nExample: Represent the word “apple” as a vector such as [0.65, -0.23, 0.17].\nRelated keywords: natural language processing, vectorization, deep learning\n\nToken\n\nDefinition: A token is a breakup of text into smaller units. These can typically be words, sentences, or phrases.\nExample: Split the sentence “I am going to school” into “I am”, “to school”, and “going”.\nAssociated keywords: tokenization, natural l

In [12]:
documents[0].page_content

'Semantic Search\n\nDefinition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.\nExample: Vectors of word embeddings can be stored in a database for quick access.\nRelated keywords: embedding, database, vectorization, vectorization\n\nEmbedding\n\nDefinition: Embedding is the process of converting textual data, such as words or sentences, into a low-dimensional, continuous vector. This allows computers to understand and process the text.\nExample: Represent the word “apple” as a vector such as [0.65, -0.23, 0.17].\nRelated keywords: natural language processing, vectorization, deep learning\n\nToken\n\nDefinition: A token is a breakup of text into smaller units. These can typically be words, sentences, or phrases.\nExample: Split the sentence “I am going to school” into “I am”, “to school”, and “going”.\nAssociated keywords: tokenization, natural language processing, parsing\n\nTokenizer'

In [13]:
docs_page_content = [doc.page_content for doc in documents]
docs_page_content

['Semantic Search\n\nDefinition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.\nExample: Vectors of word embeddings can be stored in a database for quick access.\nRelated keywords: embedding, database, vectorization, vectorization\n\nEmbedding\n\nDefinition: Embedding is the process of converting textual data, such as words or sentences, into a low-dimensional, continuous vector. This allows computers to understand and process the text.\nExample: Represent the word “apple” as a vector such as [0.65, -0.23, 0.17].\nRelated keywords: natural language processing, vectorization, deep learning\n\nToken\n\nDefinition: A token is a breakup of text into smaller units. These can typically be words, sentences, or phrases.\nExample: Split the sentence “I am going to school” into “I am”, “to school”, and “going”.\nAssociated keywords: tokenization, natural language processing, parsing\n\nTokenizer',
 'T

In [14]:
document_embeddings = embeddings.embed_documents(docs_page_content)

document_embeddings


[[-0.00721786031499505,
  0.023917170241475105,
  -0.041876401752233505,
  -0.027269845828413963,
  0.016133412718772888,
  -0.02361820638179779,
  -0.00110176473390311,
  0.01315444428473711,
  0.004903553985059261,
  0.03707161545753479,
  0.047065574675798416,
  0.00915579218417406,
  -0.027099009603261948,
  -0.05146462470293045,
  0.01467062160372734,
  0.019624892622232437,
  -0.004369688685983419,
  0.06406384706497192,
  0.03369758650660515,
  0.023575497791171074,
  0.03922843188047409,
  -0.00047380555770359933,
  -0.013143766671419144,
  0.05099482461810112,
  0.020425690338015556,
  -0.005472120363265276,
  0.023789042606949806,
  0.052703194320201874,
  0.02793183922767639,
  -0.026885462924838066,
  -0.008872843347489834,
  -0.022892149165272713,
  -0.002922913059592247,
  -0.019667601212859154,
  0.0247500017285347,
  0.02240099385380745,
  -0.036238785833120346,
  -0.02389581687748432,
  0.009604238905012608,
  0.018397001549601555,
  -0.009566868655383587,
  0.00438570

In [15]:
similarity_scores = cosine_similarity([query_embedding], document_embeddings)[0]
similarity_scores

array([0.54053468, 0.55332491, 0.3039747 , 0.18454049, 0.23445207,
       0.3345916 , 0.37496678, 0.23042831, 0.1927184 , 0.18611709,
       0.15562117, 0.10683318, 0.10943522, 0.14540852])

In [16]:
most_similar_index = np.argmax(similarity_scores)
most_similar_index


1

In [17]:
most_similar_document = documents[most_similar_index]

most_similar_document.page_content

'Tokenizer\n\nDefinition: A tokenizer is a tool that splits text data into tokens. It is used to preprocess data in natural language processing.\nExample: Split the sentence “I love programming.” into [“I”, “love”, “programming”, “.”].\nAssociated keywords: tokenization, natural language processing, parsing\n\nVectorStore\n\nDefinition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.\nExample: Vectors of word embeddings can be stored in a database for quick access.\nRelated keywords: embedding, database, vectorization, vectorization\n\nSQL\n\nDefinition: SQL(Structured Query Language) is a programming language for managing data in a database. You can query, modify, insert, delete, and more data.\nExample: SELECT * FROM users WHERE age > 18; looks up information about users who are 18 years old or older.\nAssociated keywords: database, query, data management, data management\n\nCSV'

## Huggingface Transformers -
Langchain Hugging Face Doumentation - https://python.langchain.com/docs/integrations/providers/huggingface/

Model we are using -https://huggingface.co/sentence-transformers/all-mpnet-base-v2

All Hugging Face Models - https://huggingface.co/models



In [18]:
from langchain_huggingface import HuggingFacePipeline
from langchain_huggingface import HuggingFaceEmbeddings


model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {'device': 'cpu'}
hf = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)



  from .autonotebook import tqdm as notebook_tqdm


In [20]:
documents = chunks
doc_embeddings = hf.embed_documents([doc.page_content for doc in documents])

In [21]:
len(doc_embeddings[0])

768