# **Demo: LangChain Loader, Splitter, Embeddings, and VectorStore**

# __Description:__
In this activity, you will implement the functionalities of LangChain’s loaders, splitters, embeddings, and VectorStores.
The two files in the tutorial serve as practical examples of real-world data that one might encounter in natural language processing tasks. They are:

•	The **state_of_union.txt** file, which contains transcripts of the United States’ State of the Union Addresses, represents a large text document that can be loaded and processed.

•	The **michael_resume.pdf** file, an open source resume, represents a common type of document that one might analyze for tasks such as resume screening or information extraction.




# **Steps to Perform:**


1.   Import the Necessary Modules
2.   Load Text Data from a File Using TextLoader
3.   Load PDFs from the Internet Using PyPDFLoader
4.   Split the Documents Using RecursiveCharacterTextSplitter
5.   Embed the Documents Using HuggingFaceEmbeddings and Print the Length of the Embedding
6.   Embed the Documents Using OpenAIEmbeddings and Print the Length of the Embedding
7.   Create a FAISS Instance
8.   Perform a Similarity Search on the FAISS Instance
9.   Persist the FAISS Instance
10.  Load the Persisted FAISS Instance




# **Step 1: Import the Necessary Modules**







In [None]:
# !pip install pysqlite3
# !pip install pysqlite3-binary
# !pip install pypdf

In [4]:
from langchain.document_loaders import TextLoader, PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings, OpenAIEmbeddings
from langchain.vectorstores import FAISS
import faiss
import pysqlite3
import sys
sys.modules["sqlite3"] = sys.modules.pop("pysqlite3")

#**Step 2: Load Text Data from a File Using TextLoader**



*   Print the first 100 characters from the loaded text.



In [5]:
text_loader = TextLoader("../../Datasets/state_of_union.txt")
text_document = text_loader.load()
print(text_document[:100])  # Prints the first 100 characters of the text document

[Document(page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  \n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n\nWith a duty to one another to the American people to the Constitution. \n\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \n\nSix days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \n\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \n\nHe met the Ukrainian people. \n\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world. \n\nGroups of citize

# **Step 3: Load PDFs from the Internet Using PyPDFLoader**






In [6]:
from langchain.document_loaders import PyPDFLoader

pdf_loader = PyPDFLoader("../../Datasets/michael_resume.pdf")
pdf_pages = pdf_loader.load_and_split()
print(pdf_pages[0])  # Prints the first 100 characters of the first page of the PDF


page_content='CURRICULUM VITAE :  \nM ichael M . Scott OBE, B.Sc., Dip.Ed  \n \nHome address:  Strome House     Date of Birth: 10.5.51 \n   North Strome     Place of Birth: Edinburgh \n   Lochcarron     Married to Sue Scott; 2 stepchildren \n   Ross-shire, IV54 8YJ \nTelephone (work): 01520 722901     Website: www.mmscott.co.uk \nTelephone (home):  01520 722588    E-mail:  MSStrome@aol.com \n \nAwarded OBE in Queen’s Birthday Honours, June 2005, “for services to biodiversity conservation in \nScotland”. \nAwarded Planta Europa ‘Silver Lead’ Award in September 2007, “for excellent work in European wild plant \nconservation”. \n \nEducation \nPrimary education: George Heriots School, Edinburgh (1956-1962) \nSecondary education:  Madras College, St Andrews (1962-69). \nFurther education: University of Aberdeen (1969-1974): \n    Bachelor of Science (Honours; upper second) in Botany, 1973 \n    Diploma of Education, 1974 \n   Aberdeen College of Education (1973 - 1974): \n    Certificate o

# **Step 4: Split the Documents Using RecursiveCharacterTextSplitter**


*   Split the PDF pages into smaller chunks and print the number of chunks.



In [7]:
doc_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=64)
split_texts = doc_splitter.split_documents(pdf_pages)
print(len(split_texts))  # Prints the number of chunks the PDF has been split into


15


# **Step 5: Embed the Documents Using HuggingFaceEmbeddings and Print the Length of the Embedding**






In [10]:
MODEL_NAME = "sentence-transformers/all-mpnet-base-v2"
hf_embed = HuggingFaceEmbeddings(model_name=MODEL_NAME)
text = split_texts[0].page_content
hf_embed_result = hf_embed.embed_documents([text])
print(len(hf_embed_result[0]))  # Prints the length of the first embedded document

: 

# **Step 6: Embed the Documents Using OpenAIEmbeddings and Print the Length of the Embedding**




In [None]:
openai_embed = OpenAIEmbeddings()
openai_embed_result = openai_embed.embed_documents([text])
print(len(openai_embed_result[0]))  # Prints the length of the first embedded document


# **Step 7: Create a FAISS Instance**

*   Create a FAISS instance using the split texts and the OpenAIEmbeddings.

In [None]:
# From META Inc.

# Create FAISS instance from documents and embeddings
faiss_index = FAISS.from_documents(split_texts, openai_embed)

#reference:
# https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/ 



# **Step 8: Perform a Similarity Search on the FAISS Instance**


*   Print the top two most similar documents.

In [None]:
# Perform a similarity search and print the top two most similar documents
search_result = faiss_index.similarity_search_with_score("What is the candidate's skill sets?", k=2)
print(search_result)  # Prints the top 2 most similar documents to the query


[(Document(page_content='spring 2005, I went fully digital, and all photographs can be supplied in electronic format. \n \nComputer knowledge  \nI am reasonably fluent in basic PC computer skills, using Windows XP, Word, WordPro, Excel, PowerPoint, \nAdobe Photoshop Elements, e-mail, internet etc.  I have full computer and broadband facilities at home. \n \nOther interests  \nBotanising (especially mountain flowers), travel, walking, Scottish islands, gardening, photography, \ncomputers, rugby supporter, cinema, good wine, Runrig concerts (!). \n \n[updated, 26.03.08]', metadata={'source': 'michael_resume.pdf', 'page': 3}), 0.45546138), (Document(page_content='CURRICULUM VITAE :  \nM ichael M . Scott OBE, B.Sc., Dip.Ed  \n \nHome address :  Strome House     Date of Birth : 10.5.51 \n   North Strome     Place of Birth : Edinburgh \n   Lochcarron     Married to Sue Scott; 2 stepchildren \n   Ross-shire, IV54 8YJ \nTelephone (work) : 01520 722901     Website : www.mmscott.co.uk  \nTelepho

# **Step 9: Persist the FAISS Instance**


*   Create a folder in the current working directory that persists the FAISS instance.

In [None]:
# Save the FAISS index to a file
faiss_index.save_local("faiss_index")


# **Step 10: Load the Persisted FAISS Instance**




In [None]:
# Load the persisted FAISS index from the file
faiss_index_loaded = FAISS.load_local("faiss_index", openai_embed)

# Perform a similarity search with the loaded FAISS index
vector_search_result = faiss_index_loaded.similarity_search_with_score("What is the candidate's skill sets?", k=2)
print(vector_search_result)  # Prints the top 2 most similar documents to the query from the loaded FAISS instance


[(Document(page_content='spring 2005, I went fully digital, and all photographs can be supplied in electronic format. \n \nComputer knowledge  \nI am reasonably fluent in basic PC computer skills, using Windows XP, Word, WordPro, Excel, PowerPoint, \nAdobe Photoshop Elements, e-mail, internet etc.  I have full computer and broadband facilities at home. \n \nOther interests  \nBotanising (especially mountain flowers), travel, walking, Scottish islands, gardening, photography, \ncomputers, rugby supporter, cinema, good wine, Runrig concerts (!). \n \n[updated, 26.03.08]', metadata={'source': 'michael_resume.pdf', 'page': 3}), 0.45546138), (Document(page_content='CURRICULUM VITAE :  \nM ichael M . Scott OBE, B.Sc., Dip.Ed  \n \nHome address :  Strome House     Date of Birth : 10.5.51 \n   North Strome     Place of Birth : Edinburgh \n   Lochcarron     Married to Sue Scott; 2 stepchildren \n   Ross-shire, IV54 8YJ \nTelephone (work) : 01520 722901     Website : www.mmscott.co.uk  \nTelepho

# **Conclusion**

This activity provided a step-by-step guide on how to use LangChain’s loaders, splitters, embeddings, and vector stores. You now know how to load documents, split them into manageable chunks, embed them into a numerical space, and store these embeddings for efficient similarity searches.