#Add this to Requiremtns 
+ faiss-cpu 
+ sentence-transformers 
+ langchain-huggingface


1) Ingestion: You read the file.

2) Splitting: You cut it into pieces.

2) Embedding: The HuggingFaceEmbeddings model read your text and turned it into vectors (lists of numbers).

4) Vector DB: FAISS organized these vectors in 3D space.

5) Retrieval: When you asked about the "purpose," it didn't look for the word "purpose." It looked for the concept of motivation/reasoning in the vectors and returned the right paragraph.

`Load -> Split -> Embed -> Store -> Search`

In [39]:
import os
from dotenv import load_dotenv
load_dotenv()
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import TextLoader
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS


#Gemini
from langchain_google_genai import GoogleGenerativeAIEmbeddings

os.environ['GOOGLE_API_KEY'] = os.getenv("GOOGLE_API_KEY")


In [17]:
# --- STEP 1: Load & Split (Review) ---
# Lets Load speech.txt and then split 100 and overlap 20

#Load
loader = TextLoader("./files/speech.txt")
docs = loader.load()
docs

#Split
splitter = RecursiveCharacterTextSplitter(
    chunk_size = 100,
    chunk_overlap = 20
)

chunks = splitter.split_documents(docs)
print(chunks[0].page_content)
print(chunks[1].page_content)
len(chunks)



The world must be made safe for democracy. Its peace must be planted upon the tested foundations of
foundations of political liberty. We have no selfish ends to serve. We desire no conquest, no


49

In [16]:
# --- STEP 2: The Embedding Model (The Translator) ---
# We use a free, powerful model from HuggingFace

embedding = HuggingFaceEmbeddings(model="all-MiniLM-L6-v2")

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


In [18]:
# --- STEP 3: The Vector Store (The Database) ---
# This line does the heavy lifting:
# 1. Takes all chunks.
# 2. Converts them to numbers using the embedding model.
# 3. Stores them in a local FAISS index.

vector_db = FAISS.from_documents(chunks,embedding)



In [24]:
# --- STEP 4: The Search (The Test) ---
# Now we ask a question. The DB finds the most mathematically similar chunk.

query = "What is this speech about?"

# Search for the top 2 most relevant chunks

results = vector_db.similarity_search(query,k=2)

print(results[0].page_content)

It is a distressing and oppressive duty, gentlemen of the Congress, which I have performed in thus


In [25]:
# Lets play with chunk_size and overlap and find the best size

# Split data is in docs
splitter = RecursiveCharacterTextSplitter(
    chunk_size = 200,
    chunk_overlap = 40,
)
chunks = splitter.split_documents(docs)

# Embedding
embeddings = HuggingFaceEmbeddings(model='all-MiniLM-L6-v2')

# Vectordb
vector_db = FAISS.from_documents(chunks,embeddings)

#Test and search
query = "What is the motive of Speech"
result = vector_db.similarity_search(query,k=2)

print(result[0].page_content)

Just because we fight without rancor and without selfish object, seeking nothing for ourselves but what we shall wish to share with all free peoples, we shall, I feel confident, conduct our
