Building a RAG System with LangChain and FAISS

In [1]:
import os
from dotenv import load_dotenv
import numpy as np
import warnings
warnings.filterwarnings('ignore')

# Langchain core imports
from langchain_core.documents import Document
from langchain_core.prompts import ChatPromptTemplate, PromptTemplate
from langchain_core.runnables import (
    RunnablePassthrough,
)
from langchain_core.output_parsers import StrOutputParser
from langchain_core.messages import HumanMessage, AIMessage

# Langchain specific imports
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import FAISS
from langchain_community.document_loaders import TextLoader, PyPDFLoader
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain

In [2]:
# Load environment variable
#load_dotenv()

Data Ingestion and Processing

In [3]:
sample_documents = [
    Document(
        page_content="""
        Artificial Intelligence (AI) refers to the simulation of human intelligence in machines
        that are programmed to think, reason, and make decisions. These systems can perform tasks
        such as problem-solving, perception, and language understanding. AI is often divided into
        two categories: narrow AI, which is specialized for specific tasks like facial recognition
        or recommendation systems, and general AI, which aims to replicate human-level intelligence
        across a wide range of activities. Modern AI applications include autonomous vehicles,
        medical diagnostics, and personalized digital assistants.
        """,
        metadata={"source": "AI Introduction", "page": 1, "topic": "AI"}
    ),
    Document(
        page_content="""
        Machine Learning (ML) is a core subset of AI that enables systems to improve performance
        by learning from data rather than relying on explicit programming. ML algorithms identify
        patterns and relationships within datasets to make predictions or decisions. The three
        primary types of ML are: supervised learning, where models are trained on labeled data;
        unsupervised learning, which uncovers hidden structures in unlabeled data; and reinforcement
        learning, where agents learn optimal actions through trial and error in dynamic environments.
        ML powers applications such as spam filtering, fraud detection, and recommendation engines.
        """,
        metadata={"source": "ML Basics", "page": 1, "topic": "ML"}
    ),
    Document(
        page_content="""
        Deep Learning is an advanced branch of machine learning built on artificial neural networks
        with multiple layers. These deep architectures allow systems to automatically extract
        increasingly abstract features from raw input data, such as pixels or audio signals.
        Deep learning has transformed fields like computer vision, enabling image classification
        and object detection; natural language processing, supporting translation and conversational
        AI; and speech recognition, powering virtual assistants. Breakthroughs in deep learning
        are largely driven by large datasets, powerful GPUs, and innovations in network design
        such as convolutional and recurrent neural networks.
        """,
        metadata={"source": "Deep Learning", "page": 1, "topic": "DL"}
    ),
    Document(
        page_content="""
        Natural Language Processing (NLP) is a specialized area of AI focused on enabling computers
        to understand, interpret, and generate human language. It integrates computational linguistics
        with machine learning and deep learning techniques to process text and speech. NLP applications
        are widespread: chatbots provide customer support, translation systems break down language
        barriers, sentiment analysis helps businesses gauge public opinion, and summarization tools
        condense large volumes of text. Recent advances in transformer-based models, such as BERT
        and GPT, have significantly improved NLP performance across diverse tasks.
        """,
        metadata={"source": "NLP Overview", "page": 1, "topic": "NLP"}
    )
]

print(sample_documents)

[Document(metadata={'source': 'AI Introduction', 'page': 1, 'topic': 'AI'}, page_content='\n        Artificial Intelligence (AI) refers to the simulation of human intelligence in machines\n        that are programmed to think, reason, and make decisions. These systems can perform tasks\n        such as problem-solving, perception, and language understanding. AI is often divided into\n        two categories: narrow AI, which is specialized for specific tasks like facial recognition\n        or recommendation systems, and general AI, which aims to replicate human-level intelligence\n        across a wide range of activities. Modern AI applications include autonomous vehicles,\n        medical diagnostics, and personalized digital assistants.\n        '), Document(metadata={'source': 'ML Basics', 'page': 1, 'topic': 'ML'}, page_content='\n        Machine Learning (ML) is a core subset of AI that enables systems to improve performance\n        by learning from data rather than relying on e

In [4]:
# Perform text splitting
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500, # Adjusted chunk size to 500 for demonstration
    chunk_overlap=50, # Adjusted chunk overlap to 50 for demonstration
    length_function=len, # Using len function for length calculation
    separators=[' '] # Split on spaces e.g. words to avoid cutting words
)

# Split the sample documents into smaller chunks
chunks = text_splitter.split_documents(sample_documents)
print(f"Number of chunks created: {len(chunks)}")
for i, chunk in enumerate(chunks[:3]):  # Print first 2 chunks for brevity
    print(f"Chunk {i+1} content: {chunk.page_content[:200]}...")  # Print first 200 characters of each chunk

Number of chunks created: 8
Chunk 1 content: Artificial Intelligence (AI) refers to the simulation of human intelligence in machines
        that are programmed to think, reason, and make decisions. These systems can perform tasks
        such a...
Chunk 2 content: to replicate human-level intelligence
        across a wide range of activities. Modern AI applications include autonomous vehicles,
        medical diagnostics, and personalized digital assistants....
Chunk 3 content: Machine Learning (ML) is a core subset of AI that enables systems to improve performance
        by learning from data rather than relying on explicit programming. ML algorithms identify
        patte...


In [5]:
print(f'Created {len(chunks)} text chunks from sample documents.')
print(f'Content: {chunks[0].page_content}')
print(f'Metadata: {chunks[0].metadata}')

Created 8 text chunks from sample documents.
Content: Artificial Intelligence (AI) refers to the simulation of human intelligence in machines
        that are programmed to think, reason, and make decisions. These systems can perform tasks
        such as problem-solving, perception, and language understanding. AI is often divided into
        two categories: narrow AI, which is specialized for specific tasks like facial recognition
        or recommendation systems, and general AI, which aims to replicate human-level intelligence
Metadata: {'source': 'AI Introduction', 'page': 1, 'topic': 'AI'}


In [6]:
# Load OPENAI_API_KEY
from dotenv import load_dotenv
load_dotenv()
openai_key = os.getenv('OPENAI_API_KEY')

In [7]:
# Initialize OpenAI Embeddings
embeddings = OpenAIEmbeddings(
    model='text-embedding-3-small',
    dimensions=1536,
    openai_api_key=openai_key
)

#Create embedding for a single text
sample_text = 'What is Artificial Intelligence?'
sample_embedding_vector = embeddings.embed_query(sample_text) # 
sample_embedding_vector

[-0.005486255045980215,
 -0.013723005540668964,
 -0.04157176613807678,
 0.011208266951143742,
 0.02654227614402771,
 -0.03563855588436127,
 -0.025167029350996017,
 0.012652277015149593,
 -0.01823185198009014,
 0.014322220347821712,
 0.006581541616469622,
 -0.011394907720386982,
 -0.041139546781778336,
 -0.04978395998477936,
 0.003212185110896826,
 -0.0043516759760677814,
 0.003560908604413271,
 -0.007264253683388233,
 0.05308455228805542,
 -0.02872302569448948,
 0.02974463813006878,
 0.03241654857993126,
 -0.010825162753462791,
 -0.018290791660547256,
 0.02039295621216297,
 -0.0515914261341095,
 0.03288806229829788,
 -0.0029052102472633123,
 -0.042436208575963974,
 0.014970551244914532,
 0.009774080477654934,
 -0.020049143582582474,
 0.03137528896331787,
 -0.02436152659356594,
 0.015589412301778793,
 0.03062872588634491,
 -0.011385084129869938,
 0.0013494616141542792,
 0.0012377226958051324,
 0.02870338037610054,
 -0.006095293443650007,
 0.03447942063212395,
 0.013919468969106674,
 0.0

In [8]:
# Create a list 
texts = ['Machine learning', 'Deep learning', 'Web Development', 'Natural Language Processing', 'AI']
batch_embeddings_vectors = embeddings.embed_documents(texts)
batch_embeddings_vectors

[[-0.009305418469011784,
  -0.002394041046500206,
  -0.0008696239674463868,
  -0.02188332937657833,
  0.04221281409263611,
  0.010941664688289165,
  0.007651513908058405,
  0.03962307050824165,
  -0.01957610435783863,
  0.02253076620399952,
  -0.005779835861176252,
  -0.05033519119024277,
  -0.03187738358974457,
  -0.02021176926791668,
  0.013148832134902477,
  0.01846957765519619,
  -0.008145919069647789,
  -0.03507924824953079,
  0.02075326070189476,
  0.006162411533296108,
  0.016421325504779816,
  0.019823307171463966,
  0.03693915531039238,
  -0.023896267637610435,
  0.03799859434366226,
  -0.01524417009204626,
  0.03446712717413902,
  0.0312652625143528,
  0.008293064311146736,
  -0.028699062764644623,
  -0.02834591642022133,
  -0.018834495916962624,
  -0.04223635792732239,
  -0.0047939675860106945,
  0.03171258419752121,
  -0.013207689858973026,
  -0.008734497241675854,
  0.035526569932699203,
  -0.0486871711909771,
  0.016915731132030487,
  -0.024626104161143303,
  -0.027027502

In [9]:
print(batch_embeddings_vectors[0])

[-0.009305418469011784, -0.002394041046500206, -0.0008696239674463868, -0.02188332937657833, 0.04221281409263611, 0.010941664688289165, 0.007651513908058405, 0.03962307050824165, -0.01957610435783863, 0.02253076620399952, -0.005779835861176252, -0.05033519119024277, -0.03187738358974457, -0.02021176926791668, 0.013148832134902477, 0.01846957765519619, -0.008145919069647789, -0.03507924824953079, 0.02075326070189476, 0.006162411533296108, 0.016421325504779816, 0.019823307171463966, 0.03693915531039238, -0.023896267637610435, 0.03799859434366226, -0.01524417009204626, 0.03446712717413902, 0.0312652625143528, 0.008293064311146736, -0.028699062764644623, -0.02834591642022133, -0.018834495916962624, -0.04223635792732239, -0.0047939675860106945, 0.03171258419752121, -0.013207689858973026, -0.008734497241675854, 0.035526569932699203, -0.0486871711909771, 0.016915731132030487, -0.024626104161143303, -0.02702750265598297, 0.016409555450081825, 0.071759432554245, -0.015703260898590088, -0.024955

In [10]:
print(batch_embeddings_vectors[2])

[-0.042894601821899414, -0.02673066034913063, 0.02223195694386959, 0.010919814929366112, -0.0016853786073625088, -0.03028777241706848, 0.01891024224460125, 0.007310390938073397, -0.016163941472768784, 0.03379257768392563, 0.04137759655714035, -0.016373183578252792, -0.03342640399932861, -0.03198786452412605, 0.028273819014430046, 0.025383664295077324, -0.0016935521271079779, -0.015366205945611, 0.0031991133000701666, 0.02032262459397316, 0.03094165399670601, -0.03904977813363075, -0.030889343470335007, 0.002666200278326869, 0.028770769014954567, 0.04161299392580986, -0.023735884577035904, 0.06810825318098068, 0.011266371235251427, -0.025841381400823593, 0.009062792174518108, -0.03954672813415527, 0.023971281945705414, 0.021447300910949707, 0.018491758033633232, 0.012214499525725842, 0.037846639752388, 0.034864939749240875, 0.047131750732660294, 0.026809126138687134, -0.01888408698141575, 2.0229446818120778e-05, 0.00260244682431221, 0.0032710402738302946, 0.02313431352376938, 0.06852673

In [11]:
# Compare Embedding using cosine similarity

def compare_embeddings(vec1, vec2):
    """Compute cosine similarity between two vectors."""
    vec1 = np.array(vec1)
    vec2 = np.array(vec2)
    cosine_similarity = np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))
    return cosine_similarity
similarity_ml_dl = compare_embeddings(batch_embeddings_vectors[0], batch_embeddings_vectors[1])
similarity_ml_dl

np.float64(0.7048622891303212)

In [12]:
def compare_embeddings(text1: str, text2: str):
    """Compute cosine similarity between two vectors."""
    emb1 = np.array(embeddings.embed_query(text1))
    emb2 = np.array(embeddings.embed_query(text2))
    
    ## Calculate the simialrity score
    similarity = np.dot(emb1, emb2) / (np.linalg.norm(emb1) * np.linalg.norm(emb2))
    return similarity

In [13]:
# Test semantic similarity
print("\nSemantic Similarity Examples:")
print(f"'AI' vs 'Artificial Intelligence': {compare_embeddings('AI', 'Artificial Intelligence'):.3f}")


Semantic Similarity Examples:
'AI' vs 'Artificial Intelligence': 0.563


In [14]:
# Test semantic similarity
print("\nSemantic Similarity Examples:")
print(f"'Machine learning' vs 'Deep learning': {compare_embeddings('Machine learning', 'Deep learning'):.3f}")


Semantic Similarity Examples:
'Machine learning' vs 'Deep learning': 0.705


In [15]:
# Test semantic similarity
print("\nSemantic Similarity Examples:")
print(f"'Machine learning' vs 'Natural Language Processing': {compare_embeddings('Machine learning', 'Natural Language Processing'):.3f}")


Semantic Similarity Examples:
'Machine learning' vs 'Natural Language Processing': 0.419


In [16]:
# Test semantic similarity
print("\nSemantic Similarity Examples:")
print(f"'Machine learning' vs 'Web development': {compare_embeddings('Machine learning', 'Web development'):.3f}")


Semantic Similarity Examples:
'Machine learning' vs 'Web development': 0.299


Create FAISS Vector Store

In [17]:
vectorstore = FAISS.from_documents(
    documents=chunks,
    embedding=embeddings
)
print(f'Vector store created with {vectorstore.index.ntotal} vectors')

Vector store created with 8 vectors


In [18]:
vectorstore

<langchain_community.vectorstores.faiss.FAISS at 0x22ddd671a90>

In [19]:
# Save vector store 

vectorstore.save_local('faiss_index')

In [20]:
# Load vector store

loaded_vectorstore = FAISS.load_local(
    'faiss_index',
    embeddings,
    allow_dangerous_deserialization=True
)
print(f'Loaded vector store contains {loaded_vectorstore.index.ntotal} vectors')

Loaded vector store contains 8 vectors


In [21]:
# Perform Similarity search

query = 'What is NLP?'

result= vectorstore.similarity_search(query, k=3)
print(result)

[Document(id='7c013963-50ac-49a8-94f5-d3aa29254cd7', metadata={'source': 'NLP Overview', 'page': 1, 'topic': 'NLP'}, page_content='Natural Language Processing (NLP) is a specialized area of AI focused on enabling computers\n        to understand, interpret, and generate human language. It integrates computational linguistics\n        with machine learning and deep learning techniques to process text and speech. NLP applications\n        are widespread: chatbots provide customer support, translation systems break down language\n        barriers, sentiment analysis helps businesses gauge public opinion, and summarization'), Document(id='ae6fee1d-891e-48ae-83c5-9c31c8c8d929', metadata={'source': 'NLP Overview', 'page': 1, 'topic': 'NLP'}, page_content='gauge public opinion, and summarization tools\n        condense large volumes of text. Recent advances in transformer-based models, such as BERT\n        and GPT, have significantly improved NLP performance across diverse tasks.'), Document

In [22]:
# Similarity Search with score

result_with_score = vectorstore.similarity_search_with_score(query, k=3)

print('\n\nSimilarity search with scores: ')

for doc, score in result_with_score:
    print(f'\nScore: {score:.3f}')
    print(f'Source: {doc.metadata['source']}')
    print(f'Content preview: {doc.page_content}')



Similarity search with scores: 

Score: 0.521
Source: NLP Overview
Content preview: Natural Language Processing (NLP) is a specialized area of AI focused on enabling computers
        to understand, interpret, and generate human language. It integrates computational linguistics
        with machine learning and deep learning techniques to process text and speech. NLP applications
        are widespread: chatbots provide customer support, translation systems break down language
        barriers, sentiment analysis helps businesses gauge public opinion, and summarization

Score: 1.070
Source: NLP Overview
Content preview: gauge public opinion, and summarization tools
        condense large volumes of text. Recent advances in transformer-based models, such as BERT
        and GPT, have significantly improved NLP performance across diverse tasks.

Score: 1.179
Source: Deep Learning
Content preview: Deep Learning is an advanced branch of machine learning built on artificial neural network

In [25]:
# Search with metadata filtering

filter_dict = {'topic': 'NLP'}

filtered_results= vectorstore.similarity_search(
    query,
    k=3,
    filter=filter_dict
)

print(filtered_results)
print(len(filtered_results))

[Document(id='7c013963-50ac-49a8-94f5-d3aa29254cd7', metadata={'source': 'NLP Overview', 'page': 1, 'topic': 'NLP'}, page_content='Natural Language Processing (NLP) is a specialized area of AI focused on enabling computers\n        to understand, interpret, and generate human language. It integrates computational linguistics\n        with machine learning and deep learning techniques to process text and speech. NLP applications\n        are widespread: chatbots provide customer support, translation systems break down language\n        barriers, sentiment analysis helps businesses gauge public opinion, and summarization'), Document(id='ae6fee1d-891e-48ae-83c5-9c31c8c8d929', metadata={'source': 'NLP Overview', 'page': 1, 'topic': 'NLP'}, page_content='gauge public opinion, and summarization tools\n        condense large volumes of text. Recent advances in transformer-based models, such as BERT\n        and GPT, have significantly improved NLP performance across diverse tasks.')]
2


Build RAG Chain with LCEL