# RAG Implementaion Using Langchain

In this notebook we make use of langchain and a store of academic papers to implement RAG using OpenAI's XXX LLM


### Packages

Need to run **pip install "unstructured[pdf]"** or **pip install "unstructured[md]"** for markdown and pdf dependencies

In [28]:
# pip install "unstructured[pdf]"
# pip install "unstructured[md]"

from langchain_community.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document
from langchain_community.document_loaders import PyPDFLoader
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.chat_models import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain.evaluation import load_evaluator

import openai 
from dotenv import load_dotenv
import os
import shutil
import chromadb
import nltk
nltk.download('punkt')

### Loading Data

Below we use **DirectoryLoader**, this will load all documents in the specified file path - even if it contains only one

In [30]:
data_path = "data/academic_papers_GNAR"

def load_documents():
    loader = DirectoryLoader(data_path, glob = "*.pdf")
    documents = loader.load()
    return documents

documents = load_documents()

### Spliting Text

We now use **RecursiveCharacterTextSplitter** to split the content of our document into chunks

**RecursiveCharacterTextSplitter** splits on the first delimiter, then the next and so on. The goal here is to include chunkst that were previously too large vs the specified chunk size.


A common hierarchy is; Chapters/Sections → Paragraphs → Sentences → Words


In [9]:
text_splitter = RecursiveCharacterTextSplitter(
   ['\n', '.', ' ', ''],
   chunk_size=500, 
   chunk_overlap=100,

)

# Split the document using text_splitter
chunks = text_splitter.split_documents(documents)

# Printing number of original documents and number of chunks

print(f'Number of original document: {len(documents)} | Number of chunks; {len(chunks)}')

# Printing an example of chunk content and metadata

print("--- Example ---")
print(f'Content; {chunks[2].page_content}')
print(f'Metadata; {chunks[2].metadata}')

Number of original document: 1 | Number of chunks; 228
--- Example ---
Content; . The GNAR model relates values of a time series for a given variable and time to earlier values of the same variable and of neighboring variables, with inclusion controlled by the network structure. The GNAR package is designed to ﬁt this new model, while working with standard ‘ts’ objects and the igraph package for ease of use.
Metadata; {'source': 'data/academic_papers_GNAR/GNAR.pdf'}


### Creating a Chroma Database

We use **OpenAIEmbeddings()** to embed our chunks 

**ChromaDB** stores vector embeddings

Note, we specify a persistent directory so that when we create this database we have folders locally that can be used to load the data later on

This is useful since we might want to store the database in cloud, so having this on disk means we can deploy it easily down the line

In [10]:
# Load environment variables. Assumes that project contains .env file with API keys
load_dotenv()
#---- Set OpenAI API key 
# Change environment variable name from "OPENAI_API_KEY" to the name given in 
# your .env file.
openai.api_key = os.environ['OPENAI_API_KEY']

In [12]:
chroma_path = 'chroma_GNAR' 

def save_to_chroma(chunks):
    # Delete old database
    if os.path.exists(chroma_path):
        shutil.rmtree(chroma_path)
    
    # Creating a new DB from the documents
    
    db = Chroma.from_documents(chunks, OpenAIEmbeddings(), persist_directory = chroma_path)

    # Database should save automatically but we can force save using persist

    db.persist()

    print(f'Saved {len(chunks)} chunks to {chroma_path}')
    return db


db = save_to_chroma(chunks)

Saved 228 chunks to chroma_GNAR


In [13]:
# Example of vector embedding in use
embedding_model =  OpenAIEmbeddings(openai_api_key = openai_api_key)
vector_embedding = embedding_model.embed_query("harry")
print(f'Length of vector embedding is {len(vector_embedding)}')
print("---")
print(f'The embedding itself {vector_embedding}')


Length of vector embedding is 1536
---
The embedding itself [-0.0259014293551445, -0.021373020485043526, 0.001033585867844522, -0.03539906442165375, -0.020291011780500412, 0.017231997102499008, -0.020504740998148918, -0.012770379893481731, -0.01862124353647232, -0.005436762236058712, 0.008208575658500195, 0.009337338618934155, -0.00867611076682806, 0.0060445573180913925, 0.006959589663892984, 0.005002622492611408, 0.03168550133705139, -0.011394491419196129, 0.04218499734997749, -0.03879203274846077, -0.02823910303413868, -0.013611942529678345, -0.008736222051084042, -0.010913598351180553, -0.019756685942411423, -0.010245691984891891, 0.018941840156912804, -0.02815895341336727, 0.0036534506361931562, -0.0301893912255764, 0.0239778570830822, -0.011514714919030666, -0.009985207580029964, -0.023710694164037704, -0.03438384830951691, -0.011888742446899414, 0.006231571082025766, -0.006111348047852516, 0.007580743171274662, -0.0028352646622806787, 0.01331806369125843, -0.01489432342350483, -0

The vector embeddings themselves  are not interesting, but the distances between the different embeddings are. We can take a look at the distances between some embeddings using **load_evaluator**

In [14]:
# Note the discrepancy in distance

distance_evaluator = load_evaluator("pairwise_embedding_distance")
print(distance_evaluator.evaluate_string_pairs(prediction = "cat", prediction_b = "dog"))
print(distance_evaluator.evaluate_string_pairs(prediction = "cat", prediction_b = "Nicaragua"))

{'score': 0.13717730032114772}
{'score': 0.2645261232263695}


Code below is to load the database from saved path

In [15]:
# Recalling the db
db = Chroma(persist_directory= chroma_path, embedding_function= embedding_model)

  db = Chroma(persist_directory= chroma_path, embedding_function= embedding_model)


### Searching the DB

When searching the DB, we specify the number of vectors that are most similar to the query text to augment to the original query

This query should will return a list of tuples each containing (document, relevance score)

In [22]:
# Now we can search the DB

query_text = "Who are the authors of the GNAR paper?"

def search_db(query_text):
    
    results  = db.similarity_search_with_relevance_scores(query_text, k = 3)
    
    # Adding some checks
    if len(results) == 0 or results[0][1] < 0.7:
        print("Unable to find matches")
    return results[1]

results = search_db(query_text)
print(results[0])

page_content='JSS

Journal of Statistical Software

November 2020, Volume 96, Issue 5.

doi: 10.18637/jss.v096.i05

Generalized Network Autoregressive Processes and the GNAR Package

Marina Knight University of York

Kathryn Leeming University of Warwick

Guy Nason Imperial College London

Matthew Nunes University of Bath

Abstract' metadata={'source': 'data/academic_papers_GNAR/GNAR.pdf'}


### Prompt Template

The **ChatPromptTemplate** is used here to structure the input for a language model in a consistent format, ensuring that both the context and question are presented clearly

This improves the model's ability to generate accurate and contextually relevant responses.

In [23]:
# Prompt Template
prompt_template =ChatPromptTemplate.from_template("""
Use the following piece of context to answer the question at the end.
If you don't know the answer, say that you don't know
Context: {context}
Question: {question}
""")

### Defining the LLM

We use model "gpt-4o-mini" from ChatOpenAI

In [24]:
# Defining an LLM 

llm = ChatOpenAI()

  llm = ChatOpenAI(model= "gpt-4o-mini", api_key = openai_api_key, temperature = 0)


### Retriever

This code converts the vector store into a retriever that finds the two most similar documents (k=2) based on the query, and then creates a retrieval chain to pass the retrieved context and question through a language model (LLM) for generating an answer.

It ensures the system retrieves relevant context before prompting the LLM, improving accuracy and relevance in answering the question.

In [25]:
# Convert the vector store into a retriever
retriever = db.as_retriever(search_type="similarity", search_kwargs={"k": 2})


# Create the LCEL retrieval chain
chain = (
   {"context": retriever, "question": RunnablePassthrough()}
   | prompt_template
   | llm
   | StrOutputParser()
)


# Invoke the chain
print(chain.invoke("Who are the authors of the GNAR paper?"))

The authors of the GNAR paper are Marina Knight, Kathryn Leeming, Guy Nason, and Matthew Nunes.
