# RAG On JFK Speeches: Part 2


__[1. Introduction to RAG ](#first-bullet)__

__[2. Retriving Documents With Vector (Semantic) Search](#second-bullet)__

__[3. Building A RAG Pipeline](#third-bullet)__

__[4. A CI/CD Pipeline For RAG](#fourth-bullet)__

__[4. Deploying And Monitoring A RAG Application](#fifth-bullet)__

__[5. Next Steps](#sixth-bullet)__



### 1. Introduction to RAG <a class="anchor" id="first-bullet"></a>
------------------------------

In this post, I will continue from where my [last post](http://michael-harmon.com/blog/ragjfk1.html) left off. 

In my past post I discussed how to ingest President Kennedy's speeches into [Pinecone](https://www.pinecone.io/) vector database and perform semantic search both using Pinecone's API as well as using [Langchain](https://www.langchain.com/). I used Pinecone for a vector database since its clloud based, fully managed and of course has a free tier. In this post I will expand upon this work and build out a [Retrivial Augmented Generation (RAG)](https://en.wikipedia.org/wiki/Retrieval-augmented_generation) pipeline using Langchain to be able to answer questions on President Kennedy's speeches. Finally, I'll be deploying this out as a [Streamlit](https://streamlit.io/) app for users to try out! As part of this last step I'll build out a [continuous integration/continuous deployment (CI/CD)](https://en.wikipedia.org/wiki/CI/CD) pipeline. This last component is tricky as RAG systms, like any application that uses a [Large Language Model (LLM)](https://en.wikipedia.org/wiki/Large_language_model) are notorious for being difficult to test in a robust and reproducible fashion.

You may ask whats the point of RAG pipelines, don't LLMs know all the answers? The answer is most LLMs take a long time to train and are often trained on data that is out of date when people begin to use it. In order to incorporate more recent data into our LLM we could use fine-tuning, but this can still be time confusing and costly. The other option is to use RAG, which takes your original question, embeds as a vector and "retrieves" documents from a vector database. These documents are the ones that are most semantically related to the question. The original question and the retrieved documents are passed into a prompt which is fed into the LLM. The prompt will contain you question and use the documents as "context" to generate an answer. The entire process is depicted below,


<figure>
    <img src="images/rag-pipeline.png" alt>
    <figcaption>Source: https://python.langchain.com/docs/tutorials/rag/</figcaption>
</figure>


I'll note that building a RAG pipe was actually much easier than I originally thought which is a testament to the power and simplicity of the Langchain framework! 

Let's get started! 

I'll start out with all the necessary imports:

In [35]:
# LangChain
from langchain.chains.retrieval import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.documents import Document
from langchain_core.output_parsers import StrOutputParser
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_pinecone import PineconeVectorStore

# Pinecone VectorDB
from pinecone import Pinecone
from pinecone import ServerlessSpec

import os

# API Keys
from dotenv import load_dotenv
load_dotenv()


True

## 2. Retriving Documents With Vector (Semantic) Search <a class="anchor" id="second-bullet"></a>

First thing we'll go over again retrivial with semantic search again. This is important as well dicuss a more useful way to interact with the Vector databse as a so-called "retrivier" which will allow it to be used as part of a RAG pipeline. 

The first thing I need to do is connect to the Pinecode database and make sure the index of vectors exists:

In [4]:
index_name = "prez-speeches"

pc = Pinecone(api_key=os.getenv("PINECONE_API_KEY"))
pc.list_indexes()

{'indexes': [{'deletion_protection': 'disabled',
              'dimension': 1536,
              'host': 'prez-speeches-2307pwa.svc.aped-4627-b74a.pinecone.io',
              'metric': 'cosine',
              'name': 'prez-speeches',
              'spec': {'serverless': {'cloud': 'aws', 'region': 'us-east-1'}},
              'status': {'ready': True, 'state': 'Ready'}}]}

Now that we have confirmed the index exists and is ready for querying we can create the initial connection to the Vector database using the Langchain [PineconeVectorStore](https://python.langchain.com/api_reference/pinecone/vectorstores/langchain_pinecone.vectorstores.PineconeVectorStore.html) class. Note that we ahve to pass the name of the index as well as the embeddings. It's important that we use the same embeddings here that we used to embedded the documents in the associated index.

In [None]:
embedding = OpenAIEmbeddings(model='text-embedding-ada-002')

vectordb = PineconeVectorStore(
                    pinecone_api_key=os.getenv("PINECONE_API_KEY"),
                    embedding=embedding,
                    index_name=index_name
)

Now we can perform vector similarity search using the [similiarity search](https://python.langchain.com/v0.1/docs/modules/model_io/prompts/example_selectors/similarity/) function in Langchain. Under the hook this creates a vector embedding of our query/question and finds the closest documents using the cosine similiarity score between the query embedding vector and the document embedding vectors. The closest documents are measured by the "nearest neighbors" algorithm. This process is depicted in image below,


<figure>
    <img src="images/vector-search.jpg" alt>
    <figcaption>Source: https://www.elastic.co/what-is/vector-search</figcaption>
</figure>

The one thing to note is that I use the async similarity search for funsies and return the top 5 documents.

In [None]:
query = "How did President Kennedy feel about the Berlin Wall?"

results = await vectordb.asimilarity_search(query=query, k=5)

In [39]:
for document in results:
    print("Document ID:", document.id)

Document ID: 64fc63a1-79fd-4b40-bf8c-09f0617b9f0f
Document ID: 0fa5431f-a374-429e-a622-a1ed1c2b0a21
Document ID: 121366d4-9f46-4f52-8e56-2523bf1c9c8f
Document ID: 2da0bf3a-9adc-4dd0-a697-117bc3f0d8b9
Document ID: 4df626ad-0034-45cb-8144-88a21576785d


Now that we understand how to use the vector database to perform "retrivial" using similairty search, let's create a chain that will allow us to query the database and generate a response. This will form the basis of a so-called "RAG Pipeline."

## 3. Building A RAG Pipeline <a class="anchor" id="third-bullet"></a>
--------------------------------

Now we can use the vector database as a [retriever](https://python.langchain.com/docs/integrations/retrievers/). A retriever is a special Langchain [Runnable](https://python.langchain.com/api_reference/core/runnables.html) object that takes in a string (query) and returns a list of [Documents](https://python.langchain.com/api_reference/core/documents/langchain_core.documents.base.Document.html). This is depicted below,

<figure>
    <img src="images/retriever.png" alt>
    <figcaption>Source: https://python.langchain.com/docs/concepts/retrievers/</figcaption>
</figure>


We can see this in action,

In [41]:
retriever = vectordb.as_retriever()
print(type(retriever))

<class 'langchain_core.vectorstores.base.VectorStoreRetriever'>


In [44]:
documents = retriever.invoke(input=question)

for document in documents:
    print("Document ID:", document.id)

Document ID: 64fc63a1-79fd-4b40-bf8c-09f0617b9f0f
Document ID: 0fa5431f-a374-429e-a622-a1ed1c2b0a21
Document ID: 121366d4-9f46-4f52-8e56-2523bf1c9c8f
Document ID: 2da0bf3a-9adc-4dd0-a697-117bc3f0d8b9


Next let's talk about our prompt for rag. I used the classic [rlm/rag-prompt](https://smith.langchain.com/hub/rlm/rag-prompt) from [LangSmith](https://www.langchain.com/langsmith). I couldn't use the original one as the function [create_retrieval_chain](https://python.langchain.com/api_reference/langchain/chains/langchain.chains.retrieval.create_retrieval_chain.html) expects the human input to be a variable `input` while the original prompt has the input be `question`. The whole prompt is,

In [45]:
from langchain.prompts import PromptTemplate

template = """You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
Question: {input} 
Context: {context} 
Answer:
"""

prompt = PromptTemplate(
    template=template,
    input_variables=["input", "context"],
)

I can now give an example of how to use the prompt with the documents retrieved from Pinecone and the question from the user.

In [47]:
print(
    prompt.invoke({
        "input": question,
        "context": [document.id for document in documents]
    }).text
)

You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
Question: How did President Kennedy feel about the Berlin Wall? 
Context: ['64fc63a1-79fd-4b40-bf8c-09f0617b9f0f', '0fa5431f-a374-429e-a622-a1ed1c2b0a21', '121366d4-9f46-4f52-8e56-2523bf1c9c8f', '2da0bf3a-9adc-4dd0-a697-117bc3f0d8b9'] 
Answer:



Note I only used the document ids as context, since print them would be a lot of text for the screen, however, we would pass the actual documents to the LLM. We'll use this more later.

Now we'll move on to create our LLM Chat Model as this will be needed to write the response from the context and query into `Answer` above.

In [26]:
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

The LLM will be used as the generative part RAG pipeline in a function called [create_stuff_documents_chain](https://python.langchain.com/api_reference/langchain/chains/langchain.chains.combine_documents.stuff.create_stuff_documents_chain.html). We'll call this the `generative_chain`:

In [27]:
generate_chain = create_stuff_documents_chain(llm=llm, prompt=prompt)

We can see what makes up this composite runnable and the components of the chain:

In [258]:
print(stuff_documents_chain)

bound=RunnableBinding(bound=RunnableAssign(mapper={
  context: RunnableLambda(format_docs)
}), kwargs={}, config={'run_name': 'format_inputs'}, config_factories=[])
| PromptTemplate(input_variables=['context', 'input'], input_types={}, partial_variables={}, template="You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\nQuestion: {input} \nContext: {context} \nAnswer:\n")
| ChatOpenAI(client=<openai.resources.chat.completions.completions.Completions object at 0x168f11890>, async_client=<openai.resources.chat.completions.completions.AsyncCompletions object at 0x168a2e010>, root_client=<openai.OpenAI object at 0x169946590>, root_async_client=<openai.AsyncOpenAI object at 0x168f13310>, model_name='gpt-4o-mini', temperature=0.0, model_kwargs={}, openai_api_key=SecretStr('**********'))
| StrOutputParser() kwa

Now we can can call it using the invoke function and see the answer. We can see that the chain takes in the prompt, passes to the LLM and then the String outpur parser, so we expect to obtain a string as a return type.

In [48]:
answer = generate_chain.invoke(
       {
        'context': documents,
        "input": question
      }
)

In [30]:
print(answer)

President Kennedy viewed the Berlin Wall as a significant symbol of the failures of the Communist system and an offense against humanity, as it separated families and divided people. He expressed pride in the resilience of West Berlin and emphasized the importance of freedom and the right to make choices. Kennedy's speeches reflected a commitment to supporting the people of Berlin and a broader struggle for freedom worldwide.


Now we can put this all together as a RAG chain by passing the Pinecone Vector database retriever and the generative chain. The retriever will take in the input question and perform similarity search and return the documents. These documents along with the input question will be passed to the `generate_chain` to return the output. The full RAG chain is below:

In [31]:
rag_chain = create_retrieval_chain(
                    retriever=retriever, 
                    combine_docs_chain=generate_chain)

Now we can see prompts:

In [32]:
rag_chain.get_prompts()

[PromptTemplate(input_variables=['page_content'], input_types={}, partial_variables={}, template='{page_content}'),
 PromptTemplate(input_variables=['context', 'input'], input_types={}, partial_variables={}, template="You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\nQuestion: {input} \nContext: {context} \nAnswer:\n")]

Now we can test this out,

In [None]:
response = rag_chain.invoke({"input": query})

In [34]:
response

{'input': 'How did President Kennedy feel about the Berlin Wall?',
  Document(id='0fa5431f-a374-429e-a622-a1ed1c2b0a21', metadata={'filename': 'berlin-w-germany-rudolph-wilde-platz-19630626', 'seq_num': 1.0, 'source': 'gs://prezkennedyspeches/berlin-w-germany-rudolph-wilde-platz-19630626.json', 'title': 'Remarks of President John F. Kennedy at the Rudolph Wilde Platz, Berlin, June 26, 1963', 'url': 'https://www.jfklibrary.org//archives/other-resources/john-f-kennedy-speeches/berlin-w-germany-rudolph-wilde-platz-19630626'}, page_content='Listen to speech. \xa0\xa0 View related documents. \nPresident John F. Kennedy\nWest Berlin\nJune 26, 1963\n[This version is published in the Public Papers of the Presidents: John F. Kennedy, 1963. Both the text and the audio versions omit the words of the German translator. The audio file was edited by the White House Signal Agency (WHSA) shortly after the speech was recorded. The WHSA was charged with recording only the words of the President. The Ken

The response will be a dictionary that contains the input question and the answer generated by the model. It also includes the context orwhich are all documents that were the most semantically related to the question and passed to the LLM to use.

We can see the associated data with context reference documents:

In [None]:
references = [(doc.metadata["title"],
               doc.page_content, doc.metadata["url"]) 
               for doc in response['context']]

references

[('Radio and Television Report to the American People on the Berlin Crisis, July 25, 1961',
  'https://www.jfklibrary.org//archives/other-resources/john-f-kennedy-speeches/berlin-crisis-19610725'),
 ('Remarks of President John F. Kennedy at the Rudolph Wilde Platz, Berlin, June 26, 1963',
  'Listen to speech. \xa0\xa0 View related documents. \nPresident John F. Kennedy\nWest Berlin\nJune 26, 1963\n[This version is published in the Public Papers of the Presidents: John F. Kennedy, 1963. Both the text and the audio versions omit the words of the German translator. The audio file was edited by the White House Signal Agency (WHSA) shortly after the speech was recorded. The WHSA was charged with recording only the words of the President. The Kennedy Library has an audiotape of a network broadcast of the full speech, with the translator\'s words, and a journalist\'s commentary. Because of copyright restrictions, it is only available for listening at the Library.]\nI am proud to come to this 

## 4. A CI/CD Pipeline For RAG <a class="anchor" id="fourth-bullet"></a>
-------------------

## 5. Deploying A RAG Application <a class="anchor" id="fifth-bullet"></a>
-------------------

## 6. Conclusions  <a class="anchor" id="sixth-bullet"></a>
-------------