# Retrieval-Augmented Generation for Presidential Speeches using Groq API and Langchain

Retrieval-Augmented Generation (RAG) is a widely-used technique that enables us to gather pertinent information from an external data source and provide it to our Large Language Model (LLM). It helps solve two of the biggest limitations of LLMs: knowledge cutoffs, in which information after a certain date or for a specific source is not available to the LLM, and hallucination, in which the LLM makes up an answer to a question it doesn't have the information for. With RAG, we can ensure that the LLM has relevant information to answer the question at hand.

In this notebook we will be using [Groq API](https://console.groq.com), [LangChain](https://www.langchain.com/) and [Pinecone](https://www.pinecone.io/) to perform RAG on [presidential speech transcripts](https://millercenter.org/the-presidency/presidential-speeches) from the Miller Center at the University of Virginia. In doing so, we will create vector embeddings for each speech, store them in a vector database, retrieve the most relevent speech excerpts pertaining to the user prompt and include them in context for the LLM.

### Setup

In [1]:
import pandas as pd
import numpy as np
from groq import Groq
import os
import pinecone

from langchain_community.vectorstores import Chroma
from langchain.text_splitter import TokenTextSplitter
from langchain.docstore.document import Document
from langchain_community.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain_pinecone import PineconeVectorStore
from transformers import AutoModelForCausalLM, AutoTokenizer
from sklearn.metrics.pairwise import cosine_similarity

from IPython.display import display, HTML

A Groq API Key is required for this demo - you can generate one for free [here](https://console.groq.com/keys). We will be using Pinecone as our vector database, which also requires an API key (you can create one index for a small project there for free on their Starter plan), but will also show how it works with [Chroma DB](https://www.trychroma.com/), a free open source alternative that stores vector embeddings in memory. We will also use the Llama3 8b model for this demo.

In [2]:
groq_api_key = os.getenv('GROQ_API_KEY')
pinecone_api_key = os.getenv('PINECONE_API_KEY')

client = Groq(api_key = groq_api_key)
model = "llama3-8b-8192"

### RAG Basics with One Document

The presidential speeches we'll be using are stored in this [.csv file](https://github.com/groq/groq-api-cookbook/blob/main/tutorials/presidential-speeches-rag/presidential_speeches.csv). Each row of the .csv contains fields for the date, president, party, speech title, speech summary and speech transcript, and includes every recorded presidential speech through the Trump presidency:

In [2]:
presidential_speeches_df = pd.read_csv('presidential_speeches.csv')
presidential_speeches_df.head()

Unnamed: 0,Date,President,Party,Speech Title,Summary,Transcript,URL
0,1789-04-30,George Washington,Unaffiliated,First Inaugural Address,Washington calls on Congress to avoid local an...,Fellow Citizens of the Senate and the House of...,https://millercenter.org/the-presidency/presid...
1,1789-10-03,George Washington,Unaffiliated,Thanksgiving Proclamation,"At the request of Congress, Washington establi...",Whereas it is the duty of all Nations to ackno...,https://millercenter.org/the-presidency/presid...
2,1790-01-08,George Washington,Unaffiliated,First Annual Message to Congress,"In a wide ranging speech, President Washington...",Fellow Citizens of the Senate and House of Rep...,https://millercenter.org/the-presidency/presid...
3,1790-12-08,George Washington,Unaffiliated,Second Annual Message to Congress,Washington focuses on commerce in his second a...,Fellow citizens of the Senate and House of Rep...,https://millercenter.org/the-presidency/presid...
4,1790-12-29,George Washington,Unaffiliated,Talk to the Chiefs and Counselors of the Senec...,The President reassures the Seneca Nation that...,"I the President of the United States, by my ow...",https://millercenter.org/the-presidency/presid...


To get a better idea of the steps involved in building a RAG system, let's focus on a single speech to start. In honor of his [upcoming Netflix series](https://www.netflix.com/tudum/articles/death-by-lightning-tv-series-adaptation) and his distinction of being the only president to [contribute an original proof of the Pythagorean Theorem](https://maa.org/press/periodicals/convergence/mathematical-treasure-james-a-garfields-proof-of-the-pythagorean-theorem), we'll use James Garfield's Inaugural Address:

In [3]:
garfield_inaugural = presidential_speeches_df.iloc[309].Transcript
#display(HTML(garfield_inaugural)) 

A challenge with prompting LLMs can be running into limits with their context window. While this speech is not extremely long and would actually fit in Llama3's context window, it is not always great practice to use way more of the context window than you need, so when using RAG we want to split up the text to provide only relevant parts of it to the LLM. To do so, we first need to ```tokenize``` the transcript. We'll use the ```sentence-transformers/all-MiniLM-L6-v2``` tokenzier with the transformers AutoTokenizer class for this - this will show the number of tokens the model counts in Garfield's Inaugural Address:

In [6]:
model_id = "sentence-transformers/all-MiniLM-L6-v2"
tokenizer = AutoTokenizer.from_pretrained(model_id)

# create the length function
def token_len(text):
    tokens = tokenizer.encode(
        text
    )
    return len(tokens)

token_len(garfield_inaugural)

Token indices sequence length is longer than the specified maximum sequence length for this model (3420 > 512). Running this sequence through the model will result in indexing errors


3420

Next, we'll split the text into chunks using LangChain's `TokenTextSplitter` function. In this example we will set the maximum tokens in a chunk to be 450, with a 20 token overlap to reduce the chances that a sentence or concept will be split into different chunks.

Note that LangChain uses OpenAI's `tiktoken` tokenizer, so our tokenizer will count tokens a bit differently - when adjusting for this, our chunk sizes will be around 500 tokens.

In [7]:
text_splitter = TokenTextSplitter(
    chunk_size=450, # 500 tokens is the max
    chunk_overlap=20 # Overlap of N tokens between chunks (to reduce chance of cutting out relevant connected text like middle of sentence)
)

chunks = text_splitter.split_text(garfield_inaugural)

for chunk in chunks:
    print(token_len(chunk))

453
455
467
457
457
455
461
368


Next, we will embed each chunk into a semantic vector space using the all-MiniLM-L6-v2 model, through LangChain's implementation of Sentence Transformers from [HuggingFace](https://huggingface.co/sentence-transformers). Note that each embedding has a length of 384.

In [10]:
chunk_embeddings = []
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
for chunk in chunks:
    chunk_embeddings.append(embedding_function.embed_query(chunk))

print(len(chunk_embeddings[0]),chunk_embeddings[0][:20]) #Shows first 25 embeddings out of 384

384 [-0.041311442852020264, 0.04761345684528351, 0.007975001819431782, -0.030207891017198563, 0.04763732850551605, 0.03253324702382088, 0.012350181117653847, -0.044836871325969696, -0.008013647049665451, 0.015704018995165825, -0.0009443548624403775, 0.11632765829563141, -0.007115611340850592, -0.03356580808758736, -0.043237943202257156, 0.06872360408306122, -0.04552490636706352, -0.07017458975315094, -0.10271692276000977, 0.11116139590740204]


Finally, we will embed our prompt and use cosine similarity to find the most relevant chunk to the question we'd like answered:

In [9]:
user_question = "What were James Garfield's views on civil service reform?"

In [10]:
prompt_embeddings = embedding_function.embed_query(user_question) 
similarities = cosine_similarity([prompt_embeddings], chunk_embeddings)[0] 
closest_similarity_index = np.argmax(similarities) 
most_relevant_chunk = chunks[closest_similarity_index]
display(HTML(most_relevant_chunk))

Now, we can feed the most relevant speech expert into our chat completion model so that the LLM can use it to answer our question:

In [14]:
# A chat completion function that will use the most relevant exerpt(s) from presidential speeches to answer the user's question
def presidential_speech_chat_completion(client, model, user_question, relevant_excerpts):
    chat_completion = client.chat.completions.create(
        messages = [
            {
                "role": "system",
                "content": "You are a presidential historian. Given the user's question and relevant excerpts from presidential speeches, answer the question by including direct quotes from presidential speeches. When using a quote, site the speech that it was from (ignoring the chunk)." 
            },
            {
                "role": "user",
                "content": "User Question: " + user_question + "\n\nRelevant Speech Exerpt(s):\n\n" + relevant_excerpts,
            }
        ],
        model = model
    )
    
    response = chat_completion.choices[0].message.content
    return response


presidential_speech_chat_completion(client, model, user_question, most_relevant_chunk)

'James Garfield, in his inaugural address on March 4, 1881, briefly touched on the subject of civil service reform. He expressed his belief that the civil service could not be placed on a satisfactory basis until it was regulated by law. He also mentioned his intention to ask Congress to fix the tenure of minor offices and prescribe the grounds for removal during the terms for which incumbents had been appointed. He stated that this would be done to protect those with appointing power, incumbents, and to ensure honest and faithful service from executive officers. Garfield believed that offices were created for the service of the Government, not for the benefit of incumbents or their supporters.\n\nSource: Inaugural Address, March 4, 1881.'

### Using a Vector DB to store and retrieve embeddings for all speeches

Now, let's repeat the same process for every speech in our .csv using the same text splitter as above. Note that we will be converting our text to a `Document` object so that it integrates with the vector database, and also prepending the president, date and title to the speech transcript to provide more context to the LLM:

In [15]:
documents = []
for index, row in presidential_speeches_df[presidential_speeches_df['Transcript'].notnull()].iterrows():
    chunks = text_splitter.split_text(row.Transcript)
    total_chunks = len(chunks)
    for chunk_num in range(1,total_chunks+1):
        header = f"Date: {row['Date']}\nPresident: {row['President']}\nSpeech Title: {row['Speech Title']} (chunk {chunk_num} of {total_chunks})\n\n"
        chunk = chunks[chunk_num-1]
        documents.append(Document(page_content=header + chunk, metadata={"source": "local"}))

print(len(documents))

10698


I will be using a Pinecone index called `presidential-speeches` for this demo. As mentioned above, you can sign up for Pinecone's Starter plan for free and have access to a single index, which is ideal for a small personal project. You can also use Chroma DB as an open source alternative. Note that either Vector DB will use the same embedding function we've defined above:

In [16]:
pinecone_index_name = "presidential-speeches"
docsearch = PineconeVectorStore.from_documents(documents, embedding_function, index_name=pinecone_index_name)

### Use Chroma for open source option
#docsearch = Chroma.from_documents(documents, embedding_function)


Fortunately, all of the manual work we did above to embed text and use cosine similarity to find the most relevant chunk is done under the hood when using a vector database. Now, we can ask our question again, over the entire corpus of presidential speeches.

In [17]:
user_question = "What were James Garfield's views on civil service reform?"

In [23]:
relevent_docs = docsearch.similarity_search(user_question)

# print results
#display(HTML(relevent_docs[0].page_content))

We will use the three most relevant excerpts in our system prompt. Note that even with nearly 1000 speeches chunked and stored in our vector database, the similarity search still found the same one as when we only parsed Garfield's Inaugural Address:

In [24]:
relevant_excerpts = '\n\n------------------------------------------------------\n\n'.join([doc.page_content for doc in relevent_docs[:3]])
display(HTML(relevant_excerpts.replace("\n", "<br>")))

In [21]:
presidential_speech_chat_completion(client, model, user_question, relevant_excerpts)

'James Garfield, in his Inaugural Address delivered on March 4, 1881, expressed his views on civil service reform. He believed that the civil service could not be placed on a satisfactory basis until it was regulated by law. He proposed to ask Congress to fix the tenure of the minor offices of the several Executive Departments and prescribe the grounds upon which removals shall be made during the terms for which incumbents have been appointed. He stated, "For the good of the service itself, for the protection of those who are intrusted with the appointing power against the waste of time and obstruction to the public business caused by the inordinate pressure for place, and for the protection of incumbents against intrigue and wrong, I shall at the proper time ask Congress to fix the tenure of the minor offices of the several Executive Departments and prescribe the grounds upon which removals shall be made during the terms for which incumbents have been appointed."\n\nHe also mentioned 

# Conclusion

In this notebook we've shown how to implement a RAG system using Groq API, LangChain and Pinecone by embedding, storing and searching over nearly 1,000 speeches from US presidents. By embedding speech transcripts into a vector database and leveraging the power of semantic search, we have demonstrated how to overcome two of the most significant challenges faced by LLMs: the knowledge cutoff and hallucination issues.

You can interact with this RAG application here: https://presidential-speeches-rag.streamlit.app/