# 🔍 Implementing Retrieval-Augmented Generation (RAG) with Groq API
Welcome to this step-by-step Colab notebook where we implement Retrieval-Augmented Generation (RAG) using the Groq API and a LLM model.

In this project, we explore how RAG can boost the performance and factual accuracy of Large Language Models by grounding their responses in a custom dataset—in this case, US presidential speeches.

We demonstrate this with a real-world example: answering a historical question about President James Garfield’s views on civil service reform.

🚀 What You'll Learn
# What LLMs, Groq, and RAG are, and why they matter.

How to use Groq’s hosted Mixtral model via API.

How to split and embed text using LangChain and Hugging Face tools.

How to find the most relevant context using cosine similarity.

How to query an LLM with grounded excerpts to avoid hallucinations.

# 📦 Tools & Libraries Used
groq (LLM API)

langchain, sentence-transformers, tiktoken

huggingface_hub, transformers

pandas, numpy, sklearn

# 🧠 Problem Statement

By default, LLMs may hallucinate or generate non-verifiable content. This notebook shows how RAG helps fix this by:

Asking an LLM a historical question without external grounding (high chance of hallucination).

Then feeding it retrieved, relevant excerpts from presidential speeches using semantic search.

Comparing the accuracy and specificity of the response with and without RAG.

In [1]:
!pip install groq
!pip install langchain langchain-community
!pip install huggingface_hub
!pip install tiktoken
!pip install sentence-transformers

Collecting groq
  Downloading groq-0.26.0-py3-none-any.whl.metadata (15 kB)
Downloading groq-0.26.0-py3-none-any.whl (129 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m129.6/129.6 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: groq
Successfully installed groq-0.26.0
Collecting langchain
  Downloading langchain-0.3.25-py3-none-any.whl.metadata (7.8 kB)
Collecting langchain-community
  Downloading langchain_community-0.3.24-py3-none-any.whl.metadata (2.5 kB)
Collecting langchain-core<1.0.0,>=0.3.58 (from langchain)
  Downloading langchain_core-0.3.62-py3-none-any.whl.metadata (5.8 kB)
Collecting langchain-text-splitters<1.0.0,>=0.3.8 (from langchain)
  Downloading langchain_text_splitters-0.3.8-py3-none-any.whl.metadata (1.9 kB)
Collecting langsmith<0.4,>=0.1.17 (from langchain)
  Downloading langsmith-0.3.43-py3-none-any.whl.metadata (15 kB)
Collecting SQLAlchemy<3,>=1.4 (from langchain)
  Downloading sqlalchemy-2.0.41-cp311-cp311-

 Import Required Dependencies

In [None]:
import pandas as pd
import numpy as np
from groq import Groq
import os


from langchain_community.vectorstores import Chroma
from langchain.text_splitter import TokenTextSplitter
from langchain.docstore.document import Document
from langchain_community.embeddings.sentence_transformer import SentenceTransformerEmbeddings
# from langchain_pinecone import PineconeVectorStore
from transformers import AutoModelForCausalLM, AutoTokenizer
from sklearn.metrics.pairwise import cosine_similarity


from IPython.display import display, HTML

 Set Up Your Groq API Key

In [None]:
groq_api_key = "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"
client = Groq(api_key = groq_api_key)

Load the Dataset



In [None]:
presidential_speeches_df = pd.read_csv('/content/drive/MyDrive/presidential_speeches.csv')
presidential_speeches_df.head()

Unnamed: 0,Date,President,Party,Speech Title,Summary,Transcript,URL
0,1789-04-30,George Washington,Unaffiliated,First Inaugural Address,Washington calls on Congress to avoid local an...,Fellow Citizens of the Senate and the House of...,https://millercenter.org/the-presidency/presid...
1,1789-10-03,George Washington,Unaffiliated,Thanksgiving Proclamation,"At the request of Congress, Washington establi...",Whereas it is the duty of all Nations to ackno...,https://millercenter.org/the-presidency/presid...
2,1790-01-08,George Washington,Unaffiliated,First Annual Message to Congress,"In a wide ranging speech, President Washington...",Fellow Citizens of the Senate and House of Rep...,https://millercenter.org/the-presidency/presid...
3,1790-12-08,George Washington,Unaffiliated,Second Annual Message to Congress,Washington focuses on commerce in his second a...,Fellow citizens of the Senate and House of Rep...,https://millercenter.org/the-presidency/presid...
4,1790-12-29,George Washington,Unaffiliated,Talk to the Chiefs and Counselors of the Senec...,The President reassures the Seneca Nation that...,"I the President of the United States, by my ow...",https://millercenter.org/the-presidency/presid...


In [None]:
# notebook token : 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

Set Up Hugging Face Token



In [None]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
import transformers
import torch

 Tokenize the Single Speech from the Dataset

In [None]:
garfield_inaugural = presidential_speeches_df.iloc[309].Transcript
# model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
# model_id = "meta-llama/Meta-Llama-3-8B"
# Use a general purpose tokenizer for token length calculation
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# create the length function
def token_len(text):
    tokens = tokenizer.encode(
        text
    )
    return len(tokens)

token_len(garfield_inaugural)

# 7.Split the Text in Chunks

text_splitter = TokenTextSplitter(
    chunk_size=450, # 500 tokens is the max
    chunk_overlap=20 # Overlap of N tokens between chunks (to reduce chance of cutting out relevant connected text like middle of sentence)
)

chunks = text_splitter.split_text(garfield_inaugural)

for chunk in chunks:
    print(token_len(chunk))

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (3420 > 512). Running this sequence through the model will result in indexing errors


453
455
467
457
457
455
461
368


Embed Each Chunk into Semantic Vector Space

In [None]:
# Initialize the embedding function
# Use a model specifically designed for sentence embeddings

user_question = "What were James Garfield's views on civil service reform?" # Or any other relevant question

embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

# Assuming 'chunk_embeddings' is already computed for each chunk
# If not, you need to compute embeddings for each chunk:
chunk_embeddings = [embedding_function.embed_query(chunk) for chunk in chunks]

prompt_embeddings = embedding_function.embed_query(user_question)
similarities = cosine_similarity([prompt_embeddings], chunk_embeddings)[0]
closest_similarity_index = np.argmax(similarities)
most_relevant_chunk = chunks[closest_similarity_index]
display(HTML(most_relevant_chunk))

Feed the Most Relevant Chunk to LLM to Answer User Question



In [None]:
# A chat completion function that will use the most relevant exerpt(s) from presidential speeches to answer the user's question
def presidential_speech_chat_completion(client, model, user_question, relevant_excerpts):
    chat_completion = client.chat.completions.create(
        messages = [
            {
                "role": "system",
                "content": "You are a presidential historian. Given the user's question and relevant excerpts from presidential speeches, answer the question by including direct quotes from presidential speeches. When using a quote, site the speech that it was from (ignoring the chunk)."
            },
            {
                "role": "user",
                "content": "User Question: " + user_question + "\n\nRelevant Speech Exerpt(s):\n\n" + relevant_excerpts,
            }
        ],
        model = model
    )


    response = chat_completion.choices[0].message.content
    return response

# Use a valid Groq chat completion model
# Refer to Groq documentation for available models
model = "llama3-8b-8192"
presidential_speech_chat_completion(client, model, user_question, most_relevant_chunk)

'James Garfield expressed his views on civil service reform in his inaugural address on March 4, 1881. He believed that the civil service should be regulated by law to ensure fairness and protection for those in the service. Garfield stated:\n\n"The civil service can never be placed on a satisfactory basis until it is regulated by law. For the good of the service itself, for the protection of those who are intrusted with the appointing power against the waste of time and obstruction to the public business caused by the inordinate pressure for place, and for the protection of incumbents against intrigue and wrong, I shall at the proper time ask Congress to fix the tenure of the minor offices of the several Executive Departments and prescribe the grounds upon which removals shall be made during the terms for which incumbents have been appointed."\n\nGarfield\'s views on civil service reform were shaped by his concerns about the inefficiencies and corruption that had plagued the system du

Using Gradio - Response from RAG


In [None]:
!pip install gradio

Collecting gradio
  Downloading gradio-5.31.0-py3-none-any.whl.metadata (16 kB)
Collecting aiofiles<25.0,>=22.0 (from gradio)
  Downloading aiofiles-24.1.0-py3-none-any.whl.metadata (10 kB)
Collecting fastapi<1.0,>=0.115.2 (from gradio)
  Downloading fastapi-0.115.12-py3-none-any.whl.metadata (27 kB)
Collecting ffmpy (from gradio)
  Downloading ffmpy-0.5.0-py3-none-any.whl.metadata (3.0 kB)
Collecting gradio-client==1.10.1 (from gradio)
  Downloading gradio_client-1.10.1-py3-none-any.whl.metadata (7.1 kB)
Collecting groovy~=0.1 (from gradio)
  Downloading groovy-0.1.2-py3-none-any.whl.metadata (6.1 kB)
Collecting pydub (from gradio)
  Downloading pydub-0.25.1-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting python-multipart>=0.0.18 (from gradio)
  Downloading python_multipart-0.0.20-py3-none-any.whl.metadata (1.8 kB)
Collecting ruff>=0.9.3 (from gradio)
  Downloading ruff-0.11.12-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (25 kB)
Collecting safehttpx<0.2.0,>=0.1.

In [None]:
import gradio as gr
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# - embedding_function: with method embed_query(text) -> list/np.array
# - chunk_embeddings: 2D np.array of shape (num_chunks, embedding_dim)
# - chunks: list of text excerpts corresponding to chunk_embeddings
# - client: your Groq client instance
# - model: model name string, e.g. "llama3-8b-8192"
# - presidential_speech_chat_completion: function defined to call the chat API

def get_most_relevant_excerpt(user_input):
    """
    Embed the user question, compute cosine similarities against precomputed chunk_embeddings,
    and return the most relevant text excerpt.
    """
    # Embed the question
    question_emb = embedding_function.embed_query(user_input)
    # Compute similarities
    similarities = cosine_similarity([question_emb], chunk_embeddings)[0]
    # Find the index of the highest similarity
    idx = int(np.argmax(similarities))
    return chunks[idx]

def query_presidential_speech_agent(user_input):
    """
    Gradio callback: finds the most relevant speech excerpt and returns the AI response.
    """
    try:
        excerpt = get_most_relevant_excerpt(user_input)
        return presidential_speech_chat_completion(client, model, user_input, excerpt)
    except Exception as e:
        return f"Error: {str(e)}"

# Gradio UI
gr.Interface(
    fn=query_presidential_speech_agent,
    # Corrected: Use gr.Textbox directly instead of gr.inputs.Textbox
    inputs=gr.Textbox(label="Ask the Presidential Speech Agent"),
    # Corrected: Use gr.Textbox directly instead of gr.outputs.Textbox
    outputs=gr.Textbox(label="Agent Response"),
    title="Presidential Speech Q&A",
    description="Ask a question and get answers with quotes from U.S. presidential speeches."
).launch()

It looks like you are running Gradio on a hosted a Jupyter notebook. For the Gradio app to work, sharing must be enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://0c205e6ad7be871ccb.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


