# 📰 **Chroma Vector Store and Retrieval-Augmented Generation (RAG)** for Landmarks, Municipalities, and News

### **Ironhack Data Science and Machine Learning Bootcamp**  
📅 **Date:** February 12, 2025  
📁 **Notebook:** `rag_chroma_integration_travel_chatbot.ipynb`  
👩‍💻 **Authors:** Ginosca Alejandro Dávila & Natanael Santiago Morales  

---

## **📌 Project Overview**
This notebook is part of **The Hitchhiker’s Guide to Puerto Rico**, an interactive **travel planning chatbot** designed to suggest **landmarks**, **municipalities**, and relevant **news articles** based on users' interests. In this notebook, we combine two key components:

1. **Chroma Vector Store**: We will store structured information from three datasets—**landmarks**, **municipalities**, and **news articles**—into a **Chroma vector store** for efficient querying and retrieval.
   
2. **Retrieval-Augmented Generation (RAG)**: We will integrate the Chroma vector store with a **Generative Language Model** (e.g., GPT or T5) to generate accurate chatbot responses by retrieving relevant documents based on user queries and generating contextual information.

By combining both approaches in this notebook, we aim to create an efficient and context-aware **travel planning chatbot**.

---

## **🛠️ What We Are Doing in This Notebook**
✔ **Step 1:** Load the merged data from `news_landmarks_municipalities_merged.pkl`.  
✔ **Step 2:** Convert the merged data into **Chroma's Document objects** for indexing.  
✔ **Step 3:** Store the documents in **Chroma's vector store**.  
✔ **Step 4:** Perform **retrieval-based querying** from the vector store using user input.  
✔ **Step 5:** Use **Retrieval-Augmented Generation (RAG)** to generate accurate and context-aware responses based on the retrieved documents.  

---

## **🛠️ How the Merged Data Will Be Used**
The merged dataset contains:
- **Landmarks**:
  - **File Name**
  - **Landmark Name**
  - **Coordinates** (latitude and longitude)
  - **Municipality**
  - **Wikipedia URL**
  - **Brief Description**
  
- **Municipalities**:
  - **File Name**
  - **Municipality Name**
  - **Coordinates** (latitude and longitude)
  - **Wikipedia URL**
  - **Brief Description**

- **News Articles**:
  - **File Name**
  - **Publication Date**
  - **Locations Mentioned**
  - **Article Text** (content of the news article)

### **How This Data Supports the Chatbot:**
- ✅ **Context-Aware Responses**: The chatbot can suggest landmarks, municipalities, and relevant news articles based on user queries.
- ✅ **Personalized Recommendations**: The chatbot combines user preferences with relevant context from landmarks and news.
- ✅ **Efficient Querying**: Chroma's vector store enables quick retrieval of relevant documents, and the RAG system ensures responses are generated based on that context.

---

## **📂 Dataset Description**
- **Source**: Raw text files from Wikipedia (landmarks and municipalities) and El Mundo news articles.
- **Format**: Merged into a **.pkl file** containing **Documents** with metadata and content.
- **Location**:  
  📁 `My Drive/Colab Notebooks/Ironhack/Week 9/project-dsml-interactive-travel-planner/data/merged-data/news_landmarks_municipalities_merged.pkl`

---

## **🛠️ Chroma Usage**
- Chroma is used to **index and store** the documents from the merged dataset. This allows fast retrieval of relevant documents for the chatbot based on user queries, combining information from:
  - **Landmarks** (landmark name, coordinates, municipality, descriptions, Wikipedia URL)
  - **Municipalities** (municipality name, coordinates, descriptions, Wikipedia URL)
  - **News Articles** (file name, publication date, locations mentioned, article text)

The **Chroma vector store** enables the chatbot to provide **contextual recommendations** efficiently.

---

## **📈 RAG Usage**
- **Retrieval-Augmented Generation (RAG)** will be used to **augment** the chatbot's ability to generate **contextual and accurate** responses based on the retrieved documents.
- We will retrieve relevant documents from the Chroma vector store based on user input and use them as context for generating answers using a **generative language model**.

---

🔹 **Let’s store the data in Chroma and prepare it for chatbot use with RAG! 🚀**


## 🔗 **Mounting Google Drive**

In this step, we will mount Google Drive to access the necessary dataset files stored in our drive. This will allow us to load the `news_landmarks_municipalities_merged.pkl` file, which contains the merged data.

Let's mount the drive so we can access the files stored in it.


In [1]:
from google.colab import drive

# 🔹 Mount Google Drive
drive.mount('/content/drive')


Mounted at /content/drive


## 📂 **Loading the Merged Dataset**

Now that we have mounted Google Drive, we will load the `news_landmarks_municipalities_merged.pkl` file from the specified path into the notebook. This file contains the combined information from landmarks, municipalities, and news articles.

Let’s load the data and inspect its contents.


In [2]:
# 🔹 Path to the merged dataset in Google Drive
file_path = '/content/drive/My Drive/Colab Notebooks/Ironhack/Week 9/project-dsml-interactive-travel-planner/data/merged-data/news_landmarks_municipalities_merged.pkl'

# 🔹 Load the merged data
import pickle

with open(file_path, 'rb') as file:
    merged_data = pickle.load(file)

# 🔹 Display the total number of documents
print(f"Loaded {len(merged_data)} documents from the dataset.")

# 🔹 Show the first 5 documents that are from the elmundo_chunked_es_page1_40years dataset
print("\nFirst 5 documents:")
print(merged_data[:5])

# 🔹 Show the first 5 documents of the landmarks dataset (index 1668 to 1742)
print("\nFirst 5 documents of the Landmarks dataset:")
print(merged_data[1668:1673])


# 🔹 Show the last 5 documents of the municipalities dataset (index 2242 to 2320)
print("\nLast 5 documents of the Municipalities dataset:")
print(merged_data[-5:])


Loaded 2320 documents from the dataset.

First 5 documents:
[Document(metadata={'filename': '19220527_1.txt', 'date': 'May 27, 1922', 'locations': 'the United States, Puerto Rico, Caguas, puerto rico, Arecibo, San Juan', 'source': 'news'}, page_content="In the office of the Free Federation, we found Senator and socialist leader Santiago Iglesias, with whom we discussed various economic and political issues. He praised the Worker Indemnity Commission, referring to it as one of the laws with the most humanitarian spirit. Iglesias mentioned that critics claimed 86% of the Commission's income goes to salaries, while only 14% is for worker indemnities; he deemed this assertion exaggerated and called for a clear report on the Commission's finances. He also expressed concern that shipping companies are not contributing to the Commission, arguing that since Puerto Rico is not incorporated into the United States, local laws should apply. Iglesias urged lawyers to address this matter in court so

## 📑 **Converting Merged Data into Chroma's Document Format**

In this step, we will convert the merged dataset into a list of **Document objects** that Chroma can use. Each **Document object** will have the **metadata** (columns) and **page content** (descriptions) extracted from the datasets for **landmarks**, **municipalities**, and **news articles**.

This step is crucial to ensure that Chroma can handle the data and index it for efficient querying and retrieval.

---

🔹 **Let’s convert the merged data into Chroma's Document format! 🚀**


In [3]:
from langchain.schema import Document

# 🔹 Convert the merged data into Document format
documents = []

for doc in merged_data:
    metadata = doc.metadata.copy()  # Create a copy of metadata
    page_content = doc.page_content  # Content of the document

    # Convert to Chroma's Document format
    document = Document(
        metadata=metadata,
        page_content=page_content
    )

    # Append to the documents list
    documents.append(document)

# 🔹 Check the first document to confirm the structure
print(documents[0].metadata)
print(documents[0].page_content[:500])  # Show the first 500 characters of the page content


{'filename': '19220527_1.txt', 'date': 'May 27, 1922', 'locations': 'the United States, Puerto Rico, Caguas, puerto rico, Arecibo, San Juan', 'source': 'news'}
In the office of the Free Federation, we found Senator and socialist leader Santiago Iglesias, with whom we discussed various economic and political issues. He praised the Worker Indemnity Commission, referring to it as one of the laws with the most humanitarian spirit. Iglesias mentioned that critics claimed 86% of the Commission's income goes to salaries, while only 14% is for worker indemnities; he deemed this assertion exaggerated and called for a clear report on the Commission's finances. H


## 🔹 **Installing Required Packages**

In this step, we are installing the necessary packages for working with Chroma and the HuggingFace embeddings model. We will install the `langchain-community` package for Chroma vector store functionality and `chromadb` to handle the actual vector store implementation.

By installing these packages, we ensure that the environment is ready for creating and storing documents in Chroma's vector store for efficient querying and retrieval.

---

🔹 **Let's install the required packages for Chroma and HuggingFace! 🚀**


In [3]:
# 🔹 Install required packages for Chroma and HuggingFace embeddings with minimal output
!pip install -U langchain-community -q &>/dev/null
!pip install chromadb -q &>/dev/null


## 🔹 **Storing Documents in Chroma's Vector Store**

In this step, we will store the converted documents in **Chroma's vector database**. This will allow us to index the documents for **efficient retrieval** during chatbot interactions. By using **Chroma**, we will ensure that the chatbot can quickly access relevant information based on user queries.

---

🔹 **Let’s store the documents in Chroma’s vector database! 🚀**


In [5]:
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings

# 🔹 Path to store Chroma vector store in Google Drive
chroma_db_path = "/content/drive/My Drive/Colab Notebooks/Ironhack/Week 9/project-dsml-interactive-travel-planner/chroma_db"

# 🔹 Initialize the HuggingFace Embeddings model
embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

# 🔹 Create the Chroma vector store and store the documents
db = Chroma.from_documents(
    documents,
    embedding_model,
    persist_directory=chroma_db_path  # Save in Google Drive
)

# 🔹 Confirm that the database has been successfully created
print("Chroma database has been created and documents are stored in Google Drive.")

  embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Chroma database has been created and documents are stored in Google Drive.


## 🔗 **Loading Chroma Vector Store**

In this step, we will load the **Chroma vector store** that was created and saved in the previous step. This allows us to use the stored documents for retrieval in the **Retrieval-Augmented Generation (RAG)** system.

By loading the vector store, we can query the documents for relevant information based on user queries, enabling us to provide contextual and accurate responses for the chatbot.

If the Chroma vector store was already saved in the previous run, we can skip the creation step and load the database directly from Google Drive.

---

🔹 **Let’s load the Chroma vector store for RAG! 🚀**


In [4]:
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings

# 🔹 Path to the persisted Chroma vector store in Google Drive
chroma_db_path = "/content/drive/My Drive/Colab Notebooks/Ironhack/Week 9/project-dsml-interactive-travel-planner/chroma_db"

# 🔹 Initialize the HuggingFace Embeddings model (same as before)
embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

# 🔹 Check if the Chroma vector store exists
import os
if os.path.exists(chroma_db_path):
    # 🔹 Load the Chroma vector store
    db = Chroma(persist_directory=chroma_db_path, embedding_function=embedding_model)
    print("Chroma vector store loaded successfully!")
else:
    print("Chroma vector store not found. Please make sure it was saved correctly.")


  embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling%2Fconfig.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

  db = Chroma(persist_directory=chroma_db_path, embedding_function=embedding_model)


Chroma vector store loaded successfully!


## 🔹 **Testing Document Retrieval from Chroma's Vector Store**

In this step, we will test the document retrieval process from Chroma's vector store. This will ensure that the chatbot can efficiently fetch relevant documents based on user queries.

We will query the vector store with a sample question and display the top 5 most similar documents returned by Chroma. The retrieval process is based on the semantic similarity between the query and the stored documents.

---

🔹 **Let’s test the retrieval process! 🚀**


In [5]:
# 🔹 Querying Chroma's vector store with a sample user question
user_query = "What are some landmarks in San Juan?"

# 🔹 Retrieve the top 5 most similar documents from the vector store
retrieved_docs = db.similarity_search(user_query, k=5)

# 🔹 Display the retrieved documents
for doc in retrieved_docs:
    print(f"Document Metadata: {doc.metadata}")
    print(f"Content: {doc.page_content[:300]}...")  # Display the first 300 characters of content
    print("-" * 80)  # Separator between documents


Document Metadata: {'filename': 'old_san_juan.txt', 'landmark': 'Old San Juan', 'latitude': 18.46638888888889, 'longitude': -66.11027777777777, 'municipality': 'San Juan', 'source': 'landmarks', 'url': 'https://en.wikipedia.org/wiki/Old_San_Juan'}
Content: Old San Juan is ahistoric districtlocated at the "northwest triangle"[2]of theislet of San Juanin San Juan. Its area roughly correlates to the Ballajá, Catedral, Marina, Mercado, San Cristóbal, and San Franciscosub-barrios of barrio San Juan Antiguoin the municipality of San Juan, Puerto Rico. \n...
--------------------------------------------------------------------------------
Document Metadata: {'filename': 'old_san_juan.txt', 'landmark': 'Old San Juan', 'latitude': 18.46638888888889, 'longitude': -66.11027777777777, 'municipality': 'San Juan', 'source': 'landmarks', 'url': 'https://en.wikipedia.org/wiki/Old_San_Juan'}
Content: Old San Juan is ahistoric districtlocated at the "northwest triangle"[2]of theislet of San Juanin San Ju

## 🔍 **Testing Document Retrieval Output**

In this step, we queried the **Chroma vector store** to retrieve the top 5 documents most relevant to the user’s question: **"What are some landmarks in San Juan?"**

The system successfully fetched a set of documents based on the **semantic similarity** between the query and the stored documents. The retrieved documents contain relevant information about **landmarks** and **municipalities** in San Juan.

### **Key Observations:**
- The **document metadata** includes key information such as the **filename**, **landmark** or **municipality name**, **location coordinates**, **source**, and a **URL** to the relevant Wikipedia page.
- The **content** of the documents provides descriptions of each landmark or municipality, including historical and geographical details.
- Some of the retrieved documents repeated information about **Old San Juan**, which indicates the relevance of this area in the context of landmarks in San Juan.

This retrieval process demonstrates how the system is able to pull **contextually relevant documents** from the Chroma vector store to answer specific user queries, thus enhancing the chatbot's ability to provide meaningful and accurate responses.

---

🔹 **Next Steps**: With this document retrieval process working, we can move on to integrating the **Retrieval-Augmented Generation (RAG)** system to generate answers based on the retrieved documents.


## 🔑 **Setting Up OpenAI API Key**

In this step, we will load the **OpenAI API key** from the `openai_key.txt` file that we previously saved in the project folder. This API key will be used to authenticate and interact with the OpenAI GPT model for the **Retrieval-Augmented Generation (RAG)** system.

We will read the key from the file and set it as an environment variable so that it can be accessed by the `ChatOpenAI` model in subsequent steps.

---

🔹 **Let's load and set the OpenAI API key! 🚀**


In [7]:
# 🔹 Load the API key from the file and set it as an environment variable
api_key_path = "/content/drive/My Drive/Colab Notebooks/Ironhack/Week 9/project-dsml-interactive-travel-planner/openai_key.txt"

with open(api_key_path, "r") as file:
    openai_api_key = file.read().strip()

import os
os.environ["OPENAI_API_KEY"] = openai_api_key


## 🔹 **Retrieval-Augmented Generation (RAG) Integration**

In this step, we will integrate the **Retrieval-Augmented Generation (RAG)** system with the Chroma vector store. The goal is to enable the chatbot to retrieve relevant documents from Chroma based on user queries and then generate an answer using a language model.

RAG combines **retrieval** (fetching relevant documents) and **generation** (using a language model to synthesize an answer) to provide accurate and context-aware responses. This will enhance the chatbot's ability to answer questions based on the stored documents, such as landmarks, municipalities, and news articles.

---

🔹 **Let’s integrate RAG for contextual responses! 🚀**


In [11]:
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA

# 🔹 Initialize the LLM (Large Language Model) for RAG (can use OpenAI GPT model)
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0.7, openai_api_key=os.getenv("OPENAI_API_KEY"))

# 🔹 Create the RetrievalQA chain using Chroma vector store and the LLM
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",  # Using 'stuff' to combine document contents into one response
    retriever=db.as_retriever(search_type="similarity", search_kwargs={"k": 5})  # Retrieve top 5 documents
)

# 🔹 Perform the query with RAG integration
user_query = "What are some landmarks in San Juan?"
response = qa_chain.run(user_query)

# 🔹 Display the answer generated by the RAG system
print(f"Answer: {response}")


Answer: Some landmarks in San Juan, Puerto Rico, include:

1. El Morro Fortress (Castillo San Felipe del Morro)
2. San Juan Cathedral (Catedral de San Juan Bautista)
3. La Fortaleza (Governor's Mansion)
4. Paseo de la Princesa
5. San Cristóbal Fortress (Castillo San Cristóbal)
6. Casa Blanca
7. Plaza de Armas
8. La Rogativa Statue
9. Bacardi Distillery
10. Condado Vanderbilt Hotel

These are just a few of the many landmarks that you can explore in San Juan.


## 🔍 **Retrieval-Augmented Generation (RAG) Output**

In this step, we used the **Retrieval-Augmented Generation (RAG)** system to query the **Chroma vector store** for relevant documents based on the user's query: **"What are some landmarks in San Juan?"**

The system successfully retrieved the top 5 most relevant documents from the vector store and generated an accurate response based on the semantic similarity between the query and the documents. The response included a list of **landmarks in San Juan**, showcasing the chatbot's ability to provide contextually relevant and factual information.

### **Key Takeaways from the Output:**
- The RAG system combined the power of **retrieval** (fetching relevant documents) and **generation** (using OpenAI’s GPT model) to produce a coherent, detailed answer.
- The system retrieved documents from various sources, including **landmarks** and **municipalities**, that matched the user's query about landmarks in San Juan.
- The answer provides a list of **famous landmarks** in the city, which was directly derived from the documents in the Chroma vector store.

This process demonstrates how the RAG system can integrate real-time retrieval with GPT-based generation to enhance user interactions and provide personalized, context-aware answers.

---

🔹 **Next Steps**: Continue testing the system with different queries to evaluate the accuracy and relevance of the generated answers.


## 🔹 **Testing the RAG System with a New Query**

In this step, we have wrapped the query process into a function called `query_rag_system`. This function allows us to easily test the Retrieval-Augmented Generation (RAG) system with different queries.

We will test this function with a new query, **"Tell me more about the history of Old San Juan"**. This will allow us to verify that the RAG system can correctly retrieve relevant documents from the Chroma vector store and generate a meaningful, context-aware response using the OpenAI model.

---

🔹 **Let’s test the RAG system with a historical query! 🚀**


In [13]:
def query_rag_system(query):
    response = qa_chain.run(query)
    return response

# Test the function with a new query
print(query_rag_system("Tell me more about the history of Old San Juan"))


Old San Juan, a historic district located in the capital city of San Juan, Puerto Rico, has a rich history dating back to the Spanish colonial period. It was founded by Spanish colonists in 1521 and was originally known as "Ciudad de Puerto Rico." The district's architecture reflects its colonial past, with colorful buildings, cobblestone streets, and historic forts like Castillo San Felipe del Morro and Castillo San Cristóbal. Old San Juan is known for its well-preserved 16th-century Spanish colonial buildings, making it a popular tourist destination filled with history and culture.


## 🔹 **RAG System Response to the Query**

The output from the `query_rag_system` function provides a detailed answer to the query **"Tell me more about the history of Old San Juan."** The response combines information from the relevant documents in the Chroma vector store and generates a context-aware answer based on the historical details of Old San Juan.

The response highlights key aspects of Old San Juan, such as its founding in 1521, its colonial architecture, and its significance as a historical district in Puerto Rico. This confirms that the RAG system is working as expected, efficiently combining document retrieval and generation to provide an accurate, coherent answer.

---

🔹 **RAG successfully generated the historical response for Old San Juan! 🚀**


## 🔍 **Testing the RAG System with More Queries**

In this step, we will continue testing the **Retrieval-Augmented Generation (RAG)** system with additional queries. These tests will help verify that the chatbot can respond to a variety of topics, such as historical information, tourist attractions, and other queries related to Puerto Rico.

We will test the system using different questions about **landmarks**, **municipalities**, and **historical events** to evaluate how well it generates relevant, context-aware answers from the Chroma vector store.

---

🔹 **Let’s test the RAG system with different queries! 🚀**


In [14]:
# 🔹 Test the RAG system with a query about landmarks in Puerto Rico
user_query_1 = "What are the famous landmarks in Puerto Rico?"
response_1 = query_rag_system(user_query_1)
print(f"Answer to Query 1: {response_1}")

# 🔹 Test the RAG system with a query about historical events in San Juan
user_query_2 = "Can you tell me about important historical events in San Juan?"
response_2 = query_rag_system(user_query_2)
print(f"Answer to Query 2: {response_2}")

# 🔹 Test the RAG system with a query about the municipality of San Juan
user_query_3 = "What are the key features of the municipality of San Juan?"
response_3 = query_rag_system(user_query_3)
print(f"Answer to Query 3: {response_3}")

# 🔹 Test the RAG system with a query about a specific landmark (e.g., Castillo San Felipe del Morro)
user_query_4 = "Tell me more about Castillo San Felipe del Morro."
response_4 = query_rag_system(user_query_4)
print(f"Answer to Query 4: {response_4}")


Answer to Query 1: Some of the famous landmarks in Puerto Rico include the Letras de Ponce monument in Ponce, Puerto Rico, and the Archivo General de Puerto Rico in San Juan.
Answer to Query 2: One significant historical event in San Juan is the founding of the city by Spanish colonists in 1521. Another important event is the 30th anniversary of the Jones Act, which provided civil governance for Puerto Rico, approaching, marking a significant moment in the island's political history.
Answer to Query 3: The key features of the municipality of San Juan include being the capital city and the most populous municipality in the Commonwealth of Puerto Rico. It is an unincorporated territory of the United States, with a population of 342,259 as of the 2020 census. San Juan was founded by Spanish colonists in 1521. Additionally, the historic district of Old San Juan is located within the municipality and is known for its cultural and architectural significance.
Answer to Query 4: Castillo San F

## 🔍 **Evaluation of RAG System's Performance on Multiple Queries**

In this step, we will evaluate the performance of the **Retrieval-Augmented Generation (RAG)** system based on the queries tested. The responses generated for different queries provide insights into the system’s ability to retrieve accurate and contextually relevant information from the Chroma vector store.

### **Key Observations from the Outputs:**
1. **Landmarks in Puerto Rico**: The system successfully identified prominent landmarks in Puerto Rico, like the Letras de Ponce and Archivo General de Puerto Rico, providing a broad view of notable places.
2. **Historical Events in San Juan**: The query regarding historical events in San Juan yielded relevant details about the founding of the city and political milestones like the Jones Act.
3. **Municipality of San Juan**: The chatbot effectively highlighted essential characteristics of San Juan, such as its status as the capital and its cultural significance, demonstrating an understanding of both historical and geographical context.
4. **Castillo San Felipe del Morro**: The system provided a thorough historical description of the Castillo San Felipe del Morro, illustrating the richness of the information stored in the Chroma vector store.

These results demonstrate the effectiveness of combining **retrieval** and **generation** for answering user queries, with the ability to pull in specific historical and cultural information in a coherent and informative way.

---

🔹 **Next Steps**: Continue testing the system with additional, more specific queries and assess whether the RAG system maintains its performance across various topics.


## 🔍 **Expanding Query Testing & Performance Optimization**

In this step, we will test the **Retrieval-Augmented Generation (RAG)** system with new, varied queries to further evaluate its capabilities. These tests will assess how well the system can handle different types of questions and whether the responses remain contextually relevant across a broad range of topics.

Additionally, we'll explore the possibility of optimizing the system by:
- Adjusting query processing parameters
- Fine-tuning response generation to improve coherence and relevance

### **Next Testing Areas:**
1. **Cultural and Festival Queries**: Test the system’s response to questions about Puerto Rican festivals, traditional music, and notable cultural events.
2. **Historical Figures**: Ask about famous Puerto Rican figures, such as political leaders, artists, and historical heroes.
3. **Geographical Queries**: Query the system about Puerto Rico's towns and natural landmarks beyond the typical tourist spots.

---

🔹 **Let's continue testing and optimizing the RAG system! 🚀**


In [15]:
# 🔹 Test the RAG system with a query about Puerto Rican festivals
user_query_1 = "What are some popular festivals in Puerto Rico?"
response_1 = query_rag_system(user_query_1)
print(f"Answer to Query 1: {response_1}")

# 🔹 Test the RAG system with a query about famous Puerto Rican artists
user_query_2 = "Can you tell me about famous Puerto Rican artists?"
response_2 = query_rag_system(user_query_2)
print(f"Answer to Query 2: {response_2}")

# 🔹 Test the RAG system with a query about a lesser-known Puerto Rican town
user_query_3 = "What can you tell me about the town of Adjuntas?"
response_3 = query_rag_system(user_query_3)
print(f"Answer to Query 3: {response_3}")

# 🔹 Test the RAG system with a query about natural landmarks in Puerto Rico
user_query_4 = "What are the natural landmarks in Puerto Rico?"
response_4 = query_rag_system(user_query_4)
print(f"Answer to Query 4: {response_4}")


Answer to Query 1: Some popular festivals in Puerto Rico include the Aibonito Festival of Flowers, which is a 10-day festival celebrated every June in Aibonito, and the San Sebastián Street Festival, which takes place in San Sebastián in January. These festivals are known for showcasing the culture, music, food, and traditions of Puerto Rico.
Answer to Query 2: One famous Puerto Rican artist is Francisco Manuel Oller y Cestero, who was a distinguished painter and the only Latin American painter to have played a role in the development of Impressionism. He is known for transforming painting in the Caribbean.
Answer to Query 3: Adjuntas is a small mountainside town and municipality in Puerto Rico, located in the central midwestern portion of the island on the Cordillera Central. It is north of Yauco, Guayanilla, and Peñuelas; southeast of Utuado; east of Lares and Yauco; and northwest of Ponce. Adjuntas is spread over 16 barrios and Adjuntas Pueblo, which is the administrative center of 

## 🔍 **Evaluation of RAG System on Expanded Query Set**

In this step, we tested the **Retrieval-Augmented Generation (RAG)** system with a set of varied queries that covered a broad range of topics, including **festivals**, **Puerto Rican artists**, **lesser-known towns**, and **natural landmarks**. These tests allowed us to further assess how well the system retrieves and generates contextually relevant responses.

### **Key Observations from the Outputs:**
1. **Puerto Rican Festivals**: The system successfully retrieved information on popular Puerto Rican festivals like the Aibonito Festival of Flowers and the San Sebastián Street Festival, providing a good overview of cultural events.
2. **Famous Puerto Rican Artists**: The system mentioned notable artists like Francisco Manuel Oller y Cestero and provided information about his impact on Impressionism, which showcases the chatbot's ability to handle queries about historical figures.
3. **Town of Adjuntas**: The system was able to retrieve and generate detailed information about the small town of Adjuntas, reflecting the system's capability to answer queries about less touristy locations.
4. **Natural Landmarks in Puerto Rico**: The chatbot provided information about natural landmarks, including the Laguna Tortuguero Natural Reserve, demonstrating its ability to handle geography-related queries.

### **Takeaways:**
- The system handled a diverse set of queries with accuracy and relevance, demonstrating the power of combining **retrieval** and **generation**.
- The chatbot's performance is promising, showing that it can offer insightful and context-aware answers across different topics related to Puerto Rico.

---

🔹 **Next Steps**: Continue optimizing the system by refining the parameters and testing on new topics to ensure its robustness for a wide range of user queries.


## **🔍 Evaluation and Reflection on RAG System Performance**

### **Evaluation of Query Handling**:
The **Retrieval-Augmented Generation (RAG)** system demonstrated its ability to retrieve and generate contextually relevant answers to a variety of queries. We tested the system across different types of questions related to **landmarks**, **historical events**, **cultural practices**, and **geographical features**. In each case, the system performed well, retrieving relevant documents from the **Chroma vector store** and using them to generate coherent, informative responses.

### **Key Findings**:
1. **Accuracy of Responses**:
   - The system provided **factually accurate** and **contextually relevant** answers. For example, when asked about landmarks in San Juan, it correctly listed well-known locations like El Morro and San Juan Cathedral.
2. **Context-Awareness**:
   - The RAG system showed a strong ability to maintain **contextual relevance**, whether answering questions about cultural festivals, historical figures, or specific locations in Puerto Rico.
3. **Handling of Diverse Queries**:
   - The system was able to handle a **wide range of queries**, from historical facts to geographical details, showcasing its versatility.

### **System Limitations**:
1. **Edge Cases**:
   - Some less common or very specific queries may require further testing to ensure the system’s robustness across all possible inputs. For example, obscure historical events or lesser-known local figures might pose a challenge.
2. **Response Creativity**:
   - While the system excels at factual retrieval, certain types of creative or opinion-based queries might require further fine-tuning to improve **response variability**.

---

## **🔄 Reflection on the Process**

### **Strengths of the RAG System**:
- **Combination of Retrieval and Generation**: The integration of document retrieval and response generation allowed the system to provide **rich, context-aware answers**.
- **Efficient Use of Chroma Vector Store**: Storing and retrieving documents via the Chroma vector store was effective for fast document retrieval based on semantic similarity.
- **OpenAI’s GPT Model**: Leveraging the GPT-3.5-turbo model contributed to high-quality text generation, providing coherent and relevant answers to user queries.

### **Areas for Improvement**:
- **Query Processing**: Further optimization of query processing could enhance the system’s speed and efficiency. Experimenting with different `k` values for retrieval could be explored.
- **Fine-tuning for Specific Domains**: The system can be fine-tuned to handle specific domains more effectively, such as cultural or historical queries, by further training on domain-specific data.

---

## **📈 Next Steps**:
1. **Extending the Dataset**: To further improve the system’s performance, consider adding more datasets to the Chroma vector store (e.g., more historical documents, local news articles, etc.).
2. **Performance Evaluation**: Test the system with **real users** or edge cases to gauge how well it handles complex, ambiguous, or contradictory queries.
3. **UI Development**: Integrating this system into a user interface (e.g., a web or mobile app) for easier user interaction would be a natural next step to expand its usability.
