The paths on this notebook are based on the root of the repository, but this notebook was moved afterwards. Make sure to fix the paths if you want to run the code.

In [3]:
# importing the required libraries. Some libraries run on colab while others run locally

import os
import zipfile
from tqdm import tqdm
from openai import OpenAI
import pickle as pkl

try:
    from dotenv import load_dotenv
    load_dotenv()
    HF_TOKEN = os.getenv("HUGGINGFACE_TOKEN")
    OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
    
except:
    from google.colab import drive
    from google.colab import userdata
    HF_TOKEN = userdata.get('HF_TOKEN')
    OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')

# Set to True to run the pipelines. Since we stored the results in a file, 
# we can set this to False to avoid running the pipelines again
RUN_PIPELINES = False   

In [2]:
# from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM

In [3]:
# same as above. If on colab, we need to mount the drive. If locally, we can set the path to the local directory

# ========== Mount Google Drive ==========
try:
    drive.mount('/content/drive')

# ========== Path Configuration ==========
# Update these paths according to your Google Drive structure
    DRIVE_BASE = '/content/drive/MyDrive/'
    EXTRACT_DIR = '/tmp/extracted'  # Using tmp for faster I/O

except:
    DRIVE_BASE = './outputs/'
    EXTRACT_DIR = './tmp/extracted'  # Using tmp for faster I/O

ZIP_PATH = "./data/elmundo_chunked_es_page1_40years.zip"
OUTPUT_DIR = os.path.join(DRIVE_BASE, 'cleaned_articles1')

# Create directories
os.makedirs(EXTRACT_DIR, exist_ok=True)
os.makedirs(OUTPUT_DIR, exist_ok=True)


In [4]:
# ========== File Extraction ==========
def extract_files(zip_path, extract_dir):
    """
    Extracts files from a zip archive to a directory.
    
    Args:
    zip_path (str): Path to the zip archive.
    extract_dir (str): Directory to extract the files to.
    """
    with zipfile.ZipFile(zip_path, 'r') as zip_ref:
        # Extract nested structure
        for file in zip_ref.namelist():
            if file.endswith('.txt'):
                zip_ref.extract(file, extract_dir)
    print("*" * 50)
    print(f"Extracted files to: {extract_dir}")

In [5]:
from openai import OpenAI
client = OpenAI(
    api_key=OPENAI_API_KEY
)

def correct_with_openai(text, filename, just_text = True, max_completion_tokens = 2048, temperature = 1, top_p = 1, frequency_penalty=0, presence_penalty=0,**kwargs):
  """
  Corrects text using OpenAI's GPT-4o-mini model.
  
  Args:
  text (str): The text to correct.
  filename (str): The name of the file.
  just_text (bool): Whether to return just the corrected text.
  max_completion_tokens (int): The maximum number of tokens to generate.
  temperature (float): The temperature for sampling.
  top_p (float): The nucleus sampling probability.
  frequency_penalty (float): The frequency penalty.
  presence_penalty (float): The presence penalty.
  kwargs: Additional keyword arguments.
  
  Returns:
  str: The corrected text.
  """
  response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": f"Eres un experto en documentos históricos de Puerto Rico. El texto en español son noticias del siglo XX y contiene muchos errores a causa del OCR. Descifra el contenido y tradúcelo al inglés:\n1. Preserva nombres propios (ej: Mayagüez, Caguas)\n2. Ignora el \"header\" (ej:\n```EL MUNDO\nPRONOSTICOS DEL TIEMPO PARA LA ISLA, HOY: Mayormente nublado, con aguaceros dispersos temprano en la mafiana. EN SAN JUAN. AYER: Temperatura máxima. 80; mínima, 77. Presión barométrica al nivel del mar, a las 4:80 de la tarde. 38.88 pulgadas de mercurio. No hay indicios de disturbio tropical.\n40 páginas 5/\nDIARIO DE LA MARANA\nAÑO XXVIII\nEntered aa second clsss matter, Post Office, San Juan, P. R.)```\n3. Ignora los anuncios\n4. Solo mantén contenido relacionado a Puerto Rico (especialmente sobre ciudades, locaciones o eventos históricos)\n5. Traduce el texto a inglés. Solo mantén los datos mas importantes\n6.  Lista las ciudades o locaciones de Puerto Rico mencionadas\n7. Escribe solo en texto (no uses **negrillas** ni *itálicas* ni nada en markdown)\n8. return it as a JSON object with two fields:\n    - 'metadata': un diccionario con la siguiente informacion: 'filename' (nombre del articulo), 'date' (fecha del articulo), 'locations' (lista de las ciudades o locaciones de Puerto Rico mencionadas).\n    - 'text': the corrected and summarized text in English.\n8. No digas nada mas ni preguntes más. El nombre del articulo es {filename}. Usa el siguiente texto: {text}"
          }
        ]
      }
    ],
    response_format={
      "type": "json_object"
    },
    temperature=temperature,
    max_completion_tokens=max_completion_tokens,
    top_p=top_p,
    frequency_penalty=frequency_penalty,
    presence_penalty=presence_penalty,
    **kwargs
  )
  if just_text:
    return response.choices[0].message.content

  return response

In [6]:
from datetime import datetime
import pickle as pkl

def save_progress(data, filename="all_docs.pkl"):
    """ Save the current state of data to Google Drive. 
    
    Args:
    data (dict): The data to save.
    filename (str): The filename to save the data to.
    
    Returns:
    None
    """
    save_path = os.path.join(OUTPUT_DIR, filename)

    with open(save_path, 'wb') as f:
        pkl.dump(data, f)

    print(f"Progress saved at {datetime.now().strftime('%Y-%m-%d %H:%M:%S')} to {save_path}")

In [7]:
PROGRESS_FILE = os.path.join(OUTPUT_DIR, "processed_files.log")

def get_processed_files():
    """
    Returns a set of processed files.
    
    Returns:
    set: The set of processed files
    """
    if os.path.exists(PROGRESS_FILE):
        with open(PROGRESS_FILE, 'r') as f:
            return set(f.read().splitlines())
    return set()

def update_progress(filename):
    """
    Updates the progress file with the processed filename.
    
    Args:
    filename (str): The filename to add to the progress file.
    
    Returns:
    None
    """
    with open(PROGRESS_FILE, 'a') as f:
        f.write(f"{filename}\n")

In [8]:
# ========== Processing Pipeline ==========
import json
import pickle as pkl
from langchain.docstore.document import Document
import time

# Save progress every 15 minutes
interval_minutes = 15

def process_files():
    """
    Processes the text files in the ZIP archive.
    
    Returns:
    list: A list of Document objects.
    """
    extract_files(ZIP_PATH, EXTRACT_DIR)  # extract files from zip

    all_docs = [] # for storing all the documents

    # Track when the last save occurred
    last_save_time = time.time()
    processed = get_processed_files()

    # Get all text files from nested directory
    base_dir = os.path.join(EXTRACT_DIR, "elmundo_chunked_es_page1_40years")
    txt_files = [f for f in os.listdir(base_dir) if f.endswith('.txt')]

    for filename in tqdm(txt_files, desc="Processing files"):

        if filename in processed:
            # Skip already processed files
            continue

        input_path = os.path.join(base_dir, filename)
        output_path = os.path.join(OUTPUT_DIR, f"cleaned_{filename}")

        with open(input_path, 'r', encoding='utf-8', errors='ignore') as f: # open current text file
            raw_text = f.read()

        try:
            # gets gpt-4o-mini JSON object with 'metadata' and 'text' fields:
            json_object = json.loads(correct_with_openai(raw_text, filename))  # OpenAI version

            cleaned_text = json_object['text']  # get the text from the gpt-4o-mini model

            with open(output_path, 'w', encoding='utf-8') as f: # save text on google drive
                f.write(cleaned_text)

            print(f"Processed: {filename} -> Saved to Drive")

            doc = Document(                           # convert text to a langchain text object (for use on Chroma later)
                page_content=json_object['text'],
                metadata=json_object['metadata']
            )
            all_docs.append(doc)                      # append docs to list

            # Update the processed log
            update_progress(filename)

        except Exception as e:
            print(f"Error processing {filename}: {str(e)}")
            continue

        current_time = time.time()
        if (current_time - last_save_time) >= (interval_minutes * 60):
            save_progress(all_docs)
            last_save_time = current_time  # Update the last save time

    # Save all_docs as pkl file
    with open(os.path.join(OUTPUT_DIR, "all_docs.pkl"), 'wb') as f:
        pkl.dump(all_docs, f)

    with open("./saves/all_docs.pkl", 'wb') as f:
        pkl.dump(all_docs, f)

    return all_docs

In [9]:
# Run the pipeline. False by default, set to True to run in the first cell
if RUN_PIPELINES:
    all_docs = process_files()


In [10]:
def translate_list (lista, just_text = True, max_completion_tokens = 2048, temperature = 1, top_p = 1, frequency_penalty=0, presence_penalty=0,**kwargs):
  """
  Translates a list of items to English using OpenAI's GPT-4o-mini model.
  
  Args:
  lista (list): The list to translate.
  just_text (bool): Whether to return just the translated text.
  
  Returns:
  str: The translated text in a JSON-ish style.
  """
  largo = len(lista)

  if isinstance(lista, list):
    lista = str(lista)

  response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": f"""
                Eres un experto en documentos históricos, localidades, y personas ilustres de Puerto Rico. La lista que se te pasará es un de lugares conocidos y 
                personas ilustres de Puerto Rico. Tu deber es traducir la lista al inglés y corregir cualquier error que encuentres. 
                1. Preserva nombres propios (ej: Mayagüez, Caguas, Julia de Burgos)\n
                2. Escribe solo en texto (no uses **negrillas** ni *itálicas* ni nada en markdown)\n
                3. Es posible que la lista ya contenga elementos en inglés. En ese caso, no los traduzcas, pero incluyelos en la respuesta final.\n
                3. Retorna un objeto JSON con {largo} pares key-value:\n
                    - key: el texto de la lista. Podría estar en inglés o español. Depende de como te lo dieron en la lista\n
                    - value: el texto traducido al inglés. Si el elemento ya estaba en inglés, no hace falta traducirlo, pero igualmente incluyes el texto aqui\n
                4. El ejemplo de como se veria la respuesta JSON aceptable para la lista de ejemplo ["playa de Camuy", "Julia de Burgos", "Parque acuatico Las Cascadas", "Aguada Transmission Center", "Domes Beach"]:\n
                    ```
                    {{
                        "playa de Camuy": "Camuy Beach",
                        "Julia de Burgos": "Julia de Burgos",
                        "Parque acuático Las Cascadas": "Las Cascadas Water Park",
                        "Centro Ceremonial Indígena de Caguana": "Caguana Indigenous Ceremonial Center",
                        "Aguada Transmission Center": "Aguada Transmission Center",
                        "Domes beach": "Domes beach"
                    }}
                    ```
                  En general, el JSON deberia verse {{key_1: value_1, key_2: value_2, key_3: value_3, ..., key_{largo}: value_{largo}}}\n
                5. La lista de arriba solamente es un ejemplo para que te guies.
                6. Absolutamente todos los elementos en la que el usuario te de tienen que aparecer en el JSON final con su traducción correspondiente.
                7. No digas nada mas ni preguntes más.
                8. La lista del usuario que vas a usar para el JSON es la siguiente:\n 
                    {lista}
                """
          }
        ]
      }
    ],
    response_format={
      "type": "json_object"
    },
    temperature=temperature,
    max_completion_tokens=max_completion_tokens,
    top_p=top_p,
    frequency_penalty=frequency_penalty,
    presence_penalty=presence_penalty,
    **kwargs
  )
  if just_text:
    return response.choices[0].message.content

  return response

In [11]:
def update_metadata_with_landmarks(all_docs, landmarks_dict):
    """
    Updates the metadata of the documents with the landmarks found in the text.
    
    Args:
    all_docs (list): A list of Document objects.
    landmarks_dict (dict): A dictionary of landmarks in English and Spanish.
    
    Returns:
    list: A list of Document objects with updated metadata.
    """
    for doc in all_docs:
        text = doc.page_content.lower()  # Convert to lowercase for easier matching
        for landmark_es, landmark_en in landmarks_dict.items():
            if landmark_es.lower() in text or landmark_en.lower() in text:
                if 'locations' not in doc.metadata:
                    doc.metadata['locations'] = []
                if landmark_en not in doc.metadata['locations']:
                    doc.metadata['locations'].append(landmark_en)  # Add the English landmark
                if landmark_es not in doc.metadata['locations']:
                    doc.metadata['locations'].append(landmark_es)  # Add the Spanish landmark

    return all_docs


In [12]:
import spacy

# Load spaCy's NER model
nlp = spacy.load("en_core_web_sm")

# Example function to process the documents and add NER results to metadata
def enrich_metadata_with_ner(all_docs):
    """
    Enriches the metadata of the documents with named entities recognized by spaCy.
    
    Args:
    all_docs (list): A list of Document objects.
    
    Returns:
    list: A list of Document objects with updated metadata.
    """
    for doc in all_docs:
        text = doc.page_content
        spacy_doc = nlp(text)  # Process text through spaCy NER engine

        # Collect the detected locations (GPE and LOC entities)
        ner_locations = {ent.text for ent in spacy_doc.ents if ent.label_ in ['GPE', 'LOC']}
        
        # Combine with existing locations in metadata
        existing_locations = set(doc.metadata.get('locations', []))
        updated_locations = list(existing_locations.union(ner_locations))
        
        # Update the document's metadata with enriched locations
        doc.metadata['locations'] = ', '.join(updated_locations)
    
    return all_docs

In [13]:
def get_landmarks_dict():
    """
    Returns a dictionary of landmarks in English and Spanish.
    
    Returns:
    dict: A dictionary of landmarks in English and Spanish.
    """

    path_zip = "./data/landmarks.zip"
    extract = './tmp/extracted_landmarks'  # Using tmp for faster I/O
    # Create directories
    os.makedirs(extract, exist_ok=True)

    # Extract the landmarks.zip file
    extract_files(path_zip, extract)


    #landmarks list of file names, removing .txt and changing `_`, and `-` to spaces
    base_dir = os.path.join(extract, "landmarks")
    landmarks = [f.replace('.txt', '').replace('_', ' ').replace('-', ' ') for f in os.listdir(base_dir) if f.endswith('.txt')]

    # Translate the landmarks list with OpenAI (splitting into two to avoid exceeding the token limit)
    translations1 = translate_list(landmarks[:len(landmarks)//2], just_text=False, max_completion_tokens=7000)
    translations2 = translate_list(landmarks[len(landmarks)//2:], just_text=False, max_completion_tokens=7000)

    # Combine the translations. translations contain 2 ChatCompletion objects
    translations = [translations1, translations2]

    #save translations to pkl
    with open("./save/landmark translations.pkl", 'wb') as f:
        pkl.dump(translations, f)

    # change the translations to json
    translation1_json = json.loads(translations[0].choices[0].message.content)
    translation2_json = json.loads(translations[1].choices[0].message.content)

    # make translations_json by adding translation1_json and translation2_json
    translations_json = {**translation1_json, **translation2_json}

    # save translations_json to json file
    with open("./saves/landmarks.json", 'w') as f:
        json.dump(translations_json, f)

    return translations_json

In [14]:
# delete the extracted files
! rm -rf ./tmp/extracted_landmarks

In [15]:
# Run the pipeline. False by default, set to True to run in the first cell
if RUN_PIPELINES:
    translations = get_landmarks_dict()
    
    # update metadata with landmarks
    all_docs = update_metadata_with_landmarks(all_docs, translations)
    all_docs = enrich_metadata_with_ner(all_docs)
    # save the updated all_docs to pkl
    # add a source tag to the metadata with value 'news'
    for doc in all_docs:
        doc.metadata['source'] = 'news'
        
    with open("./saves/all_docs_updated.pkl", 'wb') as f:
        pkl.dump(all_docs, f)

else:
    with open("./saves/all_docs_updated.pkl", 'rb') as f:
        news_docs = pkl.load(f)      # load the updated all_docs from pkl file

Converting other articles to Documents and store them in an array for later using them for the ChromaDB

In [16]:
# open csv files with pandas
import pandas as pd
from langchain.docstore.document import Document

if RUN_PIPELINES:
    landmarks = pd.read_csv("./structured-information-from-datasets/landmark_data_combined.csv")
    landmarks.head()

    # convert save landmarks csv in a Documents object in a list
    landmarks_docs = []
    for i, row in landmarks.iterrows():
        doc = Document(
            page_content=row['Brief Description'],
            metadata={
                'filename': row['File Name'],
                'landmark': row['Landmark Name'],
                'latitude': row['Latitude'],
                'longitude': row['Longitude'],
                'municipality': row['Municipality'],
                'url': row['Wikipedia URL'],
                'source': 'landmarks'
            }
        )
        landmarks_docs.append(doc)

In [17]:
#open municipality csv file
if RUN_PIPELINES:
    municipalities = pd.read_csv("./structured-information-from-datasets/municipality_data_combined.csv")
    municipalities.head()

    # convert save municipalities csv in a Documents object in a list
    municipalities_docs = []
    for i, row in municipalities.iterrows():
        doc = Document(
            page_content=row['Brief Description'],
            metadata={
                'filename': row['File Name'],
                'municipality': row['Municipality Name'],
                'latitude': row['Latitude'],
                'longitude': row['Longitude'],
                'url': row['Wikipedia URL'],
                'source': 'municipalities'
            }
        )
        municipalities_docs.append(doc)

Merge all lists of documents into one list and save them

In [4]:
# merge the news_docs, landmarks_docs, and municipalities_docs in a single list
if RUN_PIPELINES:
    all_docs = news_docs + landmarks_docs + municipalities_docs

    # save the merged all_docs to pkl
    with open("./saves/news_landmarks_municipalities_merged.pkl", 'wb') as f:
        pkl.dump(all_docs, f)

else:
    with open("./saves/news_landmarks_municipalities_merged.pkl", 'rb') as f:
        all_docs = pkl.load(f)      # load the updated all_docs from pkl file

ChromaDB

In [8]:
## ========== Chroma ==========
from langchain_huggingface.embeddings import HuggingFaceEmbeddings
from langchain_chroma import Chroma

create_db = False

if create_db:
    sentence_transformer_embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
    print("Initialized SentenceTransformer embeddings.")

    # Load all documents into Chroma
    db = Chroma.from_documents(all_docs, sentence_transformer_embeddings, persist_directory="./chroma_db")
    print('All documents loaded and embedded.(huggingface)')

else:
    sentence_transformer_embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
    print("Initialized SentenceTransformer embeddings.")

    print("\nLoading database...")
    db = Chroma(persist_directory='./chroma_db', embedding_function=sentence_transformer_embeddings)
    print("Huggingface database loaded.")


Initialized SentenceTransformer embeddings.

Loading database...
Huggingface database loaded.


In [36]:
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI

def rank_appropriate_locations(user_prompt):
    """
    This function ranks appropriate locations for the user based on their preferences in the prompt.
    It uses Chroma for document retrieval and RAG for ranking locations based on context-aware responses.

    Parameters:
    - user_prompt (str): The user's question or preferences (e.g., "I love sunny beaches").

    Returns:
    - list: A ranked list of location suggestions based on user preferences.
    """

    # Step 1: Analyze the user prompt for keywords (e.g., "beach," "history," "sunny")
    preferences = user_prompt.lower()
    keyword_list = ['beach', 'sunny', 'history', 'museum', 'nature', 'mountain', 'culture']  # Example keyword list

    # Step 2: Query Chroma to retrieve locations that match user preferences
    relevant_locations = []
    relevance_scores = {}  # Store relevance score for each location

    for keyword in keyword_list:
        if keyword in preferences:
            query = f"{keyword} locations"
            try:
                retrieved_docs = db.similarity_search(query, k=5)  # Adjust 'k' for the number of retrieved documents
                if retrieved_docs:
                    for doc in retrieved_docs:
                        loc_name = doc.metadata.get('landmark')
                        page_content = doc.page_content.lower()
                        # Assign a relevance score based on the presence of the keyword in the content
                        score = page_content.count(keyword)  # Count how many times the keyword appears in the content
                        if loc_name not in relevance_scores:
                            relevance_scores[loc_name] = score
                        else:
                            relevance_scores[loc_name] += score
                    relevant_locations.extend(retrieved_docs)
            except Exception as e:
                return f"Error: Failed to retrieve documents from Chroma. Please try again later. Error details: {str(e)}"

    # Edge case: If no relevant documents are retrieved
    if not relevant_locations:
        return "Sorry, no relevant locations were found based on your preferences. Please try a different query."

    # Step 3: Rank the retrieved locations using RAG
    ranked_locations = []
    try:
        # Initialize the LLM (Large Language Model) for RAG
        llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0.7, openai_api_key=os.getenv("OPENAI_API_KEY"))

        # Create the RetrievalQA chain using Chroma vector store and the LLM
        qa_chain = RetrievalQA.from_chain_type(
            llm=llm,
            chain_type="stuff",  # Using 'stuff' to combine document contents into one response
            retriever=db.as_retriever(search_type="similarity", search_kwargs={"k": 5})  # Retrieve top 5 documents
        )

        # Step 4: For each retrieved location, generate a response based on user preferences
        for location in relevant_locations:
            loc_name = location.metadata.get('landmark')
            # Pass the location to RAG with the user prompt
            response = qa_chain.run(f"How does {loc_name} match the user's preference: {user_prompt}")
            ranked_locations.append((loc_name, response, relevance_scores.get(loc_name, 0)))  # Store location with response and score

        # Step 5: Sort locations based on relevance score (higher score is better)
        ranked_locations.sort(key=lambda x: x[2], reverse=True)  # Sort by relevance score

    except Exception as e:
        return f"Error: Failed to rank locations using RAG. Please try again later. Error details: {str(e)}"

    # Step 6: Return the ranked list of location suggestions
    ranked_list = "\n".join([f"{loc[0]}: {loc[1]}" for loc in ranked_locations])  # Format the output
    return ranked_list

# Example usage: Rank locations based on a user prompt
user_prompt = "I love sunny beaches and historical places."
ranked_locations = rank_appropriate_locations(user_prompt)
print(ranked_locations)


  llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0.7, openai_api_key=os.getenv("OPENAI_API_KEY"))
  response = qa_chain.run(f"How does {loc_name} match the user's preference: {user_prompt}")


Esperanza Beach: Esperanza Beach might not fully match your preference for historical places, as it is mainly known for being a popular beach destination with hotels, restaurants, and shops. However, it does offer a sunny beach experience with its proximity to La Esperanza and the beautiful coastline. If you are specifically looking for historical places, you may want to explore other areas in Puerto Rico that are known for their historical significance.
La Pocita de las Golondrinas Beach: La Pocita de las Golondrinas Beach in Isabela is a sunny beach that is safe for families with children due to its shallow waters, but it does not have historical places nearby.
Jobos Beach: Jobos Beach may not be the best match for the user's preference of loving sunny beaches and historical places. While Jobos Beach is a sunny beach located in Puerto Rico, it does not have significant historical landmarks or sites nearby. If historical places are an important factor for the user, they may want to co

In [20]:
# Define extract_interests function
def extract_interests(user_prompt):
    """
    A simple function to extract potential interests from the user's prompt.
    This could be expanded to use NLP techniques to identify more complex patterns.
    """
    interests = []

    # Example keywords associated with different types of preferences
    keywords = {
        'sunny': ['sunny', 'beach', 'warm', 'hot', 'tropical'],
        'history': ['history', 'museum', 'historic', 'culture', 'ancient'],
        'nature': ['nature', 'outdoor', 'park', 'mountain', 'trail'],
        'rain': ['rainy', 'wet', 'storm']
    }

    # Check for keywords in the user prompt
    for category, words in keywords.items():
        if any(word in user_prompt.lower() for word in words):
            interests.append(category)

    return interests

In [21]:
def rank_appropriate_locations(user_prompt, db):
    """
    This function ranks locations based on the user's interests and the Chroma vector store.

    Parameters:
    - user_prompt (str): User's query or preferences (e.g., "I love sunny beaches").
    - db (Chroma): Chroma vector store to query data from landmarks, municipalities, and news articles.

    Returns:
    - ranked_locations (list): List of locations ranked by relevance to the user's interests.
    """

    # Step 1: Preprocess the user prompt to extract interests
    interests = extract_interests(user_prompt)
    print(f"Extracted Interests: {interests}")  # Debugging line to check the extracted interests

    if not interests:
        print("No interests extracted from the user prompt.")  # If no interests are extracted

    # Step 2: Search Chroma vector store for relevant locations
    search_results = db.similarity_search(user_prompt, k=5)  # Adjust 'k' to retrieve more locations
    print(f"Search Results: {search_results}")  # Debugging line to check Chroma search results

    if not search_results:
        print("No results found in Chroma for the given prompt.")  # If no results were found

    # Step 3: Remove duplicate documents from the search results based on metadata
    unique_results = []
    seen = set()
    for doc in search_results:
        metadata = doc.metadata
        if metadata['filename'] not in seen:
            seen.add(metadata['filename'])
            unique_results.append(doc)

    print(f"Unique Search Results: {unique_results}")  # Debugging line to check unique search results

    # Step 4: Rank locations based on the interests
    ranked_locations = []
    for doc in unique_results:
        metadata = doc.metadata
        print(f"Document Metadata: {metadata}")  # Debugging line to check document metadata

        score = 0
        # Check if any of the interests match the document's page content (not just metadata)
        for interest in interests:
            # Check if interest matches the content of the document
            if interest.lower() in doc.page_content.lower():
                score += 1

        # If the location has relevant scores, add it to the ranked list
        if score > 0:
            ranked_locations.append({
                'location': metadata.get('landmark', metadata.get('municipality', 'Unknown')),
                'score': score,
                'metadata': metadata
            })

    # Step 5: Sort locations by score (higher score means more relevant)
    ranked_locations = sorted(ranked_locations, key=lambda x: x['score'], reverse=True)
    print(f"Ranked Locations: {ranked_locations}")  # Debugging line to check the final output

    return ranked_locations

# Example usage: Get ranked location suggestions based on user preferences
user_prompt = "I love sunny beaches and warm weather"
# Assuming db is your Chroma vector store object
# Example: db = Chroma(persist_directory='path_to_your_chroma_db', embedding_function=embedding_function)
suggestions = rank_appropriate_locations(user_prompt, db)

# Display the suggestions
if suggestions:
    for suggestion in suggestions:
        print(f"Location: {suggestion['location']}")
        print(f"Score: {suggestion['score']}")
        print(f"Description: {suggestion['metadata'].get('description', 'No description available')}")
        print(f"Wikipedia URL: {suggestion['metadata'].get('url', 'No URL available')}")
        print("-" * 80)
else:
    print("No locations were ranked.")


Extracted Interests: ['sunny']
Document Metadata: {'filename': 'flamenco_beach.txt', 'landmark': 'Flamenco Beach', 'latitude': 18.331667, 'longitude': -65.318056, 'municipality': 'Culebra', 'source': 'landmarks', 'url': 'https://en.wikipedia.org/wiki/Flamenco_Beach'}
Document Metadata: {'filename': 'jobos_beach.txt', 'landmark': 'Jobos Beach', 'latitude': 18.514215, 'longitude': -67.075744, 'municipality': 'Isabela', 'source': 'landmarks', 'url': 'https://en.wikipedia.org/wiki/Jobos_Beach'}
Document Metadata: {'date': 'March 15, 1947', 'filename': '19470315_1.txt', 'locations': 'World War II, Puerto Rico, Culebra, puerto rico, San Juan, San Germán, world war ii', 'source': 'news'}
Document Metadata: {'filename': 'blue_beach_(vieques).txt', 'landmark': 'Blue Beach (Vieques)', 'latitude': 18.11305555555556, 'longitude': -65.3875, 'municipality': 'Vieques', 'source': 'landmarks', 'url': 'https://en.wikipedia.org/wiki/Blue_Beach_(Vieques)'}
Document Metadata: {'filename': 'playa_espinar.tx

In [38]:
#import markdown
from IPython.display import display, Markdown

# Example query
user_question = "Beach in San Juan"
docs = db.similarity_search(user_question, k=5)

# Print results
for doc in docs[0:5]:
    # print(doc.page_content, '\n')
    display(Markdown(doc.page_content))
    print(doc.metadata['source'])
    print("*"*80, '\n')


The San Juan Marriott Resort & Stellaris Casinois ahotelandcasinolocated on the beach in Condado, San Juan, Puerto Rico. It is operated by Marriott International. \n

landmarks
******************************************************************************** 



Esperanza Beach is a popular beach on the southern coast of Viequesin La Esperanza, Puerto Real. In comparison to other beaches in the island which are located far away from populated areas, this beach is located close to La Esperanza and it hosts a number of hotels, restaurants, food kiosks and stores. [1][2]Esperanza Pier, located on the western part of the beach, is considered a landmark of Vieques. [3]It is a very popular weekend destination for locals and visitors alike. The beach is located between Sun Bay Beachand Black Sand Beachand both can be reached by foot from La Esperanza. [4][5][6]\n

landmarks
******************************************************************************** 



Mar Bella Beach, colloquially known as Puerto Nuevo Beach, is a beach in the municipality of Vega Bajain the north coast of Puerto Rico. The beach is often referred to as Puerto Nuevo Beach because it is located in the Puerto Nuevobarrio of Vega Baja; it is also referred to as the Balneario de Vega Bajaor Balneario de Puerto Nuevo. [1]The beach is located approximately 45 minutes west of San Juan, making it popular with both locals and visitors. [2]\n

landmarks
******************************************************************************** 



Crash Boat Beachor Crashboat Beach on the northwestern coast of Puerto Ricois situated in the municipality of Aguadilla. \n

landmarks
******************************************************************************** 



Cayo Luis Peña, formerly South West Key, [1]is a small uninhabited island off the west coast of Culebra, an island municipality of Puerto Rico. The island is anature reservewhich forms part of the Culebra National Wildlife Refuge. Visitors are allowed on the island for nature walks, snorkeling, andswimming; however, visitors are not allowed to stay on the island overnight. The island is only accessible via privatewater taxis. This limited access results in relatively few visitors and the island and surrounding reefs are able to stay more pristine as a result. [2]The small number of visitors also makes the island more private for those willing to make the journey. [3]Luis Peña Beach is located on the north side of the island. The island is named after its second owner. \n

landmarks
******************************************************************************** 

