# Building the Database
This step involves:
1. Creating a structured database from the processed data (landmarks and municipalities) with relevant fields such as name, description, location, and category.
2. Indexing the data into a search engine, such as FAISS, ElasticSearch, or Pinecone, to enable efficient retrieval.

## Output
After running the script:

1. The FAISS index file will be saved at './data/landmarks_municipalities_faiss.index'.
2. The metadata CSV will be saved at './data/landmarks_municipalities_metadata.csv'.

In [4]:
# !pip install pandas sentence-transformers faiss-cpu tf-keras

In [27]:
!pip show tensorflow

Name: tensorflow
Version: 2.18.0
Summary: TensorFlow is an open source machine learning framework for everyone.
Home-page: https://www.tensorflow.org/
Author: Google Inc.
Author-email: packages@tensorflow.org
License: Apache 2.0
Location: /opt/anaconda3/lib/python3.12/site-packages
Requires: absl-py, astunparse, flatbuffers, gast, google-pasta, grpcio, h5py, keras, libclang, ml-dtypes, numpy, opt-einsum, packaging, protobuf, requests, setuptools, six, tensorboard, termcolor, typing-extensions, wrapt
Required-by: tensorflow-macos, tf_keras


In [31]:
# !pip uninstall tensorflow keras -y

Found existing installation: tensorflow 2.18.0
Uninstalling tensorflow-2.18.0:
  Successfully uninstalled tensorflow-2.18.0
Found existing installation: keras 3.8.0
Uninstalling keras-3.8.0:
  Successfully uninstalled keras-3.8.0


In [33]:
!pip install tensorflow

Collecting tensorflow
  Using cached tensorflow-2.18.0-cp312-cp312-macosx_12_0_arm64.whl.metadata (4.0 kB)
Collecting keras>=3.5.0 (from tensorflow)
  Using cached keras-3.8.0-py3-none-any.whl.metadata (5.8 kB)
Using cached tensorflow-2.18.0-cp312-cp312-macosx_12_0_arm64.whl (239.6 MB)
Using cached keras-3.8.0-py3-none-any.whl (1.3 MB)
Installing collected packages: keras, tensorflow
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow-macos 2.16.2 requires tensorflow==2.16.2; platform_system == "Darwin" and platform_machine == "arm64", but you have tensorflow 2.18.0 which is incompatible.[0m[31m
[0mSuccessfully installed keras-3.8.0 tensorflow-2.18.0


In [19]:
import pandas as pd
# import keras
# import tf_keras as keras
from sentence_transformers import SentenceTransformer
import faiss
import os
import json

RuntimeError: Failed to import transformers.integrations.integration_utils because of the following error (look up to see its traceback):
Failed to import transformers.modeling_tf_utils because of the following error (look up to see its traceback):
Your currently installed version of Keras is Keras 3, but this is not yet supported in Transformers. Please install the backwards-compatible tf-keras package with `pip install tf-keras`.

In [None]:
dir_path = './data/' # Data Path

In [None]:
# Step 1: Load processed data
def load_processed_data(json_path):
    with open(json_path, 'r', encoding='utf-8') as file:
        return json.load(file)

# Step 2: Build a structured database
def build_structured_database(landmarks_data, municipalities_data):
    # Create a DataFrame for landmarks
    landmarks_df = pd.DataFrame(landmarks_data)
    landmarks_df["category"] = "landmark"  # Add a category for landmarks

    # Create a DataFrame for municipalities
    municipalities_df = pd.DataFrame(municipalities_data)
    municipalities_df["category"] = "municipality"  # Add a category for municipalities

    # Combine both datasets into a single database
    combined_df = pd.concat([landmarks_df, municipalities_df], ignore_index=True)
    combined_df = combined_df.rename(columns={"title": "name", "content": "description"})

    return combined_df

# Step 3: Index data using FAISS
def index_data_with_faiss(dataframe, embedding_model_name="all-MiniLM-L6-v2"):
    # Load a sentence embedding model
    model = SentenceTransformer(embedding_model_name)

    # Generate embeddings for the descriptions
    descriptions = dataframe["description"].tolist()
    embeddings = model.encode(descriptions, convert_to_numpy=True, show_progress_bar=True)

    # Define a FAISS index
    dimension = embeddings.shape[1]  # Embedding size
    index = faiss.IndexFlatL2(dimension)  # L2 similarity search
    index.add(embeddings)

    return index, embeddings, dataframe

# Step 4: Save the FAISS index
def save_faiss_index(index, file_path):
    faiss.write_index(index, file_path)

# Step 5: Save metadata for later use
def save_metadata(dataframe, file_path):
    dataframe.to_csv(file_path, index=False, encoding='utf-8')

# Paths to processed data
landmarks_json_path = f"{dir_path}processed_landmarks.json"
municipalities_json_path = f"{dir_path}processed_municipalities.json"

# Paths for outputs
faiss_index_path = f"{dir_path}landmarks_municipalities_faiss.index"
metadata_path = f"{dir_path}landmarks_municipalities_metadata.csv"

# Main query process
# Load processed data
landmarks_data = load_processed_data(landmarks_json_path)
municipalities_data = load_processed_data(municipalities_json_path)

# Build a structured database
database = build_structured_database(landmarks_data, municipalities_data)

# Index data with FAISS
index, embeddings, dataframe_with_metadata = index_data_with_faiss(database)

# Save the FAISS index and metadata
save_faiss_index(index, faiss_index_path)
save_metadata(dataframe_with_metadata, metadata_path)

print(f"Database and FAISS index built successfully!")
print(f"- FAISS Index Path: {faiss_index_path}")
print(f"- Metadata Path: {metadata_path}")

## Python implementation for querying
### FAISS index created in the previous step. 

This script allows you to input a query, generate its embedding, and retrieve the most relevant landmarks or municipalities from the FAISS index.

In [49]:
import faiss
import pandas as pd
from sentence_transformers import SentenceTransformer

In [54]:
# Function to load FAISS index
def load_faiss_index(index_path):
    return faiss.read_index(index_path)

# Function to load metadata
def load_metadata(metadata_path):
    return pd.read_csv(metadata_path)

# Function to query the FAISS index
def query_faiss_index(index, metadata, query, model, top_k=5):
    # Generate embedding for the query
    query_embedding = model.encode([query], convert_to_numpy=True)

    # Search the FAISS index
    distances, indices = index.search(query_embedding, top_k)

    # Retrieve metadata for the top results
    results = []
    for i, idx in enumerate(indices[0]):
        if idx < len(metadata):
            result = {
                "rank": i + 1,
                "name": metadata.iloc[idx]["name"],
                "category": metadata.iloc[idx]["category"],
                "description": metadata.iloc[idx]["description"],
                "distance": distances[0][i]
            }
            results.append(result)

    return results

# Paths to FAISS index and metadata
faiss_index_path = f"{dir_path}landmarks_municipalities_faiss.index"
metadata_path = f"{dir_path}landmarks_municipalities_metadata.csv"

# Main query process
# Load FAISS index and metadata
index = load_faiss_index(faiss_index_path)
metadata = load_metadata(metadata_path)

# Load the sentence embedding model
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

# User query
user_query = input("Enter your query (e.g., 'historic sites in San Juan'): ")

# Query the FAISS index
top_k_results = query_faiss_index(index, metadata, user_query, embedding_model, top_k=5)

# Display results
print("\nTop Results:")
for result in top_k_results:
    print(f"Rank {result['rank']}:")
    print(f"  Name: {result['name']}")
    print(f"  Category: {result['category']}")
    print(f"  Description: {result['description'][:200]}...")
    print(f"  Distance: {result['distance']:.4f}")
    print("\n")

Enter your query (e.g., 'historic sites in San Juan'):  Arecibo



Top Results:
Rank 1:
  Name: Arecibo, Puerto Rico - Wikipedia
  Category: municipality
  Description: \n
Arecibo (/\xcb\x8c\xc3\xa6r\xc9\x99\xcb\x88si\xcb\x90bo\xca\x8a/; Spanish pronunciation: [a\xc9\xbee\xcb\x88si\xce\xb2o]) is a city and municipality on the northern coast of Puerto Rico, on the sho...
  Distance: 0.7333


Rank 2:
  Name: Arecibo Telescope - Wikipedia
  Category: landmark
  Description: The Arecibo Telescope was a 305 m (1,000 ft) spherical reflector radio telescope built into a natural sinkhole at the Arecibo Observatory located near Arecibo, Puerto Rico. A cable-mount steerable rec...
  Distance: 0.9911


Rank 3:
  Name: Arecibo Observatory - Wikipedia
  Category: landmark
  Description: \n
The Arecibo Observatory, also known as the National Astronomy and Ionosphere Center (NAIC) and formerly known as the Arecibo Ionosphere Observatory, is an observatory in Barrio Esperanza, Arecibo, ...
  Distance: 1.0893


Rank 4:
  Name: Arecibo barrio-pueblo - Wikipedia
  Cate