# Topic Query Functionality - Demo
This notebook showcases the **topic query functionality** of the place-based semantic search system, enabling users to investigate how specific themes are distributed across urban places through geosocial media signals.

The approach allows free-text input, which is semantically encoded using transformer-based language models. This query embedding is then matched against a large set of precomputed embeddings derived from geotagged social media posts. To efficiently retrieve the most relevant content, the system uses **FAISS** with inner product similarity for approximate nearest-neighbor search.

The results are then aggregated spatially, revealing where in the urban environment posts most closely align with the meaning of the query. Each area’s semantic relevance is visualized using **word clouds** that highlight the most frequent and meaningful terms found in matching posts.

This functionality supports spatial exploration of concepts such as leisure, nature, spirituality, or any other custom user-defined theme.

---

**Notebook Outline**

**1. Parse Data**

Load post embeddings and associated metadata for spatially-aware retrieval.

**2. Topic Query**

Process a free-text query into a semantic embedding and perform similarity search against geotagged posts.

**3. Visualization**

Generate word clouds for each spatial cluster or region containing thematically relevant posts.

**4.  Tests**

To demonstrate the system's capabilities, we run three example queries that reflect distinct place-related themes:

- **Query 1:** *recreation and outdoor activities*  
- **Query 2:** *observing plants and animals in green spaces*  
- **Query 3:** *peaceful sacred places for meditation, prayer, or reflection*

Each example retrieves and visualizes the locations where posts most closely match the query.

---

**Data Availability**

The original geosocial media data used in this study, comprising geotagged posts from  Instagram, Flickr, and X (formerly Twitter), cannot be publicly shared. All data utilized in this study was collected through official APIs or authorized services that, at the time of collection, explicitly prohibited the redistribution of  user-generated content in accordance with their respective terms of service (e.g., the Twitter Developer Agreement and Instagram's API Terms of Use).  While these specific agreements are no longer publicly available due to platform  ownership changes and service restructuring, their restrictions were in effect and adhered to during the data acquisition period.


**Dependencies**

In [None]:
import pandas as pd
import numpy as np
import os
import re

import torch
from collections import Counter
import sqlite3

from sentence_transformers import SentenceTransformer,util
from sklearn.metrics.pairwise import cosine_similarity
from bertopic import BERTopic

import h3
import faiss

from wordcloud import WordCloud
import matplotlib.pyplot as plt
import folium
import matplotlib.colors as mcolors
import folium
from folium.plugins import MeasureControl
from folium import IFrame, Element
from folium.plugins import FloatImage
from folium import Tooltip
import base64
from io import BytesIO
from PIL import Image
from jenkspy import JenksNaturalBreaks
from matplotlib.colors import ListedColormap, to_hex

## 1. Parse Data


**Note**: Due to the data-sharing restrictions described above, certain parts of this notebook reference CSV input files that are not included in the repository. Consequently, code cells that load or process these files cannot be executed as-is. 

In [None]:
import os
# Path to your specific folder
folder_path = './'
INPUT = f'{folder_path}/01_Input'
OUTPUT = f'{folder_path}/02_Output'

file_name_class = 'Dresden_GSMDataset_classifiesMidGeoTextEmb_withoutOutliers.csv'

db_path_posts = f'{OUTPUT}/Posts_Embeddings.db'
df = pd.read_csv(f'{OUTPUT}/{file_name_class}')

## 2. Semantic Search

**Helper Functions**

In [None]:
def plot_similarity_distribution(similarity_scores, query, z_factor):
    """
    Plots a histogram of cosine similarity scores.

    Args:
    - similarity_scores (list): List of cosine similarity scores.
    """
    plt.figure(figsize=(6,3))
    plt.hist(similarity_scores, bins=50, color='blue', alpha=0.7)
    plt.axvline(np.mean(similarity_scores), color='red', linestyle='dashed', linewidth=2, label="μ")
    plt.axvline(np.mean(similarity_scores) + z_factor * np.std(similarity_scores), color='green', linestyle='dashed', linewidth=2, label=f"T = μ + {z_factor}σ")
    plt.xlabel("Cosine Similarity Score")
    plt.ylabel("Frequency")
    plt.title(f"Query: {query}")
    plt.legend()
    plt.show()


def compute_z_score_threshold(cosine_similarities, z_factor=1.5):
    """
    Computes a dynamic threshold using the Z-score method.

    Args:
        cosine_similarities (list): List of cosine similarity scores.
        z_factor (float): Z-score factor (e.g., 1.5 means mean + 1.5 * std).

    Returns:
        float: The calculated similarity threshold.
    """
    mean_sim = np.mean(cosine_similarities)
    std_sim = np.std(cosine_similarities)
    return mean_sim + (z_factor * std_sim)

In [None]:
def semantic_search_faiss(input_text, db_path, df, model, z_factor=1.5, chunk_size=10000):
    """
    Performs FAISS-based semantic search using precomputed embeddings from a SQLite database.
    It applies cosine similarity scaling, calculates a dynamic threshold based on Z-score, 
    and generates a histogram of similarity scores.

    Args:
        input_text (str): The query text for semantic search.
        db_path (str): Path to the SQLite database containing embeddings.
        df (pd.DataFrame): DataFrame containing metadata to merge with search results.
        model (SentenceTransformer): A pre-loaded SentenceTransformer model.
        z_factor (float): The Z-score factor for threshold calculation.
        chunk_size (int, optional): Number of rows to process at a time when reading from the database.

    Returns:
        pd.DataFrame: A DataFrame of matched entries with their cosine similarity scores.
    """

    # Compute query embedding
    query_embedding = model.encode(input_text, convert_to_numpy=True)

    # Ensure embedding is 768-dimensional (if necessary)
    if len(query_embedding) < 768:
        padding = np.zeros(768 - len(query_embedding), dtype=np.float32)
        query_embedding = np.concatenate((query_embedding, padding))

    # Compute the norm (magnitude) of the query embedding
    query_norm = np.linalg.norm(query_embedding)
    query_embedding = query_embedding.reshape(1, -1)

    # Initialize FAISS index (dot product similarity)
    dimension = 768
    index = faiss.IndexFlatIP(dimension)

    # Load embeddings from SQLite and add to FAISS index
    conn = sqlite3.connect(db_path)
    query = "SELECT post_guid, embedding FROM post_embeddings"
    embeddings, post_guids = [], []

    for chunk in pd.read_sql_query(query, conn, chunksize=chunk_size):
        chunk['embedding'] = chunk['embedding'].apply(lambda x: np.frombuffer(x, dtype=np.float32))
        embeddings.extend(chunk['embedding'].tolist())
        post_guids.extend(chunk['post_guid'].tolist())

    conn.close()

    embeddings = np.array(embeddings, dtype=np.float32)

    # Compute norms (magnitudes) of stored embeddings
    embedding_norms = np.linalg.norm(embeddings, axis=1)

    # Add embeddings to FAISS index
    index.add(embeddings)

    # Perform FAISS search (dot product similarity)
    distances, indices = index.search(query_embedding, len(post_guids))

    # Extract raw scores (dot products)
    raw_scores = [dist for dist in distances[0]]

    if len(raw_scores) == 0:
        print("⚠️ No results found.")
        return pd.DataFrame()

    # Compute cosine similarity: dot product / (query norm * embedding norm)
    cosine_similarities = [
        raw_scores[i] / (query_norm * embedding_norms[indices[0][i]])
        for i in range(len(raw_scores))
    ]

    # Compute dynamic Z-score threshold
    min_similarity = compute_z_score_threshold(cosine_similarities, z_factor)

    # Plot histogram of similarity scores
    plot_similarity_distribution(cosine_similarities,input_text, z_factor)

    print(f"🔹 Dynamic Cosine Similarity Threshold (Z-score {z_factor}): {min_similarity:.4f}")

    # Filter results based on the computed similarity threshold
    results = [
        (post_guids[indices[0][i]], cosine_similarities[i])
        for i in range(len(cosine_similarities)) if cosine_similarities[i] >= min_similarity
    ]

    results_df = pd.DataFrame(results, columns=['post_guid', 'score'])
    results = pd.merge(results_df, df, on='post_guid', how='inner')
    
    print(f"✅ {len(results_df)} posts were identified as relevant to the query.")
    
    

    # Merge with metadata and return results
    return results


## 3. Visualization

**Helper Functions Visualization**

In [None]:
def h3_to_boundary(h3_index):
    """
    Convert an H3 index to its boundary in (lat, lon) format.

    Parameters:
    - h3_index (str): H3 index.

    Returns:
    - boundary (list): List of (lat, lon) tuples.
    """
    boundary = h3.h3_to_geo_boundary(h3_index)
    return [(lat, lon) for lat, lon in boundary]


def generate_wordclouds_for_results(results_df, text_column):
    """
    Generates word clouds for all H3 hex bins in results_df.

    Parameters:
    - results_df (pd.DataFrame): The DataFrame with search results containing 'h3_index_9' and 'score'.
    - text_column (str): The column in results_df containing the text lists.
    
    Returns:
    - dict: A dictionary where keys are H3 indices, and values are word cloud images (PIL.Image).
    """
    wordclouds_dict = {}

    for h3_index in results_df['h3_index_9'].unique():
        # Extract relevant posts for the H3 index
        relevant_texts = results_df[results_df['h3_index_9'] == h3_index][text_column].astype(str)
        concatenated_text = ' '.join(relevant_texts)

        # Count word frequencies using Counter
        word_counts = Counter(concatenated_text.split())

        # Generate the word cloud using the word frequencies
        wordcloud = WordCloud(
            width=600,
            height=300,
            background_color='white',
            contour_width=0.5,
            contour_color='black',
            relative_scaling='auto',
            color_func=lambda *args, **kwargs: (37,52,148),
            normalize_plurals=True,
            repeat=False,
            min_word_length=3
        ).generate_from_frequencies(word_counts)

        # Save the word cloud to the dictionary
        wordclouds_dict[h3_index] = wordcloud.to_image()

    return wordclouds_dict


In [None]:
def visualize_h3_grid(df, results_df, text_column,wordclouds=None):
    
    """
    Visualizes geographic H3 grid cells on a map, colored based on the chi-normalized 
    representation of topic relevance in each cell. This function uses Folium to create 
    an interactive map with polygons representing H3 grid cells and optional word cloud popups.

    Key features:
    - Computes the chi-normalized value for topic relevance across H3 grid cells.
    - Uses Jenks Natural Breaks for data classification to define color bins.
    - Displays tooltips with detailed statistics (e.g., total posts, topic posts, chi value).
    - Optionally includes word clouds as popups for individual H3 grid cells.
    - Adds a dynamic title and legend for better interpretability.

    Args:
        df (pd.DataFrame): DataFrame containing overall post data with H3 indices.
        results_df (pd.DataFrame): DataFrame containing filtered topic-related posts.
        text_column (str): Column name for textual data (used in tooltips/word clouds).
        wordclouds (dict, optional): A dictionary of word clouds keyed by H3 indices.

    Returns:
        folium.Map: An interactive Folium map visualizing the H3 grid cells.
    """
    
    # Calculate the center of the map
    center_lat = results_df['latitude'].mean()
    center_lon = results_df['longitude'].mean()

    # Initialize the Folium map
    folium_map = folium.Map(location=[center_lat, center_lon], zoom_start=11, tiles="Cartodb positron", control_scale = True)

    # Calculate the total posts per H3 cell
    total_posts_per_cell = df.groupby('h3_index_9').size()
    topic_posts_per_cell = results_df.groupby('h3_index_9').size()
    
    # Chi normalization with proper handling of zeros
    sum_exp = len(df)
    sum_obs = topic_posts_per_cell.sum()
    expected = total_posts_per_cell
    with np.errstate(divide='ignore', invalid='ignore'):  # Handle divisions gracefully
        chi_values = ((topic_posts_per_cell * (sum_exp / sum_obs)) - expected) / np.sqrt(expected)
        chi_values = chi_values.replace([np.inf, -np.inf], np.nan)  # Replace infinities with NaN
        normalized_values = chi_values.dropna()  # Exclude invalid calculations

        # Custom discrete colormap for chi normalization
        custom_colors = ['#edf8b1','#c7e9b4','#7fcdbb','#225ea8','#0c2c84']
        colormap = ListedColormap(custom_colors)

    # Compute Jenks Natural Breaks
    jenks = JenksNaturalBreaks(n_classes=5)
    jenks.fit(normalized_values)
    jenks_breaks = jenks.breaks_

    # Iterate over unique H3 indices and visualize each cell
    for h3_index, value in normalized_values.items():
        boundary = h3.h3_to_geo_boundary(h3_index)
        # Determine color based on Jenks classification
        bin_index = np.digitize(value, jenks_breaks, right=True) - 1
        bin_index = min(max(bin_index, 0), len(jenks_breaks) - 2)  # Ensure valid index
        color = to_hex(colormap(bin_index / (len(jenks_breaks) - 1)))

        # Retrieve counts for the tooltip
        total_posts = total_posts_per_cell.get(h3_index, 0)
        topic_posts = topic_posts_per_cell.get(h3_index, 0)

        # Tooltip text with normalized value and counts
        tooltip_text = (
            f"H3: {h3_index}<br>"
            f"Total Posts: {total_posts}<br>"
            f"Topic Posts: {topic_posts}<br>"
            f"Chi: {value:.2f}"
        )

        # Add a word cloud popup if applicable
        if wordclouds and h3_index in wordclouds:
            img = wordclouds[h3_index]
            buf = BytesIO()
            img.save(buf, format="PNG")
            buf.seek(0)
            img_data = base64.b64encode(buf.getvalue()).decode()
            popup_html = f'<img src="data:image/png;base64,{img_data}" style="width:300px;height:150px;">'
            popup = folium.Popup(popup_html, max_width=300)
        else:
            popup = None

        # Create a Folium polygon for the H3 cell
        polygon = folium.Polygon(
            locations=boundary,
            color='gray',
            fill_color=color,
            fill_opacity=0.6,
            weight=0.5,
            tooltip=folium.Tooltip(tooltip_text),
            )

        if popup:
            polygon.add_child(popup)

        folium_map.add_child(polygon)
                
        title_html = f'''
    <div style="
        position: fixed; 
        top: 10px; left: 50px; 
        z-index:9999; font-size:14px; font-weight: bold; 
        background-color: rgba(255, 255, 255, 0.7);background-color: rgba(255, 255, 255, 0.7); padding: 5px; border: 1px solid grey;">
        Topic-Place Representation for Query: <i>{query}</i>
    </div>
'''
        folium_map.get_root().html.add_child(folium.Element(title_html))
        
        # Add legend for the chi values
        legend_html = '''
    <div style="position: fixed; 
                bottom: 50px; left: 10px; width: 210px; height: 155px; 
                background-color: white; opacity: 0.8; z-index:9999; font-size:12px;
                border:1px solid grey; padding: 10px;">
        <b>Topic-Place Representation</b><br>
        <small>normalized by signed chi expectation</small><br><br>
        <i style="background:#edf8b1; width:20px; height:10px; display:inline-block;"></i> Strongly Underrepresented<br>
        <i style="background:#c7e9b4; width:20px; height:10px; display:inline-block;"></i> Underrepresented<br>
        <i style="background:#7fcdbb; width:20px; height:10px; display:inline-block;"></i> Moderately Represented<br>
        <i style="background:#225ea8; width:20px; height:10px; display:inline-block;"></i> Overrepresented<br>
        <i style="background:#0c2c84; width:20px; height:10px; display:inline-block;"></i> Strongly Overrepresented
    </div>
'''

        folium_map.get_root().html.add_child(folium.Element(legend_html))

    return folium_map

## 4. Tests

**Define sentence_transformers model for the query**

In [None]:
model = SentenceTransformer("paraphrase-multilingual-MiniLM-L12-v2")

**Semantic Query: Recreation and Outdoor Activities**

In [None]:
query = 'recreation and outdoor activities'
results_df_0 = semantic_search_faiss(query, db_path_posts, df, model,z_factor = 2)

In [None]:
%%time
wordclouds_0 = generate_wordclouds_for_results(results_df_0, 'tokens')
map_outdoor = visualize_h3_grid(df, results_df_0,'tokens', wordclouds=wordclouds_0)
map_outdoor

In [None]:
map_outdoor.save(f"./03_Visualizations/Recreation&OutdoorActivities.html")

**Semantic Query 2: Observing plants and animals in green space**

In [None]:
query = 'observing plants and animals in green spaces'
results_df_0 = semantic_search_faiss(query, db_path_posts, df, model,z_factor = 3)

In [None]:
%%time
wordclouds_0 = generate_wordclouds_for_results(results_df_0, 'tokens')
map_flora= visualize_h3_grid(df, results_df_0,'tokens', wordclouds=wordclouds_0)
map_flora

In [None]:
map_flora.save(f"./03_Visualizations/Plants&Animals_Map_Results.html")

**Semantic Query 3: peaceful sacred places for meditation, prayer, or reflection**

In [None]:
query = 'peaceful sacred places for meditation, prayer, or reflection'
results_df_0 = semantic_search_faiss(query, db_path_posts, df, model,z_factor = 2.5)

In [None]:
%%time
wordclouds_0 = generate_wordclouds_for_results(results_df_0, 'tokens')
map_spir = visualize_h3_grid(df, results_df_0,'tokens', wordclouds=wordclouds_0)
map_spir

In [None]:
map_spir.save(f"./03_Visualizations/Spirituality_Map_Results.html")

In [None]:
!jupyter nbconvert --to html "TopicQueryDemo_PlaceBasedSemanticSearch.ipynb" --output "TopicQueryDemo_PlaceBasedSemanticSearch.html"