🔧 **Installing Dependencies**

Before using Overture Maps API for fetching Place of Interest (POI) data, we need to install the overturemaps Python package. This package provides access to spatial datasets including POI, transportation, and administrative boundaries.

In [None]:
!pip install overturemaps

📥 **Clone Project Repository [ Google Colab Only ]**

This step clones the GitHub repository containing all necessary files, including:

*   Data files (category_tree.json, category_keywords.py)
*   Helper modules for scraping, classification, and evaluation
*   Jupyter notebooks for experimentation and testing

**Note:** If you're working locally (e.g., in VSCode or JupyterLab), this step is not necessary. Just open the project directly.


In [15]:
!git clone https://github.com/project-terraforma/Automating-POI-Categorization-AGCG
%cd Automating-POI-Categorization-AGCG

Cloning into 'Automating-POI-Categorization-AGCG'...
remote: Enumerating objects: 130, done.[K
remote: Counting objects: 100% (130/130), done.[K
remote: Compressing objects: 100% (104/104), done.[K
remote: Total 130 (delta 65), reused 58 (delta 22), pack-reused 0 (from 0)[K
Receiving objects: 100% (130/130), 165.63 KiB | 4.73 MiB/s, done.
Resolving deltas: 100% (65/65), done.
/content/Automating-POI-Categorization-AGCG/Automating-POI-Categorization-AGCG/Automating-POI-Categorization-AGCG


📁 **Add Project Folders to Python Path**

This step ensures that Python can locate and import custom modules (e.g., web_scraper, category_utils) from subfolders in the project.

By default, Python only searches for modules in the current directory. Since our custom code is organized into subdirectories (src, data, Testing), we manually add these to the system path using sys.path.append().

In [4]:
import sys, os
project_root = os.path.abspath("..")
sys.path.append(os.path.join(project_root, "src"))
sys.path.append(os.path.join(project_root, "data"))
sys.path.append(os.path.join(project_root, "Testing"))

📦 **Importing Required Libraries and Custom Modules**

This block sets up all necessary dependencies for data processing, classification, and evaluation:



In [16]:
# Standard libraries
import pandas as pd              # Data manipulation
import numpy as np               # Numerical computations
import overturemaps              # POI dataset loader from Overture Maps
import json                      # JSON parsing

# Custom modules from the project
import web_scraper as webScraper         # For meta/about page scraping
import category_utils as util            # Tree navigation and validation functions
import testing_utils as test             # Accuracy and evaluation helpers

# Static keyword data
from category_keywords import category_keywords  # Predefined keyword dictionary for boosting classification

📂 **Load Category Tree Structure**

This block loads the hierarchical category structure from a local JSON file:


*   **category_tree.json** defines the nested category taxonomy used for classification (e.g., education > school > preschool).

*   The file path assumes your notebook is running from the project root or /notebooks folder with appropriate sys.path adjustments.

*   This structure is critical for both rule-based keyword scoring and SBERT-based classification.

In [6]:
with open("data/category_tree.json", "r") as f:
    category_tree = json.load(f)

🌐 **Define Bounding Box and Fetch POI Data**

This block defines a bounding box for a geographic region and fetches Points of Interest (POIs) from the Overture Maps dataset using the **fetch_overture_poi_data()** utility function:


*   The bbox_sf defines a rectangular region for San Jose, CA using coordinates (min_lon, min_lat, max_lon, max_lat).

*   The theme="place" specifies the type of POIs to retrieve.

*   The result is stored in a DataFrame containing raw POI data (name, website, tags, etc.).

**Tip:** You can easily generate bounding box coordinates using [boundingbox.klokantech.com](https://boundingbox.klokantech.com/)

Select your desired region on the map and copy the coordinates for use in your query.



In [7]:
bbox_sf = (-122.004946,37.28487,-121.834263,37.416253)  # San Jose bounding box
df_sf_pois = test.fetch_overture_poi_data("place", bbox_sf)


Bounding box: (-122.004946, 37.28487, -121.834263, 37.416253)
Theme: 'place'
Fetching data from Overture Maps...
Successfully loaded 26875 POIs.


In [8]:
df_websites = df_sf_pois[['names', 'categories', 'websites']][df_sf_pois['websites'].notna()]

🕸️ **Scrape Website Descriptions**

This block extracts website content (meta description or About page) for each POI entry and organizes the results in a DataFrame:



*   **scrape_website_batch(...)**: Calls the scraping function to gather metadata and About text for the first 10 POIs.
*   The function attempts to get:

  *   <meta> title and description
  *   About page content (if available)
  *   Fallback: homepage if no About page exists


*   Results include POI name, categories, status, and extracted text.
*   Cleaned and structured into a DataFrame for analysis and classification.

In [9]:
results = test.scrape_website_batch(df_websites, webScraper, 10)
df_results = pd.DataFrame(results)
df_results

Processed 10 websites

Scraping completed.


Unnamed: 0,name,category,url,text,status,source
0,"{'primary': 'MC GARAGE DOORS', 'common': None,...","{'primary': 'online_shop', 'alternate': ['rest...",http://www.mcgaragedoors.com/,MC GARAGE DOORSPeninsula (650) 815-8252 ~Â Sou...,success,[about]
1,{'primary': 'Saratoga Woods Community Associat...,"{'primary': 'community_services_non_profits', ...",http://www.saratogawoods.net/,MenuLog inHomeAboutHistoryBoardEmploymentPhoto...,success,[about]
2,"{'primary': 'Base Builders, Inc.', 'common': N...","{'primary': 'building_contractor', 'alternate'...",,,no_valid_text,[]
3,"{'primary': 'R&W Hometree', 'common': None, 'r...","{'primary': 'arts_and_entertainment', 'alterna...",http://gallery.me.com/hightidetech/100122/Home...,,no_valid_text,[]
4,"{'primary': 'NYT Shared Service Center Inc', '...","{'primary': 'information_technology_company', ...",,,no_valid_text,[]
5,"{'primary': 'FRC Team 2813: Gear Heads', 'comm...","{'primary': 'b2b_science_and_technology', 'alt...",https://team2813.com/,Skip to contentHomeAboutAbout UsStudent Leader...,success,[about]
6,"{'primary': 'Prospect High School', 'common': ...","{'primary': 'high_school', 'alternate': ['educ...",http://prospect.cuhsd.org/apps/email/index.jsp...,Contact | Prospect High School\nThe Campbell U...,success,[meta]
7,"{'primary': 'Football Field', 'common': None, ...","{'primary': 'school_sports_team', 'alternate':...",https://linktr.ee/prospectfootball,,no_valid_text,[]
8,"{'primary': 'Flair Apartments', 'common': None...",{'primary': 'landmark_and_historical_building'...,http://www.theflairapts.com,,no_valid_text,[]
9,{'primary': 'Saratoga Periodontics & Implants'...,"{'primary': 'dentist', 'alternate': ['periodon...",http://www.sccl.org/saratoga,Saratoga Library | Santa Clara County Library ...,success,"[meta, about]"


🤖 **Load SentenceTransformer Model**

This block initializes the sentence embedding model used for category classification:

*   Uses the all-MiniLM-L6-v2 model from sentence-transformers.
*   This model converts POI descriptions (text) into dense vector embeddings.
*   These embeddings are compared to the tree structure to determine the best-fitting category.
*   Custom classifier logic using SBERT is stored in sbert_classifier.py.




In [11]:
import sbert_classifier as clf

model = clf.SentenceTransformer('all-MiniLM-L6-v2')

🌲 **Generate Tree Embeddings for Classification**

*   This step encodes all nodes in the category tree (category_keywords) using the sentence transformer model (model).
*   Each category label is transformed into a vector embedding.
*   These embeddings are used later to compare with description vectors for classification.
*   This helps enable semantic similarity comparisons at every tree level.


In [12]:
embeddings = clf.embed_tree_nodes_by_layer(category_keywords, model)

✅ **Evaluate Prediction Accuracy**

- **Purpose:** Computes how well our classifier performs on real scraped POI descriptions.

- Compares predicted category paths against ground truth categories (primary & alternates).

- Loops through all entries with status "success" and evaluates top-level match.

- Returns an enriched DataFrame with "predicted", "matches" (True/False), and summary stats.

- **verbose=True:** Prints detailed evaluation progress and match statistics.

In [None]:
evaluated = test.evaluate_prediction_accuracy(
    results_df=pd.DataFrame(results),
    model=model,
    tree=category_keywords,
    embeddings=embeddings,
    clf_module=clf,
    util_module=util,
    test_module=test,
    verbose=True
)

In [None]:
df_eval = pd.DataFrame(evaluated)
display(df_eval.head())