# City Explorer: Multi-Source Attraction & Activity Discovery with Routing

A DABN 23 project by Alessandro Hoefer and Samuel Goldbuch

## Introduction
When resorting to scraping techniques, the challenge is rarely the absence of data. Data online is abundant and easily accessible. However, the main difficulty lies in navigating such vast amounts of information and organizing it effectively.

Tourism has grown steadily over recent years, with record numbers of people traveling abroad to explore new cities and countries. Yet for both groups and individuals, deciding what to do during a visit can be frustrating. Data on attractions, activities, optimal visiting times, and local tips overflow the internet, but putting order to this chaos remains challenging. These considerations motivated our project.


## Project Overview

The goal of our script is to create a structured dataset, dynamically generated on demand, containing for any requested city: 10 must-see attractions, 10 potential activities, and real-time busyness data for those attractions.

This could serve as the foundation for a consumer product that dynamically provides users with suggestions on what to do in a city. It offers a structured method to review available attractions and, if users are already at their destination, allows them to explore in real time which places to visit or avoid based on current crowding. Conceptually, the project divides into two complementary sections:
- **Permanent storage of data**: Data is stored for long-term use through API access (see sections on TripAdvisor and Google Maps APIs). This primarily refers to the SQL databases that we create.
- **Live data retrieval**: Data is accessed in real time for immediate use by potential users. This primarily refers to Selenium-based scraping for live crowdedness information.

The project itself is divided into three main components that together achieve the overall vision:

- *Google Maps APIs* for retrieval of static data on attractions in a given city
- *TripAdvisor APIs* for retrieval of static data on activities in a given city
- *Selenium library* for retrieval of real-time crowdedness data for each attraction in our database


## Google Maps API Implementation
In the project, cities are dynamically input into the Python code using an interactive UI. An input field is displayed through the IPython library using the display function, and an interactive button triggers the search. These two elements work in conjunction to pass city names to the integrated Google Maps API.

The Google Maps API fetches data on the city's attractions, including:
- Name
- Address
- Rating
- Review count
- Attraction category
- Website
- Phone

This data is then stored in two ways:
1.	Data is first stored in an SQL database for long-term persistence
2.	Data is cached in the running Python instance for short-term usage


## TripAdvisor API Implementation
Similar to the Google Maps implementation, we use an interactive approach to input a city, which is then passed to create a query for the TripAdvisor API.

The TripAdvisor API fetches data on the city's potential activities (things to do), including:
- Name
- Address
- Rating
- Review count
- Activity category
- Website

Mirroring the Google Maps implementation, data is stored in dual fashion: in a separate SQL database for long-term use and cached in the Python instance for short-term access.


## Run through

In [7]:
# 0) Dependency check (optional)
# This notebook does NOT auto-install by default (cleaner + more reproducible).
AUTO_INSTALL = False

required = [
    ("requests", "requests"),
    ("pandas", "pandas"),
    ("ipywidgets", "ipywidgets"),
]

missing = []
for import_name, pip_name in required:
    try:
        __import__(import_name)
    except ImportError:
        missing.append(pip_name)

if missing:
    print("Missing packages:", ", ".join(missing))
    print("Install command:")
    print("  pip install " + " ".join(missing))
    if AUTO_INSTALL:
        import sys, subprocess
        subprocess.check_call([sys.executable, "-m", "pip", "install", *missing])
        print("Installed. Re-run this cell if needed.")
else:
    print("All required packages are installed.")


All required packages are installed.


In [8]:
# 1) Make sure we can import from /src (works when running from notebooks/ folder)
import sys
from pathlib import Path

PROJECT_ROOT = Path.cwd().parent if Path.cwd().name == "notebooks" else Path.cwd()
if str(PROJECT_ROOT) not in sys.path:
    sys.path.insert(0, str(PROJECT_ROOT))

print("Project root:", PROJECT_ROOT)


Project root: c:\Users\Samuel\Desktop\Git Repo\dabn23-project1\dabn23


In [9]:
# 2) Load configuration (API keys + DB path)
# config.py fails fast with a helpful error message if something is missing.

from src.config import GOOGLE_API_KEY, TA_API_KEY, DB_PATH

print("Google API key loaded (length):", len(GOOGLE_API_KEY))
print("TripAdvisor API key loaded (length):", len(TA_API_KEY))
print("DB_PATH:", DB_PATH)


Google API key loaded (length): 39
TripAdvisor API key loaded (length): 32
DB_PATH: G:\My Drive\dabn23_SharedDatabase\dabn23_places_cache.sqlite


In [10]:
# 3) Initialize the shared SQLite database (creates the file if it doesn't exist)

from pathlib import Path
from src.db import connect, migrate_if_needed, create_tables

# Ensure parent folder exists (SQLite can create the file, but not the folder)
Path(DB_PATH).parent.mkdir(parents=True, exist_ok=True)

conn = connect(DB_PATH)
migrate_if_needed(conn)   # handles legacy schemas (e.g., place_ids_json -> item_ids_json)
create_tables(conn)

tables = conn.execute("SELECT name FROM sqlite_master WHERE type='table';").fetchall()
print("✅ DB ready. Tables:", [t[0] for t in tables])


✅ DB ready. Tables: ['city_top10', 'item_summary']


In [12]:
# Step 4 definitions with imports

from src.pipelines import top10_city

ALLOW = ["Tours", "Food & Drink", "Outdoor Activities", "Boat Tours & Water Sports", "Nightlife", "Shopping"]
DENY  = ["Sights & Landmarks", "Museums"]

def city_search(city: str):
    return top10_city(conn, city, allow_groups=ALLOW, deny_groups=DENY)

## 5) Interactive search UI (ipywidgets)

Use the controls to choose:
- city
- data source (Google or TripAdvisor)
- type (attraction/activity)

Then click **Search Top 10**.


In [13]:
from src.ui import build_city_widget

LAST_SEARCHED_CITY = None
LAST_SEARCH_RESULTS = None

build_city_widget(city_search)

VBox(children=(HBox(children=(Text(value='Paris', description='City:', layout=Layout(width='420px'), placehold…

{'last_city': None, 'last_results': None}

In [14]:
print(LAST_SEARCHED_CITY)

None


# Selenium

## Selenium for Real-Time Crowdedness Data
To complement our permanent data collection through APIs, we use the Selenium package to add a real-time dimension to the data retrieval process.

Specifically, we query the SQL database to retrieve all available attractions for a given city. Then, we use that information to sequentially navigate Google Maps and fetch real-time busyness data for each attraction. The data is returned in a tabular format for easy interpretation.

From a product perspective, this is one of the most compelling features: potential users can check in real time which places to visit and which to avoid based on current crowding levels, helping them find the best spot at any given moment. However, this functionality depends entirely on the previously created SQL database containing all attraction information for a given city. There is a symbiotic relationship here, where long-term data storage enables live data retrieval and short-term usage of the product.


## Selenium implementation


In [30]:
import json
import sqlite3
import datetime

def get_attraction_names(city: str, conn: sqlite3.Connection):
    """
    Look up stored attraction names for a city from city_top10 / ta_place_summary.
    Returns a list of name strings, or None if city not found.
    """
    citykey = city.strip().lower()

    cur = conn.cursor()

    # Read the stored place_ids_json for this city
    row = cur.execute(
        "SELECT item_ids_json FROM city_top10 "
        "WHERE city_key = ? AND source = ? AND item_type = ?",
        (citykey, "google", "attraction")
    ).fetchone()

    if not row:
        # City not in DB
        return None

    place_ids = json.loads(row[0])
    if not place_ids:
        return []

    # Fetch names in the same ranked order
    placeholders = ",".join("?" * len(place_ids))
    name_rows = cur.execute(
        f"SELECT item_id, name FROM item_summary "
        f"WHERE source = ? AND item_id IN ({placeholders})",
        ["google", *place_ids]
    ).fetchall()

    name_map = {pid: name for pid, name in name_rows}

    # Preserve original ranking order
    return [name_map[pid] for pid in place_ids if pid in name_map]

In [31]:
import os
import sqlite3
import pathlib

# Path to TripAdvisor cache in the same folder as this notebook's working dir
TA_DB_PATH = str(pathlib.Path().cwd() / "dabn23_tripadvisor_cache.sqlite")

print("TripAdvisor DB path:", TA_DB_PATH)
print("Exists?", os.path.exists(TA_DB_PATH))

taconn = sqlite3.connect(TA_DB_PATH)
#taconn.execute("PRAGMA journal_mode=WAL;")  # safe even if wal/shm files present

TripAdvisor DB path: c:\Users\Samuel\Desktop\Git Repo\dabn23-project1\dabn23\notebooks\dabn23_tripadvisor_cache.sqlite
Exists? True


In [32]:
def _parse_busy_bar(aria: str):
    """
    Parses one peak-hours bar aria-label.
    Handles English ("77% busy at 2 pm") and Swedish/Nordic ("77 aktivitet kl. 1400.").
    Returns (hour_24, pct) or None.
    """
    # Swedish/Nordic: "37 aktivitet kl. 1300."
    m = re.search(r"^(\d+)\D+?kl\.\s*(\d{2})\d{2}", aria.strip())
    if m:
        return int(m.group(2)), int(m.group(1))

    # English: "77% busy at 2 pm"
    m = re.search(r"(\d+)%.*?(\d{1,2})\s*(am|pm)", aria, re.IGNORECASE)
    if m:
        pct, h, mer = int(m.group(1)), int(m.group(2)), m.group(3).lower()
        hour_24 = (h % 12) + (12 if mer == "pm" else 0)
        return hour_24, pct

    return None


def get_current_busyness(driver, attraction_name: str):
    """
    Searches Google Maps for attraction_name and scrapes the full-day busyness.
    Returns list[int|None] with 24 entries (index = hour 0–23),
    or None if the place has no peak-hours section at all.
    """
    print(f"\n  Searching: {attraction_name}")

    # 1. Navigate and type in search bar
    driver.get("https://www.google.com/maps")
    time.sleep(2)
    dismiss_google_consent(driver)

    search_bar = WebDriverWait(driver, 10).until(
        EC.element_to_be_clickable((By.NAME, "q"))
    )
    search_bar.clear()
    search_bar.send_keys(attraction_name)
    driver.find_element(By.CSS_SELECTOR, "button.mL3xi").click()
    time.sleep(3)

    # 2. Disambiguation list → click first result if present
    try:
        WebDriverWait(driver, 5).until(
            EC.element_to_be_clickable((By.CSS_SELECTOR, "a.hfpxzc"))
        ).click()
        time.sleep(3)
        print("    Clicked top result from list.")
    except TimeoutException:
        print("    Landed directly on place page.")

    # 3. Find peak-hours section
    try:
        peak_section = WebDriverWait(driver, 6).until(
            EC.presence_of_element_located((By.CSS_SELECTOR, "div.UmE4Qe"))
        )
    except TimeoutException:
        print("    No peak hours data available.")
        return None

    # 4. Parse all hourly bars into a 24-slot list
    hourly_data = [None] * 24
    bars = peak_section.find_elements(By.CSS_SELECTOR, "div.dpoVLd")
    print(f"    Found {len(bars)} hourly bars.")

    for bar in bars:
        aria = bar.get_attribute("aria-label") or ""

        # Live "Currently X% busy" → slot into current hour
        live = re.search(r"(?:Currently|Nuvarande).*?(\d+)%", aria, re.IGNORECASE)
        if live:
            hourly_data[datetime.datetime.now().hour] = int(live.group(1))
            print(f"    ✓ Live now: {int(live.group(1))}%")
            continue

        parsed = _parse_busy_bar(aria)
        if parsed:
            hour_24, pct = parsed
            if 0 <= hour_24 <= 23:
                hourly_data[hour_24] = pct

    filled = sum(1 for x in hourly_data if x is not None)
    print(f"    ✓ Stored {filled}/24 hours.")
    return hourly_data


def dismiss_google_consent(driver):
    """Dismiss the GDPR consent banner on Google Maps (EU only)."""
    try:
        accept_btn = WebDriverWait(driver, 8).until(
            EC.element_to_be_clickable((
                By.XPATH,
                '//button[.//span[contains(text(),"Accept all") '
                'or contains(text(),"Reject all")]]'
            ))
        )
        accept_btn.click()
        time.sleep(1)
        print("  Google consent dismissed.")
    except:
        print("  No consent popup found.")

In [36]:
import datetime
def scrape_peak_hours(city: str, conn: sqlite3.Connection):
    """
    Scrapes Google Maps peak hours for all TA top-10 attractions of a city.
    Saves results into the global `busyness_data` dict.
    Supports multiple cities — each call adds/updates one city entry.
    Prints the current-hour snapshot when done.
    """
    scraped_at = datetime.datetime.now().strftime("%H:%M")
    city_key   = city.strip()

    print(f"Looking up '{city_key}' in  DB...")
    names = get_attraction_names(city_key, conn)

    if names is None:
        print(f"  ✗ '{city_key}' not found in DB. Run the scraper first.")
        return
    if not names:
        print(f"  ✗ No attractions stored for '{city_key}'.")
        return

    print(f"  Found {len(names)} attractions: {', '.join(names[:3])}...")

    attractions = {}
    try:
        for name in names:
            hourly = get_current_busyness(driver, name)
            attractions[name] = hourly   # list[int|None] or None
            time.sleep(2)
    finally:
        print("\nDriver closed.")

    # Save into global dict (safe to call again for a different city)
    busyness_data[city_key] = {
        "scraped_at":  scraped_at,
        "attractions": attractions,
    }

    # Print current-hour snapshot
    now_hour = datetime.datetime.now().hour
    print(f"\n{'='*54}")
    print(f"  CURRENT BUSYNESS — {city_key}  (scraped at {scraped_at})")
    print(f"{'='*54}")
    for name, hourly in attractions.items():
        if hourly is None:
            val = "N/A (no GM data)"
        elif hourly[now_hour] is None:
            val = "N/A (no data this hour)"
        else:
            val = f"{hourly[now_hour]}%"
        print(f"  {name:<44} {val}")

In [38]:
import time
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import re
from selenium.common.exceptions import NoSuchElementException, TimeoutException

options = Options()
options.add_argument("--lang=en-US")
options.add_argument("--headless=new")
options.add_experimental_option("prefs", {
    "intl.accept_languages": "en-US,en",
    "profile.default_content_setting_values.geolocation": 2,
})
options.add_argument(
    "--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
    "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
)
driver = webdriver.Chrome(options=options)
driver.execute_cdp_cmd("Emulation.setGeolocationOverride", {
    "latitude": 40.7128, "longitude": -74.0060, "accuracy": 100
})
print("Driver started.")

# Global storage for peak-hours data (multi-city)
busyness_data = {}
scrape_peak_hours("Stockholm", conn)

driver.close()
print("Finished scraping.")
# To scrape additional cities, re-run this cell with a new city name.
# busyness_data will accumulate entries for all cities scraped this session.

Driver started.
Looking up 'Stockholm' in  DB...
  Found 10 attractions: Vasa Museum, The Royal Palace, Skansen...

  Searching: Vasa Museum
  Google consent dismissed.
    Landed directly on place page.
    Found 126 hourly bars.
    ✓ Stored 18/24 hours.

  Searching: The Royal Palace
  No consent popup found.
    Clicked top result from list.
    No peak hours data available.

  Searching: Skansen
  No consent popup found.
    Landed directly on place page.
    Found 126 hourly bars.
    ✓ Stored 18/24 hours.

  Searching: King's Garden
  No consent popup found.
    Clicked top result from list.
    Found 168 hourly bars.
    ✓ Stored 24/24 hours.

  Searching: ABBA The Museum
  No consent popup found.
    Landed directly on place page.
    Found 126 hourly bars.
    ✓ Stored 18/24 hours.

  Searching: Fotografiska Museum Stockholm
  No consent popup found.
    Landed directly on place page.
    Found 126 hourly bars.
    ✓ Live now: 100%
    ✓ Stored 18/24 hours.

  Searching: Swed

In [39]:
import pandas as pd

def print_busyness_summary(city: str = None):
    """Print a DataFrame of current-hour busyness. Omit city= for all cities."""
    cities = [city] if city else list(busyness_data.keys())
    now_hour = datetime.datetime.now().hour

    if not cities:
        print("busyness_data is empty — run scrape_peak_hours() first.")
        return

    for c in cities:
        if c not in busyness_data:
            print(f"No data for '{c}'.")
            continue
        entry = busyness_data[c]
        rows = []
        for name, hourly in entry["attractions"].items():
            pct = None if (hourly is None) else hourly[now_hour]
            rows.append({
                "Attraction":           name,
                f"Busy at {now_hour:02d}:00": f"{pct}%" if pct is not None else "N/A"
            })
        print(f"\n=== {c}  (scraped at {entry['scraped_at']}) ===")
        display(pd.DataFrame(rows))

print_busyness_summary("Stockholm")


=== Stockholm  (scraped at 20:36) ===


Unnamed: 0,Attraction,Busy at 20:00
0,Vasa Museum,0%
1,The Royal Palace,
2,Skansen,0%
3,King's Garden,60%
4,ABBA The Museum,0%
5,Fotografiska Museum Stockholm,53%
6,Swedish History Museum,0%
7,Stockholm City Hall,0%
8,Skyview,
9,Storkyrkan,0%


## Results and Limitations
From an overall project results perspective, we are satisfied with the final outcomes.
Following the implementation of the Google and TripAdvisor APIs, we successfully populated a solid SQL database. We felt that inserting a limited number of cities was sufficient for demonstration purposes, but the code is written in an interactive way such that, on demand, we can integrate our database with additional information for any desired city.
We are also pleased with our implementation of Selenium to retrieve crowdedness information. Fetching data directly from the SQL database makes the process leaner and allows users to access this information at a later time, which is key to the system's functionality. Users can first explore attractions and activities, then decide later whether to run the Selenium script for real-time data. Fetching this data can be time-intensive, since we deliberately slowed the process to avoid potential scraping blocks, taking a conservative approach.

This leads us to the project's current limitations.
- First, rapid fetching of attraction crowdedness would be essential in a working product. Given the pedagogical nature of this project, we did not feel it was necessary to optimize speed or implement techniques to parallelize the process. However, we recognize this flaw, which would be critical to address for a deliverable product.
- Second, crowdedness can only be checked for attractions in the current implementation. Since we use Google Maps to retrieve this information, we are limited to Google Maps attractions. Unfortunately, we cannot check activity crowdedness for TripAdvisor-sourced activities, as TripAdvisor does not supply users with such data.
- Additionally, the current implementation requires using a VPN, connected to a country whose official language is english, to run the scraping script reliably. Google Maps interface language still changes based on geolocation, and simply adjusting the WebDriver headers has not been enough to override this behavior. In future iterations, more robust workarounds could be introduced, but given the course context and prototype nature of the project, we consider the present solution acceptable.
- Finally, we must acknowledge the static nature of data inserted into the SQL database. For a more robust product, implementing a recurring update function would be fundamental. With the present structure, manually erasing all stored data and using the interactive UI to fetch it again would be necessary to 'update' the available information. This is costly, both computationally and time-wise. A better approach would be to implement regular checks to determine whether data has changed enough to warrant the expense of updating.


## Conclusion
As a proof of concept for a product that could facilitate travel planning for tourists worldwide, we consider ourselves satisfied. The combined use of long-term data storage with short-term data retrieval through APIs and the Selenium package proved to be a powerful synergy, allowing flexibility, ease of access, and surprisingly decent timeliness. 

Of course, we acknowledge the aforementioned limitations. However, scaling this project with more computing power, parallelization of processes across multiple (virtual) machines, and strategic implementation of powerful libraries such as the Requests library could improve timeliness and make such a product a deliverable reality.

Furthermore, we cannot ignore the pedagogical benefits of this project. Working with APIs and a robust package like Selenium on mainstream platforms such as Google Maps and TripAdvisor gave us valuable insights into data retrieval, web scraping, and the ideation of a working product. These are skills that are difficult to develop in an educational setting. We believe this experience will be highly beneficial for our professional growth.
