# City Explorer: Multi-Source Attraction & Activity Discovery with Routing

A DABN 23 project by Alessandro Hoefer and Samuel Goldbuch

## Introduction
When resorting to scraping techniques, the challenge is rarely the absence of data. Data online is abundant and easily accessible. However, the main difficulty lies in navigating such vast amounts of information and organizing it effectively.

Tourism has grown steadily over recent years, with record numbers of people traveling abroad to explore new cities and countries. Yet for both groups and individuals, deciding what to do during a visit can be frustrating. Data on attractions, activities, optimal visiting times, and local tips overflow the internet, but putting order to this chaos remains challenging. These considerations motivated our project.


## Project Overview

The goal of our script is to create a structured dataset, dynamically generated on demand, containing for any requested city: 10 must-see attractions, up to 10 potential activities, and real-time busyness data for those attractions.

This could serve as the foundation for a consumer product that dynamically provides users with suggestions on what to do in a city. It offers a structured method to review available attractions and, if users are already at their destination, allows them to explore in real time which places to visit or avoid based on current crowding. Conceptually, the project divides into two complementary sections:
- **Permanent storage of data**: Data is stored for long-term use through API access (see sections on TripAdvisor and Google Maps APIs). This primarily refers to the SQL databases that we create.
- **Live data retrieval**: Data is accessed in real time for immediate use by potential users. This primarily refers to Selenium-based scraping for live crowdedness information.

The project itself is divided into three main components that together achieve the overall vision:

- *Google Maps APIs* for retrieval of static data on attractions in a given city
- *TripAdvisor APIs* for retrieval of static data on activities in a given city
- *Selenium library* for retrieval of real-time crowdedness data for each attraction in our database


## Google Maps API Implementation
In the project, cities are dynamically input into the Python code using an interactive UI. An input field is displayed through the IPython library using the display function, and an interactive button triggers the search. These two elements work in conjunction to pass city names to the integrated Google Maps API.

The Google Maps API fetches data on the city's attractions, including:
- Name
- Address
- Rating
- Review count
- Attraction category
- Website
- Phone

This data is then stored in two ways:
1.	Data is first stored in an SQL database for long-term persistence
2.	Data is cached in the running Python instance for short-term usage


## TripAdvisor API Implementation
The input in the interactive UI is also forwarded to the TripAdvisor API, which is then passed to create a query for the fetched data.

The TripAdvisor API fetches data on the city's potential activities (things to do), including:
- Name
- Address
- Rating
- Review count
- Activity category
- Website

Mirroring the Google Maps implementation, data is stored in dual fashion: in a SQL database for long-term use and cached in the Python instance for short-term access.


## Notebook Walkthrough

### 0) Dependency Check
This code block verifies that required libraries (requests, pandas, ipywidgets) are installed in the notebook environment. Missing packages are flagged and accompanied by a pip install command, ensuring the system is properly configured before execution.

In [1]:
AUTO_INSTALL = False

required = [
    ("requests", "requests"),
    ("pandas", "pandas"),
    ("ipywidgets", "ipywidgets"), 
    ("selenium", "selenium"),
]

missing = []
for import_name, pip_name in required:
    try:
        __import__(import_name)
    except ImportError:
        missing.append(pip_name)

if missing:
    print("Missing packages:", ", ".join(missing))
    print("Install command:")
    print("  pip install " + " ".join(missing))
    if AUTO_INSTALL:
        import sys, subprocess
        subprocess.check_call([sys.executable, "-m", "pip", "install", *missing])
        print("Installed. Re-run this cell if needed.")
else:
    print("All required packages are installed.")


All required packages are installed.


### 1) Project Path Configuration

This code block configures the notebook’s system path to ensure modules from the project’s `/src` directory can be imported correctly when executed from the `notebooks` folder.

In [2]:
import sys
from pathlib import Path

PROJECT_ROOT = Path.cwd().parent if Path.cwd().name == "notebooks" else Path.cwd()
if str(PROJECT_ROOT) not in sys.path:
    sys.path.insert(0, str(PROJECT_ROOT))

print("Project root:", PROJECT_ROOT)


Project root: c:\Users\megiv\Desktop\Git Clone\dabn23-project1\dabn23


### 2) Configuration Loading

This code block imports API credentials and the database path from the centralized `config.py` file. Execution requires users to provide their own Google Maps and TripAdvisor API keys, as well as a valid connection to the SQL database stored on Google Drive, all configured as system variables. Successful loading confirms that external services and database access are properly configured before pipeline execution.

In [3]:
# 2) Load configuration (API keys + DB path)
# config.py fails fast with a helpful error message if something is missing.

from src.config import GOOGLE_API_KEY, TA_API_KEY, DB_PATH

print("Google API key loaded (length):", len(GOOGLE_API_KEY))
print("TripAdvisor API key loaded (length):", len(TA_API_KEY))
print("DB_PATH:", DB_PATH)


Google API key loaded (length): 39
TripAdvisor API key loaded (length): 32
DB_PATH: G:\My Drive\dabn23_SharedDatabase\dabn23_cache.sqlite


### 3) Database Initialization

This code block initialises the shared SQLite database by establishing a connection, and creating all required tables. If the database file or its parent directory does not yet exist, they are created automatically. Successful execution confirms that the storage layer is fully configured and ready for data input.

In [4]:
from pathlib import Path
from src.db import connect, migrate_if_needed, create_tables

# Ensure parent folder exists (SQLite can create the file, but not the folder)
Path(DB_PATH).parent.mkdir(parents=True, exist_ok=True)

conn = connect(DB_PATH)
migrate_if_needed(conn)   # handles legacy schemas (e.g., place_ids_json -> item_ids_json)
create_tables(conn)

tables = conn.execute("SELECT name FROM sqlite_master WHERE type='table';").fetchall()
print("✅ DB ready. Tables:", [t[0] for t in tables])


✅ DB ready. Tables: ['city_top10', 'item_summary']


### 4) Pipeline Definition

This code block imports the core city-level pipeline function and defines a filtered search wrapper for notebook execution. Activity categories are selectively included or excluded through predefined allow and deny lists. The resulting function enables streamlined retrieval of the project’s curated Top-10 outputs for any queried city. The goal of the allow and deny list was to use the TripAdvisor to gather addotional information regarding a cities activities in contrast to the attractions gathered from Google Maps.

In [5]:
# Step 4 definitions with imports

from src.pipelines import top10_city

ALLOW = ["Tours", "Food & Drink", "Outdoor Activities", "Boat Tours & Water Sports", "Nightlife", "Shopping"]
DENY  = ["Sights & Landmarks", "Museums"]

def city_search(city: str):
    return top10_city(conn, city, allow_groups=ALLOW, deny_groups=DENY)

### 5) Interactive City Search UI

This code block imports and initialises the interactive city search widget within the notebook environment. It connects the previously defined pipeline function to a user-facing input interface, enabling dynamic execution of the full data retrieval process for any entered city. Global variables store the most recent search and results for reuse data that is already stored in the database to save API tokens and improve performance. Less than 10 activities might appear for some cities because TripAdvisor's API limits.

In [6]:
from src.ui import build_city_widget

LAST_SEARCHED_CITY = None
LAST_SEARCH_RESULTS = None

build_city_widget(city_search)
print(LAST_SEARCHED_CITY)

VBox(children=(HBox(children=(Text(value='Paris', description='City:', layout=Layout(width='420px'), placehold…

None


## Selenium for Real-Time Crowdedness Data
To complement our permanent data collection through APIs, we use the Selenium package to add a real-time dimension to the data retrieval process.

Specifically, we query the SQL database to retrieve all available attractions for a given city. Then, we use that information to sequentially navigate Google Maps and fetch real-time busyness data for each attraction. The data is returned in a tabular format for easy interpretation.

From a product perspective, this is one of the most compelling features: potential users can check in real time which places to visit and which to avoid based on current crowding levels, helping them find the best spot at any given moment. However, this functionality depends entirely on the previously created SQL database containing all attraction information for a given city. There is a symbiotic relationship here, where long-term data storage enables live data retrieval and short-term usage of the product.


## Selenium implementation


The Selenium implementation builds on the pre-filled SQL database. It centers around the parent scrape_peak_hours() function, which orchestrates the following:

- *get_attraction_names(city, conn)*: Queries the connected database for all attractions matching the input city.
- *get_current_busyness(driver, name, city)*: The core Selenium WebDriver script that identifies and extracts the current busyness bar for an attraction. Contains helpers:
  - *_parse_busy_bar(aria)*: Parses busyness percentage from the bar's aria-label.
  - *dismiss_google_consent(driver)*: Dismisses the Google Maps cookie consent popup.

This modular design lets you launch a complete scraping pipeline with one line (scrape_peak_hours("Paris", conn)) which processes all matching attractions after initializing the Selenium WebDriver.

In [7]:
from src.selenium_driver import make_driver
from src.selenium_peak_hours import scrape_peak_hours

busyness_data = {}  # global storage for results

The Selenium implementation builds on the pre-filled SQL database. It centers around the parent `scrape_peak_hours()` function, which orchestrates the following:

- *get_attraction_names(city, conn)*: Queries the connected database for all attractions matching the input city.
- *get_current_busyness(driver, name)*: The core Selenium WebDriver script that identifies and extracts the full-day busyness profile for each attraction. It contains helper functions:
  - *_parse_busy_bar(aria)*: Parses busyness percentage values from the bar’s aria-label.
  - *dismiss_google_consent(driver)*: Dismisses the Google Maps cookie consent popup.

After initializing the Selenium WebDriver, a complete scraping pipeline can be launched with a single function call. The script sequentially processes all stored attractions for the selected city and stores the results in structured format.

In [9]:
import src.ui as ui

driver = make_driver(headless=True)
print("Driver started.")

scrape_peak_hours(ui.LAST_SEARCHED_CITY, conn, driver, busyness_data)

driver.close()
print("Finished scraping.")

Driver started.
Looking up 'Paris' in  DB...
  Found 10 attractions: Eiffel Tower, Louvre Museum, Arc de Triomphe...

  Searching: Eiffel Tower
  Google consent dismissed.
    Landed directly on place page.
    Found 126 hourly bars.
    ✓ Live now: 23%
    ✓ Stored 18/24 hours.

  Searching: Louvre Museum
  No consent popup found.
    Landed directly on place page.
    Found 108 hourly bars.
    ✓ Stored 18/24 hours.

  Searching: Arc de Triomphe
  No consent popup found.
    Landed directly on place page.
    Found 126 hourly bars.
    ✓ Stored 18/24 hours.

  Searching: Champ de Mars
  No consent popup found.
    Clicked top result from list.
    Found 168 hourly bars.
    ✓ Live now: 26%
    ✓ Stored 24/24 hours.

  Searching: Jardin du Luxembourg
  No consent popup found.
    Landed directly on place page.
    No peak hours data available.

  Searching: Tuileries Garden
  No consent popup found.
    Landed directly on place page.
    Found 126 hourly bars.
    ✓ Stored 18/24 hours

This section converts the scraped peak-hours data into a structured DataFrame for presentation. The `print_busyness_summary()` function extracts the current-hour crowdedness values from the stored results and formats them into a tabular overview.

This step bridges live Selenium retrieval with interpretable output, enabling users to immediately assess which attractions are currently more or less crowded.

In [10]:
import pandas as pd
import datetime

def print_busyness_summary(city: str = None):
    """Print a DataFrame of current-hour busyness. Omit city= for all cities."""
    cities = [city] if city else list(busyness_data.keys())
    now_hour = datetime.datetime.now().hour

    if not cities:
        print("busyness_data is empty — run scrape_peak_hours() first.")
        return

    for c in cities:
        if c not in busyness_data:
            print(f"No data for '{c}'.")
            continue
        entry = busyness_data[c]
        rows = []
        for name, hourly in entry["attractions"].items():
            pct = None if (hourly is None) else hourly[now_hour]
            rows.append({
                "Attraction": name,
                f"Busy at {now_hour:02d}:00": f"{pct}%" if pct is not None else "N/A"
            })
        print(f"\n=== {c}  (scraped at {entry['scraped_at']}) ===")
        display(pd.DataFrame(rows))

print_busyness_summary(LAST_SEARCHED_CITY)


=== Paris  (scraped at 22:49) ===


Unnamed: 0,Attraction,Busy at 23:00
0,Eiffel Tower,0%
1,Louvre Museum,0%
2,Arc de Triomphe,0%
3,Champ de Mars,45%
4,Jardin du Luxembourg,
5,Tuileries Garden,0%
6,Notre-Dame Cathedral of Paris,0%
7,Palais Garnier,
8,Sainte-Chapelle,0%
9,Place des Vosges,17%


## Results and Limitations
From an overall project results perspective, we are very satisfied with the final outcomes.
Following the implementation of the Google and TripAdvisor APIs, we successfully populated a solid SQL database. We felt that inserting a limited number of cities was sufficient for demonstration purposes, but the code is written in an interactive way such that, on demand, we can integrate our database with additional information for any desired city.
We are also pleased with our implementation of Selenium to retrieve crowdedness information. Fetching data directly from the SQL database makes the process leaner and allows users to access this information at a later time, which is key to the system's functionality. Users can first explore attractions and activities, then decide later whether to run the Selenium script for real-time data. Fetching this data can be time-intensive, since we deliberately slowed the process to avoid potential scraping blocks, taking a conservative approach.

This leads us to the project's current limitations.
- First, rapid fetching of attraction crowdedness would be essential in a working product. Given the pedagogical nature of this project, we did not feel it was necessary to optimize speed or implement techniques to parallelize the process. However, we recognize this flaw, which would be critical to address for a deliverable product.
- Second, crowdedness can only be checked for attractions in the current implementation. Since we use Google Maps to retrieve this information, we are limited to Google Maps attractions. Unfortunately, we cannot check activity crowdedness for TripAdvisor-sourced activities, as TripAdvisor does not supply users with such data.
- Additionally, the current implementation requires using a VPN, connected to a country whose official language is english, to run the scraping script reliably. Google Maps interface language still changes based on geolocation, and simply adjusting the WebDriver headers has not been enough to override this behavior. In future iterations, more robust workarounds could be introduced, but given the course context and prototype nature of the project, we consider the present solution acceptable.
- Finally, we must acknowledge the static nature of data inserted into the SQL database. For a more robust product, implementing a recurring update function would be fundamental. With the present structure, manually erasing all stored data and using the interactive UI to fetch it again would be necessary to 'update' the available information. This is costly, both computationally and time-wise. A better approach would be to implement regular checks to determine whether data has changed enough to warrant the expense of updating.


## Conclusion
As a proof of concept for a product that could facilitate travel planning for tourists worldwide, we consider ourselves satisfied. The combined use of long-term data storage with short-term data retrieval through APIs and the Selenium package proved to be a powerful synergy, allowing flexibility, ease of access, and surprisingly decent timeliness. 

Of course, we acknowledge the aforementioned limitations. However, scaling this project with more computing power, parallelization of processes across multiple (virtual) machines, and strategic implementation of powerful libraries such as the Requests library could improve timeliness and make such a product a deliverable reality.

Furthermore, we cannot ignore the pedagogical benefits of this project. Working with APIs and a robust package like Selenium on mainstream platforms such as Google Maps and TripAdvisor gave us valuable insights into data retrieval, web scraping, and the ideation of a working product. These are skills that are difficult to develop in an educational setting. We believe this experience will be highly beneficial for our professional growth.


### AI Usage Statement

The conceptual design, system architecture, and implementation of this project were developed independently by the project team. All analytical content and report texts were written by the authors.

Artificial Intelligence tools (ChatGPT, Gemini, Perplexity) were used solely to polish written text and to assist with coding support during implementation. The project README file constitutes the only component fully generated by AI for documentation purposes.