# Step : Place Discovery 

**Purpose:** Discover Google Maps place IDs for business locations across Moroccan cities

**What you'll learn:**
- How to use the DiscoveryEngine
- Multi-strategy search (with map center, without center, local search)
- Business-type-specific search queries
- Deduplication and canonical place_id resolution
- Output formats and data structure

**For Junior Developers:**
- Each cell is self-contained and can be run independently
- Clear outputs show what's happening at each step
- Test sections let you experiment safely with small data
- Works with any business type (banks, hotels, restaurants, etc.)

## Setup and Imports

In [None]:
# Add parent directory to path
import sys
from pathlib import Path

project_root = Path().resolve().parent
sys.path.insert(0, str(project_root / "src"))

print(f"Project root: {project_root}")
print(f"Python path updated")

In [None]:
# Import required libraries
import pandas as pd
import json
from datetime import datetime

# Import our modules
from review_analyzer.discover import DiscoveryEngine
from review_analyzer import config

print("All imports successful!")
print(f"\nAvailable map centers: {len(config.DEFAULT_MAP_CENTERS)} cities")
print(f"Sample cities: {list(config.DEFAULT_MAP_CENTERS.keys())[:]}")

## Data Architecture Overview

The pipeline uses an organized folder structure:

```
data/
 00_config/ # Static configurations
 cities/ # City aliases, coordinates, regions.geojson
 templates/ # Business templates
 0_raw/ # Immutable source data
 discovery/ # Discovered places (timestamped folders)
 0_interim/ # Recomputable cache
 discovery/ # Discovery cache (place_id_cache.json)
 0_processed/ # Final outputs
 discovery/ # Processed agency lists
 0_analysis/ # Reports, figures, dashboards
 99_archive/ # Deprecated data

logs/ # Pipeline execution logs
```

**This notebook saves data to:**
- `data/0_processed/discovery/` - Final discovered places
- `data/0_interim/discovery/cache/` - Place ID resolution cache


## Test : Initialize Discovery Engine

**What this does:** Creates a DiscoveryEngine instance that will handle API calls to Google Maps

**Expected output:** Confirmation that engine is ready

In [None]:
# Initialize the discovery engine
engine = DiscoveryEngine(debug=True)

print(" DiscoveryEngine initialized successfully!")
print(f" Debug mode: {engine.debug}")
print(f" Client ready: {engine.client is not None}")

In [None]:
# Compatibility: add city aliases to DEFAULT_MAP_CENTERS for accented/localized names
alias_map = {
 "Fès": "Fes",
 "Tanger": "Tangier",
 "Kénitra": "Kenitra",
 "Kenitra": "Kenitra",
}
for alias, canonical in alias_map.items():
 if canonical in config.DEFAULT_MAP_CENTERS:
 config.DEFAULT_MAP_CENTERS[alias] = config.DEFAULT_MAP_CENTERS[canonical]
print(" City aliases added (FèsFes, TangerTangier, KénitraKenitra)")


## Test : Single Business, Single City (Simplest Case)

**What this does:** Discovers locations for ONE business in ONE city

**Use case:** Perfect for testing or debugging

**Business type parameter:** Helps refine search queries (e.g., "agence" for banks, "hotel" for hotels)

**Expected output:** CSV file with place_ids, addresses, and metadata

In [None]:
# Test with single business and city (BANK EXAMPLE)
test_businesses = ["CFG Bank"]
test_business_type = "bank"
test_cities = ["Casablanca"]

# Output path (use new data architecture)
output_path = project_root / "data" / "0_processed" / "discovery" / f"test_single_discovery_{datetime.now().strftime('%Y%m%d_%H%M%S')}.csv"
output_path.parent.mkdir(parents=True, exist_ok=True)

print(f"Searching for: {test_businesses[0]} ({test_business_type}) in {test_cities[0]}")
print(f"Output will be saved to: {output_path.name}\n")

# Run discovery
df = engine.discover_branches(
 businesses=test_businesses,
 cities=test_cities,
 business_type=test_business_type,
 map_centers={"Casablanca": config.DEFAULT_MAP_CENTERS["Casablanca"]},
 brand_filter="CFG Bank",
 output_path=output_path
)

print("\n" + "="*60)
print("DISCOVERY COMPLETE")
print("="*60)
print(f" Total locations found: {len(df)}")
print(f" Output file: {output_path.name}")


In [None]:
# Load and preview results
df = pd.read_csv(output_path)

print(f"\nRESULTS PREVIEW")
print(f"="*60)
print(f"Total places found: {len(df)}")
print(f"Columns: {list(df.columns)}\n")

# Show first few results
print("First places:")
display(df.head())

# Show summary statistics
print(f"\nSUMMARY")
print(f" Unique place_ids: {df['place_id'].nunique()}")
if '_city' in df.columns:
 print(f" Cities: {df['_city'].unique().tolist()}")
if '_business' in df.columns:
 print(f" Businesses: {df['_business'].unique().tolist()}")
if 'business_type' in df.columns:
 print(f" Business types: {df['business_type'].unique().tolist()}")

## Test : Multiple Businesses, Single City

**What this does:** Discovers locations for MULTIPLE businesses in one city

**Use case:** Competitive analysis in a specific market (e.g., comparing banks in Rabat)

**Expected output:** Combined results for all businesses with comparison

In [None]:
# Test with multiple businesses, one city (BANK EXAMPLE)
test_businesses = [
 "Attijariwafa Bank",
 "BMCE Bank",
 "CIH Bank"
]
test_business_type = "bank"
test_cities = ["Rabat"]

output_path = project_root / "data" / "0_processed" / "discovery" / f"test_multi_business_discovery_{datetime.now().strftime('%Y%m%d_%H%M%S')}.csv"
output_path.parent.mkdir(parents=True, exist_ok=True)

print(f"Searching for {len(test_businesses)} {test_business_type}s in {test_cities[0]}:")
for business in test_businesses:
 print(f" - {business}")
print(f"\nOutput: {output_path.name}\n")

# Run discovery
df = engine.discover_branches(
 businesses=test_businesses,
 cities=test_cities,
 business_type=test_business_type,
 map_centers={"Rabat": config.DEFAULT_MAP_CENTERS["Rabat"]},
 brand_filter=None, # No filtering to get all results
 output_path=output_path
)

print("\n" + "="*60)
print(" DISCOVERY COMPLETE")
print("="*60)
print(f" Total locations: {len(df)}")
if '_business' in df.columns:
 print(f"\n Breakdown by business:")
 for business in test_businesses:
 count = len(df[df['_business'] == business])
 print(f" {business}: {count} locations")


In [None]:
# Load and analyze results by business
df = pd.read_csv(output_path)

print(f"\n RESULTS BY BUSINESS")
print(f"="*60)

if '_business' in df.columns:
 business_counts = df.groupby('_business').size().sort_values(ascending=False)
 print("\nLocations per business:")
 for business, count in business_counts.items():
 print(f" {business}: {count} locations")
 
 # Show sample from each business
 print(f"\n Sample from each business:")
 for business in business_counts.index:
 business_data = df[df['_business'] == business].head()
 print(f"\n{business}:")
 display(business_data[['_place_id', 'name', 'address']].head())
else:
 print("\nAll results:")
 display(df.head(0))

## Test : Different Business Types 

**What this does:** Shows how to discover different types of businesses (hotels, restaurants, etc.)

**Use case:** Demonstrates the flexibility of the new business-type-aware system

**Expected output:** Examples with hotels and restaurants

In [None]:
# EXAMPLE : Hotels in Marrakech
print(" DISCOVERING HOTELS")
print("="*60)

test_businesses = ["Hilton", "Marriott"]
test_business_type = "hotel"
test_cities = ["Marrakech"]

output_path = project_root / "data" / "0_processed" / "discovery" / f"test_hotels_{datetime.now().strftime('%Y%m%d_%H%M%S')}.csv"
output_path.parent.mkdir(parents=True, exist_ok=True)

print(f" Searching for hotels in {test_cities[0]}:")
for business in test_businesses:
 print(f" - {business}")

# Run discovery
df_hotels = engine.discover_branches(
 businesses=test_businesses,
 cities=test_cities,
 business_type=test_business_type,
 map_centers={"Marrakech": config.DEFAULT_MAP_CENTERS["Marrakech"]},
 output_path=output_path
)

print(f"\n Found {len(df_hotels)} hotel locations")
if '_business' in df_hotels.columns:
 for business in test_businesses:
 count = len(df_hotels[df_hotels['_business'] == business])
 print(f" {business}: {count} locations")

# Show sample
print("\nSample results:")
display(df_hotels[['business', 'name', 'address', 'lat', 'lng', 'place_id','rating', 'reviews_count']].head())


In [None]:
display(df_hotels.head())

In [None]:
# EXAMPLE : Restaurants in Casablanca
print("DISCOVERING RESTAURANTS")
print("="*60)

test_businesses = ["McDonald's", "KFC"]
test_business_type = "restaurant"
test_cities = ["Casablanca"]

output_path = project_root / "data" / "0_processed" / "discovery" / f"test_restaurants_{datetime.now().strftime('%Y%m%d_%H%M%S')}.csv"
output_path.parent.mkdir(parents=True, exist_ok=True)

print(f"Searching for restaurants in {test_cities[0]}:")
for business in test_businesses:
 print(f" - {business}")

# Run discovery
df_restaurants = engine.discover_branches(
 businesses=test_businesses,
 cities=test_cities,
 business_type=test_business_type,
 map_centers={"Casablanca": config.DEFAULT_MAP_CENTERS["Casablanca"]},
 output_path=output_path
)

print(f"\nFound {len(df_restaurants)} restaurant locations")
if '_business' in df_restaurants.columns:
 for business in test_businesses:
 count = len(df_restaurants[df_restaurants['_business'] == business])
 print(f" {business}: {count} locations")

# Show sample
print("\nSample results:")
display(df_restaurants[['business', 'name', 'address', 'lat', 'lng', 'place_id','rating', 'reviews_count']].head())


## Test : Single Business, Multiple Cities

**What this does:** Discovers locations for one business across MULTIPLE cities

**Use case:** Mapping a business's geographic coverage

**Expected output:** Location distribution across cities with geographic analysis

In [None]:
# Test with single business, multiple cities (BANK EXAMPLE)
test_businesses = ["Banque Populaire"]
test_business_type = "bank"
test_cities = [
 "Casablanca",
 "Rabat",
 "Marrakech",
 "Fès"
]

# Get map centers for selected cities
map_centers = {city: config.DEFAULT_MAP_CENTERS[city] for city in test_cities}

output_path = project_root / "data" / "0_processed" / "discovery" / f"test_multi_city_discovery_{datetime.now().strftime('%Y%m%d_%H%M%S')}.csv"
output_path.parent.mkdir(parents=True, exist_ok=True)

print(f"Searching for {test_businesses[0]} in {len(test_cities)} cities:")
for city in test_cities:
 print(f" - {city}")
print(f"\nOutput: {output_path.name}\n")

# Run discovery
df = engine.discover_branches(
 businesses=test_businesses,
 cities=test_cities,
 business_type=test_business_type,
 map_centers=map_centers,
 brand_filter="Banque Populaire",
 output_path=output_path
)

print("\n" + "="*60)
print("DISCOVERY COMPLETE")
print("="*60)
print(f" Total locations: {len(df)}")


In [None]:
# Analyze geographic distribution
df = pd.read_csv(output_path)

print(f"\nGEOGRAPHIC DISTRIBUTION")
print(f"="*60)

if '_city' in df.columns:
 city_counts = df.groupby('_city').size().sort_values(ascending=False)
 print("\nLocations per city:")
 for city, count in city_counts.items():
 percentage = (count / len(df)) * 00
 bar = '' * int(percentage / )
 print(f" {city:} {count:} locations {bar} {percentage:.f}%")
 
 # Show sample from each city
 print(f"\nSample locations from each city:")
 for city in city_counts.index:
 city_data = df[df['_city'] == city].head()
 print(f"\n{city}:")
 display(city_data[['_place_id', 'name', 'address']].head())
else:
 print("\nAll results:")
 display(df.head(0))

## Test 6: Full Production Run (Multiple Businesses × Multiple Cities)

**What this does:** Production-scale discovery across all major businesses and cities

**Use case:** Complete market mapping (can be used for any business type)

**Warning:** This makes many API calls and may take several minutes!

In [None]:
# Production configuration (BANK EXAMPLE)
production_businesses = [
 "Attijariwafa Bank",
 "BMCE Bank",
 "CIH Bank",
 "Banque Populaire",
 "Société Générale Maroc"
]

production_business_type = "bank"

production_cities = [
 "Casablanca",
 "Rabat",
 "Marrakech",
 "Fès",
 "Tanger",
 "Agadir"
]

# Get map centers
map_centers = {city: config.DEFAULT_MAP_CENTERS[city] for city in production_cities if city in config.DEFAULT_MAP_CENTERS}

output_path = project_root / "data" / "0_processed" / "discovery" / f"production_discovery_{datetime.now().strftime('%Y%m%d_%H%M%S')}.csv"
output_path.parent.mkdir(parents=True, exist_ok=True)

print(f"PRODUCTION RUN")
print(f"="*60)
print(f" Businesses: {len(production_businesses)}")
print(f" Cities: {len(production_cities)}")
print(f" Total combinations: {len(production_businesses) * len(production_cities)}")
print(f"\n This will take several minutes...\n")
print(f"Output: {output_path.name}\n")

# Uncomment to run
# df = engine.discover_branches(
# businesses=production_businesses,
# cities=production_cities,
# business_type=production_business_type,
# map_centers=map_centers,
# brand_filter=None,
# output_path=output_path
# )


## Test 6: Inspect and Validate Results

**What this does:** Quality checks on discovered data

**Checks:**
- Duplicate place_ids
- Missing data
- Place_id format validation
- Data completeness

In [None]:
# Load most recent discovery file (check new data architecture first)
output_dir = project_root / "data" / "0_processed" / "discovery"
if not output_dir.exists():
 output_dir = project_root / "data" / "output" # Fallback to legacy

discovery_files = sorted(output_dir.glob("*discovery*.csv"), reverse=True)

if discovery_files:
 latest_file = discovery_files[0]
 print(f"Loading: {latest_file.name}\n")
 
 df = pd.read_csv(latest_file)
 
 print(f"DATA QUALITY CHECKS")
 print(f"="*60)
 
 # Choose best ID column
 id_col = (
 'canonical_place_id'
 if ('canonical_place_id' in df.columns and df['canonical_place_id'].notna().any())
 else 'place_id' if 'place_id' in df.columns else None
 )
 
 # . Check for duplicates
 print(f"\n. Duplicate Check:")
 if id_col:
 duplicates = df[id_col].duplicated().sum()
 if duplicates > 0:
 print(f"Found {duplicates} duplicate place_ids")
 print(f" Duplicated IDs: {df[df[id_col].duplicated()][id_col].tolist()}")
 else:
 print(f"No duplicates found")
 else:
 print(f"No ID column available")
 
 # . Check for missing data
 print(f"\n. Missing Data Check:")
 missing = df.isnull().sum()
 if missing.sum() > 0:
 print(f" Columns with missing values:")
 for col, count in missing[missing > 0].items():
 print(f" {col}: {count} missing ({count/len(df)*00:.f}%)")
 else:
 print(f" No missing data")
 
 # . Validate place_id format
 print(f"\n. Place ID Format Check:")
 if id_col:
 non_null_ids = df[id_col].dropna()
 invalid_ids = non_null_ids[~non_null_ids.astype(str).apply(config.validate_place_id)]
 if len(invalid_ids) > 0:
 print(f"Found {len(invalid_ids)} invalid place_ids")
 print(f" Invalid IDs: {invalid_ids.astype(str).tolist()[:]}")
 else:
 print(f"All place_ids valid (ChIJ format)")
 else:
 print(f"Skipped: no ID column found")
 
 # . Data completeness
 print(f"\n. Data Completeness:")
 print(f" Total records: {len(df)}")
 if id_col:
 print(f" Unique place_ids: {df[id_col].nunique()}")
 
 # Support both old and new column names
 business_col = 'business' if 'business' in df.columns else '_business' if '_business' in df.columns else None
 city_col = 'city' if 'city' in df.columns else '_city' if '_city' in df.columns else None
 
 if business_col:
 businesses = df[business_col].dropna().unique()
 print(f" Businesses: {len(businesses)} ({', '.join(businesses[:])})")
 if city_col:
 cities = df[city_col].dropna().unique()
 print(f" Cities: {len(cities)} ({', '.join(cities[:])})")
 
 # . Show sample
 print(f"\n. Sample Records:")
 display(df.head())
 
else:
 print("No discovery files found. Run a test first!")


## Test 7: Visualize Discovery Results

**What this does:** Create visualizations of discovered branches

**Outputs:**
- Bar chart: Branches per city
- Bar chart: Branches per bank
- Heatmap: Bank coverage by city

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Set style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

# Load latest discovery file
if discovery_files:
 df = pd.read_csv(discovery_files[0])
 
 print(f"VISUALIZATIONS")
 print(f"="*60)
 print(f"\nData: {discovery_files[0].name}")
 print(f"Total branches: {len(df)}\n")
 
 # Support both old and new column names
 city_col = 'city' if 'city' in df.columns else '_city' if '_city' in df.columns else None
 business_col = 'business' if 'business' in df.columns else '_business' if '_business' in df.columns else '_bank' if '_bank' in df.columns else None
 
 # . Branches per city
 if city_col and city_col in df.columns:
 fig, ax = plt.subplots(figsize=(0, 6))
 city_counts = df[city_col].value_counts()
 city_counts.plot(kind='bar', ax=ax, color='steelblue')
 ax.set_title('Branches per City', fontsize=, fontweight='bold')
 ax.set_xlabel('City', fontsize=)
 ax.set_ylabel('Number of Branches', fontsize=)
 ax.grid(axis='y', alpha=0.)
 plt.xticks(rotation=, ha='right')
 plt.tight_layout()
 plt.show()
 
 # . Branches per business
 if business_col and business_col in df.columns:
 fig, ax = plt.subplots(figsize=(0, 6))
 business_counts = df[business_col].value_counts()
 business_counts.plot(kind='barh', ax=ax, color='coral')
 ax.set_title('Branches per Business', fontsize=, fontweight='bold')
 ax.set_xlabel('Number of Branches', fontsize=)
 ax.set_ylabel('Business', fontsize=)
 ax.grid(axis='x', alpha=0.)
 plt.tight_layout()
 plt.show()
 
 # . Heatmap: Business × City coverage
 if business_col and city_col and business_col in df.columns and city_col in df.columns:
 coverage = pd.crosstab(df[business_col], df[city_col])
 fig, ax = plt.subplots(figsize=(, 6))
 sns.heatmap(coverage, annot=True, fmt='d', cmap='YlOrRd', ax=ax)
 ax.set_title('Business Coverage by City (Number of Branches)', fontsize=, fontweight='bold')
 ax.set_xlabel('City', fontsize=)
 ax.set_ylabel('Business', fontsize=)
 plt.tight_layout()
 plt.show()
 
 print("\nCoverage Matrix:")
 display(coverage)
else:
 print("No discovery files found. Run a test first!")


## Export for Next Step

**What this does:** Prepares discovery results for Step (Review Collection)

**Output:** Clean CSV file ready for the next notebook

In [None]:
# Export for next step
if discovery_files:
 df = pd.read_csv(discovery_files[-])
 
 # Clean and prepare
 required_columns = ['_place_id', '_bank', '_city', 'title', 'address']
 available_columns = [col for col in required_columns if col in df.columns]
 
 df_clean = df[available_columns].drop_duplicates(subset=['_place_id'])
 
 # Save for next step
 next_step_file = project_root / "data" / "output" / "agencies_for_collection.csv"
 df_clean.to_csv(next_step_file, index=False)
 
 print(f"EXPORT COMPLETE")
 print(f"="*60)
 print(f" File: {next_step_file.name}")
 print(f" Records: {len(df_clean)}")
 print(f" Columns: {list(df_clean.columns)}")
 print(f"\n Ready for Step : Review Collection! ")
 print(f" Open: collect_reviews.ipynb")
else:
 print("No discovery files found. Run a test first!")

In [None]:
# UPDATED: Load and preview results (aligns with new discover.py schema)

df = pd.read_csv(output_path)

print("\nRESULTS PREVIEW (updated)")
print("="*60)
print(f"Total places found: {len(df)}")
print(f"Columns: {list(df.columns)}\n")

# Show first few results with key columns when available
print("First places:")
key_cols = [c for c in [
 'business', 'name', 'address', 'city', 'place_id',
 'canonical_place_id', 'data_id', 'rating', 'reviews_count',
 '_engine', 'resolve_status'
] if c in df.columns]

display(df[key_cols].head() if key_cols else df.head())

# Summary (prefer canonical ids when present)
id_col = (
 'canonical_place_id'
 if ('canonical_place_id' in df.columns and df['canonical_place_id'].notna().any())
 else 'place_id' if 'place_id' in df.columns else None
)

print("\nSUMMARY (updated)")
if id_col:
 print(f" Unique IDs ({id_col}): {df[id_col].nunique()}")
else:
 print(" Unique IDs: N/A")

if 'city' in df.columns:
 print(f" Cities: {df['city'].unique().tolist()}")
if 'business' in df.columns:
 print(f" Businesses: {df['business'].unique().tolist()}")
if 'business_type' in df.columns:
 print(f" Business types: {df['business_type'].unique().tolist()}")


In [None]:
# UPDATED: Analyze results by business (aligns with new columns)

df = pd.read_csv(output_path)

print("\nRESULTS BY BUSINESS (updated)")
print("="*60)

group_col = 'business' if 'business' in df.columns else None
id_col = (
 'canonical_place_id'
 if ('canonical_place_id' in df.columns and df['canonical_place_id'].notna().any())
 else 'place_id' if 'place_id' in df.columns else None
)

if group_col:
 business_counts = df.groupby(group_col).size().sort_values(ascending=False)
 print("\nLocations per business:")
 for business, count in business_counts.items():
 print(f" {business}: {count} locations")
 
 print("\nSample from each business:")
 for business_name in business_counts.index:
 sample = df[df[group_col] == business_name].head()
 cols = [c for c in [id_col, 'name', 'address'] if c in df.columns and id_col]
 display(sample[cols] if cols else sample.head())
else:
 print("\nAll results:")
 display(df.head(0))


In [None]:
# UPDATED: Inspect and validate results (new schema + tolerant to unresolved IDs)

# Load most recent discovery file in output directory
output_dir = project_root / "data" / "output"
discovery_files = sorted(output_dir.glob("*discovery*.csv"))

if discovery_files:
 latest_file = discovery_files[-]
 print(f"Loading: {latest_file.name}\n")
 
 df = pd.read_csv(latest_file)
 
 print("DATA QUALITY CHECKS (updated)")
 print("="*60)
 
 # Choose best id column available
 id_col = (
 'canonical_place_id'
 if ('canonical_place_id' in df.columns and df['canonical_place_id'].notna().any())
 else 'place_id' if 'place_id' in df.columns else None
 )
 
 # . Check for duplicates
 print("\n. Duplicate Check:")
 if id_col:
 duplicates = df[id_col].duplicated().sum()
 print(f" Duplicates in {id_col}: {duplicates}")
 else:
 print(" No ID column available for duplicate check")
 
 # . Missing Data Check
 print("\n. Missing Data Check:")
 missing = df.isnull().sum()
 if missing.sum() > 0:
 print(" Columns with missing values:")
 for col, count in missing[missing > 0].items():
 print(f" {col}: {count} missing ({count/len(df)*00:.f}%)")
 else:
 print("No missing data")
 
 # . Place ID Format Check (canonical format only)
 print("\n. Place ID Format Check:")
 if id_col:
 non_null_ids = df[id_col].dropna()
 invalid_ids = non_null_ids[~non_null_ids.astype(str).apply(config.validate_place_id)]
 if len(invalid_ids) > 0:
 print(f"Found {len(invalid_ids)} invalid IDs in {id_col}")
 print(f"Sample invalid: {invalid_ids.astype(str).tolist()[:]}")
 else:
 print(f"All non-null IDs in {id_col} match canonical format (ChIJ...)")
 else:
 print(" Skipped: no ID column found")
 
 # . Data completeness
 print("\n. Data Completeness:")
 print(f" Total records: {len(df)}")
 if id_col:
 print(f" Unique IDs: {df[id_col].nunique()}")
 if 'business' in df.columns:
 print(f" Businesses: {df['business'].nunique()} ({', '.join(df['business'].dropna().unique()[:0])}...) ")
 if 'city' in df.columns:
 print(f" Cities: {df['city'].nunique()} ({', '.join(df['city'].dropna().unique()[:0])}...) ")
 
 # . Show sample
 print("\n. Sample Records:")
 sample_cols = [c for c in [id_col, 'business', 'city', 'name', 'address'] if c and c in df.columns]
 display(df[sample_cols].head() if sample_cols else df.head())
else:
 print("No discovery files found. Run a test first!")


In [None]:
# UPDATED: Visualize discovery results (city/business columns)

import matplotlib.pyplot as plt
import seaborn as sns

# Set style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

# Load latest discovery file
if discovery_files:
 df = pd.read_csv(discovery_files[-])
 
 print("VISUALIZATIONS (updated)")
 print("="*60)
 print(f"\nData: {discovery_files[-].name}")
 print(f"Total branches: {len(df)}\n")
 
 # . Branches per city
 if 'city' in df.columns and not df['city'].isna().all():
 fig, ax = plt.subplots(figsize=(0, 6))
 city_counts = df['city'].value_counts()
 city_counts.plot(kind='bar', ax=ax, color='steelblue')
 ax.set_title('Branches per City', fontsize=, fontweight='bold')
 ax.set_xlabel('City', fontsize=)
 ax.set_ylabel('Number of Branches', fontsize=)
 ax.grid(axis='y', alpha=0.)
 plt.xticks(rotation=, ha='right')
 plt.tight_layout()
 plt.show()
 
 # . Branches per business
 if 'business' in df.columns and not df['business'].isna().all():
 fig, ax = plt.subplots(figsize=(0, 6))
 business_counts = df['business'].value_counts()
 business_counts.plot(kind='barh', ax=ax, color='coral')
 ax.set_title('Branches per Business', fontsize=, fontweight='bold')
 ax.set_xlabel('Number of Branches', fontsize=)
 ax.set_ylabel('Business', fontsize=)
 ax.grid(axis='x', alpha=0.)
 plt.tight_layout()
 plt.show()
 
 # . Heatmap: Business × City coverage
 if 'business' in df.columns and 'city' in df.columns:
 coverage = pd.crosstab(df['business'], df['city'])
 fig, ax = plt.subplots(figsize=(, 6))
 sns.heatmap(coverage, annot=True, fmt='d', cmap='YlOrRd', ax=ax)
 ax.set_title('Business Coverage by City (Number of Branches)', fontsize=, fontweight='bold')
 ax.set_xlabel('City', fontsize=)
 ax.set_ylabel('Business', fontsize=)
 plt.tight_layout()
 plt.show()
 
 print("\nCoverage Matrix:")
 display(coverage)
else:
 print("No discovery files found. Run a test first!")


In [None]:
# Export for next step (transform or collection)
if discovery_files:
 df = pd.read_csv(discovery_files[0])
 
 # Choose best id per row: canonical_place_id > place_id > data_id (as string)
 def choose_id(row):
 cid = row.get('canonical_place_id')
 if isinstance(cid, str) and config.validate_place_id(cid):
 return cid
 pid = row.get('place_id')
 if isinstance(pid, str) and config.validate_place_id(pid):
 return pid
 did = row.get('data_id')
 if pd.notna(did) and str(did).strip() != "":
 return str(did)
 return None
 
 ids = df.apply(choose_id, axis=)
 
 # Support both old and new column names
 business_col = 'business' if 'business' in df.columns else '_business' if '_business' in df.columns else None
 city_col = 'city' if 'city' in df.columns else '_city' if '_city' in df.columns else None
 
 df_out = pd.DataFrame({
 '_place_id': ids,
 '_business': df[business_col] if business_col else '',
 '_city': df[city_col] if city_col else '',
 'title': df.get('name'),
 'address': df.get('address')
 })
 
 df_out = df_out.dropna(subset=['_place_id']).drop_duplicates(subset=['_place_id'])
 
 # Save to new data architecture
 next_step_file = project_root / 'data' / '0_processed' / 'discovery' / 'agencies_discovered.csv'
 next_step_file.parent.mkdir(parents=True, exist_ok=True)
 df_out.to_csv(next_step_file, index=False)
 
 # Also save to legacy path for backward compatibility
 legacy_file = project_root / 'data' / 'output' / 'agencies_for_collection.csv'
 legacy_file.parent.mkdir(parents=True, exist_ok=True)
 df_out.to_csv(legacy_file, index=False)
 
 print("EXPORT COMPLETE")
 print("="*60)
 print(f" Primary: {next_step_file.relative_to(project_root)}")
 print(f" Legacy: {legacy_file.relative_to(project_root)}")
 print(f" Records: {len(df_out)}")
 print(f" Columns: {list(df_out.columns)}")
 print("\nData saved to: data/0_processed/discovery/")
 print("\nNext steps:")
 print(" .Collect reviews for discovered places")
 print(" Open: collect_reviews.ipynb")
 print(" . Or run full pipeline:")
 print(" python -m review_analyzer.main pipeline --businesses 'Bank' --cities 'City'")
else:
 print("No discovery files found. Run a test first!")


## Summary for New Developers

**What you learned:**

. **DiscoveryEngine** - The main class for finding business locations on Google Maps
. **Multi-strategy search** - Uses approaches: map center, no center, local search
. **Canonical place IDs** - Automatic resolution to ChIJ format for reliable reviews
. **Deduplication** - Automatically removes duplicate place_ids
. **Business-type awareness** - Tailored queries for banks, hotels, restaurants, etc.
6. **Brand filtering** - Optional filtering by business name
7. **Data validation** - How to check data quality
8. **New data architecture** - Organized folder structure for better data management

**Key takeaways:**
- Start with small tests ( business, city)
- Scale up gradually
- Always validate results
- Visualize data to understand coverage
- Data is saved to `data/0_processed/discovery/` for next pipeline stage
- Cache is maintained in `data/0_interim/discovery/cache/` for efficiency

**Data Flow (New Architecture):**
```
discover collect transform classify
 ↓ ↓ ↓ ↓
0_raw/ 0_interim/ 0_interim/ 0_processed/
discovery collection transform classification
```

**Next steps:**
. You've discovered place_ids (Step : discover_placeids.ipynb - this notebook!)
. Collect reviews for discovered places (Step : collect_reviews.ipynb)
. Transform reviews - normalize fields, add regions (CLI: `python -m review_analyzer.main transform`)
. Classify reviews to extract topics and sentiments (Step : classify_reviews.ipynb)

**Pipeline Command (Run all steps):**
```bash
python -m review_analyzer.main pipeline \
 --businesses "Attijariwafa Bank" \
 --cities "Casablanca" \
 --business-type "bank"
```

**Troubleshooting:**
- No results? Check business name spelling matches Google Maps
- API errors? Check .env file has SERPAPI_API_KEY set
- Duplicates? The engine should handle them automatically
- Missing canonical IDs? Engine resolves them via place details API

Happy discovering! 
