# OSRM Project — Combined Workflow

This notebook combines the functionality from the `osrm_project` Python scripts into a single, linear notebook you can run start-to-finish.

Sections: 
1. Setup / imports
2. Basic facility checks
3. (Optional) Build OSRM table-based edges (requires OSRM server)
4. Pivot edges into matrices
5. Labeling and duration upper-bound
6. Analysis & visualizations
7. Smoke test example

Notes: the notebook will *not* attempt to install packages automatically. If missing packages are reported, run `pip install -r requirements.txt` from the `osrm_project` folder. If you don't have an OSRM instance available, set `run_osrm = False` in the build step and the notebook will skip remote requests and continue where possible.

In [76]:
# Setup: imports and helpers
import os
import sys
import json
import time
from typing import List, Tuple
import pandas as pd
import numpy as np
import requests
import matplotlib.pyplot as plt
import seaborn as sns

sns.set(style="whitegrid")

# Change to the workspace root (where facilities_with_warehouses.csv is located)
os.chdir('/Users/elee/Documents/GitHub/thesiscode2026')

# simple utility: chunk iterator
def chunks(lst, n):
    for i in range(0, len(lst), n):
        yield i, lst[i:i+n]

print("Python version:", sys.version.splitlines()[0])
print("Working directory:", os.getcwd())
print("pandas version:", pd.__version__)
print("Files in current directory:", os.listdir('.')[:15])  # List first 15 files


Python version: 3.11.11 (main, Dec 11 2024, 10:25:04) [Clang 14.0.6 ]
Working directory: /Users/elee/Documents/GitHub/thesiscode2026
pandas version: 2.3.3
Files in current directory: ['nearest_facility.ipynb', 'facilities_with_warehouses.csv', 'priority_antimicrobials_cleaned.numbers', '.DS_Store', 'botswana_population_age_breakdown.csv', 'requirements.txt', 'priority_antimicrobials_estimates.numbers', 'botswanacensusmicrodata.csv', 'duration_matrix.csv', 'facility_id_lookup.csv', 'botswana_geocode', 'README.md', 'repo_trash', '.gitignore', 'osrm_project']


In [92]:
# Fix misclassified facilities: Clinics with "Health Post" in name should be Health Posts
import pandas as pd
import os
import shutil
from datetime import datetime

fac_path = 'facilities_with_warehouses.csv'
if not os.path.exists(fac_path):
    print(f'Warning: {fac_path} not found; skipping Health Post reclassification.')
else:
    fac = pd.read_csv(fac_path)
    
    # Find facility name column
    name_col = None
    for col in fac.columns:
        if 'facility' in col.lower() and 'name' in col.lower():
            name_col = col
            break
    
    if name_col is None:
        print('Warning: Could not find facility name column. Columns:', list(fac.columns))
    elif 'Service Delivery Type' not in fac.columns:
        print('Warning: Service Delivery Type column not found.')
    else:
        # Count before
        before_clinic = (fac['Service Delivery Type'] == 'Clinic').sum()
        before_hp = (fac['Service Delivery Type'] == 'Health Post').sum()
        
        # Find clinics with "Health Post" in name and reassign
        mask = (fac['Service Delivery Type'] == 'Clinic') & (fac[name_col].str.contains('Health Post', case=False, na=False))
        reassigned_count = mask.sum()
        
        if reassigned_count > 0:
            fac.loc[mask, 'Service Delivery Type'] = 'Health Post'
            
            # Create backup before writing
            backup_path = fac_path + f'.pre_healthpost_fix.{datetime.now().strftime("%Y%m%dT%H%M%S")}.bak'
            shutil.copy(fac_path, backup_path)
            print(f'Created backup: {backup_path}')
            
            # Write corrected file
            fac.to_csv(fac_path, index=False)
            
            # Report changes
            after_clinic = (fac['Service Delivery Type'] == 'Clinic').sum()
            after_hp = (fac['Service Delivery Type'] == 'Health Post').sum()
            print(f'Reassigned {reassigned_count} clinics with "Health Post" in name to Health Post')
            print(f'  Clinics: {before_clinic} → {after_clinic}')
            print(f'  Health Posts: {before_hp} → {after_hp}')
            print(f'Wrote corrected facilities to {fac_path}')
        else:
            print('No clinics with "Health Post" in name found; no changes needed.')


Created backup: facilities_with_warehouses.csv.pre_healthpost_fix.20251208T135250.bak
Reassigned 18 clinics with "Health Post" in name to Health Post
  Clinics: 194 → 176
  Health Posts: 310 → 328
Wrote corrected facilities to facilities_with_warehouses.csv


In [77]:
# 1) Basic facility checks (adapted from check_facilities.py)
FAC_CSV = 'facilities_with_warehouses.csv'
if not os.path.exists(FAC_CSV):
    print(f'File not found: {FAC_CSV} — please place it in the workspace root.')
else:
    fac = pd.read_csv(FAC_CSV)
    print('Columns:', list(fac.columns))
    print('Rows:', len(fac))
    if 'Latitude' in fac.columns and 'Longitude' in fac.columns:
        print('Nulls in Latitude/Longitude:', fac['Latitude'].isna().sum(), fac['Longitude'].isna().sum())
    else:
        print('No Latitude/Longitude columns found; please check your CSV.')
    if 'Is_Warehouse' in fac.columns:
        print('Warehouses:', (fac['Is_Warehouse'] == True).sum())
        print('Facilities:', (fac['Is_Warehouse'] == False).sum())
    display(fac.head(5))


Columns: ['Old Facility Code', 'New Facility Code', 'Facility Name', 'District', 'DHMT', 'Latitude', 'Longitude', 'Telephone', 'Service Delivery Type', 'Facility Owner', 'Facility Status', 'Is_Warehouse']
Rows: 636
Nulls in Latitude/Longitude: 0 0
Warehouses: 22
Facilities: 614


Unnamed: 0,Old Facility Code,New Facility Code,Facility Name,District,DHMT,Latitude,Longitude,Telephone,Service Delivery Type,Facility Owner,Facility Status,Is_Warehouse
0,8/3/10,972228-1,Airstrip Clinic,Mahalapye,Mahalapye,-23.115351,26.824311,+267 4710009/74166629,Clinic,GOVERNMENT,OPERATIONAL,False
1,16-4-02,713823-3,Area L Health Post,Greater Francistown,Greater Francistown,-21.160365,27.517292,+267 2470046,Health Post,GOVERNMENT,OPERATIONAL,False
2,16-4-03,611993-7,Area S Health Post,Greater Francistown,Greater Francistown,-21.156493,27.497482,+267 2470047,Health Post,GOVERNMENT,OPERATIONAL,False
3,9/3/01,153397-5,Artesia Clinic,Kgatleng,Kgatleng,-24.013836,26.320154,+267 5729710,Clinic with Maternity,GOVERNMENT,OPERATIONAL,False
4,7879798,419322-3,BAC Clinic,Greater Gaborone,Greater Gaborone,-24.68,25.92,+267 3953062,Clinic,GOVERNMENT,OPERATIONAL,False


## 2) Optional: Build edges with OSRM Table API

This step calls an OSRM `/table` request to compute pairwise durations/distances between warehouse sources and facility destinations.
Set `run_osrm = True` and ensure `osrm_url` points to your OSRM instance (e.g., `http://localhost:5001`). If OSRM is unavailable the cell will skip requests and return an empty DataFrame.

In [78]:
# Docker helper: prepare and run a local OSRM container using the local PBF file
# This helper is conservative: it checks for `docker`, validates the PBF exists in the notebook folder,
# runs the extract/partition/customize steps, starts `osrm-routed` in detached mode and polls a tiny table request
# NOTE: the helper mounts the current notebook folder into /data in the container and will create .osrm files there.
import os, shutil, subprocess, time, requests

def start_osrm_docker(pbf='botswana-latest.osm.pbf', mount_dir='.', image='osrm/osrm-backend:latest', profile='/opt/car.lua', host_port=5001, use_mld=True, timeout=120):
    """Start OSRM in Docker using the provided PBF located under `mount_dir` on the host."""
    # check docker availability
    if shutil.which('docker') is None:
        print('Docker is not available on PATH. Install Docker Desktop and ensure `docker` is runnable.')
        return None
    pbf_path = os.path.join(mount_dir, pbf)
    if not os.path.exists(pbf_path):
        print(f'PBF not found at {pbf_path}. Place the file in the notebook folder or set mount_dir appropriately.')
        return None

    abs_mount = os.path.abspath(mount_dir)
    # build command sequence: extract -> partition/customize (MLD) or extract -> contract (CH) -> run routed
    steps = []
    steps.append(['docker','run','--rm','-v', f'{abs_mount}:/data', image, 'osrm-extract','-p', profile, f'/data/{pbf}'])
    if use_mld:
        steps.append(['docker','run','--rm','-v', f'{abs_mount}:/data', image, 'osrm-partition', f'/data/{pbf}.osrm'])
        steps.append(['docker','run','--rm','-v', f'{abs_mount}:/data', image, 'osrm-customize', f'/data/{pbf}.osrm'])
        run_cmd = ['docker','run','-d','-p', f'{host_port}:5000','-v', f'{abs_mount}:/data', image, 'osrm-routed','--algorithm','mld', f'/data/{pbf}.osrm']
    else:
        steps.append(['docker','run','--rm','-v', f'{abs_mount}:/data', image, 'osrm-contract', f'/data/{pbf}.osrm'])
        run_cmd = ['docker','run','-d','-p', f'{host_port}:5000','-v', f'{abs_mount}:/data', image, 'osrm-routed', f'/data/{pbf}.osrm']

    # run the preparatory steps
    try:
        for c in steps:
            print('Running:', ' '.join(c))
            subprocess.run(c, check=True)
    except subprocess.CalledProcessError as e:
        print('A preparatory docker step failed:', e)
        return None

    # start osrm-routed in detached mode
    print('Starting osrm-routed (detached)...')
    try:
        cid = subprocess.check_output(run_cmd).decode().strip()
        print('Started container id:', cid)
    except Exception as e:
        print('Failed to start osrm-routed container:', e)
        return None

    # basic health check: small table query using the container host_port
    url = f'http://localhost:{host_port}/table/v1/driving/0,0;0,0'
    params = {'sources':'0','destinations':'1','annotations':'duration'}
    deadline = time.time() + timeout
    while time.time() < deadline:
        try:
            r = requests.get(url, params=params, timeout=5)
            if r.ok and r.json().get('code') == 'Ok':
                print('OSRM appears up and responding on port', host_port)
                return cid
        except Exception:
            pass
        time.sleep(2)
    print('OSRM did not respond within timeout. Check container logs with `docker logs <container_id>`.')
    return cid

# Example: call start_osrm_docker() to run the pipeline (this will create .osrm files in the notebook folder).
# cid = start_osrm_docker()

In [79]:
def build_edges_all_pairs(csv_path='facilities_with_warehouses.csv', osrm_url='http://localhost:5001', chunk=30, limit=0, run_osrm=True):
    """Build edges between ALL facilities (all-pairs distance matrix).
    Uses chunking for both sources and destinations to avoid URL length limits.
    Returns a DataFrame with columns: source_id, source_name, dest_id, dest_name, distance_m, duration_s
    """
    if not os.path.exists(csv_path):
        raise FileNotFoundError(csv_path)
    
    df = pd.read_csv(csv_path).dropna(subset=['Latitude','Longitude']).copy()
    
    if limit and limit > 0:
        df = df.head(limit).copy()
    
    def make_id(s: pd.DataFrame) -> pd.Series:
        a = s.get('New Facility Code') if 'New Facility Code' in s.columns else None
        b = s.get('Old Facility Code') if 'Old Facility Code' in s.columns else None
        idx_series = s.reset_index().index.astype(str)
        if a is None and b is None:
            return idx_series
        if a is not None and b is not None:
            out = a.combine_first(b)
        elif a is not None:
            out = a.copy()
        else:
            out = b.copy()
        out = out.where(out.notna(), idx_series)
        return out.astype(str)
    
    df['node_id'] = make_id(df)
    df['label'] = df.get('Facility Name', df.get('facility_name', pd.Series(['']*len(df)))).astype(str)
    df['coord'] = df.apply(lambda r: f"{r['Longitude']},{r['Latitude']}", axis=1)
    
    n_total = len(df)
    print(f"Total facilities: {n_total}")
    print(f"Using chunk size: {chunk}")
    
    base_url = f'{osrm_url}/table/v1/driving/'
    edges = []
    
    if not run_osrm:
        print('run_osrm is False — skipping remote OSRM requests. You can set run_osrm=True to contact OSRM.')
        return pd.DataFrame(columns=['source_id','source_name','dest_id','dest_name','distance_m','duration_s'])
    
    # Chunk both sources and destinations to avoid URL length limits
    total_chunks = 0
    for src_offset, src_chunk in chunks(df, chunk):
        src_coords = src_chunk['coord'].tolist()
        n_src = len(src_coords)
        
        for dst_offset, dst_chunk in chunks(df, chunk):
            total_chunks += 1
            dst_coords = dst_chunk['coord'].tolist()
            coords = ';'.join(src_coords + dst_coords)
            
            sources = ';'.join(map(str, range(n_src)))
            destinations = ';'.join(map(str, range(n_src, n_src + len(dst_coords))))
            params = {'sources': sources, 'destinations': destinations, 'annotations': 'duration,distance'}
            
            url = base_url + coords
            try:
                r = requests.get(url, params=params, timeout=300)
                r.raise_for_status()
                data = r.json()
            except Exception as e:
                print(f'[chunk {total_chunks}] request failed at src offset {src_offset}, dst offset {dst_offset}: {e}')
                continue
            
            if data.get('code') != 'Ok':
                print(f'[chunk {total_chunks}] OSRM error: {data.get("message", data)}')
                continue
            
            dists = data.get('distances')
            durs  = data.get('durations')
            if dists is None or durs is None:
                print(f'[chunk {total_chunks}] missing distances/durations')
                continue
            
            for si, (s_id, s_name) in enumerate(zip(src_chunk['node_id'], src_chunk['label'])):
                for di, (d_id, d_name) in enumerate(zip(dst_chunk['node_id'], dst_chunk['label'])):
                    dist = dists[si][di]
                    dur  = durs[si][di]
                    if dist is None or dur is None:
                        continue
                    edges.append((s_id, s_name, d_id, d_name, dist, dur))
            
            if total_chunks % 10 == 0:
                print(f'[{total_chunks} chunks] processed {min(src_offset + n_src, n_total)} × {min(dst_offset + len(dst_chunk), n_total)} facility pairs')
    
    edges_df = pd.DataFrame(edges, columns=['source_id','source_name','dest_id','dest_name','distance_m','duration_s'])
    print(f'Total chunks: {total_chunks}, total edges: {len(edges_df)}')
    return edges_df

# To test: uncomment below and run this cell
# edges_df = build_edges_all_pairs(run_osrm=True)
# print('edges_df shape:', edges_df.shape)


## 3) Pivot edges into distance/duration matrices (adapted from `pivot_matrices.py`)

In [80]:
def pivot_edges_to_matrices(edges_df, out_prefix=''):
    """Take a long-form edges DataFrame and write / return wide distance and duration matrices.
    Expects columns: source_id, source_name, dest_id, dest_name, distance_m, duration_s
    For all-pairs: creates a symmetric matrix with facility names as both rows and columns.
    Also writes ID-indexed versions and a lookup table.
    """
    if edges_df is None or len(edges_df) == 0:
        print('No edges to pivot — returning empty matrices')
        return None, None

    # drop duplicates keeping smallest distance per pair
    df = edges_df.sort_values('distance_m').drop_duplicates(subset=['source_id','dest_id'], keep='first').copy()

    # Pivot by IDs first
    dist_by_id = df.pivot(index='dest_id', columns='source_id', values='distance_m')
    dur_by_id  = df.pivot(index='dest_id', columns='source_id', values='duration_s')

    # Create ID-to-name mapping
    src_names = df[['source_id','source_name']].drop_duplicates().set_index('source_id')['source_name']
    dst_names = df[['dest_id','dest_name']].drop_duplicates().set_index('dest_id')['dest_name']
    
    # Merge name mappings (prefer source names but use dest names as fallback)
    id_to_name = pd.concat([src_names, dst_names]).drop_duplicates(keep='first')

    # Create named versions: replace index and columns with facility names
    dist_named = dist_by_id.copy()
    dist_named.index = dist_named.index.map(id_to_name)
    dist_named.columns = dist_named.columns.map(id_to_name)
    
    dur_named = dur_by_id.copy()
    dur_named.index = dur_named.index.map(id_to_name)
    dur_named.columns = dur_named.columns.map(id_to_name)

    # Write outputs to disk if out_prefix is provided
    if out_prefix is not None and out_prefix != '':
        try:
            os.makedirs(out_prefix, exist_ok=True)
        except Exception:
            pass
        
        # Write ID-indexed versions
        dist_by_id.to_csv(os.path.join(out_prefix, 'distance_matrix.csv'))
        dur_by_id.to_csv(os.path.join(out_prefix, 'duration_matrix.csv'))
        
        # Write named versions
        dist_named.to_csv(os.path.join(out_prefix, 'distance_matrix_named.csv'))
        dur_named.to_csv(os.path.join(out_prefix, 'duration_matrix_named.csv'))
        
        # Write lookup table
        id_to_name.to_csv(os.path.join(out_prefix, 'facility_id_lookup.csv'), header=['source_name'])
        
        print('Wrote pivoted matrices to disk with prefix', out_prefix)
    else:
        # Write to current directory if no prefix
        dist_by_id.to_csv('distance_matrix.csv')
        dur_by_id.to_csv('duration_matrix.csv')
        dist_named.to_csv('distance_matrix_named.csv')
        dur_named.to_csv('duration_matrix_named.csv')
        id_to_name.to_csv('facility_id_lookup.csv', header=['source_name'])
        print('Wrote pivoted matrices to disk (current directory)')

    # Return the named versions (with facility names as index/columns)
    return dist_named, dur_named

# When edges_df is empty (because run_osrm=False) this will be a no-op
dist_mat, dur_mat = pivot_edges_to_matrices(edges_df)
print('dist_mat shape:', None if dist_mat is None else dist_mat.shape)


No edges to pivot — returning empty matrices
dist_mat shape: None


## 4) Label matrices and apply duration upper-bound (adapted from `label_and_upperbound.py`)

In [81]:
def label_and_upperbound(fac_csv='facilities_with_warehouses.csv', dist_df=None, dur_df=None, out_prefix=''):
    """If provided with named (index/columns=facility names) dataframes, label them with service type and apply 1.2× upper bound to duration.
    If dataframes are None, the function will attempt to load `distance_matrix_named.csv` and `duration_matrix_named.csv` from disk.
    """
    fac = None
    if os.path.exists(fac_csv):
        fac = pd.read_csv(fac_csv)
        category_map = fac.set_index('Facility Name')['Service Delivery Type'].to_dict()
    else:
        category_map = {}

    if dist_df is None or dur_df is None:
        # try to load named matrices if present
        if os.path.exists('distance_matrix_named.csv') and os.path.exists('duration_matrix_named.csv'):
            dist_df = pd.read_csv('distance_matrix_named.csv', index_col=0)
            dur_df = pd.read_csv('duration_matrix_named.csv', index_col=0)
        else:
            print('No named matrices provided or on disk — skipping labeling step')
            return None, None

    dur_upper = dur_df * 1.2

    def apply_labels(df):
        return [f"{name} ({category_map.get(name, '--')})" for name in df]

    dur_upper.index = apply_labels(dur_upper.index)
    dur_upper.columns = apply_labels(dur_upper.columns)
    dist_df.index = apply_labels(dist_df.index)
    dist_df.columns = apply_labels(dist_df.columns)

    if out_prefix:
        dur_upper.to_csv(out_prefix + 'duration_matrix_upperbound_labeled.csv')
        dist_df.to_csv(out_prefix + 'distance_matrix_labeled.csv')
        print('Wrote labeled matrices with prefix', out_prefix)

    return dist_df, dur_upper

# Try to label if pivot produced matrices; otherwise this is a no-op
labeled_dist, labeled_dur = label_and_upperbound(dist_df=dist_mat, dur_df=dur_mat)
print('labeled shapes:', None if labeled_dist is None else labeled_dist.shape, None if labeled_dur is None else labeled_dur.shape)

labeled shapes: (614, 22) (614, 22)


## 5) Analysis & visualizations (adapted from `matrixanalysis.py`)

In [88]:
def deduplicate_labels(labels):
    from collections import Counter
    counts = Counter()
    new_labels = []
    for lbl in labels:
        counts[lbl] += 1
        if counts[lbl] > 1:
            new_labels.append(f"{lbl} ({counts[lbl]})")
        else:
            new_labels.append(lbl)
    return new_labels

def analyze(labeled_dist_csv='distance_matrix_labeled.csv', labeled_dur_csv='duration_matrix_upperbound_labeled.csv', fac_csv='facilities_with_warehouses.csv', out_dir=''):
    if not os.path.exists(labeled_dist_csv) or not os.path.exists(labeled_dur_csv):
        print('Labeled matrices not found on disk — skipping analysis (provide labeled CSVs).')
        return

    dist = pd.read_csv(labeled_dist_csv, index_col=0)
    dur = pd.read_csv(labeled_dur_csv, index_col=0)
    fac = pd.read_csv(fac_csv) if os.path.exists(fac_csv) else pd.DataFrame()

    # ensure unique labels
    if dist.index.duplicated().any() or dist.columns.duplicated().any():
        print('Warning: Duplicate facility names found — disambiguating.')
        dist.index = deduplicate_labels(dist.index)
        dist.columns = deduplicate_labels(dist.columns)
        dur.index = dist.index
        dur.columns = dist.columns

    dist_np = dist.apply(pd.to_numeric, errors='coerce').to_numpy(dtype=float)
    dur_np = dur.apply(pd.to_numeric, errors='coerce').to_numpy(dtype=float)

    # Asymmetry analysis: only compute if matrix is square; for rectangular matrices, skip
    if dist_np.shape[0] == dist_np.shape[1]:
        dist_asym_np = np.abs(dist_np - dist_np.T)
        dur_asym_np = np.abs(dur_np - dur_np.T)
        dist_asym_np = np.nan_to_num(dist_asym_np, nan=0.0)
        dur_asym_np = np.nan_to_num(dur_asym_np, nan=0.0)
        
        summary = {
            'mean_dist_asym_m': np.mean(dist_asym_np),
            'max_dist_asym_m': np.max(dist_asym_np),
            'mean_dur_asym_s': np.mean(dur_asym_np),
            'max_dur_asym_s': np.max(dur_asym_np),
            'n_facilities': dist.shape[0],
        }
    else:
        # Rectangular matrix: skip asymmetry, just report basic stats
        print(f'Note: Matrix is rectangular ({dist_np.shape[0]} × {dist_np.shape[1]}), skipping asymmetry analysis.')
        valid = (dist_np > 0) & (dur_np > 0)
        summary = {
            'mean_dist_m': np.nanmean(dist_np[valid]) if np.any(valid) else np.nan,
            'max_dist_m': np.nanmax(dist_np[valid]) if np.any(valid) else np.nan,
            'mean_dur_s': np.nanmean(dur_np[valid]) if np.any(valid) else np.nan,
            'max_dur_s': np.nanmax(dur_np[valid]) if np.any(valid) else np.nan,
            'n_sources': dist.shape[0],
            'n_destinations': dist.shape[1],
        }
    
    # Save summary CSV to the output directory
    summary_path = os.path.join(out_dir, 'matrix_summary.csv') if out_dir else 'matrix_summary.csv'
    pd.DataFrame([summary]).to_csv(summary_path, index=False)
    print(f'Wrote {summary_path}')

    # facility type counts visualization if possible
    if not fac.empty and 'Service Delivery Type' in fac.columns:
        type_counts = fac['Service Delivery Type'].value_counts().sort_values(ascending=False)
        print('\n' + '='*60)
        print('FACILITY COUNTS BY SERVICE DELIVERY TYPE')
        print('='*60)
        for service_type, count in type_counts.items():
            print(f"  {service_type:40s} : {count:4d}")
        print('='*60 + '\n')
        
        # Save counts to CSV
        counts_df = pd.DataFrame({'Service Delivery Type': type_counts.index, 'Count': type_counts.values})
        counts_path = os.path.join(out_dir, 'facility_type_counts.csv') if out_dir else 'facility_type_counts.csv'
        counts_df.to_csv(counts_path, index=False)
        print(f'Wrote {counts_path}')
        
        # Create visualization
        plt.figure(figsize=(8,4))
        sns.barplot(y=type_counts.index, x=type_counts.values, palette='viridis')
        plt.title('Number of Facilities by Service Delivery Type')
        plt.xlabel('Count')
        plt.ylabel('Facility Type')
        plt.tight_layout()
        png_path = os.path.join(out_dir, 'facility_type_counts.png') if out_dir else 'facility_type_counts.png'
        plt.savefig(png_path)
        plt.close()
        print(f'Wrote {png_path}')

    # asymmetry distribution (only for square matrices)
    if dist_np.shape[0] == dist_np.shape[1]:
        flat_asym = dist_asym_np[np.triu_indices_from(dist_asym_np, k=1)]
        plt.figure(figsize=(7,4))
        sns.histplot(flat_asym/1000, bins=50, color='coral', kde=True)
        plt.xlim(0,15)
        plt.xlabel('Asymmetry (km)')
        plt.tight_layout()
        png_path = os.path.join(out_dir, 'asymmetry_distribution.png') if out_dir else 'asymmetry_distribution.png'
        plt.savefig(png_path)
        plt.close()
        print(f'Wrote {png_path}')

    # implied speeds
    valid = (dist_np > 0) & (dur_np > 0)
    speed_mps = (dist_np[valid] / dur_np[valid]).flatten()
    plt.figure(figsize=(7,4))
    sns.histplot(speed_mps*3.6, bins=40, color='seagreen')
    plt.xlabel('Implied Travel Speed (km/h)')
    plt.tight_layout()
    png_path = os.path.join(out_dir, 'speed_distribution.png') if out_dir else 'speed_distribution.png'
    plt.savefig(png_path)
    plt.close()
    print(f'Wrote {png_path}')
    print('Done analysis')

# Not running analyze() automatically — call analyze() when you have labeled matrices on disk.


## 6) Smoke test example (the small OSRM table check)

In [89]:
def smoke_test_table(csv='facilities_with_warehouses.csv', osrm_url='http://localhost:5001'):
    # replicates matrix_smoke_test.py behavior for a tiny 2x5 example
    if not os.path.exists(csv):
        print('Facilities CSV not found — cannot run smoke test.')
        return
    df = pd.read_csv(csv).dropna(subset=['Latitude','Longitude']).copy()
    src = df[df['Is_Warehouse'] == True].copy()
    dst = df[df['Is_Warehouse'] == False].copy()
    src_small = src.head(2).reset_index(drop=True)
    dst_small = dst.head(5).reset_index(drop=True)
    if len(src_small) == 0 or len(dst_small) == 0:
        print('Not enough sources/destinations for smoke test')
        return
    def fmt_coord(row):
        return f"{row['Longitude']},{row['Latitude']}"
    coords = [fmt_coord(r) for _, r in pd.concat([src_small, dst_small]).iterrows()]
    n_src = len(src_small)
    url = f'{osrm_url}/table/v1/driving/' + ';'.join(coords)
    params = {'sources': ';'.join(map(str, range(n_src))), 'destinations': ';'.join(map(str, range(n_src, n_src + len(dst_small)))), 'annotations': 'duration,distance'}
    try:
        r = requests.get(url, params=params, timeout=60)
        r.raise_for_status()
        data = r.json()
    except Exception as e:
        print('OSRM request failed (is OSRM running?):', e)
        return
    if data.get('code') != 'Ok':
        print('OSRM returned error:', data)
        return
    durs = data.get('durations')
    dists = data.get('distances')
    if dists is not None:
        dist = pd.DataFrame(dists, index=src_small['Facility Name'].tolist(), columns=dst_small['Facility Name'].tolist())
        display(dist.round(1))
    if durs is not None:
        dur = pd.DataFrame(durs, index=src_small['Facility Name'].tolist(), columns=dst_small['Facility Name'].tolist())
        display(dur.round(1))

# To run the smoke test, call smoke_test_table()

## Next steps & usage

- To compute full matrices using your local OSRM server: set `run_osrm=True` in the `build_edges_from_facilities()` call in the 
 cell and run that cell.
- After `edges_df` is produced, run the pivot cell to create `distance_matrix.csv` and `duration_matrix.csv`.
- Run the labeling cell to produce labeled matrices and apply the 1.2× duration upper-bound.
- Call `analyze()` (from the Analysis cell) once labeled matrices are present to produce summary CSV and PNGs.

## 7) Full pipeline: Run all steps end-to-end

Execute this cell to regenerate all matrices and CSVs from the OSRM server. This will take several minutes depending on facility count.

In [93]:
print("=" * 70)
print("FULL PIPELINE: REGENERATE ALL MATRICES AND CSVS")
print("=" * 70)
print()

# Step 1: Load facilities
print("[1/5] Loading facilities...")
fac = pd.read_csv('facilities_with_warehouses.csv')
print(f"  Loaded {len(fac)} facilities")

# Step 2: Build edges from OSRM (all-pairs: every facility to every facility)
print("[2/5] Building edges from OSRM (this may take many minutes)...")
print("  Note: Chunking both sources and destinations to stay within URL limits")
edges_df = build_edges_all_pairs(run_osrm=True, osrm_url='http://localhost:5001', chunk=30, limit=0)
print(f"  Generated {len(edges_df)} edges")

# Step 3: Pivot to matrices
print("[3/5] Pivoting edges into distance/duration matrices...")
dist_mat, dur_mat = pivot_edges_to_matrices(edges_df, out_prefix='osrm_project/')
if dist_mat is not None:
    print(f"  Distance matrix shape: {dist_mat.shape}")
    print(f"  Duration matrix shape: {dur_mat.shape}")
else:
    print("  WARNING: No matrices produced")

# Step 4: Label matrices
print("[4/5] Labeling matrices and applying 1.2× duration upper-bound...")
labeled_dist, labeled_dur = label_and_upperbound(fac_csv='facilities_with_warehouses.csv', dist_df=dist_mat, dur_df=dur_mat, out_prefix='osrm_project/')
if labeled_dist is not None:
    print(f"  Labeled distance matrix shape: {labeled_dist.shape}")
    print(f"  Labeled duration (upper-bound) matrix shape: {labeled_dur.shape}")
else:
    print("  WARNING: No labeled matrices produced")

# Step 5: Analysis
print("[5/5] Running analysis (creating summary CSV and visualizations)...")
analyze(
    labeled_dist_csv='osrm_project/distance_matrix_labeled.csv',
    labeled_dur_csv='osrm_project/duration_matrix_upperbound_labeled.csv',
    fac_csv='facilities_with_warehouses.csv',
    out_dir='osrm_project'
)

print()
print("=" * 70)
print("PIPELINE COMPLETE")
print("=" * 70)
print()
print("Generated files in osrm_project/:")
for fname in ['distance_matrix.csv', 'duration_matrix.csv', 'distance_matrix_named.csv', 'duration_matrix_named.csv', 'facility_id_lookup.csv', 'distance_matrix_labeled.csv', 'duration_matrix_upperbound_labeled.csv', 'matrix_summary.csv', 'facility_type_counts.csv']:
    path = os.path.join('osrm_project', fname)
    if os.path.exists(path):
        size_mb = os.path.getsize(path) / (1024**2)
        print(f"  ✓ {fname:50s} ({size_mb:.1f} MB)")
    else:
        print(f"  ✗ {fname:50s} (NOT FOUND)")


FULL PIPELINE: REGENERATE ALL MATRICES AND CSVS

[1/5] Loading facilities...
  Loaded 636 facilities
[2/5] Building edges from OSRM (this may take many minutes)...
  Note: Chunking both sources and destinations to stay within URL limits
Total facilities: 636
Using chunk size: 30
[10 chunks] processed 30 × 300 facility pairs
[10 chunks] processed 30 × 300 facility pairs
[20 chunks] processed 30 × 600 facility pairs
[20 chunks] processed 30 × 600 facility pairs
[30 chunks] processed 60 × 240 facility pairs
[30 chunks] processed 60 × 240 facility pairs
[40 chunks] processed 60 × 540 facility pairs
[40 chunks] processed 60 × 540 facility pairs
[50 chunks] processed 90 × 180 facility pairs
[50 chunks] processed 90 × 180 facility pairs
[60 chunks] processed 90 × 480 facility pairs
[60 chunks] processed 90 × 480 facility pairs
[70 chunks] processed 120 × 120 facility pairs
[70 chunks] processed 120 × 120 facility pairs
[80 chunks] processed 120 × 420 facility pairs
[80 chunks] processed 120 ×


Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(y=type_counts.index, x=type_counts.values, palette='viridis')


Wrote osrm_project/asymmetry_distribution.png
Wrote osrm_project/speed_distribution.png
Done analysis

PIPELINE COMPLETE

Generated files in osrm_project/:
  ✓ distance_matrix.csv                                (3.5 MB)
  ✓ duration_matrix.csv                                (3.0 MB)
  ✓ distance_matrix_named.csv                          (3.5 MB)
  ✓ duration_matrix_named.csv                          (3.0 MB)
  ✓ facility_id_lookup.csv                             (0.0 MB)
  ✓ distance_matrix_labeled.csv                        (3.5 MB)
  ✓ duration_matrix_upperbound_labeled.csv             (4.5 MB)
  ✓ matrix_summary.csv                                 (0.0 MB)
  ✓ facility_type_counts.csv                           (0.0 MB)
