# Data Collection Pipeline V3 - Cost Pool Approach

This is the **most efficient approach** for collecting training data!

## How It Works:

### Phase 1: Build Cost Pool (~7.5 hours)
1. Generate 300 random cost sets
2. Optimize each once
3. Store all (costs, features) pairs

### Phase 2: Match Preferences (~instant)
1. Generate random preferences
2. Score all 300 solutions against preferences
3. Pick best match
4. Save (preference, best_costs, best_features)
5. Repeat to create 400+ training samples!

## Benefits:
- ✅ **400+ training samples from 300 optimizations**
- ✅ **Much faster**: ~7.5 hours (vs 20 hours for 200 samples)
- ✅ **Better diversity**: All preferences see all costs
- ✅ **Reusable**: Can generate more samples anytime
- ✅ **Fallback**: Generates custom costs for poor matches

In [29]:
import json
import time
import requests
import numpy as np
import pandas as pd
from datetime import datetime
from pathlib import Path
import copy

print("✓ Imports successful")

✓ Imports successful


## Configuration

**⚠️ CHANGE THESE VALUES:**

In [32]:
# API Configuration


# Model file (25 jobs, 8 vehicles)
MODEL_PATH = "../../test_model_mathias_fixed.json"  # ← Your model file

# Phase 1: Cost pool generation
POOL_SIZE = 200  # Number of random cost sets to generate
COST_POOL_FILE = "cost_pool.json"
POOL_CHECKPOINT_EVERY = 10  # Save every 10 cost sets

# Phase 2: Preference matching
N_PREFERENCE_SAMPLES = 400  # Number of training samples to generate
MATCH_QUALITY_THRESHOLD = 500  # If best score > this, generate custom costs
OUTPUT_FILE = "training_data_v3_pool.json"

# Optimization settings
WAIT_SECONDS = 120  # Wait time for each optimization

# Cost parameter ranges
COST_RANGES = {
    'costPerTravelHour': (0.5, 5.0),
    'costPerKm': (0.014, 0.49),
    'parking_multiplier': (0.2, 5.0)
}

print(f"Configuration:")
print(f"  Base URL: {BASE_URL}")
print(f"  Model: {MODEL_PATH}")
print(f"\nPhase 1 - Cost Pool:")
print(f"  Pool size: {POOL_SIZE}")
print(f"  Estimated time: {(POOL_SIZE * WAIT_SECONDS) / 3600:.1f} hours")
print(f"  Output: {COST_POOL_FILE}")
print(f"\nPhase 2 - Preference Matching:")
print(f"  Training samples: {N_PREFERENCE_SAMPLES}")
print(f"  Match threshold: {MATCH_QUALITY_THRESHOLD}")
print(f"  Output: {OUTPUT_FILE}")
print(f"\nTotal estimated time: ~{(POOL_SIZE * WAIT_SECONDS) / 3600:.1f} hours")

Configuration:
  Base URL: https://optimizer-0.staging.zenderatms.com
  Model: ../../test_model_mathias_fixed.json

Phase 1 - Cost Pool:
  Pool size: 200
  Estimated time: 6.7 hours
  Output: cost_pool.json

Phase 2 - Preference Matching:
  Training samples: 400
  Match threshold: 500
  Output: training_data_v3_pool.json

Total estimated time: ~6.7 hours


In [38]:
# Run this to see what file the collector is actually using
import json

with open(MODEL_PATH, 'r') as f:
    model = json.load(f)

# Check dates
from datetime import datetime
dates = set()
for job in model['data']['jobs']:
    for stop in job['stops']:
        for tw in stop['multiTWs']:
            dt = datetime.fromisoformat(tw['openTime'])
            dates.add(dt.date())

print(f"Dates in loaded model: {sorted(dates)}")
print(f"Today: {datetime.now().date()}")



Dates in loaded model: [datetime.date(2025, 10, 28), datetime.date(2025, 10, 29), datetime.date(2025, 10, 30)]
Today: 2025-10-27


## Cost Pool Data Collector Class

In [39]:
class CostPoolDataCollector:
    """Two-phase data collector using cost pool approach"""
    
    def __init__(self, base_url, username, password, model_path, cost_ranges):
        """Initialize data collector"""
        self.base_url = base_url.rstrip('/')
        self.username = username
        self.password = password
        self.cost_ranges = cost_ranges
        
        # Initialize session
        self.session = requests.Session()
        self.session.auth = (username, password)
        self.session.headers.update({
            'Content-Type': 'application/json',
            'Accept': 'application/json'
        })
        
        # Load base model
        with open(model_path, 'r') as f:
            self.base_model = json.load(f)
        
        print(f"✓ Base model loaded")
        print(f"  Jobs: {len(self.base_model['data']['jobs'])}")
        print(f"  Vehicles: {len(self.base_model['data']['vehicles'])}")
        
        # Data storage
        self.cost_pool = []  # List of {costs, features, pool_id}
        self.training_data = []  # Final training samples
    
    def sample_random_costs(self):
        """Sample random cost parameters from valid ranges"""
        costs = {}
        for param, (min_val, max_val) in self.cost_ranges.items():
            costs[param] = np.random.uniform(min_val, max_val)
        
        # Round for cleaner values
        costs['costPerTravelHour'] = round(costs['costPerTravelHour'], 2)
        costs['costPerKm'] = round(costs['costPerKm'], 4)
        costs['parking_multiplier'] = round(costs['parking_multiplier'], 2)
        
        return costs
    
    def sample_preferences(self):
        """Sample random preference values"""
        preferences = {
            'parking_importance': np.random.uniform(0.0, 1.0),
            'time_importance': np.random.uniform(0.0, 1.0),
            'distance_importance': np.random.uniform(0.0, 1.0)
        }
        return preferences
    
    def apply_costs_to_model(self, costs):
        """Apply cost parameters to model"""
        model = copy.deepcopy(self.base_model)
        
        # Apply vehicle costs
        for vehicle in model['data']['vehicles']:
            if 'definition' in vehicle:
                vehicle['definition']['costPerTravelHour'] = costs['costPerTravelHour']
                vehicle['definition']['costPerKm'] = costs['costPerKm']
        
        # Apply parking multiplier ONLY to delivery stops
        for job in model['data']['jobs']:
            for stop in job.get('stops', []):
                # Only multiply parking cost if:
                # 1. It's a delivery stop
                # 2. Parking cost is already > 0
                if (stop.get('type') == 'SHIPMENT_DELIVERY' and 
                    'parking' in stop and 
                    'cost' in stop['parking'] and 
                    stop['parking']['cost'] > 0):
                    
                    original_cost = stop['parking']['cost']
                    stop['parking']['cost'] = original_cost * costs['parking_multiplier']
        
        return model
    
    def send_to_odl(self, model, model_id):
        """Send model to ODL API"""
        url = f"{self.base_url}/models/{model_id}"
        
        try:
            response = self.session.put(url, json=model, timeout=30)
            response.raise_for_status()
            return True
        except Exception as e:
            print(f"  ✗ Upload failed: {e}")
            return False
    
    def get_plan(self, model_id):
        """Retrieve plan from ODL API"""
        url = url = f"{self.base_url}/models/{model_id}/optimiserstate/plan"
        
        try:
            response = self.session.get(url, timeout=30)
            response.raise_for_status()
            return response.json()
        except Exception as e:
            print(f"  ✗ Get plan failed: {e}")
            return None
        
    def extract_features(self, plan):
        """Extract route features from plan"""
        if not plan or 'vehiclePlans' not in plan:
            return None
        
        features = {
            'total_distance_km': 0.0,
            'total_travel_hours': 0.0,
            'total_cost': 0.0,
            'num_stops': 0,
            'num_vehicles_used': 0,
            'unplanned_jobs': len(plan.get('unplannedJobs', [])),
            'avg_parking_difficulty': 0.0
        }
        
        total_parking_cost = 0.0
        parking_stops = 0
        
        for vehicle_plan in plan['vehiclePlans']:
            stops = vehicle_plan.get('plannedStops', [])
            
            if len(stops) > 0:
                features['num_vehicles_used'] += 1
                features['num_stops'] += len(stops)
            
            # Get distance and time from timeStatistics (NOT from individual stops!)
            time_stats = vehicle_plan.get('timeStatistics', {})
            features['total_distance_km'] += time_stats.get('travelMetres', 0) / 1000.0  # Convert to km
            features['total_travel_hours'] += time_stats.get('travelSeconds', 0) / 3600.0  # Convert to hours
            features['total_cost'] += time_stats.get('cost', 0)
            
            # Still get parking from individual stops
            for stop in stops:
                parking = stop.get('parking', {})
                if 'cost' in parking:
                    total_parking_cost += parking['cost']
                    parking_stops += 1
        
        # Calculate averages
        if parking_stops > 0:
            features['avg_parking_difficulty'] = total_parking_cost / parking_stops
        
        # Round for cleaner values
        features['total_distance_km'] = round(features['total_distance_km'], 2)
        features['total_travel_hours'] = round(features['total_travel_hours'], 2)
        features['total_cost'] = round(features['total_cost'], 2)
        features['avg_parking_difficulty'] = round(features['avg_parking_difficulty'], 2)
        
        return features
    
    def score_solution(self, preferences, features):
        """
        Score how well a solution matches preferences.
        Lower score = better match
        """
        if features is None:
            return float('inf')
        
        score = 0.0
        
        # Parking importance
        score += preferences['parking_importance'] * features['avg_parking_difficulty']
        
        # Time importance (scaled up for balance)
        score += preferences['time_importance'] * features['total_travel_hours'] * 10
        
        # Distance importance (scaled down for balance)
        score += preferences['distance_importance'] * features['total_distance_km'] * 0.1
        
        # Penalty for unplanned jobs
        score += features['unplanned_jobs'] * 1000
        
        return score
    
    # ============================================================
    # PHASE 1: BUILD COST POOL
    # ============================================================
    
    def build_cost_pool(self, pool_size, wait_seconds, checkpoint_every, pool_file):
        """
        Phase 1: Generate pool of cost sets with their optimized solutions
        """
        print(f"\n{'='*60}")
        print(f"PHASE 1: BUILDING COST POOL")
        print(f"{'='*60}")
        print(f"Target pool size: {pool_size}")
        print(f"Estimated time: {(pool_size * wait_seconds) / 3600:.1f} hours")
        print(f"Output: {pool_file}")
        print(f"{'='*60}\n")
        
        start_time = time.time()
        
        try:
            for i in range(1, pool_size + 1):
                print(f"\nCost Set {i}/{pool_size}:")
                
                try:
                    # Generate random costs
                    costs = self.sample_random_costs()
                    print(f"  Costs: hourly={costs['costPerTravelHour']:.2f}, "
                          f"km={costs['costPerKm']:.4f}, parking_mult={costs['parking_multiplier']:.2f}")
                    
                    # Apply to model and send to API
                    model = self.apply_costs_to_model(costs)
                    model_id = f"cost_pool_{i}"
                    
                    if not self.send_to_odl(model, model_id):
                        print(f"  ✗ Failed to send model, skipping...")
                        continue
                    
                    print(f"  ✓ Model uploaded, waiting {wait_seconds}s for optimization...")
                    time.sleep(wait_seconds)
                    
                    # Get results
                    plan = self.get_plan(model_id)
                    features = self.extract_features(plan)
                    
                    if features is None:
                        print(f"  ✗ Failed to extract features, skipping...")
                        continue
                    
                    print(f"  Features: dist={features['total_distance_km']:.1f}km, "
                          f"time={features['total_travel_hours']:.1f}h, "
                          f"parking={features['avg_parking_difficulty']:.1f}")
                    
                    # Add to pool
                    self.cost_pool.append({
                        'pool_id': i,
                        'costs': costs,
                        'features': features,
                        'timestamp': datetime.now().isoformat()
                    })
                    
                    print(f"  ✓ Added to pool (total: {len(self.cost_pool)})")
                    
                    # Checkpoint
                    if i % checkpoint_every == 0:
                        self.save_cost_pool(pool_file)
                        elapsed = time.time() - start_time
                        avg_time = elapsed / len(self.cost_pool)
                        remaining = (pool_size - i) * avg_time
                        
                        print(f"\n  {'='*56}")
                        print(f"  CHECKPOINT: {len(self.cost_pool)}/{pool_size} cost sets in pool")
                        print(f"  Elapsed: {elapsed/3600:.1f} hours")
                        print(f"  Remaining: {remaining/3600:.1f} hours")
                        print(f"  {'='*56}\n")
                
                except Exception as e:
                    print(f"  ✗ Error: {e}")
                    continue
        
        except KeyboardInterrupt:
            print(f"\n\nInterrupted by user!")
            print(f"Cost pool has {len(self.cost_pool)} entries so far")
        
        # Final save
        self.save_cost_pool(pool_file)
        
        elapsed = time.time() - start_time
        print(f"\n{'='*60}")
        print(f"PHASE 1 COMPLETE")
        print(f"{'='*60}")
        print(f"Cost pool size: {len(self.cost_pool)}")
        print(f"Total time: {elapsed/3600:.1f} hours")
        if len(self.cost_pool) > 0:
            print(f"Average time per cost set: {elapsed/len(self.cost_pool):.1f} seconds")
        print(f"Data saved to: {pool_file}")
        print(f"{'='*60}\n")
    
    def save_cost_pool(self, pool_file):
        """Save cost pool to file"""
        with open(pool_file, 'w') as f:
            json.dump(self.cost_pool, f, indent=2)
        print(f"  💾 Saved {len(self.cost_pool)} cost sets to {pool_file}")
    
    def load_cost_pool(self, pool_file):
        """Load cost pool from file"""
        try:
            with open(pool_file, 'r') as f:
                self.cost_pool = json.load(f)
            print(f"✓ Loaded cost pool with {len(self.cost_pool)} entries from {pool_file}")
            return True
        except FileNotFoundError:
            print(f"✗ Cost pool file not found: {pool_file}")
            return False
    
    # ============================================================
    # PHASE 2: MATCH PREFERENCES TO POOL
    # ============================================================
    
    def find_best_match_in_pool(self, preferences):
        """
        Find the best matching cost set from the pool for given preferences
        """
        if len(self.cost_pool) == 0:
            return None
        
        best_match = None
        best_score = float('inf')
        
        for entry in self.cost_pool:
            score = self.score_solution(preferences, entry['features'])
            if score < best_score:
                best_score = score
                best_match = entry
        
        return {
            'costs': best_match['costs'],
            'features': best_match['features'],
            'score': best_score,
            'pool_id': best_match['pool_id']
        }
    
    def generate_custom_costs_for_preference(self, preferences, wait_seconds):
        """
        Fallback: Generate and optimize 5 custom costs for a preference with poor matches
        """
        print(f"    Generating 5 custom costs (poor match in pool)...")
        
        candidates = []
        
        for i in range(5):
            costs = self.sample_random_costs()
            model = self.apply_costs_to_model(costs)
            model_id = f"custom_costs_{datetime.now().timestamp()}_{i}"
            
            if not self.send_to_odl(model, model_id):
                continue
            
            time.sleep(wait_seconds)
            plan = self.get_plan(model_id)
            features = self.extract_features(plan)
            
            if features is None:
                continue
            
            score = self.score_solution(preferences, features)
            candidates.append({
                'costs': costs,
                'features': features,
                'score': score
            })
        
        if len(candidates) == 0:
            return None
        
        return min(candidates, key=lambda x: x['score'])
    
    def generate_training_samples(self, n_samples, match_threshold, wait_seconds, output_file):
        """
        Phase 2: Generate training samples by matching preferences to cost pool
        """
        if len(self.cost_pool) == 0:
            print("✗ Cost pool is empty! Run Phase 1 first.")
            return
        
        print(f"\n{'='*60}")
        print(f"PHASE 2: GENERATING TRAINING SAMPLES")
        print(f"{'='*60}")
        print(f"Cost pool size: {len(self.cost_pool)}")
        print(f"Target samples: {n_samples}")
        print(f"Match threshold: {match_threshold}")
        print(f"Output: {output_file}")
        print(f"{'='*60}\n")
        
        custom_costs_generated = 0
        
        for i in range(1, n_samples + 1):
            print(f"\nSample {i}/{n_samples}:")
            
            # Sample preferences
            preferences = self.sample_preferences()
            print(f"  Preferences: parking={preferences['parking_importance']:.2f}, "
                  f"time={preferences['time_importance']:.2f}, "
                  f"distance={preferences['distance_importance']:.2f}")
            
            # Find best match in pool
            match = self.find_best_match_in_pool(preferences)
            
            if match is None:
                print(f"  ✗ No match found, skipping...")
                continue
            
            print(f"  Best match: pool_id={match['pool_id']}, score={match['score']:.2f}")
            
            # Check match quality
            if match['score'] > match_threshold:
                print(f"  ⚠️  Score too high (>{match_threshold}), generating custom costs...")
                custom_match = self.generate_custom_costs_for_preference(preferences, wait_seconds)
                
                if custom_match is not None and custom_match['score'] < match['score']:
                    match = custom_match
                    match['pool_id'] = 'custom'
                    custom_costs_generated += 1
                    print(f"    ✓ Custom costs better: score={match['score']:.2f}")
            
            # Create training sample
            sample = {
                'sample_num': i,
                'preferences': preferences,
                'costs': match['costs'],
                'features': match['features'],
                'score': match['score'],
                'pool_id': match['pool_id'],
                'timestamp': datetime.now().isoformat()
            }
            
            self.training_data.append(sample)
            print(f"  ✓ Sample added (total: {len(self.training_data)})")
            
            # Save periodically
            if i % 50 == 0:
                self.save_training_data(output_file)
                print(f"\n  💾 Checkpoint: {len(self.training_data)} samples saved\n")
        
        # Final save
        self.save_training_data(output_file)
        
        print(f"\n{'='*60}")
        print(f"PHASE 2 COMPLETE")
        print(f"{'='*60}")
        print(f"Training samples generated: {len(self.training_data)}")
        print(f"Custom costs generated: {custom_costs_generated}")
        print(f"Pool reuse rate: {(1 - custom_costs_generated/n_samples)*100:.1f}%")
        print(f"Data saved to: {output_file}")
        print(f"{'='*60}\n")
    
    def save_training_data(self, output_file):
        """Save training data to file"""
        with open(output_file, 'w') as f:
            json.dump(self.training_data, f, indent=2)
        print(f"💾 Saved {len(self.training_data)} training samples to {output_file}")

## Initialize Collector

In [40]:
# Validate configuration
if USERNAME == "your-username" or PASSWORD == "your-password":
    print("❌ ERROR: Please set your USERNAME and PASSWORD in the configuration cell above!")
else:
    # Initialize collector
    collector = CostPoolDataCollector(
        base_url=BASE_URL,
        username=USERNAME,
        password=PASSWORD,
        model_path=MODEL_PATH,
        cost_ranges=COST_RANGES
    )
    print("\n✓ Cost pool collector initialized and ready!")

✓ Base model loaded
  Jobs: 106
  Vehicles: 26

✓ Cost pool collector initialized and ready!


In [41]:
# TEST
# Run this diagnostic cell to find the real problem
import json
from datetime import datetime
import time

# Generate test costs
test_costs = collector.sample_random_costs()
print(f"Test costs: {test_costs}\n")

# Apply to model
test_model = collector.apply_costs_to_model(test_costs)

# Send it
model_id = "diagnostic_test"
print(f"Sending to API...")
success = collector.send_to_odl(test_model, model_id)

if not success:
    print("✗ Upload failed!")
else:
    print(f"✓ Upload successful")
    print(f"\nWaiting 90 seconds for optimization...")
    time.sleep(90)
    
    print(f"Getting plan...")
    plan = collector.get_plan(model_id)
    
    if not plan:
        print("✗ No plan returned - API issue?")
    else:
        print(f"✓ Got plan back!\n")
        
        # Key diagnostics
        unplanned = plan.get('unplannedJobs', [])
        vehicle_plans = plan.get('vehiclePlans', [])
        
        print(f"Summary:")
        print(f"  Total jobs in model: {len(test_model['data']['jobs'])}")
        print(f"  Unplanned jobs: {len(unplanned)}")
        print(f"  Vehicle plans: {len(vehicle_plans)}")
        
        # Check vehicles
        total_stops = 0
        for vp in vehicle_plans:
            stops = vp.get('plannedStops', [])
            total_stops += len(stops)
            if stops:
                print(f"  Vehicle {vp.get('vehicleId')}: {len(stops)} stops")
        
        print(f"  Total planned stops: {total_stops}")
        
        if len(unplanned) > 0:
            print(f"\n⚠️ PROBLEM: {len(unplanned)} jobs are unplanned!")
            print(f"\nUnplanned job reasons:")
            for uj in unplanned[:5]:  # Show first 5
                print(f"  Job {uj.get('jobId')}: {uj.get('reason', 'No reason given')}")
        
        if total_stops == 0:
            print(f"\n✗ MAJOR PROBLEM: No stops planned at all!")
            print(f"All jobs rejected by optimizer")
        
        # Extract features
        features = collector.extract_features(plan)
        print(f"\nExtracted features:")
        print(f"  {features}")
        
        if features and features['total_distance_km'] == 0:
            print(f"\n✗ This is why you're getting zeros!")

Test costs: {'costPerTravelHour': 1.68, 'costPerKm': 0.3085, 'parking_multiplier': 4.91}

Sending to API...
✓ Upload successful

Waiting 90 seconds for optimization...
Getting plan...
✓ Got plan back!

Summary:
  Total jobs in model: 106
  Unplanned jobs: 20
  Vehicle plans: 26
  Vehicle 758524: 18 stops
  Vehicle 758525: 26 stops
  Vehicle 758526: 4 stops
  Vehicle 759409: 20 stops
  Vehicle 759410: 20 stops
  Vehicle 759411: 18 stops
  Vehicle 760317: 30 stops
  Vehicle 760318: 36 stops
  Total planned stops: 172

⚠️ PROBLEM: 20 jobs are unplanned!

Unplanned job reasons:
  Job None: No reason given
  Job None: No reason given
  Job None: No reason given
  Job None: No reason given
  Job None: No reason given

Extracted features:
  {'total_distance_km': 1007.28, 'total_travel_hours': 30.36, 'total_cost': 606487.9, 'num_stops': 172, 'num_vehicles_used': 8, 'unplanned_jobs': 20, 'avg_parking_difficulty': 0.0}


In [42]:
# Test with very different costs and longer wait
import time

test_cases = [
    {'costPerTravelHour': 0.5, 'costPerKm': 0.01, 'parking_multiplier': 0.2},  # Minimize everything
    {'costPerTravelHour': 5.0, 'costPerKm': 0.5, 'parking_multiplier': 5.0},   # Maximize everything
]

for i, costs in enumerate(test_cases):
    print(f"\n{'='*60}")
    print(f"Test {i+1}: {costs}")
    
    model = collector.apply_costs_to_model(costs)
    model_id = f"extreme_test_{i}_{datetime.now().timestamp()}"  # Very unique ID
    
    collector.send_to_odl(model, model_id)
    print(f"Waiting 120 seconds...")
    time.sleep(120)
    
    plan = collector.get_plan(model_id)
    features = collector.extract_features(plan)
    
    print(f"Features: {features}")


Test 1: {'costPerTravelHour': 0.5, 'costPerKm': 0.01, 'parking_multiplier': 0.2}
Waiting 120 seconds...
Features: {'total_distance_km': 1010.88, 'total_travel_hours': 30.42, 'total_cost': 439800.71, 'num_stops': 172, 'num_vehicles_used': 8, 'unplanned_jobs': 20, 'avg_parking_difficulty': 0.0}

Test 2: {'costPerTravelHour': 5.0, 'costPerKm': 0.5, 'parking_multiplier': 5.0}
Waiting 120 seconds...
Features: {'total_distance_km': 1005.55, 'total_travel_hours': 30.36, 'total_cost': 610266.12, 'num_stops': 172, 'num_vehicles_used': 8, 'unplanned_jobs': 20, 'avg_parking_difficulty': 0.0}


In [None]:
# Test with very different costs and longer wait
import time

test_cases = [
    {'costPerTravelHour': 0.5, 'costPerKm': 0.01, 'parking_multiplier': 0.2},  # Minimize everything
    {'costPerTravelHour': 5.0, 'costPerKm': 0.5, 'parking_multiplier': 5.0},   # Maximize everything
]

for i, costs in enumerate(test_cases):
    print(f"\n{'='*60}")
    print(f"Test {i+1}: {costs}")
    
    model = collector.apply_costs_to_model(costs)
    model_id = f"extreme_test_{i}_{datetime.now().timestamp()}"  # Very unique ID
    
    collector.send_to_odl(model, model_id)
    print(f"Waiting 120 seconds...")
    time.sleep(120)
    
    plan = collector.get_plan(model_id)
    features = collector.extract_features(plan)
    
    print(f"Features: {features}")


Test 1: {'costPerTravelHour': 0.5, 'costPerKm': 0.01, 'parking_multiplier': 0.2}
Waiting 120 seconds...
Features: {'total_distance_km': 146.8, 'total_travel_hours': 4.3, 'total_cost': 807.68, 'num_stops': 18, 'num_vehicles_used': 3, 'unplanned_jobs': 16, 'avg_parking_difficulty': 0.0}

Test 2: {'costPerTravelHour': 5.0, 'costPerKm': 0.5, 'parking_multiplier': 5.0}
Waiting 120 seconds...
Features: {'total_distance_km': 145.94, 'total_travel_hours': 4.32, 'total_cost': 20141.64, 'num_stops': 18, 'num_vehicles_used': 3, 'unplanned_jobs': 16, 'avg_parking_difficulty': 0.0}


In [None]:
# Test with very different costs and longer wait
import time

test_cases = [
    {'costPerTravelHour': 0.5, 'costPerKm': 0.01, 'parking_multiplier': 0.2},  # Minimize everything
    {'costPerTravelHour': 5.0, 'costPerKm': 0.5, 'parking_multiplier': 5.0},   # Maximize everything
]

for i, costs in enumerate(test_cases):
    print(f"\n{'='*60}")
    print(f"Test {i+1}: {costs}")
    
    model = collector.apply_costs_to_model(costs)
    model_id = f"extreme_test_{i}_{datetime.now().timestamp()}"  # Very unique ID
    
    collector.send_to_odl(model, model_id)
    print(f"Waiting 120 seconds...")
    time.sleep(120)
    
    plan = collector.get_plan(model_id)
    features = collector.extract_features(plan)
    
    print(f"Features: {features}")


Test 1: {'costPerTravelHour': 0.5, 'costPerKm': 0.01, 'parking_multiplier': 0.2}
Waiting 120 seconds...
Features: {'total_distance_km': 146.8, 'total_travel_hours': 4.3, 'total_cost': 807.68, 'num_stops': 18, 'num_vehicles_used': 3, 'unplanned_jobs': 16, 'avg_parking_difficulty': 0.0}

Test 2: {'costPerTravelHour': 5.0, 'costPerKm': 0.5, 'parking_multiplier': 5.0}
Waiting 120 seconds...
Features: {'total_distance_km': 145.94, 'total_travel_hours': 4.32, 'total_cost': 20141.64, 'num_stops': 18, 'num_vehicles_used': 3, 'unplanned_jobs': 16, 'avg_parking_difficulty': 0.0}


In [None]:
# Test with very different costs and longer wait
import time

test_cases = [
    {'costPerTravelHour': 0.5, 'costPerKm': 0.01, 'parking_multiplier': 0.2},  # Minimize everything
    {'costPerTravelHour': 5.0, 'costPerKm': 0.5, 'parking_multiplier': 5.0},   # Maximize everything
]

for i, costs in enumerate(test_cases):
    print(f"\n{'='*60}")
    print(f"Test {i+1}: {costs}")
    
    model = collector.apply_costs_to_model(costs)
    model_id = f"extreme_test_{i}_{datetime.now().timestamp()}"  # Very unique ID
    
    collector.send_to_odl(model, model_id)
    print(f"Waiting 120 seconds...")
    time.sleep(120)
    
    plan = collector.get_plan(model_id)
    features = collector.extract_features(plan)
    
    print(f"Features: {features}")


Test 1: {'costPerTravelHour': 0.5, 'costPerKm': 0.01, 'parking_multiplier': 0.2}
Waiting 120 seconds...
Features: {'total_distance_km': 146.8, 'total_travel_hours': 4.3, 'total_cost': 807.68, 'num_stops': 18, 'num_vehicles_used': 3, 'unplanned_jobs': 16, 'avg_parking_difficulty': 0.0}

Test 2: {'costPerTravelHour': 5.0, 'costPerKm': 0.5, 'parking_multiplier': 5.0}
Waiting 120 seconds...
Features: {'total_distance_km': 145.94, 'total_travel_hours': 4.32, 'total_cost': 20141.64, 'num_stops': 18, 'num_vehicles_used': 3, 'unplanned_jobs': 16, 'avg_parking_difficulty': 0.0}


In [None]:
# Test with very different costs and longer wait
import time

test_cases = [
    {'costPerTravelHour': 0.5, 'costPerKm': 0.01, 'parking_multiplier': 0.2},  # Minimize everything
    {'costPerTravelHour': 5.0, 'costPerKm': 0.5, 'parking_multiplier': 5.0},   # Maximize everything
]

for i, costs in enumerate(test_cases):
    print(f"\n{'='*60}")
    print(f"Test {i+1}: {costs}")
    
    model = collector.apply_costs_to_model(costs)
    model_id = f"extreme_test_{i}_{datetime.now().timestamp()}"  # Very unique ID
    
    collector.send_to_odl(model, model_id)
    print(f"Waiting 120 seconds...")
    time.sleep(120)
    
    plan = collector.get_plan(model_id)
    features = collector.extract_features(plan)
    
    print(f"Features: {features}")


Test 1: {'costPerTravelHour': 0.5, 'costPerKm': 0.01, 'parking_multiplier': 0.2}
Waiting 120 seconds...
Features: {'total_distance_km': 146.8, 'total_travel_hours': 4.3, 'total_cost': 807.68, 'num_stops': 18, 'num_vehicles_used': 3, 'unplanned_jobs': 16, 'avg_parking_difficulty': 0.0}

Test 2: {'costPerTravelHour': 5.0, 'costPerKm': 0.5, 'parking_multiplier': 5.0}
Waiting 120 seconds...
Features: {'total_distance_km': 145.94, 'total_travel_hours': 4.32, 'total_cost': 20141.64, 'num_stops': 18, 'num_vehicles_used': 3, 'unplanned_jobs': 16, 'avg_parking_difficulty': 0.0}


In [None]:
# Test with very different costs and longer wait
import time

test_cases = [
    {'costPerTravelHour': 0.5, 'costPerKm': 0.01, 'parking_multiplier': 0.2},  # Minimize everything
    {'costPerTravelHour': 5.0, 'costPerKm': 0.5, 'parking_multiplier': 5.0},   # Maximize everything
]

for i, costs in enumerate(test_cases):
    print(f"\n{'='*60}")
    print(f"Test {i+1}: {costs}")
    
    model = collector.apply_costs_to_model(costs)
    model_id = f"extreme_test_{i}_{datetime.now().timestamp()}"  # Very unique ID
    
    collector.send_to_odl(model, model_id)
    print(f"Waiting 120 seconds...")
    time.sleep(120)
    
    plan = collector.get_plan(model_id)
    features = collector.extract_features(plan)
    
    print(f"Features: {features}")


Test 1: {'costPerTravelHour': 0.5, 'costPerKm': 0.01, 'parking_multiplier': 0.2}
Waiting 120 seconds...
Features: {'total_distance_km': 146.8, 'total_travel_hours': 4.3, 'total_cost': 807.68, 'num_stops': 18, 'num_vehicles_used': 3, 'unplanned_jobs': 16, 'avg_parking_difficulty': 0.0}

Test 2: {'costPerTravelHour': 5.0, 'costPerKm': 0.5, 'parking_multiplier': 5.0}
Waiting 120 seconds...
Features: {'total_distance_km': 145.94, 'total_travel_hours': 4.32, 'total_cost': 20141.64, 'num_stops': 18, 'num_vehicles_used': 3, 'unplanned_jobs': 16, 'avg_parking_difficulty': 0.0}


In [None]:
# Test with very different costs and longer wait
import time

test_cases = [
    {'costPerTravelHour': 0.5, 'costPerKm': 0.01, 'parking_multiplier': 0.2},  # Minimize everything
    {'costPerTravelHour': 5.0, 'costPerKm': 0.5, 'parking_multiplier': 5.0},   # Maximize everything
]

for i, costs in enumerate(test_cases):
    print(f"\n{'='*60}")
    print(f"Test {i+1}: {costs}")
    
    model = collector.apply_costs_to_model(costs)
    model_id = f"extreme_test_{i}_{datetime.now().timestamp()}"  # Very unique ID
    
    collector.send_to_odl(model, model_id)
    print(f"Waiting 120 seconds...")
    time.sleep(120)
    
    plan = collector.get_plan(model_id)
    features = collector.extract_features(plan)
    
    print(f"Features: {features}")


Test 1: {'costPerTravelHour': 0.5, 'costPerKm': 0.01, 'parking_multiplier': 0.2}
Waiting 120 seconds...
Features: {'total_distance_km': 146.8, 'total_travel_hours': 4.3, 'total_cost': 807.68, 'num_stops': 18, 'num_vehicles_used': 3, 'unplanned_jobs': 16, 'avg_parking_difficulty': 0.0}

Test 2: {'costPerTravelHour': 5.0, 'costPerKm': 0.5, 'parking_multiplier': 5.0}
Waiting 120 seconds...
Features: {'total_distance_km': 145.94, 'total_travel_hours': 4.32, 'total_cost': 20141.64, 'num_stops': 18, 'num_vehicles_used': 3, 'unplanned_jobs': 16, 'avg_parking_difficulty': 0.0}


In [None]:
# Test with very different costs and longer wait
import time

test_cases = [
    {'costPerTravelHour': 0.5, 'costPerKm': 0.01, 'parking_multiplier': 0.2},  # Minimize everything
    {'costPerTravelHour': 5.0, 'costPerKm': 0.5, 'parking_multiplier': 5.0},   # Maximize everything
]

for i, costs in enumerate(test_cases):
    print(f"\n{'='*60}")
    print(f"Test {i+1}: {costs}")
    
    model = collector.apply_costs_to_model(costs)
    model_id = f"extreme_test_{i}_{datetime.now().timestamp()}"  # Very unique ID
    
    collector.send_to_odl(model, model_id)
    print(f"Waiting 120 seconds...")
    time.sleep(120)
    
    plan = collector.get_plan(model_id)
    features = collector.extract_features(plan)
    
    print(f"Features: {features}")


Test 1: {'costPerTravelHour': 0.5, 'costPerKm': 0.01, 'parking_multiplier': 0.2}
Waiting 120 seconds...
Features: {'total_distance_km': 146.8, 'total_travel_hours': 4.3, 'total_cost': 807.68, 'num_stops': 18, 'num_vehicles_used': 3, 'unplanned_jobs': 16, 'avg_parking_difficulty': 0.0}

Test 2: {'costPerTravelHour': 5.0, 'costPerKm': 0.5, 'parking_multiplier': 5.0}
Waiting 120 seconds...
Features: {'total_distance_km': 145.94, 'total_travel_hours': 4.32, 'total_cost': 20141.64, 'num_stops': 18, 'num_vehicles_used': 3, 'unplanned_jobs': 16, 'avg_parking_difficulty': 0.0}


In [None]:
# Test with very different costs and longer wait
import time

test_cases = [
    {'costPerTravelHour': 0.5, 'costPerKm': 0.01, 'parking_multiplier': 0.2},  # Minimize everything
    {'costPerTravelHour': 5.0, 'costPerKm': 0.5, 'parking_multiplier': 5.0},   # Maximize everything
]

for i, costs in enumerate(test_cases):
    print(f"\n{'='*60}")
    print(f"Test {i+1}: {costs}")
    
    model = collector.apply_costs_to_model(costs)
    model_id = f"extreme_test_{i}_{datetime.now().timestamp()}"  # Very unique ID
    
    collector.send_to_odl(model, model_id)
    print(f"Waiting 120 seconds...")
    time.sleep(120)
    
    plan = collector.get_plan(model_id)
    features = collector.extract_features(plan)
    
    print(f"Features: {features}")


Test 1: {'costPerTravelHour': 0.5, 'costPerKm': 0.01, 'parking_multiplier': 0.2}
Waiting 120 seconds...
Features: {'total_distance_km': 146.8, 'total_travel_hours': 4.3, 'total_cost': 807.68, 'num_stops': 18, 'num_vehicles_used': 3, 'unplanned_jobs': 16, 'avg_parking_difficulty': 0.0}

Test 2: {'costPerTravelHour': 5.0, 'costPerKm': 0.5, 'parking_multiplier': 5.0}
Waiting 120 seconds...
Features: {'total_distance_km': 145.94, 'total_travel_hours': 4.32, 'total_cost': 20141.64, 'num_stops': 18, 'num_vehicles_used': 3, 'unplanned_jobs': 16, 'avg_parking_difficulty': 0.0}


In [None]:
# Test with very different costs and longer wait
import time

test_cases = [
    {'costPerTravelHour': 0.5, 'costPerKm': 0.01, 'parking_multiplier': 0.2},  # Minimize everything
    {'costPerTravelHour': 5.0, 'costPerKm': 0.5, 'parking_multiplier': 5.0},   # Maximize everything
]

for i, costs in enumerate(test_cases):
    print(f"\n{'='*60}")
    print(f"Test {i+1}: {costs}")
    
    model = collector.apply_costs_to_model(costs)
    model_id = f"extreme_test_{i}_{datetime.now().timestamp()}"  # Very unique ID
    
    collector.send_to_odl(model, model_id)
    print(f"Waiting 120 seconds...")
    time.sleep(120)
    
    plan = collector.get_plan(model_id)
    features = collector.extract_features(plan)
    
    print(f"Features: {features}")


Test 1: {'costPerTravelHour': 0.5, 'costPerKm': 0.01, 'parking_multiplier': 0.2}
Waiting 120 seconds...
Features: {'total_distance_km': 146.8, 'total_travel_hours': 4.3, 'total_cost': 807.68, 'num_stops': 18, 'num_vehicles_used': 3, 'unplanned_jobs': 16, 'avg_parking_difficulty': 0.0}

Test 2: {'costPerTravelHour': 5.0, 'costPerKm': 0.5, 'parking_multiplier': 5.0}
Waiting 120 seconds...
Features: {'total_distance_km': 145.94, 'total_travel_hours': 4.32, 'total_cost': 20141.64, 'num_stops': 18, 'num_vehicles_used': 3, 'unplanned_jobs': 16, 'avg_parking_difficulty': 0.0}


In [None]:
# Test with very different costs and longer wait
import time

test_cases = [
    {'costPerTravelHour': 0.5, 'costPerKm': 0.01, 'parking_multiplier': 0.2},  # Minimize everything
    {'costPerTravelHour': 5.0, 'costPerKm': 0.5, 'parking_multiplier': 5.0},   # Maximize everything
]

for i, costs in enumerate(test_cases):
    print(f"\n{'='*60}")
    print(f"Test {i+1}: {costs}")
    
    model = collector.apply_costs_to_model(costs)
    model_id = f"extreme_test_{i}_{datetime.now().timestamp()}"  # Very unique ID
    
    collector.send_to_odl(model, model_id)
    print(f"Waiting 120 seconds...")
    time.sleep(120)
    
    plan = collector.get_plan(model_id)
    features = collector.extract_features(plan)
    
    print(f"Features: {features}")


Test 1: {'costPerTravelHour': 0.5, 'costPerKm': 0.01, 'parking_multiplier': 0.2}
Waiting 120 seconds...
Features: {'total_distance_km': 146.8, 'total_travel_hours': 4.3, 'total_cost': 807.68, 'num_stops': 18, 'num_vehicles_used': 3, 'unplanned_jobs': 16, 'avg_parking_difficulty': 0.0}

Test 2: {'costPerTravelHour': 5.0, 'costPerKm': 0.5, 'parking_multiplier': 5.0}
Waiting 120 seconds...
Features: {'total_distance_km': 145.94, 'total_travel_hours': 4.32, 'total_cost': 20141.64, 'num_stops': 18, 'num_vehicles_used': 3, 'unplanned_jobs': 16, 'avg_parking_difficulty': 0.0}


In [None]:
# Test with very different costs and longer wait
import time

test_cases = [
    {'costPerTravelHour': 0.5, 'costPerKm': 0.01, 'parking_multiplier': 0.2},  # Minimize everything
    {'costPerTravelHour': 5.0, 'costPerKm': 0.5, 'parking_multiplier': 5.0},   # Maximize everything
]

for i, costs in enumerate(test_cases):
    print(f"\n{'='*60}")
    print(f"Test {i+1}: {costs}")
    
    model = collector.apply_costs_to_model(costs)
    model_id = f"extreme_test_{i}_{datetime.now().timestamp()}"  # Very unique ID
    
    collector.send_to_odl(model, model_id)
    print(f"Waiting 120 seconds...")
    time.sleep(120)
    
    plan = collector.get_plan(model_id)
    features = collector.extract_features(plan)
    
    print(f"Features: {features}")


Test 1: {'costPerTravelHour': 0.5, 'costPerKm': 0.01, 'parking_multiplier': 0.2}
Waiting 120 seconds...
Features: {'total_distance_km': 146.8, 'total_travel_hours': 4.3, 'total_cost': 807.68, 'num_stops': 18, 'num_vehicles_used': 3, 'unplanned_jobs': 16, 'avg_parking_difficulty': 0.0}

Test 2: {'costPerTravelHour': 5.0, 'costPerKm': 0.5, 'parking_multiplier': 5.0}
Waiting 120 seconds...
Features: {'total_distance_km': 145.94, 'total_travel_hours': 4.32, 'total_cost': 20141.64, 'num_stops': 18, 'num_vehicles_used': 3, 'unplanned_jobs': 16, 'avg_parking_difficulty': 0.0}


In [None]:
# Test with very different costs and longer wait
import time

test_cases = [
    {'costPerTravelHour': 0.5, 'costPerKm': 0.01, 'parking_multiplier': 0.2},  # Minimize everything
    {'costPerTravelHour': 5.0, 'costPerKm': 0.5, 'parking_multiplier': 5.0},   # Maximize everything
]

for i, costs in enumerate(test_cases):
    print(f"\n{'='*60}")
    print(f"Test {i+1}: {costs}")
    
    model = collector.apply_costs_to_model(costs)
    model_id = f"extreme_test_{i}_{datetime.now().timestamp()}"  # Very unique ID
    
    collector.send_to_odl(model, model_id)
    print(f"Waiting 120 seconds...")
    time.sleep(120)
    
    plan = collector.get_plan(model_id)
    features = collector.extract_features(plan)
    
    print(f"Features: {features}")


Test 1: {'costPerTravelHour': 0.5, 'costPerKm': 0.01, 'parking_multiplier': 0.2}
Waiting 120 seconds...
Features: {'total_distance_km': 146.8, 'total_travel_hours': 4.3, 'total_cost': 807.68, 'num_stops': 18, 'num_vehicles_used': 3, 'unplanned_jobs': 16, 'avg_parking_difficulty': 0.0}

Test 2: {'costPerTravelHour': 5.0, 'costPerKm': 0.5, 'parking_multiplier': 5.0}
Waiting 120 seconds...
Features: {'total_distance_km': 145.94, 'total_travel_hours': 4.32, 'total_cost': 20141.64, 'num_stops': 18, 'num_vehicles_used': 3, 'unplanned_jobs': 16, 'avg_parking_difficulty': 0.0}


In [None]:
# Test with very different costs and longer wait
import time

test_cases = [
    {'costPerTravelHour': 0.5, 'costPerKm': 0.01, 'parking_multiplier': 0.2},  # Minimize everything
    {'costPerTravelHour': 5.0, 'costPerKm': 0.5, 'parking_multiplier': 5.0},   # Maximize everything
]

for i, costs in enumerate(test_cases):
    print(f"\n{'='*60}")
    print(f"Test {i+1}: {costs}")
    
    model = collector.apply_costs_to_model(costs)
    model_id = f"extreme_test_{i}_{datetime.now().timestamp()}"  # Very unique ID
    
    collector.send_to_odl(model, model_id)
    print(f"Waiting 120 seconds...")
    time.sleep(120)
    
    plan = collector.get_plan(model_id)
    features = collector.extract_features(plan)
    
    print(f"Features: {features}")


Test 1: {'costPerTravelHour': 0.5, 'costPerKm': 0.01, 'parking_multiplier': 0.2}
Waiting 120 seconds...
Features: {'total_distance_km': 146.8, 'total_travel_hours': 4.3, 'total_cost': 807.68, 'num_stops': 18, 'num_vehicles_used': 3, 'unplanned_jobs': 16, 'avg_parking_difficulty': 0.0}

Test 2: {'costPerTravelHour': 5.0, 'costPerKm': 0.5, 'parking_multiplier': 5.0}
Waiting 120 seconds...
Features: {'total_distance_km': 145.94, 'total_travel_hours': 4.32, 'total_cost': 20141.64, 'num_stops': 18, 'num_vehicles_used': 3, 'unplanned_jobs': 16, 'avg_parking_difficulty': 0.0}


## Phase 1: Build Cost Pool

This will take ~7.5 hours to generate 300 cost sets.

**Run this cell once** - the pool will be saved and can be reused!

In [27]:
# Build cost pool (run once)
collector.build_cost_pool(
    pool_size=POOL_SIZE,
    wait_seconds=WAIT_SECONDS,
    checkpoint_every=POOL_CHECKPOINT_EVERY,
    pool_file=COST_POOL_FILE
)


PHASE 1: BUILDING COST POOL
Target pool size: 300
Estimated time: 7.5 hours
Output: cost_pool.json


Cost Set 1/300:
  Costs: hourly=2.24, km=0.3089, parking_mult=0.39
  ✓ Model uploaded, waiting 90s for optimization...
  Features: dist=145.9km, time=4.3h, parking=0.0
  ✓ Added to pool (total: 1)

Cost Set 2/300:
  Costs: hourly=2.29, km=0.4857, parking_mult=2.61
  ✓ Model uploaded, waiting 90s for optimization...
  Features: dist=145.9km, time=4.3h, parking=0.0
  ✓ Added to pool (total: 2)

Cost Set 3/300:
  Costs: hourly=3.03, km=0.2785, parking_mult=4.97
  ✓ Model uploaded, waiting 90s for optimization...


Interrupted by user!
Cost pool has 2 entries so far
  💾 Saved 2 cost sets to cost_pool.json

PHASE 1 COMPLETE
Cost pool size: 2
Total time: 0.1 hours
Average time per cost set: 133.1 seconds
Data saved to: cost_pool.json



## Load Existing Cost Pool (Optional)

If you already have a cost pool file, load it here instead of rebuilding.

In [None]:
# Load existing pool (if you already built it)
# collector.load_cost_pool(COST_POOL_FILE)

## Analyze Cost Pool

Quick analysis of the generated cost pool.

In [None]:
# Analyze cost pool
if len(collector.cost_pool) > 0:
    df_pool = pd.DataFrame([
        {
            'pool_id': entry['pool_id'],
            **{f'cost_{k}': v for k, v in entry['costs'].items()},
            **{f'feat_{k}': v for k, v in entry['features'].items()}
        }
        for entry in collector.cost_pool
    ])
    
    print(f"Cost Pool Analysis:")
    print(f"  Pool size: {len(collector.cost_pool)}")
    print(f"\nCost parameter ranges:")
    print(df_pool[['cost_costPerTravelHour', 'cost_costPerKm', 'cost_parking_multiplier']].describe())
    print(f"\nSolution feature ranges:")
    print(df_pool[['feat_total_distance_km', 'feat_total_travel_hours', 'feat_avg_parking_difficulty']].describe())
else:
    print("Cost pool is empty. Run Phase 1 first!")

## Phase 2: Generate Training Samples

This matches preferences to the cost pool and generates training samples.

**This is very fast** (~instant for 400 samples) since we're just matching, not optimizing!

In [None]:
# Generate training samples by matching to pool
collector.generate_training_samples(
    n_samples=N_PREFERENCE_SAMPLES,
    match_threshold=MATCH_QUALITY_THRESHOLD,
    wait_seconds=WAIT_SECONDS,  # Only used if custom costs needed
    output_file=OUTPUT_FILE
)

## Analyze Training Data

In [None]:
# Load and analyze training data
with open(OUTPUT_FILE, 'r') as f:
    data = json.load(f)

print(f"Training Dataset Summary:")
print(f"  Total samples: {len(data)}")

# Convert to DataFrame
df_records = []
for sample in data:
    record = {
        'sample_num': sample['sample_num'],
        **{f'pref_{k}': v for k, v in sample['preferences'].items()},
        **{f'cost_{k}': v for k, v in sample['costs'].items()},
        **{f'feat_{k}': v for k, v in sample['features'].items()},
        'score': sample['score'],
        'pool_id': sample['pool_id']
    }
    df_records.append(record)

df = pd.DataFrame(df_records)

print(f"\nDataFrame shape: {df.shape}")
print(f"\nFirst few rows:")
print(df.head())

print(f"\nPreferences statistics:")
print(df[['pref_parking_importance', 'pref_time_importance', 'pref_distance_importance']].describe())

print(f"\nCosts statistics:")
print(df[['cost_costPerTravelHour', 'cost_costPerKm', 'cost_parking_multiplier']].describe())

print(f"\nFeatures statistics:")
print(df[['feat_total_distance_km', 'feat_total_travel_hours', 'feat_avg_parking_difficulty']].describe())

print(f"\nScore statistics:")
print(df['score'].describe())

# Count custom vs pool matches
custom_count = (df['pool_id'] == 'custom').sum()
pool_count = (df['pool_id'] != 'custom').sum()
print(f"\nMatch sources:")
print(f"  From pool: {pool_count} ({pool_count/len(df)*100:.1f}%)")
print(f"  Custom generated: {custom_count} ({custom_count/len(df)*100:.1f}%)")

## Visualize Cost Pool Coverage

Check if the cost pool covers diverse solutions.

In [None]:
import matplotlib.pyplot as plt

if len(collector.cost_pool) > 0:
    fig, axes = plt.subplots(1, 3, figsize=(15, 4))
    
    # Extract data
    distances = [e['features']['total_distance_km'] for e in collector.cost_pool]
    times = [e['features']['total_travel_hours'] for e in collector.cost_pool]
    parking = [e['features']['avg_parking_difficulty'] for e in collector.cost_pool]
    
    # Plot distributions
    axes[0].hist(distances, bins=30, edgecolor='black')
    axes[0].set_xlabel('Total Distance (km)')
    axes[0].set_ylabel('Count')
    axes[0].set_title('Distance Distribution in Pool')
    
    axes[1].hist(times, bins=30, edgecolor='black')
    axes[1].set_xlabel('Total Time (hours)')
    axes[1].set_ylabel('Count')
    axes[1].set_title('Time Distribution in Pool')
    
    axes[2].hist(parking, bins=30, edgecolor='black')
    axes[2].set_xlabel('Avg Parking Difficulty')
    axes[2].set_ylabel('Count')
    axes[2].set_title('Parking Distribution in Pool')
    
    plt.tight_layout()
    plt.show()
    
    print("Good coverage means all three histograms show diverse values!")
else:
    print("No cost pool data to visualize")

## Generate More Samples (Optional)

Want even more training data? Just run Phase 2 again!

The cost pool can be reused to generate as many samples as you want.

In [None]:
# Generate another batch of samples
# collector.training_data = []  # Reset if you want fresh samples
# collector.generate_training_samples(
#     n_samples=200,
#     match_threshold=MATCH_QUALITY_THRESHOLD,
#     wait_seconds=WAIT_SECONDS,
#     output_file='training_data_v3_pool_batch2.json'
# )

## Next Steps

Congratulations! You now have:

1. ✅ **Cost Pool**: 300 diverse cost/solution pairs
2. ✅ **Training Data**: 400+ preference → cost samples
3. ✅ **Efficiency**: ~7.5 hours total (vs 20+ hours for old approach)
4. ✅ **Reusability**: Can generate more samples anytime

Next:
- 🔜 Train regression model: preferences → costs
- 🔜 Validate model performance
- 🔜 Deploy for real-time use