# Data Preprocessing Demo

This notebook demonstrates the data preprocessing pipeline that transforms raw location data into clean, augmented data ready for optimization algorithms.

## Overview

The preprocessing follows a three-phase approach:

1. **Phase 1: Initial Data Cleaning and Structuring**
   - Filter tourist attractions only
   - Select core columns (name, latitude, longitude)
   - Handle missing values
   - Validate coordinates for Sri Lanka
   - Remove duplicates

2. **Phase 2: Data Augmentation**
   - Create category column based on name patterns
   - Generate interest scores (0-100 scale)
   - Generate visit durations (hours)

3. **Phase 3: Final Preparation for Optimization**
   - Calculate distance matrix (Haversine formula)
   - Convert to travel time matrix (40 km/h avg speed)

In [None]:
import sys
import os
sys.path.insert(0, '../scripts')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from data_utils import load_attractions_data, prepare_data_for_optimization

plt.style.use('seaborn-v0_8-darkgrid')
%matplotlib inline

## Load Preprocessed Data

The preprocessed data is already available in `data/processed/`. Let's load it and explore.

In [None]:
# Load preprocessed attractions
attractions = load_attractions_data('../data/processed/attractions.csv')

print(f"Loaded {len(attractions)} tourist attractions")
print(f"\nColumns: {list(attractions.columns)}")
print(f"\nFirst 5 rows:")
attractions.head()

## Dataset Statistics

In [None]:
print("Dataset Summary")
print("=" * 80)
print(f"Total attractions: {len(attractions)}")
print(f"\nCategory Distribution:")
print(attractions['category'].value_counts())
print(f"\nInterest Score Statistics:")
print(attractions['interest_score'].describe())
print(f"\nVisit Duration Statistics (hours):")
print(attractions['visit_duration'].describe())

## Visualize Category Distribution

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Category distribution
category_counts = attractions['category'].value_counts()
axes[0].barh(range(len(category_counts)), category_counts.values)
axes[0].set_yticks(range(len(category_counts)))
axes[0].set_yticklabels(category_counts.index)
axes[0].set_xlabel('Count')
axes[0].set_title('Distribution of Attraction Categories')
axes[0].grid(True, alpha=0.3)

# Interest score distribution
axes[1].hist(attractions['interest_score'], bins=20, edgecolor='black', alpha=0.7)
axes[1].set_xlabel('Interest Score')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Distribution of Interest Scores')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Geographic Distribution

In [None]:
# Plot attractions on a map
fig, ax = plt.subplots(figsize=(12, 10))

# Color by category
categories = attractions['category'].unique()
colors = plt.cm.tab10(np.linspace(0, 1, len(categories)))
category_colors = dict(zip(categories, colors))

for category in categories:
    subset = attractions[attractions['category'] == category]
    ax.scatter(
        subset['longitude'],
        subset['latitude'],
        c=[category_colors[category]],
        s=50,
        alpha=0.6,
        label=category,
        edgecolors='black',
        linewidths=0.5
    )

ax.set_xlabel('Longitude')
ax.set_ylabel('Latitude')
ax.set_title('Geographic Distribution of Tourist Attractions in Sri Lanka')
ax.legend(loc='upper right', fontsize=8)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Prepare Data for Optimization

Use the `prepare_data_for_optimization` function to package everything needed for the algorithms.

In [None]:
# Prepare data for optimization
prepared_data = prepare_data_for_optimization(attractions)

if prepared_data:
    print("Data prepared successfully for optimization!")
    print(f"\nPrepared data contains:")
    print(f"  - attractions: DataFrame with {prepared_data['n_attractions']} locations")
    print(f"  - distance_matrix: {prepared_data['distance_matrix'].shape} numpy array")
    print(f"  - scores: {prepared_data['scores'].shape} numpy array")
    print(f"  - visit_durations: {prepared_data['visit_durations'].shape} numpy array")
    print(f"\nDistance Matrix Statistics:")
    distances = prepared_data['distance_matrix'][prepared_data['distance_matrix'] > 0]
    print(f"  - Min distance: {distances.min():.2f} km")
    print(f"  - Max distance: {distances.max():.2f} km")
    print(f"  - Mean distance: {distances.mean():.2f} km")

## Load Pre-calculated Matrices

The distance and travel time matrices are pre-calculated and saved as .npy files for efficiency.

In [None]:
# Load pre-calculated matrices
distance_matrix = np.load('../data/processed/distance_matrix.npy')
travel_time_matrix = np.load('../data/processed/travel_time_matrix.npy')

print(f"Distance matrix shape: {distance_matrix.shape}")
print(f"Travel time matrix shape: {travel_time_matrix.shape}")

print(f"\nTravel Time Statistics (hours):")
travel_times = travel_time_matrix[travel_time_matrix > 0]
print(f"  - Min: {travel_times.min():.2f}")
print(f"  - Max: {travel_times.max():.2f}")
print(f"  - Mean: {travel_times.mean():.2f}")
print(f"  - Median: {np.median(travel_times):.2f}")

## Example: Find Closest Attractions

Let's find the 5 closest attractions to a specific location.

In [None]:
# Pick a random attraction
idx = 0  # First attraction in the dataset
attraction = attractions.iloc[idx]

print(f"Finding attractions closest to: {attraction['name']}")
print(f"Category: {attraction['category']}")
print(f"Interest Score: {attraction['interest_score']}")
print(f"\nClosest 5 attractions:")

# Get distances from this attraction to all others
distances_from_idx = distance_matrix[idx]

# Sort and get top 5 (excluding itself)
sorted_indices = np.argsort(distances_from_idx)[1:6]  # Skip 0 (itself)

for i, close_idx in enumerate(sorted_indices, 1):
    close_attraction = attractions.iloc[close_idx]
    distance = distances_from_idx[close_idx]
    travel_time = travel_time_matrix[idx, close_idx]
    print(f"{i}. {close_attraction['name']}")
    print(f"   Distance: {distance:.2f} km, Travel time: {travel_time:.2f} hours")
    print(f"   Category: {close_attraction['category']}, Score: {close_attraction['interest_score']}")
    print()

## Next Steps

The data is now ready for optimization! You can:

1. Use the Genetic Algorithm (see `02_Genetic_Algorithm_Implementation.ipynb`)
2. Use the MIP Model (see `03_MIP_Model_Benchmark.ipynb`)
3. Compare results and visualize tours (see `04_Results_and_Visualization.ipynb`)

## Re-running Preprocessing

If you need to regenerate the processed data, run:

```bash
python scripts/preprocess_data.py
```

This will:
- Clean and filter the raw data
- Generate synthetic features (categories, scores, durations)
- Calculate distance and travel time matrices
- Save all outputs to `data/processed/`