# TripX - Feature Engineering & Preprocessing

**Day 4: Transform EDA insights into ML-ready features**

This notebook demonstrates the feature engineering pipeline that transforms raw destination data into features optimized for our recommendation system.

## Objectives
1. Implement cost categorization based on EDA insights
2. Create duration compatibility scoring
3. Build seasonal matching logic
4. Engineer quality indicators
5. Prepare features for similarity-based recommendations

In [None]:
# Import libraries and our preprocessing module
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sys
sys.path.append('../src')

from prep import TripXPreprocessor, load_and_preprocess_data

print("Libraries and preprocessing module loaded!")

## 1. Load and Preprocess Data

In [None]:
# Load raw data and apply preprocessing pipeline
processed_df, preprocessor = load_and_preprocess_data('../data/raw/dest.csv')

print(f"\nDataset shape: {processed_df.shape}")
print(f"New columns added: {processed_df.shape[1] - 12}")
print(f"\nNew feature columns:")
new_cols = [col for col in processed_df.columns if col not in 
           ['destination', 'country', 'region', 'avg_cost_per_day', 'min_days', 'max_days', 
            'trip_type', 'season_best', 'popularity_score', 'safety_score', 'climate', 'activities']]
for col in new_cols:
    print(f"  - {col}")

## 2. Cost Categorization Analysis

In [None]:
# Analyze cost categories
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Cost category distribution
cost_counts = processed_df['cost_category'].value_counts()
axes[0].pie(cost_counts.values, labels=cost_counts.index, autopct='%1.1f%%', startangle=90)
axes[0].set_title('Cost Category Distribution')

# Cost category by trip type
cost_type_crosstab = pd.crosstab(processed_df['trip_type'], processed_df['cost_category'])
cost_type_crosstab.plot(kind='bar', stacked=True, ax=axes[1])
axes[1].set_title('Cost Categories by Trip Type')
axes[1].set_xlabel('Trip Type')
axes[1].set_ylabel('Number of Destinations')
axes[1].legend(title='Cost Category')
plt.xticks(rotation=45)

plt.tight_layout()
plt.show()

print("=== COST CATEGORIZATION RESULTS ===")
cost_summary = processed_df.groupby('cost_category')['avg_cost_per_day'].agg(['count', 'min', 'max', 'mean']).round(0)
print(cost_summary)

## 3. Quality Score Engineering

In [None]:
# Analyze engineered quality scores
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Quality score distribution
axes[0].hist(processed_df['quality_score'], bins=10, alpha=0.7, color='lightgreen', edgecolor='black')
axes[0].set_title('Quality Score Distribution')
axes[0].set_xlabel('Quality Score')
axes[0].set_ylabel('Number of Destinations')
axes[0].axvline(processed_df['quality_score'].mean(), color='red', linestyle='--', 
                label=f'Mean: {processed_df["quality_score"].mean():.2f}')
axes[0].legend()

# Quality vs original scores
axes[1].scatter(processed_df['popularity_score'], processed_df['quality_score'], 
               c=processed_df['safety_score'], s=60, alpha=0.7, cmap='viridis')
axes[1].set_xlabel('Popularity Score')
axes[1].set_ylabel('Quality Score (Engineered)')
axes[1].set_title('Quality Score vs Popularity (Color = Safety)')
plt.colorbar(axes[1].collections[0], ax=axes[1], label='Safety Score')

plt.tight_layout()
plt.show()

print("=== TOP 5 DESTINATIONS BY QUALITY SCORE ===")
top_quality = processed_df.nlargest(5, 'quality_score')[['destination', 'popularity_score', 'safety_score', 'quality_score']]
print(top_quality.round(2))

## 4. Duration Compatibility Testing

In [None]:
# Test duration compatibility function
test_cases = [
    (5, 3, 7, "Perfect fit"),
    (2, 3, 7, "1 day short"),
    (8, 3, 7, "1 day over"),
    (1, 3, 7, "2 days short"),
    (10, 3, 7, "3 days over")
]

print("=== DURATION COMPATIBILITY TESTING ===")
for user_days, min_days, max_days, description in test_cases:
    score = preprocessor.calculate_duration_compatibility(user_days, min_days, max_days)
    print(f"{description}: User wants {user_days} days, destination optimal {min_days}-{max_days} days â†’ Score: {score}")

# Apply to sample destinations
print("\n=== DURATION COMPATIBILITY FOR 5-DAY TRIP ===")
sample_destinations = processed_df[['destination', 'min_days', 'max_days', 'trip_type']].head(8)
sample_destinations['duration_compatibility'] = sample_destinations.apply(
    lambda row: preprocessor.calculate_duration_compatibility(5, row['min_days'], row['max_days']), axis=1
)
print(sample_destinations.sort_values('duration_compatibility', ascending=False))

## 5. Trip Type Encoding Analysis

In [None]:
# Analyze trip type encoding
trip_type_cols = [col for col in processed_df.columns if col.startswith('type_')]

print("=== TRIP TYPE ONE-HOT ENCODING ===")
print("Sample of encoded trip types:")
sample_encoding = processed_df[['destination', 'trip_type'] + trip_type_cols].head(8)
print(sample_encoding)

# Verify encoding correctness
print("\n=== ENCODING VERIFICATION ===")
for trip_type in processed_df['trip_type'].unique():
    subset = processed_df[processed_df['trip_type'] == trip_type]
    encoded_col = f'type_{trip_type}'
    if encoded_col in processed_df.columns:
        correct_encoding = (subset[encoded_col] == 1).all()
        print(f"{trip_type}: Correctly encoded = {correct_encoding}")

## 6. Feature Normalization Analysis

In [None]:
# Compare original vs normalized features
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Cost: original vs normalized
axes[0,0].hist(processed_df['avg_cost_per_day'], bins=10, alpha=0.7, label='Original', color='lightblue')
axes[0,0].set_title('Original Cost Distribution')
axes[0,0].set_xlabel('Cost per Day ($)')

axes[0,1].hist(processed_df['avg_cost_per_day_norm'], bins=10, alpha=0.7, label='Normalized', color='lightcoral')
axes[0,1].set_title('Normalized Cost Distribution')
axes[0,1].set_xlabel('Normalized Cost (0-1)')

# Quality: original vs normalized
axes[1,0].hist(processed_df['quality_score'], bins=10, alpha=0.7, label='Original', color='lightgreen')
axes[1,0].set_title('Original Quality Score Distribution')
axes[1,0].set_xlabel('Quality Score')

axes[1,1].hist(processed_df['quality_score_norm'], bins=10, alpha=0.7, label='Normalized', color='gold')
axes[1,1].set_title('Normalized Quality Score Distribution')
axes[1,1].set_xlabel('Normalized Quality (0-1)')

plt.tight_layout()
plt.show()

print("=== NORMALIZATION STATISTICS ===")
norm_cols = [col for col in processed_df.columns if col.endswith('_norm')]
for col in norm_cols[:4]:  # Show first 4 normalized features
    original_col = col.replace('_norm', '')
    print(f"\n{original_col}:")
    print(f"  Original range: {processed_df[original_col].min():.1f} - {processed_df[original_col].max():.1f}")
    print(f"  Normalized range: {processed_df[col].min():.3f} - {processed_df[col].max():.3f}")

## 7. User Profile Feature Engineering

In [None]:
# Test user profile creation
print("=== USER PROFILE FEATURE ENGINEERING ===")

# Test different user profiles
test_users = [
    {"name": "Budget Backpacker", "budget": 45, "duration": 10, "trip_type": "culture", "season": "spring"},
    {"name": "Luxury Traveler", "budget": 250, "duration": 5, "trip_type": "luxury", "season": "winter"},
    {"name": "Beach Lover", "budget": 80, "duration": 7, "trip_type": "beach", "season": "summer"},
    {"name": "City Explorer", "budget": 120, "duration": 4, "trip_type": "urban", "season": "fall"}
]

for user in test_users:
    profile = preprocessor.create_user_profile_features(
        budget=user["budget"], 
        duration=user["duration"], 
        trip_type=user["trip_type"], 
        season=user["season"]
    )
    
    print(f"\n{user['name']}:")
    print(f"  Budget: ${profile['budget']} ({profile['cost_category']} category)")
    print(f"  Duration: {profile['duration']} days")
    print(f"  Trip Type: {profile['preferred_trip_type']}")
    print(f"  Season: {profile['preferred_season']}")
    
    # Show trip type encoding
    trip_encoding = {k: v for k, v in profile.items() if k.startswith('type_') and v == 1}
    print(f"  Encoded preference: {list(trip_encoding.keys())}")

## 8. Feature Engineering Summary

### Features Created for ML Recommendation System

**1. Cost Features**
- `cost_category`: Budget tiers (budget/mid/premium/luxury) for filtering
- `avg_cost_per_day_norm`: Normalized cost for similarity calculations

**2. Quality Features**
- `quality_score`: Weighted combination of popularity (60%) + safety (40%)
- `quality_score_norm`: Normalized quality for ranking

**3. Duration Features**
- `duration_range`: Flexibility in trip length (max - min days)
- `duration_flexibility`: Relative flexibility (range / max_days)
- Duration compatibility function for user matching

**4. Preference Features**
- `type_*`: One-hot encoded trip types for preference matching
- Season matching function for temporal preferences

**5. Normalized Features**
- All numerical features normalized to 0-1 scale for fair similarity calculations

### ML Strategy Implementation Ready

**Filtering Logic**: Cost categories and duration compatibility
**Scoring Logic**: Trip type matching and seasonal preferences
**Ranking Logic**: Quality scores and normalized features
**Explainability**: Each feature has clear business meaning

### Next Steps - Day 5
Build the recommendation algorithm using these engineered features:
1. Multi-stage filtering (budget, duration)
2. Similarity scoring (preferences, features)
3. Quality-based ranking
4. Explainable recommendations with reasoning