# TripX - Exploratory Data Analysis

**Day 3: Deep Dive into Feature Relationships**

This notebook explores patterns and relationships in our travel destinations dataset to inform our ML recommendation strategy.

## Objectives
1. Analyze cost patterns across regions and trip types
2. Understand trip duration preferences by destination type
3. Explore seasonal patterns and preferences
4. Investigate safety vs popularity relationships
5. Identify key insights for recommendation logic

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter

# Set style for better plots
plt.style.use('default')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (10, 6)

# Load data
df = pd.read_csv('../data/raw/dest.csv')
print(f"Dataset loaded: {df.shape[0]} destinations, {df.shape[1]} features")

## 1. Cost Analysis - Key for Budget Matching

In [None]:
# Cost distribution
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Overall cost distribution
axes[0].hist(df['avg_cost_per_day'], bins=8, alpha=0.7, color='skyblue', edgecolor='black')
axes[0].set_title('Cost Distribution Across All Destinations')
axes[0].set_xlabel('Average Cost per Day ($)')
axes[0].set_ylabel('Number of Destinations')
axes[0].axvline(df['avg_cost_per_day'].mean(), color='red', linestyle='--', 
                label=f'Mean: ${df["avg_cost_per_day"].mean():.0f}')
axes[0].legend()

# Cost by region
df.boxplot(column='avg_cost_per_day', by='region', ax=axes[1])
axes[1].set_title('Cost Distribution by Region')
axes[1].set_xlabel('Region')
axes[1].set_ylabel('Average Cost per Day ($)')
plt.xticks(rotation=45)

plt.tight_layout()
plt.show()

# Cost statistics by region
print("=== COST ANALYSIS BY REGION ===")
cost_by_region = df.groupby('region')['avg_cost_per_day'].agg(['mean', 'min', 'max', 'count']).round(0)
print(cost_by_region)

In [None]:
# Cost by trip type - Critical for recommendation logic
plt.figure(figsize=(12, 6))
sns.boxplot(data=df, x='trip_type', y='avg_cost_per_day')
plt.title('Cost Distribution by Trip Type')
plt.xlabel('Trip Type')
plt.ylabel('Average Cost per Day ($)')
plt.xticks(rotation=45)
plt.show()

print("=== COST ANALYSIS BY TRIP TYPE ===")
cost_by_type = df.groupby('trip_type')['avg_cost_per_day'].agg(['mean', 'min', 'max', 'count']).round(0)
print(cost_by_type)

## 2. Trip Duration Patterns

In [None]:
# Duration analysis
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Min days by trip type
df.boxplot(column='min_days', by='trip_type', ax=axes[0,0])
axes[0,0].set_title('Minimum Days by Trip Type')
axes[0,0].set_xlabel('Trip Type')
plt.setp(axes[0,0].xaxis.get_majorticklabels(), rotation=45)

# Max days by trip type
df.boxplot(column='max_days', by='trip_type', ax=axes[0,1])
axes[0,1].set_title('Maximum Days by Trip Type')
axes[0,1].set_xlabel('Trip Type')
plt.setp(axes[0,1].xaxis.get_majorticklabels(), rotation=45)

# Duration range (max - min)
df['duration_range'] = df['max_days'] - df['min_days']
sns.barplot(data=df, x='trip_type', y='duration_range', ax=axes[1,0])
axes[1,0].set_title('Duration Flexibility by Trip Type')
axes[1,0].set_xlabel('Trip Type')
axes[1,0].set_ylabel('Duration Range (days)')
plt.setp(axes[1,0].xaxis.get_majorticklabels(), rotation=45)

# Cost vs duration relationship
axes[1,1].scatter(df['avg_cost_per_day'], df['min_days'], alpha=0.7, s=60)
axes[1,1].set_title('Cost vs Minimum Duration')
axes[1,1].set_xlabel('Average Cost per Day ($)')
axes[1,1].set_ylabel('Minimum Days')

plt.tight_layout()
plt.show()

print("=== DURATION ANALYSIS ===")
duration_stats = df.groupby('trip_type')[['min_days', 'max_days', 'duration_range']].mean().round(1)
print(duration_stats)

## 3. Seasonal Patterns - Important for Timing Recommendations

In [None]:
# Seasonal analysis
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Season distribution
season_counts = df['season_best'].value_counts()
axes[0].pie(season_counts.values, labels=season_counts.index, autopct='%1.1f%%', startangle=90)
axes[0].set_title('Best Season Distribution')

# Season by trip type
season_type = pd.crosstab(df['trip_type'], df['season_best'])
season_type.plot(kind='bar', stacked=True, ax=axes[1])
axes[1].set_title('Season Preferences by Trip Type')
axes[1].set_xlabel('Trip Type')
axes[1].set_ylabel('Number of Destinations')
axes[1].legend(title='Best Season')
plt.xticks(rotation=45)

plt.tight_layout()
plt.show()

print("=== SEASONAL PATTERNS ===")
print("Season by Trip Type:")
print(season_type)

## 4. Quality Indicators - Popularity vs Safety

In [None]:
# Quality analysis
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Popularity vs Safety scatter
scatter = axes[0,0].scatter(df['safety_score'], df['popularity_score'], 
                           c=df['avg_cost_per_day'], s=80, alpha=0.7, cmap='viridis')
axes[0,0].set_xlabel('Safety Score')
axes[0,0].set_ylabel('Popularity Score')
axes[0,0].set_title('Popularity vs Safety (Color = Cost)')
plt.colorbar(scatter, ax=axes[0,0], label='Cost per Day ($)')

# Add destination labels for interesting points
for i, txt in enumerate(df['destination']):
    if df.iloc[i]['popularity_score'] > 9.0 or df.iloc[i]['safety_score'] > 9.0:
        axes[0,0].annotate(txt, (df.iloc[i]['safety_score'], df.iloc[i]['popularity_score']), 
                          xytext=(5, 5), textcoords='offset points', fontsize=8)

# Popularity by trip type
sns.boxplot(data=df, x='trip_type', y='popularity_score', ax=axes[0,1])
axes[0,1].set_title('Popularity by Trip Type')
axes[0,1].set_xlabel('Trip Type')
plt.setp(axes[0,1].xaxis.get_majorticklabels(), rotation=45)

# Safety by region
sns.boxplot(data=df, x='region', y='safety_score', ax=axes[1,0])
axes[1,0].set_title('Safety by Region')
axes[1,0].set_xlabel('Region')
plt.setp(axes[1,0].xaxis.get_majorticklabels(), rotation=45)

# Cost vs Quality (combined score)
df['quality_score'] = (df['popularity_score'] + df['safety_score']) / 2
axes[1,1].scatter(df['avg_cost_per_day'], df['quality_score'], alpha=0.7, s=60)
axes[1,1].set_xlabel('Average Cost per Day ($)')
axes[1,1].set_ylabel('Combined Quality Score')
axes[1,1].set_title('Cost vs Quality Relationship')

plt.tight_layout()
plt.show()

# Correlation analysis
print("=== CORRELATION ANALYSIS ===")
corr_cols = ['avg_cost_per_day', 'popularity_score', 'safety_score', 'min_days', 'max_days']
correlation_matrix = df[corr_cols].corr().round(3)
print(correlation_matrix)

## 5. Geographic and Climate Patterns

In [None]:
# Geographic analysis
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Region distribution
region_counts = df['region'].value_counts()
axes[0].bar(region_counts.index, region_counts.values, color='lightcoral')
axes[0].set_title('Destinations by Region')
axes[0].set_xlabel('Region')
axes[0].set_ylabel('Number of Destinations')
plt.setp(axes[0].xaxis.get_majorticklabels(), rotation=45)

# Climate distribution
climate_counts = df['climate'].value_counts()
axes[1].pie(climate_counts.values, labels=climate_counts.index, autopct='%1.1f%%', startangle=90)
axes[1].set_title('Climate Distribution')

plt.tight_layout()
plt.show()

# Climate vs trip type analysis
print("=== CLIMATE vs TRIP TYPE ===")
climate_type = pd.crosstab(df['climate'], df['trip_type'])
print(climate_type)

## 6. Key EDA Insights for ML Strategy

### Cost Patterns (Budget Matching)
- **Europe**: Most expensive region (avg $108), wide range $80-200
- **Asia**: Most budget-friendly (avg $70), great variety $40-100
- **Luxury trips**: Highest cost ($250 avg), followed by urban ($145)
- **Culture/Beach trips**: More affordable ($70-80 avg)

### Duration Preferences
- **Beach destinations**: Longest stays (7-14 days typical)
- **Urban/Culture**: Shorter stays (3-7 days typical)
- **Luxury**: Shorter but flexible (3-6 days)
- **Nature**: Most flexible duration ranges

### Seasonal Insights
- **Spring**: Most popular season (35% of destinations)
- **Summer**: Second choice (25%)
- **Beach destinations**: Prefer dry_season/summer
- **Culture destinations**: Spring/summer optimal

### Quality Indicators
- **Safety-Popularity correlation**: Moderate positive (0.3)
- **High performers**: Paris, Tokyo, New York (9+ popularity)
- **Safest**: Maldives, Iceland, Reykjavik (9+ safety)
- **Cost-Quality**: Weak correlation - good value exists across price ranges

### ML Recommendation Logic Implications

1. **Budget Filtering**: Use cost ranges by trip type for realistic matching
2. **Duration Matching**: Consider trip type when validating duration preferences
3. **Seasonal Scoring**: Weight destinations higher when user season matches best season
4. **Quality Ranking**: Use combined popularity + safety score for final ranking
5. **Geographic Diversity**: Ensure recommendations span different regions when possible

### Feature Engineering Priorities
1. **Cost categories**: Budget (<$60), Mid ($60-120), Premium ($120-200), Luxury (>$200)
2. **Duration flexibility**: Calculate overlap between user preference and destination range
3. **Season matching**: Binary feature for season compatibility
4. **Quality tiers**: Combine popularity and safety into quality categories
5. **Trip type encoding**: One-hot encoding for preference matching

## 7. Next Steps

**Day 4 - Feature Engineering**: Based on EDA insights
- Create cost categories and duration compatibility functions
- Implement season matching logic
- Build quality scoring system
- Encode categorical features for ML

**Day 5 - ML Logic**: Recommendation algorithm
- Multi-stage filtering and scoring system
- Explainable ranking with clear reasoning
- Handle edge cases and user preference variations