# TripX - Travel Recommendation System
## 01. Data Overview & Initial Analysis

**Goal**: Load and understand our destination dataset for building ML-based travel recommendations.

**Key Questions**:
- What destinations do we have?
- What features can we use for recommendations?
- Are there any data quality issues?
- What patterns do we see initially?

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

print("Libraries loaded successfully!")

## 1. Load Dataset

In [None]:
# Load the destinations dataset
df = pd.read_csv('../data/raw/dest.csv')

print(f"Dataset loaded: {df.shape[0]} destinations, {df.shape[1]} features")
print("\nFirst few rows:")
df.head()

## 2. Basic Dataset Information

In [None]:
# Dataset info
print("=== DATASET INFO ===")
print(df.info())

print("\n=== MISSING VALUES ===")
print(df.isnull().sum())

print("\n=== DATA TYPES ===")
print(df.dtypes)

## 3. Feature Analysis

In [None]:
# Numerical features summary
numerical_cols = ['avg_cost_per_day', 'min_days', 'max_days', 'popularity_score', 'safety_score']

print("=== NUMERICAL FEATURES SUMMARY ===")
df[numerical_cols].describe()

In [None]:
# Categorical features analysis
categorical_cols = ['country', 'region', 'trip_type', 'season_best', 'climate']

print("=== CATEGORICAL FEATURES ANALYSIS ===")
for col in categorical_cols:
    print(f"\n{col.upper()}:")
    print(df[col].value_counts())

## 4. Key Insights for ML Model

### Features Available for Recommendation:
1. **Budget-related**: `avg_cost_per_day`
2. **Duration-related**: `min_days`, `max_days`
3. **Preference-related**: `trip_type`, `season_best`, `climate`
4. **Quality metrics**: `popularity_score`, `safety_score`
5. **Location**: `region`, `country`

### Initial Observations:

In [None]:
# Key insights
print("=== KEY INSIGHTS ===")
print(f"• Cost range: ${df['avg_cost_per_day'].min()} - ${df['avg_cost_per_day'].max()} per day")
print(f"• Trip duration: {df['min_days'].min()}-{df['max_days'].max()} days")
print(f"• Most common trip type: {df['trip_type'].mode()[0]}")
print(f"• Most popular region: {df['region'].mode()[0]}")
print(f"• Average popularity score: {df['popularity_score'].mean():.1f}/10")
print(f"• Average safety score: {df['safety_score'].mean():.1f}/10")

print("\n=== DATA QUALITY ===")
print(f"• No missing values: {df.isnull().sum().sum() == 0}")
print(f"• All destinations unique: {df['destination'].nunique() == len(df)}")

## 5. Next Steps for ML Development

Based on this initial analysis, our ML recommendation system will:

1. **Use these features for matching**:
   - Budget compatibility (`avg_cost_per_day`)
   - Duration fit (`min_days`, `max_days`)
   - Trip type preference (`trip_type`)
   - Seasonal preference (`season_best`)
   - Quality scores (`popularity_score`, `safety_score`)

2. **Recommendation approach**:
   - Score-based ranking system
   - Weight different features based on user priorities
   - Ensure explainability for each recommendation

3. **Next notebook (02_eda.ipynb)**:
   - Deeper feature relationships
   - Distribution analysis
   - Feature correlation study