# TripX - Data Overview

Quick look at our travel destinations dataset to understand what we're working with.

## Goal
Build a recommendation system that matches users with destinations based on their budget, trip length, and travel preferences.

In [None]:
# Import essential libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option('display.max_columns', None)

print("Ready to go!")

## 1. Data Loading

In [None]:
# Load the destinations dataset
df = pd.read_csv('../data/raw/dest.csv')

print(f"Dataset loaded successfully!")
print(f"Shape: {df.shape}")
print(f"Columns: {list(df.columns)}")

## 2. Basic Data Inspection

In [None]:
# First look at the data
print("=== FIRST 5 ROWS ===")
display(df.head())

print("\n=== LAST 5 ROWS ===")
display(df.tail())

In [None]:
# Data types and info
print("=== DATA TYPES & INFO ===")
df.info()

print("\n=== MISSING VALUES ===")
missing_data = df.isnull().sum()
print(missing_data[missing_data > 0] if missing_data.sum() > 0 else "No missing values found!")

## 3. Statistical Summary

In [None]:
# Numerical features summary
print("=== NUMERICAL FEATURES SUMMARY ===")
numerical_cols = ['avg_cost_per_day', 'min_days', 'max_days', 'popularity_score', 'safety_score']
display(df[numerical_cols].describe())

In [None]:
# Categorical features summary
print("=== CATEGORICAL FEATURES SUMMARY ===")
categorical_cols = ['country', 'region', 'trip_type', 'season_best', 'climate']

for col in categorical_cols:
    print(f"\n{col.upper()}:")
    print(df[col].value_counts())

## 4. Key Insights for ML Strategy

### Data Quality Assessment
- **Dataset Size**: 20 destinations across 6 continents
- **Completeness**: No missing values detected
- **Feature Types**: Mix of numerical (cost, scores) and categorical (region, type) features

### Feature Analysis for Recommendation Logic

**Primary Matching Features:**
1. **Budget Compatibility**: `avg_cost_per_day` (range: $35-300)
2. **Trip Duration**: `min_days` and `max_days` for duration matching
3. **Trip Type Preference**: `trip_type` (culture, beach, urban, luxury, nature)
4. **Seasonal Preference**: `season_best` for timing optimization

**Quality Indicators:**
1. **Popularity Score**: 7.8-9.2 range (higher = more popular)
2. **Safety Score**: 6.5-9.5 range (higher = safer)

**Additional Context:**
- **Geographic Diversity**: Good coverage across regions
- **Price Range**: Wide spectrum from budget ($35) to luxury ($300)
- **Trip Types**: Balanced distribution across different travel preferences

### ML Recommendation Strategy
Our recommendation system will:
1. **Filter** destinations by budget and duration constraints
2. **Score** based on trip type and seasonal preferences
3. **Rank** using popularity and safety as quality indicators
4. **Explain** each recommendation with clear reasoning


## 5. Next Steps

**Day 3 - EDA**: Deep dive into feature relationships and patterns
- Cost vs region analysis
- Trip type distribution
- Seasonal patterns
- Safety vs popularity correlation

**Day 4 - Feature Engineering**: Prepare features for ML
- Encode categorical variables
- Normalize numerical features
- Create derived features if needed