# Feature Engineering Analysis - Hotel Booking Cancellation Prediction
## Academic Research Framework - NIB 7072 Coursework

**Research Objective:** Comprehensive feature engineering analysis for enhancing hotel booking cancellation prediction accuracy.

**Academic Context:** This notebook implements and analyzes the feature engineering strategies defined in features.md, focusing on domain-driven feature creation and categorical encoding optimization.

**Key Areas:**
- Advanced categorical encoding strategies (mean target encoding)
- Temporal and seasonal feature engineering
- Guest behavior and composition features
- Financial and revenue optimization features
- Risk assessment and booking pattern features

## 🛠️ Environment Setup

In [None]:
# Core data manipulation
import pandas as pd
import numpy as np
from datetime import datetime, timedelta, date
import calendar

# Statistical analysis and feature selection
from scipy import stats
from scipy.stats import chi2_contingency
from sklearn.feature_selection import mutual_info_classif, SelectKBest, f_classif
from sklearn.preprocessing import StandardScaler, LabelEncoder, MinMaxScaler
from sklearn.model_selection import KFold

# Advanced feature engineering
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.preprocessing import PolynomialFeatures

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

print("✅ Feature engineering environment setup completed")

## 📊 Load Preprocessed Data

In [None]:
# Load preprocessed data from previous stage
preprocessed_path = "../data/processed/hotel_bookings_preprocessed.csv"

try:
    df_processed = pd.read_csv(preprocessed_path)
    print(f"✅ Preprocessed dataset loaded: {df_processed.shape}")
    
    # Data overview
    print(f"\n📊 PREPROCESSED DATA OVERVIEW:")
    print(f"Rows: {df_processed.shape[0]:,}")
    print(f"Columns: {df_processed.shape[1]}")
    print(f"Cancellation rate: {df_processed['is_canceled'].mean():.3f}")
    
except FileNotFoundError:
    print("⚠️ Preprocessed dataset not found.")
    print("Creating sample preprocessed data for demonstration...")
    
    # Create sample preprocessed data
    np.random.seed(42)
    n_samples = 5000
    
    df_processed = pd.DataFrame({
        'is_canceled': np.random.choice([0, 1], n_samples, p=[0.67, 0.33]),
        'lead_time': np.random.randint(0, 400, n_samples),
        'stays_in_weekend_nights': np.random.randint(0, 5, n_samples),
        'stays_in_week_nights': np.random.randint(0, 20, n_samples),
        'adults': np.random.choice([1, 2, 3, 4], n_samples, p=[0.3, 0.5, 0.15, 0.05]),
        'children': np.random.choice([0, 1, 2], n_samples, p=[0.7, 0.25, 0.05]),
        'babies': np.random.choice([0, 1], n_samples, p=[0.9, 0.1]),
        'adr': np.random.uniform(20, 400, n_samples),
        'hotel': np.random.choice(['Resort Hotel', 'City Hotel'], n_samples),
        'meal': np.random.choice(['BB', 'HB', 'FB', 'SC'], n_samples),
        'market_segment': np.random.choice(['Online TA', 'Offline TA/TO', 'Groups', 'Direct', 'Corporate'], n_samples),
        'arrival_date_month': np.random.choice(range(1, 13), n_samples)
    })
    print(f"Sample preprocessed dataset created: {df_processed.shape}")

## 🎯 Advanced Categorical Encoding Analysis

Implementation and comparison of categorical encoding strategies including mean target encoding.

In [None]:
# Mean target encoding implementation
print("🎯 MEAN TARGET ENCODING ANALYSIS:")

# Identify categorical columns
categorical_columns = df_processed.select_dtypes(include=['object']).columns.tolist()
if 'is_canceled' in categorical_columns:
    categorical_columns.remove('is_canceled')

print(f"Categorical columns identified: {categorical_columns}")

# Implement mean target encoding functions from features.md
# This will be expanded with the comprehensive encoding implementation

print("Mean target encoding functions to be implemented...")

## ⏰ Temporal Feature Engineering

Creating comprehensive temporal and seasonal features for enhanced prediction.

In [None]:
# Temporal feature engineering
print("⏰ TEMPORAL FEATURE ENGINEERING:")

# Create basic temporal features
if 'arrival_date_month' in df_processed.columns:
    # Season mapping
    season_map = {12: 'Winter', 1: 'Winter', 2: 'Winter',
                  3: 'Spring', 4: 'Spring', 5: 'Spring',
                  6: 'Summer', 7: 'Summer', 8: 'Summer',
                  9: 'Autumn', 10: 'Autumn', 11: 'Autumn'}
    
    df_enhanced = df_processed.copy()
    df_enhanced['season'] = df_enhanced['arrival_date_month'].map(season_map)
    
    print(f"✅ Season feature created")
    print(f"Season distribution:")
    print(df_enhanced['season'].value_counts())
    
    # Seasonal cancellation analysis
    seasonal_cancel = df_enhanced.groupby('season')['is_canceled'].agg(['mean', 'count'])
    print(f"\nSeasonal cancellation rates:")
    print(seasonal_cancel)

# More temporal features will be added based on features.md
print("\nAdditional temporal features to be implemented...")

## 👥 Guest Behavior Features

Engineering features related to guest composition and behavior patterns.

In [None]:
# Guest behavior feature engineering
print("👥 GUEST BEHAVIOR FEATURE ENGINEERING:")

# Basic guest composition features
if all(col in df_processed.columns for col in ['adults', 'children', 'babies']):
    df_enhanced['total_guests'] = df_processed['adults'] + df_processed['children'] + df_processed['babies']
    df_enhanced['is_family'] = ((df_processed['children'] > 0) | (df_processed['babies'] > 0)).astype(int)
    df_enhanced['adults_to_children_ratio'] = df_processed['adults'] / (df_processed['children'] + df_processed['babies'] + 1)
    
    print(f"✅ Guest composition features created")
    print(f"Family booking rate: {df_enhanced['is_family'].mean():.3f}")
    
    # Family vs non-family cancellation analysis
    family_analysis = df_enhanced.groupby('is_family')['is_canceled'].agg(['mean', 'count'])
    print(f"\nFamily booking cancellation analysis:")
    print(family_analysis)

# Additional guest behavior features will be added
print("\nAdditional guest behavior features to be implemented...")

## 📈 Next Steps

This notebook will be expanded to include:

- **Financial Features:** Revenue and pricing optimization features
- **Risk Assessment Features:** Booking pattern and cancellation risk features
- **Interaction Features:** Cross-feature combinations and polynomial features
- **Feature Selection:** Statistical and model-based feature importance
- **Feature Validation:** Performance impact analysis
- **Export Enhanced Dataset:** Final feature-engineered dataset

*Continue implementing based on the comprehensive features.md instructions.*