# Hotel Ranking System Demo

This notebook demonstrates the hotel ranking system for day-access amenities. It shows how to:

1. Generate synthetic data
2. Explore the data
3. Train a ranking model
4. Evaluate the model
5. Use the model for ranking venues

## Setup

First, let's import the necessary modules and set up the environment.

In [1]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Set plot style
plt.style.use('seaborn-whitegrid')
sns.set_palette('viridis')

# Set random seed for reproducibility
np.random.seed(42)

ModuleNotFoundError: No module named 'numpy'

## 1. Generate Synthetic Data

We'll generate a small sample dataset for demonstration purposes.

In [None]:
# Check if sample data already exists
sample_dir = "data/sample"
if not os.path.exists(sample_dir):
    print("Generating sample data...")
    from src.data_generation.main import generate_sample_data
    data = generate_sample_data()
else:
    print("Sample data already exists. Loading...")
    # Load sample data
    venues_df = pd.read_csv(os.path.join(sample_dir, "venues.csv"))
    users_df = pd.read_csv(os.path.join(sample_dir, "users.csv"))
    users_processed_df = pd.read_csv(os.path.join(sample_dir, "users_processed.csv"))
    seasonal_df = pd.read_csv(os.path.join(sample_dir, "seasonal.csv"))
    weather_df = pd.read_csv(os.path.join(sample_dir, "weather.csv"))
    interactions_df = pd.read_csv(os.path.join(sample_dir, "interactions.csv"))
    train_df = pd.read_csv(os.path.join(sample_dir, "interactions_train.csv"))
    test_df = pd.read_csv(os.path.join(sample_dir, "interactions_test.csv"))
    
    # Convert date columns to datetime
    weather_df["date"] = pd.to_datetime(weather_df["date"])
    
    data = {
        "venues": venues_df,
        "users": users_df,
        "users_processed": users_processed_df,
        "seasonal": seasonal_df,
        "weather": weather_df,
        "interactions": interactions_df,
        "interactions_train": train_df,
        "interactions_test": test_df
    }

## 2. Explore the Data

Let's explore the generated data to understand its structure and characteristics.

### 2.1 Venue Data

First, let's look at the venue data.

In [None]:
# Display venue data
print(f"Number of venues: {len(data['venues'])}")
data['venues'].head()

In [None]:
# Venue type distribution
plt.figure(figsize=(10, 6))
sns.countplot(data=data['venues'], x='venue_type')
plt.title('Venue Type Distribution')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

In [None]:
# Amenity availability
amenities = ['pool', 'beach_access', 'spa', 'gym', 'hot_tub', 'food_service', 'bar']
amenity_availability = {amenity: data['venues'][amenity].mean() for amenity in amenities if amenity in data['venues'].columns}

plt.figure(figsize=(12, 6))
sns.barplot(x=list(amenity_availability.keys()), y=list(amenity_availability.values()))
plt.title('Amenity Availability')
plt.ylabel('Percentage of Venues')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

In [None]:
# Price distribution by venue type
plt.figure(figsize=(12, 6))
sns.boxplot(data=data['venues'], x='venue_type', y='day_pass_price')
plt.title('Day Pass Price by Venue Type')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

### 2.2 Seasonal Data

Let's examine how amenity availability and pricing change by season.

In [None]:
# Display seasonal data
print(f"Number of seasonal records: {len(data['seasonal'])}")
data['seasonal'].head()

In [None]:
# Pool availability by season
if 'pool_available' in data['seasonal'].columns:
    pool_availability = data['seasonal'].groupby('season')['pool_available'].mean()
    
    plt.figure(figsize=(10, 6))
    sns.barplot(x=pool_availability.index, y=pool_availability.values)
    plt.title('Pool Availability by Season')
    plt.ylabel('Percentage of Venues')
    plt.ylim(0, 1)
    plt.tight_layout()
    plt.show()

In [None]:
# Seasonal price variation
seasonal_price = data['seasonal'].groupby(['season', 'venue_id'])['seasonal_price'].mean().reset_index()
base_price = data['venues'][['venue_id', 'day_pass_price']]
price_comparison = seasonal_price.merge(base_price, on='venue_id')
price_comparison['price_ratio'] = price_comparison['seasonal_price'] / price_comparison['day_pass_price']

plt.figure(figsize=(10, 6))
sns.boxplot(data=price_comparison, x='season', y='price_ratio')
plt.axhline(y=1, color='r', linestyle='--')
plt.title('Seasonal Price Ratio (Seasonal Price / Base Price)')
plt.ylabel('Price Ratio')
plt.tight_layout()
plt.show()

### 2.3 User Interaction Data

Now let's look at the user interaction data.

In [None]:
# Display interaction data
print(f"Number of interactions: {len(data['interactions'])}")
data['interactions'].head()

In [None]:
# Overall click-through rate
ctr = data['interactions']['clicked'].mean()
print(f"Overall Click-Through Rate: {ctr:.2%}")

In [None]:
# CTR by position
ctr_by_position = data['interactions'].groupby('position')['clicked'].mean()

plt.figure(figsize=(12, 6))
sns.barplot(x=ctr_by_position.index, y=ctr_by_position.values)
plt.title('Click-Through Rate by Position')
plt.xlabel('Position')
plt.ylabel('Click-Through Rate')
plt.tight_layout()
plt.show()

In [None]:
# CTR by season
ctr_by_season = data['interactions'].groupby('season')['clicked'].mean()

plt.figure(figsize=(10, 6))
sns.barplot(x=ctr_by_season.index, y=ctr_by_season.values)
plt.title('Click-Through Rate by Season')
plt.ylabel('Click-Through Rate')
plt.tight_layout()
plt.show()

In [None]:
# CTR by distance
data['interactions']['distance_bucket'] = pd.cut(
    data['interactions']['distance_km'],
    bins=[0, 1, 2, 5, 10, 20, 50],
    labels=['0-1km', '1-2km', '2-5km', '5-10km', '10-20km', '20-50km']
)

ctr_by_distance = data['interactions'].groupby('distance_bucket')['clicked'].mean()

plt.figure(figsize=(12, 6))
sns.barplot(x=ctr_by_distance.index, y=ctr_by_distance.values)
plt.title('Click-Through Rate by Distance')
plt.ylabel('Click-Through Rate')
plt.tight_layout()
plt.show()

## 3. Train a Ranking Model

Now let's train a ranking model using the generated data.

In [None]:
from src.modeling.feature_engineering import prepare_features
from src.modeling.model import train_ranking_model, save_model

# Prepare features for training
print("Preparing features...")
train_features_df = prepare_features(
    data['interactions_train'],
    data['venues'],
    data['users_processed'],
    data['seasonal'],
    data['weather']
)

# Train model
print("Training model...")
model, feature_cols = train_ranking_model(train_features_df)

# Save model
model_dir = "models"
os.makedirs(model_dir, exist_ok=True)
save_model(model, feature_cols, model_dir)

## 4. Evaluate the Model

Let's evaluate the trained model on the test data.

In [None]:
from src.modeling.evaluation import evaluate_model, generate_evaluation_report

# Evaluate model
print("Evaluating model...")
metrics, results_df = evaluate_model(
    model,
    data['interactions_test'],
    data['venues'],
    data['users_processed'],
    data['seasonal'],
    data['weather'],
    feature_cols
)

# Print metrics
print("\nEvaluation Metrics:")
print(f"AUC: {metrics['auc']:.4f}")
print(f"Average Precision: {metrics['average_precision']:.4f}")
print(f"NDCG@5: {metrics['ndcg@5']:.4f}")
print(f"NDCG@10: {metrics['ndcg@10']:.4f}")
print(f"CTR@1: {metrics['ctr@1']:.4f}")
print(f"CTR@5: {metrics['ctr@5']:.4f}")

In [None]:
# Plot precision-recall curve
from sklearn.metrics import precision_recall_curve, average_precision_score

precision, recall, _ = precision_recall_curve(results_df['clicked'], results_df['predicted_score'])
ap = average_precision_score(results_df['clicked'], results_df['predicted_score'])

plt.figure(figsize=(10, 6))
plt.plot(recall, precision, lw=2)
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title(f'Precision-Recall Curve (AP = {ap:.4f})')
plt.grid(True)
plt.show()

In [None]:
# Plot score distribution by clicked status
plt.figure(figsize=(10, 6))
sns.histplot(
    data=results_df,
    x='predicted_score',
    hue='clicked',
    bins=30,
    alpha=0.6
)
plt.title('Distribution of Predicted Scores by Clicked Status')
plt.xlabel('Predicted Score')
plt.ylabel('Count')
plt.show()

In [None]:
# Feature importance
importance = model.get_score(importance_type='gain')
importance = sorted(importance.items(), key=lambda x: x[1], reverse=True)

# Plot top 20 features
top_features = importance[:20]
feature_names = [f[0] for f in top_features]
feature_scores = [f[1] for f in top_features]

plt.figure(figsize=(12, 8))
sns.barplot(x=feature_scores, y=feature_names)
plt.title('Top 20 Features by Importance')
plt.xlabel('Importance Score')
plt.tight_layout()
plt.show()

## 5. Use the Model for Ranking Venues

Now let's use the trained model to rank venues for a specific user and context.

In [None]:
from src.modeling.model import predict_rankings

# Select a random session from the test data
session_id = data['interactions_test']['session_id'].sample(1).iloc[0]
session_data = data['interactions_test'][data['interactions_test']['session_id'] == session_id]

print(f"Selected session ID: {session_id}")
print(f"Number of venues in session: {len(session_data)}")

# Get user ID and context information
user_id = session_data['user_id'].iloc[0]
season = session_data['season'].iloc[0]
time_slot = session_data['time_slot'].iloc[0]

print(f"User ID: {user_id}")
print(f"Season: {season}")
print(f"Time slot: {time_slot}")

In [None]:
# Prepare features for the session
session_features = prepare_features(
    session_data,
    data['venues'],
    data['users_processed'],
    data['seasonal'],
    data['weather']
)

# Predict rankings
ranked_venues = predict_rankings(model, session_features, feature_cols)

# Sort by predicted score
ranked_venues = ranked_venues.sort_values('predicted_score', ascending=False)

# Display ranked venues
display_cols = ['venue_id', 'clicked', 'predicted_score', 'predicted_rank',
               'venue_type', 'star_rating', 'seasonal_price', 'vibe',
               'distance_km', 'time_slot']

ranked_venues[display_cols]

In [None]:
# Compare model ranking with original position
comparison = ranked_venues[['venue_id', 'position', 'predicted_rank', 'clicked', 'predicted_score']]
comparison = comparison.sort_values('position')

plt.figure(figsize=(12, 6))
plt.scatter(comparison['position'], comparison['predicted_rank'], 
           c=comparison['clicked'].map({True: 'green', False: 'red'}),
           s=100, alpha=0.7)

plt.xlabel('Original Position')
plt.ylabel('Predicted Rank')
plt.title('Original Position vs. Predicted Rank')
plt.grid(True)

# Add diagonal line for reference
max_val = max(comparison['position'].max(), comparison['predicted_rank'].max())
plt.plot([0, max_val], [0, max_val], 'k--', alpha=0.5)

# Add legend
from matplotlib.lines import Line2D
legend_elements = [
    Line2D([0], [0], marker='o', color='w', markerfacecolor='green', markersize=10, label='Clicked'),
    Line2D([0], [0], marker='o', color='w', markerfacecolor='red', markersize=10, label='Not Clicked')
]
plt.legend(handles=legend_elements)

plt.tight_layout()
plt.show()

## 6. Visualize Venue Locations

Let's visualize the venue locations for a specific city.

In [None]:
from src.utils.visualization import plot_venue_locations

# Get user's city
user_city = data['users'][data['users']['user_id'] == user_id]['home_city'].iloc[0]
print(f"User's city: {user_city}")

# Plot venue locations
plot_venue_locations(data['venues'], data['users'], city=user_city)
plt.title(f'Venue Locations in {user_city}')
plt.show()

## 7. Seasonal Availability Analysis

Let's analyze how amenity availability changes by season.

In [None]:
from src.utils.visualization import plot_seasonal_availability

# Plot seasonal availability for different amenities
amenities = ['pool', 'beach_access', 'spa', 'gym', 'hot_tub']

fig, axes = plt.subplots(len(amenities), 1, figsize=(12, 4*len(amenities)))

for i, amenity in enumerate(amenities):
    try:
        plot_seasonal_availability(data['seasonal'], amenity=amenity, ax=axes[i])
    except (KeyError, ValueError) as e:
        axes[i].text(0.5, 0.5, f"No data available for {amenity}", 
                    horizontalalignment='center', verticalalignment='center')
        axes[i].set_title(f'Seasonal Availability of {amenity.capitalize()}')

plt.tight_layout()
plt.show()

## Conclusion

In this notebook, we've demonstrated the hotel ranking system for day-access amenities. We've shown how to:

1. Generate synthetic data for venues, users, and interactions
2. Explore the data to understand its characteristics
3. Train a ranking model using XGBoost
4. Evaluate the model's performance
5. Use the model to rank venues for a specific user and context
6. Visualize venue locations and seasonal availability

The model takes into account various factors including:
- User preferences and history
- Venue attributes and amenities
- Seasonal availability and pricing
- Weather conditions
- Location proximity
- Time slot availability

This approach provides a personalized ranking of venues based on the specific context and user preferences.