# Flight Ranking Competition - Baseline Model
## Predict Business Traveler Flight Choices

**Goal**: Rank flights to predict which one a business traveler will choose
**Metric**: HitRate@3 (correct flight in top 3 predictions)
**Your Advantage**: Airline industry expertise!

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import pyarrow.parquet as pq
import gc
from sklearn.ensemble import RandomForestRegressor
import warnings
warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', 20)
pd.set_option('display.max_rows', 10)

## 1. Load Data Using PyArrow (More Stable)

In [None]:
# Load with pyarrow directly - more stable for large files
print("Loading train data...")
train_table = pq.read_table('data/train.parquet')
print(f"Train shape: {train_table.num_rows:,} rows × {train_table.num_columns} columns")

# Convert to pandas in batches if needed
train = train_table.to_pandas()
del train_table
gc.collect()

print("Data loaded successfully!")

## 2. Quick Data Overview

In [None]:
# Basic info
print(f"Total search sessions: {train['ranker_id'].nunique():,}")
print(f"Selected flights: {train['selected'].sum():,}")

# Group sizes (important for scoring)
group_sizes = train.groupby('ranker_id').size()
print(f"\nGroups with >10 options (scored): {(group_sizes > 10).sum():,}")
print(f"Groups with ≤10 options (ignored): {(group_sizes <= 10).sum():,}")

## 3. Business Traveler Insights

In [None]:
# Analyze selected vs not selected
selected = train[train['selected'] == 1]
not_selected = train[train['selected'] == 0]

print("Price Analysis:")
print(f"Selected flights avg: ${selected['totalPrice'].mean():.2f}")
print(f"Not selected avg: ${not_selected['totalPrice'].mean():.2f}")

# Corporate policy
if 'pricingInfo_isAccessTP' in train.columns:
    print(f"\nPolicy compliance rate: {selected['pricingInfo_isAccessTP'].mean():.1%}")

## 4. Simple Feature Engineering

In [None]:
# Work with a sample for speed
sample_sessions = train['ranker_id'].unique()[:5000]
train_sample = train[train['ranker_id'].isin(sample_sessions)].copy()
print(f"Working with {len(train_sample):,} rows")

# Create rank features
train_sample['price_rank'] = train_sample.groupby('ranker_id')['totalPrice'].rank()
train_sample['price_pct'] = train_sample.groupby('ranker_id')['totalPrice'].rank(pct=True)

# Duration if available
duration_cols = [c for c in train.columns if 'duration' in c]
if duration_cols:
    col = duration_cols[0]
    train_sample['duration_rank'] = train_sample.groupby('ranker_id')[col].rank()

print("Features created!")

## 5. Train Baseline Model

In [None]:
# Select features
feature_cols = ['price_rank', 'price_pct']
if 'duration_rank' in train_sample.columns:
    feature_cols.append('duration_rank')

X = train_sample[feature_cols].fillna(-1)
y = train_sample['selected']

# Train model
model = RandomForestRegressor(n_estimators=50, max_depth=5, random_state=42)
model.fit(X, y)

print(f"Model trained with {len(feature_cols)} features")

## 6. Create Submission

In [None]:
# Load test data
print("Loading test data...")
test = pd.read_parquet('data/test.parquet')
print(f"Test shape: {test.shape}")

# Create same features
test['price_rank'] = test.groupby('ranker_id')['totalPrice'].rank()
test['price_pct'] = test.groupby('ranker_id')['totalPrice'].rank(pct=True)

# Predict
X_test = test[['price_rank', 'price_pct']].fillna(-1)
test_pred = model.predict(X_test)

# Create submission
submission = test[['Id', 'ranker_id']].copy()
submission['pred'] = test_pred
submission['selected'] = submission.groupby('ranker_id')['pred'].rank(method='first', ascending=False).astype(int)

# Save
submission[['Id', 'ranker_id', 'selected']].to_parquet('submissions/baseline.parquet', index=False)
print("Submission saved!")

## Next Steps

With your airline expertise, focus on:
1. **Time features**: Departure hour, day of week
2. **Corporate features**: Policy compliance, company patterns
3. **Route analysis**: Business vs leisure routes
4. **Airline loyalty**: Frequent flyer preferences
5. **Advanced models**: LightGBM for better ranking