# Feature Engineering v1 (Booking-Focused)

### FE v1 — Step 1 (Notebook setup)

In [1]:
import pandas as pd
import numpy as np

# Load shared sample
df = pd.read_csv("../data_sample/train_100k.csv")

# Booking-only dataset
df_b = df[df["is_booking"] == 1].copy()

df_b.shape

(8058, 24)

In [2]:
# Notebook setup: allow imports from repo root when running inside /notebooks
import sys
sys.path.insert(0, "..")

from src.fe_v1 import make_features

### FE v1 — Step 2 (import feature function)

Now we’ll import a single function make_features() that:

- filters/derives the features you agreed on

- returns:

1. X (features)

2. y (target)

In [3]:
X, y = make_features(df_b)
X.shape, y.shape

((8058, 17), (8058,))

### FE v1 — Step 3 (apply the feature function + sanity checks)

In [4]:
X, y = make_features(df_b)

print("X shape:", X.shape)
print("y shape:", y.shape)

# quick checks
print("\nMissing values in X (top 10):")
print(X.isna().sum().sort_values(ascending=False).head(10))

print("\nDtypes in X:")
print(X.dtypes)


X shape: (8058, 17)
y shape: (8058,)

Missing values in X (top 10):
site_name           0
checkin_month       0
distance_missing    0
channel             0
is_package          0
is_mobile           0
stay_type           0
length_of_stay      0
srch_rm_cnt         0
posa_continent      0
dtype: int64

Dtypes in X:
site_name                    int64
posa_continent               int64
user_location_country        int64
user_location_region         int64
srch_destination_id          int64
srch_destination_type_id     int64
srch_adults_cnt              int64
srch_children_cnt            int64
srch_rm_cnt                  int64
checkin_month                int64
length_of_stay               int64
stay_type                   object
is_mobile                    int64
is_package                   int64
channel                      int64
distance_missing              bool
distance_bucket             object
dtype: object


**code explained:**

X, y = make_features(df_b)

- builds your feature matrix X from booking-only rows

- builds your target vector y = hotel_cluster

The prints checked:

- shapes (so we know the pipeline is consistent)

- missing values (so models won’t crash)

- data types (so we know what needs encoding)

Key takeaway:

- Most features are integer IDs (categorical-ish)

- We  have true categorical features: stay_type, distance_bucket

- We have boolean: distance_missing

### FE v1 — Step 4 (prepare model-ready features)

In [5]:
categorical_cols = ["stay_type", "distance_bucket"]

X_enc = pd.get_dummies(X, columns=categorical_cols, drop_first=False)
X_enc.shape

(8058, 22)