## Notebook 02: Feature Engineering

**Objective:**  
Transform the cleaned flight dataset into a stable, leakage-free feature set suitable for predictive modeling.  
Features are selected based on domain relevance, availability at booking time, and modeling cost–benefit trade-offs.

**Input:**  
- Cleaned dataset from Notebook 01

**Output:**  
- Final feature matrix (X)  
- Target variable (y)  
- Saved feature dataset for modeling notebooks


In [1]:
import pandas as pd
import numpy as np

## Load cleaned data and convert data types

In [2]:
df = pd.read_csv('../Data/Cleaned/cleaned_flight_data.csv')

date_cols = ["departure_date", "booking_date"]

for col in date_cols:
    df[col] = pd.to_datetime(df[col])

## Target variable transformation (Price)
Price is log-transformed to reduce right skew and stabilize variance.  
Model predictions are evaluated in log space and converted back for business interpretation.


In [3]:
df['target_price'] = np.log1p(df["price"])
df['target_price']

0       4.904460
1       5.079041
2       5.583158
3       4.763967
4       5.412583
          ...   
1804    6.501230
1805    5.205544
1806    6.458793
1807    6.311227
1808    5.555282
Name: target_price, Length: 1809, dtype: float64

## Booking and Temporal Features

### Days To Departure

Days-to-departure was represented both as a continuous variable and as booking-window bins.

They are binned by time periods i.e 1week, 2 week, 1 month, 3 months, 6 months. Price does not scale linearly with time and hence it makes more sense to bin the days before departure into bins of different time frames.

Binning reflects known airline pricing regimes and improves interpretability for linear models
and downstream decision logic.


In [4]:
df["days_to_departure"] = (
    pd.to_datetime(df["departure_date"]) -
    pd.to_datetime(df["booking_date"])
).dt.days

df["days_to_departure"] = df["days_to_departure"].clip(0, 365)

df["days_to_departure_bin"] = pd.cut(
    df["days_to_departure"],
    bins=[0, 7, 14, 30, 90, 180, 365],
    labels=False
)

## Calendar-Based Feature

### Deparutre Day and Weekend

Since weekend vs weekdays give a better indicator than individual days, only weekend indicator is used per model. Using binary encoding (0 for weekday and 1 for weekend)

In [5]:
df['departure_day'] = df['departure_date'].dt.day_name()
df["departure_month"] = df["departure_date"].dt.month
df["is_weekend"] = df["departure_day"].isin([5, 6]).astype(int)

### Month and Season

Month and season both represent annual demand cycles at different granularities.
To avoid redundancy, only one representation is used per model. Since season is a categorical feature, one-hot encoding is used.


In [6]:
def get_season(month):
    if month in [12, 1, 2]:
        return 'Winter'
    elif month in [3, 4, 5]:
        return 'Spring'
    elif month in [6, 7, 8]:
        return 'Summer'
    else:
        return 'Fall'

df['season'] = df['departure_date'].dt.month.apply(get_season)
df["season"] = df["season"].astype("category")


Season was encoded using one-hot encoding to represent annual demand cycles without
imposing artificial ordinal relationships between seasons. Fall was used as the
reference category.


In [7]:

df = pd.get_dummies(
    df,
    columns=["season"],
    drop_first=True
)

df.head()

Unnamed: 0,booking_date,departure_date,origin,destination,airline,price,stops,target_price,days_to_departure,days_to_departure_bin,departure_day,departure_month,is_weekend,season_Spring,season_Summer,season_Winter
0,2023-09-03,2024-01-05,LHR,CDG,Ryanair,133.89,0,4.90446,124,4,Friday,1,0,False,False,True
1,2023-08-03,2024-01-14,BOM,DEL,Indigo,159.62,0,5.079041,164,4,Sunday,1,0,False,False,True
2,2023-11-19,2024-01-10,LHR,CDG,Ryanair,264.91,0,5.583158,52,3,Wednesday,1,0,False,False,True
3,2023-09-11,2024-02-03,BOM,DEL,Indigo,116.21,0,4.763967,145,4,Saturday,2,0,False,False,True
4,2023-10-26,2024-01-06,LHR,CDG,Ryanair,223.21,0,5.412583,72,3,Saturday,1,0,False,False,True


In [8]:
country_holidays = pd.read_csv("../Data/Raw/PublicHolidaysfor2024.csv")
country_holidays['date'] = pd.to_datetime(country_holidays['date'])
df['is_holiday'] = df['departure_date'].isin(country_holidays['date']).astype(int)

In [9]:
df.head()

Unnamed: 0,booking_date,departure_date,origin,destination,airline,price,stops,target_price,days_to_departure,days_to_departure_bin,departure_day,departure_month,is_weekend,season_Spring,season_Summer,season_Winter,is_holiday
0,2023-09-03,2024-01-05,LHR,CDG,Ryanair,133.89,0,4.90446,124,4,Friday,1,0,False,False,True,0
1,2023-08-03,2024-01-14,BOM,DEL,Indigo,159.62,0,5.079041,164,4,Sunday,1,0,False,False,True,0
2,2023-11-19,2024-01-10,LHR,CDG,Ryanair,264.91,0,5.583158,52,3,Wednesday,1,0,False,False,True,0
3,2023-09-11,2024-02-03,BOM,DEL,Indigo,116.21,0,4.763967,145,4,Saturday,2,0,False,False,True,0
4,2023-10-26,2024-01-06,LHR,CDG,Ryanair,223.21,0,5.412583,72,3,Saturday,1,0,False,False,True,0


## Catergorical Feature Encoding

### Route

Route represents the origin–destination city pair and is a nominal categorical feature.
In this dataset, only four unique routes are present, resulting in low cardinality.

Given the limited number of routes, route was encoded directly using one-hot encoding
without additional grouping. One route category was dropped to act as the reference
category, enabling the model to learn relative price differences across routes.


In [10]:
df['route'] = df['origin'] +'-'+ df['destination']
df = pd.get_dummies(
    df,
    columns=["route"],
    drop_first=True
)
df.dtypes

booking_date             datetime64[ns]
departure_date           datetime64[ns]
origin                           object
destination                      object
airline                          object
price                           float64
stops                             int64
target_price                    float64
days_to_departure                 int64
days_to_departure_bin             int64
departure_day                    object
departure_month                   int32
is_weekend                        int64
season_Spring                      bool
season_Summer                      bool
season_Winter                      bool
is_holiday                        int64
route_JFK-LAX                      bool
route_LHR-CDG                      bool
route_SYD-MEL                      bool
dtype: object

### Airline

Airline is a nominal categorical feature with low cardinality and no inherent ordering.
Different airlines apply distinct pricing strategies, service levels, and cost structures,
which can independently influence ticket prices.

To capture these effects without imposing artificial ordinal relationships, airline was
encoded using one-hot encoding. One category was dropped to serve as the reference level
and to avoid multicollinearity in linear models.


In [11]:
df["airline"] = df["airline"].str.upper().str.strip()

df = pd.get_dummies(
    df,
    columns=["airline"],
    drop_first=True
)
df.head()

Unnamed: 0,booking_date,departure_date,origin,destination,price,stops,target_price,days_to_departure,days_to_departure_bin,departure_day,...,season_Summer,season_Winter,is_holiday,route_JFK-LAX,route_LHR-CDG,route_SYD-MEL,airline_INDIGO,airline_QANTAS,airline_RYANAIR,airline_UNITED
0,2023-09-03,2024-01-05,LHR,CDG,133.89,0,4.90446,124,4,Friday,...,False,True,0,False,True,False,False,False,True,False
1,2023-08-03,2024-01-14,BOM,DEL,159.62,0,5.079041,164,4,Sunday,...,False,True,0,False,False,False,True,False,False,False
2,2023-11-19,2024-01-10,LHR,CDG,264.91,0,5.583158,52,3,Wednesday,...,False,True,0,False,True,False,False,False,True,False
3,2023-09-11,2024-02-03,BOM,DEL,116.21,0,4.763967,145,4,Saturday,...,False,True,0,False,False,False,True,False,False,False
4,2023-10-26,2024-01-06,LHR,CDG,223.21,0,5.412583,72,3,Saturday,...,False,True,0,False,True,False,False,False,True,False


## Feature Cleanup

In [12]:
delete_col = ['booking_date', 'departure_date', 'departure_day', 'origin', 'destination' , 'days_to_departure']
df = df.drop(columns = delete_col)

dummy_cols = df.select_dtypes(include="bool").columns
df[dummy_cols] = df[dummy_cols].astype(int)

df.head()

Unnamed: 0,price,stops,target_price,days_to_departure_bin,departure_month,is_weekend,season_Spring,season_Summer,season_Winter,is_holiday,route_JFK-LAX,route_LHR-CDG,route_SYD-MEL,airline_INDIGO,airline_QANTAS,airline_RYANAIR,airline_UNITED
0,133.89,0,4.90446,4,1,0,0,0,1,0,0,1,0,0,0,1,0
1,159.62,0,5.079041,4,1,0,0,0,1,0,0,0,0,1,0,0,0
2,264.91,0,5.583158,3,1,0,0,0,1,0,0,1,0,0,0,1,0
3,116.21,0,4.763967,4,2,0,0,0,1,0,0,0,0,1,0,0,0
4,223.21,0,5.412583,3,1,0,0,0,1,0,0,1,0,0,0,1,0


In [13]:
df.dtypes

price                    float64
stops                      int64
target_price             float64
days_to_departure_bin      int64
departure_month            int32
is_weekend                 int64
season_Spring              int64
season_Summer              int64
season_Winter              int64
is_holiday                 int64
route_JFK-LAX              int64
route_LHR-CDG              int64
route_SYD-MEL              int64
airline_INDIGO             int64
airline_QANTAS             int64
airline_RYANAIR            int64
airline_UNITED             int64
dtype: object

## Save Features to CSV

In [14]:
df.to_csv('../Data/Cleaned/flight_features.csv', index = False)