# Notebook 2: Null Regressor Model

**Model 1**: Benchmark model, predicting only the mean

**Reason for model**: benchmark model, to get a baseline result upon which to improve future models

**Metric**: RMSE

**Reason for metric**: focus on penalising large errors over small errors, RMSE is the better choice.

In [1]:
ROOT_PATH_FROM_NOTEBOOK = ".."
DATA_PATH = "data"
PROCESSED_DATA_PATH = "processed"
DATASET_NAME = "all_data_transformed.parquet"

df_path = f"{ROOT_PATH_FROM_NOTEBOOK}/{DATA_PATH}/{PROCESSED_DATA_PATH}/{DATASET_NAME}"

In [2]:
import sys
import os

import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_validate

In [3]:
# Set Pandas option to show all columns in prints
pd.set_option('display.max_columns', None)

In [4]:
# Get the current working directory
current_dir = os.getcwd()

# Add the src directory to sys.path to use custom functions
sys.path.append(os.path.abspath(os.path.join(current_dir, '..', 'src')))

In [5]:
from models.null import SklearnNullRegressor

In [6]:
df = pd.read_parquet(df_path)

In [7]:
df.head()

Unnamed: 0,flightDayOfWeekSin,flightDayOfWeekCos,flightMonthSin,flightMonthCos,flightHourSin,flightHourCos,flightMinuteSin,flightMinuteCos,timeDeltaDays,travelDurationDay,totalTravelDistance,totalFare,isBasicEconomy,isRefundable,isNonStop,numLegs,business,coach,first,premium coach
0,-0.433884,-0.900969,0.5,-0.866025,0.0,1.0,-0.2079117,0.978148,34,0.40625,1931.0,129.58,-1,-1,-1,2,-1,1,-1,-1
1,-0.433884,-0.900969,0.5,-0.866025,1.0,6.123234000000001e-17,0.9510565,-0.309017,34,0.181944,1947.0,228.59,-1,-1,1,1,-1,1,-1,-1
2,0.433884,-0.900969,0.5,-0.866025,-0.9659258,-0.258819,-0.9510565,-0.309017,33,0.182639,1947.0,228.59,-1,-1,1,1,-1,1,-1,-1
3,0.433884,-0.900969,0.5,-0.866025,1.224647e-16,-1.0,5.665539e-16,-1.0,33,0.253472,1947.0,301.58,-1,-1,-1,2,-1,1,-1,-1
4,0.433884,-0.900969,0.5,-0.866025,-0.258819,-0.9659258,0.0,1.0,33,0.255556,1947.0,309.58,-1,-1,-1,2,-1,1,-1,-1


## Random sampling

The dataset is fairly large, so I will only choose a sample without replacement of 50000 observations to test a variety of machine learning models. 

For all subsequent experiments, these same samples will be used, so I will perform a 2-sample Kolmogorov-Smirnov test to test the null hypothesis $H_0$ that the samples come from the same probability distribution as their populations. I will keep resampling until all columns achieve a p-value of at least 0.75, indicating failure to reject $H_0$ at significance level $\alpha=0.05$, which suggests that the samples are very likely to come from the same distribution as their populations.

In [8]:
from models.stats import ks_test

In [9]:
random_state = 0

df_sample = None

p_values_dict = {col: 0.0 for col in df.columns}

In [10]:
while not all(p >= 0.5 for p in p_values_dict.values()):
    random_state += 1
    df_sample = df.sample(
        n=50000, replace=False, random_state=random_state)
    p_values_dict = ks_test(sample=df_sample, population=df)
    print(f"Random state: {random_state}, {p_values_dict}")

Random state: 1, {'flightDayOfWeekSin': np.float64(0.6568596667360618), 'flightDayOfWeekCos': np.float64(0.5368644651571088), 'flightMonthSin': np.float64(0.9999993451378185), 'flightMonthCos': np.float64(0.9999982100100006), 'flightHourSin': np.float64(0.6583487941790305), 'flightHourCos': np.float64(0.8180809319276218), 'flightMinuteSin': np.float64(0.897973812594328), 'flightMinuteCos': np.float64(0.48996142053407854), 'timeDeltaDays': np.float64(0.8781072351889596), 'travelDurationDay': np.float64(0.24576270607136508), 'totalTravelDistance': np.float64(0.8865257034605957), 'totalFare': np.float64(0.360449514286697), 'isBasicEconomy': np.float64(0.9837906466664834), 'isRefundable': np.float64(1.0), 'isNonStop': np.float64(0.9999999999999999), 'numLegs': np.float64(0.9999999999999999), 'business': np.float64(1.0), 'coach': np.float64(1.0), 'first': np.float64(1.0), 'premium coach': np.float64(1.0)}
Random state: 2, {'flightDayOfWeekSin': np.float64(0.9912414447993826), 'flightDayOfWe

In [11]:
# This is a fairly good sample in that it's statistically significant in the KS test
print(p_values_dict)

{'flightDayOfWeekSin': np.float64(0.9793205533599805), 'flightDayOfWeekCos': np.float64(0.7977704716464755), 'flightMonthSin': np.float64(0.9964357541446297), 'flightMonthCos': np.float64(0.9999956675333632), 'flightHourSin': np.float64(0.623355116992715), 'flightHourCos': np.float64(0.9579251473990231), 'flightMinuteSin': np.float64(0.5509482712847211), 'flightMinuteCos': np.float64(0.7273591285213872), 'timeDeltaDays': np.float64(0.8960621536929619), 'travelDurationDay': np.float64(0.9088387737274017), 'totalTravelDistance': np.float64(0.795064542667175), 'totalFare': np.float64(0.9390834343145299), 'isBasicEconomy': np.float64(1.0), 'isRefundable': np.float64(1.0), 'isNonStop': np.float64(1.0), 'numLegs': np.float64(0.9999999999948779), 'business': np.float64(1.0), 'coach': np.float64(1.0), 'first': np.float64(1.0), 'premium coach': np.float64(1.0)}


In [12]:
df_sample.to_parquet(f"{ROOT_PATH_FROM_NOTEBOOK}/{DATA_PATH}/{PROCESSED_DATA_PATH}/data_sample.parquet", index=False)

## Pop the target column

In [13]:
y = df_sample.pop('totalFare')
X = df_sample

## Null Regressor with Cross Validation

To get a fair comparison with all other models which will employ cross-validation for model selection, I will also be performing cross-validation for the baseline model. 

Each model will go through 5-fold cross-validation using the sample dataset.

In [16]:
# Install my custom package from Assignment 2
# Source code here: https://github.com/nicnl31/assignment2_pkg_11919925
! pip install -i https://test.pypi.org/simple/ assignment2_pkg_11919925

Looking in indexes: https://test.pypi.org/simple/
Collecting assignment2_pkg_11919925
  Using cached https://test-files.pythonhosted.org/packages/56/79/e68a19a7ce3f46c686492a7aa9d9f538346ec109eca03c04e0f79fb51e1b/assignment2_pkg_11919925-1.0.10-py3-none-any.whl.metadata (1.5 kB)
Using cached https://test-files.pythonhosted.org/packages/56/79/e68a19a7ce3f46c686492a7aa9d9f538346ec109eca03c04e0f79fb51e1b/assignment2_pkg_11919925-1.0.10-py3-none-any.whl (5.7 kB)
Installing collected packages: assignment2_pkg_11919925
Successfully installed assignment2_pkg_11919925-1.0.10


In [14]:
from assignment2_pkg_11919925.metrics.regression import print_regressor_scores

In [15]:
# Instantiate the null regressor
null_reg = SklearnNullRegressor()

In [16]:
# Cross validate 5 folds
cv_results = cross_validate(null_reg, X, y, cv=5,
                            scoring="neg_root_mean_squared_error", 
                            return_train_score=True)

In [17]:
# Print regression scores
train_score = -float(cv_results["train_score"].mean())
val_score = -float(cv_results["test_score"].mean())
print(f"RMSE Train: {train_score}")
print(f"RMSE Val: {val_score}")

RMSE Train: 208.41857342335229
RMSE Val: 208.42276109810624


**Observations**: This is the benchmark to beat for subsequent models.

Next I will check the linearity between covariates and the target, to see if a linear model suffices.

## Linearity check between covariates and the target

In [18]:
# Check correlation with the target variable
for col in X.columns:
    print(f"Correlation between {col} and {y.name}: {X[col].corr(y)}")

Correlation between flightDayOfWeekSin and totalFare: -0.12272635012020762
Correlation between flightDayOfWeekCos and totalFare: 0.01701503187273621
Correlation between flightMonthSin and totalFare: -0.0640203782082558
Correlation between flightMonthCos and totalFare: -0.04954129888303736
Correlation between flightHourSin and totalFare: 0.01274291148020228
Correlation between flightHourCos and totalFare: -0.02144172877213112
Correlation between flightMinuteSin and totalFare: -0.011751724229015608
Correlation between flightMinuteCos and totalFare: 0.005478573356582721
Correlation between timeDeltaDays and totalFare: 0.06563269007972668
Correlation between travelDurationDay and totalFare: 0.4139419531542999
Correlation between totalTravelDistance and totalFare: 0.5628821303880182
Correlation between isBasicEconomy and totalFare: -0.24733559297020452
Correlation between isRefundable and totalFare: nan
Correlation between isNonStop and totalFare: -0.28710524454620506
Correlation between nu

  c /= stddev[:, None]
  c /= stddev[None, :]


Linearity is violated, because there is little correlation between covariates and the target. A nonlinear model is necessary.

**Next model**: Support Vector Regression model.