# Introduction

Contains a naive baseline and a learned baseline.  The learned baseline will contain a breif section on explainability - what features have the strongest predictive capacity.  

A naive baseline uses no input features and a baseline to this affect is required in any supervised ML model to ensure the model is capable of learning.

A learned baseline uses the input features to build a model.  It must beat the naive baseline to demonstrate any capacity to learn.  I have seen instances in my career where machine learned models provided predictions, but the models were unable to learn from the input features.  

I like to consider supervised ML a model for continuous improvement.  A baseline model capable of learning from input features could potentially be improved with different techinques (feature engineering, different learning algorithms).  Once a baseline gets eclipsed by an improved model it becomes the new baseline.  As real-world problems get more complex, the possibilities of applying different techniques grow almost infinitely.  Adopting this approach to these types of problems can guide you to know when you have reached the point of diminishing returns and known techniques to improve model performance have been essentially exhausted.  

This dataset is a relatively simple and contrived example, but is meant to show the process of a naive baseline, a simple learned baseline, and then additional, and possibly more complex, models meant to be the learned baseline.

In [1]:
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn import linear_model
from sklearn.metrics import r2_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler


In [2]:
housing = fetch_california_housing()

In [3]:
print(f"Features are a np.array of shape {housing.data.shape}")

Features are a np.array of shape (20640, 8)


In [4]:
print(f"Target variable is a np.array of shape {housing.target.shape}")

Target variable is a np.array of shape (20640,)


In [5]:
# use a random state for reproducible results
x_train, x_test, y_train, y_test = train_test_split(housing.data,
                                                    housing.target,
                                                    test_size=0.1,
                                                    random_state=66)

In [6]:
print('Train:')
print(x_train.shape)
print(y_train.shape)

print('Test:')
print(x_test.shape)
print(y_test.shape)

Train:
(18576, 8)
(18576,)
Test:
(2064, 8)
(2064,)


# Naive baseline

When solving a problem using Machine Learning, I like to approach it as a problem of continual improvement or experimentation.  The first approach is to compute a naive baseline using only the target value to predict (no consideration of input features).  This can be used to prove that any learned model can actually learn from the input features.


In [7]:
naive_baseline_array = np.full(y_test.shape[0], np.mean(y_train))

In [8]:
naive_baseline_r2 = r2_score(y_test, naive_baseline_array)

In [9]:
print(f"Naive baseline without considering input features: {naive_baseline_r2:.3f}")

Naive baseline without considering input features: -0.001


The naive baseline uses the mean of the target value from the training set.  This is the "predicted" value used for scoring, all test records will use this as the predicted value.  

As expected, the value is 0.  This is consistent with how the r^2 metric works:

From sklearn docs:  
Best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). In the general case when the true y is non-constant, a constant model that always predicts the average y disregarding the input features would get a score of 0.0.

# Learned baseline

Use a simple linear regression model, the purpose is not to build the best possible model but rather a simple baseline to compare later models against.

In [10]:
# Perform cross-validation
scores = cross_val_score(linear_model.LinearRegression(), x_train, y_train, cv=5)  # cv is the number of folds

In [11]:
print(f'Cross Validation Scores (training data): {scores}')

Cross Validation Scores (training data): [0.60399736 0.59329824 0.59604499 0.5981567  0.60854189]


In [12]:
print(f'Average of all CV scores (training data): {scores.mean()}')

Average of all CV scores (training data): 0.6000078375660196


In [13]:
# Now fit the model on all train data
model = linear_model.LinearRegression().fit(x_train, y_train)

In [14]:
# Predict and score the test set to report our baseline performance.
y_pred = model.predict(x_test)
test_r2 = r2_score(y_test, y_pred)

In [15]:
print(f"Baseline regression using linear regression on test data : {test_r2:.3f}")

Baseline regression using linear regression on test data : 0.628


## Notes

* Stastical models such as those in sklearn will often expect, and benefit from standardized or scaled data.  Standardization forces all features to have mean 0 and std dev 1 (standard normal); while scaling places data points within boundaries without changing the proportions or distributions of the data.  One would want to explore these options to determine the best approach for pre-processing the data set.

* The latitude/longitude parameters are included in this fit, will want to experiment with removing this as using non-linear features in a linear model is not necessarily advised.

* We will likely want to experiment with other models and algorithms to beat the baseline built in this notebook.


# Basic evaluation

I would like to evaluate this model to some extent.  Start by looking at the feature coefficients to determine which have the most and least predictive power.

In [16]:
model.coef_

array([ 4.33790154e-01,  9.25129024e-03, -1.03444726e-01,  6.23072795e-01,
       -5.76182061e-06, -5.16710201e-03, -4.22392973e-01, -4.35627014e-01])

In [17]:
housing.feature_names

['MedInc',
 'HouseAge',
 'AveRooms',
 'AveBedrms',
 'Population',
 'AveOccup',
 'Latitude',
 'Longitude']

In [18]:
# pipeline is a list of tuples
# get the coefficients
coefficients = model.coef_
# and the intercept
intercept = model.intercept_

In [19]:
features_and_coefficients = dict(zip(housing.feature_names, coefficients))

In [20]:
# sort the feature importance by the absolute value of their coefficients.
{k: v for k, v in sorted(features_and_coefficients.items(), key=lambda item: abs(item[1]), reverse=True)}


{'AveBedrms': 0.6230727951255777,
 'Longitude': -0.43562701416538097,
 'MedInc': 0.433790154007385,
 'Latitude': -0.4223929731129496,
 'AveRooms': -0.10344472613882673,
 'HouseAge': 0.009251290237474749,
 'AveOccup': -0.0051671020124161105,
 'Population': -5.7618206134466165e-06}

## Notes 

The model coefficients show which features have the strongest effect (positive or negative) on the target variable.

AveBedrms is interesting - it doesn't quite follow the plot found in the exploration notebook.  This coefficient shows the affect of AveBedrms while holding all other variables constant.  It does not necessarily imply causation. 

Longitude and Latitude also are reporting a strong effect on the target variable.  This may or may not be true; Latitude and Longitude might have a linear effect on the target variable or it might not.  This is not to say that Latitude and Longitude would always have a positive effect, but in this case it may coincidentally.