# Baseline Improvement

This notebook seeks to improve upone the learned baseline from 02RegressionBaseline.

We will explore options such as:

* Removing features that are either non-linear or don't aid in prediction
* Different scaling techniques
* Other learning algorithms

RESULT

The random forest model beats our learned baseline by a substantial margin.  This model then becomes the new baseline to measure against.  
These notebooks are meant to demonstrate how an ML project is an iterative process of continuous improvement.  The goal here is not an exhaustive analysis of the best possible model for this dataset.

In [1]:
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.linear_model import ElasticNet
from sklearn import linear_model
from sklearn import svm
from sklearn.metrics import r2_score
from sklearn.pipeline import Pipeline
from sklearn import preprocessing


In [2]:
housing = fetch_california_housing()

In [3]:
# use a random state for reproducible results
x_train, x_test, y_train, y_test = train_test_split(housing.data,
                                                    housing.target,
                                                    test_size=0.1,
                                                    random_state=66)

# Remove non-linear features

We suspect that the latitude/longitude features are non-linear and don't help our model learn.  Lets remove it and determine if it helps our fit or allow the model to find value in other parameters

In [4]:
x_train = x_train[:,0:6]
x_test = x_test[:,0:6]

In [5]:
scores = cross_val_score(linear_model.LinearRegression(), x_train, y_train, cv=5)

In [6]:
print(f'Cross Validation Scores (training data): {scores}')
print(f'Average of all CV scores (training data): {scores.mean()}')

Cross Validation Scores (training data): [0.52686209 0.53295152 0.53429725 0.53464017 0.5338919 ]
Average of all CV scores (training data): 0.5325285837990277


In [7]:
# Now fit the model on all train data
model = linear_model.LinearRegression().fit(x_train, y_train)

In [8]:
# Predict and score the test set to report our baseline performance.
y_pred = model.predict(x_test)
test_r2 = r2_score(y_test, y_pred)

In [9]:
print(f"Baseline regression using linear regression on test data : {test_r2:.3f}")

Baseline regression using linear regression on test data : 0.563


We see that latitude and longitude do - in this case - help our model learn.  
This kind of goes against intuition but we'll continue to include it as it does improve the result.  

In [10]:
# put latitude and longitude back
# use a random state for reproducible results
x_train, x_test, y_train, y_test = train_test_split(housing.data,
                                                    housing.target,
                                                    test_size=0.1,
                                                    random_state=66)

# Other models

It would make sense to explore both linear and non-linear models

## Linear 

From sklearns "choosing the right estimator" documentation.


### Ridge Regresssion

In [11]:
scores = cross_val_score(linear_model.Ridge(alpha=0.5), x_train, y_train, cv=5)

In [12]:
print(f'Cross Validation Scores (training data): {scores}')
print(f'Average of all CV scores (training data): {scores.mean()}')

Cross Validation Scores (training data): [0.60400196 0.59333461 0.59603562 0.598151   0.60854108]
Average of all CV scores (training data): 0.6000128537068031


It performs almost identical, probably not necessary to check further.
We could perform a grid search over the alpha parameter but some simple experimation shows that is not likely to be helpful.  

## Non-linear models

Although this dataset seems to lend itself to linear models, we should take some time to explore both linear and non-linear models.

The sklearn documentation sugggests Support Vector Regression (SVR)
Lets quickly evaluate SVR

Also will perform an evaluation of a Random Forest.  Random forest is often a good starting point to use in a non-linear setting.

### Support Vector Machine (SVM)

In [13]:
scores = cross_val_score(svm.SVR(), x_train, y_train, cv=5)

In [14]:
print(f'Cross Validation Scores (training data): {scores}')
print(f'Average of all CV scores (training data): {scores.mean()}')

Cross Validation Scores (training data): [-0.03746847 -0.02152099 -0.03421931 -0.02271538 -0.01690274]
Average of all CV scores (training data): -0.026565379387379462


This model does objectively worse than even a naive baseline so this is not likely to be of benefit.  Doubtful that a grid search over the default parameters would improve.

### Random Froest

In [15]:
scores = cross_val_score(RandomForestRegressor(), x_train, y_train, cv=5)

In [16]:
print(f'Cross Validation Scores (training data): {scores}')
print(f'Average of all CV scores (training data): {scores.mean()}')

Cross Validation Scores (training data): [0.79647755 0.81136297 0.80457527 0.7966518  0.80779239]
Average of all CV scores (training data): 0.8033719965204252


The random forest regressor beats our previous baseline.  

Predict and score the test set to report our baseline performance.

In [17]:
model = RandomForestRegressor().fit(x_train, y_train)

In [18]:
# Predict and score the test set to report our baseline performance.
y_pred = model.predict(x_test)
test_r2 = r2_score(y_test, y_pred)

In [19]:
print(f"Random Forest regression on test data : {test_r2:.3f}")

Random Forest regression on test data : 0.836


# Scaling techniques

We know from Exploration that some of the input features are not clean, they contain outliers, or they have been capped at a certain value.  Experimenting with different types of scaling in sklearn will allow us to explore if this approach improves upon our learned baseline.

Note that a Random Forest model is not sensitive to standardized data so these techniques won't be applied to the current "winning" model.  We'll apply these techniques to the standard linear model from earlier as they will often perform better on standardized data.

In [20]:
# Create a pipeline with scaling and the model
pipeline = Pipeline([('scaler', preprocessing.RobustScaler()), ('linear_regression', linear_model.LinearRegression())])

In [21]:
# Perform cross-validation
scores = cross_val_score(pipeline, x_train, y_train, cv=5)  # cv is the number of folds

In [22]:
print(f'Average of all CV scores: {scores.mean()}')

Average of all CV scores: 0.6000078375660192


Performance stays constand with scaled data indicating that the model did not really benefit from standardized data.