# Problem 3 -- Boston Data Set

In this problem, we will build models to predict whether it's going to rain tomorrow.

## 1. Data loading
The first line of the CSV file is the head, which is removed when loading.

In [1]:
import pandas as pd

dftrain = pd.read_csv(
  'boston_data.csv',
  header=1,
  names=['CRIM','ZN','INDUS','CHAS','NOX','RM','AGE','DIS','RAD','TAX','PTRATIO','B','LSTAT','MEDV'])

Now the data looks like this.

In [2]:
print(dftrain.head(5))

      CRIM   ZN  INDUS  CHAS    NOX     RM   AGE     DIS  RAD    TAX  PTRATIO  \
0  0.02731  0.0   7.07     0  0.469  6.421  78.9  4.9671    2  242.0     17.8   
1  0.02729  0.0   7.07     0  0.469  7.185  61.1  4.9671    2  242.0     17.8   
2  0.03237  0.0   2.18     0  0.458  6.998  45.8  6.0622    3  222.0     18.7   
3  0.06905  0.0   2.18     0  0.458  7.147  54.2  6.0622    3  222.0     18.7   
4  0.02985  0.0   2.18     0  0.458  6.430  58.7  6.0622    3  222.0     18.7   

        B  LSTAT  MEDV  
0  396.90   9.14  21.6  
1  392.83   4.03  34.7  
2  394.63   2.94  33.4  
3  396.90   5.33  36.2  
4  394.12   5.21  28.7  


## 2. Data preprocessing

Since we're predicting two columns, preprocessing here only contains shuffling. Splitting and label popping are kept as a per-model basis.

In [3]:
from sklearn.utils import shuffle
dftrain = shuffle(dftrain, random_state=8)

## 3. Model training: regression against NOX

First we pop and split the data.

In [4]:
from sklearn.model_selection import train_test_split

dflabel = dftrain.pop('NOX')

x_train, x_test, y_train, y_test = train_test_split(dftrain, dflabel, test_size=0.2, random_state=42)

print('Trainig features are:\n', x_train.head(3))

print('Training labels are:\n', y_train.head(3))

Trainig features are:
          CRIM    ZN  INDUS  CHAS     RM   AGE     DIS  RAD    TAX  PTRATIO  \
272   0.22188  20.0   6.96     1  7.691  51.8  4.3665    3  223.0     18.6   
391  11.57790   0.0  18.10     0  5.036  97.0  1.7700   24  666.0     20.2   
308   0.34940   0.0   9.90     0  5.972  76.7  3.1025    4  304.0     18.4   

          B  LSTAT  MEDV  
272  390.77   6.58  35.2  
391  396.90  25.68   9.7  
308  396.24   9.97  20.3  
Training labels are:
 272    0.464
391    0.700
308    0.544
Name: NOX, dtype: float64


### 3.1. Linear regression

In [5]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

linear_reg = LinearRegression(normalize=True)
linear_reg.fit(x_train, y_train)

print('Linear reg mean squared error for train: %.4f' % mean_squared_error(linear_reg.predict(x_train), y_train))
print('Linear reg mean squared error for test: %.4f' % mean_squared_error(linear_reg.predict(x_test), y_test))

Linear reg mean squared error for train: 0.0030
Linear reg mean squared error for test: 0.0026


Cross-validation.

In [6]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(linear_reg, dftrain, dflabel, cv=5, scoring='neg_mean_squared_error')

print('Scores for negative mean squared error: ', scores)
print("Accuracy: %0.4f (+/- %0.4f)" % (scores.mean(), scores.std() * 2))

Scores for negative mean squared error:  [-0.00320939 -0.00341716 -0.00270497 -0.00364481 -0.00300694]
Accuracy: -0.0032 (+/- 0.0006)


### 3.2. Lasso (L1 reg)

In [7]:
from sklearn.linear_model import Lasso

lasso_reg = Lasso(alpha=0.01)
lasso_reg.fit(x_train, y_train)

print('Lasso mean squared error for training: %.4f' % mean_squared_error(lasso_reg.predict(x_train), y_train))
print('Lasso mean squared error for evaluation: %.4f' % mean_squared_error(lasso_reg.predict(x_test), y_test))

Lasso mean squared error for training: 0.0032
Lasso mean squared error for evaluation: 0.0027


### 3.3 Ridge (L2 reg)

In [8]:
from sklearn.linear_model import Ridge

ridge_reg = Ridge(alpha=0.01, solver="cholesky")
ridge_reg.fit(x_train, y_train)

print('Ridge mean squared error for training: %.4f' % mean_squared_error(ridge_reg.predict(x_train), y_train))
print('Ridge mean squared error for evaluation: %.4f' % mean_squared_error(ridge_reg.predict(x_test), y_test))

Ridge mean squared error for training: 0.0030
Ridge mean squared error for evaluation: 0.0026


### 3.4 Elastic net (L1 and L2 reg)

In [9]:
from sklearn.linear_model import ElasticNet

elastic_net = ElasticNet(alpha=0.01, l1_ratio=0.1)
elastic_net.fit(x_train, y_train)

print('Elastic-net mean squared error for training: %.4f' % mean_squared_error(elastic_net.predict(x_train), y_train))
print('Elastic-net mean squared error for evaluation: %.4f' % mean_squared_error(elastic_net.predict(x_test), y_test))

Elastic-net mean squared error for training: 0.0030
Elastic-net mean squared error for evaluation: 0.0026


### 3.5. Gradient boosted tree regressor

In [10]:
from sklearn import ensemble

params = {'n_estimators': 500, 'max_depth': 5, 'min_samples_split': 4, 'learning_rate': 0.01, 'loss': 'ls'}
gbtree_reg = ensemble.GradientBoostingRegressor(**params)
gbtree_reg.fit(x_train, y_train)

print('Gradient boosted tree reg mean squared error for train: %.4f' % mean_squared_error(gbtree_reg.predict(x_train), y_train))
print('Gradient boosted tree reg mean squared error for test: %.4f' % mean_squared_error(gbtree_reg.predict(x_test), y_test))


Gradient boosted tree reg mean squared error for train: 0.0001
Gradient boosted tree reg mean squared error for test: 0.0008


Cross validation also verifies that results are consistent across runs.

In [11]:
import warnings
warnings.filterwarnings('ignore')
from sklearn.model_selection import cross_val_score

scores = cross_val_score(gbtree_reg, dftrain, dflabel, cv=5, scoring='neg_mean_squared_error')

print('Scores for accuracy: ', scores)
print("Accuracy: %0.4f (+/- %0.4f)" % (scores.mean(), scores.std() * 2))

Scores for accuracy:  [-0.00122487 -0.00132428 -0.00093174 -0.00131197 -0.00095223]
Accuracy: -0.0011 (+/- 0.0003)


As a result, we found that the gradient boosted tree model was able to generate a better prediction too.

## 4. Model training: regression against MEDV
Again, we need to pop and split the data.

In [12]:
dftrain = pd.read_csv(
  'boston_data.csv',
  header=1,
  names=['CRIM','ZN','INDUS','CHAS','NOX','RM','AGE','DIS','RAD','TAX','PTRATIO','B','LSTAT','MEDV'])
dftrain = shuffle(dftrain, random_state=8)
dflabel = dftrain.pop('MEDV')

x_train, x_test, y_train, y_test = train_test_split(dftrain, dflabel, test_size=0.2, random_state=42)

print('Trainig features are:\n', x_train.head(3))
print('Training labels are:\n', y_train.head(3))

Trainig features are:
          CRIM    ZN  INDUS  CHAS    NOX     RM   AGE     DIS  RAD    TAX  \
272   0.22188  20.0   6.96     1  0.464  7.691  51.8  4.3665    3  223.0   
391  11.57790   0.0  18.10     0  0.700  5.036  97.0  1.7700   24  666.0   
308   0.34940   0.0   9.90     0  0.544  5.972  76.7  3.1025    4  304.0   

     PTRATIO       B  LSTAT  
272     18.6  390.77   6.58  
391     20.2  396.90  25.68  
308     18.4  396.24   9.97  
Training labels are:
 272    35.2
391     9.7
308    20.3
Name: MEDV, dtype: float64


### 4.1. Linear regression

In [13]:
linear_reg = LinearRegression(normalize=True)
linear_reg.fit(x_train, y_train)

print('Linear reg mean squared error for train: %.4f' % mean_squared_error(linear_reg.predict(x_train), y_train))
print('Linear reg mean squared error for test: %.4f' % mean_squared_error(linear_reg.predict(x_test), y_test))

Linear reg mean squared error for train: 22.2100
Linear reg mean squared error for test: 21.2507


### 4.2. Lasso (L1 reg)

In [14]:
from sklearn.linear_model import Lasso

lasso_reg = Lasso(alpha=0.01)
lasso_reg.fit(x_train, y_train)

print('Lasso mean squared error for training: %.4f' % mean_squared_error(lasso_reg.predict(x_train), y_train))
print('Lasso mean squared error for evaluation: %.4f' % mean_squared_error(lasso_reg.predict(x_test), y_test))

Lasso mean squared error for training: 22.2456
Lasso mean squared error for evaluation: 21.5944


### 4.3 Ridge (L2 reg)

In [15]:
ridge_reg = Ridge(alpha=0.01, solver="cholesky")
ridge_reg.fit(x_train, y_train)

print('Ridge mean squared error for training: %.4f' % mean_squared_error(ridge_reg.predict(x_train), y_train))
print('Ridge mean squared error for evaluation: %.4f' % mean_squared_error(ridge_reg.predict(x_test), y_test))

Ridge mean squared error for training: 22.2100
Ridge mean squared error for evaluation: 21.2597


### 4.4 Elastic net (L1 and L2 reg)

In [16]:
elastic_net = ElasticNet(alpha=0.01, l1_ratio=0.1)
elastic_net.fit(x_train, y_train)

print('Elastic-net mean squared error for training: %.4f' % mean_squared_error(elastic_net.predict(x_train), y_train))
print('Elastic-net mean squared error for evaluation: %.4f' % mean_squared_error(elastic_net.predict(x_test), y_test))

Elastic-net mean squared error for training: 22.6320
Elastic-net mean squared error for evaluation: 22.4864


### 4.5. Gradient boosted tree regressor

In [17]:
from sklearn import ensemble

params = {'n_estimators': 500, 'max_depth': 5, 'min_samples_split': 4, 'learning_rate': 0.01, 'loss': 'ls'}
gbtree_reg = ensemble.GradientBoostingRegressor(**params)
gbtree_reg.fit(x_train, y_train)

print('Gradient boosted tree reg mean squared error for train: %.4f' % mean_squared_error(gbtree_reg.predict(x_train), y_train))
print('Gradient boosted tree reg mean squared error for test: %.4f' % mean_squared_error(gbtree_reg.predict(x_test), y_test))

Gradient boosted tree reg mean squared error for train: 0.7233
Gradient boosted tree reg mean squared error for test: 8.9893


Cross validation also verifies that results are consistent across runs.

In [18]:
scores = cross_val_score(gbtree_reg, dftrain, dflabel, cv=5, scoring='neg_mean_squared_error')

print('Scores for accuracy: ', scores)
print("Accuracy: %0.4f (+/- %0.4f)" % (scores.mean(), scores.std() * 2))

Scores for accuracy:  [ -6.10569118 -17.92030947  -8.30690375  -8.02371162  -9.43159699]
Accuracy: -9.9576 (+/- 8.2454)


Similar to above, we also found that gradient boosted tree model was able to produce better results against regression of the structured data.