# *Demo*: Multivariable Regression on Boston Housing Data

Code that should already have come before...

Let's read in the data and see what it looks like...

In [1]:
import pandas as pd
import numpy as np

names =['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE',  'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'PRICE']

df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data', 
                 header=None, delim_whitespace=True, names=names, na_values='?')

df.head(6)  # print the first six samples

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,PRICE
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.9,5.33,36.2
5,0.02985,0.0,2.18,0,0.458,6.43,58.7,6.0622,3,222.0,18.7,394.12,5.21,28.7


## Forming the Feature Vectors
We want to put our features into feature vectors (stacked into a feature matrix). Here we check the difference between the numpy and pandas datatype, and see the importance of using ```df['feature'].values``` to get a numpy array returned.

In [2]:
features = df.columns.tolist()
features.remove('PRICE')

x = df[features[0]]
xn = df[features[0]].values

print(features)
print(x[:3]) # pandas datatype
print(xn[:3]) # numpy array

['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT']
0    0.00632
1    0.02731
2    0.02729
Name: CRIM, dtype: float64
[0.00632 0.02731 0.02729]


Treat all the features as a vector, $\mathbf{x}$, and stack the samples in a $N$ by $D$ matrix, $X$, where $N$ is the number of samples and $D$ is the number of features.

In [3]:
X = df[features].values

print(X.shape)

(506, 13)


## Normalizing the Data
Normalize the data by $\mathbf{Z} = \frac{x - \bar{x}}{\sigma_x}$. This allows us to look at our learned parameters comparativley to see which features are most important for determining the output.

It's good practice to check that what you're doing makes sense by looking at the shapes of your vectors. Numpy will sometimes allow operations to pass through that you wouldn't normally think possible with special instances of *broadcasting*.

In [4]:
Xbar = np.mean(X,axis=0,keepdims=True)
print(Xbar.shape)
Xstd = np.std(X,axis=0,keepdims=True)
print(Xstd.shape)

X = (X-Xbar) / Xstd

with np.printoptions(precision=2,suppress=True):
    print(X[:5,:5])
    
(N,D) = X.shape
print(X.shape)

(1, 13)
(1, 13)
[[-0.42  0.28 -1.29 -0.27 -0.14]
 [-0.42 -0.49 -0.59 -0.27 -0.74]
 [-0.42 -0.49 -0.59 -0.27 -0.74]
 [-0.42 -0.49 -1.31 -0.27 -0.84]
 [-0.41 -0.49 -1.31 -0.27 -0.84]]
(506, 13)


In [5]:
bias_term = np.ones((X.shape[0],1))
X = np.hstack([bias_term, X])
print(X.shape)

(506, 14)


## LS Solution
using numpy and scikit module

In [6]:
y = df['PRICE'].values.reshape(-1,1)

w = np.linalg.lstsq(X,y, rcond=-1)[0]

with np.printoptions(precision=3, suppress=True):
    print(w)

yhat = np.matmul(X,w)

RMSE = np.sqrt( np.mean( (y-yhat)**2 ) )
print("RMSE = %.3f" % RMSE)

[[22.533]
 [-0.928]
 [ 1.082]
 [ 0.141]
 [ 0.682]
 [-2.057]
 [ 2.674]
 [ 0.019]
 [-3.104]
 [ 2.662]
 [-2.077]
 [-2.061]
 [ 0.849]
 [-3.744]]
RMSE = 4.679


In [7]:
from sklearn.linear_model import LinearRegression

y = df['PRICE'].values.reshape(-1,1)

regr = LinearRegression(fit_intercept=False)
regr.fit(X,y)
yhat = regr.predict(X)

w = regr.coef_

RMSE = np.sqrt( np.mean( (y-yhat)**2 ) )
print("RMSE = %.3f" % RMSE)

RMSE = 4.679


Print the first values of the ground truth and model predictions to get a feel for our LS solution.

In [8]:
Y = np.hstack([y, yhat])
with np.printoptions(precision=2):
    print(Y[:10,:])

[[24.   30.  ]
 [21.6  25.03]
 [34.7  30.57]
 [33.4  28.61]
 [36.2  27.94]
 [28.7  25.26]
 [22.9  23.  ]
 [27.1  19.54]
 [16.5  11.52]
 [18.9  18.92]]


Printing the parameters to see which are treated most significantly. It seems that the proportion of lower-income people living in the neighborhood is the most significant indicator of housing costs (last parameter).

In [9]:
with np.printoptions(precision=2,suppress=True):
    print(w.reshape(-1,1))

[[22.53]
 [-0.93]
 [ 1.08]
 [ 0.14]
 [ 0.68]
 [-2.06]
 [ 2.67]
 [ 0.02]
 [-3.1 ]
 [ 2.66]
 [-2.08]
 [-2.06]
 [ 0.85]
 [-3.74]]


## Feature Selection:
Based on the values of beta, can you select the two most important features? What is MSE if you just use those features for your soluiton?

In [10]:
feats = (np.abs(w)>3).flatten()[1:]
print(feats)

features = np.array(features)
print(features[feats])

X = df[features[feats]].values
X = np.hstack([bias_term, X])

[False False False False False False False  True False False False False
  True]
['DIS' 'LSTAT']


In [11]:
regr = LinearRegression(fit_intercept=False)
regr.fit(X,y)
yhat = regr.predict(X)

w = regr.coef_

RMSE = np.sqrt( np.sum( (yhat - y)**2 ) / N )
print(MSE)

NameError: name 'MSE' is not defined