# Linear Regression on Boston Housing Dataset

In [1]:
import numpy as np
import pandas as pd

column_names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']

## Import the dataset into a Pandas DataFrame

In [2]:
boston_data = pd.read_csv("./datasets/Boston_Housing.csv", delimiter=",", names=column_names)
boston_data.head(10)

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat,medv
1,0.00632,18,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24
2,0.02731,0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
3,0.02729,0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
4,0.03237,0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
5,0.06905,0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2
6,0.02985,0,2.18,0,0.458,6.43,58.7,6.0622,3,222,18.7,394.12,5.21,28.7
7,0.08829,12.5,7.87,0,0.524,6.012,66.6,5.5605,5,311,15.2,395.6,12.43,22.9
8,0.14455,12.5,7.87,0,0.524,6.172,96.1,5.9505,5,311,15.2,396.9,19.15,27.1
9,0.21124,12.5,7.87,0,0.524,5.631,100,6.0821,5,311,15.2,386.63,29.93,16.5


## Preprocess the data and trim NaN values

In [9]:
boston_data.isna().sum()

CRIM       0
ZN         0
INDUS      0
CHAS       0
NOX        0
RM         0
AGE        0
DIS        0
RAD        0
TAX        0
PTRATIO    0
B          0
LSTAT      0
MEDV       0
dtype: int64

As we can see, there are no invalid entries. The data is clean!

## Split model data with 70% for training

In [10]:
from sklearn.model_selection import train_test_split
X = np.array(boston_data.iloc[1:, 0:13])
y = np.array(boston_data["MEDV"][1:])

#Testing data is 0.3 of dataset
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 5)


> ## A quick detour into the Pandas *iloc()* method

In [11]:
mydict = [{'a': 1, 'b': 2, 'c': 3, 'd': 4},
          {'a': 100, 'b': 200, 'c': 300, 'd': 400},
          {'a': 1000, 'b': 2000, 'c': 3000, 'd': 4000 }]

df = pd.DataFrame(mydict)
df

Unnamed: 0,a,b,c,d
0,1,2,3,4
1,100,200,300,400
2,1000,2000,3000,4000


.iloc[] is primarily integer position based indexing and selection (from 0 to length-1 of the axis), but may also be used with a boolean array.\
Here we look at some examples:

> ### Row Indexing

In [12]:
df.iloc[0], df.iloc[1], df.iloc[2]

(a    1
 b    2
 c    3
 d    4
 Name: 0, dtype: int64,
 a    100
 b    200
 c    300
 d    400
 Name: 1, dtype: int64,
 a    1000
 b    2000
 c    3000
 d    4000
 Name: 2, dtype: int64)

In [13]:
df.iloc[[0]], df.iloc[[1]], df.iloc[[0, 1]]

(   a  b  c  d
 0  1  2  3  4,
      a    b    c    d
 1  100  200  300  400,
      a    b    c    d
 0    1    2    3    4
 1  100  200  300  400)

In [14]:
df.iloc[1:3]

Unnamed: 0,a,b,c,d
1,100,200,300,400
2,1000,2000,3000,4000


In [15]:
df.iloc[[False, True, True]]

Unnamed: 0,a,b,c,d
1,100,200,300,400
2,1000,2000,3000,4000


In [16]:
df.iloc[[True, False, True]]

Unnamed: 0,a,b,c,d
0,1,2,3,4
2,1000,2000,3000,4000


> ### Row-column Indexing

In [17]:
df

Unnamed: 0,a,b,c,d
0,1,2,3,4
1,100,200,300,400
2,1000,2000,3000,4000


In [18]:
df.iloc[2,1], df.iloc[1, 2]

(2000, 300)

In [19]:
df.iloc[[0,2], [1,3]]

Unnamed: 0,b,d
0,2,4
2,2000,4000


In [20]:
df.iloc[1:2,1:4]

Unnamed: 0,b,c,d
1,200,300,400


According to the Panda docs, for row-column indexing, *iloc()* must be called With a boolean array whose length matches the columns [or rows].\
I.e., entering `[:, [True, True, True, False]]` for a column length not equal to 4 would result in an error.

In [21]:
df.iloc[[True, False, True], :]

Unnamed: 0,a,b,c,d
0,1,2,3,4
2,1000,2000,3000,4000


## Moving on: Using linear regression on training data

In [22]:
from sklearn.linear_model import LinearRegression

#Load our first model
model = LinearRegression()

#Train model on training data
model.fit(x_train, y_train)

#Predict the testing data for later evaluation of model's performance
model_predict = model.predict(x_test)

## Evaluating the Model

In [24]:
import sklearn

#Mean-Squared Error for model
model_mse = sklearn.metrics.mean_squared_error(y_test, model_predict, squared=False)
print(f"\nError for Linear Regression = {model_mse}")


Error for Linear Regression = 5.540490745781327
