# Boston housing prices, regression analysis

We will investigate the relationship between Median value of owner-occupied homes and various features available to us in the Boston housing dataset. 

### Import libs and data

In [116]:
import numpy as np
import pandas as pd

In [117]:
from sklearn import datasets ## imports datasets from scikit-learn
data = datasets.load_boston() ## loads Boston dataset from datasets library 
print (data.DESCR) # check out the data dictionary

.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pu

In [118]:
# define the data/predictors as the pre-set feature names  
df = pd.DataFrame(data.data, columns=data.feature_names)

# Put the target (housing value -- MEDV) in another DataFrame
target = pd.DataFrame(data.target, columns=["MEDV"])

In [119]:
target

Unnamed: 0,MEDV
0,24.0
1,21.6
2,34.7
3,33.4
4,36.2
...,...
501,22.4
502,20.6
503,23.9
504,22.0


## statsmodel

First, let's take a look at the relationship between MEDV and RM (average number of rooms per dwelling) using OLS (OLS == Ordinary Least Squares,  “Least Squares” fitting a regression line that would minimize the square of distance from the regression line).

In [138]:
## Without a constant
from patsy import dmatrices
import pandas
import statsmodels.api as sm

X = df["RM"] ## independent variables
y = target["MEDV"] ## dependent variable

In [139]:
# enter your code here

# create the model, use the OLS and fit functions
m = sm.OLS(y,X)
res = m.fit()

# make the predictions with the model
predictions = res.predict(X)

# Print out the statistics
print(res.summary())

                                 OLS Regression Results                                
Dep. Variable:                   MEDV   R-squared (uncentered):                   0.901
Model:                            OLS   Adj. R-squared (uncentered):              0.901
Method:                 Least Squares   F-statistic:                              4615.
Date:                Thu, 04 Mar 2021   Prob (F-statistic):                   3.74e-256
Time:                        18:59:09   Log-Likelihood:                         -1747.1
No. Observations:                 506   AIC:                                      3496.
Df Residuals:                     505   BIC:                                      3500.
Df Model:                           1                                                  
Covariance Type:            nonrobust                                                  
                 coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------

In [140]:
X[:3]

0    6.575
1    6.421
2    7.185
Name: RM, dtype: float64

---
#### Add a constant to the model

In [141]:
X = df["RM"] ## independent variables
y = target["MEDV"] ## dependent variable

In [142]:
# add a constant to X
# add your code here

y, X = dmatrices('y ~ X', data=df, return_type='dataframe')


Your X should look like: 
   const 	RM
0 	1.0 	6.575
1 	1.0 	6.421
2 	1.0 	7.185
3 	1.0 	6.998
4 	1.0 	7.147

In [143]:
# enter your code here
# create the model, use the OLS and fit functions
m_constant = sm.OLS(y,X)
res_constant = m_constant.fit()

# make the predictions with the model
predictions = res_constant.predict(X)

# Print out the statistics
print(res_constant.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.484
Model:                            OLS   Adj. R-squared:                  0.483
Method:                 Least Squares   F-statistic:                     471.8
Date:                Thu, 04 Mar 2021   Prob (F-statistic):           2.49e-74
Time:                        18:59:20   Log-Likelihood:                -1673.1
No. Observations:                 506   AIC:                             3350.
Df Residuals:                     504   BIC:                             3359.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    -34.6706      2.650    -13.084      0.0

In [144]:
X[:5]

Unnamed: 0,Intercept,X
0,1.0,6.575
1,1.0,6.421
2,1.0,7.185
3,1.0,6.998
4,1.0,7.147


#### Multiple Linear Regression

In [145]:
X = df[["RM", "LSTAT"]]
y = target["MEDV"]

In [146]:
# enter your code here
# create the model, use the OLS and fit functions
y, X = dmatrices('y ~ X', data=df, return_type='dataframe')
m_multi = sm.OLS(y,X)
res_multi = m_multi.fit()

# make the predictions with the model
predictions = res_multi.predict(X)

# Print out the statistics
print(res_multi.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.639
Model:                            OLS   Adj. R-squared:                  0.637
Method:                 Least Squares   F-statistic:                     444.3
Date:                Thu, 04 Mar 2021   Prob (F-statistic):          7.01e-112
Time:                        18:59:31   Log-Likelihood:                -1582.8
No. Observations:                 506   AIC:                             3172.
Df Residuals:                     503   BIC:                             3184.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -1.3583      3.173     -0.428      0.6

---
## sklearn

In [148]:
import sklearn
from sklearn import linear_model 
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

In [149]:
X = df
y = target['MEDV']

In [150]:
X[:5]

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


In [151]:
X.size, y.size

(6578, 506)

In [152]:
lm = linear_model.LinearRegression()
model = lm.fit(X,y)

In [153]:
predictions = lm.predict(X)
print(predictions[0:5])

[30.00384338 25.02556238 30.56759672 28.60703649 27.94352423]


In [154]:
len(predictions)

506

In [155]:
## print the r2, coefficients, and intercept values

# coefficients
cdf = pd.DataFrame(lm.coef_, X.columns, columns=['Coefficients'])
print(cdf)

# intercepts 
print('Intercept:', lm.intercept_)

# r2
print('R2:', sklearn.metrics.r2_score(y, predictions))

         Coefficients
CRIM        -0.108011
ZN           0.046420
INDUS        0.020559
CHAS         2.686734
NOX        -17.766611
RM           3.809865
AGE          0.000692
DIS         -1.475567
RAD          0.306049
TAX         -0.012335
PTRATIO     -0.952747
B            0.009312
LSTAT       -0.524758
Intercept: 36.459488385089855
R2: 0.7406426641094095
