# BMIS 342 S2G1 (Nate, Reese, Terrell, Matthew, Cole)

# Boston Housing Group Assignment

## Question 1

Why should the data be partitioned into training and validation sets? What will the training set be used for? What will the validation set be used for?
* The data should be partitioned into training and validation sets in order to evaluate and model performance effectively (the data needs to split in order to work).
* Training Set: Used to train the model to learn the mapping of inputs and outputs
* Validation Set: Evaluates the model performance and computes measures of error

## Question 2
Fit a miltiple linear regression model to the median house price (MEDV) as a function of CRIM, CHAS, and RM. Write the equation for predicting the median house price from the predictors in the model.

In [1]:
# Importing required packages 
import sys
!pip install dmba

%matplotlib inline
from pathlib import Path

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Lasso, Ridge, LassoCV, BayesianRidge
import statsmodels.formula.api as sm
import matplotlib.pylab as plt

from dmba import regressionSummary, exhaustive_search
from dmba import backward_elimination, forward_selection, stepwise_selection
from dmba import adjusted_r2_score, AIC_score, BIC_score

You should consider upgrading via the '/Users/nathanyoon/opt/anaconda3/bin/python -m pip install --upgrade pip' command.[0m
no display found. Using non-interactive Agg backend


In [2]:
# Loading DataFrame
# Naming the DataFrame df
housing = pd.read_csv('http://barney.gonzaga.edu/~chuang/data/dmba/BostonHousing.csv')
housing

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,LSTAT,MEDV,CAT. MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.0900,1,296,15.3,4.98,24.0,0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,9.14,21.6,0
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,4.03,34.7,1
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,2.94,33.4,1
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,5.33,36.2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,0.06263,0.0,11.93,0,0.573,6.593,69.1,2.4786,1,273,21.0,9.67,22.4,0
502,0.04527,0.0,11.93,0,0.573,6.120,76.7,2.2875,1,273,21.0,9.08,20.6,0
503,0.06076,0.0,11.93,0,0.573,6.976,91.0,2.1675,1,273,21.0,5.64,23.9,0
504,0.10959,0.0,11.93,0,0.573,6.794,89.3,2.3889,1,273,21.0,6.48,22.0,0


In [3]:
# Printing the columns of the DataFrame
housing.columns

Index(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
       'PTRATIO', 'LSTAT', 'MEDV', 'CAT. MEDV'],
      dtype='object')

In [4]:
# Printing the shape of the DataFrame
housing.shape

(506, 14)

In [5]:
# Reducing data frame to the top 1000 rows and select columns for regression analysis
print(housing.head())
housing = housing.iloc[0:1000]

      CRIM    ZN  INDUS  CHAS    NOX     RM   AGE     DIS  RAD  TAX  PTRATIO  \
0  0.00632  18.0   2.31     0  0.538  6.575  65.2  4.0900    1  296     15.3   
1  0.02731   0.0   7.07     0  0.469  6.421  78.9  4.9671    2  242     17.8   
2  0.02729   0.0   7.07     0  0.469  7.185  61.1  4.9671    2  242     17.8   
3  0.03237   0.0   2.18     0  0.458  6.998  45.8  6.0622    3  222     18.7   
4  0.06905   0.0   2.18     0  0.458  7.147  54.2  6.0622    3  222     18.7   

   LSTAT  MEDV  CAT. MEDV  
0   4.98  24.0          0  
1   9.14  21.6          0  
2   4.03  34.7          1  
3   2.94  33.4          1  
4   5.33  36.2          1  


In [13]:
predictors = ['CRIM', 'CHAS', 'RM']
outcome = 'MEDV'

In [14]:
# Encoding categorical columns from above ^^
x = pd.get_dummies(housing[predictors], drop_first=True)
x

Unnamed: 0,CRIM,CHAS,RM
0,0.00632,0,6.575
1,0.02731,0,6.421
2,0.02729,0,7.185
3,0.03237,0,6.998
4,0.06905,0,7.147
...,...,...,...
501,0.06263,0,6.593
502,0.04527,0,6.120
503,0.06076,0,6.976
504,0.10959,0,6.794


In [15]:
# Printing the head (meaning first few rows) of the x
x.head()

Unnamed: 0,CRIM,CHAS,RM
0,0.00632,0,6.575
1,0.02731,0,6.421
2,0.02729,0,7.185
3,0.03237,0,6.998
4,0.06905,0,7.147


In [16]:
# Partitioning the data
y = housing[outcome]
train_x, valid_x, train_y, valid_y = train_test_split(x, 
                                                      y, 
                                                      test_size=0.4, 
                                                      random_state=1)

In [17]:
# Instantiating a linear regression object
housing_lm = LinearRegression()
housing_lm.fit(train_x, train_y)

LinearRegression()

In [18]:
# Printing coefficients
print('intercept ', housing_lm.intercept_)
print(pd.DataFrame({'Predictor': x.columns, 'coefficient': housing_lm.coef_}))

# Printing performance measures
regressionSummary(train_y, housing_lm.predict(train_x))

intercept  -29.19346743060684
  Predictor  coefficient
0      CRIM    -0.240062
1      CHAS     3.266817
2        RM     8.325175

Regression statistics

                      Mean Error (ME) : -0.0000
       Root Mean Squared Error (RMSE) : 5.9666
            Mean Absolute Error (MAE) : 3.9668
          Mean Percentage Error (MPE) : -7.2747
Mean Absolute Percentage Error (MAPE) : 22.5927


## Question 3
Using the estimated regressional model, what median house price is predicated for a tract in the Boston area that does not bound the Charles River, has a crime rate of 0.1, and where the average number of rooms per house is 6?

In [19]:
# Intecept + ((CRIM) * crime rate) + ((CHAS) * 0) + ((RM) * avg # of rooms per house)
p = -29.19346743060684 + (-0.240062 * 0.1) + (3.266817 * 0) + (8.325175 * 6) 
p

20.733576369393155