## HW 2. 

You will repeat the main procedures from Lecture 2 in much simplified steps using a different data set (https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html).

1. Download the data using urllib and read the file into pandas dataframe. The url for data is "https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv"

2. Print the statistics of continuous variables. Note that "medv" is our target variable.

3. Examine the variables and list top five variables that correlates the most (either positively or negatively) with "medv". What are the correlation values?

3. Create a pipeline of simple median imputer and standard scaler. How many elements are missing for each variable?

4. Set the random seed to 0. Split the training (80%) and the test set (20%) using scikit-learn (no stratified sampling necessary.)

5. Fit a linear regression model to the training data. Report Training MAE and Test MAE.

NOTE: Add comments for each step such as your observations on the results, etc. To make grading easy, please leave all cell open and leave the results.

## 1. Downloading and Reading Data

In [3]:
# Mount Drive

from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [4]:
# Import Packages

import pandas as pd
import sys
import sklearn
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import os
import tarfile
import urllib.request

In [5]:
# Download Data

SITE_URL = "https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html"


# New URL for BOSTON_URL
BOSTON_URL = "https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv"
CURRENT_DIREC = os.getcwd()
BOSTON_PATH = os.path.join(CURRENT_DIREC, "boston")

def fetch_boston_data(boston_url = BOSTON_URL, boston_path = BOSTON_PATH):
  if not os.path.isdir(boston_path):
    os.makedirs(boston_path)
  csv_path = os.path.join(boston_path, "boston.csv")
  urllib.request.urlretrieve(boston_url, csv_path)
  return pd.read_csv(csv_path)

boston = fetch_boston_data()
boston.head()





Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat,medv
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2


## 2. Print Info for Continuous Variables

In [6]:
#Drop Categorical Variables
boston_cont = boston.drop(["chas","rad","tax"], axis=1)

#Print Statistics
boston_cont.describe()

Unnamed: 0,crim,zn,indus,nox,rm,age,dis,ptratio,b,lstat,medv
count,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0
mean,3.613524,11.363636,11.136779,0.554695,6.284634,68.574901,3.795043,18.455534,356.674032,12.653063,22.532806
std,8.601545,23.322453,6.860353,0.115878,0.702617,28.148861,2.10571,2.164946,91.294864,7.141062,9.197104
min,0.00632,0.0,0.46,0.385,3.561,2.9,1.1296,12.6,0.32,1.73,5.0
25%,0.082045,0.0,5.19,0.449,5.8855,45.025,2.100175,17.4,375.3775,6.95,17.025
50%,0.25651,0.0,9.69,0.538,6.2085,77.5,3.20745,19.05,391.44,11.36,21.2
75%,3.677083,12.5,18.1,0.624,6.6235,94.075,5.188425,20.2,396.225,16.955,25.0
max,88.9762,100.0,27.74,0.871,8.78,100.0,12.1265,22.0,396.9,37.97,50.0


## 3. Looking for Correlations

In [7]:
# Create Correlation Matrix
corr_matrix = boston_cont.corr()

# Select Correlations to 'medv' and sort (remember absolute values!)
print(corr_matrix['medv'].sort_values())

lstat     -0.737663
ptratio   -0.507787
indus     -0.483725
nox       -0.427321
crim      -0.388305
age       -0.376955
dis        0.249929
b          0.333461
zn         0.360445
rm         0.695360
medv       1.000000
Name: medv, dtype: float64


The variables that correlate more with "medv":
  - lstat (corr = -0.737663)
  - rm (corr = 0.695360)
  - ptratio (corr -0.507787)
  - indus (corr = -0.483725)
  - tax (-0.468536)

## 4. Simple Pipeline

In [8]:
# Import Packages
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

boston_cont.describe()

Unnamed: 0,crim,zn,indus,nox,rm,age,dis,ptratio,b,lstat,medv
count,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0
mean,3.613524,11.363636,11.136779,0.554695,6.284634,68.574901,3.795043,18.455534,356.674032,12.653063,22.532806
std,8.601545,23.322453,6.860353,0.115878,0.702617,28.148861,2.10571,2.164946,91.294864,7.141062,9.197104
min,0.00632,0.0,0.46,0.385,3.561,2.9,1.1296,12.6,0.32,1.73,5.0
25%,0.082045,0.0,5.19,0.449,5.8855,45.025,2.100175,17.4,375.3775,6.95,17.025
50%,0.25651,0.0,9.69,0.538,6.2085,77.5,3.20745,19.05,391.44,11.36,21.2
75%,3.677083,12.5,18.1,0.624,6.6235,94.075,5.188425,20.2,396.225,16.955,25.0
max,88.9762,100.0,27.74,0.871,8.78,100.0,12.1265,22.0,396.9,37.97,50.0


There are 506 observations, and every attribute has a count of 506, which means that no attributes are missing entries

In [9]:
# Simple Imputer
imputer = SimpleImputer(strategy="median")
imputer.fit(boston_cont)
boston_cont_imputer = imputer.transform(boston_cont)

#returns a numpy array, have to turn it into a data frame
boston_cont_imputer_df = pd.DataFrame(boston_cont_imputer, columns = boston_cont.columns, index = boston.index)
boston_cont_imputer_df.head()

Unnamed: 0,crim,zn,indus,nox,rm,age,dis,ptratio,b,lstat,medv
0,0.00632,18.0,2.31,0.538,6.575,65.2,4.09,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.469,6.421,78.9,4.9671,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.469,7.185,61.1,4.9671,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.458,6.998,45.8,6.0622,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.458,7.147,54.2,6.0622,18.7,396.9,5.33,36.2


In [10]:
# Pipeline Creation

# pipeline for continuous attributes
boston_cont_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy="median")),
    ('std_scaler', StandardScaler())
])

boston_cont_tr = boston_cont_pipeline.fit_transform(boston_cont)

# pipeline for all attributes (nothing done to categorical attributes)
boston_pipeline = ColumnTransformer([
    ("num", boston_cont_pipeline, list(boston_cont))
])

# run data through pipeline
boston_prepared = pd.DataFrame(boston_pipeline.fit_transform(boston_cont), columns = boston_cont.columns, index = boston_cont.index)
boston_prepared

Unnamed: 0,crim,zn,indus,nox,rm,age,dis,ptratio,b,lstat,medv
0,-0.419782,0.284830,-1.287909,-0.144217,0.413672,-0.120013,0.140214,-1.459000,0.441052,-1.075562,0.159686
1,-0.417339,-0.487722,-0.593381,-0.740262,0.194274,0.367166,0.557160,-0.303094,0.441052,-0.492439,-0.101524
2,-0.417342,-0.487722,-0.593381,-0.740262,1.282714,-0.265812,0.557160,-0.303094,0.396427,-1.208727,1.324247
3,-0.416750,-0.487722,-1.306878,-0.835284,1.016303,-0.809889,1.077737,0.113032,0.416163,-1.361517,1.182758
4,-0.412482,-0.487722,-1.306878,-0.835284,1.228577,-0.511180,1.077737,0.113032,0.441052,-1.026501,1.487503
...,...,...,...,...,...,...,...,...,...,...,...
501,-0.413229,-0.487722,0.115738,0.158124,0.439316,0.018673,-0.625796,1.176466,0.387217,-0.418147,-0.014454
502,-0.415249,-0.487722,0.115738,0.158124,-0.234548,0.288933,-0.716639,1.176466,0.441052,-0.500850,-0.210362
503,-0.413447,-0.487722,0.115738,0.158124,0.984960,0.797449,-0.773684,1.176466,0.441052,-0.983048,0.148802
504,-0.407764,-0.487722,0.115738,0.158124,0.725672,0.736996,-0.668437,1.176466,0.403225,-0.865302,-0.057989


## 5. Model Creation

In [11]:
# Import Packages, Set Seed to 0
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
seed = 0


In [12]:
# Split Data into Training (80%) and Test (20%) Sets

boston_train, boston_test = train_test_split(boston_prepared, test_size = 0.20, random_state = seed)



## 6. Linear Regression

In [13]:
## HAS REFERENCES TO ABOVE CELLS - RUN CELLS IN SECTION 5 FIRST
# Import Packages
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error


# Assign Linear Regression Model
lin_reg = LinearRegression()

# Drop True 'medv' Values from Train and Test Sets
medv_train = boston_train[["medv"]]
boston_train = boston_train.drop("medv", axis=1)
medv_test = boston_test[["medv"]]
boston_test = boston_test.drop("medv", axis=1)


# Fit Linear Regressions
lin_reg.fit(boston_train, medv_train)

# Predict on Train and Test Sets
boston_predictions_train = lin_reg.predict(boston_train)
boston_predictions_test = lin_reg.predict(boston_test)

# Calcualte Mean Absolute Error
train_mae = np.mean(np.abs(medv_train - boston_predictions_train))
test_mae = np.mean(np.abs(medv_test - boston_predictions_test))

print(float(train_mae), float(test_mae))


0.3455131135949065 0.43146755428301925


Train MAE: 0.3455131135949065,
Test MAE: 0.43146755428301925