## HW 2. 

You will repeat the main procedures from Lecture 2 in much simplified steps using a different data set (https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html).

1. Download the data using __urllib__ and read the file into pandas dataframe. The url for data is "https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv"

2. Print the statistics of __continuous__ variables. Note that __"medv"__ is our target variable.

3. Examine the variables and __list top five variables that correlates the most (either positively or negatively) with "medv"__. What are the __correlation values__?

3. Create a pipeline of __simple median imputer and standard scaler__. How many __elements__ are missing for each variable?

4. Set the random seed to 0. Split the training (80%) and the test set (20%) using scikit-learn (no stratified sampling necessary.)

5. Fit a linear regression model to the training data. Report Training MAE and Test MAE.

NOTE: Add comments for each step such as your observations on the results, etc. To make grading easy, please __leave all cell open and leave the results__.

In [1]:
# Mount Drive
from google.colab import drive
drive.mount('/content/drive/')

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


## 1. Getting Data

In [2]:
# Importing all the necessary packages

import pandas as pd
import sys
import sklearn
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import os
import tarfile
import urllib.request

In [18]:
#%cd /content/drive/MyDrive/MachineLearning/HW2

/content/drive/MyDrive/MachineLearning/HW2


In [43]:
# Downloading Data

BOSTON_URL = "https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv"
CURR_DIRECTORY = os.getcwd()
BOSTON_PATH = os.path.join(CURR_DIRECTORY, "boston")

# Fetch and read the data from the URL and return a Pandas Dataframe containing the data
def get_data(url, path, csv_name):
  if not os.path.isdir(path):
    os.makedirs(path)
  csv_path = os.path.join(path, csv_name)
  urllib.request.urlretrieve(url, csv_path)
  return pd.read_csv(csv_path)

boston_data = get_data(BOSTON_URL, BOSTON_PATH, "boston.csv")
boston_data.head()


Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat,medv
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2


## 2. Printing Summary Statistics for Continuous Variables

In [19]:
boston_data.describe() # Determine the Categorical Variables

boston_Continuous_Vars = boston_data.drop(["chas","rad","tax"], axis=1) # Drop the categorical variables

boston_Continuous_Vars.describe()

Unnamed: 0,crim,zn,indus,nox,rm,age,dis,ptratio,b,lstat,medv
count,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0
mean,3.613524,11.363636,11.136779,0.554695,6.284634,68.574901,3.795043,18.455534,356.674032,12.653063,22.532806
std,8.601545,23.322453,6.860353,0.115878,0.702617,28.148861,2.10571,2.164946,91.294864,7.141062,9.197104
min,0.00632,0.0,0.46,0.385,3.561,2.9,1.1296,12.6,0.32,1.73,5.0
25%,0.082045,0.0,5.19,0.449,5.8855,45.025,2.100175,17.4,375.3775,6.95,17.025
50%,0.25651,0.0,9.69,0.538,6.2085,77.5,3.20745,19.05,391.44,11.36,21.2
75%,3.677083,12.5,18.1,0.624,6.6235,94.075,5.188425,20.2,396.225,16.955,25.0
max,88.9762,100.0,27.74,0.871,8.78,100.0,12.1265,22.0,396.9,37.97,50.0


## 3. Top 5 Correlations with "medv"

In [20]:
corr_matrix = boston_Continuous_Vars.corr()  # Create a correlation matrix between the variables in boston_Continuous_Vars


In [21]:
corr_matrix["medv"].sort_values() # The correlations for "medv"

lstat     -0.737663
ptratio   -0.507787
indus     -0.483725
nox       -0.427321
crim      -0.388305
age       -0.376955
dis        0.249929
b          0.333461
zn         0.360445
rm         0.695360
medv       1.000000
Name: medv, dtype: float64

The top 5 vairables with the largest correlations with "medv" are

1. lstat (-0.737663)
2. rm (0.695360)
3. ptratio (-0.507787)
4. indus (-0.483725)
5. nox (-0.427321)


## 4. Pipeline Creation
- Create a pipeline of __simple median imputer and standard scaler__.
- How many __elements__ are missing for each variable?

### a. Missing elements

In [8]:
# Retrieve any incomplete rows
sample_incomplete_rows = boston_data[boston_data.isnull().any(axis=1)].head()
print(sample_incomplete_rows)

# Since the sample returns an empty dataframe, we can conclude that there are no missing elements.

Empty DataFrame
Columns: [crim, zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, b, lstat, medv]
Index: []


### b. Pipeline

In [22]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

In [23]:
imputer = SimpleImputer(strategy="median") # Simple median imputer 
imputer.fit(boston_Continuous_Vars)
b_imputer = imputer.transform(boston_Continuous_Vars) 

In [24]:
# Since b_imputer is a np array, we must create a pandas dataframe from it.
boston_Imputer_df = pd.DataFrame(b_imputer, columns = boston_Continuous_Vars.columns, index = boston_data.index)

In [30]:
# Pipeline for continuous variables
boston_Continuous_Vars_pipe = Pipeline([
    ('imputer', SimpleImputer(strategy="median")),
    ('std_scaler', StandardScaler())
])

# Fit-transform the pipeline
boston_Continuous_Vars_tr = boston_Continuous_Vars_pipe.fit_transform(boston_Continuous_Vars)

b_prepared_data = pd.DataFrame(boston_Continuous_Vars_tr, columns = boston_Continuous_Vars.columns, index = boston_Continuous_Vars.index)

b_prepared_data

Unnamed: 0,crim,zn,indus,nox,rm,age,dis,ptratio,b,lstat,medv
0,-0.419782,0.284830,-1.287909,-0.144217,0.413672,-0.120013,0.140214,-1.459000,0.441052,-1.075562,0.159686
1,-0.417339,-0.487722,-0.593381,-0.740262,0.194274,0.367166,0.557160,-0.303094,0.441052,-0.492439,-0.101524
2,-0.417342,-0.487722,-0.593381,-0.740262,1.282714,-0.265812,0.557160,-0.303094,0.396427,-1.208727,1.324247
3,-0.416750,-0.487722,-1.306878,-0.835284,1.016303,-0.809889,1.077737,0.113032,0.416163,-1.361517,1.182758
4,-0.412482,-0.487722,-1.306878,-0.835284,1.228577,-0.511180,1.077737,0.113032,0.441052,-1.026501,1.487503
...,...,...,...,...,...,...,...,...,...,...,...
501,-0.413229,-0.487722,0.115738,0.158124,0.439316,0.018673,-0.625796,1.176466,0.387217,-0.418147,-0.014454
502,-0.415249,-0.487722,0.115738,0.158124,-0.234548,0.288933,-0.716639,1.176466,0.441052,-0.500850,-0.210362
503,-0.413447,-0.487722,0.115738,0.158124,0.984960,0.797449,-0.773684,1.176466,0.441052,-0.983048,0.148802
504,-0.407764,-0.487722,0.115738,0.158124,0.725672,0.736996,-0.668437,1.176466,0.403225,-0.865302,-0.057989


## 5. Split Dataset for Training 
- Set the random seed to 0. 
- Split the training (80%) and the test set (20%) using scikit-learn (no stratified sampling necessary.)


In [31]:
from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(b_prepared_data, test_size=0.2, random_state=0)

## 6. Linear Regression
- Fit a linear regression model to the training data.
- Report Training MAE and Test MAE.

In [36]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

lin_reg = LinearRegression()

# To train, we need the mdev values to act as the y, or the value we are seeking.
medv_train = train_set[["medv"]]
train_no_medv = train_set.drop("medv", axis=1)

medv_test = test_set[["medv"]]
test_no_medv = test_set.drop("medv", axis=1)

# Linear Regression with training data
lin_reg.fit(train_no_medv, medv_train)

LinearRegression()

In [40]:
# Predictions
predictions_train = lin_reg.predict(train_no_medv)
predictions_test = lin_reg.predict(test_no_medv)

In [41]:
# Mean Absolute Error
train_MAE = np.mean(np.abs(medv_train - predictions_train))
test_MAE = np.mean(np.abs(medv_test - predictions_test))

In [42]:
print("Training MAE: ", float(train_MAE))
print("Test MAE: ", float(test_MAE))

Training MAE:  0.3455131135949065
Test MAE:  0.43146755428301925


Training MAE:  0.3455131135949065 <br>
Test MAE:  0.43146755428301925