# Linear Regression with Boston Housing Dataset


Description of the dataset: 

Boston dataset contains information collected by the U.S Census Service concerning housing in the area of Boston Mass.
In total, there are 506 instances (rows) and each has 14 attributes (columns). Target variable is`MEDV` or *Median value of owner-occupied homes in $1000's*with the other 13 being corresponding values like: per capita crime rate by town, proportion of non-retail business acres per town, and average number of rooms per dwelling.

## 1. Loading the Data

In [None]:
from sklearn.datasets import load_boston
# load Boston dataset
boston_df = load_boston()
# print(boston_df.DESCR)



## 2. Import cuDF and build a dataframe

In [None]:
import cudf

# build dataframe from data key
boston_gdf = cudf.DataFrame(list(boston_df.data))

# set column names to feature_names
boston_gdf.columns = boston_df.feature_names

# add MEDV column from target ---> median value in $1000  (PRICE)
boston_gdf['MEDV'] = boston_df.target

# let's see what we're working with
boston_gdf.head()

## 3. Train Test Split 

In [None]:
from cuml.preprocessing.model_selection import train_test_split

In [None]:
# set X to all variables except the target variable
X = boston_gdf.drop('MEDV', axis=1)
# set Y to  price --> MEDV
Y = boston_gdf['MEDV']

In [None]:
# train/test split (70:30)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size = 0.7)

## 4.  Fit Linear Regression Model using cuML 

In [None]:
from cuml import LinearRegression
from sklearn.metrics import mean_squared_error

In [None]:
# call Linear Regression model
mlr = LinearRegression()

# train the model for multiple regression
mlr.fit(X_train, Y_train)

# make predictions for test X values
Y_pred = mlr.predict(X_test)

# calculate error
mmse = mean_squared_error(Y_test, Y_pred)
print(mmse)

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
# scatter actual and predicted results
plt.scatter(Y_test, Y_pred)

# label graph
plt.xlabel("Actual Prices: $Y_i$")
plt.ylabel("Predicted prices: $\hat{Y}_i$")
plt.title("Prices vs Predicted prices: $Y_i$ vs $\hat{Y}_i$")

plt.show()