# Fundamentals of Data Science - Week 6 #

In this notebook, the first section is going to cover the following practical aspects of data science:
+ Creating a Linear Regression model
+ Predicting the model on unseen data and calculating error on the predicted score vs orginal score
+ Create a simple linear regression (with a single variable and a target) on the Diabetes dataset
+ Fit a linear model on the data and plot it
+ Create multivariate linear regression to predict house prices in Boston
+ Plot correlation between variables, predicted price vs original price and calculate mean square errors 


In [None]:
import pandas as pd

import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score


<h3> Single Variable Linear regression </h3>

In [None]:
# Load the diabetes dataset
diabetes = datasets.load_diabetes()

# Use only one feature
diabetes_X = diabetes.data[:, np.newaxis, 2]
#Construct a data frame that contains features and estimated coefficients.
pd.DataFrame(list(zip(diabetes_X, diabetes.target)), columns = ['feature', 'Target'])

In [None]:
# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]

# Split the targets into training/testing sets
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]

# Create linear regression object
regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)

# Make predictions using the testing set
diabetes_y_pred = regr.predict(diabetes_X_test)

# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean squared error
print("Mean squared error: %.2f"
      % mean_squared_error(diabetes_y_test, diabetes_y_pred))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % r2_score(diabetes_y_test, diabetes_y_pred))

# Plot outputs
plt.scatter(diabetes_X_test, diabetes_y_test,  color='black')
plt.plot(diabetes_X_test, diabetes_y_pred, color='blue', linewidth=3)
plt.xlabel("Features")
plt.ylabel("Target Values")
plt.title("Plot of original target (black dots) and the linearly fit model(blue line)")

plt.xticks(())
plt.yticks(())


plt.show()

<h3> Multivariate Regression: Predicting house prices in Boston  </h3>

In [None]:
#Import Boston data set and store it in a variable called boston

from sklearn.datasets import load_boston
boston = load_boston()

#The object boston is a dictionary, so you can explore the keys of this dictionary and the shape of the key -'data'
print(boston.keys())
print(boston.data.shape)

Before starting the analysis it is always good to delve into the data. First, we look into the feature names
of boston data set. We can also see the description of this data set to know more about it. In this  data set there are 506 instances(rows) and 13 attributes or parameters(columns). The goal of this exercise is to predict the  housing prices in boston region using the features given.


In [None]:
print(boston.feature_names)
print(boston.DESCR)

<h4> Convert <i> boston.data </i> into a pandas data frame. <h4>

In [None]:
bos = pd.DataFrame(boston.data)
bos.head()

In [None]:
#As we can see the column names are just numbers, so to replace those numbers with the feature names.

bos.columns = boston.feature_names
bos.head()

In [None]:
#boston.target contains the housing prices. We need to add add one more column 'PRICE' to the dataframe for the target.
bos['PRICE'] = boston.target
bos.head()

We are now going to  fit a linear regression model and predict the Boston housing prices. We will use the least squares method as the way to estimate the coefficients.

Y = boston housing price(also called “target” data in Python)

and

X = all the other features (or independent variables)

First, import linear regression from sci-kit learn module. Then we need to drop the price column as we want only the parameters as our X values and store linear regression object in a variable called <i>lm</i>.

In [None]:
from sklearn.linear_model import LinearRegression
X = bos.drop('PRICE', axis = 1)

#This creates a LinearRegression object
lm = LinearRegression()
lm

<h4> Fitting a Linear Model </h4> We will use all 13 parameters to fit a linear regression model. Two other parameters that we can pass to linear regression object are <i>fit_intercept</i> and <i>normalize</i>.

In [None]:
lm.fit(X, bos.PRICE)
#print the intercept and number of coefficients.
print('Estimated intercept coefficient:', lm.intercept_)
print('Number of coefficients:', len(lm.coef_))

In [None]:
#Construct a data frame that contains features and estimated coefficients.
pd.DataFrame(list(zip(X.columns, lm.coef_)), columns = ['features', 'estimatedCoefficients'])

As can be seen from the data frame that there is a high correlation between RM and prices. Lets plot a scatter plot between True housing prices and True RM.

In [None]:
plt.scatter(bos.RM, bos.PRICE)
plt.xlabel("Average number of rooms per dwelling (RM)")
plt.ylabel("Housing Price")
plt.title("Relationship between RM and Price")
plt.show()
# As can be seen that there is a positive correlation between RM and housing prices.

<h4> Predicting Prices </h4> To calculate the predicted prices (Y_i) we use <i>lm.predict</i>. Then we print the first 5 housing prices predicted by our model. We then plot a scatter plot to compare true prices vs the predicted prices.

In [None]:
lm.predict(X)[0:5]

In [None]:
plt.scatter(bos.PRICE, lm.predict(X))
plt.xlabel("Prices: $Y_i$")
plt.ylabel("Predicted prices: $\hat{Y}_i$")
plt.title("Prices vs Predicted Prices: $Y_i$ vs $\hat{Y}_i$")
plt.show()

In [None]:
#We can notice that there is some error in  the prediction as the housing prices increase.
#Lets calculate the mean squared error.
mseFull = np.mean((bos.PRICE - lm.predict(X))** 2)
print(mseFull)

But if we fit linear regression for <b>one feature</b> the error will be very high. Lets take the feature ‘PTRATIO’ and calculate the mean squared error.

In [None]:
lm = LinearRegression()
lm.fit(X[['PTRATIO']], bos.PRICE)

msePTRATIO = np.mean((bos.PRICE - lm.predict(X[['PTRATIO']]))** 2)
print(msePTRATIO)

The <b>mean squared error</b> has increased. So this shows that a single feature is not a good predictor of housing prices.

<b> To-Do 1: Make a train-test split and calculate the mean squared error for training data and test data. </b>

<b> To-Do 2: Plot the residuals for training and test datasets </b>
