## Supervised Learning - Linear Regression

In [None]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston

# visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

Load the dataset of house prices in Boston

In [None]:
boston = load_boston()
print(boston.DESCR)

#### Import the data into Pandas dataframe

In [None]:
df = pd.DataFrame(boston.data, columns=boston.feature_names)
df['target'] = boston.target
print(df.shape)

In [None]:
df

Plot the correlation between different descriptors in the data

In [None]:
sns.set(style='whitegrid', context='notebook')
sns.heatmap(df.corr())

### Build model for linear regression

First, extract feature matrix and label vector from the dataframe. Split the dataset into training and testing subsets using 80% of the data for training and 20% for testing. 

In [None]:
X = df.iloc[:,range(13)].as_matrix()
y = df['target']
rooms = df['RM']

Splitter function `train_test_split` in the `model.selection` class can be used to split arrays or matrices into random train and test subsets.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(rooms, y, train_size=0.8, random_state=1)

Construct a linear regression model with default parameters

In [None]:
model1 = LinearRegression()
model1

Fit the model to the training dataset

In [None]:
model1.fit(X_train.values.reshape(-1,1),y_train)

Predict the labels for the test dataset using the trained model

In [None]:
model1_predict = model1.predict(X_test.values.reshape(-1,1))

Plot the train, test, and predicted values

In [None]:
plt.scatter(X_train,y_train,color='grey')
plt.scatter(X_test,y_test,color='black')
plt.scatter(X_test,model1_predict,color='red')

Print the coefficient and the intercept of the line of regression. Calculate RMSE for values in test dataset and corresponding predicted values

In [None]:
print (model1.coef_, model1.intercept_)
print(model1.predict(8))
RMSE_linear_rooms = np.sqrt(metrics.mean_squared_error(model1_predict, y_test))
print (RMSE_linear_rooms)

### Build model using all the features

Arrange the dataframe as feature matrix and label vectors. Split the dataset into training and testing subsets using 80% of the data for training and 20% for testing.

In [None]:
X = df.iloc[:,range(13)].as_matrix()
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=1)

Build a linear regression model with default parameters and fit it to the training data

In [None]:
model_multi = LinearRegression()
model_multi.fit(X_train,y_train)

Predict the labels for the test dataset

In [None]:
model_multi_predict = model_multi.predict(X_test)

In [None]:
RMSE_linear_multi = np.sqrt(metrics.mean_squared_error(model_multi_predict, y_test))
print (RMSE_linear_multi)

Plot the results

In [None]:
model_multi.predict(np.array([0,0,0,0,0,8,0,0,0,0,0,0,0]).reshape(1,-1))

In [None]:
print (model_multi.intercept_)
print (model_multi.coef_)

In [None]:
print(boston.feature_names)

**Advantages of linear regression:**
  * It is quite simple to explain and Model training and prediction are fast
  * Features don't need scaling
  * Can perform well with a small number of observations
  
**Disadvantages of linear regression:**
  * Presumes a linear relationship between the features and the response
  * Performance is (generally) not competitive with the best supervised learning methods due to high bias
  * Sensitive to irrelevant features

### Build a polynomial model 

In [None]:
#from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline

In [None]:
X = df.iloc[:,range(13)].as_matrix()
y = df['target']
rooms = df['RM']
X_train, X_test, y_train, y_test = train_test_split(rooms, y, train_size=0.8, random_state=1)

**Pipeline** can be used to chain multiple estimators into one. This is useful as there is often a fixed sequence of steps in processing the data, for example feature selection, normalization and classification. 
Pipeline serves two purposes here:

**Convenience:** You only have to call *fit* and *predict* once on your data to fit a whole sequence of estimators.

**Joint parameter selection:** You can grid search over parameters of all estimators in the pipeline at once.

All estimators in a pipeline, except the last one, must be transformers (i.e. must have a transform method). The last estimator may be any type (transformer, classifier, etc.).

In [None]:
model_poly = make_pipeline(PolynomialFeatures(2),LinearRegression())
## What should be the value for degree?

In [None]:
model_poly.fit(X_train.values.reshape(-1,1),y_train)

In [None]:
model_poly_predict = model_poly.predict(X_test.values.reshape(-1,1))

In [None]:
plt.scatter(X_train,y_train,color='grey')
plt.scatter(X_test,y_test,color='black')
plt.scatter(X_test,model_poly_predict,color='red')

In [None]:
model_poly.predict(8)

In [None]:
RMSE_poly_rooms = np.sqrt(metrics.mean_squared_error(model_poly_predict, y_test))

In [None]:
print (RMSE_linear_rooms)
print (RMSE_linear_multi)
print (RMSE_Poly_rooms)

### Exercise
Build a polynomial model using all the features 

In [None]:
X = df.iloc[:,range(13)].as_matrix()
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(rooms, y, train_size=0.8, random_state=1)

In [1]:
## Solution