# Practice Exercise Linear Regression

## Case Study: Boston Housing Price Prediction

### Problem Statement

The problem at hand is to predict the housing prices of a town or a suburb based on the features of the locality provided to us. In the process, we need to identify the most important features in the dataset. We need to employ techniques of data preprocessing and build a linear regression model that predicts the prices for us. 

### Data Information

Each record in the database describes a Boston suburb or town. The data was drawn from the Boston Standard Metropolitan Statistical Area (SMSA) in 1970. Detailed attribute information can be found below-

Attribute Information (in order):
- CRIM:     per capita crime rate by town
- ZN:       proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS:    proportion of non-retail business acres per town
- CHAS:     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX:      nitric oxides concentration (parts per 10 million)
- RM:       average number of rooms per dwelling
- AGE:      proportion of owner-occupied units built prior to 1940
- DIS:      weighted distances to five Boston employment centres
- RAD:      index of accessibility to radial highways
- TAX:      full-value property-tax rate per 10,000 dollars
- PTRATIO:  pupil-teacher ratio by town
- LSTAT:    %lower status of the population
- MEDV:     Median value of owner-occupied homes in 1000 dollars.

### Concepts to cover
- <a href= "#link1">1.EDA </a>
- <a href= "#link2">2.Splitting the data </a>
- <a href= "#link3">3.Modelling </a>
- <a href= "#link4">4.Bonus: Statsmodel based impementation </a>


**Importing Libraries**

In [None]:
from sklearn.datasets import load_boston
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

### <a id = "link1">Load the dataset</a>

In [None]:
df = pd.read_csv("Boston.csv")
df.head()

**Check the shape of the dataset**

In [None]:
df.shape

**Get the info data types column wise**

In [None]:
df.info()

In [None]:
df.describe()

### Univariate and Bivariate Analysis
To do - Identify insights if any from the distributuions.

In [None]:
# let us plot all the columns to look at their distributions

import seaborn as sns
for i in df.columns:
    plt.figure(figsize = (10,10))
    sns.distplot(df[i])
    plt.show()

In [None]:
#Bivariate Scatterplot of Prices with all the features
for i in df.columns:
    plt.figure(figsize = (10,10))
    sns.scatterplot(x = df[i], y = df['MEDV'])
    plt.show()                

**Get the summary statistics of the numerical columns of the dataset**

**Get the Correlation Heatmap**

In [None]:
plt.figure(figsize=(12,8))
sns.heatmap(df.corr(),annot=True,fmt='.2f',cmap='rainbow', )
plt.show()

### <a id = "link2">Split the dataset</a>
Let's split the data into the dependent and independent variables and further split it into train and test set in a ratio of 70:30 for train and test set.

In [None]:
Y = df['MEDV']
X = df.drop(columns = {'MEDV'})

In [None]:
#splitting the data in 70:30 ratio of train to test data
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.30 , random_state=1)

### <a id = "link3">Using Linear Model from Scikit learn library</a>

**Fit the model to the training set**

In [None]:
#intialise the model to be fit and fir the model on the train data
regression_model = LinearRegression(fit_intercept=True)
regression_model.fit(X_train, y_train)

**Get the score on training set**

In [None]:
#get the R-square score the fitted train data

print('The coefficient of determination R^2 of the prediction on Train set', regression_model.score(X_train, y_train))

In [None]:
# write your own R-square function for the testing data

def r_squared(model, X, y):
    y_mean = y_train.mean()
    SST = ((y_train - y_mean)**2).sum()
    SSE = ((y_train - regression_model.predict(X_train))**2).sum()
    r_square = 1 - SSE/SST
    return SSE, SST, r_square
    
SSE, SST, r_square = r_squared(regression_model, X_train, y_train)
print("SSE: ", SSE)
print("SST: ", SST)
print("r_square: ", r_square)

**Get the score on test set**

In [None]:
print('The coefficient of determination R^2 of the prediction on Test set',regression_model.score(X_test, y_test))

**Get the RMSE on test set**

In [None]:
print("The Root Mean Square Error (RMSE) of the model is for testing set is",np.sqrt(mean_squared_error(y_test,regression_model.predict(X_test))))

**Get model Coefficients**

In [None]:
a = regression_model.coef_
coeff_data = pd.DataFrame()
coeff_data['Coefs'] = regression_model.coef_
coeff_data['Feature'] = X_train.columns
coeff_data = coeff_data.append({'Coefs': regression_model.intercept_, 'Feature': "Intercept"}, ignore_index = True)
coeff_data

In [None]:
# Let us write the equation of the fit
Equation = "Price ="
print(Equation, end='\t')
for i in range(0, 13):
    if(i!=12):
        print("(",coeff_data.iloc[i].Coefs,")", "*", coeff_data.iloc[i].Feature, "+", end = '  ')
    else:
        print(coeff_data.iloc[i].Coefs)

## <a id = "link4">Bonus: Using Statsmodels OLS</a>

In [None]:
# This adds the constant term beta0 to the Linear Regression.
X_con=sm.add_constant(X)
X_trainc, X_testc, y_trainc, y_testc = train_test_split(X_con, Y, test_size=0.30 , random_state=1)

**Make the linear model using OLS**

In [None]:
model = sm.OLS(y_trainc,X_trainc).fit()
model.summary()

**Get the value of coefficient of determination**

In [None]:
print('The variation in the independent variable which is explained by the dependent variable is','\n',
      model.rsquared*100,'%')

**Get the Predictions on test set**

In [None]:
ypred = model.predict(X_testc)

**Calculate MSE for training set**

In [None]:
mse = model.mse_model
mse

**Get the RMSE on training set**

In [None]:
print("The Root Mean Square Error (RMSE) of the model for Training set is",np.sqrt(mse))

**Get the RMSE on test set**

In [None]:
print("The Root Mean Square Error (RMSE) of the model is for testing set is",np.sqrt(mean_squared_error(y_test,ypred)))