# Multiple Linear Regression with sklearn - Exercise Solution

You are given a real estate dataset. 

Real estate is one of those examples that every regression course goes through as it is extremely easy to understand and there is a (almost always) certain causal relationship to be found.

The data is located in the file: 'real_estate_price_size_year.csv'. 

You are expected to create a multiple linear regression (similar to the one in the lecture), using the new data. 

Apart from that, please:
-  Display the intercept and coefficient(s)
-  Find the R-squared and Adjusted R-squared
-  Compare the R-squared and the Adjusted R-squared
-  Compare the R-squared of this regression and the simple linear regression where only 'size' was used
-  Using the model make a prediction about an apartment with size 750 sq.ft. from 2009
-  Find the univariate (or multivariate if you wish - see the article) p-values of the two variables. What can you say about them?
-  Create a summary table with your findings

In this exercise, the dependent variable is 'price', while the independent variables are 'size' and 'year'.

Good luck!

## Import the relevant libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
import seaborn as sns
sns.set()

## Load the data

In [2]:
file_path = 'C:/Users/iolley2/Desktop/DS Contd/Multiple Linear Regressions/Examples/real_estate_price_size_year.csv'
data = pd.read_csv(file_path)

In [3]:
data.head()

Unnamed: 0,price,size,year
0,234314.144,643.09,2015
1,228581.528,656.22,2009
2,281626.336,487.29,2018
3,401255.608,1504.75,2015
4,458674.256,1275.46,2009


In [4]:
data.describe()

Unnamed: 0,price,size,year
count,100.0,100.0,100.0
mean,292289.47016,853.0242,2012.6
std,77051.727525,297.941951,4.729021
min,154282.128,479.75,2006.0
25%,234280.148,643.33,2009.0
50%,280590.716,696.405,2015.0
75%,335723.696,1029.3225,2018.0
max,500681.128,1842.51,2018.0


## Create the regression

### Declare the dependent and the independent variables

In [7]:
yname,x1name,x2name = 'price','size','year'
y = data[yname]
x1 = data[[x1name,x2name]]
x2 = data[[x1name]]

### Regression

In [8]:
mulreg = LinearRegression() # multiple regression
reg = LinearRegression() # simple regression
mulreg.fit(x1,y)
reg.fit(x2,y)

LinearRegression()

### Find the intercept

In [12]:
# intercepts
print('Simple Regression',str(reg.intercept_))
print('Multiple Regression',str(mulreg.intercept_))

Simple Regression 101912.60180122912
Multiple Regression -5772267.017463282


### Find the coefficients

In [11]:
# Coefficients
print('Simple Regression',str(reg.coef_))
print('Multiple Regression',str(mulreg.coef_))

Simple Regression [223.17874259]
Multiple Regression [ 227.70085401 2916.78532684]


### Calculate the R-squared

In [17]:
# R-Squared
print('Simple Regression',str(reg.score(x2,y)))
print('Multiple Regression',str(mulreg.score(x1,y)))

Simple Regression 0.7447391865847587
Multiple Regression 0.7764803683276791


### Calculate the Adjusted R-squared

In [20]:
simp_adj_R2 = 1 - (1-reg.score(x2,y))*(x2.shape[0]-1)/(x2.shape[0]-x2.shape[1]-1)
mul_adj_R2 = 1 - (1-mulreg.score(x1,y))*(x1.shape[0]-1)/(x1.shape[0]-x1.shape[1]-1)

In [21]:
print('Simple Regression',str(simp_adj_R2))
print('Multiple Regression',str(mul_adj_R2))

Simple Regression 0.7421344844070521
Multiple Regression 0.7718717161282498


### Compare the R-squared and the Adjusted R-squared

Answer...

In [31]:
compare = pd.DataFrame({'R2':[reg.score(x2,y),mulreg.score(x1,y)],'Adjusted R2':[simp_adj_R2,mul_adj_R2]})
compare.rename(index={0:'Simple',1:'Multiple'})

Unnamed: 0,R2,Adjusted R2
Simple,0.744739,0.742134
Multiple,0.77648,0.771872


### Compare the Adjusted R-squared with the R-squared of the simple linear regression

Answer...

In [32]:
compare = pd.DataFrame({'R2':[reg.score(x2,y),mulreg.score(x1,y)],'Adjusted R2':[simp_adj_R2,mul_adj_R2]})
compare.rename(index={0:'Simple',1:'Multiple'})

Unnamed: 0,R2,Adjusted R2
Simple,0.744739,0.742134
Multiple,0.77648,0.771872


### Making predictions

Find the predicted price of an apartment that has a size of 750 sq.ft. from 2009.

In [39]:
mulreg.predict([[750,2009]])

array([258330.34465995])

### Calculate the univariate p-values of the variables

In [40]:
from sklearn.feature_selection import f_regression

In [41]:
p_values = f_regression(x1,y)[1].round(3)
p_values

array([0.   , 0.357])

In [42]:
f_statistic = f_regression(x1,y)[0].round(3)
f_statistic

array([285.921,   0.855])

### Create a summary table with your findings

In [44]:
mulreg_summary = pd.DataFrame(data = x1.columns.values,columns = ['Features'])
mulreg_summary

Unnamed: 0,Features
0,size
1,year


Answer...

In [45]:
mulreg_summary['Coefficients'] = mulreg.coef_
mulreg_summary['p-values'] = p_values.round(3)
mulreg_summary['F-Statistic'] = f_statistic.round(3)
mulreg_summary

Unnamed: 0,Features,Coefficients,p-values,F-Statistic
0,size,227.700854,0.0,285.921
1,year,2916.785327,0.357,0.855
