# Multiple Linear Regression with sklearn - Exercise

You are given a real estate dataset. 

Real estate is one of those examples that every regression course goes through as it is extremely easy to understand and there is a (almost always) certain causal relationship to be found.

The data is located in the file: 'real_estate_price_size_year.csv'. 

You are expected to create a multiple linear regression (similar to the one in the lecture), using the new data. 

Apart from that, please:
-  Display the intercept and coefficient(s)
-  Find the R-squared and Adjusted R-squared
-  Compare the R-squared and the Adjusted R-squared
-  Compare the R-squared of this regression and the simple linear regression where only 'size' was used
-  Using the model make a prediction about an apartment with size 750 sq.ft. from 2009
-  Find the univariate (or multivariate if you wish - see the article) p-values of the two variables. What can you say about them?
-  Create a summary table with your findings

In this exercise, the dependent variable is 'price', while the independent variables are 'size' and 'year'.

Good luck!

## Import the relevant libraries

In [47]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

from sklearn.linear_model import LinearRegression

## Load the data

In [48]:
dataset = pd.read_csv('real_estate_price_size_year.csv')
dataset.head()

Unnamed: 0,price,size,year
0,234314.144,643.09,2015
1,228581.528,656.22,2009
2,281626.336,487.29,2018
3,401255.608,1504.75,2015
4,458674.256,1275.46,2009


In [49]:
dataset.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
price,100.0,292289.47016,77051.727525,154282.128,234280.148,280590.716,335723.696,500681.128
size,100.0,853.0242,297.941951,479.75,643.33,696.405,1029.3225,1842.51
year,100.0,2012.6,4.729021,2006.0,2009.0,2015.0,2018.0,2018.0


## Create the regression

### Declare the dependent and the independent variables

I want to predict the price based on the the size or the year of the house

In [50]:
y = dataset['price']
x = dataset[['size', 'year']]

### Regression

In [51]:
model = LinearRegression()
model.fit(x, y)

### Find the intercept

In [52]:
model.intercept_

-5772267.017463277

### Find the coefficients

In [53]:
model.coef_

array([ 227.70085401, 2916.78532684])

### Calculate the R-squared

In [54]:
r2 = model.score(x, y)
r2

0.7764803683276794

### Calculate the Adjusted R-squared

Adjusted R-Squared Formula

$
R^2_{adj.}=1-(1-R^2)*\frac{n-1}{n-p-1}
$

Where,
- n = number of observation (100)
- p = number of predictors (2)


In [55]:
n = x.shape[0]
p = x.shape[1]

adjusted_r2 = 1 - (1 - r2) * (n - 1) / (n - p - 1)
adjusted_r2

0.7718717161282501

In [56]:
adjusted_r2 - r2

-0.004608652199429297

This result shows that one of the predictors have little explanatory power, however it's still high.

### Compare the R-squared and the Adjusted R-squared

R-Squared = 0.7764803683276794
Adj. R-Squared = 0.7718717161282501

The differences between these values is not much, thus we could say that both of the variable has a high explanatory power.

### Compare the Adjusted R-squared with the R-squared of the simple linear regression

In [57]:
# Since we don't know about the R-Squared of the simple linear regression, we will calculate it first

x_slr = dataset['size']

In [58]:
x_slr_matrix = x_slr.values.reshape(-1, 1)
x_slr_matrix.shape

(100, 1)

In [59]:
model_slr = LinearRegression()
model_slr.fit(x_slr_matrix, y)

In [60]:
r2_slr = model_slr.score(x_slr_matrix, y)
r2_slr

0.7447391865847587

In [61]:
adjusted_r2 - r2_slr

0.027132529543491435

As you can see, there are a lot of differences between Simple Linear Regression (R-Squared) and Multiple Linear Regression (Adj. R-Squared). This means that, it's indeed that the more variable we put it in our model, the higher the explanatory power is.

### Making predictions

Find the predicted price of an apartment that has a size of 750 sq.ft. from 2009.

In [62]:
new_data = pd.DataFrame(data=[[750, 2009]], columns=['size', 'year'])
new_data

Unnamed: 0,size,year
0,750,2009


In [63]:
model.predict(new_data)

array([258330.34465995])

In [64]:
combined = new_data.join(pd.DataFrame({'price': model.predict(new_data)}))
combined

Unnamed: 0,size,year,price
0,750,2009,258330.34466


### Calculate the univariate p-values of the variables

In [65]:
from sklearn.feature_selection import f_regression

In [66]:
f_regression(x, y)

(array([285.92105192,   0.85525799]), array([8.12763222e-31, 3.57340758e-01]))

In [67]:
p_values = f_regression(x, y)[1]
p_values.round(6)

array([0.      , 0.357341])

This means that, although the Adjusted R-Squared showing the higher the explanatory power is, but the Year variable seems not too significant to determine the Price of a new data

### Create a summary table with your findings

In [68]:
summary = pd.DataFrame(data=x.columns.values, columns=['Features'])
summary

Unnamed: 0,Features
0,size
1,year


In [69]:
summary['Coefficients'] = model.coef_
summary['p-values'] = p_values.round(6)
summary

Unnamed: 0,Features,Coefficients,p-values
0,size,227.700854,0.0
1,year,2916.785327,0.357341


However, based on the comparation between the R-Squared of SLR and the Adj. R-Squared of MLR, we could say that the variable 'Year' is not too significant, it has also been proven by the p-values presented above