# Multiple Linear Regression with sklearn - Exercise Solution

You are given a real estate dataset. 

Real estate is one of those examples that every regression course goes through as it is extremely easy to understand and there is a (almost always) certain causal relationship to be found.

The data is located in the file: 'real_estate_price_size_year.csv'. 

You are expected to create a multiple linear regression (similar to the one in the lecture), using the new data. 

Apart from that, please:
-  Display the intercept and coefficient(s)
-  Find the R-squared and Adjusted R-squared
-  Compare the R-squared and the Adjusted R-squared
-  Compare the R-squared of this regression and the simple linear regression where only 'size' was used
-  Using the model make a prediction about an apartment with size 750 sq.ft. from 2009
-  Find the univariate (or multivariate if you wish - see the article) p-values of the two variables. What can you say about them?
-  Create a summary table with your findings

In this exercise, the dependent variable is 'price', while the independent variables are 'size' and 'year'.

Good luck!

## Import the relevant libraries

In [26]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import f_regression

## Load the data

In [3]:
pathfile = 'real_estate_price_size_year.csv'
df = pd.read_csv(pathfile)

In [4]:
df.describe()

Unnamed: 0,price,size,year
count,100.0,100.0,100.0
mean,292289.47016,853.0242,2012.6
std,77051.727525,297.941951,4.729021
min,154282.128,479.75,2006.0
25%,234280.148,643.33,2009.0
50%,280590.716,696.405,2015.0
75%,335723.696,1029.3225,2018.0
max,500681.128,1842.51,2018.0


## Create the regression

### Declare the dependent and the independent variables

In [7]:
y = df['price']
x1 = df[['size','year']]

### Regression

In [8]:
results = LinearRegression()
results.fit(x1,y)

LinearRegression()

### Find the intercept

In [9]:
results.intercept_

-5772267.017463278

### Find the coefficients

In [10]:
results.coef_

array([ 227.70085401, 2916.78532684])

### Calculate the R-squared

In [11]:
results.score(x1,y)

0.7764803683276796

### Calculate the Adjusted R-squared

In [15]:
def score_ajusted(x,y):
    r = LinearRegression()
    r.fit(x,y)
    r2 = r.score(x,y)
    n = x.shape[0]
    p = x.shape[1]
    
    return 1-(1-r2)*(n-1)/(n-p-1)
    

In [16]:
score_ajusted(x1,y)

0.7718717161282503

### Compare the R-squared and the Adjusted R-squared

Answer...The two R2 values have close values. Therefore have two independet variables are not penalizing the model

### Compare the Adjusted R-squared with the R-squared of the simple linear regression

In [21]:
xs = df['size'].values.reshape(-1,1)

resultss1 = LinearRegression()
resultss1.fit(xs,y)
resultss1.score(xs,y)

0.7447391865847587

In [22]:
xs2 = df['year'].values.reshape(-1,1)
resultss2 = LinearRegression()
resultss2.fit(xs2,y)
resultss2.score(xs2,y)

0.008651618660186267

Answer... The R2 value of multivabriable model was superior to the simple linear regression of size and year silngle variables models. In adition the year variable alone does not contribute much for the models values.

### Making predictions

Find the predicted price of an apartment that has a size of 750 sq.ft. from 2009.

In [25]:
results.predict([[750,2009]])

array([258330.34465995])

### Calculate the univariate p-values of the variables

In [30]:
p_values = f_regression(x1,y)[1].round(3)

In [32]:
p_values

array([0.   , 0.357])

### Create a summary table with your findings

In [39]:
df_summary = pd.DataFrame(data=x1.columns.values,columns=['Features'])
df_summary['Coeficients'] = results.coef_
df_summary['p_value'] = p_values
df_summary

Unnamed: 0,Features,Coeficients,p_value
0,size,227.700854,0.0
1,year,2916.785327,0.357


Answer... Size is a variable extremely significant but year presented a low significance. Therefore shoud be removed from the model