# Feature scaling with sklearn - Exercise

You are given a real estate dataset. 

Real estate is one of those examples that every regression course goes through as it is extremely easy to understand and there is a (almost always) certain causal relationship to be found.

The data is located in the file: 'real_estate_price_size_year.csv'. 

You are expected to create a multiple linear regression (similar to the one in the lecture), using the new data. This exercise is very similar to a previous one. This time, however, **please standardize the data**.

Apart from that, please:
-  Display the intercept and coefficient(s)
-  Find the R-squared and Adjusted R-squared
-  Compare the R-squared and the Adjusted R-squared
-  Compare the R-squared of this regression and the simple linear regression where only 'size' was used
-  Using the model make a prediction about an apartment with size 750 sq.ft. from 2009
-  Find the univariate (or multivariate if you wish - see the article) p-values of the two variables. What can you say about them?
-  Create a summary table with your findings

In this exercise, the dependent variable is 'price', while the independent variables are 'size' and 'year'.

Good luck!

## Import the relevant libraries

In [14]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler

sns.set()

## Load the data

In [9]:
data = pd.read_csv('real_estate_price_size_year.csv')

In [15]:
data.head()

Unnamed: 0,price,size,year
0,234314.144,643.09,2015
1,228581.528,656.22,2009
2,281626.336,487.29,2018
3,401255.608,1504.75,2015
4,458674.256,1275.46,2009


## Create the regression

### Declare the dependent and the independent variables

In [7]:
y=data['price']
x=data[['size','year']]

### Scale the inputs

In [16]:
scaler=StandardScaler()
scaler.fit(x)

StandardScaler(copy=True, with_mean=True, with_std=True)

In [17]:
x_scaled=scaler.transform(x)

x_scaled[all the rows, first column]

In [40]:
x_scaled[:,0]

array([-0.70816415, -0.66387316, -1.23371919,  2.19844528,  1.42498884,
       -0.937209  , -0.95171405, -0.78328682, -0.57603328, -0.53467702,
        0.69939906,  3.33780001, -0.53467702,  0.52699137,  1.51100715,
        1.77668568, -0.54810263, -0.77276222, -0.58004747,  0.58943055,
       -0.78365788, -1.02322731,  1.19557293, -1.12884431, -1.10378093,
        0.84424715, -0.95171405,  1.62279723, -0.58004747,  2.17014356,
        0.5306345 , -0.58004747, -0.8606021 , -1.10378093,  0.015233  ,
       -0.77603429, -0.10057126, -0.95387294, -0.56517136, -0.5219598 ,
        0.56983186, -0.57603328, -0.10057126,  1.62279723,  0.69939906,
       -0.5219598 , -0.7415595 , -0.5219598 , -0.7415595 , -0.79600403,
       -0.69328805,  0.56983186,  0.56983186, -0.42214483, -0.69328805,
        2.21224194,  0.6039356 ,  1.45329055, -0.08495304, -0.95751607,
       -0.08387359, -0.52125142,  1.18939985,  0.56983186, -0.56517136,
       -0.08748299,  0.52699137, -1.02285625, -0.56517136,  2.17

### Regression

In [24]:
reg= LinearRegression()
reg.fit(x_scaled,y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

### Find the intercept

In [25]:
reg.intercept_

292289.4701599997

### Find the coefficients

In [26]:
reg.coef_

array([67501.57614152, 13724.39708231])

### Calculate the R-squared

In [27]:
reg.score(x_scaled,y)

0.7764803683276793

### Calculate the Adjusted R-squared

In [28]:
def adjrs2(x,y):
    r2= reg.score(x,y)
    n=x.shape[0]
    p=x.shape[1]
    adj_r2= 1-(1-r2)*((n-1)/(n-p-1))
    return adj_r2

In [29]:
adjrs2(x_scaled,y)

0.77187171612825

calculating simple regression

In [43]:
regg=LinearRegression()
x_Simple_LR=x_scaled[:,0].reshape(-1,1)
regg.fit(x_Simple_LR,y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [44]:
regg.score(x_Simple_LR,y)

0.7447391865847587

In [45]:
adjrs2(x_scaled,y)

0.77187171612825

### Compare the R-squared and the Adjusted R-squared

The difference between R-square and adjusted R-square is not a lot as model did not significantly penalize the model for adding additional feature

### Compare the Adjusted R-squared with the R-squared of the simple linear regression

There is no significant improvement to the model after adding year of purchase as adj r-square is only slightly higher than R-square of simple LR

### Making predictions

Find the predicted price of an apartment that has a size of 750 sq.ft. from 2009.

In [55]:
new_data= pd.DataFrame(data=[[750,2009]],columns=['size','year'])
new_data

Unnamed: 0,size,year
0,750,2009


In [56]:
new_data_scaled=scaler.transform(new_data)

In [57]:
new_data_scaled

array([[-0.34752816, -0.76509206]])

In [63]:
reg.predict(new_data_scaled)

array([258330.34465995])

In [81]:
regg.predict(-0.34752816)

array([269296.65852163])

### Calculate the univariate p-values of the variables

In [48]:
from sklearn.feature_selection import f_regression

In [50]:
f_regression(x_scaled,y)

(array([285.92105192,   0.85525799]), array([8.12763222e-31, 3.57340758e-01]))

In [61]:
p_values=f_regression(x_scaled,y)[1]
p_values=p_values.round(3)
p_values

array([0.   , 0.357])

### Create a summary table with your findings

In [70]:
summary= pd.DataFrame(data= x.columns.values, columns=['Features'] )
summary['Coefficients']= reg.coef_
summary['p-values']= p_values

In [71]:
summary

Unnamed: 0,Features,Coefficients,p-values
0,size,67501.576142,0.0
1,year,13724.397082,0.357


It seems that year does not bring any significant improvements hence it is suitable to remove it