# Feature scaling with sklearn - Exercise

You are given a real estate dataset. 

Real estate is one of those examples that every regression course goes through as it is extremely easy to understand and there is a (almost always) certain causal relationship to be found.

The data is located in the file: 'real_estate_price_size_year.csv'. 

You are expected to create a multiple linear regression (similar to the one in the lecture), using the new data. This exercise is very similar to a previous one. This time, however, **please standardize the data**.

Apart from that, please:
-  Display the intercept and coefficient(s)
-  Find the R-squared and Adjusted R-squared
-  Compare the R-squared and the Adjusted R-squared
-  Compare the R-squared of this regression and the simple linear regression where only 'size' was used
-  Using the model make a prediction about an apartment with size 750 sq.ft. from 2009
-  Find the univariate (or multivariate if you wish - see the article) p-values of the two variables. What can you say about them?
-  Create a summary table with your findings

In this exercise, the dependent variable is 'price', while the independent variables are 'size' and 'year'.

Good luck!

## Import the relevant libraries

In [61]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from sklearn.linear_model import LinearRegression

## Load the data

In [62]:
data = pd.read_csv('real_estate_price_size_year.csv')

## Create the regression

### Declare the dependent and the independent variables

In [63]:
x = data[['size', 'year']]
y = data['price']

### Scale the inputs

In [107]:
from sklearn import preprocessing 


data_std_array = preprocessing.scale(data)

data_std = pd.DataFrame()
data_std['price'] = pd.DataFrame(data_std_array[:,0])
data_std['size'] = pd.DataFrame(data_std_array[:,1])
data_std['year'] = pd.DataFrame(data_std_array[:,2])


x_std = data_std[['size','year']]
y_std = data_std['price']
data_std.head()

Unnamed: 0,price,size,year
0,-0.756211,-0.708164,0.510061
1,-0.830986,-0.663873,-0.765092
2,-0.139086,-1.233719,1.147638
3,1.421319,2.198445,0.510061
4,2.170269,1.424989,-0.765092


### Regression

In [77]:
Reg = LinearRegression()
Reg.fit(x_std,y)

LinearRegression()

### Find the intercept

In [78]:
Reg.intercept_

292289.4701599997

### Find the coefficients

In [79]:
Reg.coef_

array([67501.57614152, 13724.39708231])

### Calculate the R-squared

In [82]:
Reg.score(x_std,y)

0.7764803683276793

### Calculate the Adjusted R-squared

In [83]:
def adjust_r2(x,y):
    r2 = Reg.score(x,y)
    n = x.shape[0]
    p = x.shape[1]
    return 1-(1-r2)*(n-1)/(n-p-1)
adjust_r2(x_std,y)

0.77187171612825

### Compare the R-squared and the Adjusted R-squared

R-square > R-squared adjust mean we don't choose 2 independent vairables wisely.

### Compare the Adjusted R-squared with the R-squared of the simple linear regression

R-squared adjust in decreasing when use multi simple linear regression mean ['Year'] is do not necessary for this linear regression.

### Making predictions

Find the predicted price of an apartment that has a size of 750 sq.ft. from 2009.

In [88]:
predict = [750,2009]
std_test = preprocessing.scale(predict)
Reg.predict([std_test])



array([238512.29110079])

### Calculate the univariate p-values of the variables

In [90]:
from sklearn.feature_selection import f_regression

In [95]:
p_value = f_regression(x_std,y)[1]
p_value.round(2)

array([0.  , 0.36])

### Create a summary table with your findings

In [112]:
summary = pd.DataFrame()
summary = pd.DataFrame(data = x.columns.values, columns = ['Features'])

summary['Coef'] = pd.DataFrame(Reg.coef_)
summary['P_value'] = pd.DataFrame(p_value.round(2))
summary

Unnamed: 0,Features,Coef,P_value
0,size,67501.576142,0.0
1,year,13724.397082,0.36


Answer...