# Multiple Linear Regression with sklearn - Exercise Solution

You are given a real estate dataset. 
You are expected to create a multiple linear regression, using the new data. 

In this exercise, the dependent variable is 'price', while the independent variables are 'size' and 'year'.

## Import the relevant libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as ans
from sklearn.linear_model import LinearRegression

## Load the data

In [2]:
data=pd.read_csv('real_estate_price_size_year.csv')

In [4]:
data.head()

Unnamed: 0,price,size,year
0,234314.144,643.09,2015
1,228581.528,656.22,2009
2,281626.336,487.29,2018
3,401255.608,1504.75,2015
4,458674.256,1275.46,2009


In [5]:
data.describe()

Unnamed: 0,price,size,year
count,100.0,100.0,100.0
mean,292289.47016,853.0242,2012.6
std,77051.727525,297.941951,4.729021
min,154282.128,479.75,2006.0
25%,234280.148,643.33,2009.0
50%,280590.716,696.405,2015.0
75%,335723.696,1029.3225,2018.0
max,500681.128,1842.51,2018.0


In [7]:
x=data[['size','year']]
y=data['price']

### Regression

In [9]:
reg=LinearRegression()
reg.fit(x,y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

### Find the intercept

In [12]:
reg.intercept_

-5772267.017463278

### Find the coefficients

In [13]:
reg.coef_

array([ 227.70085401, 2916.78532684])

### Calculate the R-squared

In [21]:
r2=reg.score(x,y)
r2

0.7764803683276796

### Calculate the Adjusted R-squared

In [17]:
n=x.shape[0]
p=x.shape[1]

In [19]:
adj_r2=1-((1-reg.score(x,y))*(n-1)/(n-p-1))
adj_r2

0.7718717161282503

### Compare the R-squared and the Adjusted R-squared

In [22]:
r2<adj_r2

False

### Making predictions

Find the predicted price of an apartment that has a size of 750 sq.ft. from 2009.

In [23]:
reg.predict([[750,2009]])

array([258330.34465995])

### Calculate the univariate p-values of the variables

In [24]:
from sklearn.feature_selection import f_regression

In [25]:
f_regression(x,y)

(array([285.92105192,   0.85525799]), array([8.12763222e-31, 3.57340758e-01]))

In [26]:
p_values=f_regression(x,y)[1]

In [27]:
p_values

array([8.12763222e-31, 3.57340758e-01])

### Create a summary table with your findings

In [35]:
reg_summary=pd.DataFrame(data=x.columns.values, columns=['Features'])
reg_summary['coefficients']=reg.coef_
reg_summary['p_values']=p_values.round(4)
reg_summary

Unnamed: 0,Features,coefficients,p_values
0,size,227.700854,0.0
1,year,2916.785327,0.3573


therefore based on the pvalues, year is redundant in this model