## Machine Learning With Python: Linear Regression Multiple Variables


### Sample problem of predicting home price in monroe, new jersey (USA)


##### Below is the table containing home prices in monroe twp, NJ. Here price depends on area (square feet), bed rooms and age of the home (in years). Given these prices we have to predict prices of new homes based on area, bed rooms and age.

![](data/homeprices.jpg)

###### Given these home prices find out price of a home that has,

**3000 sqr ft area, 3 bedrooms, 40 year old**

**2500 sqr ft area, 4 bedrooms, 5 year old**


##### We will use regression with multiple variables here. Price can be calculated using following equation,



![](data/equation.jpg)

##### Here area, bedrooms, age are called independant variables or **features** whereas price is a dependant variable



In [1]:
import pandas as pd
import numpy as np
from sklearn import linear_model

In [2]:
df = pd.read_csv('data/homeprices.csv')
df

Unnamed: 0,area,bedrooms,age,price
0,2600,3.0,20,550000
1,3000,4.0,15,565000
2,3200,,18,610000
3,3600,3.0,30,595000
4,4000,5.0,8,760000
5,4100,6.0,8,810000


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   area      6 non-null      int64  
 1   bedrooms  5 non-null      float64
 2   age       6 non-null      int64  
 3   price     6 non-null      int64  
dtypes: float64(1), int64(3)
memory usage: 320.0 bytes


In [4]:
df.describe()

Unnamed: 0,area,bedrooms,age,price
count,6.0,5.0,6.0,6.0
mean,3416.666667,4.2,16.5,648333.333333
std,587.934237,1.30384,8.288546,109117.673484
min,2600.0,3.0,8.0,550000.0
25%,3050.0,3.0,9.75,572500.0
50%,3400.0,4.0,16.5,602500.0
75%,3900.0,5.0,19.5,722500.0
max,4100.0,6.0,30.0,810000.0


In [5]:
df.columns

Index(['area', 'bedrooms', 'age', 'price'], dtype='object')

##### Data Preprocessing: Fill NA values with median value of a column

In [6]:
df.bedrooms.median()

4.0

In [9]:
df['age']

0    20
1    15
2    18
3    30
4     8
5     8
Name: age, dtype: int64

In [7]:
df.bedrooms = df.bedrooms.fillna(df.bedrooms.median())
df

Unnamed: 0,area,bedrooms,age,price
0,2600,3.0,20,550000
1,3000,4.0,15,565000
2,3200,4.0,18,610000
3,3600,3.0,30,595000
4,4000,5.0,8,760000
5,4100,6.0,8,810000


In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   area      6 non-null      int64  
 1   bedrooms  6 non-null      float64
 2   age       6 non-null      int64  
 3   price     6 non-null      int64  
dtypes: float64(1), int64(3)
memory usage: 320.0 bytes


In [12]:
X = df.drop('price',axis='columns')
X

Unnamed: 0,area,bedrooms,age
0,2600,3.0,20
1,3000,4.0,15
2,3200,4.0,18
3,3600,3.0,30
4,4000,5.0,8
5,4100,6.0,8


In [13]:
y = df.price
y

0    550000
1    565000
2    610000
3    595000
4    760000
5    810000
Name: price, dtype: int64

In [14]:
import plotly.express as px
import plotly.graph_objects as go

ModuleNotFoundError: No module named 'plotly'

In [19]:
fig = px.scatter_3d(new_df, x=new_df.area,y=new_df.bedrooms,z=price)
fig.show()

In [13]:
reg = linear_model.LinearRegression()
reg.fit(new_df,df.price)

In [27]:
reg.coef_

array([  112.06244194, 23388.88007794, -3231.71790863])

In [28]:
reg.intercept_

221323.0018654043

In [14]:
# Find price of home with 3000 sqr ft area, 3 bedrooms, 40 year old
reg.predict([[3000, 3, 40]])



array([498408.25158031])

In [36]:
# Y = m1 * X1 + m2 * X2 + m3 * X3 + b (m1,m2,m3 is coefficient and b is intercept)

112.06244194*3000 + 23388.88007794*3 + -3231.71790863*40 + 221323.00186540384


498408.25157402386