# Linear regression with multiple variables

### Defining model hypothesis

Let's assume our model has 5 model parameters. Our hypothesis function will look like this:

$$h_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_3 + \theta_4 x_4 + \theta_5 x_5$$

Any model that has <b>n</b> parameters can be defined as:

$$h_\theta(x) = \theta_0 + \theta_1 x_1 + ... + \theta_n x_n$$

<br>

### Cost function for linear regression with multiple variables

Cost function for linear regression with multiple variables can be written using matrix notation and vectors:

$$MSE = J(\theta) = \frac{1}{2m}(X\theta - |\vec{y}|)^T (X\theta - \vec{y})$$

where:

$\frac{1}{2m}$ is an expression to calculate $half$ of MSE per prediction. We're taking half of MSE due to convenience when calculating the derivative of MSE.

$X$ is a feature matrix, where each column corresponds to a different variable, and $\theta$  is a parameter vector, so $X\theta$ represents the vector of predicted values (matrix multiplied by a vector results in a vector).

$\vec{y}$ is a vector of actual values.

$(X\theta - \vec{y})$ = $
\begin{bmatrix}
e_1 \\
e_2 \\
\vdots \\
e_m \\
\end{bmatrix}
$  is the <b>vector of error</b>, or vector of residuals. It can be read as the difference between the vector of predicted values and the vector of actual values.

$(X\theta - \vec{y})^T$ is the <b>transposition</b> of the vector of error. If vector of error was a column vector, then the transpose of voe turns into a row vector. If voe was a row vector, then the transpose of voe turns into a column vector.

$(X\theta - \vec{y})^T(X\theta - \vec{y})$ is a <b>dot product</b> between transpose of voe and voe.

Calculating the dot product:

$(X\theta - \vec{y})^T(X\theta - \vec{y})$ = $
\begin{bmatrix}
e_1 \\
e_2 \\
\vdots \\
e_m \\
\end{bmatrix}
$ × $\begin{bmatrix} e_1 & e_2 & \cdots & e_m \end{bmatrix}$ = $e_1 * e_1 + e_2 * e_2 + ... + e_m * e_m $ = $e_1^2 + e_2^2 + ... + e_m^2$

<br>

### Gradient of the mean squared error (MSE)

$$\nabla(\theta) = \frac{1}{m}X^T(X\theta - \vec{y})$$

$X^T$ - transpose of the feature matrix X.

$X\theta$ - vector of predicted values.

$(X\theta - y)$ - vector of error.

$X^T(X\theta - y)$ - dot product of vector of error and feature matrix.

$\frac{1}{m}$ - used to calculate average across all samples.

<br>

Gradient of MSE is used to iteratively update the parameter vector $\theta$.

Feature vector is updated using this equation:

$$\theta := \theta - \alpha \nabla J(\theta)$$

### Implementing Linear Regression with multiple variables using sklearn

Import libraries:

In [33]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

Load data to dataframe:

In [34]:
df = pd.read_csv("House_Rent_Dataset.csv")

df.head(10)

Unnamed: 0,Posted On,BHK,Rent,Size,Floor,Area Type,Area Locality,City,Furnishing Status,Tenant Preferred,Bathroom,Point of Contact
0,2022-05-18,2,10000,1100,Ground out of 2,Super Area,Bandel,Kolkata,Unfurnished,Bachelors/Family,2,Contact Owner
1,2022-05-13,2,20000,800,1 out of 3,Super Area,"Phool Bagan, Kankurgachi",Kolkata,Semi-Furnished,Bachelors/Family,1,Contact Owner
2,2022-05-16,2,17000,1000,1 out of 3,Super Area,Salt Lake City Sector 2,Kolkata,Semi-Furnished,Bachelors/Family,1,Contact Owner
3,2022-07-04,2,10000,800,1 out of 2,Super Area,Dumdum Park,Kolkata,Unfurnished,Bachelors/Family,1,Contact Owner
4,2022-05-09,2,7500,850,1 out of 2,Carpet Area,South Dum Dum,Kolkata,Unfurnished,Bachelors,1,Contact Owner
5,2022-04-29,2,7000,600,Ground out of 1,Super Area,Thakurpukur,Kolkata,Unfurnished,Bachelors/Family,2,Contact Owner
6,2022-06-21,2,10000,700,Ground out of 4,Super Area,Malancha,Kolkata,Unfurnished,Bachelors,2,Contact Agent
7,2022-06-21,1,5000,250,1 out of 2,Super Area,Malancha,Kolkata,Unfurnished,Bachelors,1,Contact Agent
8,2022-06-07,2,26000,800,1 out of 2,Carpet Area,"Palm Avenue Kolkata, Ballygunge",Kolkata,Unfurnished,Bachelors,2,Contact Agent
9,2022-06-20,2,10000,1000,1 out of 3,Carpet Area,Natunhat,Kolkata,Semi-Furnished,Bachelors/Family,2,Contact Owner


In [35]:
y = df['Rent']
X = df.drop(['Rent', 'Posted On'], axis=1)

In [36]:
df.head(10)

Unnamed: 0,Posted On,BHK,Rent,Size,Floor,Area Type,Area Locality,City,Furnishing Status,Tenant Preferred,Bathroom,Point of Contact
0,2022-05-18,2,10000,1100,Ground out of 2,Super Area,Bandel,Kolkata,Unfurnished,Bachelors/Family,2,Contact Owner
1,2022-05-13,2,20000,800,1 out of 3,Super Area,"Phool Bagan, Kankurgachi",Kolkata,Semi-Furnished,Bachelors/Family,1,Contact Owner
2,2022-05-16,2,17000,1000,1 out of 3,Super Area,Salt Lake City Sector 2,Kolkata,Semi-Furnished,Bachelors/Family,1,Contact Owner
3,2022-07-04,2,10000,800,1 out of 2,Super Area,Dumdum Park,Kolkata,Unfurnished,Bachelors/Family,1,Contact Owner
4,2022-05-09,2,7500,850,1 out of 2,Carpet Area,South Dum Dum,Kolkata,Unfurnished,Bachelors,1,Contact Owner
5,2022-04-29,2,7000,600,Ground out of 1,Super Area,Thakurpukur,Kolkata,Unfurnished,Bachelors/Family,2,Contact Owner
6,2022-06-21,2,10000,700,Ground out of 4,Super Area,Malancha,Kolkata,Unfurnished,Bachelors,2,Contact Agent
7,2022-06-21,1,5000,250,1 out of 2,Super Area,Malancha,Kolkata,Unfurnished,Bachelors,1,Contact Agent
8,2022-06-07,2,26000,800,1 out of 2,Carpet Area,"Palm Avenue Kolkata, Ballygunge",Kolkata,Unfurnished,Bachelors,2,Contact Agent
9,2022-06-20,2,10000,1000,1 out of 3,Carpet Area,Natunhat,Kolkata,Semi-Furnished,Bachelors/Family,2,Contact Owner


In [37]:
X.columns

Index(['BHK', 'Size', 'Floor', 'Area Type', 'Area Locality', 'City',
       'Furnishing Status', 'Tenant Preferred', 'Bathroom',
       'Point of Contact'],
      dtype='object')