## Machine Learning Fundamentals: Multiple Linear Regression
Author: Jinwoo Ahn

In one of the earlier tutorials, we discussed the linear of <b>univariate linear regression</b>, which is when we only have one variable to predict a certain entity. In this tutorial, we are going to expand on that knowledge to work on something called the <b>multiple linear regression</b> where we will have multiple variables that we can use to predict the desired outcome.

### Libraries & Files

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression

In [9]:
housing = pd.read_csv('housing.csv')
housing = housing[['price', 'area', 'bedrooms']]

### What is Linear Regression?

Refer back to the <a href="https://github.com/jin-woo-ahn/ml_fundamentals/blob/main/supervised/1.%20univariate_linear_regression.ipynb">univariate linear regression tutorial</a> for a review of what linear regression is.

### What Does This Look Like?

<a href="https://github.com/jin-woo-ahn/ml_fundamentals/blob/main/supervised/1.%20univariate_linear_regression.ipynb">Previously</a>, we showed how we can use linear regression when there is a <b>positive linear relationship</b> across two different variables (e.g. the Price and Area from the house price example).

With this information, we got to a conclusion where we can have the format <b>y = ax + b</b>. Now that we have more than one variable, our equation will change to <b>y = ax_1 + bx_2 + c</b>.

Our x_1 will be the <b>area</b> and x_2 will be the <b>number of bedrooms</b>.

### How Can We Train Our Model From This?

The process is the same as what we did in the very first tutorial, but for the sake of representation, let us try all the steps out.

In [25]:
x_train = np.array(housing[['area', 'bedrooms']][0:450])
y_train = np.array(housing['price'][0:450])

x_test = np.array(housing[['area', 'bedrooms']][450:])
y_test = np.array(housing['price'][450:])

In [26]:
lg_model = LinearRegression()

In [27]:
lg_model.fit(x_train, y_train)

In [28]:
slope = lg_model.coef_
intercept = lg_model.intercept_
print((slope[0], intercept))

(362.24680142328276, 1237164.5493464484)


### Performance?

As expected, out performance on the training set increases with the addition of a new variable because the model is <b>more complex</b> now. However, the test score is still negative because there are a lot of variables that we have not taken into consideration yet.

In [30]:
train_score = lg_model.score(x_train, y_train)
train_score

0.3118586803622959

In [31]:
test_score = lg_model.score(x_test, y_test)
test_score

-20.340743800303322

### What's Next?

Congratulations on finishing the <b>Multiple Linear Regression</b> tutorial. Next, we will discuss <b>Feature Scaling</b>!