<br>
# 3.6.3  Multiple Linear regression
<br>

### Form of multiple linear regression

$y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n$

- $y$ is the response
- $\beta_0$ is the intercept
- $\beta_1$ is the coefficient for $x_1$ (the first feature)
- $\beta_n$ is the coefficient for $x_n$ (the nth feature)

##### What are the features?
- **age:** average age of houses
- **lstat:** percent of households with lower socioeconomic status (percent)

##### What is the response?
- **medv:** median value of owner-occupied homes (in $1000s)

<br>
$y = \beta_0 + \beta_1 \times age + \beta_2 \times lstat$

<br>

In [None]:
# inserted cell

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, explained_variance_score
from sklearn.preprocessing import PolynomialFeatures

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [None]:
# read CSV file and save the results
data = pd.read_csv('data/Boston.csv')

# create a Python list of feature names
feature_cols = ['age', 'lstat']

# use the list to select a subset of the original DataFrame
X = data[feature_cols]

# print the first 5 rows
X.head()

In [None]:
# check the type and shape of X
print(type(X))
print(X.shape)

In [None]:
# select a Series from the DataFrame
y = data['medv']

# equivalent command that works if there are no spaces in the column name
y = data.medv

# print the first 5 values
y.head()

In [None]:
# check the type and shape of y
print(type(y))
print(y.shape)

## Splitting X and y into training and testing sets

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [None]:
# default split is 75% for training and 25% for testing
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

## Multiple Linear Regression in scikit-learn

In [None]:
# import model
#from sklearn.linear_model import LinearRegression

# instantiate
linreg = LinearRegression()

# fit the model to the training data (learn the coefficients)
linreg.fit(X_train, y_train)

### Interpreting model coefficients

In [None]:
# print the intercept and coefficients
print(linreg.intercept_)
print(linreg.coef_)

In [None]:
# pair the feature names with the coefficients
list(zip(feature_cols, linreg.coef_))

$y = 33.246 + 0.024 \times age - 0.970 \times lstat$

- This is a statement of **association**, not **causation**.


### Making predictions

In [None]:
# make predictions on the testing set
y_pred = linreg.predict(X_test)

We need an **evaluation metric** in order to compare our predictions with the actual values!

### Computing  $R^2$

In [None]:
print(linreg.score(X_test, y_test))

### Computing the RMSE 

In [None]:
print(np.sqrt(mean_squared_error(y_test, y_pred)))