# Supplementary Materials Part 4: Linear Regression

We will tackle the second ML model learnt in IT1244 called Linear Regression. This is essential for regression problems, which are different from classification and clustering problems.

In [29]:
# Loading dataset 
import pandas as pd 
import numpy as np

housing = pd.read_csv("data/housing_cleaned.csv", index_col = 'Unnamed: 0')

X = housing.drop("median_house_value", axis = 1) 
y = housing["median_house_value"]

## Part 1: Initialisation of models

By this part, you should have learnt how to manipulate data in Pandas (with the other supplementary materials) or with NumPy. 

As a recap, the workflow of models usually goes like this:
1. Find the model from sklearn - it's usually in a separate library
2. do a train test split on the data (80/20? 70/30? up to you)
3. fit the data onto the training data
4. predict the results using the test data 
5. compare the predictions against the y values of test data

How do we apply this for Linear Regression?

In [43]:
# Linear Regression is taken from sklearn.linear_model
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# We will then split the data into different datasets 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Save the model into a variable 
linreg = LinearRegression()

# Fit the data onto training data
linreg.fit(X_train, y_train)

# Predict results using test data
predictions = linreg.predict(X_test)

## Part 2: Metrics

As mentioned above, we compare the predictions against the y values of test data. How do we know if a regression model is good? This is where metrics come in.

There are mainly three kinds of metrics:
1. Mean Squared Error (MSE) - euclidean distance between the predicted and actual data, squared
2. Mean Average Error (MAE) - 
3. Root Mean Squared Error (RMSE) - MSE, but squared root!

Depending on what you want to measure from your regression model, you will have to pick the metrics carefully.

In [39]:
# Insert code here
from sklearn.metrics import mean_squared_error, mean_absolute_error

# I have created a function just to make it a lot easier to show stuff:
def print_regression_metrics(y_pred, y_test):

    # Remember we have it such that the predicted column is scaled down with np.log1p
    # We need the opposing function, np.expm1 
    mse_score = mean_squared_error(y_pred, y_test)
    mae_score = mean_absolute_error(y_pred, y_test)
    rmse_score = mse_score ** 0.5 

    print(f"MSE:{mse_score} \nMAE: {mae_score} \nRMSE:{rmse_score}")
    return 

# You know what to do if you want it to return the regression metrics! 
print_regression_metrics(predictions, y_test)

MSE:0.11040202545242539 
MAE: 0.25171619872349976 
RMSE:0.3322680024504698


We can tell that for this, MSE works way better and RMSE doesn't make sense since MSE is < 1. (Think why?)

### What is the difference between MSE and MAE?
MSE means Mean Squared Error, while MAE means Mean Absolute Error.

MSE is the value calculated from the __summation of the squared differences__ between the predicted values and the actual values divided by the number of data points.

MAE is the value calculated from the __summation of the absolute differences__ between the predicted values and the actual values divided by the number of data points.

### Formulas:
MSE is calculated as: 

\[ MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y_i})^2 \]

Meanwhile, MAE is calculated as:

\[ MAE = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y_i}| \]

(if you can't see what the formulas are, refer to the github!)

### When do i use MSE and MAE?
MSE:
- Heavily penalises large errors, making it sensitive to outliers
- Emphasizes larger errors more than smaller errors.

MAE:
- Does not heavily penalise large error, making it more robust to outliers
- Gives equal weight to all erorrs and treats them equally.

Think about why MSE and MAE have these properties! 


## Part 3: Further reading

For this lesson, I will be going through the simplest regression model, which is Linear Regression. If you're interested in other kinds of regressions, do know that state-of-the-art methods can do regressions too! But if you're looking for something a little bit more traditional, do read up on:

- Lasso Regression
- Ridge Regression
- Elastic Net Regression 

You may or may not use these in your projects, but you will have to explain how these differ from the normal linear regression - you have been warned.

See you next lesson :) 