<img src="./img/vi_logo.png" style="float: left; margin: 10px; height: 45px">

# Vertical Institute Data Science 101
# Lesson 5: Linear Regression and Feature Scaling

---


### Learning Objectives

#### Part 1: Linear Regression
**After this lesson, you will be able to:**
- Define the terms: modeling, prediction
- Understand the best line of a set of data
- Find the best fit line by hand

#### Part 2: Feature Scaling
**After this lesson, you will be able to:**
- Understand the importance of scaling data
- Understand different ways to normalize data
- Use Scikit-learn preprocessing to scale data in various ways

***Modeling and Predictions***

Consider the following scenario:

A manager at the bank is disturbed with more and more customers leaving their credit card services. They would really appreciate if one could predict for them who is gonna get churned so they can proactively go to the customer to provide them better services and turn customers' decisions in the opposite direction

With models, we can make __predictions__

- Models are relationships between quantities
- Linear Regression is a method in determining the coefficients of linear relationships

## Supervised Learning Workflow

- Training data during model training
    - Labels provided
- Testing data during prediction 
    - Labels not provided, to be predicted
    
    
- Training and testing data go through the same preprocessing flow to reach the same shape before fed to model
    - Outlier removal
    - Standardizing
    - Filling in missing values
    - Data Transformations 
        - Encoding
        - Pivoting
        - GroupBy
        - Join

### Focus on how columns are broken down into Training Data and Labels 

<img src="img/flow.png" style="height:400px">
Credits: https://www.researchgate.net/figure/A-flowchart-of-a-supervised-machine-learning-model_fig1_314202159

## Linear Regression

- Interactive explanation: https://setosa.io/ev/ordinary-least-squares-regression/

<img src="img/linreg.png" style="margin: 20px; height: 400px">

***Linear regression*** is an analysis that assesses whether one or more predictor variables explain the dependent (criterion) variable.  
The regression has five key assumptions:

1. Linear relationship
2. No or little multicollinearity
3. No auto-correlation
4. Homoscedasticity
5. Residuals are Normally Distributed

- less important if you only care about prediction (MSE) and not inference (coefficients of predictors)

<img src="img/eqn.png" style="margin: 20px; height: 30px">

Check for a linear relationship

<img src="img/check_line.png" style="margin: 20px; height: 300px">

Let's investigate the housing dataset with linear regression. We'll use two different packages and you can see examples for linear regression of each:
- **statsmodels** (more for inference) -- http://statsmodels.sourceforge.net/devel/examples/#regression
- **scikit-learn** (more for prediction) -- http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

## Intro to Scikit-Learn

Scikit-learn is a machine learning package for python that includes a huge array of models including linear regression. Scikit-learn includes a number of sample data sets including the Boston housing data set. (We could also load the data with pandas as in the last lesson.)

<img src="img/ml_map.png">

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns

### Insurance dataset

We will be studying how variables like age, gender and smoker can affect the insurance charges paid by customers. Subsequently, we will be building several models to predict the insurance charges using the same variables.

In [None]:
df = pd.read_csv('insurance.csv')
df.head()

In [None]:
# Put the target in it's own series because sklearn needs predictors and targets as separate inputs to .fit()
y = df['charges'] 
y.head()

## Sklearn model fitting workflow
- **Import** libraries and functions
- **Instantiate** the model from its imported class
- **Fit** this instance of the model using X (features) and y (labels)
- **Predict** on X(usually different from X used to train) to get predicted y
- **Evaluate** by comparing true y vs predicted y

## Below here is the skeleton structure for Linear Regression

`
from sklearn.linear_model import LinearRegression  # import 
from sklearn.metrics import mean_squared_error # import 
lm = LinearRegression()  # Instantiate a linear regression model
lm.fit(X, y)    # Fit the model on predictors (2D object) and target (1D object)
predictions = lm.predict(X_test) # Use fitted model to predict on the NEW unseen data
mean_squared_error(y_test, predictions)   # compare true values with predictions using a performance metric
`

### `bmi` as feature 

- X variable usually used to mean the predictors 
    - They have to be all numbers, so object dtypes like categories or text must be transformed first
    - This object has to be 2D because Sklearn is designed to fit on 2D objects
    - Almost all Sklearn models cannot handle missing data, so make no None or np.NaN in X or y
- y variable usually used to mean the target

- You should name them with more meaningful names when it's not obvious what they are 

In [None]:
# Select X
X = df[["bmi"]]   # Why not df['bmi']? Because sklearn needs 2D object for all X its models fits on
X.head()

In [None]:
from matplotlib import pyplot as plt

# Let's plot the relationship between X and y
# We see a good visual indicator for a linear relationship
plt.scatter(X, y)
plt.xlabel('bmi')
plt.ylabel('Charges')

### Implementing the fitting process

In [None]:
#Let's try fitting the data into a linear model

# Step 1 - Import the library
from sklearn.linear_model import LinearRegression

# Step 2 - create an instance of model


# Step 3 - fit the variables into the model
# Notice that the fit method is ran in-place, the result need not be assigned to a new variable


# Step 4 - use the model to predict your dependent variables


### Getting the attributes from the fitted model + How to read documentation

- Attributes ending with _ come about after fitting, they are not available before the model is fit on training data
- https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

In [None]:
# what the predicted y-value is when all predictors are 0
print(lm.intercept_)

# how much y increases per unit increase in the corresponding predictor keeping all other predictors constant
print(lm.coef_)

### Visualizing results

In [None]:
#plot scatter plot here
plt.scatter(X,y)
plt.xlabel('bmi')
plt.ylabel('Charges')

# only need to prepare (x,y) coords of 2 endpoints to plot the predicted line
x = np.array([X.min(),X.max()])
y_pred = x * lm.coef_ + lm.intercept_

plt.plot(x,y_pred,c='r')

### Evaluation of the model

- Sklearn has many metrics provided for each type of modelling task (regression/classification), you don't have to code your own: https://scikit-learn.org/stable/modules/classes.html#regression-metrics
- Most metrics are symmetric
    - if you swap the 2 input parameters they give same result eg. A+B = B+A 
    - Good practice is to provide values to the metric functions in order of true, predicted
    - `mean_squared_error(y_true,y_pred)` instead of `mean_squared_error(y_pred,y_true)`

#### Mean Squared Error (MSE)

The MSE first checks the difference between our prediction vs. the actual value of the target. (Actual - Prediction)

Then it squares this error value to eliminate negative values.

Finally, report the average of all MSEs.


 $$MSE = \frac{1}{n}\displaystyle\sum_{i=1}^{n}{(Actual_i -Predicted_i)^2}$$

actual | prediction | error | squared_error
---|---|---|---
15|10|5|25
10|12|-2|4

```
sum_squared_error = (15-10)**2 + (10-12)**2
              = 5**2 + (-2)**2
              = 25 + 4 
              = 29
mean_squared_error = squared_error/n
                   = 29/2
        
        
```

In [None]:
# Sklearn.metrics provides us a function to calculate MSE easily
from _____.______ import ___________


print(_______)

##### Disadvantages of MSE
- Sensitive to scale
    - If y-values were * 100, MSE will be * 100, making it hard to compare model across different problems with different y
    - If comparing across different y is not necessary (like when staying on 1 problem with 1 y), MSE works fine comparing among the same scale relatively
- Penalizes errors non-linearly because of the square
    - You may want to do Mean Absolute Error instead which doesn't square: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html

### `Age` as feature 

In [None]:
# Select X and y
X = df[["age"]]
y = df['charges']

# We have written a function to conveniently train the model, make predictions, plot the graphs, 
# compute the metrics in one shot! That's the power of a user defined function
def train_predict(X,y):
    lm = LinearRegression()
    lm.fit(X, y)
    predictions = lm.predict(X)

    # Plot graph if only 1 independent variable
    if X.shape[1] == 1:
        plt.scatter(X,y)

        # Plot our linear regression model    
        x = np.array([X.min(), X.max()])
        y_pred = x * lm.coef_ + lm.intercept_
        plt.plot(x,y_pred,c='r')    
    
    # Compute the metrics
    mse = mean_squared_error(y, predictions)
    
    print("MSE:", mse)   
    print("Coefficients: ",lm.coef_)
    print("Intercept: ",lm.intercept_)
    
    return lm.coef_,lm.intercept_

train_predict(X,y)

### Exercise 

- Let's try to use both **bmi** and **age** as predictors
- Notice how model.coef_ has 2 coefficients in array now
    - Compare the 2 coefficients with the coefficients when they were the sole predictors, what do you observe?
- Hint: For practice of the sklearn model fitting workflow, try to do it without using the `train_predict` function first

In [None]:
# Fill in the blanks below

# import the model 
from sklearn.linear_model import ___

# prepare the data
X = df[['bmi','age']]
y = df['charges']

# instantiate the model
lm = ____

# fit the model
__________(X,y)

# predict on the same X (usually we predict on new X, but that's for later)
__ = lm.___(_)

# evaluate your results using regression-based metrics
mse = mean_squared_error(__,__)

print(mse)

In [None]:
#already defined function
X = df[['bmi','age']]
y = df['charges']

train_predict(X,y)

### More detailed exploration (why 3 lines when age as predictor?)

In [None]:
df.head()

In [None]:
df.dtypes

In [None]:
# Hue can be used to group to multiple data variable and show the dependency of the passed data values are to be plotted
# Hue: variables that define the subsets of the data
sns.scatterplot(x='age',y='charges',data=df, hue='sex')
plt.show()
sns.scatterplot(x='age',y='charges',data=df, hue='smoker')
plt.show()
sns.scatterplot(x='age',y='charges',data=df, hue='region')
plt.show()

# children is numeric dtype, but because of low cardinality (number of unique values), we can still hue sensibly
sns.scatterplot(x='age',y='charges',data=df, hue='children') 

### Interpretations from tests of various columns being hue
- Smoker column looks most interesting
- It seems that Smokers have consistently higher charges than non-smokers no matter the age
- That are a mix of smokers and non-smokers in the 10000 to 30000 range, further investigation using more variables is required
- It may be that the data required to separate them is not even in the dataset 
    - Form a hypothesis of what new metrics should be collected
    - Consider cost of collection at scale, if scalable, try to collect the new metric for a small group of people
    first to test effectiveness of new metric

<a id='compare'></a>
## Comparing the models

A perfect fit would yield a straight line when we plot the predicted values versus the true values. We'll quantify the goodness of fit soon.

### Build a model with just the numeric columns

In [None]:
y = df['charges'] # using charges as y
X = df[['age', 'bmi', 'children']]
X.head()

In [None]:
y.head()

In [None]:
from sklearn.metrics import mean_squared_error

train_predict(X,y)  # note how MSE reduced from 133million to 128 million with the addition of children variable

## Pairplot 

In [None]:
# reload data again just in case
df = pd.read_csv('insurance.csv')

# See df.corr
df.corr()

In [None]:
#plot pairplot
sns.pairplot(df,diag_kind='kde')


Why `df.corr()` is not enough 

- Pearson's correlation unable to see non-linearities (curved relationships)
- Unable to see groupwise regressions (there are 3 lines in age vs charges)
- Unable to see that max charges decrease as children increase

Pairplot interpretations

- Age looks positively correlated with charges
- Children looks negatively correlated with the max charge value

## How to add smoker (categorical data) as a feature? 

In [None]:
df.head(2)

### OrdinalEncoding 

In [None]:
# only run series.map() once! 
# Values in columns that are not found in keys of mapping dictionary will be converted to nan
# restart from pd.read_csv if make this mistake
df['smoker'] = df['smoker'].map({'yes':1,
                                 'no':0}) 

df.head() # always check data after a transformation

In [None]:
#to take a look at smoker vs age
sns.pairplot(df)

In [None]:
features = ["age", "smoker"]
X = df[features]
y = df['charges']

train_predict(X,y)

- Massive drop in MSE from 133 million with age as only predictor to 40 million by adding smoker
- R-squared jumped from 0.09 to 0.72
- Actual vs predicted reduced from 3 lines to 2 lines

### OneHotEncoding 
- If using model for inference (using coefficient sizes to determine predictor importance), beware of dummy variable trap
- https://towardsdatascience.com/beware-of-the-dummy-variable-trap-in-pandas-727e8e6b8bde

#### What is pd.get_dummies?

Let's run the following and take a look

In [None]:
# Load data again just in case
df = pd.read_csv('insurance.csv')
# BEFORE using dummy
df.head()

In [None]:
# AFTER using dummy
dummies_df = pd.________(df, columns=['______'])
dummies_df.head(2)  # note pandas generated new column names for you by appending categories 

In [None]:
# for this problem you will get same MSE (although different coefficients) no matter which of 3 combinations
# things will be different if one-hot-encoding columns with 3 or more categories
# features = ["age","smoker_no"]
# features = ["age","smoker_yes"]
features = ["age","smoker_no","smoker_yes"]

X = dummies_df[features]
y = dummies_df['charges']

train_predict(X,y) 

## Part 2: Feature Scaling

***Why do we need to scale data?***
- To handle disparities in units
- Many distance-based machine learning models require scaling (K-nearest Neighbors, K-means, Regularized regressions)
- Can speed up iterative model training

***How do we scale our data?***

### Feature Scaling: Boston dataset

In [None]:
# imports
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# import linear_model and mean_squared_error
from sklearn import linear_model
from sklearn.metrics import mean_squared_error

In [None]:
# Load the Boston Housing dataset
df = pd.read_csv("boston_data.csv")
df.head()

### Scaling our data

Let's see what effect scaling our data has on some of the features by picking two features
that have a large difference in scale.

In [None]:
xs = df["NOX"]
ys = df["TAX"]

# plot a scatter plot
plt.____(xs, ys)
plt.xlabel("NOX")
plt.ylabel("TAX")
plt.show()

- What are the min and max of TAX?
- What are the min and max of NOX?

<a id='standardization'></a>
## Standardization
- Also called Z-score normalization
- Rescales to have a mean of zero and a variance of 1

<img src="img/standardization.png" style="margin: 20px; height: 200px">

Let's apply standardization, transforming our data to have mean zero $(\mu = 0)$ and variance 1 $(\sigma^2 = 1)$ by the formula:

$$ x' = \frac{x - \mu}{\sigma}$$

####  Sklearn StandardScaler

- StandardScaler is one of the many transformers: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing

### The fit-transform workflow

1. initialize transformer  `ss = StandardScaler()`
2. fit transformer     `ss.fit(data)`
3. apply transformer on same/new data. `ss.transform(data)`

___
If fitting and transforming on same data, `fit_transform()` is a faster version of `fit()` then `transform()`

In [None]:
# import package
from sklearn.________ import __________


# initialize transformer 
ss = ____________()

# fit the scaler object
ss.fit(df[['NOX','TAX']])

# transform the original variables
df[['NOX','TAX']] = ss.transform(df[['NOX','TAX']]) # remember to assign back to update

# You can also do this in one shot
# df[['NOX','TAX']] = ss.fit_transform(df[['NOX','TAX']]) # fit and transform at the same time

# plot the scatter plot
plt.scatter(df['NOX'], df['TAX'], color='r')
plt.xlabel("NOX standardized")
plt.ylabel("TAX standardized")
plt.show()

### What is happening under the hood when we do ss.fit?

In [None]:
ss.__dict__

In [None]:
ss.mean_, ss.scale_  # these show what standard scaler learned from fitting on the data

<a id='minmax'></a>
### Min-Max Scaling (Another kind of scaling! There are many more variety)
- Simple rescaling of data to fit a defined interval
- We use the formula:

$$x' = \frac{x - x_{min}}{x_{max} - x_{min}}$$

## Lesson Summary


Let's review what we learned today. We:

- Applied linear regression to the insurance dataset
- Used different tools such as statsmodels and scikit-learn to perform statistical algorithms
- Understood the importance of feature scaling for certain predictive models
- Used standardization and min-max scaling to the boston dataset and observed effects on regression model

# Readings 

Linear Regression Assumptions:
- Assumptions of linear regression: https://towardsdatascience.com/linear-regression-assumptions-why-is-it-important-af28438a44a1 

Sklearn:

- Common pitfalls interpreting coefficients: https://scikit-learn.org/stable/auto_examples/inspection/plot_linear_model_coefficient_interpretation.html
- Comparing Supervised learning algorithms: https://www.dataschool.io/comparing-supervised-learning-algorithms/
- Sklearn preprocessing guide: https://scikit-learn.org/stable/modules/preprocessing.html
- Free Sklearn Course with 50 short videos: https://courses.dataschool.io/scikit-learn-tips

Statistics:
- Easy channel to get first exposure to any stats concept: https://www.youtube.com/watch?v=PaFPbb66DxQ&list=PLblh5JKOoLUIzaEkCLIUxQFjPIlapw8nU&ab_channel=StatQuestwithJoshStarmer
- Understanding metrics with Venn Diagrams: https://www.andrewheiss.com/blog/2021/08/21/r2-euler/
- 2 sets of notations for sum of squares: https://365datascience.com/tutorials/statistics-tutorials/sum-squares/

ML Crash Course:
- https://developers.google.com/machine-learning/crash-course

- Channel for a start on any stats concept: https://www.youtube.com/watch?v=PaFPbb66DxQ&list=PLblh5JKOoLUIzaEkCLIUxQFjPIlapw8nU&ab_channel=StatQuestwithJoshStarmer