# Machine Learning Foundation

## Section 2, Part e: Regularization LAB


## Learning objectives

By the end of this lesson, you will be able to:

*   Implement data standardization
*   Implement variants of regularized regression
*   Combine data standardization with the train-test split procedure
*   Implement regularization to prevent overfitting in regression problems


In [1]:
import piplite
await piplite.install(['tqdm', 'seaborn', 'pandas', 'numpy'])

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline


# Surpress warnings:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

np.set_printoptions(precision=3, suppress=True)

Matplotlib is building the font cache; this may take a moment.


In the following cell we load the data and define some useful plotting functions.


In [3]:
np.random.seed(72018)



def to_2d(array):
    return array.reshape(array.shape[0], -1)


    
def plot_exponential_data():
    data = np.exp(np.random.normal(size=1000))
    plt.hist(data)
    plt.show()
    return data
    
def plot_square_normal_data():
    data = np.square(np.random.normal(loc=5, size=1000))
    plt.hist(data)
    plt.show()
    return data

### Loading in Boston Data


In [4]:
import pyodide_http as ph
ph.patch_all()
url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML240EN-SkillsNetwork/labs/data/boston_housing_clean.pickle"
boston = pd.read_pickle(url)
boston_data = boston['dataframe']

In [5]:
boston_data.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


In [6]:
boston_data.shape 

(506, 14)

## Data standardization


**Standardizing** data refers to transforming each variable so that it more closely follows a **standard** normal distribution, with mean 0 and standard deviation 1.

The [`StandardScaler`](http://scikit-learn.org/dev/modules/generated/sklearn.preprocessing.StandardScaler.html?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML240ENSkillsNetwork34171862-2022-01-01#sklearn.preprocessing.StandardScaler) object in SciKit Learn can do this.


**Generate X and y**:


In [7]:
y_col = "MEDV"

X = boston_data.drop(y_col, axis=1)
y = boston_data[y_col]

**Import, fit, and transform using `StandardScaler`**


In [8]:
from sklearn.preprocessing import StandardScaler

s = StandardScaler()
X_ss = s.fit_transform(X)
X_ss

array([[-0.418,  0.285, -1.288, ..., -1.459,  0.441, -1.076],
       [-0.415, -0.488, -0.593, ..., -0.303,  0.441, -0.492],
       [-0.415, -0.488, -0.593, ..., -0.303,  0.396, -1.209],
       ...,
       [-0.411, -0.488,  0.116, ...,  1.176,  0.441, -0.983],
       [-0.406, -0.488,  0.116, ...,  1.176,  0.403, -0.865],
       [-0.413, -0.488,  0.116, ...,  1.176,  0.441, -0.669]])

### Exercise:

Confirm standard scaling


In [9]:
#Hint:

a = np.array([[1, 2, 3], 
              [4, 5, 6]]) 
print(a) # 2 rows, 3 columns

[[1 2 3]
 [4 5 6]]


In [10]:
a.mean(axis=0) # mean along the *columns*

array([2.5, 3.5, 4.5])

In [11]:
a.mean(axis=1) # mean along the *rows*

array([2., 5.])

In [12]:
X2 = np.array(X)
X2

array([[  0.006,  18.   ,   2.31 , ...,  15.3  , 396.9  ,   4.98 ],
       [  0.027,   0.   ,   7.07 , ...,  17.8  , 396.9  ,   9.14 ],
       [  0.027,   0.   ,   7.07 , ...,  17.8  , 392.83 ,   4.03 ],
       ...,
       [  0.061,   0.   ,  11.93 , ...,  21.   , 396.9  ,   5.64 ],
       [  0.11 ,   0.   ,  11.93 , ...,  21.   , 393.45 ,   6.48 ],
       [  0.047,   0.   ,  11.93 , ...,  21.   , 396.9  ,   7.88 ]])

In [13]:
man_transform = (X2-X2.mean(axis=0))/X2.std(axis=0)  # manual transformation of standard scaler
# man_transform = np.nan_to_num(man_transform, nan=0.0)
# same >>> np.where(np.isna(man_transform), 0.0, man_transform)
man_transform

array([[-0.418,  0.285, -1.288, ..., -1.459,  0.441, -1.076],
       [-0.415, -0.488, -0.593, ..., -0.303,  0.441, -0.492],
       [-0.415, -0.488, -0.593, ..., -0.303,  0.396, -1.209],
       ...,
       [-0.411, -0.488,  0.116, ...,  1.176,  0.441, -0.983],
       [-0.406, -0.488,  0.116, ...,  1.176,  0.403, -0.865],
       [-0.413, -0.488,  0.116, ...,  1.176,  0.441, -0.669]])

In [14]:
np.allclose(man_transform, X_ss) # True >>> both arrays are same.

True

In [15]:
X_ss

array([[-0.418,  0.285, -1.288, ..., -1.459,  0.441, -1.076],
       [-0.415, -0.488, -0.593, ..., -0.303,  0.441, -0.492],
       [-0.415, -0.488, -0.593, ..., -0.303,  0.396, -1.209],
       ...,
       [-0.411, -0.488,  0.116, ...,  1.176,  0.441, -0.983],
       [-0.406, -0.488,  0.116, ...,  1.176,  0.403, -0.865],
       [-0.413, -0.488,  0.116, ...,  1.176,  0.441, -0.669]])

### Coefficients with and without scaling


In [16]:
from sklearn.linear_model import LinearRegression

In [17]:
lr = LinearRegression()

In [18]:
lr.fit(X, y)
print(lr.coef_) # min = -0.73

[ -0.107   0.046   0.021   2.689 -17.796   3.805   0.001  -1.476   0.306
  -0.012  -0.953   0.009  -0.525]


#### Discussion (together):

The coefficients are on widely different scales. Is this "bad"?


In [19]:
from sklearn.preprocessing import StandardScaler

In [20]:
s = StandardScaler()
X_ss = s.fit_transform(X)

In [21]:
lr2 = LinearRegression()
lr2.fit(X_ss, y)
print(lr2.coef_) # coefficients now "on the same scale"

[-0.92   1.081  0.143  0.682 -2.06   2.671  0.021 -3.104  2.659 -2.076
 -2.062  0.857 -3.749]


### Exercise:

Based on these results, what is the most "impactful" feature (this is intended to be slightly ambiguous)? "In what direction" does it affect "y"?

**Hint:** Recall from last week that we can "zip up" the names of the features of a DataFrame `df` with a model `model` fitted on that DataFrame using:

```python
dict(zip(df.columns.values, model.coef_))
```


In [22]:
zip(X.columns, lr2.coef_)

<zip at 0xa488710>

In [23]:
### BEGIN SOLUTION
pd.DataFrame(zip(X.columns, lr2.coef_)).sort_values(by=1)
### END SOLUTION

Unnamed: 0,0,1
12,LSTAT,-3.74868
7,DIS,-3.104448
9,TAX,-2.075898
10,PTRATIO,-2.062156
4,NOX,-2.060092
0,CRIM,-0.920411
6,AGE,0.021121
2,INDUS,0.142967
3,CHAS,0.682203
11,B,0.85664


Looking just at the strength of the standardized coefficients LSTAT, DIS, RM and RAD are all the 'most impactful'. Sklearn does not have built in statistical signifigance of each of these variables which would aid in making this claim stronger/weaker


### Lasso with and without scaling


We discussed Lasso in lecture.

Let's review together:

1.  What is different about Lasso vs. regular Linear Regression?
2.  Is standardization more or less important with Lasso vs. Linear Regression? Why?


In [24]:
from sklearn.linear_model import Lasso
from sklearn.preprocessing import PolynomialFeatures

#### Create polynomial features


[`PolynomialFeatures`](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML240ENSkillsNetwork34171862-2022-01-01)


In [25]:
pf = PolynomialFeatures(degree=2, include_bias=False,)
X_pf = pf.fit_transform(X)

**Note:** We use `include_bias=False` since `Lasso` includes a bias by default.


In [26]:
X_pf_ss = s.fit_transform(X_pf)

### Lasso


[`Lasso` documentation](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML240ENSkillsNetwork34171862-2022-01-01)


In [27]:
l = Lasso()
l.fit(X, y)
l.coef_

array([-0.063,  0.049, -0.   ,  0.   , -0.   ,  0.947,  0.021, -0.669,
        0.264, -0.015, -0.723,  0.008, -0.761])

In [28]:
las = Lasso()
las.fit(X_pf_ss, y)
las.coef_ 

array([-0.   ,  0.   , -0.   ,  0.   , -0.   ,  0.   , -0.   , -0.   ,
       -0.   , -0.   , -0.991,  0.   , -0.   , -0.   ,  0.   , -0.   ,
        0.068, -0.   , -0.   , -0.   , -0.   , -0.   , -0.   , -0.   ,
       -0.   , -0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,
        0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   , -0.   ,  0.   ,
       -0.   , -0.   , -0.   , -0.05 , -0.   , -0.   , -0.   , -0.   ,
       -0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,
        0.   ,  0.   ,  0.   , -0.   , -0.   , -0.   , -0.   , -0.   ,
       -0.   , -0.   ,  0.   , -0.   ,  3.3  , -0.   , -0.   , -0.   ,
       -0.   , -0.   ,  0.42 , -3.498, -0.   , -0.   , -0.   , -0.   ,
       -0.   ,  0.   , -0.   , -0.   , -0.   , -0.146, -0.   , -0.   ,
       -0.   , -0.   , -0.   , -0.   , -0.   , -0.   , -0.   , -0.   ,
       -0.   , -0.   , -0.   ,  0.   , -0.   ,  0.   , -0.   , -0.   ])

In [29]:
print("Lasso without scaling:", l.coef_.sum())
print("Lasso with scaling:", las.coef_.sum())

Lasso without scaling: -0.9429244307679312
Lasso with scaling: -0.8967378020435977


### Exercise

Compare

*   Sum of magnitudes of the coefficients
*   Number of coefficients that are zero

for Lasso with alpha 0.1 vs. 1.

Before doing the exercise, answer the following questions in one sentence each:

*   Which do you expect to have greater magnitude?
*   Which do you expect to have more zeros?


In [30]:
### BEGIN SOLUTION
las01 = Lasso(alpha = 0.1)
las01.fit(X_pf_ss, y)
print(las01.coef_)
print('sum of coefficients:', abs(las01.coef_).sum() )
print('number of coefficients not equal to 0:', str((las01.coef_!=0).sum()) +"/"+ str(len(las01.coef_)))

[-0.     0.     0.     0.    -0.     0.     0.    -0.     0.876  0.
 -0.     0.    -0.    -0.     0.336 -0.     0.907 -0.    -0.488 -0.
 -0.    -0.    -0.    -0.    -0.323 -0.284  0.326 -0.323  0.15   0.
  0.     0.     0.     0.     0.     0.     0.    -0.     0.229 -0.
  0.    -0.     0.    -0.131  0.     0.     0.     0.    -0.     0.
 -0.019  0.     0.     0.    -0.    -0.     0.     0.    -0.    -0.
 -1.542 -0.    -0.909  0.    -0.    -0.     0.    -0.     6.015 -0.
 -0.     0.    -0.    -1.678  0.    -5.121  0.    -0.     0.     0.
 -0.     0.    -0.     0.     0.    -1.022 -0.    -0.     0.125  0.
  0.     0.     0.93  -0.111  0.     0.     0.    -1.627 -0.     0.
 -0.    -0.    -0.     2.699]
sum of coefficients: 26.172415115426773
number of coefficients not equal to 0: 23/104


In [31]:
las1 = Lasso(alpha = 1)
las1.fit(X_pf_ss, y)
print(las1.coef_)
print('sum of coefficients:', abs(las1.coef_).sum() )
print('number of coefficients not equal to 0:', str((las1.coef_!=0).sum()) +"/"+ str(len(las1.coef_)))

[-0.     0.    -0.     0.    -0.     0.    -0.    -0.    -0.    -0.
 -0.991  0.    -0.    -0.     0.    -0.     0.068 -0.    -0.    -0.
 -0.    -0.    -0.    -0.    -0.    -0.     0.     0.     0.     0.
  0.     0.     0.     0.     0.     0.     0.     0.    -0.     0.
 -0.    -0.    -0.    -0.05  -0.    -0.    -0.    -0.    -0.     0.
  0.     0.     0.     0.     0.     0.     0.     0.     0.    -0.
 -0.    -0.    -0.    -0.    -0.    -0.     0.    -0.     3.3   -0.
 -0.    -0.    -0.    -0.     0.42  -3.498 -0.    -0.    -0.    -0.
 -0.     0.    -0.    -0.    -0.    -0.146 -0.    -0.    -0.    -0.
 -0.    -0.    -0.    -0.    -0.    -0.    -0.    -0.    -0.     0.
 -0.     0.    -0.    -0.   ]
sum of coefficients: 8.472405227760156
number of coefficients not equal to 0: 7/104


With more regularization (higher alpha) we will expect the penalty for higher weights to be greater and thus the coefficients to be pushed down. Thus a higher alpha means lower magnitude with more coefficients pushed down to 0.


### Exercise: $R^2$


Calculate the $R^2$ of each model without train/test split.

Recall that we import $R^2$ using:

```python
from sklearn.metrics import r2_score
```


In [32]:
### BEGIN SOLUTION
from sklearn.metrics import r2_score
r2_score(y, las.predict(X_pf_ss))
### END SOLUTION

0.7207000461229027

#### Discuss:

Will regularization ever increase model performance if we evaluate on the same dataset that we trained on?


## With train/test split


#### Discuss

Are there any issues with what we've done so far?

**Hint:** Think about the way we have done feature scaling.

Discuss in groups of two or three.


In [33]:
from sklearn.model_selection import train_test_split

In [34]:
X.shape, X_pf.shape

((506, 13), (506, 104))

In [35]:
X_train, X_test, y_train, y_test = train_test_split(X_pf, y, test_size=0.3, 
                                                    random_state=72018)

In [38]:
# Transformation ---> Polynomial Features >> Standard Scaling
X_train_s = s.fit_transform(X_train)
print(X_train_s.shape[1])
las.fit(X_train_s, y_train)           # alpha = 1.0 (default)
X_test_s = s.transform(X_test)
y_pred = las.predict(X_test_s)
r2_score(y_test, y_pred)

104


0.6780325981174933

In [39]:
X_train_s = s.fit_transform(X_train)
las01.fit(X_train_s, y_train)          # alpha = 0.1
X_test_s = s.transform(X_test)
y_pred = las01.predict(X_test_s)
r2_score(y_test, y_pred)

0.7999261342846065

### Exercise

#### Part 1:

Do the same thing with Lasso of:

*   `alpha` of 0.001
*   Increase `max_iter` to 100000 to ensure convergence.

Calculate the $R^2$ of the model.

Feel free to copy-paste code from above, but write a one sentence comment above each line of code explaining why you're doing what you're doing.

#### Part 2:

Do the same procedure as before, but with Linear Regression.

Calculate the $R^2$ of this model.

#### Part 3:

Compare the sums of the absolute values of the coefficients for both models, as well as the number of coefficients that are zero. Based on these measures, which model is a "simpler" description of the relationship between the features and the target?


**PART 1** --> Lasso Regression with alpha = 0.001 and max_iter = 100000

In [40]:
# Decreasing regularization and ensuring convergence
las001 = Lasso(alpha = 0.001, max_iter=100000)

**Decreasing regularization (`alpha=0.001`)**
- `alpha` is the regularization strength in `Lasso regression`.
- A smaller value (0.001) means **less regularization**, meaning Lasso will shrink fewer coefficients to zero.
- If alpha is too large, it can force **too many coefficients** to become exactly zero, leading to **underfitting**.

**Ensuring convergence (`max_iter=100000`)**
- `max_iter` is the maximum number of iterations for the optimization algorithm.
- A higher value (100000) ensures that the algorithm has enough iterations to converge.
- If max_iter is too low, the optimization may stop before finding the best coefficients, leading to poor performance or a ConvergenceWarning.

In [41]:
# Transforming training set to get standardized units
X_train_s = s.fit_transform(X_train)

In [42]:
# Fitting model to training set
las001.fit(X_train_s, y_train)

In [43]:
# Transforming test set using the parameters defined from training set
X_test_s = s.transform(X_test)

In [44]:
# Finding prediction on test set
y_pred = las001.predict(X_test_s)

In [45]:
# Calculating r2 score
print("r2 score for alpha = 0.001:", r2_score(y_test, y_pred))

r2 score for alpha = 0.001: 0.8847893236874534


**PART 2** --> Using Linear Regression

In [46]:
# Using vanilla Linear Regression
lr = LinearRegression()

# Fitting model to training set
lr.fit(X_train_s, y_train)

# predicting on test set
y_pred_lr = lr.predict(X_test_s)

# Calculating r2 score
print("r2 score for Linear Regression:", r2_score(y_test,y_pred_lr))

r2 score for Linear Regression: 0.8689110469231067


**PART 3** --> Comparison between 2 models

In [48]:
# Part 3
print('Magnitude of Lasso coefficients:', abs(las001.coef_).sum())
print('Number of coeffients not equal to 0 for Lasso:', (las001.coef_!=0).sum())
print("Remark: higher coefficients to zero (more leading to underfitting).")
print()
print('Magnitude of Linear Regression coefficients:', abs(lr.coef_).sum())
print('Number of coeffients not equal to 0 for Linear Regression:', (lr.coef_!=0).sum())
print('Remark: lower coefficients to zero (less leading to underfitting).')

Magnitude of Lasso coefficients: 435.5723229043781
Number of coeffients not equal to 0 for Lasso: 90
Remark: higher coefficients to zero (more leading to underfitting).

Magnitude of Linear Regression coefficients: 1183.8918138675313
Number of coeffients not equal to 0 for Linear Regression: 104
Remark: lower coefficients to zero (less leading to underfitting).


**EXPLANATION**  
1. Lasso Regression (α = 0.001)  
✔ Lower Magnitude of Coefficients (435.57) --> Implies simpler model with lower variance  
✔ 90 Non-Zero Coefficients --> Feature selection effect, reducing complexity  
✔ More coefficients shrunk to zero --> Could lead to underfitting if too many features are removed  

2. Linear Regression  
✔ Higher Magnitude of Coefficients (1183.89)--> More complex model with higher variance  
✔ 104 Non-Zero Coefficients --> Retains more features  
✔ No Regularization --> Can lead to overfitting if features are correlated or noisy


**Remark**: 
- The features of which coefficients equals to zero can be removed from the dataset since **too many features can lead to overfitting**. In this way, model complexity can be reduced.
- In contrast, if too many coefficients are forced to zero, the model will lead to lose information that can happen **underfitting**.
- Better interpretability: A model with fewer nonzero features is easier to interpret.

In [54]:
X_train.shape

(354, 104)

In [107]:
# Checking custom

# construct features
features = np.arange(X_train.shape[1])

# select nonzero features
sel_features = features[las001.coef_ != 0]

# select features from X_train_s
X_train_sel = X_train_s[:, sel_features]

# select features from X_test_s
X_test_sel = X_test_s[:, sel_features]

h_las = Lasso(alpha = 0.001, max_iter=100000)
h_las.fit(X_train_sel, y_train)
y_pred = h_las.predict(X_test_sel)
print("R2 scores:", r2_score(y_test, y_pred))

# matching data values
print(f"max(y_test) = {round(y_test.max(),3)}, max(y_pred) = {round(y_pred.max(),3)}")
print(f"avg(y_test) = {round(y_test.mean(),3)}, avg(y_pred) = {round(y_pred.mean(),3)}")
print(f"min(y_test) = {round(y_test.min(),3)}, min(y_pred) = {round(y_pred.min(),3)}")

# match
y_test[:5], y_pred[:5]

R2 scores: 0.8847643917978281
max(y_test) = 50.0, max(y_pred) = 49.886
avg(y_test) = 22.576, avg(y_pred) = 22.667
min(y_test) = 7.0, min(y_pred) = 4.617


(502    20.6
 127    16.2
 390    15.1
 303    33.1
 277    33.1
 Name: MEDV, dtype: float64,
 array([18.213, 15.911, 17.006, 32.41 , 31.521]))

## L1 vs. L2 Regularization


As mentioned in the deck: `Lasso` and `Ridge` regression have the same syntax in SciKit Learn.

Now we're going to compare the results from Ridge vs. Lasso regression:


[`Ridge`](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML240ENSkillsNetwork34171862-2022-01-01)


In [49]:
from sklearn.linear_model import Ridge

### Exercise

Following the Ridge documentation from above:

1.  Define a Ridge object `r` with the same `alpha` as `las001`.
2.  Fit that object on `X` and `y` and print out the resulting coefficients.


In [None]:
### BEGIN SOLUTION
# Decreasing regularization and ensuring convergence
r = Ridge(alpha = 0.001)
X_train_s = s.fit_transform(X_train)
r.fit(X_train_s, y_train)
X_test_s = s.transform(X_test)
y_pred_r = r.predict(X_test_s)

# Calculating r2 score
r.coef_
### END SOLUTION

In [None]:
las001 # same alpha as Ridge above

In [None]:
las001.coef_

In [None]:
print(np.sum(np.abs(r.coef_)))
print(np.sum(np.abs(las001.coef_)))

print(np.sum(r.coef_ != 0))
print(np.sum(las001.coef_ != 0))

**Conclusion:** Ridge does not make any coefficients 0. In addition, on this particular dataset, Lasso provides stronger overall regularization than Ridge for this value of `alpha` (not necessarily true in general).


In [None]:
y_pred = r.predict(X_pf_ss)
print(r2_score(y, y_pred))

y_pred = las001.predict(X_pf_ss)
print(r2_score(y, y_pred))

**Conclusion**: Ignoring issues of overfitting, Ridge does slightly better than Lasso when `alpha` is set to 0.001 for each (not necessarily true in general).


# Example: Does it matter when you scale?


In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_ss, y, test_size=0.3, 
                                                    random_state=72018)

In [None]:
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
r2_score(y_test, y_pred)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, 
                                                    random_state=72018)

In [None]:
s = StandardScaler()
lr_s = LinearRegression()
X_train_s = s.fit_transform(X_train)
lr_s.fit(X_train_s, y_train)
X_test_s = s.transform(X_test)
y_pred_s = lr_s.predict(X_test_s)
r2_score(y_test, y_pred)

**Conclusion:** It doesn't matter whether you scale before or afterwards, in terms of the raw predictions, for Linear Regression. However, it matters for other algorithms. Plus, as we'll see later, we can make scaling part of a `Pipeline`.


***

### Machine Learning Foundation (C) 2020 IBM Corporation
