# Machine Learning Foundation

## Section 2, Part e: Regularization LAB


## Learning objectives

By the end of this lesson, you will be able to:

*   Implement data standardization
*   Implement variants of regularized regression
*   Combine data standardization with the train-test split procedure
*   Implement regularization to prevent overfitting in regression problems


In [None]:
import piplite
await piplite.install(['tqdm', 'seaborn', 'pandas', 'numpy'])

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline


# Surpress warnings:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

np.set_printoptions(precision=3, suppress=True)

In the following cell we load the data and define some useful plotting functions.


In [None]:
np.random.seed(72018)



def to_2d(array):
    return array.reshape(array.shape[0], -1)


    
def plot_exponential_data():
    data = np.exp(np.random.normal(size=1000))
    plt.hist(data)
    plt.show()
    return data
    
def plot_square_normal_data():
    data = np.square(np.random.normal(loc=5, size=1000))
    plt.hist(data)
    plt.show()
    return data

### Loading in Boston Data


In [None]:
from pyodide.http import pyfetch
 
async def download(url, filename):
    response = await pyfetch(url)
    if response.status == 200:
        with open(filename, "wb") as f:
            f.write(await response.bytes())
path = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML240EN-SkillsNetwork/labs/data/boston_housing_clean.pickle"
 
#you will need to download the dataset; if you are running locally, please comment out the following 
await download(path, "boston_housing_clean.pickle")
 
 
# Import pandas library
import pandas as pd

In [None]:
with open('boston_housing_clean.pickle', 'rb') as to_read:
    boston = pd.read_pickle(to_read)
boston_data = boston['dataframe']
boston_description = boston['description']

# show the first 5 rows using dataframe.head() method
print("The first 5 rows of the dataframe") 
boston_data.head()

## Data standardization


**Standardizing** data refers to transforming each variable so that it more closely follows a **standard** normal distribution, with mean 0 and standard deviation 1.

The [`StandardScaler`](http://scikit-learn.org/dev/modules/generated/sklearn.preprocessing.StandardScaler.html?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML240ENSkillsNetwork34171862-2022-01-01#sklearn.preprocessing.StandardScaler) object in SciKit Learn can do this.


**Generate X and y**:


In [None]:
y_col = "MEDV"

X = boston_data.drop(y_col, axis=1)
y = boston_data[y_col]

**Import, fit, and transform using `StandardScaler`**


In [None]:
from sklearn.preprocessing import StandardScaler

s = StandardScaler()
X_ss = s.fit_transform(X)

### Exercise:

Confirm standard scaling


In [None]:
#Hint:

a = np.array([[1, 2, 3], 
              [4, 5, 6]]) 
print(a) # 2 rows, 3 columns

In [None]:
a.mean(axis=0) # mean along the *columns*

In [None]:
a.mean(axis=1) # mean along the *rows*

In [None]:
### BEGIN SOLUTION
X2 = np.array(X)
man_transform = (X2-X2.mean(axis=0))/X2.std(axis=0)
np.allclose(man_transform, X_ss)
### END SOLUTION

### Coefficients with and without scaling


In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
lr = LinearRegression()

y_col = "MEDV"

X = boston_data.drop(y_col, axis=1)
y = boston_data[y_col]

In [None]:
lr.fit(X, y)
print(lr.coef_) # min = -18

#### Discussion (together):

The coefficients are on widely different scales. Is this "bad"?


In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
s = StandardScaler()
X_ss = s.fit_transform(X)

In [None]:
lr2 = LinearRegression()
lr2.fit(X_ss, y)
print(lr2.coef_) # coefficients now "on the same scale"

### Exercise:

Based on these results, what is the most "impactful" feature (this is intended to be slightly ambiguous)? "In what direction" does it affect "y"?

**Hint:** Recall from last week that we can "zip up" the names of the features of a DataFrame `df` with a model `model` fitted on that DataFrame using:

```python
dict(zip(df.columns.values, model.coef_))
```


In [None]:
### BEGIN SOLUTION
pd.DataFrame(zip(X.columns, lr2.coef_)).sort_values(by=1)
### END SOLUTION

Looking just at the strength of the standardized coefficients LSTAT, DIS, RM and RAD are all the 'most impactful'. Sklearn does not have built in statistical signifigance of each of these variables which would aid in making this claim stronger/weaker

### **(Opcional) Detalles de la regularizaci√≥n - Parte 1**  

Este cuaderno revisa las soluciones del **LAB de Regularizaci√≥n**, centr√°ndose en la **estandarizaci√≥n de datos** y la implementaci√≥n de **regresi√≥n regularizada (Lasso y Ridge)** para evitar el sobreajuste.

---

### **1Ô∏è‚É£ Estandarizaci√≥n de Datos**  
- Se trabaja con el **conjunto de datos de Boston**, extra√≠do de un archivo auxiliar.  
- **Estandarizaci√≥n:** Se transforman las variables para que tengan **media 0 y desviaci√≥n est√°ndar 1**.  
- Se aplica `StandardScaler` de `sklearn` y se verifica su correcta implementaci√≥n compar√°ndola con una transformaci√≥n manual en NumPy.  
- **Verificaci√≥n con `np.allclose()`**: Se confirma que ambas transformaciones (manual y autom√°tica) producen valores equivalentes.  

---

### **2Ô∏è‚É£ Regresi√≥n Lineal con Datos Escalados y No Escalados**  
- Se entrena un **modelo de regresi√≥n lineal** en ambas versiones de los datos (escalados y sin escalar).  
- **Diferencias en coeficientes:**  
  - **Sin escalar:** Los coeficientes tienen **magnitudes muy diferentes**, dificultando la interpretaci√≥n de la importancia de las variables.  
  - **Con escalado:** Los coeficientes est√°n en la **misma escala**, lo que facilita la comparaci√≥n de su impacto en la variable de salida (`MEDV`).  
- Se identifican las **caracter√≠sticas m√°s influyentes**:  
  - **LSTAT** (estado socioecon√≥mico bajo) tiene el coeficiente m√°s bajo (impacto negativo).  
  - **RM** (n√∫mero de habitaciones) tiene el coeficiente m√°s alto (impacto positivo).  

---

### **3Ô∏è‚É£ Interpretaci√≥n y Aplicaci√≥n**  
- La **regresi√≥n lineal con datos escalados** permite evaluar **la importancia relativa de las caracter√≠sticas**.  
- Se muestra c√≥mo la estandarizaci√≥n **mejora la interpretaci√≥n y comparaci√≥n de coeficientes**.  

üìå **Pr√≥ximos pasos:** Se implementar√° la **regresi√≥n Lasso**, comparando su rendimiento con datos escalados y sin escalar. üöÄ

### Lasso with and without scaling


We discussed Lasso in lecture.

Let's review together:

1.  What is different about Lasso vs. regular Linear Regression?
2.  Is standardization more or less important with Lasso vs. Linear Regression? Why?


In [None]:
from sklearn.linear_model import Lasso
from sklearn.preprocessing import PolynomialFeatures

#### Create polynomial features


[`PolynomialFeatures`](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML240ENSkillsNetwork34171862-2022-01-01)


In [None]:
pf = PolynomialFeatures(degree=2, include_bias=False,)
X_pf = pf.fit_transform(X)

**Note:** We use `include_bias=False` since `Lasso` includes a bias by default.


In [None]:
X_pf_ss = s.fit_transform(X_pf)

### Lasso


[`Lasso` documentation](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML240ENSkillsNetwork34171862-2022-01-01)


In [None]:
las = Lasso()
las.fit(X_pf_ss, y)
las.coef_ 

### Exercise

Compare

*   Sum of magnitudes of the coefficients
*   Number of coefficients that are zero

for Lasso with alpha 0.1 vs. 1.

Before doing the exercise, answer the following questions in one sentence each:

*   Which do you expect to have greater magnitude?
*   Which do you expect to have more zeros?


In [None]:
### BEGIN SOLUTION
las01 = Lasso(alpha = 0.1)
las01.fit(X_pf_ss, y)
print('sum of coefficients:', abs(las01.coef_).sum() )
print('number of coefficients not equal to 0:', (las01.coef_!=0).sum())

In [None]:
las1 = Lasso(alpha = 1)
las1.fit(X_pf_ss, y)
print('sum of coefficients:',abs(las1.coef_).sum() )
print('number of coefficients not equal to 0:',(las1.coef_!=0).sum())
### END SOLUTION

With more regularization (higher alpha) we will expect the penalty for higher weights to be greater and thus the coefficients to be pushed down. Thus a higher alpha means lower magnitude with more coefficients pushed down to 0.


### Exercise: $R^2$


Calculate the $R^2$ of each model without train/test split.

Recall that we import $R^2$ using:

```python
from sklearn.metrics import r2_score
```


In [None]:
### BEGIN SOLUTION
from sklearn.metrics import r2_score
r2_score(y,las.predict(X_pf_ss))
### END SOLUTION

#### Discuss:

Will regularization ever increase model performance if we evaluate on the same dataset that we trained on?


## With train/test split


#### Discuss

Are there any issues with what we've done so far?

**Hint:** Think about the way we have done feature scaling.

Discuss in groups of two or three.


In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_pf, y, test_size=0.3, 
                                                    random_state=72018)

In [None]:
X_train_s = s.fit_transform(X_train)
las.fit(X_train_s, y_train)
X_test_s = s.transform(X_test)
y_pred = las.predict(X_test_s)
r2_score(y_test, y_pred)

In [None]:
X_train_s = s.fit_transform(X_train)
las01.fit(X_train_s, y_train)
X_test_s = s.transform(X_test)
y_pred = las01.predict(X_test_s)
r2_score(y_test, y_pred)

### **(Opcional) Detalles de la regularizaci√≥n - Parte 2**  

Este cuaderno introduce la **regresi√≥n Lasso (L1)**, explicando c√≥mo difiere de la regresi√≥n lineal y su impacto en la regularizaci√≥n. Se analiza la importancia de la **estandarizaci√≥n**, la influencia del par√°metro **Œ± (alpha)** y se eval√∫a el rendimiento del modelo en datos de entrenamiento y prueba.

---

### **1Ô∏è‚É£ Diferencia entre Lasso y Regresi√≥n Lineal**  
- **Lasso agrega una penalizaci√≥n en la funci√≥n de costo**, basada en la **suma del valor absoluto** de los coeficientes.  
- **Estandarizaci√≥n es clave** en Lasso, ya que sin ella, los coeficientes pueden verse penalizados de manera desigual debido a escalas diferentes en las caracter√≠sticas.  

---

### **2Ô∏è‚É£ Implementaci√≥n de Lasso y Comparaci√≥n de Œ± (Regularizaci√≥n)**  
- Se usa la **versi√≥n predeterminada de Lasso (Œ± = 1.0)** y se observa que muchos coeficientes **se reducen a cero**, eliminando caracter√≠sticas irrelevantes.  
- Se compara con un **Œ± m√°s peque√±o (0.1)**:  
  - **Œ± m√°s alto** ‚Üí **Mayor regularizaci√≥n**, m√°s coeficientes en **cero** y menor complejidad del modelo.  
  - **Œ± m√°s bajo** ‚Üí **Menor regularizaci√≥n**, m√°s coeficientes activos y mayor complejidad.  
- Se eval√∫a la **suma de los coeficientes** y el **n√∫mero de coeficientes distintos de cero**, confirmando que Œ± m√°s alto reduce la magnitud y la cantidad de coeficientes activos.  

---

### **3Ô∏è‚É£ Evaluaci√≥n con R¬≤ y Divisi√≥n de Datos**  
- Se mide el **R¬≤ (coeficiente de determinaci√≥n)** para evaluar el rendimiento del modelo.  
- **Mayor regularizaci√≥n (Œ± alto) reduce R¬≤** en el conjunto de entrenamiento, ya que limita la capacidad del modelo de ajustarse demasiado a los datos.  
- Se introduce **train_test_split** para evaluar la capacidad de generalizaci√≥n en datos nuevos:  
  - **Œ± alto (1.0)** ‚Üí R¬≤ en prueba **bajo (0.33)**, indicando que el modelo es demasiado simple (**subajuste**).  
  - **Œ± bajo (0.1)** ‚Üí R¬≤ en prueba **mayor**, mejor ajuste al conjunto de retenci√≥n.  

---

### **4Ô∏è‚É£ Conclusi√≥n**  
- **El balance entre sesgo y varianza es clave**: demasiada regularizaci√≥n (alto Œ±) puede hacer que el modelo **subajuste**, mientras que muy poca regularizaci√≥n puede causar **sobreajuste**.  
- **Pr√≥ximo paso:** Se explorar√°n **diferentes valores de Œ±** para optimizar el modelo y encontrar el equilibrio adecuado. üöÄ

### Exercise

#### Part 1:

Do the same thing with Lasso of:

*   `alpha` of 0.001
*   Increase `max_iter` to 100000 to ensure convergence.

Calculate the $R^2$ of the model.

Feel free to copy-paste code from above, but write a one sentence comment above each line of code explaining why you're doing what you're doing.

#### Part 2:

Do the same procedure as before, but with Linear Regression.

Calculate the $R^2$ of this model.

#### Part 3:

Compare the sums of the absolute values of the coefficients for both models, as well as the number of coefficients that are zero. Based on these measures, which model is a "simpler" description of the relationship between the features and the target?


In [None]:
### BEGIN SOLUTION

# Part 1

# Decreasing regularization and ensuring convergence
las001 = Lasso(alpha = 0.001, max_iter=100000)

# Transforming training set to get standardized units
X_train_s = s.fit_transform(X_train)

# Fitting model to training set
las001.fit(X_train_s, y_train)

# Transforming test set using the parameters defined from training set
X_test_s = s.transform(X_test)

# Finding prediction on test set
y_pred = las001.predict(X_test_s)

# Calculating r2 score
print("r2 score for alpha = 0.001:", r2_score(y_test, y_pred))


# Part 2

# Using vanilla Linear Regression
lr = LinearRegression()

# Fitting model to training set
lr.fit(X_train_s, y_train)

# predicting on test set
y_pred_lr = lr.predict(X_test_s)

# Calculating r2 score
print("r2 score for Linear Regression:", r2_score(y_test,y_pred_lr))


# Part 3
print('Magnitude of Lasso coefficients:', abs(las001.coef_).sum())
print('Number of coeffients not equal to 0 for Lasso:', (las001.coef_!=0).sum())

print('Magnitude of Linear Regression coefficients:', abs(lr.coef_).sum())
print('Number of coeffients not equal to 0 for Linear Regression:', (lr.coef_!=0).sum())
### END SOLUTION

## L1 vs. L2 Regularization


As mentioned in the deck: `Lasso` and `Ridge` regression have the same syntax in SciKit Learn.

Now we're going to compare the results from Ridge vs. Lasso regression:


[`Ridge`](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML240ENSkillsNetwork34171862-2022-01-01)


In [None]:
from sklearn.linear_model import Ridge

### Exercise

Following the Ridge documentation from above:

1.  Define a Ridge object `r` with the same `alpha` as `las001`.
2.  Fit that object on `X` and `y` and print out the resulting coefficients.


In [None]:
### BEGIN SOLUTION
# Decreasing regularization and ensuring convergence
r = Ridge(alpha = 0.001)
X_train_s = s.fit_transform(X_train)
r.fit(X_train_s, y_train)
X_test_s = s.transform(X_test)
y_pred_r = r.predict(X_test_s)

# Calculating r2 score
r.coef_
### END SOLUTION

In [None]:
las001 # same alpha as Ridge above

In [None]:
las001.coef_

In [None]:
print(np.sum(np.abs(r.coef_)))
print(np.sum(np.abs(las001.coef_)))

print(np.sum(r.coef_ != 0))
print(np.sum(las001.coef_ != 0))

**Conclusion:** Ridge does not make any coefficients 0. In addition, on this particular dataset, Lasso provides stronger overall regularization than Ridge for this value of `alpha` (not necessarily true in general).


In [None]:
y_pred = r.predict(X_pf_ss)
print(r2_score(y, y_pred))

y_pred = las001.predict(X_pf_ss)
print(r2_score(y, y_pred))

**Conclusion**: Ignoring issues of overfitting, Ridge does slightly better than Lasso when `alpha` is set to 0.001 for each (not necessarily true in general).


# Example: Does it matter when you scale?


In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_ss, y, test_size=0.3, 
                                                    random_state=72018)

In [None]:
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
r2_score(y_test, y_pred)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, 
                                                    random_state=72018)

In [None]:
s = StandardScaler()
lr_s = LinearRegression()
X_train_s = s.fit_transform(X_train)
lr_s.fit(X_train_s, y_train)
X_test_s = s.transform(X_test)
y_pred_s = lr_s.predict(X_test_s)
r2_score(y_test, y_pred)

**Conclusion:** It doesn't matter whether you scale before or afterwards, in terms of the raw predictions, for Linear Regression. However, it matters for other algorithms. Plus, as we'll see later, we can make scaling part of a `Pipeline`.



***

### Machine Learning Foundation (C) 2020 IBM Corporation


### **(Opcional) Detalles de la regularizaci√≥n - Parte 3**  

En esta secci√≥n del **LAB de Regularizaci√≥n**, se comparan diferentes valores de **Œ± (alpha) en Lasso**, la diferencia entre **Lasso y Ridge**, y la importancia de la **escalaci√≥n de datos** en modelos regularizados. Tambi√©n se eval√∫an los modelos mediante la **puntuaci√≥n R¬≤** en conjuntos de entrenamiento y prueba.

---

### **1Ô∏è‚É£ Comparaci√≥n de Lasso con Diferentes Valores de Œ±**  
- Se prueba **Lasso con Œ± = 0.001** y se ajusta el n√∫mero de iteraciones para garantizar la convergencia.  
- Se aplica **train_test_split**, asegurando que la transformaci√≥n de datos se ajuste solo al conjunto de entrenamiento y luego se use para transformar el conjunto de prueba.  
- **Resultados:**  
  - **Lasso R¬≤ = 0.868**, mejor que **Regresi√≥n Lineal R¬≤ = 0.855**.  
  - **Lasso reduce la magnitud de los coeficientes y elimina m√°s coeficientes** que la regresi√≥n lineal.  
  - Lasso mostr√≥ **89 coeficientes activos**, mientras que la regresi√≥n lineal ten√≠a **104 coeficientes activos**.  

---

### **2Ô∏è‚É£ Comparaci√≥n entre Lasso y Ridge**  
- Se prueba **Ridge con Œ± = 0.001**, usando la misma transformaci√≥n de datos.  
- **Diferencias clave:**  
  - **Ridge no elimina coeficientes**, solo los reduce.  
  - **Lasso tiene m√°s coeficientes en cero**, lo que simplifica el modelo.  
  - **Magnitud de los coeficientes:** Ridge tiene coeficientes m√°s grandes que Lasso.  
- **R¬≤ de Ridge fue menor que el de Lasso**, indicando que **Lasso generaliz√≥ mejor en este caso**.  

üìå **Conclusi√≥n:** **Lasso proporcion√≥ una regularizaci√≥n m√°s fuerte**, eliminando coeficientes irrelevantes y mejorando la generalizaci√≥n.

---

### **3Ô∏è‚É£ Importancia de la Escalaci√≥n de Datos**  
- Se prueba si escalar antes o despu√©s de dividir los datos afecta el rendimiento.  
- **Resultados:**  
  - En **regresi√≥n lineal simple**, la puntuaci√≥n R¬≤ **no cambia**, ya que no hay regularizaci√≥n.  
  - En **Lasso o Ridge, la escalaci√≥n es crucial**, ya que afecta la penalizaci√≥n de los coeficientes.  
- **Conclusi√≥n:** Siempre se debe aplicar `fit_transform` **solo en el conjunto de entrenamiento**, y luego usar `transform` en el conjunto de prueba.

---

### **4Ô∏è‚É£ Conclusi√≥n del Laboratorio y Curso 2**  
- **Lasso fue m√°s efectivo que Ridge en este caso**, eliminando coeficientes y mejorando la generalizaci√≥n.  
- **La escalaci√≥n es clave para modelos regularizados**, asegurando que los coeficientes sean penalizados de manera uniforme.  
- **Pr√≥ximo paso:** **Curso 3**, donde se aplicar√°n estas t√©cnicas en diferentes modelos.

üìå **La regularizaci√≥n es fundamental para mejorar la generalizaci√≥n de los modelos y evitar el sobreajuste.** üöÄ