# Regularization

## Feature Scaling
* **Some Machine Learning models** that rely on distance metrics (e.g. KNN) **require scaling** to perform well.
* Feature scaling improves the convergence of steepest descent algorithms, which do not possess the property of scale invariance.
* If features are on different scales, certain weights may update faster than others since the feature values play a role in weight updates.
* **Critical benefit of feature scaling related to gradient descent.**
* Tehre are some ML algorithms wherre scaling will not have an effect (e.g. CART based methods).
* Scaling the features so that their respective ranges are **uniform is important in comparing measurements** that have different units.
* It allows us directly compare model coefficients to each other.
* **Must always scale new unseen data before feeding to model.**
* Effects direct interpretability of feature coefficients.
* **Feature scaling benefits:**
  
> Can lead to great increases in performance.

> Absolutely necessary for some models.

> Virtually no "real" downside to scaling features.

#### Two main ways to scale features:
  
**1. Standardization:** Rescales data to have **a mean of 0 and standard deviation of 1**. It is also called **Z-score normalization**.

**2. Normalization:** Rescales all data values to be **between 0-1**.

* There are many more methods of scaling features and **Scikit-Learn** provides easy to use classes that **fit** and **transform** feature data for scaling.
* A **.fit()** call simply calculates the necessary statistics (Xmin, Xmax, mean, standard deviation). 
* A **.transform()** call actually scales data and returns the new scaled version of data.
* We **only fit to training data**. Calculating statistical information should only come from training data.,
* Using the full data set would cause **data leakage**.
#### Feature scaling process:
  
> Perform train-test split

> Fit to training feature data

> Transform training feature data

> Transfrm test feature data

#### Do we need to scale the label (y)?
* In general, it is not necessary nor advised.
* Normalizing the output distribution is altering the definition of the target.
* Can negatively impact stochastic gradient descent.
* So, we only scale the features (X).


## Cross Validation

## Regularization
Regularization seeks **to solve a few common model issues** by:
- Minimizing model complexity
- Penalizing the loss function
- Reducing model overfitting (add more bias to reduce model variance)
  
In general, we can think of regularization as a way to reduce model overfitting and variance.

**Three main types of Regularization:**

**1.** L1 Regularization: **LASSO Regression**

**2.** L2 Regularization: **Ridge Regression**

**3.** Combining L1 and L2: **Elastic Net**

These regularization methods have a **cost**:
* Introduce a additional hyperparameter that needs to be tuned.
* A multiplier to the penalty to decide the "strength" of the penalty.

## 1. LASSO Regression (L1 Regularization)
* L1 regularization adds a penalty equal to the **absolute value** of the magnitude of coefficients.
* It limits the size of the coefficients.
* It can yield sparse models where **some coefficients can become zero**.

## 2. Ridge Regression (L2 Regularization)
* L2 regularization adds a penallty equal to the **square** or the magnitude of coefficients.
* **All coefficients are shrunk** by the same factor.
* Does not necessarily eliminate coefficients.

## 3. Elastic Net
* Elastic Net **combines L1 and L2 with the addition of an alpha parameter** deciding the ratio between them.