# Regularization

Regularization solves a few common model issues:
* Minimizing model complexitiy
* Penalizing the loss function
* Reducing model overfitting (add more bias to reduce model variance)

In general, we can think of regularization as a way to reduce model overfitting and variance.
* Requires some additional bias
* Requires a search for optimal penalty hyperparameter.

Three main types of regularization:

   * L1 Regularization : Lasso Regression
   * L2 Regularization : Ridge Regression
   * Combining L1 and L2 : Elastic Net regression

**L1 Regularization** adds a penalty equal to the absolute value of the magnitude of coefficients.
* Limits the size of the coefficients.
* Can yield sparse models where some coefficients can become zero.

![image.png](attachment:image.png)

**L2 Regularization** adds a penalty equal to the square the magnitude of coefficients.

* All coefficients are shrunk by the same factor.
* Doesn't necessarily eliminate coefficients.

![image.png](attachment:image.png)

**Elastic net** combines L1 and L2 with the addition of an alpha parameter deciding the ratio between them:

![image.png](attachment:image.png)

![image-2.png](attachment:image-2.png)

# Feature Scaling

Feature scaling provides many benefits to our machine learning process!

Some machine learning models that rely on distance metrics (e.g. KNN) require scaling to perform well.

* Feature scaling improves the convergence od steepest descent algorithms, which do not posses the property of scale invariance.

* If features are on different scales, certain weights may update faster than others since the feature values xj play a role in the weight updates.

* Critical benefit of feature scaling related to gradient descent.
* It is related to gradient descent and helping that algorithm perform optimally.
* There are some ML Algos where scaling won't have an effect(e.g CART based methods).

* Scaling the features so that their respective ranges are uniform is important in comparing measurements that have different units.
* Allows us directly compare model coefficients to each other.


**Feature scaling caveats:**

* Must always scale new unseen data before feeding the model
* Effects direct interpretability of feature coefficients
    * Easier to compare coefficients to one another, harder to relate back to original unscaled feature.

**Feature scaling benefits:**

* Can lead to great increases in model performance
* Absolutely necessary for some models
* Virtually no "real" downside to scaling features.

** Two main ways to scale features :

* Standardization
    * Rescales data to have a mean of 0 and standast deviation of 1.    
![image.png](attachment:image.png)
    
    
* Normalization
    * Rescales all data values to be between 0-1.
![image-2.png](attachment:image-2.png)

**Thera are many methods of scaling features and Scikit_Learn provides easy to use classes that "fit" and "transform" feature data for scaling.**

A.fit() method call simply calcculates the necessary statistics(Xmin, Xmax, mean, std)
A.transform() call actually scales data and returns the new scaled version of data.


**Very important consideration for fit and transform:**

* We only fit to training data
* Calculating statistical information should only come from training data.
* Don't want to assume prior knowledge of the test set.

**Feature scaling process:**

* Perform train test split
* Fit to training feature data
* Transform training feature data
* Transform test fature data

# Cross Validation

Cross-Validation is basically a resampling technique to make our model sure about its efficiency and accuracy on the unseen data. Cross-validation is one of the most effective methods used to validate the model.

**In the Cross-Validation process**

* we implement the following steps:

* We separate a sample dataset as a validation set.

* We train the model using the rest of the dataset (training set).

* With the validation set, we measure the performance of the model.

* We can do this process k times. The model will do this with different validation (test) data each time. 

* We get the performance values of each after then get their average values.

* If the algorithm gives a satisfactory result with the test set, we can now use this algorithm. Otherwise, try cross-validation on another algorithm.

**Cross validation process:**

* Remove a hold out test set
![image.png](attachment:image.png)
* Perform classic train test split to remaining data
![image-2.png](attachment:image-2.png)
* Train and tune on this data
![image-3.png](attachment:image-3.png)
* Or K-Fold cross validation
![image-4.png](attachment:image-4.png)
* Train and tune on this data to adjust hyperparameters
![image-5.png](attachment:image-5.png)
* After training and tuning perform final evaluation hold out test set
![image-6.png](attachment:image-6.png)


![image-7.png](attachment:image-7.png)


NOT: Cross_validation, test setinden alınan tek seferlik score ın tutarlı olup olmadığını doğrulamak için yapılır.

# Grid Search

A model hyperparameter is a characteristic of a model that is external to the model and whose value cannot be estimated from data. The value of the hyperparameter has to be set before the learning process begins.

Grid search is an enhancement algorithm that allows you to choose the best parameters for your optimization issue from a bunch of parameter alternatives that you give.

Grid-search is used to find the optimal hyperparameters of a model which results in the most ‘accurate’ predictions.

**There are two types of gridsearch in Scikit-Learn library:**

Exhaustive Grid Search

Randomized Parameter Optimization

In this course we will focus on Exhaustive Grid Search. However, sometimes the chosen hyperparameters are too many to run and you don't have enough time and resources, in this situation you can also use Randomized Parameter Optimization.