# Preprocessing

## Feature Scaling
**Scaling**: the process of changing the *magnitude* (scale) of numerical features so they are comparable across dimensions.

For *distance-based* or *gradient-based* models (i.e., linear regression, logistic regression, neural networks, k-nearest networks), not scaling features can:
- Distort distance calculations
- Slow or destabilize optimization
- Cause uneven regularization penalties
- Reduce model performance


### Importance of Feature Scaling
- Linear and logistic regression models compute predictions as weighted linear combinations of features. When features have different scales (say $[0, 100]$ and $[0, 10]$), the larger magnitude features can dominate (as $100w_1>>10w_2$). While the model could adjust weights to compensate, differences in scale affect the numerical conditioning of the optimization problem.
- Gradient-based optimization is sensitive to feature scale. Uneven feature magnitudes tend to elongate loss curves leading to inefficient gradient updates and slower converge. Scaling makes the curvature of the loss function more isotropic.
- Distance-based models rely directly on distance metrics such as Euclidean distance. If one feature has a larger scale than others, it disproportionately influences distance calculations, reducing the contribution of smaller-scale features.
- Regularization penalize coefficient magnitudes rather than feature magnitudes. Without scaling, features with larger ranges require smaller coefficients, while smaller-scale features require larger coefficients. Thus, regularization will unevenly penalize the coefficients.

Generally speaking, many learning algorithms benefit from standardization of data. Tree-based models are unaffected by data that is not scaled, as it does not compute products, distances, or gradients. Instead, they make splits on threshold conditions. Scaling preserves the ordering of values, so if a feature scaled from $[0, 500]$ to $[0, 1]$, a split at $F_1 < 20$ remains identical at $F_1 < 0.04$.

### Transformers
**Preprocessing Transformers**: transformers are anything that learns something from data (fit) and then modifies data (transform). We have three types of preprocessing transformers to scale data:
- **Scalers**: operate feature-wise and adjust the mean, variance, range, and spread of data. They preserve the order, linear relationships, and overall shape of distributions. Most scalers perform *linear transformations*.
- **Transformers**: operate feature-wise and apply *non-linear transformations*, meaning they modify the shape of the distribution. Used when data is highly skewed or variance depends on magnitude.
- **Normalizers**: operate sample-wise and rescale each data point to have a unit norm.

## Feature Encoding
**Encoding**: the process of converting categorical features into numerical representations.

### Ordinal Encoding
Ordinal encoding is for categories with a meaningful order. `OrdinalEncoder` preserves order. It is critical only ordinal data uses an ordinal encoding strategy, as higher values imply higher ranking in the order.
| Size   | Encoded |
| ------ | ------- |
| Small  | 0       |
| Medium | 1       |
| Large  | 2       |

### Nominal Encoding
Nominal encoding is for categories that do not have a meaningful order. `OneHotEncoder` creates a binary column for each category. This effectively produces a "new feature" for each category.

| Color | Red | Blue | Green |
| ----- | --- | ---- | ----- |
| Red   | 1   | 0    | 0     |
| Blue  | 0   | 1    | 0     |
| Green | 0   | 0    | 1     |