# Feature Engineering

There are a variety of problems we can find with the data for different variables in our datasets. Feature engineering involves transforming the data before sending it to an ML algorithm. This involves performing tasks such as filling in missing values within a variable or encoding categorical values or dates.

# Problems Found in Data

### Missing Data

Missing data in a data set is the absence of values for certain observations within a variable. Missing data affects all supervised machine learning models. 

There are a variety of reasons why data can be missing:
- Data is lost
- Data is not stored properly
- The value is undefined or can't exist (division where denominator is 0)
- Survey data where user didn't answer a question

When we have missing data we need to either remove the observation or provide some sort of default value.

### Labels in Categorical Variables

When the values of a categorical variable are strings rather than numbers, we have to transform them so that we can use them in our models.

The unique set of values a categorical variable can take on are called labels or categories. The cardinality describes the number of labels a variable has.

There are three main problems you find with categorical variables
- High cardinality - High number of labels
- Rare labels - infrequent categories
- String data type - Categories aren't numeric

Tree-based algorithms tend to overfit when you have variables with "high cardinality", i.e. a high number of labels. 

Rare labels can cause a problem because they are in so few observations in the overall dataset that there is a chance that they only end up in the training set or in the test set.

Most categorical data, such as sex, year, or color are captured as strings or numbers (e.g. year) that don't have a specific numerical meaning. These variables needs to be *encoded*.

### Data Distribution

Linear models assume that the data follows a Gaussian distribution. If the numerical variables in the data are skewed or not normal, we may have to apply a transformation.

Other models like Support Vector Machines (SVMs) and Neural Networks do not make any variable assumptions, however a better spread of values over a larger range tends to improve the predictive performance of these algorithms.


References:
- [Feature Engineering-How to Transform Data to Better Fit The Gaussian Distribution-Data Science](https://www.youtube.com/watch?v=U_wKdCBC-w0&ab_channel=KrishNaik)

### Outliers

Are there any unusual or unexpected values? Maybe they are extremely high, or extremely low compared to other observation values for the given variable.

Outliers may affect certain machine learning models. For example, with a linear regression, outliers can easily change the slope of the regression line, especially in smaller datasets. The Adaboost algorithm is also extremely sensitive to outliers. 



### Feature Magnitude

Most supervised machine learning algorithms are sensitive to the scale of the variables. 

Sensitive to feature scale:
- Linear and Logistic Regression
- Neural Networks
- Support Vector Machines
- K-Nearest Neighbors (KNN)
- K-Means Clustering
- Linear Discriminant Analysis (LDA)
- Principal Component Analysis (PCA)

Tree-based ML models insensitive to feature scale:
- Classification and Regression Trees
- Random Forests
- Gradient Boosted Trees

Many of the training algorithms for linear models and neural networks can converge faster when the variables have a similar scale. So, we can use ***feature scaling*** to make all the variables have a similar scale.

The distance-based algorithms such as KNN and K-Means Cluster are also sensitive to scale.