# Feature Selection

Feature selection is determining which variables are the most predictive, and building our model using those. 

We start with all of the variables in our dataset, then we eliminate features that don't give us the results we want.


# Why do we select features?
- Simple models are easier to interpret - The people who act on the results need to understand the inputs
- Shorter training times
- Enhanced generalization by reducing overfitting
- Easier for devs to implement and deploy the model into production 
- Reduced risk of data errors during model use
- Data redundancy

# Why do we want fewer features?
- Less data to provide as input - Making the calling system collect and send more input variables results in more work. 
- Smaller JSON payloads sent to the model - More input variables results in larger payloads and possibly slower performance. 
- Less error handling code - need error handling per variable / input, e.g. how do we handle previously unseen data?
- Less information to log
- Less feature engineering code

# Variable Redundancy

Redundant variables are variables we want to eliminate from our model so that we don't need to provide them as inputs. 

Types of redundant variables:
- Constant Variables - variables that are constant for all observations
- Quasi-constant variables - >99% of observations have the same value
- Duplication - Two variables with different names, but they're identical
- Correlation - Two variables that are highly correlated provide the same information about the target we want to predict

# Feature Selection Methods

Algotirhms for selecting features:
- Embedded Methods
- Wrapper Methods
- Filter Methods

### Filter Methods
- Statistical tests like ANOVA and K-Squared
- Independent of the algorithm we ultimately build
- Discriminate based on feature characteristics

Pros:
- Quick feature removal - If you have lots of features, can eliminate a big chunk easily
- Model agnostic
- Fast computation

Cons:
- Don't capture redundancy - Looking at one feature at a time 
- Don't capture interaction - Looking at one feature at a time 
- Poor model performance - (compared to other methods)

### Wrapper Methods
- Considers ML algorithm
- Evaluates subsets of features, not one at a time
- Known as "greedy algorithms" because they evaluate all possible feature combinations

Pros:
- Considers feature interactions
- Best performance
- Best feature subset for a given algotirhm

Cons:
- Not model agnostic
- Computation is expensive - have to build an ML model for each combination
- Often impractical - have to build an ML model for each combination

### Embedded Methods
- Feature selection during training of ML algorithm
- Example: lasso regression
- Example: feature importance we derive when fitting tree-based algorithms

Pros:
- Consider feature interaction
- Good model importance
- Better than Filter Methods
- Faster than Wrapper Methods

Cons:
- Not model agnostic - e.g. features selected in Random Forest not necessarily best to use in a linear model