## Side notes 
_(code snippets, summaries, resources, etc.)_
- 

# Feature Selection

## Summary of topics covered

![summary of feature selection](feature_selection_images/summary_feature_selection.png)

## Why Feature Selection?

1. Knowledge Discovery
    - Interpretatibility and insight
2. Curse of Dimensionality
    - Samples required increases exponential as features increase, that is $2^{n}$

## Feature Selection Algorithms

__Quiz: How hard is the problem?__

![feature selection algorithms quiz](feature_selection_images/fs_algorithms_quiz.png)

- Used combinatorics to determine solution
    - Must try all subsets, and there are an exponential number of subsets $m$
    - if we do not know anything about $m$, combinations are $2^{n}$
    - if we need $m$ to be half or less of $n$, for example, then we use ${n \choose m}$ (which is also exponential)

### Types of feature selection algorithms

1. Filtering
    - Search criterion is independent of the learning algorithm
- Wrapping
    - Search criterion is dependant on the learning algorithm
    
![types of feature selection algorithms](feature_selection_images/types_of_fs_algorithms.png)

### Performance of each feature selection algorithm

![feature selection algorithm tradeoff](feature_selection_images/fs_algorithms_tradeoff.png)

__Filtering:__
- the price of faster performance of this type of algorithm is that features are considered in isolation
    - i.e. does not take into account relationships between features that might make a particular feature more relevant
- Possible search criteria include:
    - information gain as seen in DTs
    - variance, entropy, the gini index
    - "useful" features (e.g. as determine by neural nets by assignment weights to different features)
    - Independent / non-redundant
    - other statistical measures of relevant
- Filtering could take into account labels for the samples
    - e.g. entropy would not consider labels whereas
    - information gain would
- A Decision Tree's information gain can be used as the criterion with which to filter features, i.e. finding the features that provide the most information given the class label
    - A DT learner by definition, provides a subset of features (features that it decided to split on, ones that are most important for predicting the right labels).
    - A union of these features could then be passed to another type of learner.
    - In that case, we would utilize the inductive bias of a DT to choose the features that are most important for predicting the label, and the inductive bias of antoher learner to do the actual learning.
    - E.g. KNN's problem with dimensionality could be offset by a DTs strength in that area

__Wrapping:__
- Use the specific learner to assess relevance of features to avoid searching through all possible combinations of features (exponential time cost)
- Possible criteria for choosing with features to run through learner:
    - Hill climbing (others look a lot like this one)
    - Randomized optimization
    - Forward sequential selection (polynomial performance; a kind of hill climbing)
        1. Choose best single feature, running through each individually through the learner
        - Determine the best feature that runs in combination with the previously selected feature(s)
        - Once an addition of a feature does not decrease error significantly, stop.
    - Backward elimination (another kind of hill climbing)
        - Similar algorithm to forward search except starts with all features and eliminated one-by-one until next elimination would significantly decrease the learner's performance.

### Quiz: Minimum features required

![Minimum features required quiz](feature_selection_images/min_features_quiz.png)

## Relevance vs Usefulness

- Usefulness measures effect of a variable on a particular learning algorithm
    - This is ultimately what we care about
- Relevance is about information
    - A variable could be "strongly" or "weakly" relevant (definitions below)
    - A feature is irrelevant if it does not add any information to the classifier
    - A subset of usefulness, measuring the usefulness of a variable with respect to the Bayes Optimal Classifier (see note on this below).

For more explanation on the slides below, see [these](https://classroom.udacity.com/nanodegrees/nd009/parts/0091345407/modules/542278935775460/lessons/5415378701/concepts/6010086150923#) [videos](https://classroom.udacity.com/nanodegrees/nd009/parts/0091345407/modules/542278935775460/lessons/5415378701/concepts/6010086160923)

![relevance](feature_selection_images/relevance.png)

![relevance vs usefulness](feature_selection_images/relevance_vs_usefulness.png)


Note about Bayes Optimal Classifer (B.O.C.):
- It is theoretical concept, resulting classifier if we were able to test every possible one (an infinite number).
- Truly a measure of information of variables
- Any other algorithm has an inductive bias
