#  Machine Learning - Feature Selection
Notes and examples from Udacity's Introduction to Machine Learning course.

## Feature Selection

[Wiki](https://en.wikipedia.org/wiki/Feature_selection) page on feature selection in machine learning 

[SciKit Learn](http://scikit-learn.org/stable/modules/feature_selection.html) page on feature selection

In machine learning features are extremely important, thus which ones we decide to include or not include is also critical.  Feature selection can involve both creating and eliminating variables:

1. Creation of features using human intuition can allow for better insight and more powerful analysis.  However, caution must be exercised because if done incorrectly, large amounts of error or completely incorrect data can be introduced to the problem.  Be aware of 100% accuracy as a possible warning sign of poor feature creation.
2. Elimination of features can also be done.  This may be necessary because the feature is noisy, it is causing over fitting, the feature is strongly/highly correlated with an already present feature, or simply there is too much data and is causing training/testing to be slow.

Regardless, it important to know that:


$$ Features \neq Information $$

Features try to access information while the information provided is based on the quality of those features!  Think quality over quantity!

A good mantra to live by is to use only as many features as needed.  The question then becomes, how do we best select the features to use on our models?

### Univariate Feature Selection
SciKit Learn provides many ways to automatically select features; the majority of them fall under the category of [univariate feature selection](http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection).  As the name implies, this method works by treating each feature as an independent case while determining how much influence it has on the final solution.  There are two very popular methods of univariate feature selection in SciKitLearn:

1. [SelectPercentile](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectPercentile.html#sklearn.feature_selection.SelectPercentile): Selects the user specified percentage of top most influencial features
2. [SelectKBest](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html#sklearn.feature_selection.SelectKBest): Removes all but the 'K' best features

### Bias and Variance
The amount of features and information included within your model also inflicts the dilemma of bias versus variance.  High bias is when the algorithm tends to over simplify the model due to a lack of information and generally leads to a low $R^2$ value and a high error on the training set; this is usually due to a lack of features.  High variance on the other hand is when the model pays too much attention to features leading to a higher error on the test set than on the training set.  High variance usually leads to a model not generalizing well as it is too specific to the information it was trained on.

The goal when creating a model is to find the 'sweet spot' between bias and variance by optimizing the number of features used.  This can be accomplished through regularization

### Regularization
Regularization is a method that automatically penalizes extra features as they are added and will thus set non critical variables to zero through the calculation and assignment of a coefficient.  In regression, one of them most popular methods of regularization is with the [LASSO](http://scikit-learn.org/stable/modules/linear_model.html#lasso) method.  Essentially each feature has a coefficient assigned to it based on its importance to the model; the higher the number, the more important the feature.  In total all coefficients must sum to less than a set parameter meaning that some go to zero, effectively removing those features from the model.  More information can be found on the Wikipedia page [here](https://en.wikipedia.org/wiki/Lasso_(statistics).


## Overfitting a DecisionTree Example

In this example, a new feature is created from the Enron data and used to train a simple decision tree.  The example shows how using a small training set with a lot of features can easily overfit a model.  Also illustrated is how a feature (in this case the newly created one) has the ability to dominate a regression.

#### Load in the data

In [1]:
#the usuals
import pickle
import numpy as np
from sklearn import cross_validation
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score
from sklearn import tree

In [None]:
np.random.seed(42)

words_file = "../text_learning/your_word_data.pkl" 
authors_file = "../text_learning/your_email_authors.pkl"
word_data = pickle.load(open(words_file, "r"))
authors = pickle.load(open(authors_file, "r"))

#### Split the data into training and test then vectorize

In [3]:
#split the data into training and test
features_train, features_test, labels_train, labels_test = cross_validation.train_test_split(
    word_data, authors, test_size=0.1, random_state=42)

#create tfidf vectorizer
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
                             stop_words='english')
features_train = vectorizer.fit_transform(features_train)
features_test  = vectorizer.transform(features_test).toarray()


### a classic way to overfit is to use a small training
### set and a large number of features;
### train on only 150 events to put ourselves in this regime
features_train = features_train[:150].toarray()
labels_train   = labels_train[:150]

#### Create, fit, and predict with a simple decision tree

In [4]:
clfTree = tree.DecisionTreeClassifier(min_samples_split=8)
clfTree.fit(features_train, labels_train)
pred = clfTree.predict(features_test)

print('The accuracy of the Decision Tree classifier with a min_samples_split of 8 is: ',
      accuracy_score(labels_test, pred))

('The accuracy of the Decision Tree classifier with a min_samples_split of 8 is: ', 0.94766780432309439)


Here we can see that a very simple and basic decision tree is providing a much higher than expected accuracy score; a tell tale sign of overfitting.

Lets look at the feature importances and print those that are above 0.2:

In [5]:
featImp = clfTree.feature_importances_
for x in range(len(featImp)):
    if featImp[x] > 0.2:
        print featImp[x], x

0.764705882353 33614


Only one feature is above a value of 0.2, meaning that it is the feature mostly controlling the solution.  Lets find out what word that is.

In [6]:
vectorizer.get_feature_names()[33614]

u'sshacklensf'