# Feature Selection
Minimal number of features it takes to capture trends in the data.
- Select best features
- Add new features

**Process**
- Use human intuition
    - POIs send emails to each other at a higher rate
- Code up new feature
    - Int number of messages to this person from POI
- Visualise
    - Does the new feature give discriminating power between POIs?
- Repeat
    - Can we do better? E.g. scale featre by total number of messages to or from that person.

Observe
- Outliers
- Mixture of labelled points: Are there chunks in your visualisation where there are only one category of labels? (e.g. if <20% of emails sent to POIs -> all not POIs.

In [None]:
#!/usr/bin/python

###
### in poiFlagEmail() below, write code that returns a boolean
### indicating if a given email is from a POI
###

import sys
import reader
import poi_emails

def getToFromStrings(f):
    '''
    The imported reader.py file contains functions that we've created to help
    parse e-mails from the corpus. .getAddresses() reads in the opening lines
    of an e-mail to find the To: From: and CC: strings, while the
    .parseAddresses() line takes each string and extracts the e-mail addresses
    as a list.
    '''
    f.seek(0)
    to_string, from_string, cc_string   = reader.getAddresses(f)
    to_emails   = reader.parseAddresses( to_string )
    from_emails = reader.parseAddresses( from_string )
    cc_emails   = reader.parseAddresses( cc_string )

    return to_emails, from_emails, cc_emails


### POI flag an email

def poiFlagEmail(f):
    """ given an email file f,
        return a trio of booleans for whether that email is
        to, from, or cc'ing a poi """

    to_emails, from_emails, cc_emails = getToFromStrings(f)

    ### poi_emails.poiEmails() returns a list of all POIs' email addresses.
    poi_email_list = poi_emails.poiEmails()

    to_poi = False
    from_poi = False
    cc_poi   = False

    ### to_poi and cc_poi are related functions, which flag whether
    ### the email under inspection is addressed to a POI, or if a POI is in cc
    ### you don't have to change this code at all

    ### there can be many "to" emails, but only one "from", so the
    ### "to" processing needs to be a little more complicated
    if to_emails:
        ctr = 0
        while not to_poi and ctr < len(to_emails):
            if to_emails[ctr] in poi_email_list:
                to_poi = True
            ctr += 1
    if cc_emails:
        ctr = 0
        while not to_poi and ctr < len(cc_emails):
            if cc_emails[ctr] in poi_email_list:
                cc_poi = True
            ctr += 1


    #################################
    ######## your code below ########
    ### set from_poi to True if #####
    ### the email is from a POI #####
    #################################

    if from_emails:
        ctr = 0
        while not from_poi and ctr < len(from_emails):
            if from_emails[ctr] in poi_email_list:
                from_poi = True
            ctr += 1
    
    

    #################################
    return to_poi, from_poi, cc_poi

## Beware of bugs - be skeptical of classifiers with near 100% accuracy

When Katie was working on the Enron POI identifier, she engineered a feature that identified when a given person was on the same email as a POI. So for example, if Ken Lay and Katie Malone are both recipients of the same email message, then Katie Malone should have her "shared receipt" feature incremented. If she shares lots of emails with POIs, maybe she's a POI herself.

Here's the problem: there was a subtle bug, that Ken Lay's "shared receipt" counter would also be incremented when this happens. And of course, then Ken Lay always shares receipt with a POI, because he is a POI. So the "shared receipt" feature became extremely powerful in finding POIs, because it effectively was encoding the label for each person as a feature.

We found this first by being suspicious of a classifier that was always returning 100% accuracy. Then we removed features one at a time, and found that this feature was driving all the performance. Then, digging back through the feature code, we found the bug outlined above. We changed the code so that a person's "shared receipt" feature was only incremented if there was a different POI who received the email, reran the code, and tried again. The accuracy dropped to a more reasonable level.

## Getting rid of features
Reasons
- It's noisy
- It causes overfitting
- It is highly correlated with a feature that's already present
- Additional features slow donw training/testing process

## Features != Information.
Features attempt to access information but are not info themselves. We want the info. // Quantity vs quality.



In [None]:
#!/usr/bin/python

import pickle
import cPickle
import numpy

from sklearn import cross_validation
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectPercentile, f_classif



def preprocess(words_file = "../tools/word_data.pkl", authors_file="../tools/email_authors.pkl"):
    """ 
        this function takes a pre-made list of email texts (by default word_data.pkl)
        and the corresponding authors (by default email_authors.pkl) and performs
        a number of preprocessing steps:
            -- splits into training/testing sets (10% testing)
            -- vectorizes into tfidf matrix
            -- selects/keeps most helpful features

        after this, the feaures and labels are put into numpy arrays, which play nice with sklearn functions

        4 objects are returned:
            -- training/testing features
            -- training/testing labels

    """

    ### the words (features) and authors (labels), already largely preprocessed
    ### this preprocessing will be repeated in the text learning mini-project
    authors_file_handler = open(authors_file, "r")
    authors = pickle.load(authors_file_handler)
    authors_file_handler.close()

    words_file_handler = open(words_file, "r")
    word_data = cPickle.load(words_file_handler)
    words_file_handler.close()

    ### test_size is the percentage of events assigned to the test set
    ### (remainder go into training)
    features_train, features_test, labels_train, labels_test = cross_validation.train_test_split(word_data, authors, test_size=0.1, random_state=42)



    ### text vectorization--go from strings to lists of numbers
    # Some feature selection here  with (1) `stop_words=`english`' and
    # (2) max_df -> don't include terms that have a document frequency 
    # strictly higher than the given thresholdts. 
    vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
                                 stop_words='english')
    features_train_transformed = vectorizer.fit_transform(features_train)
    features_test_transformed  = vectorizer.transform(features_test)



    ### feature selection, because text is super high dimensional and 
    ### can be really computationally chewy as a result
    # Select best 10% of features using classifier
    selector = SelectPercentile(f_classif, percentile=10)
    selector.fit(features_train_transformed, labels_train)
    features_train_transformed = selector.transform(features_train_transformed).toarray()
    features_test_transformed  = selector.transform(features_test_transformed).toarray()

    ### info on the data
    print "no. of Chris training emails:", sum(labels_train)
    print "no. of Sara training emails:", len(labels_train)-sum(labels_train)
    
    return features_train_transformed, features_test_transformed, labels_train, labels_test


High dimensionality data -> many features

## Bias-Variance Dilemma and Number of Features

**High bias**: Pays little attention to data and is oversimplified
- e.g. few features used
- Low r^2, large SSE
**High variance**: Pays too much attention to data, doesn't generalise well. Overfits.
- e.g. carefully minimised SSE
- Much higher error on test set than on training set

Tradeoff between goodness of fit and the simplicity of fit.
Want few features, low SSE, high r^2.

## Regulatisation: Balancing error with no. of features
- Method for automatically penalising extra features in your model
Reverse-u plot (quality of model against no. of features)

E.g. in regressions

### Lasso Regression
Minimise SSE + $\lambda|\beta|$, 

where $\lambda$ is a penalty parameter and
$\beta$ is the coefficients of the regression (related to the number of features used)

So gain of feature in minimising SSE has to outweigh the penalty of using that extra feature.

$$y = \sum m_ix_i + b$$

**Process: **Lasso regression will try adding features one at a time. If it doesn't decrease SSE sufficiently, it won't add the feature. I.e. it sets the coefficients of those features to zero.

Precisely, the **optimisation objective for Lasso is: ** $$(1 / (2 * \text{n_samples})) * ||y - Xw||^2_2 + \alpha * ||w||_1$$


# Feature Selection: Charles & Michael

## Why?
- Knowledge Discovery, Interpretability and Insight (Human)
    - Which features matter
- Curse of Dimensionality (Machine)
    - The amount of data you need grows exponentially in the number of features you have

### How hard is the problem
of choosing m features out of n features? (Might not know what m is, m \leq n.)
    - n choose m, or 2^n.
    - NP-hard.

Two a
## Alg approches: Filtering and Wrapping

### Filtering:
**Process**:
- Have input features 
- Run feat through alg which maximises for some score
    - Criteria built in search with no reference to the learner
- Passes features to some learning alg which will use it for classification/regression.

**Adv**:
- Faster: Don't need to worry about paying the cost of what the learner is going to do.
- Flow forward

**Disadv**:
- No feedback. Ignores the learner.
- (Speed ->) Tend to look at features is isolation

**Examples of criteria**:
- Information Gain (depends on labels)
    - E.G. Put a decision tree inside the search box. Then the top features that come out of a decision tree go into another learner e.g. KNN. (KNN suffers from Curse of Dim because it doesn't know what features are important.)
    - Another version: Neural net and pruning features that have low weight.
> Nice
- Entropy, Gini index (version of entropy), some form of variance (doesn't depend on the labels)
- Linear Independence 

**Analogies within Supervised Learning**: Decision Trees (**Information Gain**).
- Note you can look at labels for filtering in supervised learning.

### Wrapping:
**Process**:
- Take features
- Searches over features
- Learning alg reports how well it does
    - Criteria built in learner
- Use that score to search for better set of features

**Adv**:
- Allows for feedback
- Takes into account model bias and the learner

**Disadv**:
- Much slower.

**Examples of criteria**:
- Kinds of local search or hill climbing (deterministic gradient search)
- Randomised optimisation e.g. mimic or genetic algorithms
> Don't know what this is.
- Forward sequential selection (Polynomial) ~ Hill climbing where neighbourhood relation is adding one more feature.
    - Start with a a feature of your end features.
    - Look at your features in isolation.
    - Pass first, then second, then third...
    - Whichever feature is best you keep.
    - Then you look at each of remaining features and add them individually. You pick the best combination.
    - etc until the improvement is not significant enough.
- Backward elimination 
    - Hill climbing (Reverse of forward search)
- (NOT exhaustive search cause that's exponential)


Domain knowledge comes into choice of criteria.

## Relevance and Usefulness
- What if a feature doesn't provide any information?

### Relevance
**Relevance ~ Information**
- A feature $x_i$ is strongly feature is **strongly relevant** if removing it degrades the **Bayes Optimal Classifier** (on a subset of features).
    - Weighted average of all the hypotheses. The best that you could do on average.
- $x_i$ is **weakly relevant** if 
    - not strongly relevant
    - There exists a subset of features S such that adding $x_i$ to S improves  BOC.
    - e.g. for an AND (a,b), if e = not a, neither a or e is strongly relevant. But they are weakly relevant.
- $x_i$ is otherwise irrelevant

BOC is the gold standard.

### Usefulness
Usefulness measures the **effect (of minimising error) on a particular predictor**.
- E.g. c = 1 for all features in and AND(a,b) dataset for an origin-constrained perceptron
- E.g. relevance is useful wrt the BOC.

## Summary

- Feature Selection Definiton
- Filtering (Faster? but ignoreos bias) vs Wrapping (Slow but useful)
- Relevance (Info) vs usefulness (Reduce error for a particular model)
    - Strong and weak relevance
