**Lesson 11: Feature Selection**

This is the notebook for Lesson 11: Feature Selection. I start by opening the starter code and running it from the notebook. 

In [113]:
# %load find_signature.py
#!/usr/bin/python

import pickle
import numpy
numpy.random.seed(42)


### The words (features) and authors (labels), already 
### largely processed. These files should
### have been created from the previous (Lesson 10)
### mini-project.
words_file = "../text_learning/your_word_data.pkl" 
authors_file = "../text_learning/your_email_authors.pkl"
word_data = pickle.load( open(words_file, "r"))
authors = pickle.load( open(authors_file, "r") )



### test_size is the percentage of events assigned to the 
### test set (the remainder go into training)
### feature matrices changed to dense representations 
### for compatibility with classifier 
### functions in versions 0.15.2 and earlier
from sklearn import cross_validation
features_train, features_test, labels_train, labels_test \
    = cross_validation.train_test_split(word_data, authors,\
    test_size=0.1, random_state=42)

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
                             stop_words='english')
features_train = vectorizer.fit_transform(features_train)
features_test  = vectorizer.transform(features_test).toarray()


### a classic way to overfit is to use a small number
### of data points and a large number of features;
### train on only 150 events to put ourselves in this regime
features_train = features_train[:150].toarray()
labels_train   = labels_train[:150]



### your code goes here
# I should get a decision tree up and training on 
# the training set





In [114]:
features_train[0]

array([ 0.,  0.,  0., ...,  0.,  0.,  0.])

In [115]:
features_train[:,2]

array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.])

In [116]:
len(features_train)

150

In [117]:
from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf.fit(features_train, labels_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

In [118]:
pred = clf.predict(features_test)

In [119]:
len(pred)

1758

In [120]:
len(labels_train)

150

In [121]:
from sklearn.metrics import accuracy_score

In [122]:
accuracy_score(pred, labels_test)

0.81683731513083047

In [123]:
accuracy_score(labels_test, pred)

0.81683731513083047

Find the most important features using the *feature_importance_* method. 

In [124]:
importance = clf.feature_importances_

In [125]:
importance_high = []
for score in importance: 
    if score > 0.2:
        importance_high.append(score)
        

In [126]:
importance_high

[0.36363636363636365]

In [127]:
len(importance_high)

1

In [128]:
importance

array([ 0.,  0.,  0., ...,  0.,  0.,  0.])

In [129]:
type(importance)

numpy.ndarray

In [130]:
import numpy as np
np.count_nonzero(importance)

13

In [131]:
np.where(importance.nonzero())

(array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]),
 array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12]))

I have no idea what that means. 

In [132]:
itemindex = np.where(importance>=0.2)

In [134]:
print itemindex

(array([21323]),)


Use TfIdf to get the most important word. Tfidf in this context is the vectorizer. 

So to figure out what word is causing the problem I have to go back to the feature numbers I found before and get the words associated with them. We use the get_feature_names() method to get the words driving the descrimination.  

We assign the list of feature_names to the x variable and then use the index number of the feature that we found to be determining the outcome of the model to extract its name from the list. 

In [135]:
x = vectorizer.get_feature_names()

In [136]:
x[21323]

u'houectect'

From the feature_selection in TfIdf Vectorizer section there is some code presented that has that narrows down the data being used by using the SelectPercentile and f_classif modules. They are used in the code thusly:
```
selector = SelectPercentile(f_classif, percentile=10)
selector.fit(features_train_transformed, labels_train)
features_train_transformed = slector.transform(features_train_transformed).toarray()
features_test_transformed = selector.transform(features_test_transformed).toarray()
```
So the selector gets rid of the 90% of features that don't help distinguish between the two groups of emails. 

We also can use the TfidfVectorizer() to get rid of a lot of words from the data set off the top. She uses that in the code to create the transformed features when she makes the vectorizer. 



**Remove, Repeat**

This word seems like an outlier in a certain sense, so let’s remove it and refit. 

Go back to text_learning/vectorize_text.py, and remove this word from the emails using the same method you used to remove “sara”, “chris”, etc. 

[Ok, I think this is back in the last lesson. I'll get the code and copy and paste it here and run it again from here.]

Rerun vectorize_text.py, and once that finishes, rerun find_signature.py. Any other outliers pop up? What word is it? Seem like a signature-type word? (Define an outlier as a feature with importance >0.2, as before).

In [87]:
%cd "../text_learning/"

/Users/michaelreinhard/nano/machineLearning/ud120-projects/text_learning


In [88]:
%ls

Tutorial_Working_w_Text_Data.ipynb  [34mscikit-learn[m[m/                       text_learning_3.ipynb
Tutorial_sklearn_Vanderpass.ipynb   sklearn_text_tutorial_setup.ipynb   vectorize_text.py
Untitled.ipynb                      test_email.txt                      your_email_authors.pkl
Untitled1.ipynb                     text_learning.ipynb                 your_word_data.pkl
from_chris.txt                      text_learning_1.ipynb
from_sara.txt                       text_learning_2.ipynb


In [None]:
# %load vectorize_text.py
#!/usr/bin/python

import os
import pickle
import re
import sys

sys.path.append( "../tools/" )
from parse_out_email_text import parseOutText

"""
    Starter code to process the emails from Sara and Chris to extract
    the features and get the documents ready for classification.

    The list of all the emails from Sara are in the from_sara list
    likewise for emails from Chris (from_chris)

    The actual documents are in the Enron email dataset, which
    you downloaded/unpacked in Part 0 of the first mini-project. If you have
    not obtained the Enron email corpus, run startup.py in the tools folder.

    The data is stored in lists and packed away in pickle files at the end.
"""


from_sara  = open("from_sara.txt", "r")
from_chris = open("from_chris.txt", "r")

from_data = []
word_data = []

### temp_counter is a way to speed up the development--there are
### thousands of emails from Sara and Chris, so running over all of them
### can take a long time
### temp_counter helps you only look at the first 200 emails in the list so you
### can iterate your modifications quicker
temp_counter = 0


for name, from_person in [("sara", from_sara), ("chris", from_chris)]:
    for path in from_person:
        ### only look at first 200 emails when developing
        ### once everything is working, remove this line to run over full dataset
        temp_counter += 1
        if temp_counter < 200:
            path = os.path.join('..', path[:-1])
            print path
            email = open(path, "r")

            ### use parseOutText to extract the text from the opened email

            ### use str.replace() to remove any instances of the words
            ### ["sara", "shackleton", "chris", "germani"]

            ### append the text to word_data

            ### append a 0 to from_data if email is from Sara, and 1 if email is from Chris


            email.close()

print "emails processed"
from_sara.close()
from_chris.close()

pickle.dump( word_data, open("your_word_data.pkl", "w") )
pickle.dump( from_data, open("your_email_authors.pkl", "w") )





### in Part 4, do TfIdf vectorization here




Ok, that just gets me the original script before I worked on it. The thing I need is the script after the alterations. 

In [112]:
# I have code that runs and seems to work but the output is not accepted by the grader for the signature scrubbing quiz in Lesson 10. Here is my code: 

# %load vectorize_text.py
#!/usr/bin/python

%reload_ext autoreload

import os
import pickle
import re
import sys

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

sys.path.append( "../tools/" )
from parse_out_email_text import parseOutText

"""
    Starter code to process the emails from Sara and Chris to extract
    the features and get the documents ready for classification.

    The list of all the emails from Sara are in the from_sara list
    likewise for emails from Chris (from_chris)

    The actual documents are in the Enron email dataset, which
    you downloaded/unpacked in Part 0 of the first mini-project. If you have
    not obtained the Enron email corpus, run startup.py in the tools folder.

    The data is stored in lists and packed away in pickle files at the end.
"""


from_sara  = open("from_sara.txt", "r")
from_chris = open("from_chris.txt", "r")

# trying to find out why no data gets to the parseOutText program
# print from_sara
# This returned a proper file object

from_data = []
word_data = []

### temp_counter is a way to speed up the development--there are
### thousands of emails from Sara and Chris, so running over all of them
### can take a long time
### temp_counter helps you only look at the first 200 emails in the list so you
### can iterate your modifications quicker
temp_counter = 0


for name, from_person in [("sara", from_sara), ("chris", from_chris)]:
    for path in from_person:
        ### only look at first 200 emails when developing
        ### once everything is working, remove this line to run over full dataset
        temp_counter += 1
#         if temp_counter == 1:
#             print path
#             print type(path)
#         else: 
#             pass

        if temp_counter >= 0:
            path = os.path.join('..', path[:-1])

            email = open(path, "r")
#             if temp_counter == 1:
#                 print type(email)
#             else: 
#                 pass
            
            
            ### use parseOutText to extract the text from the opened email
            #email = parseOutText(email)
            

            text = parseOutText(email)
            if temp_counter == 1:
                print "\ntext: " + text + "\n"

            #print text
            ### use str.replace() to remove any instances of the words
            
            drop_words = ["sara", "shackleton", "chris", "germani","sshacklensf", "cgermannsf"]
            for word in drop_words:
                text = text.replace(word, "")
            
            ### append the text to word_data
            
            word_data.append(text)

            ### append a 0 to from_data if email is from Sara, and 1 if email is from Chris

            if name == "sara":
                from_data.append(0)
            else: 
                from_data.append(1)
            
            email.close()

print "emails processed"
print word_data[152]
from_sara.close()
from_chris.close()

pickle.dump( word_data, open("your_word_data.pkl", "w") )
pickle.dump( from_data, open("your_email_authors.pkl", "w") )





### in Part 4, do TfIdf vectorization here




text: sbaile2 nonprivilegedpst susan pleas send the forego list to richard thank sara shackleton enron wholesal servic 1400 smith street eb3801a houston tx 77002 ph 713 8535620 fax 713 6463490 

emails processed
tjonesnsf stephani and sam need nymex calendar 
