**Lesson 10: text_learning**

The sklearn module for text learning is based on the bag of words model, called the CountVectorizer in sklearn, and lives in the feature_extraction.text library.

The CountVectorizer is based on the bag of words model, which allows us to analyze documents of different sizes along a common set of dimensions, specifically the words that occur in any of the documents. 

In [44]:
from sklearn.feature_extraction.text import CountVectorizer

**Stopwords in NLTK**

*Get a list of stopwords from the Natural Language Took Kit*

In [45]:
import nltk

In [46]:
from nltk.corpus import stopwords

In [7]:
# nltk.download() 
# only needs to be done once. 

In [47]:
sw = stopwords.words("english")

In [48]:
sw[4]

u'we'

In [49]:
len(sw)

127

Here is the code you would use to get the word index of a word: 
```
print vectorizer.vocabulary_.get("great")
```

This is the snowball stemmer. It is like a sklearn module where you create a model or machine that has an existence independent of the data that it is applied to. 

In [50]:
from nltk.stem.snowball import SnowballStemmer

In [51]:
stemmer = SnowballStemmer("english")

In [52]:
stemmer.stem("responsivity")

u'respons'

**Order of Operations in Text Processing**

Always do your stemming before you do your bag of words text processing. After all, that is the whole point of doing the stemming. 

**Weighting Terms by Frequency**

The idea is to weight words inversely by frequency. 

*Tf* is for term frequency

*Idf* is the frequency of the term in the corpus as a whole. You want to give a higher weight to the rare words in the corpus. They are what make your document stand out. When words that are uncommon in the corpus as a whole are frequent in a document those words are especially salient to determining the unique content of the document. So the whole process is; TfIdf. 

**Text Learning Mini-Project**

*parseOutText()*

Takes an opened email as an argument and returns a string containing all the stemmed words in the email. It takes out any metadata that might be at the top so you have just the text of the email. 

*Warming up with parseOutText*

The text to the email should print out to the screen. We have to go to the tools directory to run the script "parse_out_email_text.py"

In [53]:
%pwd

u'/Users/michaelreinhard/nano/machineLearning/ud120-projects'

In [6]:
%cat /Users/michaelreinhard/nano/machineLearning/ud120-projects/tools/parse_out_email_text.py

In [59]:
%cd text_learning

/Users/michaelreinhard/nano/machineLearning/ud120-projects/text_learning


In [66]:
%run /Users/michaelreinhard/nano/machineLearning/ud120-projects/tools/parse_out_email_text.py



hi everyon  if you can read this messag your proper use parseouttext  plea proceed to the next part of the project



In [67]:
%pwd

u'/Users/michaelreinhard/nano/machineLearning/ud120-projects/text_learning'

In [68]:
%ls

Tutorial_Working_w_Text_Data.ipynb  test_email.txt
Tutorial_sklearn_Vanderpass.ipynb   text_learning.ipynb
from_chris.txt                      text_learning_1.ipynb
from_sara.txt                       text_learning_2.ipynb
[34mscikit-learn[m[m/                       text_learning_3.ipynb
sklearn_text_tutorial_setup.ipynb   vectorize_text.py


*Deploy Stemming*

Now we comment out the line "words = text_string" in the script "parseOutText()"

I have to figure out how to loop through a string and stem the words in the string. 

In [69]:
string = "The main thing to keep in mind is don't panic " 

This produces a string of stemmed words. It first takes the string and separates it into a iterable with the .split(" ") method. Then, in the body of the for loop, it appends each stemmed word to a list called stemmed_words. Finally, it takes the list of stemmed terms and converts them back into a string of words separated by spaces with the .join() function called on a space!

In [70]:
stemmed_words = []
for i in string.split(" "): 
    stemmed_words.append(stemmer.stem(i))
stemmed_words
string = " ".join(stemmed_words)
string

u"the main thing to keep in mind is don't panic "

In [5]:
%run ../tools/parse_out_email_text.py

In [15]:
#!/usr/bin/python

from nltk.stem.snowball import SnowballStemmer
import string

def parseOutText(f):
    """ given an opened email file f, parse out all text below the
        metadata block at the top
        (in Part 2, you will also add stemming capabilities)
        and return a string that contains all the words
        in the email (space-separated) 
        
        example use case:
        f = open("email_file_name.txt", "r")
        text = parseOutText(f)
        
        """


    f.seek(0)  ### go back to beginning of file (annoying)
    all_text = f.read()

    ### split off metadata
    content = all_text.split("X-FileName:")
    words = ""
    if len(content) > 1:
        ### remove punctuation
        text_string = content[1].translate(string.maketrans("", ""), string.punctuation)
        #print "This is text_string: ", text_string
        ### project part 2: comment out the line below
#       words = text_string
        #words = ""
        stemmer = SnowballStemmer("english", ignore_stopwords=True)
        stemmed_words = []
        for word in text_string.split(): 
            stemmed_words.append(stemmer.stem(word))
            words = " ".join(stemmed_words)

          

            

        #print "Here is the string, 'words': ", words
                
        
        ### split the text string into individual words, stem each word,
        ### and append the stemmed word to words (make sure there's a single
        ### space between each stemmed word)
        
   
    return words

    

def main():
    ff = open("../text_learning/test_email.txt", "r")
    text = parseOutText(ff)
    print text



if __name__ == '__main__':
    main()

hi everyon if you can read this messag your proper use parseouttext pleas proceed to the next part of the project


In [73]:
%run ../tools/parse_out_email_text.py



hi everyon  if you can read this messag your proper use parseouttext  plea proceed to the next part of the project



In [74]:
%cd ../text_learning/

/Users/michaelreinhard/nano/machineLearning/ud120-projects/text_learning


In [75]:
%cat test_email.txt

To: Katie_and_Sebastians_Excellent_Students@udacity.com
From: katie@udacity.com
X-FileName:

Hi Everyone!  If you can read this message, you're properly using parseOutText.  Please proceed to the next part of the project!


**"Clean Away any 'Signature Words'"**

The script *vectorize_text.py* whould iterate through all the emails from Chris and Sara. Each email should be feed to parseOutText() to return the stemmed string. 

Once you have the stemmed string representing each email you do two things. 

First, we get rid of the proper names of people and (maybe?) countries. The four signature words are "sara", "chris", "shackleton" and "germani". I presume these are the first and last names of the two parties sending the emails. 

The text of each updated text string should be appended to 'word_data'. And, when we append the string, we should also append a 0 or a 1 (for sara and chris) to from_data, to show who wrote the email. 

The goal is to end up with two lists. The first list should have the stemmed text of each email (in a single string?). The other should be a list of 0s and 1s for whether the author of the email is Sara (0) or Chris (1). 

Inside the code is a 'temp_counter' to keep the code from running over all the emails. Once the code is working we should disable temp_counter and run the code over the whole data set. 

In [None]:
#!/usr/bin/python

import os
import pickle
import re
import sys

sys.path.append( "../tools/" )
from parse_out_email_text import parseOutText

"""
    Starter code to process the emails from Sara and Chris to extract
    the features and get the documents ready for classification.

    The list of all the emails from Sara are in the from_sara list
    likewise for emails from Chris (from_chris)

    The actual documents are in the Enron email dataset, which
    you downloaded/unpacked in Part 0 of the first mini-project. If you have
    not obtained the Enron email corpus, run startup.py in the tools folder.

    The data is stored in lists and packed away in pickle files at the end.
"""




from_sara  = open("from_sara.txt", "r")
from_chris = open("from_chris.txt", "r")


from_data = []
word_data = []

### temp_counter is a way to speed up the development--there are
### thousands of emails from Sara and Chris, so running over all of them
### can take a long time
### temp_counter helps you only look at the first 200 emails in the list so you
### can iterate your modifications quicker
temp_counter = 0
print temp_counter

for name, from_person in [("sara", from_sara), ("chris", from_chris)]:
    for path in from_person:
        ### only look at first 200 emails when developing
        ### once everything is working, remove this line to run over full dataset
        temp_counter += 1
        if temp_counter < 200:
            path = os.path.join('..', path[:-1])
            print path
            email = open(path, "r")
            j = parseOutText(email)
            print j
            print temp_counter

            ### use parseOutText to extract the text from the opened email

            ### use str.replace() to remove any instances of the words
            ### ["sara", "shackleton", "chris", "germani"]

            ### append the text to word_data

            ### append a 0 to from_data if email is from Sara, and 1 if email is from Chris


            email.close()

print "emails processed"
from_sara.close()
from_chris.close()

pickle.dump( word_data, open("your_word_data.pkl", "w") )
pickle.dump( from_data, open("your_email_authors.pkl", "w") )





### in Part 4, do TfIdf vectorization here

In [None]:
ls

In [24]:
path = os.path.join('..', path[:-1])
print path

../../maildir/bailey-s/deleted_items/101


In [25]:
cd ../../maildir/bailey-s/deleted_items/101

[Errno 2] No such file or directory: '../../maildir/bailey-s/deleted_items/101'
/Users/michaelreinhard/nano/machineLearning/ud120-projects/text_learning


In [26]:
#!/usr/bin/python

import os
import pickle
import re
import sys

sys.path.append( "../tools/" )
from parse_out_email_text import parseOutText


from_sara  = open("from_sara.txt", "r")
from_chris = open("from_chris.txt", "r")


The different ways to read the contents of a file are .read(), .readline() and .readlines(). 

.read() will just put out the entire contents of a file at once. If you call .read() two times on the same file object it will return '' the second time. 

The .readline() method reads the file one line at a time. 

.readlines() returns all the lines but in a list. 

In [27]:
from_sara  = open("from_sara.txt", "r")
from_sara.read()

'maildir/bailey-s/deleted_items/101.\nmaildir/bailey-s/deleted_items/106.\nmaildir/bailey-s/deleted_items/132.\nmaildir/bailey-s/deleted_items/185.\nmaildir/bailey-s/deleted_items/186.\nmaildir/bailey-s/deleted_items/187.\nmaildir/bailey-s/deleted_items/193.\nmaildir/bailey-s/deleted_items/195.\nmaildir/bailey-s/deleted_items/214.\nmaildir/bailey-s/deleted_items/215.\nmaildir/bailey-s/deleted_items/233.\nmaildir/bailey-s/deleted_items/242.\nmaildir/bailey-s/deleted_items/243.\nmaildir/bailey-s/deleted_items/244.\nmaildir/bailey-s/deleted_items/246.\nmaildir/bailey-s/deleted_items/247.\nmaildir/bailey-s/deleted_items/254.\nmaildir/bailey-s/deleted_items/259.\nmaildir/bailey-s/deleted_items/260.\nmaildir/bailey-s/deleted_items/261.\nmaildir/bailey-s/deleted_items/263.\nmaildir/bailey-s/deleted_items/278.\nmaildir/bailey-s/deleted_items/290.\nmaildir/bailey-s/deleted_items/296.\nmaildir/bailey-s/deleted_items/302.\nmaildir/bailey-s/deleted_items/306.\nmaildir/bailey-s/deleted_items/307.\n

In [28]:
from_sara.read()

''

In [29]:
from_sara  = open("from_sara.txt", "r")
from_sara.readline()

'maildir/bailey-s/deleted_items/101.\n'

In [30]:
from_sara.readline()

'maildir/bailey-s/deleted_items/106.\n'

See how the number is going up. 

In [31]:
from_sara  = open("from_sara.txt", "r")
myList = from_sara.readlines()

In [32]:
myList[0]

'maildir/bailey-s/deleted_items/101.\n'

In [33]:
myList[1:3]

['maildir/bailey-s/deleted_items/106.\n',
 'maildir/bailey-s/deleted_items/132.\n']

Then you can loop through the file. 

In [34]:
for x in myList[:4]: 
    print x

maildir/bailey-s/deleted_items/101.

maildir/bailey-s/deleted_items/106.

maildir/bailey-s/deleted_items/132.

maildir/bailey-s/deleted_items/185.



And we can count the number of lines or anything else in a file: 

In [35]:
from_sara  = open("from_sara.txt", "r")
for line in from_sara:
    print line

maildir/bailey-s/deleted_items/101.

maildir/bailey-s/deleted_items/106.

maildir/bailey-s/deleted_items/132.

maildir/bailey-s/deleted_items/185.

maildir/bailey-s/deleted_items/186.

maildir/bailey-s/deleted_items/187.

maildir/bailey-s/deleted_items/193.

maildir/bailey-s/deleted_items/195.

maildir/bailey-s/deleted_items/214.

maildir/bailey-s/deleted_items/215.

maildir/bailey-s/deleted_items/233.

maildir/bailey-s/deleted_items/242.

maildir/bailey-s/deleted_items/243.

maildir/bailey-s/deleted_items/244.

maildir/bailey-s/deleted_items/246.

maildir/bailey-s/deleted_items/247.

maildir/bailey-s/deleted_items/254.

maildir/bailey-s/deleted_items/259.

maildir/bailey-s/deleted_items/260.

maildir/bailey-s/deleted_items/261.

maildir/bailey-s/deleted_items/263.

maildir/bailey-s/deleted_items/278.

maildir/bailey-s/deleted_items/290.

maildir/bailey-s/deleted_items/296.

maildir/bailey-s/deleted_items/302.

maildir/bailey-s/deleted_items/306.

maildir/bailey-s/deleted_items/307.

m

In [36]:
os.path.abspath(from_sara.name)

'/Users/michaelreinhard/nano/machineLearning/ud120-projects/text_learning/from_sara.txt'

In [37]:
temp_counter = 0
for name, from_person in [("sara", from_sara), ("chris", from_chris)]:
    temp_counter += 1
    if temp_counter < 20000:
        print name
        print from_person.read()
        #print "I am hitting the loop for the %s time", % temp_counter
    

sara

chris
maildir/donohoe-t/inbox/253.
maildir/germany-c/_sent_mail/1.
maildir/germany-c/_sent_mail/10.
maildir/germany-c/_sent_mail/100.
maildir/germany-c/_sent_mail/1000.
maildir/germany-c/_sent_mail/1001.
maildir/germany-c/_sent_mail/1002.
maildir/germany-c/_sent_mail/1003.
maildir/germany-c/_sent_mail/1004.
maildir/germany-c/_sent_mail/1005.
maildir/germany-c/_sent_mail/1006.
maildir/germany-c/_sent_mail/1007.
maildir/germany-c/_sent_mail/1008.
maildir/germany-c/_sent_mail/1009.
maildir/germany-c/_sent_mail/101.
maildir/germany-c/_sent_mail/1010.
maildir/germany-c/_sent_mail/1011.
maildir/germany-c/_sent_mail/1012.
maildir/germany-c/_sent_mail/1013.
maildir/germany-c/_sent_mail/1014.
maildir/germany-c/_sent_mail/1015.
maildir/germany-c/_sent_mail/1016.
maildir/germany-c/_sent_mail/1017.
maildir/germany-c/_sent_mail/1018.
maildir/germany-c/_sent_mail/1019.
maildir/germany-c/_sent_mail/102.
maildir/germany-c/_sent_mail/1020.
maildir/germany-c/_sent_mail/1021.
maildir/germany-c/_sen

Suggestion made for getting the emails since I don't see the text f the emails is to re-run the [set up code for project 1](https://discussions.udacity.com/t/cant-read-in-the-data-of-clean-away-signature-words/6627): ./tools/setup.py 

In [38]:
cd ../tools/

/Users/michaelreinhard/nano/machineLearning/ud120-projects/tools


In [39]:
%pwd

u'/Users/michaelreinhard/nano/machineLearning/ud120-projects/tools'

In [40]:
%ls

email_authors.pkl          parse_test_my.py
email_preprocess.py        python2_lesson06_keys.pkl
email_preprocess.pyc       python2_lesson13_keys.pkl
feature_format.py          python2_lesson14_keys.pkl
feature_format.pyc         startup.py
parse_out_email_text.py    word_data.pkl
parse_out_email_text.pyc


In [41]:
%run startup.py


checking for nltk
checking for numpy
checking for scipy
checking for sklearn

downloading the Enron dataset (this may take a while)
to check on progress, you can cd up one level, then execute <ls -lthr>
Enron dataset should be last item on the list, along with its current size
download will complete at about 423 MB
download complete!

unzipping Enron dataset (this may take a while)
you're ready to go!
