**Lesson 10: text_learning**

The sklearn module for text learning is based on the bag of words model, called the CountVectorizer in sklearn, and lives in the feature_extraction.text library.

The CountVectorizer is based on the bag of words model, which allows us to analyze documents of different sizes along a common set of dimensions, specifically the words that occur in any of the documents. 

In [6]:
from sklearn.feature_extraction.text import CountVectorizer

CountVectorizer is just the name for the bag of words in sklearn. 

To get them into CountVectorizer you have to turn them into a list. 

In [63]:
string1 = "My name is Michael. (Love, Love, Love) (Hate, hate hate) Here are some words that we might find in an email"
string2 = "My name is Michael. The new Star Wars sucks, but it doesn't suck as bad as the others."
string3 = "My name is Michael. When in the course of human events, it becomes necessary..."

email_list = [string1, string2, string3]

Now we turn the list into a bag of words, so this means that every document is a row and every possible word, i.e., any word that occurs at least once in the corpus, is a column. 

In [64]:
print email_list

['My name is Michael. (Love, Love, Love) (Hate, hate hate) Here are some words that we might find in an email', "My name is Michael. The new Star Wars sucks, but it doesn't suck as bad as the others.", 'My name is Michael. When in the course of human events, it becomes necessary...']


In [65]:
vectorizer = CountVectorizer()

In [66]:
bag_of_words = vectorizer.fit(email_list)

In [67]:
bag_of_words

CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [68]:
bag_of_words = vectorizer.transform(email_list)

In [69]:
bag_of_words[2,15]

1

In [70]:
print bag_of_words

  (0, 0)	1
  (0, 1)	1
  (0, 8)	1
  (0, 10)	1
  (0, 11)	3
  (0, 12)	1
  (0, 14)	1
  (0, 15)	1
  (0, 17)	3
  (0, 18)	1
  (0, 19)	1
  (0, 20)	1
  (0, 21)	1
  (0, 26)	1
  (0, 30)	1
  (0, 33)	1
  (0, 35)	1
  (1, 2)	2
  (1, 3)	1
  (1, 5)	1
  (1, 7)	1
  (1, 15)	1
  (1, 16)	1
  (1, 18)	1
  (1, 20)	1
  (1, 21)	1
  (1, 23)	1
  (1, 25)	1
  (1, 27)	1
  (1, 28)	1
  (1, 29)	1
  (1, 31)	2
  (1, 32)	1
  (2, 4)	1
  (2, 6)	1
  (2, 9)	1
  (2, 13)	1
  (2, 14)	1
  (2, 15)	1
  (2, 16)	1
  (2, 18)	1
  (2, 20)	1
  (2, 21)	1
  (2, 22)	1
  (2, 24)	1
  (2, 31)	1
  (2, 34)	1


So the tuples are the document/word pair. The column of mostly ones is the frequency. 

So, the row that has the values (0, 16) 1 [this is in the Udacity Video] means that the 16th word in the 0th document occurred once. 

There are still some things I don't understand, though. I have included the same sentence, "My name is Michael", three times in the data set but there are no words with the frequency 3 in the data set. Also, all three strings begin with the same four words but in the data set the documents do not begin the same way. 

Ahh, the frequency number is the number of times it occurred in that document. If I want to see a 3 I have to repeat the same word three times in the same document. I'll just confirm that by changing string1 above. 

Ok, really cool. The words 'love' and 'hate' occur three times. I show the two 3s in the vectorized document and I show that the vectorizer is case insensitive. Also, we see that the number applied to the word has nothing to do with where it occurs in the document.

We can find out what number a particular word has by using the get method on the vocabulary method on the count vectorizer. 

In [71]:
vectorizer.vocabulary_.get("love")

17

Get the word from the number? 

**Stopwords in NLTK**

*Get a list of stopwords from the Natural Language Took Kit*

In [72]:
import nltk

In [73]:
from nltk.corpus import stopwords

In [None]:
nltk.download() # don't have to run this everytime. 

In [74]:
sw = stopwords.words("english")

In [75]:
sw[0]

u'i'

In [76]:
len(sw)

127

So now we know how many stopwords there are in English. 

Now not all unique words are actually different, that is why we need the stemmer. 

This is the snowball stemmer. It is like a sklearn module where you create a model or machine that has an existence independent of the data that it is applied to. 

In [1]:
from nltk.stem.snowball import SnowballStemmer

In [2]:
stemmer = SnowballStemmer("english")

In [3]:
stemmer.stem("responsivity")

u'respons'

**Order of Operations in Text Processing**

Always do your stemming before you do your bag of words text processing. After all, the whole point of stemming is to save time and count the right things. You don't want to count irrelevant variations on words as entirely different things. 

**Weighting Terms by Frequency**

The idea is to weight words inversely by frequency. 

*Tf* is for term frequency

*Idf* is the frequency of the term in the corpus as a whole. You want to give a higher weight to the rare words in the corpus. They are what make your document stand out. When words that are uncommon in the corpus as a whole are frequent in a document those words are especially salient to determining the unique content of the document. So the whole process is; TfIdf. 

Tf is just the standard weighting of how often a word occurs in a document. But the Idf is the weighting of words that do NOT occur often in the corpus as a whole. So TfIdf weights the rare words that occur a lot in the document more highly. 

**Mini-Project**

Now we are going to start doing some of the pre-processing ourselves.

In the previous projects we recieved the data in the TfIdf form. Now we have to get them to that stage ourselves. 

We get two text files, one of email from Sara, one from Chris. We have to access the parseOutText() function that takes an opened email as an argument and returns a string containing all the stemmed words in the email. 

The first task in the project is to run parseOutText() on an email. The function is defined in the script *parse_out_email_text.py*. The script calls the function on a sample email. 

The instructions are: "We currently have this script set up so that it will print the text of the email to the screen, what is the text that you get when you run parseOutText()?"

In [10]:
%run "../tools/parse_out_email_text.py"

hi everyon if you can read this messag your proper use parseouttext pleas proceed to the next part of the project 


**Deploying Stemming**

In parseOutText(), comment out the following line: 

words = text_string 

Augment parseOutText() so that the string it returns has all the words stemmed using a SnowballStemmer (use the nltk package, some examples that I found helpful can be found here: http://www.nltk.org/howto/stem.html ). Rerun parse_out_email_text.py, which will use your updated parseOutText() function--what’s your output now?

Hint: you'll need to break the string down into individual words, stem each word, then recombine all the words into one string.

*My Answer*

So I am going to modify the *parse_out_email_test.py* in my notebook and try to run it from there. First I will concatenate the script's contents to my notebook.

So I run in the window below: 
```
%load "../tools/parse_out_email_text.py"
```

And the magic command will show commented out at the top of the window with the text of the script printed out in the window. 

In [1]:
#!/usr/bin/python

from nltk.stem.snowball import SnowballStemmer
import string

def parseOutText(f):
    """ given an opened email file f, parse out all text below the
        metadata block at the top
        (in Part 2, you will also add stemming capabilities)
        and return a string that contains all the words
        in the email (space-separated) 
        
        example use case:
        f = open("email_file_name.txt", "r")
        text = parseOutText(f)
        
        Hint: you'll need to break the string down into 
        individual words, stem each word, then recombine all 
        the words into one string.
        
        """


    f.seek(0)  ### go back to beginning of file (annoying)
    all_text = f.read()

    ### split off metadata
    content = all_text.split("X-FileName:")
    words = ""
    if len(content) > 1:
        ### remove punctuation
        text_string = content[1].translate(string.maketrans("", ""), string.punctuation)

        ### project part 2: comment out the line below
        #words = text_string

        ### split the text string into individual words, stem each word,
        ### and append the stemmed word to words (make sure there's a single
        ### space between each stemmed word)
        
        # get the stemmer
        stemmer = SnowballStemmer("english", ignore_stopwords=True)
        text_string = text_string.split()
        for word in text_string:
            words += stemmer.stem(word) + " "

        




    return words

    

def main():
    ff = open("../text_learning/test_email.txt", "r")
    text = parseOutText(ff)
    print text



if __name__ == '__main__':
    main()


hi everyon if you can read this messag your proper use parseouttext pleas proceed to the next part of the project 


In [2]:
%run "../tools/parse_out_email_text.py"

hi everyon if you can read this messag your proper use parseouttext pleas proceed to the next part of the project 


So now it works both ways. That is good. 

**Vectorize Text**

Now we are going to look at the vectorizing of text. We have the text stemmed. Now we want to turn those into numbers that sklearn work with. 

The file is *vectorize_text.py*. It will import the parseOutText() and then I alter the code to loop through and parse the text. We feed it one email at a time and return a stemmed text string.

Then we go through the strings and take out the proper names. 

Then we append the updated text strings to the list (?) word_data. And we append the Chris (0) and sara (1) numbers to the from_data list. 

In [3]:
%pwd

u'/Users/michaelreinhard/nano/machineLearning/ud120-projects/text_learning'

In [5]:
# I have code that runs and seems to work but the output is not accepted by the grader for the signature scrubbing quiz in Lesson 10. Here is my code: 

# %load vectorize_text.py
#!/usr/bin/python

import os
import pickle
import re
import sys

sys.path.append( "../tools/" )
from parse_out_email_text import parseOutText

"""
    Starter code to process the emails from Sara and Chris to extract
    the features and get the documents ready for classification.

    The list of all the emails from Sara are in the from_sara list
    likewise for emails from Chris (from_chris)

    The actual documents are in the Enron email dataset, which
    you downloaded/unpacked in Part 0 of the first mini-project. If you have
    not obtained the Enron email corpus, run startup.py in the tools folder.

    The data is stored in lists and packed away in pickle files at the end.
"""


from_sara  = open("from_sara.txt", "r")
from_chris = open("from_chris.txt", "r")

# trying to find out why no data gets to the parseOutText program
# print from_sara
# This returned a proper file object

from_data = []
word_data = []

### temp_counter is a way to speed up the development--there are
### thousands of emails from Sara and Chris, so running over all of them
### can take a long time
### temp_counter helps you only look at the first 200 emails in the list so you
### can iterate your modifications quicker
temp_counter = 0


for name, from_person in [("sara", from_sara), ("chris", from_chris)]:
    for path in from_person:
        ### only look at first 200 emails when developing
        ### once everything is working, remove this line to run over full dataset
        temp_counter += 1
        if temp_counter < 200:
            path = os.path.join('..', path[:-1])
            #print path
            email = open(path, "r")
            
            
            ### use parseOutText to extract the text from the opened email
            #email = parseOutText(email)
            

            text = parseOutText(email)
            #print text
            ### use str.replace() to remove any instances of the words
            
            drop_names = ["sara", "shackleton", "chris", "germani"]
            for i in drop_names:
                text = text.replace("i", "")
            
            ### append the text to word_data
            
            word_data.append(text)

            ### append a 0 to from_data if email is from Sara, and 1 if email is from Chris

            if name == "sara":
                from_data.append(0)
            else: 
                from_data.append(1)
            
            email.close()

print "emails processed"
print word_data[152]
from_sara.close()
from_chris.close()

pickle.dump( word_data, open("your_word_data.pkl", "w") )
pickle.dump( from_data, open("your_email_authors.pkl", "w") )





### in Part 4, do TfIdf vectorization here

emails processed
tjonesnsf stephan and sam need nymex calendar 


**TfIdf It**

Transform the word_data into a tf-idf matrix using the sklearn TfIdf transformation. 

Remove english stopwords.

You can access the mapping between words and feature numbers using get_feature_names(), which returns a list of all the words in the vocabulary. 

How many different words are there?

*Me*

So, I am going to start with asking can I get access the the word_data object we have already created and figure out what it is.

In [7]:
word_data[0]

u'sbale2 nonprvlegedpst susan pleas send the forego lst to rchard thank sara shackleton enron wholesal servc 1400 smth street eb3801a houston tx 77002 ph 713 8535620 fax 713 6463490 '

So it looks like a list of strings that contain the texts of emails. I suppose that the 2 at the end of the name is an '@' sign that doesn't get shifted in the transition from whatever we started with to unicode. 

I will confirm that it is a list. 

In [8]:
type(word_data)

list

So now we should import the CountVectorizer and the TfIdfTransformer.

In [14]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

In [18]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(word_data)

In [22]:
X

<199x3215 sparse matrix of type '<type 'numpy.int64'>'
	with 18304 stored elements in Compressed Sparse Row format>

In [23]:
vectorizer.get_feature_names()
# treats numbers as features too? 

[u'00',
 u'003680684doc',
 u'0039',
 u'01',
 u'0107',
 u'0114',
 u'0121',
 u'012159',
 u'0125',
 u'0126',
 u'0130',
 u'0131',
 u'0141',
 u'0143',
 u'0148',
 u'0158',
 u'02',
 u'020',
 u'0202112',
 u'0205',
 u'02052000',
 u'0207',
 u'02072001',
 u'0212',
 u'02152001',
 u'02162001',
 u'0219',
 u'02192001',
 u'02202001',
 u'0224',
 u'0227',
 u'022701pdf',
 u'023048',
 u'0240',
 u'0241',
 u'0246',
 u'0247',
 u'0250',
 u'0253',
 u'0254',
 u'0255',
 u'0256',
 u'0293',
 u'0304',
 u'03052001',
 u'0308',
 u'0313',
 u'03132000',
 u'031401',
 u'03142001',
 u'03182001',
 u'0320',
 u'03212001',
 u'0322',
 u'0325',
 u'034836',
 u'04042001',
 u'04052001',
 u'0410',
 u'0411',
 u'041201',
 u'04122001',
 u'04132000',
 u'0415',
 u'04172001',
 u'04252000',
 u'0426',
 u'0428',
 u'0429',
 u'0432',
 u'0433',
 u'043632',
 u'0438',
 u'0441',
 u'0442',
 u'0444',
 u'0445',
 u'0450',
 u'0457',
 u'0458',
 u'05',
 u'0501',
 u'05012000',
 u'05022000',
 u'0504',
 u'05042000',
 u'05052000',
 u'0508',
 u'05142001',
 u'

In [27]:
transformer = TfidfTransformer()
transformer.fit_transform(X)

<199x3215 sparse matrix of type '<type 'numpy.float64'>'
	with 18304 stored elements in Compressed Sparse Row format>