## Sidenotes (definitions, code snippets, resources, etc.)
__definition:__ deprecated vs. obsolete
- _deprecated_: a status applied to software features to indicate that they should be avoided, typically because they have been superseded. 
    - Features are rather deprecated rather than removed in order to provide backward compatibility, giving programmers who have used the feature time to bring their code into compliance with the new standard.
- _obsolete_: Obsolete features are those deemed redundant and are highly likely to be discontinued in the next few releases.

### Useful Python 2 Code Snippets
- `raise SystemExit(0)` used to stop python script (like break but outside loop)


### Techniques
`temp_counter` applied in vectorize_text.py, in original file.
- limits processing to subset of dataset
- useful in development stage so modifications can be done more quickly
- remove when working to run through whole dataset

### IPython functions
#### Exploring accessible variables:
- `dir()` lists scope variables:
- `globals()` gives a dictionary of global variables
- `locals()` gives a dictionary of local variables
- `whos` lists currently defined variable (or `who` for less detail)
- `sys.modules` is a dictionary variable that maps all names of all currently loaded modules to module objects.
    - The contents of this dictionary are used to determine whether import loads a fresh copy of a module (reimporting a module does not rerun the code, just references namespace).
    - See [Python Essential Reference 4th Edition by David M. Beazley (2009)](https://www.evernote.com/shard/s37/nl/1033921335/e9944533-f7e9-4614-b5aa-41dbe8aa9900/) for more on modules and packages (chapter 8)

Applying built-in function `vars`([_object_]):
- Potentially very useful, needs some filtering/formating
- `vars()` without an argument behaves like `locals()`
    - only useful for reads, updates to returned dictionary are ignored
    - also takes an optional argument to find out which vars are defined within an object itself
- To get the names :
    - ```for name in vars().keys():
      print(name)```
- To get the values:
    - ```for value in vars().values():
      print(value)```


### String Functions (deprecated in Python 2)
Used in `../tools/parse_out_email_text.py` for quizes below
[Python 2 documentation](https://docs.python.org/2/library/string.html#string-functions)
- [Extra: Python 3 documentation](https://docs.python.org/3.1/library/string.html)
- always true that string.`join(string.split(s, sep), sep)` equals _s_.
- `string.translate`(s, table[, deletechars])
    - Delete all characters from s that are in deletechars (if present), and then translate the characters using table, which must be a 256-character string giving the translation for each character value, indexed by its ordinal. If table is None, then only the character deletion step is performed.
- `string.translate`(s, table[, deletechars])
    - Delete all characters from s that are in deletechars (if present), and then translate the characters using table, which must be a 256-character string giving the translation for each character value, indexed by its ordinal. If table is None, then only the character deletion step is performed.

## Dimensions when Learning From Text
- Length of text is variable, so input dimensions will vary too
- Solution: use dictionary with word : frequency (bag of words)

### Bag of Words
Properties
- Does not consider word order data (hence "bag")
- Count value means longer phrases give different input vectors
    - "Biasing to encode the length of the text"
- Can only handle complex phrases unless extra bags are created for them
    - e.g. classic example of complex phrase is "chicago bulls" vs "chicago" + "bulls"

### Sklearn - Bag of Words
#### [`CountVectorizer()`][alg]
`vectorizer.fit_transform()`
- returns `dict` of `(document_index, word_position) : freq` mapping
- `.fit()` makes some sort of indexed list of all words
- `.transform()` assigns a count to each of the words
- `vectorizer.vocabulary_.get("word")` gets feature number of word
    from `vocabulary_` attribute

[User Guide: 4.2.3. Text feature extraction][user guide]

Extra [Tutorial: Working With Text Data][tutorial]

[alg]: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
[user guide]: http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction
[tutorial]: http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

### `nltk` corpus/vocabulary library
[NLTK 3.0 Documentation](http://www.nltk.org/)
- NOTE: Need to download corpi (corpuses?) first with `nltk.download()`

##### Stopwords
- _Definition_: low information, high frequency
- Examples of low-info words from quiz: the, will, hi, names (since emails only from two people)

In [None]:
# Quiz: how many English stopwords?
from nltk.corpus import stopwords
sw = stopwords.words('english')
# sw is a list, ordered by most common usage to least
print(len(sw))

##### Stemming
- Not all unique words have different meanings
- Stemmers are functions the take words with similar meanings and group them into one-dimensional feature
    - e.g. response, responsive, respond... -> respon
- computational linguists make these functions

[`nltk.stem.snowball.`**`SnowballStemmer()`**][doc] module documentation
- An example stemmer in `nltk`

[doc]: http://www.nltk.org/api/nltk.stem.html?highlight=snowballstemmer#nltk.stem.snowball.SnowballStemmer

In [None]:
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer('english')
print(stemmer.stem('responsiveness'))  # = respons
print(stemmer.stem('unresponsive'))  # = unrespons
# this difference in output is a potential limitation
#      might not be what we want

### Order of Operations in Text Processing
1. Stemming
2. Bag-of-words representation

This will affect bag-of-words output, but is also more practical because of bag-of-words datatype.

### Weighting Words (features)
#### Tf vs. Idf representations
- _Term frequency (Tf)_: like bag-of-words representation, weigthing words by frequency
- _Inverse document frequency (Idf)_: weighting by frequency of words in corpus as a whole
    - weights rare words higher than more common words; rare words better distinguish differences between documents
    - covered in upcoming mini-project
    
[`sklearn.feature_extraction.text.`**`TfidfVectorizer`**][doc] module documentation
- Combines all the options of CountVectorizer and TfidfTransformer in a single model

[4.2.3. Text feature extraction][userguide] user guide linked in documentation
- 

[doc]: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
[userguide]: http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction

### Mini-Project! on Preprocessing & Text Learning
In the beginning of this class, you identified emails by their authors using a number of supervised classification algorithms. In those projects, we handled the preprocessing for you, transforming the input emails into a TfIdf so they could be fed into the algorithms. Now you will construct your own version of that preprocessing step, so that you are going directly from raw data to processed features.

#### Quiz: Warm up
- You will be given two text files: one contains the locations of all the emails from Sara, the other has emails from Chris. You will also have access to the parseOutText() function, which accepts an opened email as an argument and returns a string containing all the (stemmed) words in the email.
- You’ll start with a warmup exercise to get acquainted with parseOutText(). Go to the tools directory and run parse_out_email_text.py, which contains parseOutText() and a test email to run this function over.
- __parseOutText()__ takes the opened email and returns only the text part, stripping away any metadata that may occur at the beginning of the email, so what's left is the text of the message. We currently have this script set up so that it will print the text of the email to the screen, what is the text that you get when you run parseOutText()?
- _A hint when submitting_: the words in the string that you get have TWO spaces between them; make sure your answer does too!

In [None]:
# atom /Users/mdlynch37/Documents/udacity/intro_to_ml/ud120-projects/tools/parse_out_email_text.py
%run ../tools/parse_out_email_text.py

`Out [ ]: Hi Everyone  If you can read this message youre properly using parseOutText  Please proceed to the next part of the project`

#### Quiz: Deploying Stemming
- In parseOutText(), comment out the following line: `words = text_string`
- Augment parseOutText() so that the string it returns has all the words stemmed using a SnowballStemmer.
- Use the nltk package, some examples that I found helpful can be found in the [nltk howto guide](http://www.nltk.org/howto/stem.html). Rerun parse_out_email_text.py, which will use your updated parseOutText() function--what’s your output now?
    - Hint: you'll need to break the string down into individual words, stem each word, then recombine all the words into one string.

In [2]:
!python2 ../tools/parse_out_email_text.py

hi everyon if you can read this messag your proper use parseouttext pleas proceed to the next part of the project


`Out [ ]: hi everyon if you can read this messag your proper use parseouttext pleas proceed to the next part of the project`

#### Quiz: Clean Away "Signature Words"
- In vectorize_text.py, you will iterate through all the emails from Chris and from Sara. 
- For each email, feed the opened email to parseOutText() and return the stemmed text string.
- Then do two things:
    1. remove signature words (“sara”, “shackleton”, “chris”, “germani”--bonus points if you can figure out why it's "germani" and not "germany")
    - append the updated text string to word_data -- if the email is from Sara, append 0 (zero) to from_data, or append a 1 if Chris wrote the email.
- Once this step is complete, you should have two lists: one contains the stemmed text of each email, and the second should contain the labels that encode (via a 0 or 1) who the author of that email is.
- Running over all the emails can take a little while (5 minutes or more), so we've added a temp_counter to cut things off after the first 200 emails. Of course, once everything is working, you'd want to run over the full dataset.

In the box below, put the string that you get for word_data[152].

In [19]:
import vectorize_text as vt
print vt.word_data[152]

tjonesnsf stephani and sam need nymex calendar


`Out [ ]:  tjonesnsf stephani and sam need nymex calendar`

#### Quiz: TfIdf It
- Make sure to first remove english stopwords and disable temp_counter
- Transform the word_data into a tf-idf matrix using the sklearn TfIdf transformation.
    - Use tf-idf Vectorizer class to transform the word data.
- You can access the mapping between words and feature numbers using `get_feature_names()`, which returns a list of all the words in the vocabulary. How many different words are there?

In [95]:
# if statement used to determine if module had previous been imported
import sys
if 'vectorize_text' in sys.modules.keys():
    vt = reload(vt)
else:
    import vectorize_text as vt
feature_names = vt.vectorizer.get_feature_names()
len(feature_names)

emails processed
word data vectorized with method 1


38757

`Out [ ]: 38757`

#### Quiz: Accessing TfIdf Features
- What is word number 34597 in your TfIdf?
    - (Just to be clear--if the question were "what is word number 100," we would be looking for the word corresponding to vocab_list[100]. Zero-indexed arrays are so confusing to talk about sometimes.)

In [62]:
feature_names[34597]

u'stephaniethank'

`Out [ ]: u'stephaniethank'`