**\*Remember to activate the *far_nlp* environment before starting jupyter notebook from command line and running this notebook:**

**OS X, Linux:** `$ source activate far_nlp`

**Windows:** `$ activate far_nlp`

In [None]:
import pandas as pd
import numpy as np

In [None]:
# Verify the right environment is enabled by checking the python path
import sys
print(sys.executable)

# How Can We Represent Text for Use in Machine Learning?

The classic way to represent text for machine learning is with a *term frequency matrix* also sometimes referred to as a *Bag of Words representation*. In its simplest form this means counting how many times a word appears in a document, and we do this count for each document in our collection of documents (aka our corpus). 

In the example below our corpus is *eagle_powers* and each sentence is considered a separate document. 

In [None]:
eagle_powers = ["He knows where to find the eagle eggs.",
        "His eagle eggs possess magical powers.",
        "Summon your eagle powers."
        ]

In [None]:
# This is a custom script to show how vectorization works easily
from utilities import SimpleVectorizer

In [None]:
SimpleVectorizer(eagle_powers)

### Try it Out
Make your own test corpus by creating a new variable that contains a list of sentences. How does the vectorizer handle your corpus, does it behave as expected?

In [None]:
# Make a new list variable that contains a few sentences.
# Pass your documents into the simpleVectorizer function.



#### Hint:

In [None]:
my_corpus = ["This is a sentance.", "This is another sentance."]
SimpleVectorizer(my_corpus)

## How does the SimpleVectorizer fall short?

Modify your previous list and try to explore what some of the limitations of the SimpleVectorizer might be. For example, how does it handle capitalization? If the goal of vectorizing is capturing the meaning (or similarity of like documents), how does it fall short? How does it perform well? 

### Try it Out:

In [None]:
# Try to find shortcomings of the SimpleVectorizer by modifying your document corpus.



#### Hint:

In [None]:
my_corpus = ["He ran yesterday.", "Run over there!", "Running a meeting is hard.", "The river runs east."]
SimpleVectorizer(my_corpus)

## Discussion

What did we find?

# Under the Hood of the SimpleVectorizer

Before we move on and see options for creating a a term frequency matrix, let's look under the hood of our SimpleVectorizer function and see how it works.

In [None]:
# Run this cell to see the components of the SimpleVectorizer function
??SimpleVectorizer()

The SimpleVectorizer function is using a function called CountVectorizer from sci-kit learn to build the term frequency matrix. Let's see how to use CountVectorizer on its own.

To use CountVectorizer we'll have to:
1. Import it from sci-kit learn
2. Instantiate it
3. And call ```fit_transform``` on a set of documents.

See the sci-kit learn docs on [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).

In [None]:
# Import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
# Instantiate CountVectorizer, this makes an instance of CV
# with whatever options you specify.
CV = CountVectorizer(lowercase=False, min_df=1)

In [None]:
# Fit CountVectoroizer instance on our document corpus
# The result is a sparse matrix
results = CV.fit_transform(eagle_powers)
results

In [None]:
# Here is what the sparse representation looks like
print(results)

In [None]:
# To see what this looks like as a regular matrix we need to use .todense()
results.todense()

In [None]:
# Import the displayTFMatrix function to view the matrix as we have been.
# For this function, pass in the CountVectorizer instance and the documents
# that will be vectorized.
from utilities import DisplayTFMatrix

In [None]:
DisplayTFMatrix(CV, eagle_powers)

### Try it Out:
Repeat the steps above using your own document corpus to:
1. **Instantiate CountVectorizer:** Make a new instance of CountVectorizer, you can set it equal to *CV* or anything else that makes sense to you.
2. **Transform your Corpus:** Call the ```fit_transform``` method on the new CountVectorizer instance and pass in your corpus. Set this to a variable to capture the resulting sparse matrix.
3. **View Sparse Representation of the Matrix:** View the sparse version of your term frequency matrix using ```print()```.
4. **View Dense Representation of the Matrix:** View the dense representation of your term frequency matrix by calling `.todense()` on variable containing your term frequency matrix.
5. **View the DataFrame:** View the data frame of your term frequency matrix using ```DisplayTFMatrix(CVInstance, corpus)```. Make sure to pass in the CountVectorizer instance and your corpus. 

In [None]:
# Your code here




# Improving on Term Frequency Matricies

Two of the main issues with Term Frequency Matrices are:
- **Size:** They grow logarithmically to the number of documents in the corpus and get large very fast. This can lead to high computing costs, it can take a long time to vectorize all of the documents in a corpus.
- **Noise:** Related to the first problem is that these representations of documents also contain a lot of noise. There is a lot of extra information that may not be helpful for our machine learning task. For example, is the word *the* indicative of document content? Is it useful to have a count for this word?



## Stop Words

Stop words are words that we want to ommit from being counted. This could be because we don't think they convey any important meaning or we may want to omit them for other reasons.

We can set the stop words by passing in `stopwords="english"` as a parameter in count vectorizer. We can also pass in a list of our own custom words. 

In [None]:
# See what happens when we use the list of stopwords provided 
# with sci-kit learn.

CV = CountVectorizer(stop_words="english", lowercase=False)
DisplayTFMatrix(CV, eagle_powers)

In [None]:
# Check out the default list of stop words from sk learn
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
ENGLISH_STOP_WORDS

### Try it Out:
Make a new list variable that will contain a list of custom stopwords. Use the default stopwords from sci-kit learn as the starting point for the new list. Then append your own stopwords to the new stopwords list. Use ```DisplayTFMatrix()``` to see the effects of using the new stopwords list on your custom corpus.

In [None]:
# Your code here




#### Hint:

In [None]:
# Set the defualt stopwords to a new list variable. You need
# to change it to a list because the original list is a frozen set and will
# not allow you to add on other words
custom_stop_words = list(ENGLISH_STOP_WORDS)

# My custom Words
new_stop_words = ["eagle","magical","powers"]

# Add my words to the default stop words
custom_stop_words.extend(new_stop_words)

# Make a TF Matrix using my custom stop words.
CV = CountVectorizer(stop_words=custom_stop_words, lowercase=False)
DisplayTFMatrix(CV, eagle_powers)

## Minimum and Maximum Limits

In [None]:
# To see how minimum and maxim limits work, it will help to
# have a larger corpus with even more eagle powers.
more_eagle_powers = ["He knows where to find the eagle eggs.",
                     "Eagle eggs?",
                     "His eagle eggs possess magical powers.",
                     "For you to become empowered by the eagle you must climb that cliff",
                     "Find the egg and drink it.",
                     "Summon your eagle powers.",
                     "Eagle powers, come to me! Please!"
                    ]

<img style="float: left;" src="img/eagle_powers.gif">

In [None]:
# We can also get rid of the most and least frequent
# words by setting minimum and maximum limits.
# This can also reduce the matrix size and reduce noise.

# CV = CountVectorizer(lowercase=False)
CV = CountVectorizer(min_df=2, max_df=6, lowercase=False)
DisplayTFMatrix(CV, more_eagle_powers)

### Try it Out:

Use the example above to do the following:

- Add a few more sentences to your document corpus. Make sure to have some words that occur frequently throughout documents in the corpus so you can see the effects of setting a maximum term frequency.
- Vectorize your corpus with CountVectorizer using upper and lower limits to see how it affects the final matrix.
- Also, notice that we've been seeing a lowercase option for CountVectorizer. Set `lowercase=True` to have all the letters converted to lowercase notice how this effects the outcome in the final term frequency matrix.

In [None]:
# Your code here




## NGrams

Instead of tokenizing a single word we can tokeninze several words at a time to get a better sense of meaning and nuance in the text. However, we also need to keep in mind this can greatly increase the the size of our matrices.

In [None]:
# This is what our TF matrix looks like without using ngrams.
# Or in other words using an ngram of 1, or a unigram
# Note the size: 3 rows, 14 columns
CV = CountVectorizer()
DisplayTFMatrix(CV, eagle_powers)

In [None]:
# Now let's see what it looks like using an ngram range of 2-3.
# This means we'll have only 2 and 3 word strings.
# Notice the new size of the dataFrame

CV = CountVectorizer(ngram_range=(2,3))
DisplayTFMatrix(CV, eagle_powers)

### Try it Out:

Make a Term Frequency matrix using `ngram_range`. Experiment with different ranges to see how it affects the final matrix.

In [None]:
# Your Code Here




#### Hint:

In [None]:
CV = CountVectorizer(ngram_range=(1,3))
DisplayTFMatrix(CV, eagle_powers)

## Truncated SVD

As we've seen with ngrams, it is easy for a vectorized corpus to become very wide and have a large number of features/columns. Truncated SVD allows us to reduce the dimensionality of our data (the number of features/columns) while maintaining a large degree of the variance from the original data. Truncated SVD works efficiently with data in sparse form, so it works well with the output from CountVectorizer.

Scikit learn reference: [TruncatedSVD](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html)

More info on TFIDF and how it works:
- [https://www.quora.com/What-is-an-intuitive-explanation-of-singular-value-decomposition-SVD/answer/Jason-Liu-21?srid=nGs9](https://www.quora.com/What-is-an-intuitive-explanation-of-singular-value-decomposition-SVD/answer/Jason-Liu-21?srid=nGs9)
- [https://stats.stackexchange.com/questions/134282/relationship-between-svd-and-pca-how-to-use-svd-to-perform-pca](https://stats.stackexchange.com/questions/134282/relationship-between-svd-and-pca-how-to-use-svd-to-perform-pca)

In [None]:
# Let's generate a wide term frequency matrix for comparison
CV = CountVectorizer(ngram_range=(1,3))

# The original matrix has 94 columns for this 
DisplayTFMatrix(CV, more_eagle_powers)

In [None]:
# Now Let's use TSVD

# Vectorize the documents with CV
CV = CountVectorizer(ngram_range=(1,3))
cv_results = CV.fit_transform(more_eagle_powers)

In [None]:
# Like other methods we need to import TSVD before use
from sklearn.decomposition import TruncatedSVD

In [None]:
# Then instantiate TSVD and set parameters
TSVD = TruncatedSVD(n_components=2)

In [None]:
# Fit TSVD on our term matrix, note it is not sparse
tsvd_result = TSVD.fit_transform(cv_results)
tsvd_result

In [None]:
# Dispaly the TSVD results
pd.DataFrame(tsvd_result, index=more_eagle_powers)

In [None]:
# We can see how much of the variance from the original data
# is retained by using explained_variance_ratio_
# With only two columns of data we're capturing 43% of the variance

TSVD.explained_variance_ratio_

### Try it Out:

1. Make a wide term frequency matrix (use something like an `ngram_range(1,3)` or more) and set the results to a variable.
2. Use Truncated SVD to reduce the columns of this data set to 2 columns.
3. Display the results of the new matrix.
4. Use `.explained_variance_ratio_` to see how much of the variance of the original data that each column captures.
5. Adjust the `n_components` setting in Truncated SVD so that you are capturing 70%-80% of the variation of the original data. How many columns does this end up being? 

In [None]:
# Your code here




#### Hint:

In [None]:
# Vectorize the corpus
CV = CountVectorizer(ngram_range=(1,5))
cv_results = CV.fit_transform(more_eagle_powers)

# Use TSVD on the Term Frequency Matrix
TSVD = TruncatedSVD(n_components=2)
tsvd_results = TSVD.fit_transform(cv_results)

# Display the resulting reduced-feature Matrix 
display(pd.DataFrame(tsvd_result, index=more_eagle_powers))
# Display how much variance was captured in each column
display(TSVD.explained_variance_ratio_)


# Adjust n_components to caputre 70% of the variance
TSVD = TruncatedSVD(n_components=4)
tsvd_results = TSVD.fit_transform(cv_results)

# Display matrix and variance
display(pd.DataFrame(tsvd_result, index=more_eagle_powers))
display(TSVD.explained_variance_ratio_)

## TF-IDF

Term frequency-inverse document frequency (TF-IDF) is a way of weighting words that gives the highest value to words which occur frequently in a specific document but occurs less frequently in the overall corpus.

1. Highest when a term occurs many times within a small number of documents (thus lending high discriminating power to those documents);

2. Lower when a term occurs fewer times in a document, or occurs in many documents (thus offering a less pronounced relevance signal);

3. Lowest when a term occurs in virtually all documents.

**Above taken from second resource linked to below.*

Resources for learning more about TF-IDF:
- Good introduction to the idea of TF-IDF: [http://www.tfidf.com/](http://www.tfidf.com/)
- [Introduction to Information Retrieval: TF-idf Stanford](https://nlp.stanford.edu/IR-book/html/htmledition/tf-idf-weighting-1.html)
- Details on Sci-kit learn's specific implementation of TF-IDF: [Tf–idf term weighting](http://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting)
- Documentatio for using Tf-Idf in sci-kit learn: [sklearn.feature_extraction.text.TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)

This is the basic concept of the math behind TF-IDF. However, there are several different implementations and options that can be added to TF-IDF so you may see slightly different formulas out in the wild.

$\text{tfidf}_t = \frac{\text{number of times a word appears in a document}}{\text{total number of words in a document}} \cdot log( \frac{\text{total number of documents in corpus}}{\text{number of documents in corpus where a word appears at least once}} )$

**A quick TFIDF example (taken from [tfidf.com](http://www.tfidf.com/))**

Our goal is to calculate the TFIDF value for the word *cat* in the given document below:
- Our document has a total of 100 words.
- The word *cat* occurs 3 times in the document.
- Our corpus has 10,000,000 documents total.
- The word cat occurs at least once in 1,000 documents in our corpus.

In [None]:
from math import log

normalized_term_frequency = 3/100
inverse_document_frequency = log(10000000/1000, 10)
tfidf_value = normalized_term_frequency*inverse_document_frequency

print("The normalized term frequency is: ", normalized_term_frequency)
print("The inverse document frequency is: ", inverse_document_frequency)
print("The TF-IDF value for the word cat in our document is: ", tfidf_value)

In [None]:
# Similar to CountVectorizer we need to import TFIDF Vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
# Instantiate TFIDF
TFIDF = TfidfVectorizer()

# Display TFIDF
DisplayTFMatrix(TFIDF, eagle_powers)

Look over the matrix above to get a feel for what TFIDF is doing.

- What words have the highest value, and the lowest?
- For words that occur in all documents, why are values different or the same across documents?

**Some may notice that the values are not the same as if we calculated them by hand using the formula above. I'm guessing this is due to Sci-kit learn's specific way of implementing TF-IDF.*

### Try it Out:

Use `TfidfVectorizer()` to vectorize your corpus and then use `DisplayTFMatrix()` to display it. Examine the results and answer the same questions as above:
- What words have the highest value, and the lowest?
- For words that occur in all documents, why are values different or the same across documents?

In [None]:
# Your Code Here




#### Hint:

In [None]:
TFIDF = TfidfVectorizer()
DisplayTFMatrix(TFIDF, eagle_powers)

## Lemmatization

Lemmatization is getting the base form of a word which we will then use to make our term frequency matrix. In this way, different variations of a word can be mapped to the root word. This can further reduce the size and (potentially) the noise in our Term Frequency Matrix. We'll use a NLP library called spaCy to tokenzie our text and get the lemmas. We'll pass the results from spaCy to CountVectorizer to generate our term frequency matrix.

SpaCy documentation: [https://spacy.io/docs/usage/](https://spacy.io/docs/usage/)
SpaCy can do a LOT more than what we are using it for here.

In [None]:
# This is how we load spaCy
import spacy              # First import the package
nlp = spacy.load('en')    # Then load the langauge model

In [None]:
# let's see see how spacy works and how the lemmas compare to 
# the original text

for doc in eagle_powers:
    spacy_doc = nlp(doc)
    for tok in spacy_doc:
        print("Original:", tok.orth_, " | Lemma:", tok.lemma_)

In [None]:
#  Now let's bring together tokenization with vectorization.

# Setup an empty list that will hold our tokenized (lemma) documents
eagle_power_lemmas = []

# Iterate Through each document
for doc in more_eagle_powers: 
    
    # Process the document with spaCy
    spacy_doc = nlp(doc)
    
    # Grab the lemma for each token
    lemmas = [token.lemma_ for token in spacy_doc] 
    
    # Make the list of lemmas into a string
    lemmas = " ".join(lemmas)
    
    # Add the lemma string to the eagle_power_lemmas list
    eagle_power_lemmas.append(lemmas) 

In [None]:
# Let's see what our string of lemma's looks like
eagle_power_lemmas

In [None]:
CV = CountVectorizer()
DisplayTFMatrix(CV, eagle_power_lemmas)

### Try it Out:

Drawing on what we've seen above, build a function named *spacy_tokenizer* that does the following:
- Take in a single string (document) as an argument — not a corpus or collection of docs/
- Process the string and create a spaCy document.
- Iterates tokens in the spaCy document and returns the lemma for each token as a list.
- Test out the function with a single string (document).

In [None]:
# Your Code Here




#### Hint:

In [None]:
def spacy_tokenizer(doc_as_string):
    spacy_doc = nlp(doc_as_string)
    tokens = [tok.lemma_ for tok in spacy_doc]
    return tokens

In [None]:
spacy_tokenizer("Test some string here.")

### Try it Out 2:

Now make a new instance of CountVectorizer and add the following parameter: `tokenizer=spacy_tokenizer`. Feel free to add in any of the other parameters that we've learned.

Use `DisplayTFMatrix` to display the results.

In [None]:
# Your code here




#### Hint:

In [None]:
CV = CountVectorizer(stop_words='english', tokenizer=spacy_tokenizer)
DisplayTFMatrix(CV, more_eagle_powers)


# End of Part 1