# Module 6 Assignment

A few things you should keep in mind when working on assignments:

1. Make sure you fill in any place that says `YOUR CODE HERE`. Do **not** write your answer in anywhere else other than where it says `YOUR CODE HERE`. Anything you write anywhere else will be removed or overwritten by the autograder.

2. Before you submit your assignment, make sure everything runs as expected. Go to menubar, select _Kernel_, and restart the kernel and run all cells (_Restart & Run all_).

3. Do not change the title (i.e. file name) of this notebook.

4. Make sure that you save your work (in the menubar, select _File_ → _Save and CheckPoint_)

5. You are allowed to submit an assignment multiple times, but only the most recent submission will be graded.

In [1]:
import pandas as pd
import numpy as np
from collections import Counter
from gensim import corpora, models
from nose.tools import assert_equal, assert_almost_equal, assert_is_instance

import random
import nltk
from nltk.corpus import reuters
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from nltk.corpus import wordnet as wn
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import WhitespaceTokenizer
from nltk.tokenize import WordPunctTokenizer
from nltk.collocations import BigramCollocationFinder, BigramAssocMeasures

In [2]:
text = """When analyzing large text corpora, trends can appear. These trends can be repeated use of common
phrases or terms that are indicative of common underlying themes or topics. For example, books on programming
might refer to themes such as human computer interaction, optimization and performance, or identifying
and removing error conditions. Finding these common topics can be important for a number of reasons.
On the one hand, when they are completely unknown, they can be used to provide new insight into text documents.
On the other hand, when they may be partially or even completely unknown, computationally identified topics can
provide deeper or more concise insight into the relationship between documents.
The process of identifying these common topics is known as topic modeling, which is generally a form
of unsupervised learning. As a specific example, consider the twenty newsgroup data that we have analyzed
in scikit learn. While there are twenty different newsgroups, it turns out they can be grouped into six related
categories: computers, sports, science, politics, religion, and miscellaneous. While we now these topics ahead of
time (from the newsgroup titles), we can apply topic modeling to these data to identify the common words or phrases
that define these common topics.
In the rest of this notebook, we explore the concept of topic modeling. First we will use the scikit learn
library to perform topic modeling. We will introduce and use non-negative matrix factorization and Latent Dirichlet
allocation. We apply topic modeling to a text classification problem, and also explore the terms that make up
identified topics. Finally, we introduce the gensim library, which provides additional techniques for topic modeling.
"""

# Problem 1: Counting the number of tokens in a document

Write a function called $\texttt{token_counter}$ that tokenizes a document using an nltk.tokenize method and then returns the number of tokens in the document.

In [3]:
def token_counter(tokenizer,text):
    """

    Inputs
    ----------

    tokenizer: the nltk.tokenize method used to tokenize the document
    text: the document to tokenize
    
    Returns
    -------

    num_tokens: the number of tokens found in the text by the tokenizer
    """

    ###BEGIN SOLUTION###
    tokens = tokenizer.tokenize(text)
    num_tokens = len(tokens)
    ###END SOLUTION###
    
    return num_tokens, tokens

In [4]:
token_count,tokens = token_counter(WhitespaceTokenizer(),text)
assert_equal(tokens[0],'When')
assert_equal(272,token_count)
token_count,tokens = token_counter(WordPunctTokenizer(),text)
assert_equal(312,token_count)

# Problem 2: Finding the top collocated words

Write a function called $\texttt{top_collocated}$ which, given a tokenized text, returns the top collocated bi-grams in a tokenized text using PMI.

In [5]:
def top_collocated(tokenized_text, num_collocations):
    """

    Inputs
    ----------

    tokenized_text: the tokenized text
    num_collocations: integer, the number of top collocated words to return
    
    Returns
    -------

    top_collocations: the top collocated words
    """

    ###BEGIN SOLUTION###
    finder = BigramCollocationFinder.from_words(tokenized_text)
    top_collocations = finder.nbest(BigramAssocMeasures().pmi, num_collocations)
    ###END SOLUTION###
    
    return top_collocations

In [6]:
top = top_collocated(tokens,10)

top_c = [('(', 'from'),
 ('-', 'negative'),
 (':', 'computers'),
 ('Dirichlet', 'allocation'),
 ('Latent', 'Dirichlet'),
 ('The', 'process'),
 ('When', 'analyzing'),
 ('additional', 'techniques'),
 ('analyzed', 'in'),
 ('analyzing', 'large')]

assert_equal(top_c,top)

In [7]:
# text taken from "Introduction to NLP: Topic Modeling" notebook

text = """When analyzing large text corpora, trends can appear. These trends can be repeated use of common phrases or terms that are indicative of common underlying themes or topics. For example, books on programming might refer to themes such as human computer interaction, optimization and performance, or identifying and removing error conditions. Finding these common topics can be important for a number of reasons. On the one hand, when they are completely unknown, they can be used to provide new insight into text documents. On the other hand, when they may be partially or even completely unknown, computationally identified topics can provide deeper or more concise insight into the relationship between documents.
The process of identifying these common topics is known as topic modeling, which is generally a form of unsupervised learning. As a specific example, consider the twenty newsgroup data that we have analyzed in scikit learn. While there are twenty different newsgroups, it turns out they can be grouped into six related categories: computers, sports, science, politics, religion, and miscellaneous. While we now these topics ahead of time (from the newsgroup titles), we can apply topic modeling to these data to identify the common words or phrases that define these common topics.
In the rest of this notebook, we explore the concept of topic modeling. First we will use the scikit learn library to perform topic modeling. We will introduce and use non-negative matrix factorization and Latent Dirichlet allocation. We apply topic modeling to a text classification problem, and also explore the terms that make up identified topics. Finally, we introduce the gensim library, which provides additional techniques for topic modeling.
"""

print(text)

When analyzing large text corpora, trends can appear. These trends can be repeated use of common phrases or terms that are indicative of common underlying themes or topics. For example, books on programming might refer to themes such as human computer interaction, optimization and performance, or identifying and removing error conditions. Finding these common topics can be important for a number of reasons. On the one hand, when they are completely unknown, they can be used to provide new insight into text documents. On the other hand, when they may be partially or even completely unknown, computationally identified topics can provide deeper or more concise insight into the relationship between documents.
The process of identifying these common topics is known as topic modeling, which is generally a form of unsupervised learning. As a specific example, consider the twenty newsgroup data that we have analyzed in scikit learn. While there are twenty different newsgroups, it turns out the

# Problem 3: Preprocessing Text

For this problem take the text from above and replace all instances of '\n' with nothing. Next convert the the text above to a list using a period as the delimiter.

If done correctly the first 5 items in data will look like this:
```
['When analyzing large text corpora, trends can appear',
 ' These trends can be repeated use of common phrases or terms that are indicative of common underlying themes or topics',
 ' For example, books on programming might refer to themes such as human computer interaction, optimization and performance, or identifying and removing error conditions',
 ' Finding these common topics can be important for a number of reasons',
 ' On the one hand, when they are completely unknown, they can be used to provide new insight into text documents',
 ' On the other hand, when they may be partially or even completely unknown, computationally identified topics can provide deeper or more concise insight into the relationship between documents']
```

In [8]:
###BEGIN SOLUTION
data = text.replace("\n","").split(".")
###END SOLUTION

In [9]:
print(data)

['When analyzing large text corpora, trends can appear', ' These trends can be repeated use of common phrases or terms that are indicative of common underlying themes or topics', ' For example, books on programming might refer to themes such as human computer interaction, optimization and performance, or identifying and removing error conditions', ' Finding these common topics can be important for a number of reasons', ' On the one hand, when they are completely unknown, they can be used to provide new insight into text documents', ' On the other hand, when they may be partially or even completely unknown, computationally identified topics can provide deeper or more concise insight into the relationship between documents', 'The process of identifying these common topics is known as topic modeling, which is generally a form of unsupervised learning', ' As a specific example, consider the twenty newsgroup data that we have analyzed in scikit learn', ' While there are twenty different new

In [10]:
assert_equal(data, 
['When analyzing large text corpora, trends can appear', ' These trends can be repeated use of common phrases or terms that are indicative of common underlying themes or topics', ' For example, books on programming might refer to themes such as human computer interaction, optimization and performance, or identifying and removing error conditions', ' Finding these common topics can be important for a number of reasons', ' On the one hand, when they are completely unknown, they can be used to provide new insight into text documents', ' On the other hand, when they may be partially or even completely unknown, computationally identified topics can provide deeper or more concise insight into the relationship between documents', 'The process of identifying these common topics is known as topic modeling, which is generally a form of unsupervised learning', ' As a specific example, consider the twenty newsgroup data that we have analyzed in scikit learn', ' While there are twenty different newsgroups, it turns out they can be grouped into six related categories: computers, sports, science, politics, religion, and miscellaneous', ' While we now these topics ahead of time (from the newsgroup titles), we can apply topic modeling to these data to identify the common words or phrases that define these common topics', 'In the rest of this notebook, we explore the concept of topic modeling', ' First we will use the scikit learn library to perform topic modeling', ' We will introduce and use non-negative matrix factorization and Latent Dirichlet allocation', ' We apply topic modeling to a text classification problem, and also explore the terms that make up identified topics', ' Finally, we introduce the gensim library, which provides additional techniques for topic modeling', '']
            )

# Problem 4: Creating a vectore space model with Gensim
For this problem create a set of stop words by reading in data from 'english.txt' in the same directory and storing it in a set.

Next parse text from the variable data and remove stop words (make all words in the sentence lower case). *Label this as txts (we will use it in the next problem).*

Then remove words appearing more than once. Now grab each word that eppears more than once (these are our tokens).

Now create a dictionary mapping for our text corupus and convert the collection of words in our corpus to a bag of words.

Next calculate the inverse document counts for all terms using the gensim's implementation of Tf-idf using on the bag of words and transform the bag of words into the tfidf space.

Lastly create an LDA model for our corpus using the  LdaModel implementation in gensim. The random_state should be 0, id2word should the dictionary mapping of ids to words, the number of iterations shoulde be 10000, and the corpus should be the bag of words in the tfidf space. *Name this model: lda_model



In [11]:
np.random.seed(0)

In [12]:
###BEGIN SOLUTION
stop_words = set(pd.read_csv('english.txt').values.flatten().tolist()) # stop words

# removing stop words...
txts = [[word for word in sentance.lower().split() if word not in stop_words]
        for sentance in data]

# removing words with frequency > 1
frequency = Counter([word for txt in txts for word in txt])
tokens = [[token for token in txt if frequency[token] > 1]
          for txt in txts]

# dictionary mapping and creating bag of words
dict_gensim = corpora.Dictionary(tokens)
crps = [dict_gensim.doc2bow(txt) for txt in txts]


tfidf = models.TfidfModel(crps)
crps_tfidf = tfidf[crps]

lda_model = models.LdaModel(corpus=crps_tfidf, id2word=dict_gensim, random_state=0, iterations=10000)
###END SOLUTION

In [13]:
# Let's take a look at the topics our model selected.
lda_model.print_topics(5)


[(6,
  '0.033*"data" + 0.033*"learn" + 0.033*"identified" + 0.033*"topic" + 0.033*"twenty" + 0.033*"newsgroup" + 0.033*"insight" + 0.033*"scikit" + 0.033*"will" + 0.033*"modeling"'),
 (85,
  '0.033*"data" + 0.033*"learn" + 0.033*"identified" + 0.033*"topic" + 0.033*"twenty" + 0.033*"newsgroup" + 0.033*"insight" + 0.033*"scikit" + 0.033*"will" + 0.033*"modeling"'),
 (15,
  '0.033*"data" + 0.033*"learn" + 0.033*"identified" + 0.033*"topic" + 0.033*"twenty" + 0.033*"newsgroup" + 0.033*"insight" + 0.033*"scikit" + 0.033*"will" + 0.033*"modeling"'),
 (82,
  '0.033*"data" + 0.033*"learn" + 0.033*"identified" + 0.033*"topic" + 0.033*"twenty" + 0.033*"newsgroup" + 0.033*"insight" + 0.033*"scikit" + 0.033*"will" + 0.033*"modeling"'),
 (51,
  '0.033*"data" + 0.033*"learn" + 0.033*"identified" + 0.033*"topic" + 0.033*"twenty" + 0.033*"newsgroup" + 0.033*"insight" + 0.033*"scikit" + 0.033*"will" + 0.033*"modeling"')]

In [14]:
assert_equal(lda_model.iterations, 10000)
assert_equal(lda_model.decay, .5)

# Problem 5: Using Word2Vec

For this problem create a Word2Vec model using gensim's implementation. 
Pass in the sentences without stop words from the previous problem. Set the maximum distance between the current and predicited word within a sentence to be 5. Ignore all words with a total frequency less than 1. Assign the argument that controls random number generator to be 0, and set the number of iterations over the corpus to be 100. Name this the Word2Vec model to be *model*. 


In [15]:
np.random.seed(0)

In [16]:
###BEGIN SOLUTION
model = models.Word2Vec(txts, window=2, min_count=1, seed=1, iter=100)
###END SOLUTION

In [17]:
assert_equal(model.alpha, .025)
assert_equal(model.batch_words, 10000)
assert_equal(model.train_count, 1)

In [18]:
ans1 = model.similarity('scikit', 'learn')
ans3 = model.similarity('corpora,', 'text')
ans4 = model.similarity('text', 'text')
print("Similarity between %s and %s is %s"%('scikit', 'learn', ans1) )
print("Similarity between %s and %s is %s"%('corpora', 'text', ans3) )
print("Similarity between %s and %s is %s"%('text', 'text', ans4) )



Similarity between scikit and learn is 0.913426512011
Similarity between corpora and text is 0.864464308736
Similarity between text and text is 1.0


# Problem 6: Computing path similarity

Write a function called $\texttt{get_path_similarity}$ that takes in two words and calculates their path similarity using the wordnet corpus. Note that the wordnet corpus has been imported above as $\texttt{wn}$. Recall also that words passed to wordnet have an ending indicated their part of speech. In this case, we will use words that are marked with $\texttt{.n.01}$ See the lesson notebook.

In [19]:
def get_path_similarity(x,y):
    """

    Inputs
    ----------

    x: the first word
    y: the second word
    
    Returns
    -------

    similarity: the path similarity between the two words
    """

    ###BEGIN SOLUTION###
    first_word = x+".n.01"
    second_word = y+".n.01"
    w1 = wn.synset(first_word)
    w2 = wn.synset(second_word)
    similarity = wn.path_similarity(w1, w2)
    ###END SOLUTION###
    
    return similarity

In [None]:
assert_almost_equal(get_path_similarity('dog','boy'),0.14285714285714285)
assert_almost_equal(get_path_similarity('drive','boy'),0.08333333333333333)
assert_almost_equal(get_path_similarity('man','boy'),0.3333333333333333)