A few things you should keep in mind when working on assignments:

1. Make sure you fill in any place that says `YOUR CODE HERE`. Do **not** write your answer in anywhere else other than where it says `YOUR CODE HERE`. Anything you write anywhere else will be removed or overwritten by the autograder.

2. Before you submit your assignment, make sure everything runs as expected. Go to menubar, select _Kernel_, and restart the kernel and run all cells (_Restart & Run all_).

3. Do not change the title (i.e. file name) of this notebook.

4. Make sure that you save your work (in the menubar, select _File_ → _Save and CheckPoint_)

5. You are allowed to submit an assignment multiple times, but only the most recent submission will be graded.

# Problem 2. Topic Modeling.

In this problem, we use the [genism](https://radimrehurek.com/gensim/) library to create a topic model.

In [None]:
import numpy as np
from gensim import corpora
from gensim.models import LdaModel

from nose.tools import assert_equal, assert_is_instance, assert_true

Suppose we are given some sample documents as follows:

In [None]:
doc_a = "Brocolli is good to eat. My brother likes to eat good brocolli, but not my mother."
doc_b = "My mother spends a lot of time driving my brother around to baseball practice."
doc_c = "Some health experts suggest that driving may cause increased tension and blood pressure."
doc_d = "I often feel pressure to perform well at school, but my mother never seems to drive my brother to do better."
doc_e = "Health professionals say that brocolli is good for your health."

doc_set = [doc_a, doc_b, doc_c, doc_d, doc_e]

To generate a topic model, first we need to perform some basic text processing. In natural language processing, the following steps are commonly used:

- Tokenizing: breaking a text into its elements.
- Stopping: removing meaningless words.

## Tokenize

- Write a function named `tokenize` that takes **one** document (a string, e.g. `doc_a` or `doc_b`, *not* `doc_set`) and returns a list of tokens.
- The function also takes a second argument, `stop_words`, a list of strings.
- All tokens in the returned list should be lowercase.
- For example, when we run
```python
>>> doc_b = "My mother spends a lot of time driving my brother around to baseball practice."
>>> stop_words = "for a of the and to in on an but at".split()
>>> print(tokenize(doc_b, stop_words))
```
we should get
```
['my', 'mother', 'spends', 'lot', 'time', 'driving', 'my', 'brother', 'around', 'baseball', 'practice.']
```

In [None]:
def tokenize(doc, stop_words):
    """
    Tokenizes a string, removing 'stop_words'.
    
    Paramters
    ---------
    doc: A string.
    stop_words: A list of strings.
    
    Returns
    -------
    A list of tokens.
    """
    
    # YOUR CODE HERE
    
    return result

In [None]:
stop_words = "for a of the and to in on an but at".split()
tokens_a = tokenize(doc_a, stop_words)
print(tokens_a)

In [None]:
def test_doc_tokens(doc, tokens):
    assert_is_instance(tokens, list)
    assert_true(all(isinstance(t, str) for t in tokens))
    assert_true(all(t in doc.lower() for t in tokens))
    assert_true(all(" " not in t for t in tokens))
    assert_true(all(t not in stop_words for t in tokens))
    
for doc in doc_set:
    tokens = tokenize(doc, stop_words)
    test_doc_tokens(doc, tokens)

Note that our `tokenize` function tokenizes only *one* document, but we want to tokenize *all* documents in `doc_set`, because the `corpora.Dictionary()` function accepts a list of lists, one list for each of our documents. (See the [Introduction to Topic Modeling notebook](https://github.com/UI-DataScience/accy571-fa16/blob/master/Week10/notebooks/intro2nlp-tm.ipynb).)

In [None]:
texts = [tokenize(d, stop_words) for d in doc_set]
print(texts)

(Note that we have taken a slightly different approach than the approach used in the [Introduction to Topic Modeling notebook](https://github.com/UI-DataScience/accy571-fa16/blob/master/Week10/notebooks/intro2nlp-tm.ipynb), where we used nested list comprehensions. Another difference is that we used only tokens that appear more than once, but here we use all tokens, even those that appear only once.)

Now that we have a list of lists for each document, we are ready to use `corpora.Dictionary` to contstruct a [document-term matrix](https://en.wikipedia.org/wiki/Document-term_matrix). The `Dictionary()` function goes through each text and assigns a unique integer ID to each unique token. At the same time, it also counts how frequently each term appears within each document. The result is a mapping (i.e., a dictionary) of each words to its frequency.

In [None]:
dictionary = corpora.Dictionary(texts)
print(dictionary.token2id)

In [None]:
print(dictionary.token2id["brocolli"])

The `doc2bow()` method converts dictionary into a [Bag of words](https://en.wikipedia.org/wiki/Bag-of-words_model). The result is a corpus: a list of lists, where each list is a list of tuples.

In [None]:
corpus = [dictionary.doc2bow(text) for text in texts]
print(corpus)

For example, `corpus[0]` represents our first document, `doc_a`.

In [None]:
print(corpus[0])

The tuples are of the form (term ID, term frequency), so if
```python
>>> print(dictionary.token2id["brocolli"])
```
says brocolli’s ID is 0 (this ID will be different every time you run the notebook), then the tuple `(0, 2)` indicates that brocolli appeared twice in `doc_a`.

With the document term matrix (`corpus`) we can construct a topic model. In the following code cell, we use latent Dirichlet allocation (LDA). To learn more about LDA, see for example [Topic Modeling and Digital Humanities](http://journalofdigitalhumanities.org/2-1/topic-modeling-and-digital-humanities-by-david-m-blei/).

In [None]:
lda_model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=2, passes=20)
print(lda_model.print_topics(num_topics=2, num_words=3))

When I ran this,
```python
>>> lda_model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=2, passes=20)
>>> print(lda_model.print_topics(num_topics=2, num_words=3))
```
I got
```
[(0, '0.078*good + 0.055*brocolli + 0.055*is'), (1, '0.073*my + 0.041*brother + 0.040*mother')]
```

(The output will be slightly different every time you run the notebook.)

We have two topics separated by a comma. Each topic has three words that are most likely to appear in that topic. Usually, topic modeling requires a large set of documents, but our model looks reasonable, even with our small docuement set: "good" and "brocolli" together make sense; the second topic, "brother" and "mother", also seems reasonable.