## Additional Exercises for 04.03: Topic Modeling

Example adopted from [Aneesha Bakharia](https://medium.com/@aneesha/topic-modeling-with-scikit-learn-e80d33668730).

## Question 0
We're going to use a dataset within scikit-learn do topic modeling.
1. Load in `fetch_20newsgroups` from `sklearn.datasets`.
2. Randomize the dataset.
3. Retrieve the content from the Newsgroups dataset.

*Note:* If this is the first time you're importing `fetech_20newsgroups` from `sklean.datasets`, it will take a minute to download the dataset from a server.

In [4]:
from sklearn.datasets import fetch_20newsgroups

dataset = fetch_20newsgroups(shuffle=True, random_state=1, remove=('headers', 'footers', 'quotes'))
documents = dataset.data

In [5]:
documents

["Well i'm not sure about the story nad it did seem biased. What\nI disagree with is your statement that the U.S. Media is out to\nruin Israels reputation. That is rediculous. The U.S. media is\nthe most pro-israeli media in the world. Having lived in Europe\nI realize that incidences such as the one described in the\nletter have occured. The U.S. media as a whole seem to try to\nignore them. The U.S. is subsidizing Israels existance and the\nEuropeans are not (at least not to the same degree). So I think\nthat might be a reason they report more clearly on the\natrocities.\n\tWhat is a shame is that in Austria, daily reports of\nthe inhuman acts commited by Israeli soldiers and the blessing\nreceived from the Government makes some of the Holocaust guilt\ngo away. After all, look how the Jews are treating other races\nwhen they got power. It is unfortunate.\n",
 "\n\n\n\n\n\n\nYeah, do you expect people to read the FAQ, etc. and actually accept hard\natheism?  No, you need a little leap

## Question 1
1. Using the parameters below, vectorize `documents` using `CountVectorizer`, similar to what we did in lecture.
2. With the vectorized `documents`, call `fit_transform` on the `documents`.
3. Extract the feature names from the vectorized `documents`.

In [9]:
max_df = 0.95
min_df = 2
n_features = 1000

In [20]:
from sklearn.feature_extraction.text import CountVectorizer

tf_vectorizer = CountVectorizer(max_df=max_df, min_df=min_df, max_features=n_features, stop_words='english')
tf = tf_vectorizer.fit_transform(documents)
tf_feature_names = tf_vectorizer.get_feature_names()

## Question 2
1. Fit the `lda` model below with the features you extracted above.


In [21]:
n_topics = 20
from sklearn.decomposition import NMF, LatentDirichletAllocation
lda = LatentDirichletAllocation(n_topics=n_topics, 
                                max_iter=5,
                                learning_method='online', 
                                learning_offset=50.,
                                random_state=0)

In [22]:
lda.fit(tf)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7,
             learning_method='online', learning_offset=50.0,
             max_doc_update_iter=100, max_iter=5, mean_change_tol=0.001,
             n_jobs=1, n_topics=20, perp_tol=0.1, random_state=0,
             topic_word_prior=None, total_samples=1000000.0, verbose=0)

## Question 3
1. Using the `display_topics` function below, display the derived topics from the LDA model above.

In [23]:
def display_topics(model, feature_names, no_top_words = 10):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic %d:" % (topic_idx))
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-no_top_words - 1:-1]]))

In [24]:
display_topics(lda, tf_feature_names)

Topic 0:
does think people just don believe point time case say
Topic 1:
gun law control guns police rate crime state laws firearms
Topic 2:
said didn did know went just came got people time
Topic 3:
good ve years car year just like really ago got
Topic 4:
use entry section code rules stuff build include int define
Topic 5:
windows window db using display program color screen widget motif
Topic 6:
israel war jews israeli men people military state women land
Topic 7:
space 000 new national research university nasa center health 1993
Topic 8:
god jesus christian bible church christians faith christ life religion
Topic 9:
key chip encryption keys clipper use security public technology bit
Topic 10:
edu file com available ftp files information version list pub
Topic 11:
memory use video bus monitor board ground pc ram need
Topic 12:
armenian turkish armenians w7 cx turkey greek turks armenia hz
Topic 13:
ax max b8f g9v a86 pl 145 1d9 0t 34u
Topic 14:
game team games play season hockey leag

## Question 4
1. Play around with the parameters above and see what kind of results you'll get.