# [LEGALST-190] Lab 4/10: Topic Models

This lab will cover latent dirichlet allocation and topic models using `gensim` and `scikit-learn`.

*Estimated Time: 35 Minutes *

### Table of Contents
[The Data](#section data)<br>
1 - [Using Gensim to Implement a LDA Model](#section 1)<br>
2 - [Using scikit-learn](#section 2)<br>
3 - [Finding topics from UN Debates](#section 3)<br>

**Dependencies:**

In [46]:
import string

import nltk
from nltk.stem.snowball import SnowballStemmer
from nltk.corpus import stopwords

!pip install gensim
from gensim import corpora, models

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

import numpy as np
import pandas as pd

from helper import *

[33mYou are using pip version 9.0.1, however version 9.0.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


----
## The Data<a id='section data'></a>

For this lab, we'll use sci-kit learn's `20 newsgroups` dataset, which is a list of approximately 18,000 newsgroup posts. Because of its size, we'll only be working with about 750 posts. At the end of this lab, we'll also work with a selected portion of the UN Data. 

----

## Section 1: Using Gensim to Implement a LDA Model<a id='section 1'></a>

### What Is Latent Dirichlet Allocation?
Latent dirichlet allocation is a way of discovering topics in a set of documents, generating topics based on word frequency. LDA is a probabilistic bag-of-words model that makes an assumption that documents are produced from a variety of topics that produce words with certain probilities. Then it backtracks, finding a set of certain topics that would have created the documents.

----

### Using `gensim`

We'll use the LDA algorithm from `gensim`, a python library for topic modelling.

Let's get working with the data. The `20 newsgroups` data is under the name `20newgroups_data.csv` in the data folder. 

**Question 1.1:** Retrieve the posts from the DataFrame and assign the list to a variable named `documents`.

In [47]:
data = pd.read_csv('data/20newsgroups_data.csv')
data.head()

Unnamed: 0,posts
0,Well i'm not sure about the story nad it did s...
1,"\n\n\n\n\n\n\nYeah, do you expect people to re..."
2,Although I realize that principle is not one o...
3,Notwithstanding all the legitimate fuss about ...
4,"Well, I will have to change the scoring on my ..."


In [48]:
documents = data['posts']
type(documents)

pandas.core.series.Series

Awesome! We now have data we can work with. Before we start anything, we must clean the text.

Just to review, we want to process our text by:<br>
1) Tokenizing our document<br>
2) Removing stop words (remove meaningless words)<br>
3) Stemming or merging words that have equivalent meanings<br>

<a id='gensim'></a>**Question 1.2:** Tokenize and stem the text in `documents` and filter your tokens by using both `stop` and `more_stops` and punctuation.

In [49]:
### I hate iterating across objects; it is totally opaque
###   guess I need practice

stop = stopwords.words('english')
punctuation = string.punctuation
stemmer = SnowballStemmer("english")
more_stops = ['--', '``', "''", "s'", "\'s", "n\'t", "...", "\'m", "-*-", "-|"]
tokenized = []    #initialize list for cleaned up documents
for doc in documents:
    tokens = [word.lower() for sent in nltk.sent_tokenize(doc) for word in nltk.word_tokenize(sent)]
    filtered_tokens = [x for x in tokens if x not in punctuation and x not in more_stops]
    stopped_tokens = [x for x in filtered_tokens if not x in stop]
    stemmed_tokens = [stemmer.stem(i) for i in stopped_tokens]
    tokenized.append(stemmed_tokens)
tokenized[0:5]    

[['well',
  'sure',
  'stori',
  'nad',
  'seem',
  'bias',
  'disagre',
  'statement',
  'u.s.',
  'media',
  'ruin',
  'israel',
  'reput',
  'redicul',
  'u.s.',
  'media',
  'pro-isra',
  'media',
  'world',
  'live',
  'europ',
  'realiz',
  'incid',
  'one',
  'describ',
  'letter',
  'occur',
  'u.s.',
  'media',
  'whole',
  'seem',
  'tri',
  'ignor',
  'u.s.',
  'subsid',
  'israel',
  'exist',
  'european',
  'least',
  'degre',
  'think',
  'might',
  'reason',
  'report',
  'clear',
  'atroc',
  'shame',
  'austria',
  'daili',
  'report',
  'inhuman',
  'act',
  'commit',
  'isra',
  'soldier',
  'bless',
  'receiv',
  'govern',
  'make',
  'holocaust',
  'guilt',
  'go',
  'away',
  'look',
  'jew',
  'treat',
  'race',
  'got',
  'power',
  'unfortun'],
 ['yeah',
  'expect',
  'peopl',
  'read',
  'faq',
  'etc',
  'actual',
  'accept',
  'hard',
  'atheism',
  'need',
  'littl',
  'leap',
  'faith',
  'jimmi',
  'logic',
  'run',
  'steam',
  'jim',
  'sorri',
  'ca',


Now that we have our tokenized documents, we have to convert it to a *document-term matrix* which can be done by instantiating a `gensim` dictionary object. Our first step is to turn our tokenized documents into a "dictionary" that maps a word to its integer ID, like a bag-of-words model. <a id='Q1.3'></a>

**Question 1.3:** Implement a gensim dictionary from the `corpora` package and assign it to a variable named `dictionary`. You can look [at the documentation](https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary) for the corpora package if necessary.

In [50]:
dictionary = corpora.Dictionary(tokenized)
type(dictionary)

gensim.corpora.dictionary.Dictionary

This is the last step before we implement the model! We must convert our documents to bag-of-words format using our dictionary. Every document is represented as a list of tuples of the word's integer ID and its frequency. This list of 400 documents represents our document-term matrix.

**Question 1.4:** Using `dictionary` from the previous question, convert to your tokenzied documents into a bag-of-words format and store it to a variable named `corpus`. We want to use `doc2bow()` method ***for every document*** in our tokenized text. The documentation is linked [here](https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary.doc2bow).

You should end up with a list of tuples for each document and calling `corpus[i]` for some integer i should return something of the form <br>`[(16, 2), (58, 1), (59, 1),...`

In [51]:
corpus = []
corpus = [dictionary.doc2bow(txt) for txt in tokenized]
corpus[7]

[(0, 1),
 (11, 1),
 (20, 2),
 (34, 1),
 (44, 1),
 (49, 1),
 (68, 1),
 (83, 1),
 (85, 1),
 (92, 2),
 (99, 1),
 (112, 1),
 (114, 2),
 (115, 2),
 (119, 1),
 (121, 1),
 (196, 1),
 (222, 1),
 (233, 1),
 (298, 3),
 (313, 1),
 (609, 1),
 (610, 1),
 (611, 1),
 (612, 1),
 (613, 1),
 (614, 1),
 (615, 1),
 (616, 3),
 (617, 1),
 (618, 1),
 (619, 1),
 (620, 1),
 (621, 1),
 (622, 2),
 (623, 1),
 (624, 1),
 (625, 1),
 (626, 1),
 (627, 1),
 (628, 1),
 (629, 1),
 (630, 1),
 (631, 1),
 (632, 1),
 (633, 1),
 (634, 1),
 (635, 1),
 (636, 1),
 (637, 1),
 (638, 1),
 (639, 1),
 (640, 1),
 (641, 1),
 (642, 1),
 (643, 1),
 (644, 1),
 (645, 1),
 (646, 1),
 (647, 1),
 (648, 1),
 (649, 1),
 (650, 1)]

Now that we have a document-term matrix, we’re ready to generate an LDA model!

The cell below is an example of an implementation of a `gensim` LDA model. Run the cell and take a look at what it displays.

In [52]:
ldamodel = models.LdaModel(corpus, 
                           id2word=dictionary,
                           num_topics=6,
                           chunksize=270, 
                           update_every=20,
                           passes=3)
#We have defined a helper function show_topics that takes in the model as an argument.
show_topics(ldamodel)

Topic 0
0.008*"one" + 0.006*"would" + 0.006*"peopl" + 0.004*"like" + 0.004*"think" + 0.004*"time" + 0.004*"year" + 0.004*"know" + 0.004*"get" + 0.004*"go"

Topic 1
0.006*"use" + 0.004*"2" + 0.003*"get" + 0.003*"one" + 0.003*"would" + 0.003*"go" + 0.003*"gm" + 0.003*"1" + 0.003*"peopl" + 0.003*"3"

Topic 2
0.006*"would" + 0.005*"use" + 0.004*"peopl" + 0.004*"one" + 0.003*"get" + 0.003*"key" + 0.003*"like" + 0.002*"church" + 0.002*"think" + 0.002*"2"

Topic 3
0.008*"would" + 0.007*"use" + 0.006*"one" + 0.005*"like" + 0.004*"get" + 0.003*"2" + 0.003*"1" + 0.003*"system" + 0.003*"know" + 0.003*"pleas"

Topic 4
0.012*"1" + 0.009*"2" + 0.005*"use" + 0.004*"3" + 0.003*"one" + 0.003*"also" + 0.003*"would" + 0.003*"copi" + 0.003*"get" + 0.003*"key"

Topic 5
0.004*"peopl" + 0.003*"use" + 0.003*"time" + 0.003*"like" + 0.003*"know" + 0.003*"one" + 0.003*"would" + 0.003*"hiv" + 0.003*"file" + 0.003*"even"



----

Our model returned some topics! We have a jumble of words and numbers, but remember that LDA is a probabilistic model. For clarity, the `show_topics` function utilizes the `LdaModel` method `.show_topics()` which gives us the words that contribute the most to `num_topics` random topics (random because we aren't defining the topics). The numbers in front the words represent the probability of that word appearing in the topic. If we look at **topic 2**, we can infer from the proportions in front of the word that the topic is more about 1 and 2 rather than "people", which has a lower value than the first two. 

Although we aren't explicitly defining the topics, we are telling the computer how many topics to look for. LDA treats each document as a mix of words and a mix of topics. It chooses words that contribute to a topic and finds certain topics that describe a document.

Unfortunately, we are working with a pretty small set of ducments and our topics would be more defined with a larger corpus, but we can still work with our data to get defined topics. Let's start defining more optimal parameters for our model!

----

There are quite a few parameters for generating the `LdaModel` that affect the quality of the topics returned. We'll go over some of the helpful and important parameters to use when implementing a LDA model in `gensim`.

| Required Parameters        |Value                          | Default | What it does  |
| :-------------------------:|:-----------------------------:| --|:-------------:|
|                corpus      | corpus (doc-term matrix) | None | This specifies your LDA model parameters. |
| id2word     | `gensim` dictionary | None | The doc-term matrix to word <br>integer ID mapping. |
| num_topics<br> | integer | 100 |Specifies the number of underlying topics <br> in your documents. Usually, the fewer <br> documents you have, the smaller number <br> you assign [(this is a hot topic!)](https://www.quora.com/Latent-Dirichlet-Allocation-LDA-What-is-the-best-way-to-determine-k-number-of-topics-in-topic-modeling).<br>|


| Optional (but helpful) Parameters   |Default | What it does  |
|:------------------------------------|------- |:-------------------------------:|
| passes | 1 | How many times you want to iterate through the corpus.<br> The more passes, the more accurate your model will be, <br> although  it can take longer time if you have a large dataset. |
|chunksize | 2000 | The size of the batch documents you want to run through.<br> e.g. chunksize = 10, we run 10 documents at a time.| 
|update_every |1 | Update the model after every `n` number of chunks. |



**Question 1.5:** After reviewing the parameters, go back to the previous model. What is problematic about it and its results? Which parameters do you think you should change first to get more explicit results?

### The major problem is that there are similar words in each of the topics identified. If we increased the number of passes, my guess is that the model would converge on words that differ among topics. I'm not sure if changing the batch size would change the result, although updating the model more frequently would seem to update the probabilities and therefore lead to clearer categories.

<a id='Q1.6a'></a> **Question 1.6a:** Improve the previous LDA model by adjusting the parameters  [(documentation)](https://radimrehurek.com/gensim/models/ldamodel.html#gensim.models.ldamodel.LdaModel). Try to get as clear topics as possible! Remember to call `show_topics` on your model to print out the words for your topics.

In [53]:
ldamodel2 = models.LdaModel(corpus, 
                           id2word=dictionary,
                           num_topics=6,
                           chunksize=270, 
                           update_every=20,
                           passes=50)
#We have defined a helper function show_topics that takes in the model as an argument.
show_topics(ldamodel2)

Topic 0
0.007*"god" + 0.007*"one" + 0.006*"would" + 0.006*"use" + 0.006*"like" + 0.006*"know" + 0.004*"good" + 0.004*"peopl" + 0.004*"need" + 0.004*"think"

Topic 1
0.005*"space" + 0.005*"orbit" + 0.004*"use" + 0.004*"launch" + 0.004*"year" + 0.003*"one" + 0.003*"would" + 0.003*"build" + 0.003*"mission" + 0.003*"key"

Topic 2
0.008*"peopl" + 0.007*"would" + 0.006*"go" + 0.006*"one" + 0.004*"use" + 0.004*"know" + 0.004*"think" + 0.004*"govern" + 0.004*"time" + 0.003*"say"

Topic 3
0.010*"1" + 0.009*"2" + 0.006*"would" + 0.006*"use" + 0.006*"get" + 0.006*"one" + 0.005*"like" + 0.005*"3" + 0.004*"think" + 0.004*"car"

Topic 4
0.004*"new" + 0.003*"back" + 0.003*"would" + 0.003*"list" + 0.003*"book" + 0.003*"like" + 0.002*"problem" + 0.002*"window" + 0.002*"get" + 0.002*"want"

Topic 5
0.006*"file" + 0.006*"imag" + 0.005*"send" + 0.004*"mail" + 0.004*"format" + 0.004*"comput" + 0.004*"graphic" + 0.004*"also" + 0.004*"use" + 0.004*"includ"



**Question 1.6b:** What are some topics that you can infer from your optimized model?<p>
<p>
<b>It looks like I was right about the number of interations improving the differentiation, and it looks like Topic 0 is software, Topic 1 is "sending image files", Topic 2 is "human/machine interaction", Topic 3 is "copies/numbers", Topic 4 is maybe inference about people, and Topic 5 is "HIV/AIDS"

In [54]:
ldamodel3 = models.LdaModel(corpus, 
                           id2word=dictionary,
                           num_topics=3,
                           chunksize=400, 
                           update_every=20,
                           passes=100)
#We have defined a helper function show_topics that takes in the model as an argument.
show_topics(ldamodel3)

Topic 0
0.006*"use" + 0.004*"file" + 0.004*"send" + 0.004*"imag" + 0.004*"graphic" + 0.003*"system" + 0.003*"also" + 0.003*"mail" + 0.003*"program" + 0.003*"format"

Topic 1
0.006*"would" + 0.006*"use" + 0.005*"get" + 0.005*"one" + 0.004*"like" + 0.004*"year" + 0.003*"key" + 0.003*"peopl" + 0.003*"make" + 0.003*"car"

Topic 2
0.007*"one" + 0.006*"would" + 0.005*"peopl" + 0.005*"1" + 0.004*"2" + 0.004*"like" + 0.004*"time" + 0.004*"think" + 0.004*"know" + 0.004*"go"



**Question 1.6c:** Did you notice any patterns while changing values of certain parameters (how did num_topics change the quality of results)? What worked in giving you reasonable, clear topics and what didn't? 

### It does not seem like changing the number of topics (so that I would get fewer) helped all that much; more iterations helped in conjunction with the 6 topics above. I am not sure what changing the chunk size or the model update frequency would do, but I expect that updating the model more frequently might allow it to converge faster? I'd have to read up on it.

----

## Section 2: Using `scikit-learn`<a id='section 2'></a>

Along with `gensim`,  we can also use `scikit-learn` to implement a LDA model. Using the `scikit-learn` algorithm is less clear, since a lot of the work is done by the computer. But, by going through the `gensim` algorithm, we now have an idea how LDA works, and using the `scikit-learn` algorithm will be a little more clear. 

You may be wondering, *why did we do all of that work in the first section when we can just use `scikit-learn` to implement a LDA model?* 

The motivation to use `gensim` is that you have much more control over how you can implement your model. This is a really big benefit if you want to find topics without having too much overlap, repetition, or inconsistency. For example, with `gensim`, you can tokenize and stem your documents in any way you want. But with `scikit-learn`, you don't have as much control over how you manipulate your data. Another example is the ability to filter your `gensim` dictionary instance, which you will explore later in the last section. If you prefer a more explicit method of implementing a topic model and want command over your data, then `gensim` is a great option.

Anyway, let's get started with using `scikit-learn`!

----

**Question 2.1:** In order to implement a LDA model using `scikit-learn`, we must extract features to a matrix using either the count vectorizer or the tf-idf vectorizer. Which one do we use and why?

### so use count vectorizer because tf-idf vectorizer standardizes the data and we don't want that?


----

If you answered a count vectorizer to the previous question, you're right! Since LDA is a probabilistic model, we only need the raw term counts.

<a id='sklearn'></a>

**Question 2.2:** Instantiate a count vectorizer with the parameters `max_df=.95`, `min_df=2`, and `stop_words='english'`.

In [55]:
cv = CountVectorizer(max_df=.95, min_df=2, stop_words='english')

With the vectorizer, we can transform our data into a document-term matrix, as well as use `.get_feature_names()` to get the word to integer ID mapping like we did in [question 1.3](#Q1.3).

**Question 2.3:** Use your vectorizer to transform the same dataset from the first section of this lab to a document-term matrix (concept is similar to Q1.4) and get the feature names.

In [56]:
dtm = cv.fit_transform(documents)
features = cv.get_feature_names()

We're almost done! The last step is to implement the model, so we can get our topics. Similar to `gensim`, there are parameters that should be adjusted to fit your documents.

| Parameters <br> [(documentation)](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html)|Default | What it does  |
| :-------------------------:|:-----------------------------:|:-------------:|
|          n_components      | 10 | Equivalent to `num_topics` in the `gensim` model. This <br> specifies the number of latent topics in your documents. |
| max_iter | 10 | Equivalent to `passes` in the `gensim` model. |
| batch_size | 128 | Equivalent to `chunksize` |

**Question 2.4:** Implement the LDA model using `LatentDirichletAllocation`. Don't forget to fit your document-term matrix from the previous question!

**Note:** Set the `learning_method` parameter to `'online'`, which is its default (for the latest version of scikit-learn) but will throw a deprecation warning if not specified.

In [57]:
LDA = LatentDirichletAllocation(n_topics=5, max_iter=25, batch_size=128, learning_method='online')
# got an error message for using keyword 'n_components' but default is 10
LDA_model = LDA.fit(dtm)
LDA_model

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7,
             learning_method='online', learning_offset=10.0,
             max_doc_update_iter=100, max_iter=25, mean_change_tol=0.001,
             n_jobs=1, n_topics=5, perp_tol=0.1, random_state=None,
             topic_word_prior=None, total_samples=1000000.0, verbose=0)

A function `topic_words` is defined for you (from `helper.py`) and it takes in three arguments:

     1) model: your LDA model
     2) feature_names: the feature names from the vectorizer
     3) num_top_words: number of words you want displayed

It prints out the topic number and the words that fall under that topic, although it does not display weight of words like `gensim`.

**Question 2.5:** Specify the number of words you want displayed and call `topic_and_words` on the LDA model from the previous question. 

<sub>**Note:** If your topics are repetitive or aren't very coherent, try tweaking the parameters in the previous question.</sub>

In [58]:
topic_words(LDA_model, features, 10)

Topic 0:
10 data software new 15 aids space windows 00 information
Topic 1:
israel game israeli year blues lebanese just years said play
Topic 2:
space years billion new like armenians just people didn dollars
Topic 3:
edu graphics mail pub send com file key use chip
Topic 4:
people don just like think know good god time does


**Question 2.6:** Did your model yield clear, interpretable results? How does it compare to the LDA model you created in [section one](#Q1.6a)?

### It gave somewhat interpretable results; they seem less about the same topic than the LDA model in section one of this lab. I can't say that the results are great, though. I can say that topic zero is about computers and maybe topic 4 is about something like hockey (some sport), but topics 1 and 3 are hard to characterize. Is it that sensitive to parameters?

----
## Section 3: Finding topics from UN General Debates<a id='section 3'></a>

We have two ways of implementing a LDA model, let's try both on the UN General Debates dataset. We can now get an idea of what was discussed at a specific session through topic modelling!

**Question 3.1**: Load `un-general-debates-2015` from the data folder and extract the data from the 'text' column. This file contains the data from the 70th session. 

In [59]:
un = pd.read_csv('data/un-general-debates-2015.csv')
un.head()

Unnamed: 0,session,year,country,text
0,70,2015,CAF,"The Head of State of the Transition, Her Excel..."
1,70,2015,TON,I congratulate Mr. Mogens Lykketoft on his ass...
2,70,2015,AGO,"At the outset, on behalf of the President of A..."
3,70,2015,LAO,"At the outset, I would like to extend my since..."
4,70,2015,BRN,"First of all, I would like to congratulate His..."


In [75]:
un_documents = un['text']
type(un_documents)

pandas.core.series.Series

**Question 3.2:** First implement a LDA model using `gensim`. Follow similar steps from the [first section](#gensim), and adjust your parameters accordingly! Use the `show_topics` function to display your topics.

**Tip:** Use the `filter_extremes(no_below=<int>)` method [(documentation)](https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary.filter_extremes) on your `gensim` dictionary, which helps filter through tokens based on frequency (in this case it'll keep any tokens contained in at least the specified integer number of documents). Feel free to use other parameters in `filter_extremes` to optimize your topics! 

As mentioned at the beginning of [section 2](#section 2), you really do have a lot of control over your model and I encourage you to utilize these tools to refine your topic model.

In [76]:
stop = stopwords.words('english')
punctuation = string.punctuation
stemmer = SnowballStemmer("english")
more_stops = ['--', '``', "''", "s'", "\'s", "n\'t", "...", "\'m", "-*-", "-|"]
un_tokenized = []    #initialize list for cleaned up documents
for doc in un_documents:
    tokens = [word.lower() for sent in nltk.sent_tokenize(doc) for word in nltk.word_tokenize(sent)]
    filtered_tokens = [x for x in tokens if x not in punctuation and x not in more_stops]
    stopped_tokens = [x for x in filtered_tokens if not x in stop]
    stemmed_tokens = [stemmer.stem(i) for i in stopped_tokens]
    un_tokenized.append(stemmed_tokens)
un_tokenized[0:5]    

[['head',
  'state',
  'transit',
  'excel',
  'ms.',
  'catherin',
  'samba',
  'panza',
  'spoken',
  'general',
  'assembl',
  'person',
  'order',
  'thank',
  'unit',
  'nation',
  'extrem',
  'valuabl',
  'support',
  'process',
  'transit',
  'central',
  'african',
  'republ',
  'ala',
  'resurg',
  'violenc',
  'sinc',
  '25',
  'septemb',
  'capit',
  'bangui',
  'meant',
  'return',
  'home',
  'earlier',
  'intend',
  'therefor',
  'ask',
  'give',
  'speech',
  'follow',
  'honour',
  'great',
  'pleasur',
  'share',
  'general',
  'assembl',
  'vision',
  'countri',
  'major',
  'issu',
  'face',
  'world',
  'report',
  'develop',
  'situat',
  'central',
  'african',
  'republ',
  'serious',
  'situat',
  'obtain',
  'today',
  'countri',
  'mean',
  'must',
  'spare',
  'address',
  'intern',
  'issu',
  'order',
  'call',
  'attent',
  'world',
  'leader',
  'new',
  'tragedi',
  'affect',
  'peopl',
  'central',
  'african',
  'republ',
  'like',
  'first',
  'sincer

In [77]:
jiten = corpora.Dictionary(un_tokenized)
jiten.filter_extremes(no_below=5)
type(jiten)

gensim.corpora.dictionary.Dictionary

In [78]:
corpus = []
corpus = [jiten.doc2bow(txt) for txt in tokenized]
corpus[7]

[(16, 1),
 (18, 2),
 (34, 1),
 (35, 1),
 (36, 1),
 (37, 1),
 (40, 1),
 (46, 1),
 (50, 1),
 (56, 1),
 (60, 1),
 (72, 1),
 (76, 1),
 (83, 2),
 (84, 1),
 (86, 1),
 (98, 1),
 (113, 1),
 (129, 1),
 (139, 1),
 (142, 3),
 (144, 1),
 (152, 1),
 (168, 2),
 (177, 3),
 (179, 1),
 (181, 1),
 (182, 1),
 (189, 1),
 (198, 1),
 (205, 1),
 (207, 1),
 (208, 2),
 (252, 1),
 (265, 1),
 (267, 1),
 (272, 1),
 (273, 1),
 (286, 1),
 (287, 1),
 (305, 1),
 (312, 1),
 (314, 6),
 (328, 2),
 (330, 2),
 (336, 1),
 (354, 1),
 (359, 1),
 (368, 1),
 (373, 1),
 (392, 1),
 (394, 1),
 (401, 1),
 (412, 1),
 (420, 2),
 (425, 1),
 (426, 1),
 (436, 2),
 (437, 4),
 (447, 3),
 (451, 1),
 (457, 1),
 (469, 4),
 (470, 1),
 (472, 1),
 (492, 1),
 (494, 1),
 (498, 1),
 (507, 2),
 (514, 2),
 (520, 1),
 (535, 2),
 (562, 1),
 (570, 1),
 (572, 3),
 (576, 1),
 (589, 2),
 (592, 1),
 (607, 1),
 (609, 1),
 (611, 1),
 (621, 1),
 (627, 1),
 (632, 1),
 (639, 1),
 (642, 1),
 (643, 2),
 (652, 1),
 (667, 1),
 (670, 1),
 (690, 1),
 (696, 2),
 (705

In [82]:
ldamodel11 = models.LdaModel(corpus, 
                           id2word=jiten,
                           num_topics=5,
                           chunksize=100, 
                           update_every=10,
                           passes=20)
#We have defined a helper function show_topics that takes in the model as an argument.
show_topics(ldamodel11)

Topic 0
0.007*"palestinian" + 0.006*"terrorist" + 0.005*"syrian" + 0.005*"syria" + 0.005*"weapon" + 0.004*"nuclear" + 0.004*"israel" + 0.004*"yemen" + 0.004*"territori" + 0.004*"african"

Topic 1
0.006*"iran" + 0.006*"europ" + 0.006*"want" + 0.005*"say" + 0.005*"syria" + 0.004*"european" + 0.004*"islam" + 0.004*"said" + 0.004*"differ" + 0.004*"could"

Topic 2
0.010*"island" + 0.007*"small" + 0.004*"ocean" + 0.004*"impact" + 0.003*"vulner" + 0.003*"partnership" + 0.003*"pacif" + 0.003*"financ" + 0.003*"per" + 0.003*"sid"

Topic 3
0.005*"peacekeep" + 0.004*"crime" + 0.004*"european" + 0.004*"syria" + 0.004*"nuclear" + 0.004*"oper" + 0.004*"crise" + 0.004*"europ" + 0.003*"ukrain" + 0.003*"iraq"

Topic 4
0.008*"african" + 0.006*"democrat" + 0.005*"educ" + 0.005*"korea" + 0.004*"america" + 0.004*"central" + 0.004*"mali" + 0.004*"per" + 0.003*"cent" + 0.003*"health"



### this does not seem to do much better in terms of interpretability--the first topic has something to do with terrorism, rogue states, etc., but it is very hard to imagine what the other topics are

**Question 3.3:** Now, implement a model using `scikit-learn`. Again, follow similar steps from the [second section](#sklearn) and adjust parameters accordingly. You can display your topics using `topic_words`.

In [83]:
un_cv = CountVectorizer(max_df=.95, min_df=2, stop_words='english')

In [87]:
un_dtm = un_cv.fit_transform(un_documents)
un_features = un_cv.get_feature_names()

In [90]:
un_LDA = LatentDirichletAllocation(n_topics=3, max_iter=50, batch_size=99, learning_method='online')
# got an error message for using keyword 'n_components' but default is 10
un_LDA_model = un_LDA.fit(un_dtm)
un_LDA_model

LatentDirichletAllocation(batch_size=99, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7,
             learning_method='online', learning_offset=10.0,
             max_doc_update_iter=100, max_iter=50, mean_change_tol=0.001,
             n_jobs=1, n_topics=3, perp_tol=0.1, random_state=None,
             topic_word_prior=None, total_samples=1000000.0, verbose=0)

In [91]:
topic_words(un_LDA_model, un_features, 15)

Topic 0:
countries states human country global rights new sustainable years climate general change support agenda council
Topic 1:
solomon societies uganda islands external sdgs caledonia melanesian transformation factors convergence prescription relying partly shelf
Topic 2:
israel iran argentina nuclear deal country president know cent just islamic state israeli years case


### OK, a bit better in terms of discovering topics, but I am falling out of love with LDA.
I can see that there are a couple of climate change topics and something like a nuclear/middle east topic. This does not seem to reflect the actual topics very well.

----
**Question 3.4:** Which algorithm yielded more well-defined topics? You can also skim the resolutions passed [here](http://research.un.org/en/docs/ga/quick/regular/70) for reference. What do you think are some factors that need to be considered about the data when choosing an algorithm and adjusting its parameters?

<b>The gensim algorithm yielded better defined topics, or at least better separated and more readily understood topics.</b> I suppose that the structure of each document needs to be considered; UN speeches each tend to draw from the same set of topics in a plenary session, so that might make identification harder? 

**Question 3.5:** What are some differences that you noticed between the `gensim` and `scikit-learn` algorithms? What are some of their drawbacks? Do you prefer one over the other and if so, why?

<b>The scikit-learn algorithm is easier to implement since you don't need to go through as many steps, but it gave strange topics; maybe with some parameter tweaking they would have been better. The gensim algorithm gave a result that was easier to understand.</b> I'm not sure I would put that much stock in LDA for classifying something like UN speeches, at this point.

----
Awesome! Now you know how to implement a topic model two ways using `gensim` and `scikit-learn`. Even though `scikit-learn` is more straightforward and requires less work to implement, the control you have over `gensim` is very valuable and can result in more distinct topics.

Ultimately, the choice is yours and I hope having both options helps you generate great topic models!

----

## Bibliography

 - Chen, Edwin, Introduction to latent dirichlet allocation. http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/
 - Use of `20newsgroups` data set adapted from Topic Modelling tutorial by Aneesha Bakharia
 https://medium.com/mlreview/topic-modeling-with-scikit-learn-e80d33668730
 - Resolutions from UN 70th session: http://research.un.org/en/docs/ga/quick/regular/70
 - Text cleaning code adapted from notebook by Alex Estes https://github.com/dlab-berkeley/python-text-analysis/blob/master/Intro_to_TextAnalysis/Intro_to_TextAnalysis.ipynb

----
Notebook developed by: Jason Jiang

Data Science Modules: http://data.berkeley.edu/education/modules