# Topic Modeling with Gensim

In [None]:
import pandas as pd

# import packages for text processing
import nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import re

from gensim.corpora import Dictionary
from gensim.models import ldamodel
import numpy
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')  # To ignore all warnings that arise here to enhance clarity

We're setting up our corpus now. In the toy corpus presented, there are 6 documents.

In [None]:
texts = ["Human machine interface enterprise resource planning system quality processing management",
         "management processing quality enterprise resource planning systems is user management",
         "human engineering testing of enterprise resource planning system processing quality management",
         "food desert poor staff good service cheap price bad location restaurant",
         "good service poor food resturant staff bad service price desert good location",
         "restaurant poor service bad food desert staff bad service high price good location"
         ]

In [None]:
# Remove useless numbers and alphanumerical words
documents = [re.sub("[^a-zA-Z]+", " ", text) for text in texts]
# tokenize
texts = [[word for word in text.lower().split() ] for text in documents]
# stemming words: having --> have; friends --> friend
lmtzr = WordNetLemmatizer()
texts = [[lmtzr.lemmatize(word) for word in text ] for text in texts]
# remove common words 
stoplist = stopwords.words('english')
texts = [[word for word in text if word not in stoplist] for text in texts]
#remove short words
texts = [[ word for word in tokens if len(word) >= 3 ] for tokens in texts]

In [None]:
# this is text processing required for topic modeling with Gensim
dictionary = Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

We set up the LDA model in the corpus. We set the number of topics to be 2, and expect to see one which is to do with river banks, and one to do with financial banks. 

# Optimal k value

# LDA Model Building

# Prints the topics.

This exercise has shown you how to perform topic modeling with Gensim. The results show two hiddent topics in the data. One is about **restaurant** and the other **enterprise**.

# Assigns the topics to the documents in corpus

# Appendix 1

In [None]:
import pyLDAvis.gensim

In [None]:
pyLDAvis.enable_notebook()
pyLDAvis.gensim.prepare(model, corpus, dictionary)

# Appendix 2 

We want to show off the new `get_term_topics` and `get_document_topics` functionalities, and a good way to do so is to play around with words which might have different meanings in different context.

The word `bank` is a good candidate here, where it can mean either the financial institution or a river bank.
In the toy corpus presented, there are 11 documents, 5 `river` related and 6 `finance` related. 

### get_term_topics

The function `get_term_topics` returns the odds of that particular word belonging to a particular topic. 
A few examples:

In [None]:
model.get_term_topics('service')

Makes sense, the value for it belonging to `topic_0` is a lot more.

In [None]:
model.get_term_topics('system')

This also works out well, the word finance is more likely to be in topic_1 to do with financial banks.

### get_document_topics (Predictive Analytics)

`get_document_topics` is an already existing gensim functionality which uses the `inference` function to get the sufficient statistics and figure out the topic distribution of the document.

The addition to this is the ability for us to now know the topic distribution for each word in the document. 
Let us test this with two different documents which have the word bank in it, one in the finance context and one in the river context.

The `get_document_topics` method returns (along with the standard document topic proprtion) the word_type followed by a list sorted with the most likely topic ids, when `per_word_topics` is set as true.

In [None]:
bow_resturant = ['bad','food','location']
bow_enterprise = ['quality','system','resource']

In [None]:
bow = model.id2word.doc2bow(bow_resturant) # convert to bag of words format first
print bow

In [None]:
doc_topics, word_topics, phi_values = model.get_document_topics(bow, per_word_topics=True)

word_topics

Now what does that output mean? It means that all three words are more likely to be in `topic_0` than `topic_1`.

In [None]:
doc_topics

In [None]:
phi_values

bow_resturant = ['bad','food','location'] is likely to be "topic_0" (restaurant)

Now that we know exactly what `get_document_topics` does, let us now do the same with our second document, `bow_finance`.

In [None]:
bow = model.id2word.doc2bow(bow_enterprise) # convert to bag of words format first
doc_topics, word_topics, phi_values = model.get_document_topics(bow, per_word_topics=True)
word_topics

In [None]:
phi_values

In [None]:
doc_topics

The new (or unlabeled) document "quality system resource" is classfied as topic_1

### Predicting the topic distribution of a new (or unlabled) document: Another Example

In [None]:
unlabeled = ["poor service, looks bad restaurant management, menu is overpriced",
            "management of enterprise resource planning systems"]

In [None]:
# Remove useless numbers and alphanumerical words
unlabeled = [re.sub("[^a-zA-Z]+", " ", text) for text in unlabeled]
# tokenize
unlabeled = [[word for word in text.lower().split() ] for text in unlabeled]
# stemming words: having --> have; friends --> friend
lmtzr = WordNetLemmatizer()
unlabeled = [[lmtzr.lemmatize(word) for word in text ] for text in unlabeled]
# remove common words 
stoplist = stopwords.words('english')
unlabeled = [[word for word in text if word not in stoplist] for text in unlabeled]
#remove short words
unlabeled = [[ word for word in tokens if len(word) >= 3 ] for tokens in unlabeled]

In [None]:
unlabeled

In [None]:
for i in unlabeled:
    print model.id2word.doc2bow(i)

In [None]:
for i in unlabeled:
    bow = model.id2word.doc2bow(i)
    doc_topics, word_topics, phi_values = model.get_document_topics(bow, per_word_topics=True)
    print doc_topics

The first document is classified as topic_0

The second document is classified as topic_1