# Topic Modeling

### Author: [Marco Tavora](http://www.marcotavora.me/)


## Table of contents

- [Introduction](#Introduction)

- [Libraries](#Libraries)

- [The Problem Domain](#The-Problem-Domain)

In [11]:
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

<a id = 'Introduction'></a>
## Introduction

[[go back to the top]](#Table-of-contents)

In this notebook, I will use Python and its libraries for **topic modeling**. In topic modeling, statistical models are used to identify topics or categories in a document or a set of documents. I will use one specific method called **Latent Dirichlet Allocation (LDA)**. This algorithm can be summarized as follows:
- First we select (without previous knowledge about what the topics) - a fixed number $T$ of topics 
- We then randomly assign each word to a topic
- For each document $d$, word $w$ and topic $t$, the probability $P(t\,|\,w,d)$ that the word $w$ of document $d$ corresponds to topic $t$ is calculated
- We then reassign each word $w$ to some topic based on $P(t\,|\,w,d)$ and repeat the process until we find the optimal assignment of words to topics

<a id = 'Libraries'></a>
## Libraries  

[[go back to the top]](#Table-of-contents)

This notebook uses the following packages:

- `spacy`
- `nltk`
- `random`
- `gensim`
- `pickle`
- `pandas`
- `sklearn`

<a id ='The-Problem-Domain'></a>
## The Problem Domain

[[go back to the top]](#Table-of-contents)

In this project I apply LDA to a collection of [Sherlock Holmes](https://en.wikipedia.org/wiki/Sherlock_Holmes) short stories obtained from the [Gutenberg project](https://github.com/ropenscilabs/gutenbergr).

<a id ='Importing-the-documents'></a>
## Importing the documents
[[go back to the top]](#Table-of-contents)

In [40]:
df = pd.read_csv('sherlock.csv',index_col=0)
df.dropna(inplace=True)
df.head()

Unnamed: 0,gutenberg_id,text,story
1,1661,ADVENTURE I. A SCANDAL IN BOHEMIA,ADVENTURE I. A SCANDAL IN BOHEMIA
3,1661,I.,ADVENTURE I. A SCANDAL IN BOHEMIA
5,1661,To Sherlock Holmes she is always THE woman. I ...,ADVENTURE I. A SCANDAL IN BOHEMIA
6,1661,him mention her under any other name. In his e...,ADVENTURE I. A SCANDAL IN BOHEMIA
7,1661,and predominates the whole of her sex. It was ...,ADVENTURE I. A SCANDAL IN BOHEMIA


Notice that in the **text** column, the first row of each story is the title of the story. We will remove those lines (and the id and story columns as well):

In [32]:
# df['story'].unique().tolist()
# len(df['story'].unique().tolist())
# df[df['text']==df['story']].shape

In [41]:
df = df[df['text'] != df['story']]
df = df.drop(['gutenberg_id','story'],axis=1)
df.reset_index(inplace=True, drop=True)
df.head()

Unnamed: 0,text
0,I.
1,To Sherlock Holmes she is always THE woman. I ...
2,him mention her under any other name. In his e...
3,and predominates the whole of her sex. It was ...
4,any emotion akin to love for Irene Adler. All ...


<a id ='List-of-texts'></a>
## List of texts

[[go back to the top]](#Table-of-contents)

From `df` I will build a list `doc_set` containing our texts:

In [63]:
doc_set = df.values.T.tolist()[0]
for doc in doc_set[0:7]:
    print(doc)

I.
To Sherlock Holmes she is always THE woman. I have seldom heard
him mention her under any other name. In his eyes she eclipses
and predominates the whole of her sex. It was not that he felt
any emotion akin to love for Irene Adler. All emotions, and that
one particularly, were abhorrent to his cold, precise but
admirably balanced mind. He was, I take it, the most perfect


<a id ='Cleaning-the-text'></a>
## Cleaning the text

[[go back to the top]](#Table-of-contents)

Before applying natural language processing tools to our problem, I will provide a quick review of some basic procedures using Python. We first import `nltk` and the necessary classes for lemmatization and stemming:

In [64]:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer

We then create objects of the classes `PorterStemmer` and `WordNetLemmatizer`:

In [65]:
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

To use lemmatization and/or stemming in a given string text we must first tokenize it. The code below matches word characters until it reaches a non-word character, like a space. 

In [66]:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')

### Build a list of lists of tokens 

In [67]:
tokenined_docs = []
for doc in doc_set:
    tokens = tokenizer.tokenize(doc.lower())
    tokenined_docs.append(tokens)
    
print(tokenined_docs[0:3])

[['i'], ['to', 'sherlock', 'holmes', 'she', 'is', 'always', 'the', 'woman', 'i', 'have', 'seldom', 'heard'], ['him', 'mention', 'her', 'under', 'any', 'other', 'name', 'in', 'his', 'eyes', 'she', 'eclipses']]


### Apply lemmatizing

In [68]:
lemmatized_tokens = []
for lst in tokenined_docs:
    tokens_lemma = [lemmatizer.lemmatize(i) for i in lst]
    lemmatized_tokens.append(tokens_lemma)
    
print(lemmatized_tokens[0:3])

[['i'], ['to', 'sherlock', 'holmes', 'she', 'is', 'always', 'the', 'woman', 'i', 'have', 'seldom', 'heard'], ['him', 'mention', 'her', 'under', 'any', 'other', 'name', 'in', 'his', 'eye', 'she', 'eclipse']]


### Dropping stopwords and words with less than $n$ letters

In [69]:
from stop_words import get_stop_words
en_stop_words = get_stop_words('en')

In [76]:
print(en_stop_words)

['a', 'about', 'above', 'after', 'again', 'against', 'all', 'am', 'an', 'and', 'any', 'are', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', "can't", 'cannot', 'could', "couldn't", 'did', "didn't", 'do', 'does', "doesn't", 'doing', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', "hadn't", 'has', "hasn't", 'have', "haven't", 'having', 'he', "he'd", "he'll", "he's", 'her', 'here', "here's", 'hers', 'herself', 'him', 'himself', 'his', 'how', "how's", 'i', "i'd", "i'll", "i'm", "i've", 'if', 'in', 'into', 'is', "isn't", 'it', "it's", 'its', 'itself', "let's", 'me', 'more', 'most', "mustn't", 'my', 'myself', 'no', 'nor', 'not', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'ought', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 'same', "shan't", 'she', "she'd", "she'll", "she's", 'should', "shouldn't", 'so', 'some', 'such', 'than', 'that', "that's", 'the', 'their', 'theirs', 'them', 'themselves', 't

In [85]:
n=4
tokens = []
for lst in lemmatized_tokens:
    tokens.append([i for i in lst if not i in en_stop_words if len(i) > n])

# dropping empty sublists
tokens = [token for token in tokens if len(token)>0]
tokens[0:5]

[['sherlock', 'holmes', 'always', 'woman', 'seldom', 'heard'],
 ['mention', 'eclipse'],
 ['predominates', 'whole'],
 ['emotion', 'irene', 'adler', 'emotion'],
 ['particularly', 'abhorrent', 'precise']]

<a id ='Document-term-matrix'></a>
## Document-term matrix

[[go back to the top]](#Table-of-contents)

I will now generate an LDA model and for that, the frequency that each term occurs within each document needs to be understood.

A **document-term matrix** is constructed to do that. It contains a corpus of $n$ documents and a vocabulary of $m$ words. Each cell $ij$ counts the frequency of the word $j$ in the document $i$.

|               | word_1 | word_2 | ... | word_m |
| ------------- |:------:| ----- :|----- :|----- :|
| doc_1         | 1      | 3   | ... |2
| doc_2         | 2      |   3   |...|3
| ...           | ...    |    2   |...|1
| doc_n         | 1      |    1   |...|1

What LDA does is to convert this matrix into two matrices with lower dimensions namely:

|               | topic_1 | topic_2 | ... | topic_T |
| ------------- |:------:| ----- :|----- :|----- :|
| doc_1         | 0      | 1   | ... |1
| doc_2         | 0      |   1   |...|1
| ...           | ...    |    ...   |...|1
| doc_n         | 1      |    0   |...|0

and

|               | word_1 | word_2 | ... | word_m |
| ------------- |:------:| ----- :|----- :|----- :|
| topic_1         | 1      | 0   | ... |1
| topic_2         | 1      |   0   |...|1
| ...           | ...    |    ...   |...|1
| topic_T         | 1      |    1   |...|1




<a id ='Tokens-into-dictionary'></a>
## Tokens into dictionary

[[go back to the top]](#Table-of-contents)

In [86]:
from gensim import corpora, models

dictionary = corpora.Dictionary(tokens)

## Tokenize documents into document-term matrix

[[go back to the top]](#Table-of-contents)

In [87]:
corpus = [dictionary.doc2bow(text) for text in tokens]

import pickle
pickle.dump(corpus, open('corpus.pkl', 'wb'))
dictionary.save('dictionary.gensim')

In [88]:
corpus[0]

[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1)]

## LDA model

In [89]:
import gensim
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=6, id2word = dictionary, passes=20)
ldamodel.save('model.gensim')

In [98]:
for el in ldamodel.print_topics(num_topics=6, num_words=5):
    print(el,'\n')

(0, '0.015*"asked" + 0.014*"first" + 0.014*"found" + 0.014*"quite" + 0.013*"little"') 

(1, '0.024*"night" + 0.015*"woman" + 0.013*"seemed" + 0.011*"though" + 0.011*"without"') 

(2, '0.030*"shall" + 0.017*"never" + 0.013*"cried" + 0.012*"perhaps" + 0.012*"place"') 

(3, '0.066*"holmes" + 0.017*"window" + 0.016*"nothing" + 0.013*"matter" + 0.013*"sherlock"') 

(4, '0.018*"think" + 0.012*"right" + 0.010*"hardly" + 0.008*"black" + 0.008*"gentleman"') 

(5, '0.020*"house" + 0.017*"might" + 0.017*"round" + 0.016*"little" + 0.010*"turned"') 



In [99]:
dictionary = gensim.corpora.Dictionary.load('dictionary.gensim')

corpus = pickle.load(open('corpus.pkl', 'rb'))

lda = gensim.models.ldamodel.LdaModel.load('model.gensim')

import pyLDAvis.gensim

lda_display = pyLDAvis.gensim.prepare(lda, corpus, dictionary, sort_topics=False)

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  topic_term_dists = topic_term_dists.ix[topic_order]


In [97]:
pyLDAvis.display(lda_display)