# Topic Modeling

### Author: [Marco Tavora](http://www.marcotavora.me/)

**For the best viewing experience use [nbviewer]().**

## Table of contents

- [Introduction](#Introduction)

- [Required libraries](#Required-libraries)

## Introduction

[[go back to the top]](#Table-of-contents)

In this notebook, I will use Python and its libraries for **topic modeling**. In topic modeling, statistical models are used to identify topics or categories in a document or a set of documents. I will use one specific method called **Latent Dirichlet Allocation (LDA)**. The algorithm can be summarized as follows:
- First we select - without previous knowledge regarding what the topics actually are - a fixed number of topics $T$ 
- We then randomly assign each word to a topic
- For each document $d$, word $w$ and topic $t$ we calculate the probability $P(t\,|\,w,d)$ that the word $w$ of document $d$ corresponds to topic $t$
- We then reassign each word $w$ to some topic based on $P(t\,|\,w,d)$ and repeat the process until we find the optimal assignment of words to topics

## Libraries  

[[go back to the top]](#Table-of-contents)

This notebook uses the following packages:

- `spacy`
- `nltk`
- `random`
- `gensim`
- `pickle`
- `pandas`
- `sklearn`

In [1]:
import pandas as pd
#from IPython.core.interactiveshell import InteractiveShell
#InteractiveShell.ast_node_interactivity = "all" # see the value of multiple statements at once.
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

## The problem domain

[[go back to the top]](#Table-of-contents)

In this project I apply LDA to labels on research papers. The dataset is a subset of [this](https://github.com/susanli2016/Machine-Learning-with-Python/blob/master/dataset.csv) data set.

## Using `spaCy`

[[go back to the top]](#Table-of-contents)

In this projects I will use the `spaCy` library (see this [link](https://github.com/skipgram/modern-nlp-in python/blob/master/executable/Modern_NLP_in_Python.ipynb)). 

`spaCy` is:

> An industrial-strength natural language processing (NLP) library for Python. spaCy's goal is to take recent advancements in natural language processing out of research papers and put them in the hands of users to build production software.

In [2]:
import spacy
spacy.load('en_core_web_sm')
from spacy.lang.en import English
parser = English()

## Importing the documents

[[go back to the top]](#Table-of-contents)

In [3]:
df = pd.read_csv('ampreviews_scraped_forum_text.csv')
df.drop("Unnamed: 0", axis = 1, inplace = True) 
#df.columns = ['titles']
#df.shape
#df.head()
df

FileNotFoundError: [Errno 2] No such file or directory: 'ampreviews_scraped_forum_text.csv'

In [14]:
df.iloc[:10].values

array([["\nHow do you guys pay for it? Wanted to look up a couple spots but don't want my CC on file. I know they do gift card methods but I then gotta drag my ass out to the store and get a physical gift card for $30 which only gets me 15 days.\n\nedit: I took a closer look and saw that NYers can't even do the gift card method cause they don't accept it.\n\xa0\n"],
       ['\nYeah it’s annoying. You’d have to but s gift card in nj and submit a photo of the receipt with payment. Or start s big coin account and pay by bit coin.  They also claim to allow you to upload photos of places where they don’t have a photo which gets you a free month but when i tried that, i found there was no way to actually upload a photo other than the “members photos” and i haven’t received any credit from uploading that way.  Not worth the trouble.\n\xa0\n'],
       ['\nUse your credit card.  After the charge is processed, tell the bank you lost the credit card, and they will issue you a new one.\n\xa0\n'],


## List of documents

[[go back to the top]](#Table-of-contents)

From `df` I will build a list `doc_set` containing the row entries:

In [16]:
doc_set = df.values.T.tolist()[0]
print(doc_set[0:10])

["\nHow do you guys pay for it? Wanted to look up a couple spots but don't want my CC on file. I know they do gift card methods but I then gotta drag my ass out to the store and get a physical gift card for $30 which only gets me 15 days.\n\nedit: I took a closer look and saw that NYers can't even do the gift card method cause they don't accept it.\n\xa0\n", '\nYeah it’s annoying. You’d have to but s gift card in nj and submit a photo of the receipt with payment. Or start s big coin account and pay by bit coin.  They also claim to allow you to upload photos of places where they don’t have a photo which gets you a free month but when i tried that, i found there was no way to actually upload a photo other than the “members photos” and i haven’t received any credit from uploading that way.  Not worth the trouble.\n\xa0\n', '\nUse your credit card.  After the charge is processed, tell the bank you lost the credit card, and they will issue you a new one.\n\xa0\n', "\nAlso tried the upload p

## Cleaning the text

[[go back to the top]](#Table-of-contents)

Before applying natural language processing tools to our problem, I will provide a quick review of some basic procedures using Python. We first import `nltk` and the necessary classes for lemmatization and stemming:

In [17]:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer

We then create objects of the classes `PorterStemmer` and `WordNetLemmatizer`:

In [18]:
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

To use lemmatization and/or stemming in a given string text we must first tokenize it. The code below matches word characters until it reaches a non-word character, like a space. 

In [19]:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')

### Build a list of lists of tokens 

In [20]:
tokenined_docs = []
for doc in doc_set:
    tokens = tokenizer.tokenize(doc.lower())
    tokenined_docs.append(tokens)
    
print(tokenined_docs[0:3])

[['how', 'do', 'you', 'guys', 'pay', 'for', 'it', 'wanted', 'to', 'look', 'up', 'a', 'couple', 'spots', 'but', 'don', 't', 'want', 'my', 'cc', 'on', 'file', 'i', 'know', 'they', 'do', 'gift', 'card', 'methods', 'but', 'i', 'then', 'gotta', 'drag', 'my', 'ass', 'out', 'to', 'the', 'store', 'and', 'get', 'a', 'physical', 'gift', 'card', 'for', '30', 'which', 'only', 'gets', 'me', '15', 'days', 'edit', 'i', 'took', 'a', 'closer', 'look', 'and', 'saw', 'that', 'nyers', 'can', 't', 'even', 'do', 'the', 'gift', 'card', 'method', 'cause', 'they', 'don', 't', 'accept', 'it'], ['yeah', 'it', 's', 'annoying', 'you', 'd', 'have', 'to', 'but', 's', 'gift', 'card', 'in', 'nj', 'and', 'submit', 'a', 'photo', 'of', 'the', 'receipt', 'with', 'payment', 'or', 'start', 's', 'big', 'coin', 'account', 'and', 'pay', 'by', 'bit', 'coin', 'they', 'also', 'claim', 'to', 'allow', 'you', 'to', 'upload', 'photos', 'of', 'places', 'where', 'they', 'don', 't', 'have', 'a', 'photo', 'which', 'gets', 'you', 'a', 'fr

### Apply lemmatizing

In [21]:
lemmatized_tokens = []
for lst in tokenined_docs:
    tokens_lemma = [lemmatizer.lemmatize(i) for i in lst]
    lemmatized_tokens.append(tokens_lemma)
    
print(lemmatized_tokens[0:3])

[['how', 'do', 'you', 'guy', 'pay', 'for', 'it', 'wanted', 'to', 'look', 'up', 'a', 'couple', 'spot', 'but', 'don', 't', 'want', 'my', 'cc', 'on', 'file', 'i', 'know', 'they', 'do', 'gift', 'card', 'method', 'but', 'i', 'then', 'gotta', 'drag', 'my', 'as', 'out', 'to', 'the', 'store', 'and', 'get', 'a', 'physical', 'gift', 'card', 'for', '30', 'which', 'only', 'get', 'me', '15', 'day', 'edit', 'i', 'took', 'a', 'closer', 'look', 'and', 'saw', 'that', 'nyers', 'can', 't', 'even', 'do', 'the', 'gift', 'card', 'method', 'cause', 'they', 'don', 't', 'accept', 'it'], ['yeah', 'it', 's', 'annoying', 'you', 'd', 'have', 'to', 'but', 's', 'gift', 'card', 'in', 'nj', 'and', 'submit', 'a', 'photo', 'of', 'the', 'receipt', 'with', 'payment', 'or', 'start', 's', 'big', 'coin', 'account', 'and', 'pay', 'by', 'bit', 'coin', 'they', 'also', 'claim', 'to', 'allow', 'you', 'to', 'upload', 'photo', 'of', 'place', 'where', 'they', 'don', 't', 'have', 'a', 'photo', 'which', 'get', 'you', 'a', 'free', 'mon

### Dropping stopwords and words with less than $n$ letters

In [22]:
from stop_words import get_stop_words
en_stop_words = get_stop_words('en')

In [23]:
n=4
tokens = []
for lst in lemmatized_tokens:
    tokens.append([i for i in lst if not i in en_stop_words if len(i) > n])

print(tokens[0:3])

[['wanted', 'couple', 'method', 'gotta', 'store', 'physical', 'closer', 'nyers', 'method', 'cause', 'accept'], ['annoying', 'submit', 'photo', 'receipt', 'payment', 'start', 'account', 'claim', 'allow', 'upload', 'photo', 'place', 'photo', 'month', 'tried', 'found', 'actually', 'upload', 'photo', 'member', 'photo', 'haven', 'received', 'credit', 'uploading', 'worth', 'trouble'], ['credit', 'charge', 'processed', 'credit', 'issue']]


## Document-term matrix

[[go back to the top]](#Table-of-contents)

I will now generate an LDA model and for that, the frequency that each term occurs within each document needs to be understood.

A **document-term matrix** is constructed to do that. It contains a corpus of $n$ documents and a vocabulary of $m$ words. Each cell $ij$ counts the frequency of the word $j$ in the document $i$.

|               | word_1 | word_2 | ... | word_m |
| ------------- |:------:| ----- :|----- :|----- :|
| doc_1         | 1      | 3   | ... |2
| doc_2         | 2      |   3   |...|3
| ...           | ...    |    2   |...|1
| doc_n         | 1      |    1   |...|1

What LDA does is to convert this matrix into two matrices with lower dimensions namely:

|               | topic_1 | topic_2 | ... | topic_T |
| ------------- |:------:| ----- :|----- :|----- :|
| doc_1         | 0      | 1   | ... |1
| doc_2         | 0      |   1   |...|1
| ...           | ...    |    ...   |...|1
| doc_n         | 1      |    0   |...|0

and

|               | word_1 | word_2 | ... | word_m |
| ------------- |:------:| ----- :|----- :|----- :|
| topic_1         | 1      | 0   | ... |1
| topic_2         | 1      |   0   |...|1
| ...           | ...    |    ...   |...|1
| topic_T         | 1      |    1   |...|1




## Tokens into dictionary

[[go back to the top]](#Table-of-contents)

In [24]:
from gensim import corpora, models

dictionary = corpora.Dictionary(tokens)

## Tokenize documents into document-term matrix

[[go back to the top]](#Table-of-contents)

In [25]:
corpus = [dictionary.doc2bow(text) for text in tokens]

import pickle
pickle.dump(corpus, open('corpus.pkl', 'wb'))
dictionary.save('dictionary.gensim')

In [26]:
corpus[0]

[(0, 1),
 (1, 1),
 (2, 1),
 (3, 1),
 (4, 1),
 (5, 2),
 (6, 1),
 (7, 1),
 (8, 1),
 (9, 1)]

## LDA model

In [27]:
import gensim
ldamodel_3 = gensim.models.ldamodel.LdaModel(corpus, num_topics=3, id2word = dictionary, passes=20)
ldamodel_3.save('model3.gensim')
ldamodel_4 = gensim.models.ldamodel.LdaModel(corpus, num_topics=3, id2word = dictionary, passes=20)
ldamodel_4.save('model4.gensim')

In [28]:
for el in ldamodel_3.print_topics(num_topics=3, num_words=3):
    print(el,'\n')

(0, '0.058*"photo" + 0.044*"payment" + 0.030*"maybe"') 

(1, '0.163*"photo" + 0.047*"thing" + 0.047*"never"') 

(2, '0.048*"access" + 0.048*"uploaded" + 0.048*"picture"') 



In [29]:
for el in ldamodel_4.print_topics(num_topics=3, num_words=3):
    print(el,'\n')

(0, '0.140*"photo" + 0.043*"credit" + 0.043*"rubmaps"') 

(1, '0.063*"photo" + 0.038*"tried" + 0.038*"review"') 

(2, '0.068*"maybe" + 0.046*"problem" + 0.046*"method"') 



In [30]:
dictionary = gensim.corpora.Dictionary.load('dictionary.gensim')

In [31]:
corpus = pickle.load(open('corpus.pkl', 'rb'))

In [32]:
lda = gensim.models.ldamodel.LdaModel.load('model3.gensim')

In [34]:
import pyLDAvis.gensim_models


  and should_run_async(code)


In [36]:
lda_display = pyLDAvis.gensim_models.prepare(lda, corpus, dictionary, sort_topics=False)

  and should_run_async(code)


In [37]:
pyLDAvis.display(lda_display)

  and should_run_async(code)
