# Topic Modeling

Another popular text analysis technique is called topic modeling. The ultimate goal of topic modeling is to find various topics that are present in your corpus. Each document in the corpus will be made up of at least one topic, if not multiple topics.

In this notebook, we will be covering the steps on how to do Latent Dirichlet Allocation (LDA), which is one of many topic modeling techniques. It was specifically designed for text data.

To use a topic modeling technique, you need to provide (1) a document-term matrix and (2) the number of topics you would like the algorithm to pick up.

In [1]:
import pickle
import pandas as pd

# Uncomment to setuo LDA logging to a file
import logging
logging.basicConfig(filename='lda_model.log', format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

# Import the necessary modules for LDA with gensim
# Terminal / Anaconda Navigator: conda install -c conda-forge gensim
from gensim import matutils, models
import scipy.sparse # sparse matrix format is required for gensim

In [2]:
# Let's read in our document-term matrix
with open('Pickles/term_matrix.pickle', 'rb') as d:
    data = pickle.load(d)

In [3]:
data

Unnamed: 0,able,abn,accelerated,accelerating,accenture,acceptance,accepted,access,accessing,accessories,...,我會等待確認,是东亚地区统一的一党主权国家,澳门特别行政区,照总面积计算,現在,裤脚,許多人需要同意,還沒有,重庆,鞋子全部打湿完了
CHATS,1,0,0,0,0,0,0,0,0,0,...,1,1,1,1,1,0,1,1,1,0
EMAILS,3,2,1,1,5,9,1,11,1,2,...,0,0,0,0,0,0,0,0,0,0
SMS,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,1


In [4]:
# One of the required inputs is a term-document matrix (transpose of document-term)
tdm = data.transpose()
tdm.head()

Unnamed: 0,CHATS,EMAILS,SMS
able,1,3,0
abn,0,2,0
accelerated,0,1,0
accelerating,0,1,0
accenture,0,5,0


In [5]:
# We're going to put the term-document matrix into a new gensim format, from df --> sparse matrix --> gensim corpus
sparse_counts = scipy.sparse.csr_matrix(tdm)
corpus = matutils.Sparse2Corpus(sparse_counts)