# Topic Modeling with Latent Dirichlet Allocation in Spark

The Latent Dirichlet Allocation is widely used for topic modeling, and is particularly well suited for processing text, since it handles sparse feature vectors very well.  It is described in Wikipedia https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation, and elsewhere.  

## Library Imports

In [1]:
import sys.process._
import org.apache.spark.mllib.clustering.{DistributedLDAModel, LDA}
import org.apache.spark.mllib.clustering.LocalLDAModel
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.DataFrame
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import scala.collection.mutable
import scala.io.Source

- here are some os commands that are used to import data 

In [84]:
"ls -lrth LDAModels"!

total 8.0K
drwxr-xr-x 2 root root 4.0K Nov 14 21:36 trainedModel-1-vocab
drwxr-xr-x 4 root root 4.0K Nov 14 21:38 trainedModel-1




0

## Visualize with Brunel

In order to use Brunel, it must be downloaded from the website. This is possible by using the "%AddJar" function as follows:

In [3]:
%AddJar -magic https://brunelvis.org/jar/spark-kernel-brunel-all-2.5.jar -f

Starting download from https://brunelvis.org/jar/spark-kernel-brunel-all-2.5.jar
Finished download of spark-kernel-brunel-all-2.5.jar


## Display Known Categories 

- the newsgroup dataset is organized into topics directories.  There are 20 directories, each with 1000 news items.  

In [4]:
val categories = ("ls 20_newsgroups"!!).split("\n")
categories.foreach(println)

alt.atheism
comp.graphics
comp.os.ms-windows.misc
comp.sys.ibm.pc.hardware
comp.sys.mac.hardware
comp.windows.x
misc.forsale
rec.autos
rec.motorcycles
rec.sport.baseball
rec.sport.hockey
sci.crypt
sci.electronics
sci.med
sci.space
soc.religion.christian
talk.politics.guns
talk.politics.mideast
talk.politics.misc
talk.religion.misc


categories = Array(alt.atheism, comp.graphics, comp.os.ms-windows.misc, comp.sys.ibm.pc.hardware, comp.sys.mac.hardware, comp.windows.x, misc.forsale, rec.autos, rec.motorcycles, rec.sport.baseball, rec.sport.hockey, sci.crypt, sci.electronics, sci.med, sci.space, soc.religion.christian, talk.politics.guns, talk.politics.mideast, talk.politics.misc, talk.religion.misc)




[alt.atheism, comp.graphics, comp.os.ms-windows.misc, comp.sys.ibm.pc.hardware, comp.sys.mac.hardware, comp.windows.x, misc.forsale, rec.autos, rec.motorcycles, rec.sport.baseball, rec.sport.hockey, sci.crypt, sci.electronics, sci.med, sci.space, soc.religion.christian, talk.politics.guns, talk.politics.mideast, talk.politics.misc, talk.religion.misc]

# Load Data

- We use a pre-calculated LDA model as computed in (TODO: Add a notebook for calculating model?)
- The model follows the standard Spark ldaModel format
    - the vocabulary is a special format, and is loaded for display purposes

In [5]:
//val filename = "LDAModel/f-1.0-sd-0-nTpc-10-nIts-100-81eada91"
val filename = "LDAModels/trainedModel-1"

val filenameVocab = filename + "-vocab"

val ldaModelLoaded = DistributedLDAModel.load(sc,filename)

 
val vocabLoaded: Map[String, Int] = sc.textFile(filename + "-vocab").
                 map(x=>x.split(",")).map(x=>(x(0).slice(1,x(0).size),
                 x(1).slice(0,x(1).size-1).toInt)).collect.toMap

val vocabArrayLoaded: Map[Int, String] = vocabLoaded.map(x => x._2 -> x._1)

filename = LDAModels/trainedModel-1
filenameVocab = LDAModels/trainedModel-1-vocab
ldaModelLoaded = org.apache.spark.mllib.clustering.DistributedLDAModel@4330c304
vocabLoaded = Map(incident -> 1589, serious -> 400, mario -> 2361, inflammatory -> 5664, boggs -> 1182, embedded -> 3245, speaker -> 2586, orioles -> 5010, terrible -> 6458, lion -> 4706, rate -> 2394, inevitable -> 5623, metabolism -> 1216, lights -> 5734, submitted -> 5065, snow -> 1641, purchasing -> 3174, ecac -> 2349, michael -> 577, stacked -> 4879, looks -> 342, shap -> 3754, postal -> 2768, california -> 1468, mammography -> 1457, anticancer -> 2404, illustrator -> 1773, basenotes -> 4496, accident -> 1616, glean -> 5992, muscles ->...




## Displaying Topics and top 75 words for each topic

The model generated here uses only 10 topics, which is not enough for the 20 directories, but more than the 7 topics associated with these directories (sci, comp, etc.).  We expect an imperfect characterization, but some topics appear to be sensible.  Again, we confirm that our dataset is the same one generated by spotchecking that topic 0 has "anonymous" at the top of the list with a score of 0.0126

In [6]:
var idx = 0
val topicIndices = ldaModelLoaded.describeTopics(maxTermsPerTopic = 75)
topicIndices.foreach { case (terms, termWeights) =>
  println("TOPIC:" + idx)
  terms.zip(termWeights).foreach { case (term, weight) =>
    println(s"${vocabArrayLoaded(term.toInt)}\t$weight")
  }
    idx += 1
  println()
}

TOPIC:0
anonymous	0.012643275631369468
internet	0.01199892653846487
system	0.011489389159339945
email	0.010461753442982532
information	0.010033859874241645
many	0.009046697848641316
most	0.008985058416534403
these	0.008776317737127445
privacy	0.00870315316139373
mail	0.008168184471456088
address	0.007318294146922494
anonymity	0.006893699748457801
their	0.006457573336484874
access	0.005735232098296366
usenet	0.0056810127282647055
computer	0.0054274783521739735
sites	0.005346644944619072
network	0.005201538796310405
user	0.004953510922416698
message	0.004673200538625899
files	0.004580725684189369
associated	0.0045489408314838406
identity	0.0044313548110878585
where	0.004270628694089294
currently	0.004267651379416461
over	0.004165226488264872
find	0.0041167933944265675
rights	0.004083307820384657
file	0.0040832693812494865
been	0.0040562934952105565
users	0.004045788261142567
server	0.003992542724717023
generally	0.0038020822540080707
posting	0.003798796389884891
local	0.00377375105594615

between	0.0034762865615605497
might	0.0034353173667108646
also	0.0033552574531741743
maybe	0.0033345429434129045
fire	0.0033309384635907898
then	0.0033264640872788256
been	0.003317611966948764
never	0.0032800148989558624
creation	0.003249793289359336
cause	0.003159088070438072
difference	0.0031082175135053757
illinois	0.003073464353824994
different	0.003069379116904226
like	0.0029460762527453227
them	0.0028900747803706555
opinions	0.0028540359490622945
change	0.002816408274442823
service	0.0028058309413480156
reported	0.0028044699497185464
feel	0.002800205231245364
find	0.002751238819624096
love	0.0027423342564316874
world	0.0026841901034129474
great	0.002642193903249871
explain	0.0026412729493847696
light	0.0026228900694397943
kind	0.002561562494224994
could	0.002516585239458934
sense	0.0024952862163511555
dark	0.002474484282074642
start	0.002470386486694099
religion	0.0024292201171011063
wants	0.002389158446201289
provide	0.0022717620640987993
david	0.0022368919519904855
agnostic	0.0

able	0.002860652233720948
different	0.002820581342528041
whole	0.0028182354374335357
church	0.002811341500724958
christian	0.002800031827554073
same	0.0027761447016901447
anyone	0.0027671992002842013
them	0.0027586370827573105
sounds	0.002747888305444781
help	0.0027369022953699514
could	0.002729253418983698
jesus	0.0026631243274963584
real	0.002658383337142649
book	0.002621088752585224
great	0.002609870854755423
already	0.0026076142609919887
shown	0.002576468084208852
drawing	0.0025760997881228934
mentioned	0.002571609813952075
compromise	0.0025462677478937487
dave	0.0025356697925542717
those	0.002534348741429736
nothing	0.002521460571859881
least	0.0024842682482825638
believe	0.002435993202176831
later	0.0024203844433827444
oilers	0.0024094308316120043
pocklington	0.0024094308316120043
quicktime	0.002389834885615463
question	0.0023529350320039715
before	0.002332027362975994
possible	0.0023133558813076393
wrong	0.0022831923055528307
values	0.0022710001619628737
next	0.00225497911526594

idx = 10
topicIndices = Array((Array(25, 29, 14, 33, 7, 17, 11, 8, 31, 47, 71, 87, 0, 42, 111, 36, 120, 101, 115, 147, 117, 142, 172, 53, 151, 28, 64, 154, 59, 6, 159, 199, 219, 171, 73, 16, 18, 5, 27, 238, 114, 67, 190, 287, 276, 15, 106, 270, 74, 12, 255, 167, 13, 211, 156, 364, 19, 191, 304, 98, 24, 41, 108, 325, 186, 284, 294, 32, 39, 496, 351, 105, 160, 10, 216),Array(0.012643275631369468, 0.01199892653846487, 0.011489389159339945, 0.010461753442982532, 0.010033859874241645, 0.009046697848641316, 0.008985058416534403, 0.008776317737127445, 0.00870315316139373, 0.008168184471456088, 0.007318294146922494, 0.006893699748457801, 0.006457573336484874, 0.005735232098296366, 0.0056810127282647055, 0.0054274783521739735, 0.005346644944619072, 0.0052...


[([I@63f73ddb,[D@6c250628), ([I@488fb47b,[D@4d2f741e), ([I@64d9784f,[D@f8ba291), ([I@3a12e377,[D@cabffaf), ([I@231ff575,[D@71d244ad), ([I@5cf618bb,[D@f73c20d), ([I@531b3acb,[D@300feb29), ([I@5cd00e53,[D@4da2cc6e), ([I@47e56a73,[D@87c8705), ([I@5402b2cf,[D@f98d59d)]

# Using Brunel to Visualize each Topic

Rather than sort through a large list of topics, we select a topic here and use Brunel to generate a word cloud.  This may sometimes be easier, but it is also useful to have the full list as given above.  Here, we visualize topic 6, which appears to have many words associated with what we might expect to be sci.electronics, as well as other sci.xxx topics.
There are other ways to confirm that this is a good topic assignment, but we are demonstrating how an early effort at characterizing the corpus topics might be carried out.

In [77]:
val topicToVisualize =  0
val topicDF = sc.parallelize(topicIndices(topicToVisualize)._1.map(x=>vocabArrayLoaded(x)).zip(topicIndices(topicToVisualize)._2)).toDF("word", "weight")

topicToVisualize = 0
topicDF = [word: string, weight: double]


[word: string, weight: double]

In [78]:
%%brunel data('topicDF') cloud x(word) color(word) size(weight) label(word)

                'brunel' : 'https://brunelvis.org/js...


These words appear to be related to the manually labeled topic of `sci.crypt`.

It is not unusual to see imperfect clustering.  After all, the categories were assigned by hand, and we are using an algorithm to assign labels.  We see behavior like this in even the simplest K Means model (TODO: Link to KMeans Notebook).  
The great benefit of unsupervised learning is the ability to give structure to vast quantities of data that are otherwise difficult for a human to spot.  Since we view this as an aid to understanding, we don't need to have a perfect correspondence between the annotations and the discovered topics.  Looking at Topic 1, we also see words that seem to be related to the `sci.crypt` topic. 

In [82]:
val topicToVisualize =  1
val topicDF = sc.parallelize(topicIndices(topicToVisualize)._1.map(x=>vocabArrayLoaded(x)).zip(topicIndices(topicToVisualize)._2)).toDF("word", "weight")

topicToVisualize = 1
topicDF = [word: string, weight: double]


[word: string, weight: double]

In [83]:
%%brunel data('topicDF') cloud x(word) color(word) size(weight) label(word)

                'brunel' : 'https://brunelvis.org/js...


## Here we load the first article in the sci.crypt directory

In [None]:
The first document in this directory will now be analyzed.  What topics will our model predict are most prevalent in this document?

In [12]:
val test_input = Seq(Source.fromFile("20_newsgroups/sci.crypt/14147").getLines.reduce(_+_))

println("trying to assign a topic to the following text")

println("===")
println(test_input)
println("===")

trying to assign a topic to the following text
===
List(Xref: cantaloupe.srv.cs.cmu.edu alt.security.ripem:136 sci.crypt:14147 comp.security.misc:2868 alt.security:9389 comp.mail.misc:11904 comp.answers:213 news.answers:6521Newsgroups: alt.security.ripem,sci.crypt,comp.security.misc,alt.security,comp.mail.misc,comp.answers,news.answersPath: cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!fs7.ece.cmu.edu!news.sei.cmu.edu!cis.ohio-state.edu!zaphod.mps.ohio-state.edu!wupost!spool.mu.edu!caen!sol.ctr.columbia.edu!usenet.ucs.indiana.edu!silver.ucs.indiana.edu!mvanheynFrom: Marc VanHeyningen <mvanheyn@cs.indiana.edu>Subject: RIPEM Frequently Asked QuestionsContent-Type: text/x-usenet-FAQ; version=1.0; title="RIPEM FAQ"Message-ID: <C3Juww.Dru@usenet.ucs.indiana.edu>Followup-To: alt.security.ripemOriginator: mvanheyn@silver.ucs.indiana.eduSender: news@usenet.ucs.indiana.edu (USENET News System)Supersedes: <1993Jan25.113427.28926@news.cs.indiana.edu>Nntp-Posting-Host: silver.ucs.indiana.eduO

test_input = List(Xref: cantaloupe.srv.cs.cmu.edu alt.security.ripem:136 sci.crypt:14147 comp.security.misc:2868 alt.security:9389 comp.mail.misc:11904 comp.answers:213 news.answers:6521Newsgroups: alt.security.ripem,sci.crypt,comp.security.misc,alt.security,comp.mail.misc,comp.answers,news.answersPath: cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!fs7.ece.cmu.edu!news.sei.cmu.edu!cis.ohio-state.edu!zaphod.mps.ohio-state.edu!wupost!spool.mu.edu!caen!sol.ctr.columbia.edu!usenet.ucs.indiana.edu!silver.ucs.indiana.edu!mvanheynFrom: Marc VanHeyningen <mvanheyn@cs.indiana.edu>Subject: RIPEM Frequently Asked QuestionsContent-Type: text/x-usenet-FAQ; version=1.0; title="RIPEM FAQ"Message-ID: <C3Juww.Dru@usenet.ucs.indiana.edu>Followup-To: alt.security.ripemOriginator: mvanhey...


List(Xref: cantaloupe.srv.cs.cmu.edu alt.security.ripem:136 sci.crypt:14147 comp.security.misc:2868 alt.security:9389 comp.mail.misc:11904 comp.answers:213 news.answers:6521Newsgroups: alt.security.ripem,sci.crypt,comp.security.misc,alt.security,comp.mail.misc,comp.answers,news.answersPath: cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!fs7.ece.cmu.edu!news.sei.cmu.edu!cis.ohio-state.edu!zaphod.mps.ohio-state.edu!wupost!spool.mu.edu!caen!sol.ctr.columbia.edu!usenet.ucs.indiana.edu!silver.ucs.indiana.edu!mvanheynFrom: Marc VanHeyningen <mvanheyn@cs.indiana.edu>Subject: RIPEM Frequently Asked QuestionsContent-Type: text/x-usenet-FAQ; version=1.0; title="RIPEM FAQ"Message-ID: <C3Juww.Dru@usenet.ucs.indiana.edu>Followup-To: alt.security.ripemOriginator: mvanheyn@silver.ucs.indiana.eduSender: news@usenet.ucs.indiana.edu (USENET News System)Supersedes: <1993Jan25.113427.28926@news.cs.indiana.edu>Nntp-Posting-Host: silver.ucs.indiana.eduOrganization: Computer Science, Indiana UniversityDa

## Generating a Topic Distribution from the text document

- TODO:  Add the equations for topic distributions?

Here we generate the estimate of the fraction of each topic present in this set of words.  The topic that is most strongly represented here would be considered the main subject of the article.

In [73]:
val test_document = sc.parallelize(test_input.map(doc=>doc.split("\\s")).zipWithIndex.map{ case (tokens, id) =>
    val counts = new mutable.HashMap[Int, Double]()
    tokens.foreach { term =>
    if (vocabLoaded.contains(term)) {
        val idx = vocabLoaded(term)
        counts(idx) = counts.getOrElse(idx, 0.0) + 1.0
        }
    }
    (id.toLong, Vectors.sparse(vocabLoaded.size, counts.toSeq))
})


val localLDAModelLoaded: LocalLDAModel = ldaModelLoaded.asInstanceOf[DistributedLDAModel].toLocal
val topicDistributions = localLDAModelLoaded.topicDistributions(test_document)
println("first topic distribution:" + topicDistributions.first._2.toArray.mkString(", "))

first topic distribution:0.3678651688357385, 0.16033980860948344, 0.030735811366463403, 0.05206532087578698, 0.05230728554300706, 0.06665422319327728, 0.04200850832933759, 0.06093942907232035, 0.06837865937763134, 0.09870578479695413


test_document = ParallelCollectionRDD[92] at parallelize at <console>:50
localLDAModelLoaded = org.apache.spark.mllib.clustering.LocalLDAModel@6e48e008
topicDistributions = MapPartitionsRDD[93] at map at LDAModel.scala:356


MapPartitionsRDD[93] at map at LDAModel.scala:356

In [72]:
val localLDAModelLoaded: LocalLDAModel = ldaModelLoaded.asInstanceOf[DistributedLDAModel].toLocal
val topicDistributions = localLDAModelLoaded.topicDistributions(test_document)
println("first topic distribution:" + topicDistributions.first._2.toArray.mkString(", "))

first topic distribution:0.36786523297221063, 0.16033828435894748, 0.030735711602361673, 0.05206433589847484, 0.0523069847727492, 0.06665434863907264, 0.04200754078034929, 0.06094140495115028, 0.06837956418138692, 0.09870659184329698


localLDAModelLoaded = org.apache.spark.mllib.clustering.LocalLDAModel@3f1b4c04
topicDistributions = MapPartitionsRDD[91] at map at LDAModel.scala:356


MapPartitionsRDD[91] at map at LDAModel.scala:356

## Displaying Topic Distributions

- By looking at the table and bar plot, we can see that this article is strongly associated with topic 6.  This seems to match our understanding of the nature of topic 6 based on our review of the word cloud for this topic.

In [75]:
val topicDistributionsDF = sc.parallelize(topicDistributions.map(x=>x._2).first.toArray.zipWithIndex).
                         toDF("TopicFraction","TopicNumber")
topicDistributionsDF.show

+--------------------+-----------+
|       TopicFraction|TopicNumber|
+--------------------+-----------+
|  0.3678649762344198|          0|
| 0.16033806996173225|          1|
|0.030735693886702244|          2|
| 0.05206535944793738|          3|
| 0.05230728788829701|          4|
| 0.06665456846835575|          5|
|   0.042005077688621|          6|
|0.060942125531264255|          7|
| 0.06838005368148471|          8|
| 0.09870678721118552|          9|
+--------------------+-----------+



topicDistributionsDF = [TopicFraction: double, TopicNumber: int]


[TopicFraction: double, TopicNumber: int]

In [85]:
%%brunel data('topicDistributionsDF') bar x(TopicNumber) y(TopicFraction) transpose

                'brunel' : 'https://brunelvis.org/js...
