# Topic Modeling with Latent Dirichlet Allocation in Spark

The Latent Dirichlet Allocation is widely used for topic modeling, and is particularly well suited for processing text, since it handles sparse feature vectors very well.  It is described in Wikipedia https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation, and elsewhere.  

## Library Imports

In [3]:
import sys.process._
import org.apache.spark.mllib.clustering.{DistributedLDAModel, LDA}
import org.apache.spark.mllib.clustering.LocalLDAModel
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.DataFrame
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import scala.collection.mutable
import scala.io.Source

- here are some os commands that are used to import data 

In [4]:
"ls -lrth LDAModels"!

total 0
drwxr-xr-x  20 nilmeier@us.ibm.com  staff   640B Sep  9 08:48 trainedModel-1-vocab
drwxr-xr-x   4 nilmeier@us.ibm.com  staff   128B Sep  9 08:48 trainedModel-1




0

## Visualize with Brunel

In order to use Brunel, it must be downloaded from the website. This is possible by using the "%AddJar" function as follows:

In [2]:
%AddJar -magic http://brunelvis.org/jar/spark-kernel-brunel-all-2.5.jar -f


Starting download from http://brunelvis.org/jar/spark-kernel-brunel-all-2.5.jar
Finished download of spark-kernel-brunel-all-2.5.jar


## Display Known Categories 

- the newsgroup dataset is organized into topics directories.  There are 20 directories, each with 1000 news items.  

In [5]:
val categories = ("ls 20_newsgroups"!!).split("\n")
categories.foreach(println)

alt.atheism
comp.graphics
comp.os.ms-windows.misc
comp.sys.ibm.pc.hardware
comp.sys.mac.hardware
comp.windows.x
misc.forsale
rec.autos
rec.motorcycles
rec.sport.baseball
rec.sport.hockey
sci.crypt
sci.electronics
sci.med
sci.space
soc.religion.christian
talk.politics.guns
talk.politics.mideast
talk.politics.misc
talk.religion.misc


categories = Array(alt.atheism, comp.graphics, comp.os.ms-windows.misc, comp.sys.ibm.pc.hardware, comp.sys.mac.hardware, comp.windows.x, misc.forsale, rec.autos, rec.motorcycles, rec.sport.baseball, rec.sport.hockey, sci.crypt, sci.electronics, sci.med, sci.space, soc.religion.christian, talk.politics.guns, talk.politics.mideast, talk.politics.misc, talk.religion.misc)




Array(alt.atheism, comp.graphics, comp.os.ms-windows.misc, comp.sys.ibm.pc.hardware, comp.sys.mac.hardware, comp.windows.x, misc.forsale, rec.autos, rec.motorcycles, rec.sport.baseball, rec.sport.hockey, sci.crypt, sci.electronics, sci.med, sci.space, soc.religion.christian, talk.politics.guns, talk.politics.mideast, talk.politics.misc, talk.religion.misc)

# Load Data

- We use a pre-calculated LDA model as computed in (TODO: Add a notebook for calculating model?)
- The model follows the standard Spark ldaModel format
    - the vocabulary is a special format, and is loaded for display purposes

In [6]:
//val filename = "LDAModel/f-1.0-sd-0-nTpc-10-nIts-100-81eada91"
val filename = "LDAModels/trainedModel-1"

val filenameVocab = filename + "-vocab"

val ldaModelLoaded = DistributedLDAModel.load(sc,filename)

 
val vocabLoaded: Map[String, Int] = sc.textFile(filename + "-vocab").
                 map(x=>x.split(",")).map(x=>(x(0).slice(1,x(0).size),
                 x(1).slice(0,x(1).size-1).toInt)).collect.toMap

val vocabArrayLoaded: Map[Int, String] = vocabLoaded.map(x => x._2 -> x._1)

filename = LDAModels/trainedModel-1
filenameVocab = LDAModels/trainedModel-1-vocab
ldaModelLoaded = org.apache.spark.mllib.clustering.DistributedLDAModel@3cdb8377
vocabLoaded = Map(kaiserstrasse -> 3213, serious -> 859, denon -> 3229, sinister -> 4597, precious -> 2567, sectors -> 4016, teresa -> 2933, ignition -> 5309, orioles -> 2123, terrible -> 5420, rate -> 3672, inevitable -> 3968, snow -> 4248, ecac -> 1801, michael -> 387, buckle -> 2515, analogous -> 4541, looks -> 336, gory -> 5163, california -> 535, newbie -> 1591, scare -> 2644, precludes -> 4307, wiser -> 3509, aclu -> 5344, accident -> 3978, muscles -> 3741, latent -> 3892, contracts -> 5375, practitioner -> 1209, ideas -> 3711, static...


Map(kaiserstrasse -> 3213, serious -> 859, denon -> 3229, sinister -> 4597, precious -> 2567, sectors -> 4016, teresa -> 2933, ignition -> 5309, orioles -> 2123, terrible -> 5420, rate -> 3672, inevitable -> 3968, snow -> 4248, ecac -> 1801, michael -> 387, buckle -> 2515, analogous -> 4541, looks -> 336, gory -> 5163, california -> 535, newbie -> 1591, scare -> 2644, precludes -> 4307, wiser -> 3509, aclu -> 5344, accident -> 3978, muscles -> 3741, latent -> 3892, contracts -> 5375, practitioner -> 1209, ideas -> 3711, static...

## Displaying Topics and top 75 words for each topic

The model generated here uses only 10 topics, which is not enough for the 20 directories, but more than the 7 topics associated with these directories (sci, comp, etc.).  We expect an imperfect characterization, but some topics appear to be sensible.  Again, we confirm that our dataset is the same one generated by spotchecking that topic 0 has "anonymous" at the top of the list with a score of 0.0126

In [7]:
var idx = 0
val topicIndices = ldaModelLoaded.describeTopics(maxTermsPerTopic = 75)
topicIndices.foreach { case (terms, termWeights) =>
  println("TOPIC:" + idx)
  terms.zip(termWeights).foreach { case (term, weight) =>
    println(s"${vocabArrayLoaded(term.toInt)}\t$weight")
  }
    idx += 1
  println()
}

TOPIC:0
their	0.010206415529524955
evidence	0.009680700054145041
those	0.009106578911717032
moral	0.008882379387183812
morality	0.008731527735880816
enviroleague	0.007855725667519659
school	0.007682124744273288
such	0.007139100836241298
people	0.006650679385118319
custom	0.0066047663743736855
other	0.005715676934328962
saying	0.005687362003627658
systems	0.005430594282749532
earth	0.00535073027473017
objective	0.005279513946269445
know	0.005228942544644583
hell	0.004655950865771618
heard	0.0046429765206000865
youth	0.004484101337024856
bible	0.0043767020740309885
just	0.004282166539650333
even	0.004096449450376287
where	0.004083749354512398
great	0.004066702107892993
answer	0.003943473852441394
conference	0.003906293391943268
anecdotal	0.00388392797398689
letter	0.0038825906700004167
said	0.0038728704099930383
these	0.003838132040713528
first	0.00378605943593762
organizations	0.0037550257606988008
service	0.003685182463541458
also	0.0036807129392890767
been	0.0036784867638874844
based	

remove	0.0026739067875053654
could	0.0026716837542740336
read	0.0026567222772773914
used	0.002591094880553239
here	0.002542762043242895
server	0.002531986154134188
buying	0.002480427420652517
research	0.002478634910105323
video	0.002437762266063563

TOPIC:9
think	0.010690184801225536
know	0.010115354138117674
need	0.010013690320844346
could	0.009632945378593109
computer	0.008969894579273404
never	0.008488418058257579
help	0.007367649540854439
anyone	0.00728745616139563
much	0.007104525447873499
original	0.007065134294003463
while	0.006717389994440381
most	0.006059293369152935
card	0.005669021645559508
just	0.005561323838975921
where	0.0054820889127344015
clutch	0.005136278869478737
problem	0.004996981514085081
read	0.004588387242194891
says	0.004563495561484577
else	0.004544817106587224
another	0.004533424669957317
like	0.004421018458548945
before	0.004387800339230712
clipper	0.004283263232951924
time	0.004260988739465885
available	0.004155621630430944
difference	0.0040985272317422975


idx = 10
topicIndices = Array((Array(0, 65, 31, 135, 89, 154, 155, 25, 6, 210, 5, 126, 83, 228, 172, 2, 323, 156, 383, 140, 4, 14, 17, 80, 262, 409, 474, 240, 53, 16, 24, 349, 334, 23, 8, 147, 111, 113, 15, 30, 575, 79, 68, 231, 549, 41, 322, 453, 432, 67, 601, 382, 19, 764, 464, 82, 43, 99, 723, 138, 221, 178, 259, 10, 226, 108, 541, 305, 18, 218, 28, 523, 27, 553, 217),Array(0.010206415529524955, 0.009680700054145041, 0.009106578911717032, 0.008882379387183812, 0.008731527735880816, 0.007855725667519659, 0.007682124744273288, 0.007139100836241298, 0.006650679385118319, 0.0066047663743736855, 0.005715676934328962, 0.005687362003627658, 0.005430594282749532, 0.00535073027473017, 0.005279513946269445, 0.005228942544644583, 0.004655950865771618, 0....


Array((Array(0, 65, 31, 135, 89, 154, 155, 25, 6, 210, 5, 126, 83, 228, 172, 2, 323, 156, 383, 140, 4, 14, 17, 80, 262, 409, 474, 240, 53, 16, 24, 349, 334, 23, 8, 147, 111, 113, 15, 30, 575, 79, 68, 231, 549, 41, 322, 453, 432, 67, 601, 382, 19, 764, 464, 82, 43, 99, 723, 138, 221, 178, 259, 10, 226, 108, 541, 305, 18, 218, 28, 523, 27, 553, 217),Array(0.010206415529524955, 0.009680700054145041, 0.009106578911717032, 0.008882379387183812, 0.008731527735880816, 0.007855725667519659, 0.007682124744273288, 0.007139100836241298, 0.006650679385118319, 0.0066047663743736855, 0.005715676934328962, 0.005687362003627658, 0.005430594282749532, 0.00535073027473017, 0.005279513946269445, 0.005228942544644583, 0.004655950865771618, 0....

# Using Brunel to Visualize each Topic

Rather than sort through a large list of topics, we select a topic here and use Brunel to generate a word cloud.  This may sometimes be easier, but it is also useful to have the full list as given above.  Here, we visualize topic 6, which appears to have many words associated with what we might expect to be sci.electronics, as well as other sci.xxx topics.
There are other ways to confirm that this is a good topic assignment, but we are demonstrating how an early effort at characterizing the corpus topics might be carried out.

In [8]:
val topicToVisualize =  0
val topicDF = sc.parallelize(topicIndices(topicToVisualize)._1.map(x=>vocabArrayLoaded(x)).zip(topicIndices(topicToVisualize)._2)).toDF("word", "weight")

topicToVisualize = 0
topicDF = [word: string, weight: double]


[word: string, weight: double]

In [9]:
%%brunel data('topicDF') cloud x(word) color(word) size(weight) label(word)

                'brunel' : 'https://brunelvis.org/js/...


These words appear to be related to the manually labeled topic of `sci.crypt`.

It is not unusual to see imperfect clustering.  After all, the categories were assigned by hand, and we are using an algorithm to assign labels.  We see behavior like this in even the simplest K Means model (TODO: Link to KMeans Notebook).  
The great benefit of unsupervised learning is the ability to give structure to vast quantities of data that are otherwise difficult for a human to spot.  Since we view this as an aid to understanding, we don't need to have a perfect correspondence between the annotations and the discovered topics.  Looking at Topic 1, we also see words that seem to be related to the `sci.crypt` topic. 

In [10]:
val topicToVisualize =  1
val topicDF = sc.parallelize(topicIndices(topicToVisualize)._1.map(x=>vocabArrayLoaded(x)).zip(topicIndices(topicToVisualize)._2)).toDF("word", "weight")

topicToVisualize = 1
topicDF = [word: string, weight: double]


[word: string, weight: double]

In [11]:
%%brunel data('topicDF') cloud x(word) color(word) size(weight) label(word)

                'brunel' : 'https://brunelvis.org/js/...


## Here we load the first article in the sci.crypt directory

In [12]:
The first document in this directory will now be analyzed.  What topics will our model predict are most prevalent in this document?

Name: Compile Error
Message: <console>:1: error: ';' expected but '.' found.
The first document in this directory will now be analyzed.  What topics will our model predict are most prevalent in this document?
                                                         ^

StackTrace: 

In [16]:
val test_input = Seq(Source.fromFile("20_newsgroups/sci.crypt/14147").getLines.reduce(_+_))

println("trying to assign a topic to the following text")

println("===")
println(test_input)
println("===")

trying to assign a topic to the following text
===
List(Xref: cantaloupe.srv.cs.cmu.edu alt.security.ripem:136 sci.crypt:14147 comp.security.misc:2868 alt.security:9389 comp.mail.misc:11904 comp.answers:213 news.answers:6521Newsgroups: alt.security.ripem,sci.crypt,comp.security.misc,alt.security,comp.mail.misc,comp.answers,news.answersPath: cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!fs7.ece.cmu.edu!news.sei.cmu.edu!cis.ohio-state.edu!zaphod.mps.ohio-state.edu!wupost!spool.mu.edu!caen!sol.ctr.columbia.edu!usenet.ucs.indiana.edu!silver.ucs.indiana.edu!mvanheynFrom: Marc VanHeyningen <mvanheyn@cs.indiana.edu>Subject: RIPEM Frequently Asked QuestionsContent-Type: text/x-usenet-FAQ; version=1.0; title="RIPEM FAQ"Message-ID: <C3Juww.Dru@usenet.ucs.indiana.edu>Followup-To: alt.security.ripemOriginator: mvanheyn@silver.ucs.indiana.eduSender: news@usenet.ucs.indiana.edu (USENET News System)Supersedes: <1993Jan25.113427.28926@news.cs.indiana.edu>Nntp-Posting-Host: silver.ucs.indiana.eduO

test_input = List(Xref: cantaloupe.srv.cs.cmu.edu alt.security.ripem:136 sci.crypt:14147 comp.security.misc:2868 alt.security:9389 comp.mail.misc:11904 comp.answers:213 news.answers:6521Newsgroups: alt.security.ripem,sci.crypt,comp.security.misc,alt.security,comp.mail.misc,comp.answers,news.answersPath: cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!fs7.ece.cmu.edu!news.sei.cmu.edu!cis.ohio-state.edu!zaphod.mps.ohio-state.edu!wupost!spool.mu.edu!caen!sol.ctr.columbia.edu!usenet.ucs.indiana.edu!silver.ucs.indiana.edu!mvanheynFrom: Marc VanHeyningen <mvanheyn@cs.indiana.edu>Subject: RIPEM Frequently Asked QuestionsContent-Type: text/x-usenet-FAQ; version=1.0; title="RIPEM FAQ"Message-ID: <C3Juww.Dru@usenet.ucs.indiana.edu>Followup-To: alt.security.ripemOriginator: mvanhey...


List(Xref: cantaloupe.srv.cs.cmu.edu alt.security.ripem:136 sci.crypt:14147 comp.security.misc:2868 alt.security:9389 comp.mail.misc:11904 comp.answers:213 news.answers:6521Newsgroups: alt.security.ripem,sci.crypt,comp.security.misc,alt.security,comp.mail.misc,comp.answers,news.answersPath: cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!fs7.ece.cmu.edu!news.sei.cmu.edu!cis.ohio-state.edu!zaphod.mps.ohio-state.edu!wupost!spool.mu.edu!caen!sol.ctr.columbia.edu!usenet.ucs.indiana.edu!silver.ucs.indiana.edu!mvanheynFrom: Marc VanHeyningen <mvanheyn@cs.indiana.edu>Subject: RIPEM Frequently Asked QuestionsContent-Type: text/x-usenet-FAQ; version=1.0; title="RIPEM FAQ"Message-ID: <C3Juww.Dru@usenet.ucs.indiana.edu>Followup-To: alt.security.ripemOriginator: mvanhey...

## Generating a Topic Distribution from the text document

- TODO:  Add the equations for topic distributions?

Here we generate the estimate of the fraction of each topic present in this set of words.  The topic that is most strongly represented here would be considered the main subject of the article.

In [17]:
val test_document = sc.parallelize(test_input.map(doc=>doc.split("\\s")).zipWithIndex.map{ case (tokens, id) =>
    val counts = new mutable.HashMap[Int, Double]()
    tokens.foreach { term =>
    if (vocabLoaded.contains(term)) {
        val idx = vocabLoaded(term)
        counts(idx) = counts.getOrElse(idx, 0.0) + 1.0
        }
    }
    (id.toLong, Vectors.sparse(vocabLoaded.size, counts.toSeq))
})


val localLDAModelLoaded: LocalLDAModel = ldaModelLoaded.asInstanceOf[DistributedLDAModel].toLocal
val topicDistributions = localLDAModelLoaded.topicDistributions(test_document)
println("first topic distribution:" + topicDistributions.first._2.toArray.mkString(", "))

first topic distribution:0.07967191081138034, 0.1246539383633934, 0.1486657313167322, 0.08355097889247869, 0.09453706701068618, 0.03970100667718787, 0.12126878774172307, 0.0348534882765313, 0.14227528770721515, 0.13082180320267184


test_document = ParallelCollectionRDD[48] at parallelize at <console>:45
localLDAModelLoaded = org.apache.spark.mllib.clustering.LocalLDAModel@2be17e22
topicDistributions = MapPartitionsRDD[52] at map at LDAModel.scala:355


MapPartitionsRDD[52] at map at LDAModel.scala:355

In [18]:
val localLDAModelLoaded: LocalLDAModel = ldaModelLoaded.asInstanceOf[DistributedLDAModel].toLocal
val topicDistributions = localLDAModelLoaded.topicDistributions(test_document)
println("first topic distribution:" + topicDistributions.first._2.toArray.mkString(", "))

first topic distribution:0.07967258462810432, 0.12465355873517882, 0.14866695995074797, 0.08355107578035681, 0.09453089589352134, 0.039701553504252945, 0.12127277360903607, 0.034853464802041094, 0.14227536626444356, 0.13082176683231708


localLDAModelLoaded = org.apache.spark.mllib.clustering.LocalLDAModel@751e562b
topicDistributions = MapPartitionsRDD[53] at map at LDAModel.scala:355


MapPartitionsRDD[53] at map at LDAModel.scala:355

## Displaying Topic Distributions

- By looking at the table and bar plot, we can see that this article is strongly associated with topic 6.  This seems to match our understanding of the nature of topic 6 based on our review of the word cloud for this topic.

In [19]:
val topicDistributionsDF = sc.parallelize(topicDistributions.map(x=>x._2).first.toArray.zipWithIndex).
                         toDF("TopicFraction","TopicNumber")
topicDistributionsDF.show

+-------------------+-----------+
|      TopicFraction|TopicNumber|
+-------------------+-----------+
|0.07967163143697772|          0|
| 0.1246541909082848|          1|
|0.14866580106143887|          2|
|0.08355079941391448|          3|
|0.09453719591363383|          4|
|0.03969966075757955|          5|
| 0.1212698363588011|          6|
|0.03485353580933095|          7|
|0.14227558375934002|          8|
|0.13082176458069883|          9|
+-------------------+-----------+



topicDistributionsDF = [TopicFraction: double, TopicNumber: int]


[TopicFraction: double, TopicNumber: int]

In [20]:
%%brunel data('topicDistributionsDF') bar x(TopicNumber) y(TopicFraction) transpose

                'brunel' : 'https://brunelvis.org/js/...
