# Topic Modeling with Latent Dirichlet Allocation in Spark

The Latent Dirichlet Allocation is a popular method for clustering, especially for text based datasets.  It has particularly good performance for sparse feature sets, as is seen when the features are wordcounts vectors.  It alleviates the need for something like word2vec, which results in a dense lower dimensional feature vector. 

Since this particular analysis can take up to 10 minutes on your laptop, we'll train it in one notebook, and load the results in a second notebook.

In [1]:
import scala.collection.mutable
import org.apache.spark.mllib.clustering.LDA
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.rdd.RDD

## Downloading Data

We will download the yahoo 20 newsgroups dataset if you haven't already done so. This dataset contains 20,000 news stories that have been manually categorized into directories.  

In [3]:
import sys.process._
val listing = "ls"!! ;

if (listing.contains("20_newsgroups") == false ) {
    println("downloading dataset")
    "wget http://qwone.com/~jason/20Newsgroups/20news-19997.tar.gz"!;
    "tar -xvf 20news-19997.tar.gz"!;
    "rm 20news-19997*"!;
}
println("directories in 20 newsgroup dataset")


directories in 20 newsgroup dataset


listing = 


"20news-19997.tar.gz
SimpleL...


20news-19997.tar.gz
20_newsgroups
ActivationFunctions.html
ActivationFunctions.ipynb
cars2.csv
cars.csv
derby.log
ex3ExampleMultivariateGD.m
ex3x.dat
ex3y.dat
gradientDescent.m
Heart Model with Hyperparameter Search.html
Heart Model with Hyperparameter Search.ipynb
LDAModels
LDAModelsOld
LDAModelTraining.html
LDAModelTraining.ipynb
LDAPredictWithPretrainedModel.html
LDAPredictWithPretrainedModel.ipynb
LinearRegressionSingleFeature.html
LinearRegressionSingleFeature.ipynb
MNIST for Beginners.html
MNIST for Beginners.ipynb
model
multinomialSampler.html
multinomialSampler.ipynb
notebook.tex
output_15_1.png
output_17_1.png
output_39_0.png
output_40_2.png
patientdataV6.csv
README.md
recentDraft.pdf
SimpleKMeans.ipynb
SimpleLinearRegression.html
SimpleLinearRegression.ipynb
SimpleLinearRegression.ipynb.invalid
spark-warehouse
test.py
Untitled.ipynb
Word2VecWithKMeans.html
Word2VecWithKMeans.ipynb


The directories (categories) are listed in the output below.

In [4]:
"ls 20_newsgroups/"!

alt.atheism
comp.graphics
comp.os.ms-windows.misc
comp.sys.ibm.pc.hardware
comp.sys.mac.hardware
comp.windows.x
misc.forsale
rec.autos
rec.motorcycles
rec.sport.baseball
rec.sport.hockey
sci.crypt
sci.electronics
sci.med
sci.space
soc.religion.christian
talk.politics.guns
talk.politics.mideast
talk.politics.misc
talk.religion.misc




0

Let's load all of the stories into a corpus.

In [5]:
val corpus: RDD[String] = sc.wholeTextFiles("20_newsgroups/*/*").map(_._2).
                           sample(fraction=0.01,withReplacement=false,seed=0L)

corpus = PartitionwiseSampledRDD[3] at sample at <console>:35


PartitionwiseSampledRDD[3] at sample at <console>:35

## Tokenization and Word Count

From the corpus, we tokenize the words, which means every instance of the word gets a count of 1 assigned to it.  When we collect repeated instances of those words, we will add those counts up and get a total word count.  Because this is so frequently done in big data, word counts are often called the "hello world" of big data and text mining.  This can take about 5 min.

In [7]:
// Split each document into a sequence of terms (words)
val tokenized: RDD[Seq[String]] =
  corpus.map(_.toLowerCase.split("\\s")).map(_.filter(_.length > 3).filter(_.forall(java.lang.Character.isLetter)))

// do the word count.  Each document gets a word count
val t0TermCount = System.nanoTime
val termCounts: Array[(String, Long)] =
  tokenized.flatMap(_.map(_ -> 1L)).reduceByKey(_ + _).collect().sortBy(-_._2)
val dtTermCount = Math.round((System.nanoTime-t0TermCount)/1e6)/1e3
println("Time to Compute Word Count: " + dtTermCount + "s")

Time to Compute Word Count: 260.565s


tokenized = MapPartitionsRDD[9] at map at <console>:42
t0TermCount = 2285994266528809
termCounts = Array((that,661), (this,361), (with,357), (have,305), (from,217), (what,187), (will,186), (they,185), (about,169), (would,152), (your,135), (some,133), (article,126), (there,124), (more,120), (when,110), (other,108), (people,104), (just,103), (which,102), (their,90), (like,89), (were,89), (know,86), (university,84), (only,84), (been,81), (information,80), (these,78), (than,74), (should,74), (most,73), (because,71), (does,69), (system,68), (also,68), (even,64), (many,64), (make,63), (could,63), (them,58), (into,57), (good,55), (think,52), (very,52), (anonymous,52), (same,50), (government,50), (over,49), (internet,49), (canc...




## Removing Stop Words

We sorted the word count by frequency so that we can easily see that innocuous words tend to occur with the most frequency.  For instance, the word "that" occurs 661 times, and it is clearly not a word that would help us to understand what sort of topic might be associated with such a word.  We call these words "stop words", and filter them from the dataset.  For this example, we are removing the top 10% of the most frequently occuring words. 
We then generate term count vectors for the documents, which is a more compact representation of the counts, and is what the spark API requires for training.

In [8]:
val numStopwords = 20

val fraction = 0.1
val numStopWords = Math.round(fraction * termCounts.size)

// (JN:  This only drops first 20 words...what about a different approach?)
val vocabArray: Array[String] =
  termCounts.takeRight(termCounts.size - numStopwords).map(_._1)

//   vocab: Map term -> term index
val vocab: Map[String, Int] = vocabArray.zipWithIndex.toMap

// Convert documents into term count vectors
val t0Documents = System.nanoTime
val documents: RDD[(Long, Vector)] =
  tokenized.zipWithIndex.map { case (tokens, id) =>
    val counts = new mutable.HashMap[Int, Double]()
    tokens.foreach { term =>
      if (vocab.contains(term)) {
        val idx = vocab(term)
        counts(idx) = counts.getOrElse(idx, 0.0) + 1.0
      }
    }
    (id, Vectors.sparse(vocab.size, counts.toSeq))
  }

val dtDocuments = Math.round((System.nanoTime - t0Documents)/1e6)/1e3
println("Time to Compute Term Count: " + dtDocuments + "s")

Time to Compute Term Count: 225.152s


numStopwords = 20
fraction = 0.1
numStopWords = 653
vocabArray = Array(their, like, were, know, university, only, been, information, these, than, should, most, because, does, system, also, even, many, make, could, them, into, good, think, very, anonymous, same, government, over, internet, cancer, privacy, still, email, then, those, computer, time, need, windows, much, used, access, using, anyone, believe, first, mail, said, world, since, between, after, where, encryption, right, really, research, want, file, being, problem, while, under, find, each, aids, such, never, part, available, address, breast, local, data, better, take, back, might, through, someone, fact, going, different, sure, number, news, anonymity, something, things, must, april, years, me...




## Training the LDA Model

Training takes about 5 min.  You should only need to do this once in your life, but if you do it twice you should get the same result, because the random seed is set to 0.

In [1]:
// Set LDA parameters
val numTopics = 10
val lda = new LDA().setK(numTopics).setMaxIterations(100)
lda.setSeed(0) //important for reproduceability

val t0Train = System.nanoTime
val ldaModel = lda.run(documents)
val dtTrain = Math.round((System.nanoTime - t0Train)/1e6)/1e3
println("Time to Train: " + dtTrain + "s")

Time to Train: 659.922s


numTopics = 10
lda = org.apache.spark.mllib.clustering.LDA@6da22e08
t0Train = 2287729162310245
ldaModel = org.apache.spark.mllib.clustering.DistributedLDAModel@5173a565
dtTrain = 659.922


659.922

We can now print out the top 10 words associated with each of our 10 topics.  This will be handy for comparison to the topics in our second notebook.  For example, the first word of Topic 0 should be "anonymous" with a score of 0.0126.

In [10]:
// Print topics, showing top-weighted 10 terms for each topic.
var idx = 0
val topicIndices = ldaModel.describeTopics(maxTermsPerTopic = 30)
topicIndices.foreach { case (terms, termWeights) =>
  println("TOPIC: " + idx)
  terms.zip(termWeights).foreach { case (term, weight) =>
    println(s"${vocabArray(term.toInt)}\t$weight")
  }
  println()
  idx += 1
}

TOPIC: 0
anonymous	0.012643275631369468
internet	0.01199892653846487
system	0.011489389159339945
email	0.010461753442982532
information	0.010033859874241645
many	0.009046697848641316
most	0.008985058416534403
these	0.008776317737127445
privacy	0.00870315316139373
mail	0.008168184471456088
address	0.007318294146922494
anonymity	0.006893699748457801
their	0.006457573336484874
access	0.005735232098296366
usenet	0.0056810127282647055
computer	0.0054274783521739735
sites	0.005346644944619072
network	0.005201538796310405
user	0.004953510922416698
message	0.004673200538625899
files	0.004580725684189369
associated	0.0045489408314838406
identity	0.0044313548110878585
where	0.004270628694089294
currently	0.004267651379416461
over	0.004165226488264872
find	0.0041167933944265675
rights	0.004083307820384657
file	0.0040832693812494865
been	0.0040562934952105565

TOPIC: 1
encryption	0.01481548780905835
government	0.013693226139575654
same	0.00917146448637624
need	0.008240628107969715
technology	0.008

idx = 10
topicIndices = Array((Array(25, 29, 14, 33, 7, 17, 11, 8, 31, 47, 71, 87, 0, 42, 111, 36, 120, 101, 115, 147, 117, 142, 172, 53, 151, 28, 64, 154, 59, 6),Array(0.012643275631369468, 0.01199892653846487, 0.011489389159339945, 0.010461753442982532, 0.010033859874241645, 0.009046697848641316, 0.008985058416534403, 0.008776317737127445, 0.00870315316139373, 0.008168184471456088, 0.007318294146922494, 0.006893699748457801, 0.006457573336484874, 0.005735232098296366, 0.0056810127282647055, 0.0054274783521739735, 0.005346644944619072, 0.005201538796310405, 0.004953510922416698, 0.004673200538625899, 0.004580725684189369, 0.0045489408314838406, 0.0044313548110878585, 0.004270628694089294, 0.004267651379416461, 0.004165226488264872, 0.00411679339...


[([I@40a32aec,[D@3f930ab2), ([I@384762db,[D@f9c23ca), ([I@56b19c9f,[D@71d0da61), ([I@72023f3a,[D@1c6ed0c3), ([I@4661e2b5,[D@31953bca), ([I@528d392d,[D@380f1377), ([I@71322a89,[D@6961040e), ([I@521e1c5e,[D@12ca29c1), ([I@724c2f73,[D@136c0440), ([I@3132069f,[D@72f5a559)]

Whew, that took a long time!  Let's write this to a directory and move on to the next notebook!  Spark has a standard format that it can read and write here, but it doesn't actually contain the words array, and we want to look at that in the next notebook.  First, let's save the words array.

In [34]:
val filename = "trainedModel-1"
sc.parallelize(vocab.toSeq).saveAsTextFile("LDAModels/" + filename + "-vocab")


filename = trainedModel-1


lastException: Throwable = null


trainedModel-1

Finally, we'll save the model for loading in the next notebook.

In [37]:
ldaModel.save(sc, "LDAModels/" + filename)

lastException: Throwable = null
