ScaDaMaLe Course
[site](https://lamastex.github.io/scalable-data-science/sds/3/x/) and
[book](https://lamastex.github.io/ScaDaMaLe/index.html)

Topic Modeling of Movie Dialogs with Latent Dirichlet Allocation
================================================================

**Let us cluster the conversations from different movies!**

This notebook will provide a brief algorithm summary, links for further
reading, and an example of how to use LDA for Topic Modeling.

**not tested in Spark 2.2+ yet (see 034 notebook for syntactic issues,
if any)**

Algorithm Summary
-----------------

-   **Task**: Identify topics from a collection of text documents
-   **Input**: Vectors of word counts
-   **Optimizers**:
    -   EMLDAOptimizer using [Expectation
        Maximization](https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm)
    -   OnlineLDAOptimizer using Iterative Mini-Batch Sampling for
        [Online Variational
        Bayes](https://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf)

Links
-----

-   Spark API docs
    -   Scala:
        [LDA](https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.clustering.LDA)
    -   Python:
        [LDA](https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.clustering.LDA)
-   [MLlib Programming
    Guide](http://spark.apache.org/docs/latest/mllib-clustering.html#latent-dirichlet-allocation-lda)
-   [ML Feature Extractors &
    Transformers](http://spark.apache.org/docs/latest/ml-features.html)
-   [Wikipedia: Latent Dirichlet
    Allocation](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation)

Readings for LDA
----------------

-   A high-level introduction to the topic from Communications of the
    ACM
    -   <http://www.cs.columbia.edu/~blei/papers/Blei2012.pdf>
-   A very good high-level humanities introduction to the topic
    (recommended by Chris Thomson in English Department at UC, Ilam):
    -   <http://journalofdigitalhumanities.org/2-1/topic-modeling-and-digital-humanities-by-david-m-blei/>

Also read the methodological and more formal papers cited in the above
links if you want to know more.

Let's get a bird's eye view of LDA from
http://www.cs.columbia.edu/~blei/papers/Blei2012.pdf next.

-   See pictures (hopefully you read the paper last night!)
-   Algorithm of the generative model (this is unsupervised clustering)
-   For a careful introduction to the topic see Section 27.3 and 27.4
    (pages 950-970) pf Murphy's *Machine Learning: A Probabilistic
    Perspective, MIT Press, 2012*.
-   We will be quite application focussed or applied here!

In [None]:
//This allows easy embedding of publicly available information into any other notebook
//when viewing in git-book just ignore this block - you may have to manually chase the URL in frameIt("URL").
//Example usage:
// displayHTML(frameIt("https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation#Topics_in_LDA",250))
def frameIt( u:String, h:Int ) : String = {
      """<iframe 
 src=""""+ u+""""
 width="95%" height="""" + h + """"
 sandbox>
  <p>
    <a href="http://spark.apache.org/docs/latest/index.html">
      Fallback link for browsers that, unlikely, don't support frames
    </a>
  </p>
</iframe>"""
   }
displayHTML(frameIt("http://journalofdigitalhumanities.org/2-1/topic-modeling-and-digital-humanities-by-david-m-blei/",900))

In [None]:
displayHTML(frameIt("https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation#Topics_in_LDA",250))

In [None]:
displayHTML(frameIt("https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation#Model",600))

In [None]:
displayHTML(frameIt("https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation#Mathematical_definition",910))

  

Probabilistic Topic Modeling Example
------------------------------------

This is an outline of our Topic Modeling workflow. Feel free to jump to
any subtopic to find out more.

-   Step 0. Dataset Review
-   Step 1. Downloading and Loading Data into DBFS
    -   (Step 1. only needs to be done once per shard - see details at
        the end of the notebook for Step 1.)
-   Step 2. Loading the Data and Data Cleaning
-   Step 3. Text Tokenization
-   Step 4. Remove Stopwords
-   Step 5. Vector of Token Counts
-   Step 6. Create LDA model with Online Variational Bayes
-   Step 7. Review Topics
-   Step 8. Model Tuning - Refilter Stopwords
-   Step 9. Create LDA model with Expectation Maximization
-   Step 10. Visualize Results

Step 0. Dataset Review
----------------------

In this example, we will use the [Cornell Movie Dialogs
Corpus](https://people.mpi-sws.org/~cristian/Cornell_Movie-Dialogs_Corpus.html).

Here is the `README.txt`:

------------------------------------------------------------------------

------------------------------------------------------------------------

Cornell Movie-Dialogs Corpus

Distributed together with:

"Chameleons in imagined conversations: A new approach to understanding
coordination of linguistic style in dialogs" Cristian
Danescu-Niculescu-Mizil and Lillian Lee Proceedings of the Workshop on
Cognitive Modeling and Computational Linguistics, ACL 2011.

(this paper is included in this zip file)

NOTE: If you have results to report on these corpora, please send email
to cristian@cs.cornell.edu or llee@cs.cornell.edu so we can add you to
our list of people using this data. Thanks!

Contents of this README:

        A) Brief description
        B) Files description
        C) Details on the collection procedure
        D) Contact

A\) Brief description:

This corpus contains a metadata-rich collection of fictional
conversations extracted from raw movie scripts:

-   220,579 conversational exchanges between 10,292 pairs of movie
    characters
-   involves 9,035 characters from 617 movies
-   in total 304,713 utterances
-   movie metadata included: - genres - release year - IMDB rating -
    number of IMDB votes - IMDB rating
-   character metadata included: - gender (for 3,774 characters) -
    position on movie credits (3,321 characters)

B\) Files description:

In all files the field separator is " +++$+++ "

-   movie*titles*metadata.txt - contains information about each movie
    title - fields: - movieID, - movie title, - movie year, - IMDB
    rating, - no. IMDB votes, - genres in the format
    \['genre1','genre2',...,'genreN'\]

-   movie*characters*metadata.txt - contains information about each
    movie character - fields: - characterID - character name - movieID -
    movie title - gender ("?" for unlabeled cases) - position in credits
    ("?" for unlabeled cases)

-   movie\_lines.txt - contains the actual text of each utterance -
    fields: - lineID - characterID (who uttered this phrase) - movieID -
    character name - text of the utterance

-   movie*conversations.txt - the structure of the conversations -
    fields - characterID of the first character involved in the
    conversation - characterID of the second character involved in the
    conversation - movieID of the movie in which the conversation
    occurred - list of the utterances that make the conversation, in
    chronological order: \['lineID1','lineID2',...,'lineIDN'\] has to be
    matched with movie*lines.txt to reconstruct the actual content

-   raw*script*urls.txt - the urls from which the raw sources were
    retrieved

C\) Details on the collection procedure:

We started from raw publicly available movie scripts (sources
acknowledged in raw*script*urls.txt). In order to collect the metadata
necessary for this study and to distinguish between two script versions
of the same movie, we automatically matched each script with an entry in
movie database provided by IMDB (The Internet Movie Database; data
interfaces available at http://www.imdb.com/interfaces). Some amount of
manual correction was also involved. When more than one movie with the
same title was found in IMBD, the match was made with the most popular
title (the one that received most IMDB votes)

After discarding all movies that could not be matched or that had less
than 5 IMDB votes, we were left with 617 unique titles with metadata
including genre, release year, IMDB rating and no. of IMDB votes and
cast distribution. We then identified the pairs of characters that
interact and separated their conversations automatically using simple
data processing heuristics. After discarding all pairs that exchanged
less than 5 conversational exchanges there were 10,292 left, exchanging
220,579 conversational exchanges (304,713 utterances). After
automatically matching the names of the 9,035 involved characters to the
list of cast distribution, we used the gender of each interpreting actor
to infer the fictional gender of a subset of 3,321 movie characters (we
raised the number of gendered 3,774 characters through manual
annotation). Similarly, we collected the end credit position of a subset
of 3,321 characters as a proxy for their status.

D\) Contact:

Please email any questions to: cristian@cs.cornell.edu (Cristian
Danescu-Niculescu-Mizil)

------------------------------------------------------------------------

------------------------------------------------------------------------

Step 2. Loading the Data and Data Cleaning
------------------------------------------

We have already used the wget command to download the file, and put it
in our distributed file system (this process takes about 1 minute). To
repeat these steps or to download data from another source follow the
steps at the bottom of this worksheet on **Step 1. Downloading and
Loading Data into DBFS**.

Let's make sure these files are in dbfs now:

In [None]:
// this is where the data resides in dbfs (see below to download it first, if you go to a new shard!)
display(dbutils.fs.ls("dbfs:/datasets/sds/nlp/cornell_movie_dialogs_corpus/")) 

  

Conversations Data
------------------

In [None]:
sc.textFile("dbfs:/datasets/sds/nlp/cornell_movie_dialogs_corpus/movie_conversations.txt").top(5).foreach(println)

In [None]:
// Load text file, leave out file paths, convert all strings to lowercase
val conversationsRaw = sc.textFile("dbfs:/datasets/sds/nlp/cornell_movie_dialogs_corpus/movie_conversations.txt").zipWithIndex()

  

Review first 5 lines to get a sense for the data format.

In [None]:
conversationsRaw.top(5).foreach(println) // the first five Strings in the RDD

In [None]:
conversationsRaw.count // there are over 83,000 conversations in total

In [None]:
import scala.util.{Failure, Success}

val regexConversation = """\s*(\w+)\s+(\+{3}\$\+{3})\s*(\w+)\s+(\2)\s*(\w+)\s+(\2)\s*(\[.*\]\s*$)""".r

case class conversationLine(a: String, b: String, c: String, d: String)

val conversationsRaw = sc.textFile("dbfs:/datasets/sds/nlp/cornell_movie_dialogs_corpus/movie_conversations.txt")
 .zipWithIndex()
  .map(x => 
          {
            val id:Long = x._2
            val line = x._1
            val pLine = regexConversation.findFirstMatchIn(line)
                               .map(m => conversationLine(m.group(1), m.group(3), m.group(5), m.group(7))) 
                                  match {
                                    case Some(l) => Success(l)
                                    case None => Failure(new Exception(s"Non matching input: $line"))
                                  }
              (id,pLine)
           }
  )

In [None]:
conversationsRaw.filter(x => x._2.isSuccess).count()

In [None]:
conversationsRaw.filter(x => x._2.isFailure).count()

  

The conversation number and line numbers of each conversation are in one
line in `conversationsRaw`.

In [None]:
conversationsRaw.filter(x => x._2.isSuccess).take(5).foreach(println)

  

Let's create `conversations` that have just the coversation id and
line-number with order information.

In [None]:
val conversations 
    = conversationsRaw
      .filter(x => x._2.isSuccess)
      .flatMap { 
        case (id,Success(l))  
                  => { val conv = l.d.replace("[","").replace("]","").replace("'","").replace(" ","")
                       val convLinesIndexed = conv.split(",").zipWithIndex
                       convLinesIndexed.map( cLI => (id, cLI._2, cLI._1))
                      }
       }.toDF("conversationID","intraConversationID","lineID")

In [None]:
conversations.show(15)

  

Movie Titles
------------

In [None]:
val moviesMetaDataRaw = sc.textFile("dbfs:/datasets/sds/nlp/cornell_movie_dialogs_corpus/movie_titles_metadata.txt")
moviesMetaDataRaw.top(5).foreach(println)

In [None]:
moviesMetaDataRaw.count() // number of movies

In [None]:
import scala.util.{Failure, Success}

/*  - contains information about each movie title
  - fields:
          - movieID,
          - movie title,
          - movie year,
          - IMDB rating,
          - no. IMDB votes,
          - genres in the format ['genre1','genre2',...,'genreN']
          */
val regexMovieMetaData = """\s*(\w+)\s+(\+{3}\$\+{3})\s*(.+)\s+(\2)\s+(.+)\s+(\2)\s+(.+)\s+(\2)\s+(.+)\s+(\2)\s+(\[.*\]\s*$)""".r

case class lineInMovieMetaData(movieID: String, movieTitle: String, movieYear: String, IMDBRating: String, NumIMDBVotes: String, genres: String)

val moviesMetaDataRaw = sc.textFile("dbfs:/datasets/sds/nlp/cornell_movie_dialogs_corpus/movie_titles_metadata.txt")
  .map(line => 
          {
            val pLine = regexMovieMetaData.findFirstMatchIn(line)
                               .map(m => lineInMovieMetaData(m.group(1), m.group(3), m.group(5), m.group(7), m.group(9), m.group(11))) 
                                  match {
                                    case Some(l) => Success(l)
                                    case None => Failure(new Exception(s"Non matching input: $line"))
                                  }
              pLine
           }
  )

In [None]:
moviesMetaDataRaw.count

In [None]:
moviesMetaDataRaw.filter(x => x.isSuccess).count()

In [None]:
moviesMetaDataRaw.filter(x => x.isSuccess).take(10).foreach(println)

In [None]:
//moviesMetaDataRaw.filter(x => x.isFailure).take(10).foreach(println) // to regex refine for casting

In [None]:
val moviesMetaData 
    = moviesMetaDataRaw
      .filter(x => x.isSuccess)
      .map { case Success(l) => l }
      .toDF().select("movieID","movieTitle","movieYear")

In [None]:
moviesMetaData.show(10,false)

  

Lines Data
----------

In [None]:
val linesRaw = sc.textFile("dbfs:/datasets/sds/nlp/cornell_movie_dialogs_corpus/movie_lines.txt")

In [None]:
linesRaw.count() // number of lines making up the conversations

  

Review first 5 lines to get a sense for the data format.

In [None]:
linesRaw.top(5).foreach(println)

  

To see 5 random lines in the `lines.txt` evaluate the following cell.

In [None]:
linesRaw.takeSample(false, 5).foreach(println)

In [None]:
import scala.util.{Failure, Success}

/*  field in line.txt are:
          - lineID
          - characterID (who uttered this phrase)
          - movieID
          - character name
          - text of the utterance
          */
val regexLine = """\s*(\w+)\s+(\+{3}\$\+{3})\s*(\w+)\s+(\2)\s*(\w+)\s+(\2)\s*(.+)\s+(\2)\s*(.*$)""".r

case class lineInMovie(lineID: String, characterID: String, movieID: String, characterName: String, text: String)

val linesRaw = sc.textFile("dbfs:/datasets/sds/nlp/cornell_movie_dialogs_corpus/movie_lines.txt")
  .map(line => 
          {
            val pLine = regexLine.findFirstMatchIn(line)
                               .map(m => lineInMovie(m.group(1), m.group(3), m.group(5), m.group(7), m.group(9))) 
                                  match {
                                    case Some(l) => Success(l)
                                    case None => Failure(new Exception(s"Non matching input: $line"))
                                  }
              pLine
           }
  )

In [None]:
linesRaw.filter(x => x.isSuccess).count()

In [None]:
linesRaw.filter(x => x.isFailure).count()

In [None]:
linesRaw.filter(x => x.isSuccess).take(5).foreach(println)

  

Let's make a DataFrame out of the successfully parsed line.

In [None]:
val lines 
    = linesRaw
      .filter(x => x.isSuccess)
      .map { case Success(l) => l }
      .toDF()
      .join(moviesMetaData, "movieID") // and join it to get movie meta data

In [None]:
lines.show(5)

  

Dialogs with Lines
------------------

Let's join ght two DataFrames on `lineID` next.

In [None]:
val convLines = conversations.join(lines, "lineID").sort($"conversationID", $"intraConversationID")

In [None]:
convLines.count

In [None]:
conversations.count

In [None]:
display(convLines)

  

Let's amalgamate the texts utered in the same conversations together.

By doing this we loose all the information in the order of utterance.

But this is fine as we are going to do LDA with just the *first-order
information of words uttered in each conversation* by anyone involved in
the dialogue.

In [None]:
import org.apache.spark.sql.functions.{collect_list, udf, lit, concat_ws}

val corpusDF = convLines.groupBy($"conversationID",$"movieID")
  .agg(concat_ws(" :-()-: ",collect_list($"text")).alias("corpus"))
  .join(moviesMetaData, "movieID") // and join it to get movie meta data
  .select($"conversationID".as("id"),$"corpus",$"movieTitle",$"movieYear")
  .cache()

In [None]:
corpusDF.count()

In [None]:
corpusDF.take(5).foreach(println)

In [None]:
display(corpusDF)

  

Feature extraction and transformation APIs
------------------------------------------

We will use the convenient [Feature extraction and transformation
APIs](http://spark.apache.org/docs/latest/ml-features.html).

Step 3. Text Tokenization
-------------------------

We will use the RegexTokenizer to split each document into tokens. We
can setMinTokenLength() here to indicate a minimum token length, and
filter away all tokens that fall below the minimum. See:

-   <http://spark.apache.org/docs/latest/ml-features.html#tokenizer>.

In [None]:
import org.apache.spark.ml.feature.RegexTokenizer

// Set params for RegexTokenizer
val tokenizer = new RegexTokenizer()
.setPattern("[\\W_]+") // break by white space character(s)
.setMinTokenLength(4) // Filter away tokens with length < 4
.setInputCol("corpus") // name of the input column
.setOutputCol("tokens") // name of the output column

// Tokenize document
val tokenized_df = tokenizer.transform(corpusDF)

In [None]:
display(tokenized_df.sample(false,0.001,1234L)) 

In [None]:
display(tokenized_df.sample(false,0.001,123L).select("tokens"))

  

Step 4. Remove Stopwords
------------------------

We can easily remove stopwords using the StopWordsRemover(). See:

-   <http://spark.apache.org/docs/latest/ml-features.html#stopwordsremover>.

If a list of stopwords is not provided, the StopWordsRemover() will use
[this list of
stopwords](http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words),
also shown below, by default.

`a,about,above,across,after,afterwards,again,against,all,almost,alone,along,already,also,although,always,am,among,amongst,amoungst,amount,an,and,another,any,anyhow,anyone,anything,anyway,anywhere, are,around,as,at,back,be,became,because,become,becomes,becoming,been,before,beforehand,behind,being,below,beside,besides,between,beyond,bill,both,bottom,but,by,call,can,cannot,cant,co,computer,con,could, couldnt,cry,de,describe,detail,do,done,down,due,during,each,eg,eight,either,eleven,else,elsewhere,empty,enough,etc,even,ever,every,everyone,everything,everywhere,except,few,fifteen,fify,fill,find,fire,first, five,for,former,formerly,forty,found,four,from,front,full,further,get,give,go,had,has,hasnt,have,he,hence,her,here,hereafter,hereby,herein,hereupon,hers,herself,him,himself,his,how,however,hundred,i,ie,if, in,inc,indeed,interest,into,is,it,its,itself,keep,last,latter,latterly,least,less,ltd,made,many,may,me,meanwhile,might,mill,mine,more,moreover,most,mostly,move,much,must,my,myself,name,namely,neither,never, nevertheless,next,nine,no,nobody,none,noone,nor,not,nothing,now,nowhere,of,off,often,on,once,one,only,onto,or,other,others,otherwise,our,ours,ourselves,out,over,own,part,per,perhaps,please,put,rather,re,same, see,seem,seemed,seeming,seems,serious,several,she,should,show,side,since,sincere,six,sixty,so,some,somehow,someone,something,sometime,sometimes,somewhere,still,such,system,take,ten,than,that,the,their,them, themselves,then,thence,there,thereafter,thereby,therefore,therein,thereupon,these,they,thick,thin,third,this,those,though,three,through,throughout,thru,thus,to,together,too,top,toward,towards,twelve,twenty,two, un,under,until,up,upon,us,very,via,was,we,well,were,what,whatever,when,whence,whenever,where,whereafter,whereas,whereby,wherein,whereupon,wherever,whether,which,while,whither,who,whoever,whole,whom,whose,why,will, with,within,without,would,yet,you,your,yours,yourself,yourselves`

You can use `getStopWords()` to see the list of stopwords that will be
used.

In this example, we will specify a list of stopwords for the
StopWordsRemover() to use. We do this so that we can add on to the list
later on.

In [None]:
display(dbutils.fs.ls("dbfs:/tmp/stopwords")) // check if the file already exists from earlier wget and dbfs-load

  

If the file `dbfs:/tmp/stopwords` already exists then skip the next two
cells, otherwise download and load it into DBFS by uncommenting and
evaluating the next two cells.

In [None]:
wget http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words -O /tmp/stopwords # uncomment '//' at the beginning and repeat only if needed again

In [None]:
cp file:/tmp/stopwords dbfs:/tmp/stopwords 

In [None]:
// List of stopwords
val stopwords = sc.textFile("/tmp/stopwords").collect()

In [None]:
stopwords.length // find the number of stopwords in the scala Array[String]

  

Finally, we can just remove the stopwords using the `StopWordsRemover`
as follows:

In [None]:
import org.apache.spark.ml.feature.StopWordsRemover

// Set params for StopWordsRemover
val remover = new StopWordsRemover()
.setStopWords(stopwords) // This parameter is optional
.setInputCol("tokens")
.setOutputCol("filtered")

// Create new DF with Stopwords removed
val filtered_df = remover.transform(tokenized_df)

  

Step 5. Vector of Token Counts
------------------------------

LDA takes in a vector of token counts as input. We can use the
`CountVectorizer()` to easily convert our text documents into vectors of
token counts.

The `CountVectorizer` will return
`(VocabSize, Array(Indexed Tokens), Array(Token Frequency))`.

Two handy parameters to note:

-   `setMinDF`: Specifies the minimum number of different documents a
    term must appear in to be included in the vocabulary.
-   `setMinTF`: Specifies the minimum number of times a term has to
    appear in a document to be included in the vocabulary.

See:

-   <http://spark.apache.org/docs/latest/ml-features.html#countvectorizer>.

In [None]:
import org.apache.spark.ml.feature.CountVectorizer

// Set params for CountVectorizer
val vectorizer = new CountVectorizer()
.setInputCol("filtered")
.setOutputCol("features")
.setVocabSize(10000) 
.setMinDF(5) // the minimum number of different documents a term must appear in to be included in the vocabulary.
.fit(filtered_df)

In [None]:
// Create vector of token counts
val countVectors = vectorizer.transform(filtered_df).select("id", "features")

In [None]:
// see the first countVectors
countVectors.take(1)

  

To use the LDA algorithm in the MLlib library, we have to convert the
DataFrame back into an RDD.

In [None]:
// Convert DF to RDD - ideally we should use ml for everything an not ml and mllib ; DAN
import org.apache.spark.ml.feature.{CountVectorizer, RegexTokenizer, StopWordsRemover}
import org.apache.spark.ml.linalg.{Vector => MLVector}
import org.apache.spark.mllib.clustering.{LDA, OnlineLDAOptimizer}
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.sql.{Row, SparkSession}

val lda_countVector = countVectors.map { case Row(id: Long, countVector: MLVector) => (id, Vectors.fromML(countVector)) }.rdd


In [None]:
// format: Array(id, (VocabSize, Array(indexedTokens), Array(Token Frequency)))
lda_countVector.take(1)

  

Let's get an overview of LDA in Spark's MLLIB
---------------------------------------------

See:

-   <http://spark.apache.org/docs/latest/mllib-clustering.html#latent-dirichlet-allocation-lda>.

Create LDA model with Online Variational Bayes
----------------------------------------------

We will now set the parameters for LDA. We will use the
OnlineLDAOptimizer() here, which implements Online Variational Bayes.

Choosing the number of topics for your LDA model requires a bit of
domain knowledge. As we do not know the number of "topics", we will set
numTopics to be 20.

In [None]:
val numTopics = 20

  

We will set the parameters needed to build our LDA model. We can also
setMiniBatchFraction for the OnlineLDAOptimizer, which sets the fraction
of corpus sampled and used at each iteration. In this example, we will
set this to 0.8.

In [None]:
import org.apache.spark.mllib.clustering.{LDA, OnlineLDAOptimizer}

// Set LDA params
val lda = new LDA()
.setOptimizer(new OnlineLDAOptimizer().setMiniBatchFraction(0.8))
.setK(numTopics)
.setMaxIterations(3)
.setDocConcentration(-1) // use default values
.setTopicConcentration(-1) // use default values

  

Create the LDA model with Online Variational Bayes.

In [None]:
val ldaModel = lda.run(lda_countVector)

  

Watch **Online Learning for Latent Dirichlet Allocation** in NIPS2010 by
Matt Hoffman (right click and open in new tab)

[!\[Matt Hoffman's NIPS 2010 Talk Online
LDA\]](http://videolectures.net/nips2010_hoffman_oll/thumb.jpg)\](http://videolectures.net/nips2010*hoffman*oll/)

Also see the paper on *Online varioational Bayes* by Matt linked for
more details (from the above URL):
[http://videolectures.net/site/normal*dl/tag=83534/nips2010*1291.pdf](http://videolectures.net/site/normal_dl/tag=83534/nips2010_1291.pdf)

Note that using the OnlineLDAOptimizer returns us a
[LocalLDAModel](http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.clustering.LocalLDAModel),
which stores the inferred topics of your corpus.

Review Topics
-------------

We can now review the results of our LDA model. We will print out all 20
topics with their corresponding term probabilities.

Note that you will get slightly different results every time you run an
LDA model since LDA includes some randomization.

Let us review results of LDA model with Online Variational Bayes, step
by step.

In [None]:
val topicIndices = ldaModel.describeTopics(maxTermsPerTopic = 5)

In [None]:
val vocabList = vectorizer.vocabulary

In [None]:
val topics = topicIndices.map { case (terms, termWeights) =>
  terms.map(vocabList(_)).zip(termWeights)
}

  

Feel free to take things apart to understand!

In [None]:
topicIndices(0)

In [None]:
topicIndices(0)._1

In [None]:
topicIndices(0)._1(0)

In [None]:
vocabList(topicIndices(0)._1(0))

  

Review Results of LDA model with Online Variational Bayes - Doing all
four steps earlier at once.

In [None]:
val topicIndices = ldaModel.describeTopics(maxTermsPerTopic = 5)
val vocabList = vectorizer.vocabulary
val topics = topicIndices.map { case (terms, termWeights) =>
  terms.map(vocabList(_)).zip(termWeights)
}
println(s"$numTopics topics:")
topics.zipWithIndex.foreach { case (topic, i) =>
  println(s"TOPIC $i")
  topic.foreach { case (term, weight) => println(s"$term\t$weight") }
  println(s"==========")
}

  

Going through the results, you may notice that some of the topic words
returned are actually stopwords that are specific to our dataset (for
eg: "writes", "article"...). Let's try improving our model.

Step 8. Model Tuning - Refilter Stopwords
-----------------------------------------

We will try to improve the results of our model by identifying some
stopwords that are specific to our dataset. We will filter these
stopwords out and rerun our LDA model to see if we get better results.

In [None]:
val add_stopwords = Array("whatever") // add  more stop-words like the name of your company!

In [None]:
// Combine newly identified stopwords to our exising list of stopwords
val new_stopwords = stopwords.union(add_stopwords)

In [None]:
import org.apache.spark.ml.feature.StopWordsRemover

// Set Params for StopWordsRemover with new_stopwords
val remover = new StopWordsRemover()
.setStopWords(new_stopwords)
.setInputCol("tokens")
.setOutputCol("filtered")

// Create new df with new list of stopwords removed
val new_filtered_df = remover.transform(tokenized_df)

In [None]:
// Set Params for CountVectorizer
val vectorizer = new CountVectorizer()
.setInputCol("filtered")
.setOutputCol("features")
.setVocabSize(10000)
.setMinDF(5)
.fit(new_filtered_df)

// Create new df of countVectors
val new_countVectors = vectorizer.transform(new_filtered_df).select("id", "features")

In [None]:
// Convert DF to RDD
val new_lda_countVector = new_countVectors.map { case Row(id: Long, countVector: MLVector) => (id, Vectors.fromML(countVector)) }.rdd

  

We will also increase MaxIterations to 10 to see if we get better
results.

In [None]:
// Set LDA parameters
val new_lda = new LDA()
.setOptimizer(new OnlineLDAOptimizer().setMiniBatchFraction(0.8))
.setK(numTopics)
.setMaxIterations(10)
.setDocConcentration(-1) // use default values
.setTopicConcentration(-1) // use default values

  

#### How to find what the default values are?

Dive into the source!!!

1.  Let's find the default value for `docConcentration` now.
2.  Got to Apache Spark package Root:
    <https://spark.apache.org/docs/latest/api/scala/#package>

-   search for 'ml' in the search box on the top left (ml is for ml
    library)
-   Then find the `LDA` by scrolling below on the left to mllib's
    `clustering` methods and click on `LDA`
-   Then click on the source code link which should take you here:
    -   <https://github.com/apache/spark/blob/v1.6.1/mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala>
    -   Now, simply go to the right function and see the following
        comment block:

    \`\`\` /\*\*
    -   Concentration parameter (commonly named "alpha") for the prior
        placed on documents'

    -   distributions over topics ("theta").

    -   

    -   This is the parameter to a Dirichlet distribution, where larger
        values mean more smoothing

    -   (more regularization).

    -   

    -   If not set by the user, then docConcentration is set
        automatically. If set to

    -   singleton vector \[alpha\], then alpha is replicated to a vector
        of length k in fitting.

    -   Otherwise, the \[\[docConcentration\]\] vector must be length k.

    -   (default = automatic)

    -   

    -   Optimizer-specific parameter settings:

    -   -   EM

    -   - Currently only supports symmetric distributions, so all values in the vector should be

    -     the same.

    -   - Values should be > 1.0

    -   - default = uniformly (50 / k) + 1, where 50/k is common in LDA libraries and +1 follows

    -     from Asuncion et al. (2009), who recommend a +1 adjustment for EM.

    -   -   Online

    -   - Values should be >= 0

    -   - default = uniformly (1.0 / k), following the implementation from

    -     [[https://github.com/Blei-Lab/onlineldavb]].

    -   @group param \*/ \`\`\`

**HOMEWORK:** Try to find the default value for `TopicConcentration`.

In [None]:
// Create LDA model with stopwords refiltered
val new_ldaModel = new_lda.run(new_lda_countVector)

In [None]:
val topicIndices = new_ldaModel.describeTopics(maxTermsPerTopic = 5)
val vocabList = vectorizer.vocabulary
val topics = topicIndices.map { case (terms, termWeights) =>
  terms.map(vocabList(_)).zip(termWeights)
}
println(s"$numTopics topics:")
topics.zipWithIndex.foreach { case (topic, i) =>
  println(s"TOPIC $i")
  topic.foreach { case (term, weight) => println(s"$term\t$weight") }
  println(s"==========")
}

  

Step 9. Create LDA model with Expectation Maximization
------------------------------------------------------

Let's try creating an LDA model with Expectation Maximization on the
data that has been refiltered for additional stopwords. We will also
increase MaxIterations here to 100 to see if that improves results. See:

-   <http://spark.apache.org/docs/latest/mllib-clustering.html#latent-dirichlet-allocation-lda>.

In [None]:
import org.apache.spark.mllib.clustering.EMLDAOptimizer

// Set LDA parameters
val em_lda = new LDA()
.setOptimizer(new EMLDAOptimizer())
.setK(numTopics)
.setMaxIterations(100)
.setDocConcentration(-1) // use default values
.setTopicConcentration(-1) // use default values

In [None]:
val em_ldaModel = em_lda.run(new_lda_countVector) // takes a long long time 22 minutes

In [None]:
import org.apache.spark.mllib.clustering.DistributedLDAModel;
val em_DldaModel = em_ldaModel.asInstanceOf[DistributedLDAModel]

In [None]:
val top10ConversationsPerTopic = em_DldaModel.topDocumentsPerTopic(10)

In [None]:
top10ConversationsPerTopic.length // number of topics

In [None]:
//em_DldaModel.topicDistributions.take(10).foreach(println)

  

Note that the EMLDAOptimizer produces a DistributedLDAModel, which
stores not only the inferred topics but also the full training corpus
and topic distributions for each document in the training corpus.

In [None]:
val topicIndices = em_ldaModel.describeTopics(maxTermsPerTopic = 5)

In [None]:
val vocabList = vectorizer.vocabulary

In [None]:
vocabList.size

In [None]:
val topics = topicIndices.map { case (terms, termWeights) =>
  terms.map(vocabList(_)).zip(termWeights)
}

In [None]:
vocabList(47) // 47 is the index of the term 'university' or the first term in topics - this may change due to randomness in algorithm

  

This is just doing it all at once.

In [None]:
val topicIndices = em_ldaModel.describeTopics(maxTermsPerTopic = 5)
val vocabList = vectorizer.vocabulary
val topics = topicIndices.map { case (terms, termWeights) =>
  terms.map(vocabList(_)).zip(termWeights)
}
println(s"$numTopics topics:")
topics.zipWithIndex.foreach { case (topic, i) =>
  println(s"TOPIC $i")
  topic.foreach { case (term, weight) => println(s"$term\t$weight") }
  println(s"==========")
}

In [None]:
top10ConversationsPerTopic(2)

In [None]:
top10ConversationsPerTopic(2)._1

In [None]:
val scenesForTopic2 = sc.parallelize(top10ConversationsPerTopic(2)._1).toDF("id")

In [None]:
display(scenesForTopic2.join(corpusDF,"id"))

In [None]:
sc.parallelize(top10ConversationsPerTopic(2)._1).toDF("id").join(corpusDF,"id").show(10,false)

In [None]:
sc.parallelize(top10ConversationsPerTopic(5)._1).toDF("id").join(corpusDF,"id").show(10,false)

In [None]:
corpusDF.show(5)

  

We've managed to get some good results here. For example, we can easily
infer that Topic 2 is about space, Topic 3 is about israel, etc.

We still get some ambiguous results like Topic 0.

To improve our results further, we could employ some of the below
methods:

-   Refilter data for additional data-specific stopwords
-   Use Stemming or Lemmatization to preprocess data
-   Experiment with a smaller number of topics, since some of these
    topics in the 20 Newsgroups are pretty similar
-   Increase model's MaxIterations

Visualize Results
-----------------

We will try visualizing the results obtained from the EM LDA model with
a d3 bubble chart.

In [None]:
// Zip topic terms with topic IDs
val termArray = topics.zipWithIndex

In [None]:
// Transform data into the form (term, probability, topicId)
val termRDD = sc.parallelize(termArray)
val termRDD2 =termRDD.flatMap( (x: (Array[(String, Double)], Int)) => {
  val arrayOfTuple = x._1
  val topicId = x._2
  arrayOfTuple.map(el => (el._1, el._2, topicId))
})

In [None]:
// Create DF with proper column names
val termDF = termRDD2.toDF.withColumnRenamed("_1", "term").withColumnRenamed("_2", "probability").withColumnRenamed("_3", "topicId")

In [None]:
display(termDF)

  

We will convert the DataFrame into a JSON format, which will be passed
into d3.

In [None]:
// Create JSON data
val rawJson = termDF.toJSON.collect().mkString(",\n")

  

We are now ready to use D3 on the rawJson data.

In [None]:
displayHTML(s"""
<!DOCTYPE html>
<meta charset="utf-8">
<style>

circle {
  fill: rgb(31, 119, 180);
  fill-opacity: 0.5;
  stroke: rgb(31, 119, 180);
  stroke-width: 1px;
}

.leaf circle {
  fill: #ff7f0e;
  fill-opacity: 1;
}

text {
  font: 14px sans-serif;
}

</style>
<body>
<script src="https://cdnjs.cloudflare.com/ajax/libs/d3/3.5.5/d3.min.js"></script>
<script>

var json = {
 "name": "data",
 "children": [
  {
     "name": "topics",
     "children": [
      ${rawJson}
     ]
    }
   ]
};

var r = 1500,
    format = d3.format(",d"),
    fill = d3.scale.category20c();

var bubble = d3.layout.pack()
    .sort(null)
    .size([r, r])
    .padding(1.5);

var vis = d3.select("body").append("svg")
    .attr("width", r)
    .attr("height", r)
    .attr("class", "bubble");

  
var node = vis.selectAll("g.node")
    .data(bubble.nodes(classes(json))
    .filter(function(d) { return !d.children; }))
    .enter().append("g")
    .attr("class", "node")
    .attr("transform", function(d) { return "translate(" + d.x + "," + d.y + ")"; })
    color = d3.scale.category20();
  
  node.append("title")
      .text(function(d) { return d.className + ": " + format(d.value); });

  node.append("circle")
      .attr("r", function(d) { return d.r; })
      .style("fill", function(d) {return color(d.topicName);});

var text = node.append("text")
    .attr("text-anchor", "middle")
    .attr("dy", ".3em")
    .text(function(d) { return d.className.substring(0, d.r / 3)});
  
  text.append("tspan")
      .attr("dy", "1.2em")
      .attr("x", 0)
      .text(function(d) {return Math.ceil(d.value * 10000) /10000; });

// Returns a flattened hierarchy containing all leaf nodes under the root.
function classes(root) {
  var classes = [];

  function recurse(term, node) {
    if (node.children) node.children.forEach(function(child) { recurse(node.term, child); });
    else classes.push({topicName: node.topicId, className: node.term, value: node.probability});
  }

  recurse(null, root);
  return {children: classes};
}
</script>
""")

  

Step 1. Downloading and Loading Data into DBFS
----------------------------------------------

Here are the steps taken for downloading and saving data to the
distributed file system. Uncomment them for repeating this process on
your databricks cluster or for downloading a new source of data.

Unfortunately, the original data at:

-   [http://www.mpi-sws.org/~cristian/data/cornell*movie*dialogs\_corpus.zip](http://www.mpi-sws.org/~cristian/data/cornell_movie_dialogs_corpus.zip)

is not suited for manipulation and loading into dbfs easily. So the data
has been downloaded, directory renamed without white spaces, superfluous
OS-specific files removed, `dos2unix`'d, `tar -zcvf`'d and uploaded to
the following URL for an easily dbfs-loadable download:

-   [http://lamastex.org/datasets/public/nlp/cornell*movie*dialogs\_corpus.tgz](http://lamastex.org/datasets/public/nlp/cornell_movie_dialogs_corpus.tgz)

In [None]:
wget http://lamastex.org/datasets/public/nlp/cornell_movie_dialogs_corpus.tgz

  

Untar the file.

In [None]:
tar zxvf cornell_movie_dialogs_corpus.tgz

  

Let us list and load all the files into dbfs after `dbfs.fs.mkdirs(...)`
to create the directory
`dbfs:/datasets/sds/nlp/cornell_movie_dialogs_corpus/`.

In [None]:
pwd && ls -al cornell_movie_dialogs_corpus

In [None]:
dbutils.fs.rm("dbfs:/datasets/sds/nlp/cornell_movie_dialogs_corpus/",true)

In [None]:
dbutils.fs.mkdirs("dbfs:/datasets/sds/nlp/cornell_movie_dialogs_corpus/")

In [None]:

dbutils.fs.cp("file:///databricks/driver/cornell_movie_dialogs_corpus/movie_characters_metadata.txt","dbfs:/datasets/sds/nlp/cornell_movie_dialogs_corpus/movie_characters_metadata.txt")
dbutils.fs.cp("file:///databricks/driver/cornell_movie_dialogs_corpus/movie_conversations.txt","dbfs:/datasets/sds/nlp/cornell_movie_dialogs_corpus/movie_conversations.txt")
dbutils.fs.cp("file:///databricks/driver/cornell_movie_dialogs_corpus/movie_lines.txt","dbfs:/datasets/sds/nlp/cornell_movie_dialogs_corpus/movie_lines.txt")
dbutils.fs.cp("file:///databricks/driver/cornell_movie_dialogs_corpus/movie_titles_metadata.txt","dbfs:/datasets/sds/nlp/cornell_movie_dialogs_corpus/movie_titles_metadata.txt")
dbutils.fs.cp("file:///databricks/driver/cornell_movie_dialogs_corpus/raw_script_urls.txt","dbfs:/datasets/sds/nlp/cornell_movie_dialogs_corpus/raw_script_urls.txt")
dbutils.fs.cp("file:///databricks/driver/cornell_movie_dialogs_corpus/README.txt","dbfs:/datasets/sds/nlp/cornell_movie_dialogs_corpus/README.txt")


In [None]:
display(dbutils.fs.ls("dbfs:/datasets/sds/nlp/cornell_movie_dialogs_corpus/"))