[ScaDaMaLe, Scalable Data Science and Distributed Machine Learning](https://lamastex.github.io/scalable-data-science/sds/3/x/)
==============================================================================================================================

Topic Modeling of Movie Dialogs with Latent Dirichlet Allocation
================================================================

### Let us cluster the conversations from different movies!

This notebook will provide a brief algorithm summary, links for further
reading, and an example of how to use LDA for Topic Modeling.

**not tested in Spark 2.2+ yet (see 034 notebook for syntactic issues,
if any)**

Algorithm Summary
-----------------

-   **Task**: Identify topics from a collection of text documents
-   **Input**: Vectors of word counts
-   **Optimizers**:
    -   EMLDAOptimizer using [Expectation
        Maximization](https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm)
    -   OnlineLDAOptimizer using Iterative Mini-Batch Sampling for
        [Online Variational
        Bayes](https://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf)

Links
-----

-   Spark API docs
    -   Scala:
        [LDA](https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.clustering.LDA)
    -   Python:
        [LDA](https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.clustering.LDA)
-   [MLlib Programming
    Guide](http://spark.apache.org/docs/latest/mllib-clustering.html#latent-dirichlet-allocation-lda)
-   [ML Feature Extractors &
    Transformers](http://spark.apache.org/docs/latest/ml-features.html)
-   [Wikipedia: Latent Dirichlet
    Allocation](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation)

Readings for LDA
----------------

-   A high-level introduction to the topic from Communications of the
    ACM
    -   <http://www.cs.columbia.edu/~blei/papers/Blei2012.pdf>
-   A very good high-level humanities introduction to the topic
    (recommended by Chris Thomson in English Department at UC, Ilam):
    -   <http://journalofdigitalhumanities.org/2-1/topic-modeling-and-digital-humanities-by-david-m-blei/>

Also read the methodological and more formal papers cited in the above
links if you want to know more.

Let's get a bird's eye view of LDA from
http://www.cs.columbia.edu/~blei/papers/Blei2012.pdf next.

-   See pictures (hopefully you read the paper last night!)
-   Algorithm of the generative model (this is unsupervised clustering)
-   For a careful introduction to the topic see Section 27.3 and 27.4
    (pages 950-970) pf Murphy's *Machine Learning: A Probabilistic
    Perspective, MIT Press, 2012*.
-   We will be quite application focussed or applied here!

  

  

  

Probabilistic Topic Modeling Example
------------------------------------

This is an outline of our Topic Modeling workflow. Feel free to jump to
any subtopic to find out more.

-   Step 0. Dataset Review
-   Step 1. Downloading and Loading Data into DBFS
    -   (Step 1. only needs to be done once per shard - see details at
        the end of the notebook for Step 1.)
-   Step 2. Loading the Data and Data Cleaning
-   Step 3. Text Tokenization
-   Step 4. Remove Stopwords
-   Step 5. Vector of Token Counts
-   Step 6. Create LDA model with Online Variational Bayes
-   Step 7. Review Topics
-   Step 8. Model Tuning - Refilter Stopwords
-   Step 9. Create LDA model with Expectation Maximization
-   Step 10. Visualize Results

Step 0. Dataset Review
----------------------

In this example, we will use the [Cornell Movie Dialogs
Corpus](https://people.mpi-sws.org/~cristian/Cornell_Movie-Dialogs_Corpus.html).

Here is the `README.txt`:

------------------------------------------------------------------------

------------------------------------------------------------------------

Cornell Movie-Dialogs Corpus

Distributed together with:

"Chameleons in imagined conversations: A new approach to understanding
coordination of linguistic style in dialogs" Cristian
Danescu-Niculescu-Mizil and Lillian Lee Proceedings of the Workshop on
Cognitive Modeling and Computational Linguistics, ACL 2011.

(this paper is included in this zip file)

NOTE: If you have results to report on these corpora, please send email
to cristian@cs.cornell.edu or llee@cs.cornell.edu so we can add you to
our list of people using this data. Thanks!

Contents of this README:

        A) Brief description
        B) Files description
        C) Details on the collection procedure
        D) Contact

A\) Brief description:

This corpus contains a metadata-rich collection of fictional
conversations extracted from raw movie scripts:

-   220,579 conversational exchanges between 10,292 pairs of movie
    characters
-   involves 9,035 characters from 617 movies
-   in total 304,713 utterances
-   movie metadata included: - genres - release year - IMDB rating -
    number of IMDB votes - IMDB rating
-   character metadata included: - gender (for 3,774 characters) -
    position on movie credits (3,321 characters)

B\) Files description:

In all files the field separator is " +++$+++ "

-   movie*titles*metadata.txt - contains information about each movie
    title - fields: - movieID, - movie title, - movie year, - IMDB
    rating, - no. IMDB votes, - genres in the format
    \['genre1','genre2',...,'genreN'\]

-   movie*characters*metadata.txt - contains information about each
    movie character - fields: - characterID - character name - movieID -
    movie title - gender ("?" for unlabeled cases) - position in credits
    ("?" for unlabeled cases)

-   movie\_lines.txt - contains the actual text of each utterance -
    fields: - lineID - characterID (who uttered this phrase) - movieID -
    character name - text of the utterance

-   movie*conversations.txt - the structure of the conversations -
    fields - characterID of the first character involved in the
    conversation - characterID of the second character involved in the
    conversation - movieID of the movie in which the conversation
    occurred - list of the utterances that make the conversation, in
    chronological order: \['lineID1','lineID2',...,'lineIDN'\] has to be
    matched with movie*lines.txt to reconstruct the actual content

-   raw*script*urls.txt - the urls from which the raw sources were
    retrieved

C\) Details on the collection procedure:

We started from raw publicly available movie scripts (sources
acknowledged in raw*script*urls.txt). In order to collect the metadata
necessary for this study and to distinguish between two script versions
of the same movie, we automatically matched each script with an entry in
movie database provided by IMDB (The Internet Movie Database; data
interfaces available at http://www.imdb.com/interfaces). Some amount of
manual correction was also involved. When more than one movie with the
same title was found in IMBD, the match was made with the most popular
title (the one that received most IMDB votes)

After discarding all movies that could not be matched or that had less
than 5 IMDB votes, we were left with 617 unique titles with metadata
including genre, release year, IMDB rating and no. of IMDB votes and
cast distribution. We then identified the pairs of characters that
interact and separated their conversations automatically using simple
data processing heuristics. After discarding all pairs that exchanged
less than 5 conversational exchanges there were 10,292 left, exchanging
220,579 conversational exchanges (304,713 utterances). After
automatically matching the names of the 9,035 involved characters to the
list of cast distribution, we used the gender of each interpreting actor
to infer the fictional gender of a subset of 3,321 movie characters (we
raised the number of gendered 3,774 characters through manual
annotation). Similarly, we collected the end credit position of a subset
of 3,321 characters as a proxy for their status.

D\) Contact:

Please email any questions to: cristian@cs.cornell.edu (Cristian
Danescu-Niculescu-Mizil)

------------------------------------------------------------------------

------------------------------------------------------------------------

Step 2. Loading the Data and Data Cleaning
------------------------------------------

We have already used the wget command to download the file, and put it
in our distributed file system (this process takes about 1 minute). To
repeat these steps or to download data from another source follow the
steps at the bottom of this worksheet on **Step 1. Downloading and
Loading Data into DBFS**.

Let's make sure these files are in dbfs now:

In [None]:
// this is where the data resides in dbfs (see below to download it first, if you go to a new shard!)
display(dbutils.fs.ls("dbfs:/datasets/sds/nlp/cornell_movie_dialogs_corpus/")) 

  

[TABLE]

  

Conversations Data
------------------

In [None]:
sc.textFile("dbfs:/datasets/sds/nlp/cornell_movie_dialogs_corpus/movie_conversations.txt").top(5).foreach(println)

  

>     u999 +++$+++ u1006 +++$+++ m65 +++$+++ ['L227588', 'L227589', 'L227590', 'L227591', 'L227592', 'L227593', 'L227594', 'L227595', 'L227596']
>     u998 +++$+++ u1005 +++$+++ m65 +++$+++ ['L228159', 'L228160']
>     u998 +++$+++ u1005 +++$+++ m65 +++$+++ ['L228157', 'L228158']
>     u998 +++$+++ u1005 +++$+++ m65 +++$+++ ['L228130', 'L228131']
>     u998 +++$+++ u1005 +++$+++ m65 +++$+++ ['L228127', 'L228128', 'L228129']

In [None]:
// Load text file, leave out file paths, convert all strings to lowercase
val conversationsRaw = sc.textFile("dbfs:/datasets/sds/nlp/cornell_movie_dialogs_corpus/movie_conversations.txt").zipWithIndex()

  

>     conversationsRaw: org.apache.spark.rdd.RDD[(String, Long)] = ZippedWithIndexRDD[3709] at zipWithIndex at command-753740454082219:2

  

Review first 5 lines to get a sense for the data format.

In [None]:
conversationsRaw.top(5).foreach(println) // the first five Strings in the RDD

  

>     (u999 +++$+++ u1006 +++$+++ m65 +++$+++ ['L227588', 'L227589', 'L227590', 'L227591', 'L227592', 'L227593', 'L227594', 'L227595', 'L227596'],8954)
>     (u998 +++$+++ u1005 +++$+++ m65 +++$+++ ['L228159', 'L228160'],8952)
>     (u998 +++$+++ u1005 +++$+++ m65 +++$+++ ['L228157', 'L228158'],8951)
>     (u998 +++$+++ u1005 +++$+++ m65 +++$+++ ['L228130', 'L228131'],8950)
>     (u998 +++$+++ u1005 +++$+++ m65 +++$+++ ['L228127', 'L228128', 'L228129'],8949)

In [None]:
conversationsRaw.count // there are over 83,000 conversations in total

  

>     res1: Long = 83097

In [None]:
import scala.util.{Failure, Success}

val regexConversation = """\s*(\w+)\s+(\+{3}\$\+{3})\s*(\w+)\s+(\2)\s*(\w+)\s+(\2)\s*(\[.*\]\s*$)""".r

case class conversationLine(a: String, b: String, c: String, d: String)

val conversationsRaw = sc.textFile("dbfs:/datasets/sds/nlp/cornell_movie_dialogs_corpus/movie_conversations.txt")
 .zipWithIndex()
  .map(x => 
          {
            val id:Long = x._2
            val line = x._1
            val pLine = regexConversation.findFirstMatchIn(line)
                               .map(m => conversationLine(m.group(1), m.group(3), m.group(5), m.group(7))) 
                                  match {
                                    case Some(l) => Success(l)
                                    case None => Failure(new Exception(s"Non matching input: $line"))
                                  }
              (id,pLine)
           }
  )

  

>     import scala.util.{Failure, Success}
>     regexConversation: scala.util.matching.Regex = \s*(\w+)\s+(\+{3}\$\+{3})\s*(\w+)\s+(\2)\s*(\w+)\s+(\2)\s*(\[.*\]\s*$)
>     defined class conversationLine
>     conversationsRaw: org.apache.spark.rdd.RDD[(Long, Product with Serializable with scala.util.Try[conversationLine])] = MapPartitionsRDD[3713] at map at command-753740454082223:9

In [None]:
conversationsRaw.filter(x => x._2.isSuccess).count()

  

>     res2: Long = 83097

In [None]:
conversationsRaw.filter(x => x._2.isFailure).count()

  

>     res3: Long = 0

  

The conversation number and line numbers of each conversation are in one
line in `conversationsRaw`.

In [None]:
conversationsRaw.filter(x => x._2.isSuccess).take(5).foreach(println)

  

>     (0,Success(conversationLine(u0,u2,m0,['L194', 'L195', 'L196', 'L197'])))
>     (1,Success(conversationLine(u0,u2,m0,['L198', 'L199'])))
>     (2,Success(conversationLine(u0,u2,m0,['L200', 'L201', 'L202', 'L203'])))
>     (3,Success(conversationLine(u0,u2,m0,['L204', 'L205', 'L206'])))
>     (4,Success(conversationLine(u0,u2,m0,['L207', 'L208'])))

  

Let's create `conversations` that have just the coversation id and
line-number with order information.

In [None]:
val conversations 
    = conversationsRaw
      .filter(x => x._2.isSuccess)
      .flatMap { 
        case (id,Success(l))  
                  => { val conv = l.d.replace("[","").replace("]","").replace("'","").replace(" ","")
                       val convLinesIndexed = conv.split(",").zipWithIndex
                       convLinesIndexed.map( cLI => (id, cLI._2, cLI._1))
                      }
       }.toDF("conversationID","intraConversationID","lineID")

  

>     notebook:4: warning: match may not be exhaustive.
>     It would fail on the following input: (_, Failure(_))
>           .flatMap {
>                    ^
>     conversations: org.apache.spark.sql.DataFrame = [conversationID: bigint, intraConversationID: int ... 1 more field]

In [None]:
conversations.show(15)

  

>     +--------------+-------------------+------+
>     |conversationID|intraConversationID|lineID|
>     +--------------+-------------------+------+
>     |             0|                  0|  L194|
>     |             0|                  1|  L195|
>     |             0|                  2|  L196|
>     |             0|                  3|  L197|
>     |             1|                  0|  L198|
>     |             1|                  1|  L199|
>     |             2|                  0|  L200|
>     |             2|                  1|  L201|
>     |             2|                  2|  L202|
>     |             2|                  3|  L203|
>     |             3|                  0|  L204|
>     |             3|                  1|  L205|
>     |             3|                  2|  L206|
>     |             4|                  0|  L207|
>     |             4|                  1|  L208|
>     +--------------+-------------------+------+
>     only showing top 15 rows

  

Movie Titles
------------

In [None]:
val moviesMetaDataRaw = sc.textFile("dbfs:/datasets/sds/nlp/cornell_movie_dialogs_corpus/movie_titles_metadata.txt")
moviesMetaDataRaw.top(5).foreach(println)

  

>     m99 +++$+++ indiana jones and the temple of doom +++$+++ 1984 +++$+++ 7.50 +++$+++ 112054 +++$+++ ['action', 'adventure']
>     m98 +++$+++ indiana jones and the last crusade +++$+++ 1989 +++$+++ 8.30 +++$+++ 174947 +++$+++ ['action', 'adventure', 'thriller', 'action', 'adventure', 'fantasy']
>     m97 +++$+++ independence day +++$+++ 1996 +++$+++ 6.60 +++$+++ 151698 +++$+++ ['action', 'adventure', 'sci-fi', 'thriller']
>     m96 +++$+++ invaders from mars +++$+++ 1953 +++$+++ 6.40 +++$+++ 2115 +++$+++ ['horror', 'sci-fi']
>     m95 +++$+++ i am legend +++$+++ 2007 +++$+++ 7.10 +++$+++ 156084 +++$+++ ['drama', 'sci-fi', 'thriller']
>     moviesMetaDataRaw: org.apache.spark.rdd.RDD[String] = dbfs:/datasets/sds/nlp/cornell_movie_dialogs_corpus/movie_titles_metadata.txt MapPartitionsRDD[3722] at textFile at command-753740454082232:1

In [None]:
moviesMetaDataRaw.count() // number of movies

  

>     res8: Long = 617

In [None]:
import scala.util.{Failure, Success}

/*  - contains information about each movie title
  - fields:
          - movieID,
          - movie title,
          - movie year,
          - IMDB rating,
          - no. IMDB votes,
          - genres in the format ['genre1','genre2',...,'genreN']
          */
val regexMovieMetaData = """\s*(\w+)\s+(\+{3}\$\+{3})\s*(.+)\s+(\2)\s+(.+)\s+(\2)\s+(.+)\s+(\2)\s+(.+)\s+(\2)\s+(\[.*\]\s*$)""".r

case class lineInMovieMetaData(movieID: String, movieTitle: String, movieYear: String, IMDBRating: String, NumIMDBVotes: String, genres: String)

val moviesMetaDataRaw = sc.textFile("dbfs:/datasets/sds/nlp/cornell_movie_dialogs_corpus/movie_titles_metadata.txt")
  .map(line => 
          {
            val pLine = regexMovieMetaData.findFirstMatchIn(line)
                               .map(m => lineInMovieMetaData(m.group(1), m.group(3), m.group(5), m.group(7), m.group(9), m.group(11))) 
                                  match {
                                    case Some(l) => Success(l)
                                    case None => Failure(new Exception(s"Non matching input: $line"))
                                  }
              pLine
           }
  )

  

>     import scala.util.{Failure, Success}
>     regexMovieMetaData: scala.util.matching.Regex = \s*(\w+)\s+(\+{3}\$\+{3})\s*(.+)\s+(\2)\s+(.+)\s+(\2)\s+(.+)\s+(\2)\s+(.+)\s+(\2)\s+(\[.*\]\s*$)
>     defined class lineInMovieMetaData
>     moviesMetaDataRaw: org.apache.spark.rdd.RDD[Product with Serializable with scala.util.Try[lineInMovieMetaData]] = MapPartitionsRDD[3725] at map at command-753740454082234:17

In [None]:
moviesMetaDataRaw.count

  

>     res9: Long = 617

In [None]:
moviesMetaDataRaw.filter(x => x.isSuccess).count()

  

>     res10: Long = 617

In [None]:
moviesMetaDataRaw.filter(x => x.isSuccess).take(10).foreach(println)

  

>     Success(lineInMovieMetaData(m0,10 things i hate about you,1999,6.90,62847,['comedy', 'romance']))
>     Success(lineInMovieMetaData(m1,1492: conquest of paradise,1992,6.20,10421,['adventure', 'biography', 'drama', 'history']))
>     Success(lineInMovieMetaData(m2,15 minutes,2001,6.10,25854,['action', 'crime', 'drama', 'thriller']))
>     Success(lineInMovieMetaData(m3,2001: a space odyssey,1968,8.40,163227,['adventure', 'mystery', 'sci-fi']))
>     Success(lineInMovieMetaData(m4,48 hrs.,1982,6.90,22289,['action', 'comedy', 'crime', 'drama', 'thriller']))
>     Success(lineInMovieMetaData(m5,the fifth element,1997,7.50,133756,['action', 'adventure', 'romance', 'sci-fi', 'thriller']))
>     Success(lineInMovieMetaData(m6,8mm,1999,6.30,48212,['crime', 'mystery', 'thriller']))
>     Success(lineInMovieMetaData(m7,a nightmare on elm street 4: the dream master,1988,5.20,13590,['fantasy', 'horror', 'thriller']))
>     Success(lineInMovieMetaData(m8,a nightmare on elm street: the dream child,1989,4.70,11092,['fantasy', 'horror', 'thriller']))
>     Success(lineInMovieMetaData(m9,the atomic submarine,1959,4.90,513,['sci-fi', 'thriller']))

In [None]:
//moviesMetaDataRaw.filter(x => x.isFailure).take(10).foreach(println) // to regex refine for casting

In [None]:
val moviesMetaData 
    = moviesMetaDataRaw
      .filter(x => x.isSuccess)
      .map { case Success(l) => l }
      .toDF().select("movieID","movieTitle","movieYear")

  

>     notebook:4: warning: match may not be exhaustive.
>     It would fail on the following input: Failure(_)
>           .map { case Success(l) => l }
>                ^
>     moviesMetaData: org.apache.spark.sql.DataFrame = [movieID: string, movieTitle: string ... 1 more field]

In [None]:
moviesMetaData.show(10,false)

  

>     +-------+---------------------------------------------+---------+
>     |movieID|movieTitle                                   |movieYear|
>     +-------+---------------------------------------------+---------+
>     |m0     |10 things i hate about you                   |1999     |
>     |m1     |1492: conquest of paradise                   |1992     |
>     |m2     |15 minutes                                   |2001     |
>     |m3     |2001: a space odyssey                        |1968     |
>     |m4     |48 hrs.                                      |1982     |
>     |m5     |the fifth element                            |1997     |
>     |m6     |8mm                                          |1999     |
>     |m7     |a nightmare on elm street 4: the dream master|1988     |
>     |m8     |a nightmare on elm street: the dream child   |1989     |
>     |m9     |the atomic submarine                         |1959     |
>     +-------+---------------------------------------------+---------+
>     only showing top 10 rows

  

Lines Data
----------

In [None]:
val linesRaw = sc.textFile("dbfs:/datasets/sds/nlp/cornell_movie_dialogs_corpus/movie_lines.txt")

  

>     linesRaw: org.apache.spark.rdd.RDD[String] = dbfs:/datasets/sds/nlp/cornell_movie_dialogs_corpus/movie_lines.txt MapPartitionsRDD[3733] at textFile at command-753740454082242:1

In [None]:
linesRaw.count() // number of lines making up the conversations

  

>     res15: Long = 304713

  

Review first 5 lines to get a sense for the data format.

In [None]:
linesRaw.top(5).foreach(println)

  

>     L99999 +++$+++ u4166 +++$+++ m278 +++$+++ DULANEY +++$+++ You didn't know about it before that?
>     L99998 +++$+++ u4168 +++$+++ m278 +++$+++ JOANNE +++$+++ To show you this.  It's a letter from that lawyer, Koehler.  He wrote it to me the day after I saw him.  He's the one who told me I could get the money if Miss Lawson went to jail.
>     L99997 +++$+++ u4166 +++$+++ m278 +++$+++ DULANEY +++$+++ Why'd you come here?
>     L99996 +++$+++ u4168 +++$+++ m278 +++$+++ JOANNE +++$+++ I'm gonna go to jail.  I know they're gonna make it look like I did it. They gotta put it on someone.
>     L99995 +++$+++ u4168 +++$+++ m278 +++$+++ JOANNE +++$+++ What do you think I've got?  A gun? Maybe I'm gonna kill you too.  Maybe I'll blow your head off right now.

  

To see 5 random lines in the `lines.txt` evaluate the following cell.

In [None]:
linesRaw.takeSample(false, 5).foreach(println)

  

>     L216035 +++$+++ u5302 +++$+++ m351 +++$+++ RAMBO +++$+++ Colonel.
>     L597568 +++$+++ u8300 +++$+++ m564 +++$+++ LOMBARD +++$+++ I don�t.
>     L513032 +++$+++ u7667 +++$+++ m518 +++$+++ LINDA +++$+++ He's no more an Indian than I am though. Anyhow, Doyle's gonna try and tease you and be mean to you to show off to his friends. Just like he does to Frank and me sometimes. You just ignore it. Or stay out here away from 'em if he'll let you. He's an okay guy till he gets drunk but tonight he'll get drunk. I guarantee it.
>     L35914 +++$+++ u313 +++$+++ m19 +++$+++ JESSE +++$+++ Yesss, but I was thinking, I could come by, and then take Zee out. Some place near. With other folk. Near. Here.  But out.
>     L426481 +++$+++ u2391 +++$+++ m153 +++$+++ COOLEY +++$+++ - and share one of your graves.

In [None]:
import scala.util.{Failure, Success}

/*  field in line.txt are:
          - lineID
          - characterID (who uttered this phrase)
          - movieID
          - character name
          - text of the utterance
          */
val regexLine = """\s*(\w+)\s+(\+{3}\$\+{3})\s*(\w+)\s+(\2)\s*(\w+)\s+(\2)\s*(.+)\s+(\2)\s*(.*$)""".r

case class lineInMovie(lineID: String, characterID: String, movieID: String, characterName: String, text: String)

val linesRaw = sc.textFile("dbfs:/datasets/sds/nlp/cornell_movie_dialogs_corpus/movie_lines.txt")
  .map(line => 
          {
            val pLine = regexLine.findFirstMatchIn(line)
                               .map(m => lineInMovie(m.group(1), m.group(3), m.group(5), m.group(7), m.group(9))) 
                                  match {
                                    case Some(l) => Success(l)
                                    case None => Failure(new Exception(s"Non matching input: $line"))
                                  }
              pLine
           }
  )

  

>     import scala.util.{Failure, Success}
>     regexLine: scala.util.matching.Regex = \s*(\w+)\s+(\+{3}\$\+{3})\s*(\w+)\s+(\2)\s*(\w+)\s+(\2)\s*(.+)\s+(\2)\s*(.*$)
>     defined class lineInMovie
>     linesRaw: org.apache.spark.rdd.RDD[Product with Serializable with scala.util.Try[lineInMovie]] = MapPartitionsRDD[3737] at map at command-753740454082248:15

In [None]:
linesRaw.filter(x => x.isSuccess).count()

  

>     res18: Long = 304713

In [None]:
linesRaw.filter(x => x.isFailure).count()

  

>     res19: Long = 0

In [None]:
linesRaw.filter(x => x.isSuccess).take(5).foreach(println)

  

>     Success(lineInMovie(L1045,u0,m0,BIANCA,They do not!))
>     Success(lineInMovie(L1044,u2,m0,CAMERON,They do to!))
>     Success(lineInMovie(L985,u0,m0,BIANCA,I hope so.))
>     Success(lineInMovie(L984,u2,m0,CAMERON,She okay?))
>     Success(lineInMovie(L925,u0,m0,BIANCA,Let's go.))

  

Let's make a DataFrame out of the successfully parsed line.

In [None]:
val lines 
    = linesRaw
      .filter(x => x.isSuccess)
      .map { case Success(l) => l }
      .toDF()
      .join(moviesMetaData, "movieID") // and join it to get movie meta data

  

>     notebook:4: warning: match may not be exhaustive.
>     It would fail on the following input: Failure(_)
>           .map { case Success(l) => l }
>                ^
>     lines: org.apache.spark.sql.DataFrame = [movieID: string, lineID: string ... 5 more fields]

In [None]:
lines.show(5)

  

>     +-------+-------+-----------+-------------+--------------------+-------------+---------+
>     |movieID| lineID|characterID|characterName|                text|   movieTitle|movieYear|
>     +-------+-------+-----------+-------------+--------------------+-------------+---------+
>     |   m203|L593445|      u3102|        HAGEN|You owe the Don a...|the godfather|     1972|
>     |   m203|L593444|      u3094|     BONASERA|Yes, I understand...|the godfather|     1972|
>     |   m203|L593443|      u3102|        HAGEN|This is Tom Hagen...|the godfather|     1972|
>     |   m203|L593425|      u3102|        HAGEN|               Yes. |the godfather|     1972|
>     |   m203|L593424|      u3094|     BONASERA|The Don himself i...|the godfather|     1972|
>     +-------+-------+-----------+-------------+--------------------+-------------+---------+
>     only showing top 5 rows

  

Dialogs with Lines
------------------

Let's join ght two DataFrames on `lineID` next.

In [None]:
val convLines = conversations.join(lines, "lineID").sort($"conversationID", $"intraConversationID")

  

>     convLines: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [lineID: string, conversationID: bigint ... 7 more fields]

In [None]:
convLines.count

  

>     res24: Long = 304713

In [None]:
conversations.count

  

>     res25: Long = 304713

In [None]:
display(convLines)

  

[TABLE]

Truncated to 30 rows

  

Let's amalgamate the texts utered in the same conversations together.

By doing this we loose all the information in the order of utterance.

But this is fine as we are going to do LDA with just the *first-order
information of words uttered in each conversation* by anyone involved in
the dialogue.

In [None]:
import org.apache.spark.sql.functions.{collect_list, udf, lit, concat_ws}

val corpusDF = convLines.groupBy($"conversationID",$"movieID")
  .agg(concat_ws(" :-()-: ",collect_list($"text")).alias("corpus"))
  .join(moviesMetaData, "movieID") // and join it to get movie meta data
  .select($"conversationID".as("id"),$"corpus",$"movieTitle",$"movieYear")
  .cache()

  

>     import org.apache.spark.sql.functions.{collect_list, udf, lit, concat_ws}
>     corpusDF: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: bigint, corpus: string ... 2 more fields]

In [None]:
corpusDF.count()

  

>     res28: Long = 83097

In [None]:
corpusDF.take(5).foreach(println)

  

>     [28762,Your wife and children...you're happy with them? :-()-: Yes. :-()-: Good.,the godfather,1972]
>     [28815,Michael? :-()-: I'm thinking about it. :-()-: Michael... :-()-: No, I would not like you better if you were Ingrid Bergman.,the godfather,1972]
>     [28842,What is it? :-()-: Is it all right if I go to the bathroom?,the godfather,1972]
>     [28766,Things went badly in Palermo? :-()-: The younger men have no respect. Things are changing; I don't know what will happen.  Michael, because of the wedding, people now know your name. :-()-: Is that why there are more men on the walls? :-()-: Even so, I don't think it is safe here anymore.  I've made plans to move you to a villa near Siracuse. You must go right away. :-()-: What is it? :-()-: Bad news from America.  Your brother, Santino.  He has been killed.,the godfather,1972]
>     [28835,We can't wait.  No matter what Sollozzo say about a deal, he's figuring out how to kill Pop.  You have to get Sollozzo now. :-()-: The kid's right.,the godfather,1972]

In [None]:
display(corpusDF)

  

[TABLE]

Truncated to 30 rows

  

Feature extraction and transformation APIs
------------------------------------------

We will use the convenient [Feature extraction and transformation
APIs](http://spark.apache.org/docs/latest/ml-features.html).

Step 3. Text Tokenization
-------------------------

We will use the RegexTokenizer to split each document into tokens. We
can setMinTokenLength() here to indicate a minimum token length, and
filter away all tokens that fall below the minimum. See:

-   <http://spark.apache.org/docs/latest/ml-features.html#tokenizer>.

In [None]:
import org.apache.spark.ml.feature.RegexTokenizer

// Set params for RegexTokenizer
val tokenizer = new RegexTokenizer()
.setPattern("[\\W_]+") // break by white space character(s)
.setMinTokenLength(4) // Filter away tokens with length < 4
.setInputCol("corpus") // name of the input column
.setOutputCol("tokens") // name of the output column

// Tokenize document
val tokenized_df = tokenizer.transform(corpusDF)

  

>     import org.apache.spark.ml.feature.RegexTokenizer
>     tokenizer: org.apache.spark.ml.feature.RegexTokenizer = regexTok_5380a11bc0d5
>     tokenized_df: org.apache.spark.sql.DataFrame = [id: bigint, corpus: string ... 3 more fields]

In [None]:
display(tokenized_df.sample(false,0.001,1234L)) 

In [None]:
display(tokenized_df.sample(false,0.001,123L).select("tokens"))

  

Step 4. Remove Stopwords
------------------------

We can easily remove stopwords using the StopWordsRemover(). See:

-   <http://spark.apache.org/docs/latest/ml-features.html#stopwordsremover>.

If a list of stopwords is not provided, the StopWordsRemover() will use
[this list of
stopwords](http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words),
also shown below, by default.

`a,about,above,across,after,afterwards,again,against,all,almost,alone,along,already,also,although,always,am,among,amongst,amoungst,amount,an,and,another,any,anyhow,anyone,anything,anyway,anywhere, are,around,as,at,back,be,became,because,become,becomes,becoming,been,before,beforehand,behind,being,below,beside,besides,between,beyond,bill,both,bottom,but,by,call,can,cannot,cant,co,computer,con,could, couldnt,cry,de,describe,detail,do,done,down,due,during,each,eg,eight,either,eleven,else,elsewhere,empty,enough,etc,even,ever,every,everyone,everything,everywhere,except,few,fifteen,fify,fill,find,fire,first, five,for,former,formerly,forty,found,four,from,front,full,further,get,give,go,had,has,hasnt,have,he,hence,her,here,hereafter,hereby,herein,hereupon,hers,herself,him,himself,his,how,however,hundred,i,ie,if, in,inc,indeed,interest,into,is,it,its,itself,keep,last,latter,latterly,least,less,ltd,made,many,may,me,meanwhile,might,mill,mine,more,moreover,most,mostly,move,much,must,my,myself,name,namely,neither,never, nevertheless,next,nine,no,nobody,none,noone,nor,not,nothing,now,nowhere,of,off,often,on,once,one,only,onto,or,other,others,otherwise,our,ours,ourselves,out,over,own,part,per,perhaps,please,put,rather,re,same, see,seem,seemed,seeming,seems,serious,several,she,should,show,side,since,sincere,six,sixty,so,some,somehow,someone,something,sometime,sometimes,somewhere,still,such,system,take,ten,than,that,the,their,them, themselves,then,thence,there,thereafter,thereby,therefore,therein,thereupon,these,they,thick,thin,third,this,those,though,three,through,throughout,thru,thus,to,together,too,top,toward,towards,twelve,twenty,two, un,under,until,up,upon,us,very,via,was,we,well,were,what,whatever,when,whence,whenever,where,whereafter,whereas,whereby,wherein,whereupon,wherever,whether,which,while,whither,who,whoever,whole,whom,whose,why,will, with,within,without,would,yet,you,your,yours,yourself,yourselves`

You can use `getStopWords()` to see the list of stopwords that will be
used.

In this example, we will specify a list of stopwords for the
StopWordsRemover() to use. We do this so that we can add on to the list
later on.

In [None]:
display(dbutils.fs.ls("dbfs:/tmp/stopwords")) // check if the file already exists from earlier wget and dbfs-load

  

[TABLE]

  

If the file `dbfs:/tmp/stopwords` already exists then skip the next two
cells, otherwise download and load it into DBFS by uncommenting and
evaluating the next two cells.

In [None]:
wget http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words -O /tmp/stopwords # uncomment '//' at the beginning and repeat only if needed again

  

>     --2019-05-31 08:23:58--  http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words
>     Resolving ir.dcs.gla.ac.uk (ir.dcs.gla.ac.uk)... 130.209.240.253
>     Connecting to ir.dcs.gla.ac.uk (ir.dcs.gla.ac.uk)|130.209.240.253|:80... connected.
>     HTTP request sent, awaiting response... 200 OK
>     Length: 2237 (2.2K)
>     Saving to: ‘/tmp/stopwords’
>
>          0K ..                                                    100%  320M=0s
>
>     2019-05-31 08:23:59 (320 MB/s) - ‘/tmp/stopwords’ saved [2237/2237]

In [None]:
cp file:/tmp/stopwords dbfs:/tmp/stopwords 

  

>     res41: Boolean = true

In [None]:
// List of stopwords
val stopwords = sc.textFile("/tmp/stopwords").collect()

  

>     stopwords: Array[String] = Array(a, about, above, across, after, afterwards, again, against, all, almost, alone, along, already, also, although, always, am, among, amongst, amoungst, amount, an, and, another, any, anyhow, anyone, anything, anyway, anywhere, are, around, as, at, back, be, became, because, become, becomes, becoming, been, before, beforehand, behind, being, below, beside, besides, between, beyond, bill, both, bottom, but, by, call, can, cannot, cant, co, computer, con, could, couldnt, cry, de, describe, detail, do, done, down, due, during, each, eg, eight, either, eleven, else, elsewhere, empty, enough, etc, even, ever, every, everyone, everything, everywhere, except, few, fifteen, fify, fill, find, fire, first, five, for, former, formerly, forty, found, four, from, front, full, further, get, give, go, had, has, hasnt, have, he, hence, her, here, hereafter, hereby, herein, hereupon, hers, herself, him, himself, his, how, however, hundred, i, ie, if, in, inc, indeed, interest, into, is, it, its, itself, keep, last, latter, latterly, least, less, ltd, made, many, may, me, meanwhile, might, mill, mine, more, moreover, most, mostly, move, much, must, my, myself, name, namely, neither, never, nevertheless, next, nine, no, nobody, none, noone, nor, not, nothing, now, nowhere, of, off, often, on, once, one, only, onto, or, other, others, otherwise, our, ours, ourselves, out, over, own, part, per, perhaps, please, put, rather, re, same, see, seem, seemed, seeming, seems, serious, several, she, should, show, side, since, sincere, six, sixty, so, some, somehow, someone, something, sometime, sometimes, somewhere, still, such, system, take, ten, than, that, the, their, them, themselves, then, thence, there, thereafter, thereby, therefore, therein, thereupon, these, they, thick, thin, third, this, those, though, three, through, throughout, thru, thus, to, together, too, top, toward, towards, twelve, twenty, two, un, under, until, up, upon, us, very, via, was, we, well, were, what, whatever, when, whence, whenever, where, whereafter, whereas, whereby, wherein, whereupon, wherever, whether, which, while, whither, who, whoever, whole, whom, whose, why, will, with, within, without, would, yet, you, your, yours, yourself, yourselves)

In [None]:
stopwords.length // find the number of stopwords in the scala Array[String]

  

>     res35: Int = 319

  

Finally, we can just remove the stopwords using the `StopWordsRemover`
as follows:

In [None]:
import org.apache.spark.ml.feature.StopWordsRemover

// Set params for StopWordsRemover
val remover = new StopWordsRemover()
.setStopWords(stopwords) // This parameter is optional
.setInputCol("tokens")
.setOutputCol("filtered")

// Create new DF with Stopwords removed
val filtered_df = remover.transform(tokenized_df)

  

>     import org.apache.spark.ml.feature.StopWordsRemover
>     remover: org.apache.spark.ml.feature.StopWordsRemover = stopWords_294e3228eba8
>     filtered_df: org.apache.spark.sql.DataFrame = [id: bigint, corpus: string ... 4 more fields]

  

Step 5. Vector of Token Counts
------------------------------

LDA takes in a vector of token counts as input. We can use the
`CountVectorizer()` to easily convert our text documents into vectors of
token counts.

The `CountVectorizer` will return
`(VocabSize, Array(Indexed Tokens), Array(Token Frequency))`.

Two handy parameters to note:

-   `setMinDF`: Specifies the minimum number of different documents a
    term must appear in to be included in the vocabulary.
-   `setMinTF`: Specifies the minimum number of times a term has to
    appear in a document to be included in the vocabulary.

See:

-   <http://spark.apache.org/docs/latest/ml-features.html#countvectorizer>.

In [None]:
import org.apache.spark.ml.feature.CountVectorizer

// Set params for CountVectorizer
val vectorizer = new CountVectorizer()
.setInputCol("filtered")
.setOutputCol("features")
.setVocabSize(10000) 
.setMinDF(5) // the minimum number of different documents a term must appear in to be included in the vocabulary.
.fit(filtered_df)

  

>     import org.apache.spark.ml.feature.CountVectorizer
>     vectorizer: org.apache.spark.ml.feature.CountVectorizerModel = cntVec_48267a85f1b9

In [None]:
// Create vector of token counts
val countVectors = vectorizer.transform(filtered_df).select("id", "features")

  

>     countVectors: org.apache.spark.sql.DataFrame = [id: bigint, features: vector]

In [None]:
// see the first countVectors
countVectors.take(1)

  

>     res38: Array[org.apache.spark.sql.Row] = Array([28762,(10000,[7,112,179,308],[1.0,1.0,1.0,1.0])])

  

To use the LDA algorithm in the MLlib library, we have to convert the
DataFrame back into an RDD.

In [None]:
// Convert DF to RDD - ideally we should use ml for everything an not ml and mllib ; DAN
import org.apache.spark.ml.feature.{CountVectorizer, RegexTokenizer, StopWordsRemover}
import org.apache.spark.ml.linalg.{Vector => MLVector}
import org.apache.spark.mllib.clustering.{LDA, OnlineLDAOptimizer}
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.sql.{Row, SparkSession}

val lda_countVector = countVectors.map { case Row(id: Long, countVector: MLVector) => (id, Vectors.fromML(countVector)) }.rdd


  

>     import org.apache.spark.ml.feature.{CountVectorizer, RegexTokenizer, StopWordsRemover}
>     import org.apache.spark.ml.linalg.{Vector=>MLVector}
>     import org.apache.spark.mllib.clustering.{LDA, OnlineLDAOptimizer}
>     import org.apache.spark.mllib.linalg.Vectors
>     import org.apache.spark.sql.{Row, SparkSession}
>     lda_countVector: org.apache.spark.rdd.RDD[(Long, org.apache.spark.mllib.linalg.Vector)] = MapPartitionsRDD[3912] at rdd at command-753740454082286:11

In [None]:
// format: Array(id, (VocabSize, Array(indexedTokens), Array(Token Frequency)))
lda_countVector.take(1)

  

>     res42: Array[(Long, org.apache.spark.mllib.linalg.Vector)] = Array((28762,(10000,[7,112,179,308],[1.0,1.0,1.0,1.0])))

  

Let's get an overview of LDA in Spark's MLLIB
---------------------------------------------

See:

-   <http://spark.apache.org/docs/latest/mllib-clustering.html#latent-dirichlet-allocation-lda>.

Create LDA model with Online Variational Bayes
----------------------------------------------

We will now set the parameters for LDA. We will use the
OnlineLDAOptimizer() here, which implements Online Variational Bayes.

Choosing the number of topics for your LDA model requires a bit of
domain knowledge. As we do not know the number of "topics", we will set
numTopics to be 20.

In [None]:
val numTopics = 20

  

>     numTopics: Int = 20

  

We will set the parameters needed to build our LDA model. We can also
setMiniBatchFraction for the OnlineLDAOptimizer, which sets the fraction
of corpus sampled and used at each iteration. In this example, we will
set this to 0.8.

In [None]:
import org.apache.spark.mllib.clustering.{LDA, OnlineLDAOptimizer}

// Set LDA params
val lda = new LDA()
.setOptimizer(new OnlineLDAOptimizer().setMiniBatchFraction(0.8))
.setK(numTopics)
.setMaxIterations(3)
.setDocConcentration(-1) // use default values
.setTopicConcentration(-1) // use default values

  

>     import org.apache.spark.mllib.clustering.{LDA, OnlineLDAOptimizer}
>     lda: org.apache.spark.mllib.clustering.LDA = org.apache.spark.mllib.clustering.LDA@3c173c8

  

Create the LDA model with Online Variational Bayes.

In [None]:
val ldaModel = lda.run(lda_countVector)

  

>     ldaModel: org.apache.spark.mllib.clustering.LDAModel = org.apache.spark.mllib.clustering.LocalLDAModel@5bf00930

  

Watch **Online Learning for Latent Dirichlet Allocation** in NIPS2010 by
Matt Hoffman (right click and open in new tab)

[!\[Matt Hoffman's NIPS 2010 Talk Online
LDA\]](http://videolectures.net/nips2010_hoffman_oll/thumb.jpg)\](http://videolectures.net/nips2010*hoffman*oll/)

Also see the paper on *Online varioational Bayes* by Matt linked for
more details (from the above URL):
[http://videolectures.net/site/normal*dl/tag=83534/nips2010*1291.pdf](http://videolectures.net/site/normal_dl/tag=83534/nips2010_1291.pdf)

Note that using the OnlineLDAOptimizer returns us a
[LocalLDAModel](http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.clustering.LocalLDAModel),
which stores the inferred topics of your corpus.

Review Topics
-------------

We can now review the results of our LDA model. We will print out all 20
topics with their corresponding term probabilities.

Note that you will get slightly different results every time you run an
LDA model since LDA includes some randomization.

Let us review results of LDA model with Online Variational Bayes, step
by step.

In [None]:
val topicIndices = ldaModel.describeTopics(maxTermsPerTopic = 5)

  

>     topicIndices: Array[(Array[Int], Array[Double])] = Array((Array(1, 2, 4, 49, 0),Array(0.0014102155338741765, 0.0012758924372910556, 0.0011214448310395873, 9.238914780871355E-4, 9.047647243869576E-4)), (Array(1, 6, 2, 0, 4),Array(0.0014443699497685366, 0.0012377629724506722, 0.0011714257476524842, 0.0010861657304183027, 8.604460434628813E-4)), (Array(1, 2, 8, 0, 3),Array(0.0014926060533533697, 0.0013429026076916017, 0.0013067364965238173, 0.0011607492289313303, 0.0011400804862230437)), (Array(5, 6, 4, 1, 7),Array(0.006717314446949222, 0.006002662754297925, 0.004488111770001314, 0.004408679383982238, 0.0042465917238892655)), (Array(0, 19, 3, 8, 6),Array(0.0050059173813691085, 0.0029731088780905225, 0.0022359962463711185, 0.002193246256785973, 0.0019111384839030116)), (Array(3, 0, 10, 1, 15),Array(0.003714410612506209, 0.0017122806517390608, 0.0017073041827440282, 0.0015712232707115927, 0.0012303967042097022)), (Array(0, 1, 6, 10, 2),Array(0.00467483294478972, 0.0038641828467113268, 0.003328578440542597, 0.002867941043688811, 0.002532629878316373)), (Array(0, 2, 9, 1, 13),Array(0.00960017865043255, 0.009308573745541343, 0.005704969701604644, 0.004085042285865179, 0.004031048471919761)), (Array(0, 4, 5, 77, 16),Array(0.004550808496981245, 0.004122146617438838, 0.0019092043643137734, 0.0018255598181846045, 0.001761167250972209)), (Array(6, 2, 5, 1, 0),Array(0.0016782125889211463, 0.0012427279906039904, 0.0012197157251243875, 0.0010635502545983016, 9.50137528050953E-4)), (Array(2, 1, 3, 0, 6),Array(0.003126597598330109, 0.0027451035751362273, 0.00228759303132256, 0.0017239166326848171, 0.0017047784964894794)), (Array(2, 1, 27, 4, 3),Array(0.004734133576359814, 0.004201386287998202, 0.0036983083453854372, 0.0025414887712607768, 0.002091795015523375)), (Array(0, 5, 1, 12, 2),Array(0.0035340054254694784, 0.002387182752907053, 0.0019263993964325303, 0.001843992584617911, 0.0018065489773133325)), (Array(2, 1, 5, 14, 0),Array(0.0016017017354850733, 0.0014834097260266685, 0.0014300356385979168, 0.001294952229819751, 0.0012788947989035501)), (Array(7, 1, 10, 6, 2),Array(0.002043769246809558, 0.0013757478946969802, 0.0013208455540129331, 0.0012662647575091633, 0.0011549537488969965)), (Array(0, 1, 2, 3, 4),Array(0.022087503347588935, 0.01571524947937798, 0.012895996754133662, 0.01026452087962411, 0.009873743305368164)), (Array(0, 1, 3, 4, 9),Array(0.002204551343207476, 0.0016283414468010306, 0.0014214537687803855, 0.0012768751041210551, 0.0011525954268574248)), (Array(46, 1, 2, 16, 5),Array(0.0022031979750387655, 0.0020637622110226085, 0.0019281346187348387, 0.0015712770524161123, 0.0014183600893726285)), (Array(0, 2, 3, 5, 8),Array(0.0035729889283848504, 0.0024215014894025766, 0.0018740761967851508, 0.001838630576321126, 0.0016262171049684524)), (Array(3, 10, 30, 9, 4),Array(0.0018098267577494882, 0.0015864305565599366, 0.0015861983258874525, 0.001331260635860306, 0.0012793651558771885)))

In [None]:
val vocabList = vectorizer.vocabulary

  

>     vocabList: Array[String] = Array(know, just, like, want, think, right, going, good, yeah, tell, come, time, look, didn, mean, make, okay, really, little, sure, gonna, thing, people, said, maybe, need, sorry, love, talk, thought, doing, life, night, things, work, money, better, told, long, help, believe, years, shit, does, away, place, hell, doesn, great, home, feel, fuck, kind, remember, dead, course, wouldn, wait, kill, guess, understand, thank, girl, wrong, leave, listen, talking, real, stop, hear, nice, happened, fine, wanted, father, gotta, mind, fucking, house, wasn, getting, world, stay, mother, left, came, care, thanks, knew, room, trying, guys, went, looking, coming, heard, friend, haven, seen, best, tonight, live, used, matter, killed, pretty, business, idea, couldn, head, miss, says, wife, called, woman, morning, tomorrow, start, stuff, saying, play, hello, baby, hard, probably, minute, days, took, somebody, today, school, meet, gone, crazy, wants, damn, forget, cause, problem, deal, case, friends, point, hope, jesus, afraid, looks, knows, year, worry, exactly, aren, half, thinking, shut, hold, wanna, face, minutes, bring, read, word, doctor, everybody, makes, supposed, story, turn, true, watch, thousand, family, brother, kids, week, happen, fuckin, working, open, happy, lost, john, hurt, town, ready, alright, late, actually, married, gave, beautiful, soon, jack, times, sleep, door, having, hand, drink, easy, gets, chance, young, trouble, different, anybody, shot, rest, hate, death, second, later, asked, phone, wish, check, quite, change, police, walk, couple, question, close, taking, heart, hours, making, comes, anymore, truth, trust, dollars, important, captain, telling, funny, person, honey, goes, eyes, inside, reason, stand, break, means, number, tried, high, white, water, suppose, body, sick, game, excuse, party, women, country, waiting, christ, answer, office, send, pick, alive, sort, blood, black, daddy, line, husband, goddamn, book, fifty, thirty, fact, million, hands, died, power, stupid, started, shouldn, months, city, boys, dinner, sense, running, hour, shoot, drive, fight, speak, george, ship, living, figure, dear, street, ahead, lady, seven, free, feeling, scared, frank, able, children, outside, moment, safe, news, president, brought, write, happens, sent, bullshit, lose, light, glad, child, girls, sounds, sister, lives, promise, till, sound, weren, save, poor, cool, asking, shall, plan, bitch, king, daughter, weeks, beat, york, cold, worth, taken, harry, needs, piece, movie, fast, possible, small, goin, straight, human, hair, tired, food, company, lucky, pull, wonderful, touch, state, looked, thinks, picture, leaving, words, control, clear, known, special, buddy, luck, order, follow, expect, mary, catch, mouth, worked, mister, learn, playing, perfect, dream, calling, questions, hospital, takes, ride, coffee, miles, parents, works, secret, explain, hotel, worse, kidding, past, outta, general, unless, felt, drop, throw, interested, hang, certainly, absolutely, earth, loved, wonder, dark, accident, seeing, simple, turned, doin, clock, date, sweet, meeting, clean, sign, feet, handle, army, music, giving, report, cops, fucked, charlie, information, smart, yesterday, fall, fault, class, bank, month, blow, major, caught, swear, paul, road, talked, choice, boss, plane, david, paid, wear, american, worried, clothes, paper, goodbye, lord, ones, strange, terrible, mistake, given, hurry, blue, finish, murder, kept, apartment, sell, middle, nothin, hasn, careful, meant, walter, moving, changed, imagine, fair, difference, quiet, happening, near, quit, personal, marry, future, figured, rose, agent, kinda, michael, building, mama, early, private, trip, watching, busy, record, certain, jimmy, broke, longer, sake, store, stick, finally, boat, born, sitting, evening, bucks, chief, history, ought, lying, kiss, honor, lunch, darling, favor, fool, uncle, respect, rich, liked, killing, land, peter, tough, interesting, brain, problems, nick, welcome, completely, dick, honest, wake, radio, cash, dude, dance, james, bout, floor, weird, court, calls, jail, window, involved, drunk, johnny, officer, needed, asshole, books, spend, situation, relax, pain, service, dangerous, grand, security, letter, stopped, realize, table, offer, bastard, message, instead, killer, jake, nervous, deep, pass, somethin, evil, english, bought, short, ring, step, picked, likes, voice, eddie, machine, lived, upset, forgot, carry, afternoon, fear, quick, finished, count, forgive, wrote, named, decided, totally, space, team, doubt, pleasure, lawyer, suit, station, gotten, bother, prove, return, pictures, slow, bunch, strong, wearing, driving, list, join, christmas, tape, attack, church, appreciate, force, hungry, standing, college, dying, present, charge, prison, missing, truck, public, board, calm, staying, gold, ball, hardly, hadn, lead, missed, island, government, horse, cover, reach, french, joke, star, fish, mike, moved, america, surprise, soul, seconds, club, self, movies, putting, dress, cost, listening, lots, price, saved, smell, mark, peace, gives, crime, dreams, entire, single, usually, department, beer, holy, west, wall, stuck, nose, protect, ways, teach, awful, forever, type, grow, train, detective, billy, rock, planet, walking, beginning, dumb, papers, folks, park, attention, hide, card, birthday, reading, test, share, master, lieutenant, starting, field, partner, twice, enjoy, dollar, blame, film, mess, bomb, round, girlfriend, south, loves, plenty, using, gentlemen, especially, records, evidence, experience, silly, admit, normal, fired, talkin, lock, louis, fighting, mission, notice, memory, promised, crap, wedding, orders, ground, guns, glass, marriage, idiot, heaven, impossible, knock, green, wondering, spent, animal, hole, neck, drugs, press, nuts, names, broken, position, asleep, jerry, visit, boyfriend, acting, plans, feels, tells, paris, smoke, wind, sheriff, cross, holding, gimme, mention, walked, judge, code, double, brothers, writing, pardon, keeps, fellow, fell, closed, angry, lovely, cute, surprised, percent, charles, correct, agree, bathroom, address, andy, ridiculous, summer, tommy, rules, note, account, group, sleeping, learned, sing, pulled, colonel, proud, laugh, river, area, upstairs, jump, built, difficult, breakfast, bobby, bridge, dirty, betty, amazing, locked, north, definitely, alex, feelings, plus, worst, accept, kick, file, wild, seriously, grace, stories, steal, gettin, nature, advice, relationship, contact, waste, places, spot, beach, stole, apart, favorite, knowing, level, song, faith, risk, loose, patient, foot, eating, played, action, witness, washington, turns, build, obviously, begin, split, games, command, crew, decide, nurse, keeping, tight, bird, form, runs, copy, arrest, complete, scene, consider, jeffrey, insane, taste, teeth, shoes, monster, devil, henry, career, sooner, innocent, hall, showed, gift, weekend, heavy, study, greatest, comin, danger, keys, raise, destroy, track, carl, california, concerned, bruce, program, blind, suddenly, hanging, apologize, seventy, chicken, medical, forward, drinking, sweetheart, willing, guard, legs, admiral, shop, professor, suspect, tree, camp, data, ticket, goodnight, possibly, dunno, burn, paying, television, trick, murdered, losing, senator, credit, extra, dropped, sold, warm, meaning, stone, starts, hiding, lately, cheap, marty, taught, science, lookin, simply, majesty, harold, corner, jeff, queen, following, duty, training, seat, heads, cars, discuss, bear, noticed, enemy, helped, screw, richard, flight)

In [None]:
val topics = topicIndices.map { case (terms, termWeights) =>
  terms.map(vocabList(_)).zip(termWeights)
}

  

>     topics: Array[Array[(String, Double)]] = Array(Array((just,0.0014102155338741765), (like,0.0012758924372910556), (think,0.0011214448310395873), (home,9.238914780871355E-4), (know,9.047647243869576E-4)), Array((just,0.0014443699497685366), (going,0.0012377629724506722), (like,0.0011714257476524842), (know,0.0010861657304183027), (think,8.604460434628813E-4)), Array((just,0.0014926060533533697), (like,0.0013429026076916017), (yeah,0.0013067364965238173), (know,0.0011607492289313303), (want,0.0011400804862230437)), Array((right,0.006717314446949222), (going,0.006002662754297925), (think,0.004488111770001314), (just,0.004408679383982238), (good,0.0042465917238892655)), Array((know,0.0050059173813691085), (sure,0.0029731088780905225), (want,0.0022359962463711185), (yeah,0.002193246256785973), (going,0.0019111384839030116)), Array((want,0.003714410612506209), (know,0.0017122806517390608), (come,0.0017073041827440282), (just,0.0015712232707115927), (make,0.0012303967042097022)), Array((know,0.00467483294478972), (just,0.0038641828467113268), (going,0.003328578440542597), (come,0.002867941043688811), (like,0.002532629878316373)), Array((know,0.00960017865043255), (like,0.009308573745541343), (tell,0.005704969701604644), (just,0.004085042285865179), (didn,0.004031048471919761)), Array((know,0.004550808496981245), (think,0.004122146617438838), (right,0.0019092043643137734), (fucking,0.0018255598181846045), (okay,0.001761167250972209)), Array((going,0.0016782125889211463), (like,0.0012427279906039904), (right,0.0012197157251243875), (just,0.0010635502545983016), (know,9.50137528050953E-4)), Array((like,0.003126597598330109), (just,0.0027451035751362273), (want,0.00228759303132256), (know,0.0017239166326848171), (going,0.0017047784964894794)), Array((like,0.004734133576359814), (just,0.004201386287998202), (love,0.0036983083453854372), (think,0.0025414887712607768), (want,0.002091795015523375)), Array((know,0.0035340054254694784), (right,0.002387182752907053), (just,0.0019263993964325303), (look,0.001843992584617911), (like,0.0018065489773133325)), Array((like,0.0016017017354850733), (just,0.0014834097260266685), (right,0.0014300356385979168), (mean,0.001294952229819751), (know,0.0012788947989035501)), Array((good,0.002043769246809558), (just,0.0013757478946969802), (come,0.0013208455540129331), (going,0.0012662647575091633), (like,0.0011549537488969965)), Array((know,0.022087503347588935), (just,0.01571524947937798), (like,0.012895996754133662), (want,0.01026452087962411), (think,0.009873743305368164)), Array((know,0.002204551343207476), (just,0.0016283414468010306), (want,0.0014214537687803855), (think,0.0012768751041210551), (tell,0.0011525954268574248)), Array((hell,0.0022031979750387655), (just,0.0020637622110226085), (like,0.0019281346187348387), (okay,0.0015712770524161123), (right,0.0014183600893726285)), Array((know,0.0035729889283848504), (like,0.0024215014894025766), (want,0.0018740761967851508), (right,0.001838630576321126), (yeah,0.0016262171049684524)), Array((want,0.0018098267577494882), (come,0.0015864305565599366), (doing,0.0015861983258874525), (tell,0.001331260635860306), (think,0.0012793651558771885)))

  

Feel free to take things apart to understand!

In [None]:
topicIndices(0)

  

>     res43: (Array[Int], Array[Double]) = (Array(1, 2, 4, 49, 0),Array(0.0014102155338741765, 0.0012758924372910556, 0.0011214448310395873, 9.238914780871355E-4, 9.047647243869576E-4))

In [None]:
topicIndices(0)._1

  

>     res44: Array[Int] = Array(1, 2, 4, 49, 0)

In [None]:
topicIndices(0)._1(0)

  

>     res45: Int = 1

In [None]:
vocabList(topicIndices(0)._1(0))

  

>     res46: String = just

  

Review Results of LDA model with Online Variational Bayes - Doing all
four steps earlier at once.

In [None]:
val topicIndices = ldaModel.describeTopics(maxTermsPerTopic = 5)
val vocabList = vectorizer.vocabulary
val topics = topicIndices.map { case (terms, termWeights) =>
  terms.map(vocabList(_)).zip(termWeights)
}
println(s"$numTopics topics:")
topics.zipWithIndex.foreach { case (topic, i) =>
  println(s"TOPIC $i")
  topic.foreach { case (term, weight) => println(s"$term\t$weight") }
  println(s"==========")
}

  

>     20 topics:
>     TOPIC 0
>     just	0.0014102155338741765
>     like	0.0012758924372910556
>     think	0.0011214448310395873
>     home	9.238914780871355E-4
>     know	9.047647243869576E-4
>     ==========
>     TOPIC 1
>     just	0.0014443699497685366
>     going	0.0012377629724506722
>     like	0.0011714257476524842
>     know	0.0010861657304183027
>     think	8.604460434628813E-4
>     ==========
>     TOPIC 2
>     just	0.0014926060533533697
>     like	0.0013429026076916017
>     yeah	0.0013067364965238173
>     know	0.0011607492289313303
>     want	0.0011400804862230437
>     ==========
>     TOPIC 3
>     right	0.006717314446949222
>     going	0.006002662754297925
>     think	0.004488111770001314
>     just	0.004408679383982238
>     good	0.0042465917238892655
>     ==========
>     TOPIC 4
>     know	0.0050059173813691085
>     sure	0.0029731088780905225
>     want	0.0022359962463711185
>     yeah	0.002193246256785973
>     going	0.0019111384839030116
>     ==========
>     TOPIC 5
>     want	0.003714410612506209
>     know	0.0017122806517390608
>     come	0.0017073041827440282
>     just	0.0015712232707115927
>     make	0.0012303967042097022
>     ==========
>     TOPIC 6
>     know	0.00467483294478972
>     just	0.0038641828467113268
>     going	0.003328578440542597
>     come	0.002867941043688811
>     like	0.002532629878316373
>     ==========
>     TOPIC 7
>     know	0.00960017865043255
>     like	0.009308573745541343
>     tell	0.005704969701604644
>     just	0.004085042285865179
>     didn	0.004031048471919761
>     ==========
>     TOPIC 8
>     know	0.004550808496981245
>     think	0.004122146617438838
>     right	0.0019092043643137734
>     fucking	0.0018255598181846045
>     okay	0.001761167250972209
>     ==========
>     TOPIC 9
>     going	0.0016782125889211463
>     like	0.0012427279906039904
>     right	0.0012197157251243875
>     just	0.0010635502545983016
>     know	9.50137528050953E-4
>     ==========
>     TOPIC 10
>     like	0.003126597598330109
>     just	0.0027451035751362273
>     want	0.00228759303132256
>     know	0.0017239166326848171
>     going	0.0017047784964894794
>     ==========
>     TOPIC 11
>     like	0.004734133576359814
>     just	0.004201386287998202
>     love	0.0036983083453854372
>     think	0.0025414887712607768
>     want	0.002091795015523375
>     ==========
>     TOPIC 12
>     know	0.0035340054254694784
>     right	0.002387182752907053
>     just	0.0019263993964325303
>     look	0.001843992584617911
>     like	0.0018065489773133325
>     ==========
>     TOPIC 13
>     like	0.0016017017354850733
>     just	0.0014834097260266685
>     right	0.0014300356385979168
>     mean	0.001294952229819751
>     know	0.0012788947989035501
>     ==========
>     TOPIC 14
>     good	0.002043769246809558
>     just	0.0013757478946969802
>     come	0.0013208455540129331
>     going	0.0012662647575091633
>     like	0.0011549537488969965
>     ==========
>     TOPIC 15
>     know	0.022087503347588935
>     just	0.01571524947937798
>     like	0.012895996754133662
>     want	0.01026452087962411
>     think	0.009873743305368164
>     ==========
>     TOPIC 16
>     know	0.002204551343207476
>     just	0.0016283414468010306
>     want	0.0014214537687803855
>     think	0.0012768751041210551
>     tell	0.0011525954268574248
>     ==========
>     TOPIC 17
>     hell	0.0022031979750387655
>     just	0.0020637622110226085
>     like	0.0019281346187348387
>     okay	0.0015712770524161123
>     right	0.0014183600893726285
>     ==========
>     TOPIC 18
>     know	0.0035729889283848504
>     like	0.0024215014894025766
>     want	0.0018740761967851508
>     right	0.001838630576321126
>     yeah	0.0016262171049684524
>     ==========
>     TOPIC 19
>     want	0.0018098267577494882
>     come	0.0015864305565599366
>     doing	0.0015861983258874525
>     tell	0.001331260635860306
>     think	0.0012793651558771885
>     ==========
>     topicIndices: Array[(Array[Int], Array[Double])] = Array((Array(1, 2, 4, 49, 0),Array(0.0014102155338741765, 0.0012758924372910556, 0.0011214448310395873, 9.238914780871355E-4, 9.047647243869576E-4)), (Array(1, 6, 2, 0, 4),Array(0.0014443699497685366, 0.0012377629724506722, 0.0011714257476524842, 0.0010861657304183027, 8.604460434628813E-4)), (Array(1, 2, 8, 0, 3),Array(0.0014926060533533697, 0.0013429026076916017, 0.0013067364965238173, 0.0011607492289313303, 0.0011400804862230437)), (Array(5, 6, 4, 1, 7),Array(0.006717314446949222, 0.006002662754297925, 0.004488111770001314, 0.004408679383982238, 0.0042465917238892655)), (Array(0, 19, 3, 8, 6),Array(0.0050059173813691085, 0.0029731088780905225, 0.0022359962463711185, 0.002193246256785973, 0.0019111384839030116)), (Array(3, 0, 10, 1, 15),Array(0.003714410612506209, 0.0017122806517390608, 0.0017073041827440282, 0.0015712232707115927, 0.0012303967042097022)), (Array(0, 1, 6, 10, 2),Array(0.00467483294478972, 0.0038641828467113268, 0.003328578440542597, 0.002867941043688811, 0.002532629878316373)), (Array(0, 2, 9, 1, 13),Array(0.00960017865043255, 0.009308573745541343, 0.005704969701604644, 0.004085042285865179, 0.004031048471919761)), (Array(0, 4, 5, 77, 16),Array(0.004550808496981245, 0.004122146617438838, 0.0019092043643137734, 0.0018255598181846045, 0.001761167250972209)), (Array(6, 2, 5, 1, 0),Array(0.0016782125889211463, 0.0012427279906039904, 0.0012197157251243875, 0.0010635502545983016, 9.50137528050953E-4)), (Array(2, 1, 3, 0, 6),Array(0.003126597598330109, 0.0027451035751362273, 0.00228759303132256, 0.0017239166326848171, 0.0017047784964894794)), (Array(2, 1, 27, 4, 3),Array(0.004734133576359814, 0.004201386287998202, 0.0036983083453854372, 0.0025414887712607768, 0.002091795015523375)), (Array(0, 5, 1, 12, 2),Array(0.0035340054254694784, 0.002387182752907053, 0.0019263993964325303, 0.001843992584617911, 0.0018065489773133325)), (Array(2, 1, 5, 14, 0),Array(0.0016017017354850733, 0.0014834097260266685, 0.0014300356385979168, 0.001294952229819751, 0.0012788947989035501)), (Array(7, 1, 10, 6, 2),Array(0.002043769246809558, 0.0013757478946969802, 0.0013208455540129331, 0.0012662647575091633, 0.0011549537488969965)), (Array(0, 1, 2, 3, 4),Array(0.022087503347588935, 0.01571524947937798, 0.012895996754133662, 0.01026452087962411, 0.009873743305368164)), (Array(0, 1, 3, 4, 9),Array(0.002204551343207476, 0.0016283414468010306, 0.0014214537687803855, 0.0012768751041210551, 0.0011525954268574248)), (Array(46, 1, 2, 16, 5),Array(0.0022031979750387655, 0.0020637622110226085, 0.0019281346187348387, 0.0015712770524161123, 0.0014183600893726285)), (Array(0, 2, 3, 5, 8),Array(0.0035729889283848504, 0.0024215014894025766, 0.0018740761967851508, 0.001838630576321126, 0.0016262171049684524)), (Array(3, 10, 30, 9, 4),Array(0.0018098267577494882, 0.0015864305565599366, 0.0015861983258874525, 0.001331260635860306, 0.0012793651558771885)))
>     vocabList: Array[String] = Array(know, just, like, want, think, right, going, good, yeah, tell, come, time, look, didn, mean, make, okay, really, little, sure, gonna, thing, people, said, maybe, need, sorry, love, talk, thought, doing, life, night, things, work, money, better, told, long, help, believe, years, shit, does, away, place, hell, doesn, great, home, feel, fuck, kind, remember, dead, course, wouldn, wait, kill, guess, understand, thank, girl, wrong, leave, listen, talking, real, stop, hear, nice, happened, fine, wanted, father, gotta, mind, fucking, house, wasn, getting, world, stay, mother, left, came, care, thanks, knew, room, trying, guys, went, looking, coming, heard, friend, haven, seen, best, tonight, live, used, matter, killed, pretty, business, idea, couldn, head, miss, says, wife, called, woman, morning, tomorrow, start, stuff, saying, play, hello, baby, hard, probably, minute, days, took, somebody, today, school, meet, gone, crazy, wants, damn, forget, cause, problem, deal, case, friends, point, hope, jesus, afraid, looks, knows, year, worry, exactly, aren, half, thinking, shut, hold, wanna, face, minutes, bring, read, word, doctor, everybody, makes, supposed, story, turn, true, watch, thousand, family, brother, kids, week, happen, fuckin, working, open, happy, lost, john, hurt, town, ready, alright, late, actually, married, gave, beautiful, soon, jack, times, sleep, door, having, hand, drink, easy, gets, chance, young, trouble, different, anybody, shot, rest, hate, death, second, later, asked, phone, wish, check, quite, change, police, walk, couple, question, close, taking, heart, hours, making, comes, anymore, truth, trust, dollars, important, captain, telling, funny, person, honey, goes, eyes, inside, reason, stand, break, means, number, tried, high, white, water, suppose, body, sick, game, excuse, party, women, country, waiting, christ, answer, office, send, pick, alive, sort, blood, black, daddy, line, husband, goddamn, book, fifty, thirty, fact, million, hands, died, power, stupid, started, shouldn, months, city, boys, dinner, sense, running, hour, shoot, drive, fight, speak, george, ship, living, figure, dear, street, ahead, lady, seven, free, feeling, scared, frank, able, children, outside, moment, safe, news, president, brought, write, happens, sent, bullshit, lose, light, glad, child, girls, sounds, sister, lives, promise, till, sound, weren, save, poor, cool, asking, shall, plan, bitch, king, daughter, weeks, beat, york, cold, worth, taken, harry, needs, piece, movie, fast, possible, small, goin, straight, human, hair, tired, food, company, lucky, pull, wonderful, touch, state, looked, thinks, picture, leaving, words, control, clear, known, special, buddy, luck, order, follow, expect, mary, catch, mouth, worked, mister, learn, playing, perfect, dream, calling, questions, hospital, takes, ride, coffee, miles, parents, works, secret, explain, hotel, worse, kidding, past, outta, general, unless, felt, drop, throw, interested, hang, certainly, absolutely, earth, loved, wonder, dark, accident, seeing, simple, turned, doin, clock, date, sweet, meeting, clean, sign, feet, handle, army, music, giving, report, cops, fucked, charlie, information, smart, yesterday, fall, fault, class, bank, month, blow, major, caught, swear, paul, road, talked, choice, boss, plane, david, paid, wear, american, worried, clothes, paper, goodbye, lord, ones, strange, terrible, mistake, given, hurry, blue, finish, murder, kept, apartment, sell, middle, nothin, hasn, careful, meant, walter, moving, changed, imagine, fair, difference, quiet, happening, near, quit, personal, marry, future, figured, rose, agent, kinda, michael, building, mama, early, private, trip, watching, busy, record, certain, jimmy, broke, longer, sake, store, stick, finally, boat, born, sitting, evening, bucks, chief, history, ought, lying, kiss, honor, lunch, darling, favor, fool, uncle, respect, rich, liked, killing, land, peter, tough, interesting, brain, problems, nick, welcome, completely, dick, honest, wake, radio, cash, dude, dance, james, bout, floor, weird, court, calls, jail, window, involved, drunk, johnny, officer, needed, asshole, books, spend, situation, relax, pain, service, dangerous, grand, security, letter, stopped, realize, table, offer, bastard, message, instead, killer, jake, nervous, deep, pass, somethin, evil, english, bought, short, ring, step, picked, likes, voice, eddie, machine, lived, upset, forgot, carry, afternoon, fear, quick, finished, count, forgive, wrote, named, decided, totally, space, team, doubt, pleasure, lawyer, suit, station, gotten, bother, prove, return, pictures, slow, bunch, strong, wearing, driving, list, join, christmas, tape, attack, church, appreciate, force, hungry, standing, college, dying, present, charge, prison, missing, truck, public, board, calm, staying, gold, ball, hardly, hadn, lead, missed, island, government, horse, cover, reach, french, joke, star, fish, mike, moved, america, surprise, soul, seconds, club, self, movies, putting, dress, cost, listening, lots, price, saved, smell, mark, peace, gives, crime, dreams, entire, single, usually, department, beer, holy, west, wall, stuck, nose, protect, ways, teach, awful, forever, type, grow, train, detective, billy, rock, planet, walking, beginning, dumb, papers, folks, park, attention, hide, card, birthday, reading, test, share, master, lieutenant, starting, field, partner, twice, enjoy, dollar, blame, film, mess, bomb, round, girlfriend, south, loves, plenty, using, gentlemen, especially, records, evidence, experience, silly, admit, normal, fired, talkin, lock, louis, fighting, mission, notice, memory, promised, crap, wedding, orders, ground, guns, glass, marriage, idiot, heaven, impossible, knock, green, wondering, spent, animal, hole, neck, drugs, press, nuts, names, broken, position, asleep, jerry, visit, boyfriend, acting, plans, feels, tells, paris, smoke, wind, sheriff, cross, holding, gimme, mention, walked, judge, code, double, brothers, writing, pardon, keeps, fellow, fell, closed, angry, lovely, cute, surprised, percent, charles, correct, agree, bathroom, address, andy, ridiculous, summer, tommy, rules, note, account, group, sleeping, learned, sing, pulled, colonel, proud, laugh, river, area, upstairs, jump, built, difficult, breakfast, bobby, bridge, dirty, betty, amazing, locked, north, definitely, alex, feelings, plus, worst, accept, kick, file, wild, seriously, grace, stories, steal, gettin, nature, advice, relationship, contact, waste, places, spot, beach, stole, apart, favorite, knowing, level, song, faith, risk, loose, patient, foot, eating, played, action, witness, washington, turns, build, obviously, begin, split, games, command, crew, decide, nurse, keeping, tight, bird, form, runs, copy, arrest, complete, scene, consider, jeffrey, insane, taste, teeth, shoes, monster, devil, henry, career, sooner, innocent, hall, showed, gift, weekend, heavy, study, greatest, comin, danger, keys, raise, destroy, track, carl, california, concerned, bruce, program, blind, suddenly, hanging, apologize, seventy, chicken, medical, forward, drinking, sweetheart, willing, guard, legs, admiral, shop, professor, suspect, tree, camp, data, ticket, goodnight, possibly, dunno, burn, paying, television, trick, murdered, losing, senator, credit, extra, dropped, sold, warm, meaning, stone, starts, hiding, lately, cheap, marty, taught, science, lookin, simply, majesty, harold, corner, jeff, queen, following, duty, training, seat, heads, cars, discuss, bear, noticed, enemy, helped, screw, richard, flight)
>     topics: Array[Array[(String, Double)]] = Array(Array((just,0.0014102155338741765), (like,0.0012758924372910556), (think,0.0011214448310395873), (home,9.238914780871355E-4), (know,9.047647243869576E-4)), Array((just,0.0014443699497685366), (going,0.0012377629724506722), (like,0.0011714257476524842), (know,0.0010861657304183027), (think,8.604460434628813E-4)), Array((just,0.0014926060533533697), (like,0.0013429026076916017), (yeah,0.0013067364965238173), (know,0.0011607492289313303), (want,0.0011400804862230437)), Array((right,0.006717314446949222), (going,0.006002662754297925), (think,0.004488111770001314), (just,0.004408679383982238), (good,0.0042465917238892655)), Array((know,0.0050059173813691085), (sure,0.0029731088780905225), (want,0.0022359962463711185), (yeah,0.002193246256785973), (going,0.0019111384839030116)), Array((want,0.003714410612506209), (know,0.0017122806517390608), (come,0.0017073041827440282), (just,0.0015712232707115927), (make,0.0012303967042097022)), Array((know,0.00467483294478972), (just,0.0038641828467113268), (going,0.003328578440542597), (come,0.002867941043688811), (like,0.002532629878316373)), Array((know,0.00960017865043255), (like,0.009308573745541343), (tell,0.005704969701604644), (just,0.004085042285865179), (didn,0.004031048471919761)), Array((know,0.004550808496981245), (think,0.004122146617438838), (right,0.0019092043643137734), (fucking,0.0018255598181846045), (okay,0.001761167250972209)), Array((going,0.0016782125889211463), (like,0.0012427279906039904), (right,0.0012197157251243875), (just,0.0010635502545983016), (know,9.50137528050953E-4)), Array((like,0.003126597598330109), (just,0.0027451035751362273), (want,0.00228759303132256), (know,0.0017239166326848171), (going,0.0017047784964894794)), Array((like,0.004734133576359814), (just,0.004201386287998202), (love,0.0036983083453854372), (think,0.0025414887712607768), (want,0.002091795015523375)), Array((know,0.0035340054254694784), (right,0.002387182752907053), (just,0.0019263993964325303), (look,0.001843992584617911), (like,0.0018065489773133325)), Array((like,0.0016017017354850733), (just,0.0014834097260266685), (right,0.0014300356385979168), (mean,0.001294952229819751), (know,0.0012788947989035501)), Array((good,0.002043769246809558), (just,0.0013757478946969802), (come,0.0013208455540129331), (going,0.0012662647575091633), (like,0.0011549537488969965)), Array((know,0.022087503347588935), (just,0.01571524947937798), (like,0.012895996754133662), (want,0.01026452087962411), (think,0.009873743305368164)), Array((know,0.002204551343207476), (just,0.0016283414468010306), (want,0.0014214537687803855), (think,0.0012768751041210551), (tell,0.0011525954268574248)), Array((hell,0.0022031979750387655), (just,0.0020637622110226085), (like,0.0019281346187348387), (okay,0.0015712770524161123), (right,0.0014183600893726285)), Array((know,0.0035729889283848504), (like,0.0024215014894025766), (want,0.0018740761967851508), (right,0.001838630576321126), (yeah,0.0016262171049684524)), Array((want,0.0018098267577494882), (come,0.0015864305565599366), (doing,0.0015861983258874525), (tell,0.001331260635860306), (think,0.0012793651558771885)))

  

Going through the results, you may notice that some of the topic words
returned are actually stopwords that are specific to our dataset (for
eg: "writes", "article"...). Let's try improving our model.

Step 8. Model Tuning - Refilter Stopwords
-----------------------------------------

We will try to improve the results of our model by identifying some
stopwords that are specific to our dataset. We will filter these
stopwords out and rerun our LDA model to see if we get better results.

In [None]:
val add_stopwords = Array("whatever") // add  more stop-words like the name of your company!

  

>     add_stopwords: Array[String] = Array(whatever)

In [None]:
// Combine newly identified stopwords to our exising list of stopwords
val new_stopwords = stopwords.union(add_stopwords)

  

>     new_stopwords: Array[String] = Array(a, about, above, across, after, afterwards, again, against, all, almost, alone, along, already, also, although, always, am, among, amongst, amoungst, amount, an, and, another, any, anyhow, anyone, anything, anyway, anywhere, are, around, as, at, back, be, became, because, become, becomes, becoming, been, before, beforehand, behind, being, below, beside, besides, between, beyond, bill, both, bottom, but, by, call, can, cannot, cant, co, computer, con, could, couldnt, cry, de, describe, detail, do, done, down, due, during, each, eg, eight, either, eleven, else, elsewhere, empty, enough, etc, even, ever, every, everyone, everything, everywhere, except, few, fifteen, fify, fill, find, fire, first, five, for, former, formerly, forty, found, four, from, front, full, further, get, give, go, had, has, hasnt, have, he, hence, her, here, hereafter, hereby, herein, hereupon, hers, herself, him, himself, his, how, however, hundred, i, ie, if, in, inc, indeed, interest, into, is, it, its, itself, keep, last, latter, latterly, least, less, ltd, made, many, may, me, meanwhile, might, mill, mine, more, moreover, most, mostly, move, much, must, my, myself, name, namely, neither, never, nevertheless, next, nine, no, nobody, none, noone, nor, not, nothing, now, nowhere, of, off, often, on, once, one, only, onto, or, other, others, otherwise, our, ours, ourselves, out, over, own, part, per, perhaps, please, put, rather, re, same, see, seem, seemed, seeming, seems, serious, several, she, should, show, side, since, sincere, six, sixty, so, some, somehow, someone, something, sometime, sometimes, somewhere, still, such, system, take, ten, than, that, the, their, them, themselves, then, thence, there, thereafter, thereby, therefore, therein, thereupon, these, they, thick, thin, third, this, those, though, three, through, throughout, thru, thus, to, together, too, top, toward, towards, twelve, twenty, two, un, under, until, up, upon, us, very, via, was, we, well, were, what, whatever, when, whence, whenever, where, whereafter, whereas, whereby, wherein, whereupon, wherever, whether, which, while, whither, who, whoever, whole, whom, whose, why, will, with, within, without, would, yet, you, your, yours, yourself, yourselves, whatever)

In [None]:
import org.apache.spark.ml.feature.StopWordsRemover

// Set Params for StopWordsRemover with new_stopwords
val remover = new StopWordsRemover()
.setStopWords(new_stopwords)
.setInputCol("tokens")
.setOutputCol("filtered")

// Create new df with new list of stopwords removed
val new_filtered_df = remover.transform(tokenized_df)

  

>     import org.apache.spark.ml.feature.StopWordsRemover
>     remover: org.apache.spark.ml.feature.StopWordsRemover = stopWords_3d7dc1a9b2ef
>     new_filtered_df: org.apache.spark.sql.DataFrame = [id: bigint, corpus: string ... 4 more fields]

In [None]:
// Set Params for CountVectorizer
val vectorizer = new CountVectorizer()
.setInputCol("filtered")
.setOutputCol("features")
.setVocabSize(10000)
.setMinDF(5)
.fit(new_filtered_df)

// Create new df of countVectors
val new_countVectors = vectorizer.transform(new_filtered_df).select("id", "features")

  

>     vectorizer: org.apache.spark.ml.feature.CountVectorizerModel = cntVec_2fcb7a8b0dc8
>     new_countVectors: org.apache.spark.sql.DataFrame = [id: bigint, features: vector]

In [None]:
// Convert DF to RDD
val new_lda_countVector = new_countVectors.map { case Row(id: Long, countVector: MLVector) => (id, Vectors.fromML(countVector)) }.rdd

  

>     new_lda_countVector: org.apache.spark.rdd.RDD[(Long, org.apache.spark.mllib.linalg.Vector)] = MapPartitionsRDD[3955] at rdd at command-753740454082314:2

  

We will also increase MaxIterations to 10 to see if we get better
results.

In [None]:
// Set LDA parameters
val new_lda = new LDA()
.setOptimizer(new OnlineLDAOptimizer().setMiniBatchFraction(0.8))
.setK(numTopics)
.setMaxIterations(10)
.setDocConcentration(-1) // use default values
.setTopicConcentration(-1) // use default values

  

>     new_lda: org.apache.spark.mllib.clustering.LDA = org.apache.spark.mllib.clustering.LDA@5fca2e4f

  

#### How to find what the default values are?

Dive into the source!!!

1.  Let's find the default value for `docConcentration` now.
2.  Got to Apache Spark package Root:
    <https://spark.apache.org/docs/latest/api/scala/#package>

-   search for 'ml' in the search box on the top left (ml is for ml
    library)
-   Then find the `LDA` by scrolling below on the left to mllib's
    `clustering` methods and click on `LDA`
-   Then click on the source code link which should take you here:
    -   <https://github.com/apache/spark/blob/v1.6.1/mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala>
    -   Now, simply go to the right function and see the following
        comment block:

    \`\`\` /\*\*
    -   Concentration parameter (commonly named "alpha") for the prior
        placed on documents'

    -   distributions over topics ("theta").

    -   

    -   This is the parameter to a Dirichlet distribution, where larger
        values mean more smoothing

    -   (more regularization).

    -   

    -   If not set by the user, then docConcentration is set
        automatically. If set to

    -   singleton vector \[alpha\], then alpha is replicated to a vector
        of length k in fitting.

    -   Otherwise, the \[\[docConcentration\]\] vector must be length k.

    -   (default = automatic)

    -   

    -   Optimizer-specific parameter settings:

    -   -   EM

    -   - Currently only supports symmetric distributions, so all values in the vector should be

    -     the same.

    -   - Values should be > 1.0

    -   - default = uniformly (50 / k) + 1, where 50/k is common in LDA libraries and +1 follows

    -     from Asuncion et al. (2009), who recommend a +1 adjustment for EM.

    -   -   Online

    -   - Values should be >= 0

    -   - default = uniformly (1.0 / k), following the implementation from

    -     [[https://github.com/Blei-Lab/onlineldavb]].

    -   @group param \*/ \`\`\`

**HOMEWORK:** Try to find the default value for `TopicConcentration`.

In [None]:
// Create LDA model with stopwords refiltered
val new_ldaModel = new_lda.run(new_lda_countVector)

  

>     new_ldaModel: org.apache.spark.mllib.clustering.LDAModel = org.apache.spark.mllib.clustering.LocalLDAModel@3f1301a7

In [None]:
val topicIndices = new_ldaModel.describeTopics(maxTermsPerTopic = 5)
val vocabList = vectorizer.vocabulary
val topics = topicIndices.map { case (terms, termWeights) =>
  terms.map(vocabList(_)).zip(termWeights)
}
println(s"$numTopics topics:")
topics.zipWithIndex.foreach { case (topic, i) =>
  println(s"TOPIC $i")
  topic.foreach { case (term, weight) => println(s"$term\t$weight") }
  println(s"==========")
}

  

>     20 topics:
>     TOPIC 0
>     right	0.002368539995607174
>     love	0.0019026093436816463
>     just	0.001739005396343051
>     okay	0.001493567868602809
>     know	0.0011919944841106388
>     ==========
>     TOPIC 1
>     like	0.012255569993473736
>     just	0.007532527227834193
>     come	0.007114873840600518
>     know	0.006960825682483897
>     think	0.006460380113586568
>     ==========
>     TOPIC 2
>     know	0.017593342399778864
>     yeah	0.01729763439457538
>     gonna	0.014297985209693677
>     just	0.009395640487800467
>     tell	0.007112117826339655
>     ==========
>     TOPIC 3
>     just	0.002310885836348927
>     know	0.0020049203493508585
>     better	0.001839601963450054
>     like	0.0016545385663972387
>     right	0.001505081787498549
>     ==========
>     TOPIC 4
>     know	0.012396058201765845
>     didn	0.004786910731106122
>     like	0.004783067030382327
>     right	0.003733205551673614
>     just	0.0028592628116592403
>     ==========
>     TOPIC 5
>     just	0.0028236500929191208
>     know	0.0026011344347436015
>     going	0.0015951009390631876
>     didn	0.001385667983895007
>     wait	0.001275555813151892
>     ==========
>     TOPIC 6
>     going	0.00275337137203844
>     right	0.001685679960504387
>     just	0.0015380845174617235
>     know	0.0014818062892167352
>     captain	0.0013896743515293423
>     ==========
>     TOPIC 7
>     going	0.011956735401221285
>     just	0.006541063462593452
>     know	0.005428932374204778
>     think	0.004308569608730405
>     believe	0.003696595226603709
>     ==========
>     TOPIC 8
>     think	0.0019959039820595533
>     sorry	0.00198077299794292
>     know	0.0016723315231586236
>     shit	0.0015606901977245095
>     right	0.0013015271817698212
>     ==========
>     TOPIC 9
>     know	0.003615862714921936
>     said	0.001961114693915351
>     sorry	0.0018595382287745752
>     like	0.0017819242854891695
>     think	0.0016468683030306027
>     ==========
>     TOPIC 10
>     time	0.008784671423019166
>     want	0.00282365356227211
>     sure	0.0024833597381016476
>     know	0.0019777615447230884
>     right	0.0016576304456760946
>     ==========
>     TOPIC 11
>     just	0.0021068918389201634
>     like	0.0020497480766035994
>     know	0.002022347553873645
>     want	0.0019500819941038825
>     said	0.001503771370040063
>     ==========
>     TOPIC 12
>     look	0.00433587608823225
>     think	0.0025833796049907604
>     know	0.002007970741805987
>     going	0.0016840410422251017
>     just	0.0010661551551733228
>     ==========
>     TOPIC 13
>     know	0.0020279945673448915
>     come	0.0019980250335794405
>     think	0.0012733121858788797
>     going	0.001192108885417234
>     okay	0.001186180285931844
>     ==========
>     TOPIC 14
>     like	0.004262090436242644
>     right	0.0021537790725358777
>     just	0.0013683197398457016
>     know	0.0010911699327713488
>     look	0.0010869000557749361
>     ==========
>     TOPIC 15
>     come	0.004769396664496132
>     know	0.0026229974920448534
>     like	0.0021612642420959253
>     just	0.0013228057897488347
>     right	0.001171812635848879
>     ==========
>     TOPIC 16
>     know	0.025323543461007635
>     just	0.018361261941348715
>     like	0.01574431601713426
>     want	0.014855701536091734
>     think	0.011957607420818889
>     ==========
>     TOPIC 17
>     like	0.004346004035796333
>     know	0.0022903208899377127
>     just	0.002008680613491114
>     little	0.0019547134832950414
>     maybe	0.0017287784612649724
>     ==========
>     TOPIC 18
>     know	0.003217184151682409
>     think	0.003063734585623867
>     just	0.0018328245079520728
>     want	0.0017709019452594528
>     like	0.0016903614729120188
>     ==========
>     TOPIC 19
>     hello	0.008911727886543675
>     stop	0.0025143616929346174
>     just	0.0023958078165974795
>     like	0.00184251815055585
>     come	0.0018199130672157007
>     ==========
>     topicIndices: Array[(Array[Int], Array[Double])] = Array((Array(5, 27, 1, 16, 0),Array(0.002368539995607174, 0.0019026093436816463, 0.001739005396343051, 0.001493567868602809, 0.0011919944841106388)), (Array(2, 1, 10, 0, 4),Array(0.012255569993473736, 0.007532527227834193, 0.007114873840600518, 0.006960825682483897, 0.006460380113586568)), (Array(0, 8, 20, 1, 9),Array(0.017593342399778864, 0.01729763439457538, 0.014297985209693677, 0.009395640487800467, 0.007112117826339655)), (Array(1, 0, 36, 2, 5),Array(0.002310885836348927, 0.0020049203493508585, 0.001839601963450054, 0.0016545385663972387, 0.001505081787498549)), (Array(0, 13, 2, 5, 1),Array(0.012396058201765845, 0.004786910731106122, 0.004783067030382327, 0.003733205551673614, 0.0028592628116592403)), (Array(1, 0, 6, 13, 57),Array(0.0028236500929191208, 0.0026011344347436015, 0.0015951009390631876, 0.001385667983895007, 0.001275555813151892)), (Array(6, 5, 1, 0, 233),Array(0.00275337137203844, 0.001685679960504387, 0.0015380845174617235, 0.0014818062892167352, 0.0013896743515293423)), (Array(6, 1, 0, 4, 40),Array(0.011956735401221285, 0.006541063462593452, 0.005428932374204778, 0.004308569608730405, 0.003696595226603709)), (Array(4, 26, 0, 42, 5),Array(0.0019959039820595533, 0.00198077299794292, 0.0016723315231586236, 0.0015606901977245095, 0.0013015271817698212)), (Array(0, 23, 26, 2, 4),Array(0.003615862714921936, 0.001961114693915351, 0.0018595382287745752, 0.0017819242854891695, 0.0016468683030306027)), (Array(11, 3, 19, 0, 5),Array(0.008784671423019166, 0.00282365356227211, 0.0024833597381016476, 0.0019777615447230884, 0.0016576304456760946)), (Array(1, 2, 0, 3, 23),Array(0.0021068918389201634, 0.0020497480766035994, 0.002022347553873645, 0.0019500819941038825, 0.001503771370040063)), (Array(12, 4, 0, 6, 1),Array(0.00433587608823225, 0.0025833796049907604, 0.002007970741805987, 0.0016840410422251017, 0.0010661551551733228)), (Array(0, 10, 4, 6, 16),Array(0.0020279945673448915, 0.0019980250335794405, 0.0012733121858788797, 0.001192108885417234, 0.001186180285931844)), (Array(2, 5, 1, 0, 12),Array(0.004262090436242644, 0.0021537790725358777, 0.0013683197398457016, 0.0010911699327713488, 0.0010869000557749361)), (Array(10, 0, 2, 1, 5),Array(0.004769396664496132, 0.0026229974920448534, 0.0021612642420959253, 0.0013228057897488347, 0.001171812635848879)), (Array(0, 1, 2, 3, 4),Array(0.025323543461007635, 0.018361261941348715, 0.01574431601713426, 0.014855701536091734, 0.011957607420818889)), (Array(2, 0, 1, 18, 24),Array(0.004346004035796333, 0.0022903208899377127, 0.002008680613491114, 0.0019547134832950414, 0.0017287784612649724)), (Array(0, 4, 1, 3, 2),Array(0.003217184151682409, 0.003063734585623867, 0.0018328245079520728, 0.0017709019452594528, 0.0016903614729120188)), (Array(121, 68, 1, 2, 10),Array(0.008911727886543675, 0.0025143616929346174, 0.0023958078165974795, 0.00184251815055585, 0.0018199130672157007)))
>     vocabList: Array[String] = Array(know, just, like, want, think, right, going, good, yeah, tell, come, time, look, didn, mean, make, okay, really, little, sure, gonna, thing, people, said, maybe, need, sorry, love, talk, thought, doing, life, night, things, work, money, better, told, long, help, believe, years, shit, does, away, place, hell, doesn, great, home, feel, fuck, kind, remember, dead, course, wouldn, wait, kill, guess, understand, thank, girl, wrong, leave, listen, talking, real, stop, hear, nice, happened, fine, wanted, father, gotta, mind, fucking, house, wasn, getting, world, stay, mother, left, came, care, thanks, knew, room, trying, guys, went, looking, coming, heard, friend, haven, seen, best, tonight, live, used, matter, killed, pretty, business, idea, couldn, head, miss, says, wife, called, woman, morning, tomorrow, start, stuff, saying, play, hello, baby, hard, probably, minute, days, took, somebody, school, today, meet, gone, crazy, wants, damn, forget, cause, problem, deal, case, friends, point, hope, jesus, afraid, looks, knows, year, worry, exactly, aren, half, thinking, shut, hold, wanna, face, minutes, bring, word, read, doctor, everybody, makes, supposed, story, turn, true, watch, thousand, family, brother, kids, week, happen, fuckin, working, happy, open, lost, john, hurt, town, ready, alright, late, actually, gave, married, beautiful, soon, jack, times, sleep, door, having, drink, hand, easy, gets, chance, young, trouble, different, anybody, shot, rest, hate, death, second, later, asked, phone, wish, check, quite, walk, change, police, couple, question, close, taking, heart, hours, making, comes, anymore, truth, trust, dollars, important, captain, telling, funny, person, honey, goes, eyes, reason, inside, stand, break, number, tried, means, high, white, water, suppose, body, sick, game, excuse, party, women, country, answer, waiting, christ, office, send, pick, alive, sort, blood, black, daddy, line, husband, goddamn, book, fifty, thirty, million, fact, hands, died, power, started, stupid, shouldn, months, boys, city, sense, dinner, running, hour, shoot, drive, fight, speak, george, living, ship, figure, dear, street, ahead, lady, seven, scared, free, feeling, frank, able, children, outside, safe, moment, news, president, brought, write, happens, sent, bullshit, lose, light, glad, child, girls, sister, sounds, lives, till, promise, sound, weren, save, poor, cool, asking, shall, plan, king, bitch, daughter, beat, weeks, york, cold, worth, taken, harry, needs, piece, movie, fast, possible, small, goin, straight, human, hair, food, tired, company, lucky, pull, wonderful, touch, looked, state, thinks, picture, words, leaving, control, clear, known, special, buddy, luck, follow, order, expect, mary, catch, mouth, worked, mister, learn, playing, perfect, dream, calling, questions, hospital, coffee, takes, ride, parents, miles, works, secret, hotel, explain, worse, kidding, past, outta, general, unless, felt, drop, throw, hang, interested, certainly, absolutely, earth, loved, wonder, dark, accident, seeing, doin, turned, simple, clock, date, sweet, meeting, clean, sign, feet, handle, army, music, giving, report, cops, fucked, charlie, information, yesterday, smart, fall, fault, class, bank, month, blow, swear, caught, major, paul, road, talked, choice, boss, plane, david, paid, wear, american, worried, clothes, ones, lord, goodbye, paper, terrible, strange, mistake, given, kept, finish, blue, murder, hurry, apartment, sell, middle, nothin, careful, hasn, meant, walter, moving, changed, fair, imagine, difference, quiet, happening, near, quit, personal, marry, figured, rose, future, building, kinda, agent, early, mama, michael, watching, trip, private, busy, record, certain, jimmy, broke, longer, sake, store, finally, boat, stick, born, sitting, evening, bucks, history, chief, lying, ought, honor, kiss, darling, lunch, uncle, fool, favor, respect, rich, land, liked, killing, peter, tough, brain, interesting, completely, welcome, nick, problems, wake, radio, dick, honest, cash, dance, dude, james, bout, floor, weird, court, jail, calls, window, involved, drunk, johnny, officer, needed, asshole, spend, situation, books, relax, pain, grand, dangerous, service, letter, stopped, security, realize, offer, table, message, bastard, killer, instead, jake, deep, nervous, pass, somethin, evil, english, bought, short, step, ring, picked, likes, machine, voice, eddie, upset, carry, forgot, lived, afternoon, fear, finished, quick, count, forgive, wrote, named, decided, totally, space, team, lawyer, pleasure, doubt, suit, station, gotten, bother, return, prove, slow, pictures, bunch, strong, list, wearing, driving, join, tape, christmas, force, church, attack, appreciate, college, standing, hungry, present, dying, charge, prison, missing, truck, board, public, staying, calm, gold, ball, hardly, hadn, lead, missed, island, government, cover, horse, reach, joke, french, fish, star, america, moved, soul, surprise, mike, putting, seconds, club, self, movies, dress, cost, lots, price, listening, saved, smell, mark, peace, dreams, crime, gives, entire, department, usually, single, holy, west, beer, nose, wall, stuck, protect, ways, teach, train, grow, awful, type, forever, rock, detective, billy, dumb, papers, walking, beginning, planet, folks, park, attention, card, hide, birthday, master, share, lieutenant, starting, test, reading, field, partner, twice, enjoy, film, bomb, mess, blame, dollar, loves, girlfriend, south, round, records, especially, using, plenty, gentlemen, evidence, silly, admit, experience, fired, normal, talkin, lock, mission, memory, louis, fighting, notice, crap, wedding, promised, ground, idiot, orders, marriage, guns, glass, impossible, heaven, knock, spent, neck, wondering, green, animal, hole, press, drugs, nuts, position, broken, names, asleep, jerry, acting, feels, visit, plans, boyfriend, smoke, paris, wind, tells, gimme, holding, cross, sheriff, walked, mention, judge, code, writing, double, brothers, keeps, pardon, fellow, fell, closed, lovely, angry, cute, percent, surprised, charles, agree, bathroom, correct, address, ridiculous, summer, andy, rules, tommy, group, account, note, learned, colonel, pulled, sing, laugh, proud, sleeping, area, built, jump, upstairs, difficult, river, bobby, dirty, breakfast, bridge, betty, locked, amazing, north, alex, definitely, plus, feelings, accept, kick, worst, grace, gettin, wild, stories, steal, seriously, file, relationship, advice, nature, places, waste, contact, spot, apart, knowing, stole, beach, favorite, loose, level, song, faith, risk, played, eating, foot, patient, witness, turns, washington, action, build, obviously, begin, split, crew, command, games, decide, tight, nurse, keeping, bird, form, runs, copy, scene, jeffrey, arrest, complete, taste, consider, insane, teeth, shoes, henry, career, sooner, monster, devil, hall, innocent, showed, study, gift, weekend, heavy, keys, greatest, comin, destroy, danger, track, raise, suddenly, hanging, bruce, carl, california, apologize, concerned, blind, program, medical, chicken, sweetheart, drinking, forward, seventy, willing, shop, guard, legs, suspect, professor, admiral, data, ticket, camp, tree, goodnight, paying, burn, losing, possibly, dunno, television, senator, trick, murdered, dropped, extra, credit, starts, warm, stone, sold, hiding, meaning, taught, marty, cheap, lately, simply, science, lookin, following, harold, queen, majesty, jeff, corner, cars, heads, training, seat, duty, noticed, helped, bear, enemy, discuss, responsible, trial, dave)
>     topics: Array[Array[(String, Double)]] = Array(Array((right,0.002368539995607174), (love,0.0019026093436816463), (just,0.001739005396343051), (okay,0.001493567868602809), (know,0.0011919944841106388)), Array((like,0.012255569993473736), (just,0.007532527227834193), (come,0.007114873840600518), (know,0.006960825682483897), (think,0.006460380113586568)), Array((know,0.017593342399778864), (yeah,0.01729763439457538), (gonna,0.014297985209693677), (just,0.009395640487800467), (tell,0.007112117826339655)), Array((just,0.002310885836348927), (know,0.0020049203493508585), (better,0.001839601963450054), (like,0.0016545385663972387), (right,0.001505081787498549)), Array((know,0.012396058201765845), (didn,0.004786910731106122), (like,0.004783067030382327), (right,0.003733205551673614), (just,0.0028592628116592403)), Array((just,0.0028236500929191208), (know,0.0026011344347436015), (going,0.0015951009390631876), (didn,0.001385667983895007), (wait,0.001275555813151892)), Array((going,0.00275337137203844), (right,0.001685679960504387), (just,0.0015380845174617235), (know,0.0014818062892167352), (captain,0.0013896743515293423)), Array((going,0.011956735401221285), (just,0.006541063462593452), (know,0.005428932374204778), (think,0.004308569608730405), (believe,0.003696595226603709)), Array((think,0.0019959039820595533), (sorry,0.00198077299794292), (know,0.0016723315231586236), (shit,0.0015606901977245095), (right,0.0013015271817698212)), Array((know,0.003615862714921936), (said,0.001961114693915351), (sorry,0.0018595382287745752), (like,0.0017819242854891695), (think,0.0016468683030306027)), Array((time,0.008784671423019166), (want,0.00282365356227211), (sure,0.0024833597381016476), (know,0.0019777615447230884), (right,0.0016576304456760946)), Array((just,0.0021068918389201634), (like,0.0020497480766035994), (know,0.002022347553873645), (want,0.0019500819941038825), (said,0.001503771370040063)), Array((look,0.00433587608823225), (think,0.0025833796049907604), (know,0.002007970741805987), (going,0.0016840410422251017), (just,0.0010661551551733228)), Array((know,0.0020279945673448915), (come,0.0019980250335794405), (think,0.0012733121858788797), (going,0.001192108885417234), (okay,0.001186180285931844)), Array((like,0.004262090436242644), (right,0.0021537790725358777), (just,0.0013683197398457016), (know,0.0010911699327713488), (look,0.0010869000557749361)), Array((come,0.004769396664496132), (know,0.0026229974920448534), (like,0.0021612642420959253), (just,0.0013228057897488347), (right,0.001171812635848879)), Array((know,0.025323543461007635), (just,0.018361261941348715), (like,0.01574431601713426), (want,0.014855701536091734), (think,0.011957607420818889)), Array((like,0.004346004035796333), (know,0.0022903208899377127), (just,0.002008680613491114), (little,0.0019547134832950414), (maybe,0.0017287784612649724)), Array((know,0.003217184151682409), (think,0.003063734585623867), (just,0.0018328245079520728), (want,0.0017709019452594528), (like,0.0016903614729120188)), Array((hello,0.008911727886543675), (stop,0.0025143616929346174), (just,0.0023958078165974795), (like,0.00184251815055585), (come,0.0018199130672157007)))

  

Step 9. Create LDA model with Expectation Maximization
------------------------------------------------------

Let's try creating an LDA model with Expectation Maximization on the
data that has been refiltered for additional stopwords. We will also
increase MaxIterations here to 100 to see if that improves results. See:

-   <http://spark.apache.org/docs/latest/mllib-clustering.html#latent-dirichlet-allocation-lda>.

In [None]:
import org.apache.spark.mllib.clustering.EMLDAOptimizer

// Set LDA parameters
val em_lda = new LDA()
.setOptimizer(new EMLDAOptimizer())
.setK(numTopics)
.setMaxIterations(100)
.setDocConcentration(-1) // use default values
.setTopicConcentration(-1) // use default values

  

>     import org.apache.spark.mllib.clustering.EMLDAOptimizer
>     em_lda: org.apache.spark.mllib.clustering.LDA = org.apache.spark.mllib.clustering.LDA@7c84d0ae

In [None]:
val em_ldaModel = em_lda.run(new_lda_countVector) // takes a long long time 22 minutes

  

>     em_ldaModel: org.apache.spark.mllib.clustering.LDAModel = org.apache.spark.mllib.clustering.DistributedLDAModel@188f58bf

In [None]:
import org.apache.spark.mllib.clustering.DistributedLDAModel;
val em_DldaModel = em_ldaModel.asInstanceOf[DistributedLDAModel]

  

>     import org.apache.spark.mllib.clustering.DistributedLDAModel
>     em_DldaModel: org.apache.spark.mllib.clustering.DistributedLDAModel = org.apache.spark.mllib.clustering.DistributedLDAModel@188f58bf

In [None]:
val top10ConversationsPerTopic = em_DldaModel.topDocumentsPerTopic(10)

  

>     top10ConversationsPerTopic: Array[(Array[Long], Array[Double])] = Array((Array(39677, 39693, 39680, 39679, 39674, 39682, 39681, 39676, 41932, 41967),Array(0.03185722402229515, 0.03185722402229515, 0.03185722402229515, 0.03185722402229515, 0.03185722402229515, 0.03185722402229515, 0.031196200176884056, 0.020282154018599348, 0.01099645315549, 0.01099645315549)), (Array(39677, 39693, 39680, 39679, 39674, 39682, 39681, 39676, 41932, 41967),Array(0.035359403020952286, 0.035359403020952286, 0.035359403020952286, 0.035359403020952286, 0.035359403020952286, 0.035359403020952286, 0.03471449892991739, 0.022506359024306477, 0.0112575667750105, 0.0112575667750105)), (Array(39677, 39693, 39680, 39679, 39674, 39682, 39681, 39676, 41932, 41967),Array(0.02454380082221751, 0.02454380082221751, 0.02454380082221751, 0.02454380082221751, 0.02454380082221751, 0.02454380082221751, 0.02390214852968437, 0.01563567947724491, 0.01037864738296468, 0.01037864738296468)), (Array(69318, 15221, 15149, 23167, 59606, 51632, 51639, 64470, 67338, 66968),Array(0.9999514001066685, 0.999945172626603, 0.9999406008121946, 0.9999406008121946, 0.9999406008121946, 0.9999406008121946, 0.9999406008121946, 0.9999406008121946, 0.9999406008121946, 0.9999406008121946)), (Array(39679, 39677, 39693, 39680, 39674, 39682, 39681, 39676, 41932, 41967),Array(0.05060524129300216, 0.05060524129300216, 0.05060524129300216, 0.05060524129300216, 0.05060524129300216, 0.05060524129300216, 0.05027652950865874, 0.032180017393406625, 0.011630224445618545, 0.011630224445618545)), (Array(39679, 39677, 39693, 39680, 39674, 39682, 39681, 39676, 41932, 41967),Array(0.043670158470644385, 0.043670158470644385, 0.043670158470644385, 0.043670158470644385, 0.043670158470644385, 0.043670158470644385, 0.04313069897321676, 0.027782187731151566, 0.01176138814023006, 0.01176138814023006)), (Array(39679, 39677, 39693, 39680, 39674, 39682, 39681, 39676, 41932, 41967),Array(0.04718260291563998, 0.04718260291563998, 0.04718260291563998, 0.04718260291563998, 0.04718260291563998, 0.04718260291563998, 0.046744086701796375, 0.030009654606021424, 0.011639894189919175, 0.011639894189919175)), (Array(39679, 39677, 39693, 39680, 39674, 39682, 39681, 39676, 41932, 41967),Array(0.04675497162050312, 0.04675497162050312, 0.04675497162050312, 0.04675497162050312, 0.04675497162050312, 0.04675497162050312, 0.04630740194891603, 0.02973828460805704, 0.011613267488918038, 0.011613267488918038)), (Array(39677, 39693, 39680, 39679, 39674, 39682, 39681, 39676, 41932, 41967),Array(0.02922347731698308, 0.02922347731698308, 0.02922347731698308, 0.02922347731698308, 0.02922347731698308, 0.02922347731698308, 0.02856137645845995, 0.01860910369309368, 0.010781174428638705, 0.010781174428638705)), (Array(39677, 39693, 39680, 39674, 39682, 39679, 39681, 39676, 41932, 41967),Array(0.05098977390451263, 0.05098977390451263, 0.05098977390451263, 0.05098977390451263, 0.05098977390451263, 0.05098977390451263, 0.05065478074966324, 0.032424728159182487, 0.011804021891259129, 0.011804021891259129)), (Array(39677, 39693, 39680, 39674, 39682, 39679, 39681, 39676, 41932, 41967),Array(0.05458945924257291, 0.05458945924257291, 0.05458945924257291, 0.05458945924257291, 0.05458945924257291, 0.05458945924257291, 0.054382561204063574, 0.03470709039767808, 0.011833044488991173, 0.011833044488991173)), (Array(39677, 39693, 39680, 39679, 39674, 39682, 39681, 39676, 41932, 41967),Array(0.02757805103727522, 0.02757805103727522, 0.02757805103727522, 0.02757805103727522, 0.02757805103727522, 0.02757805103727522, 0.026914567318365282, 0.017564142155031864, 0.010769649833867787, 0.010769649833867787)), (Array(39679, 39677, 39693, 39680, 39674, 39682, 39681, 39676, 41932, 41967),Array(0.04651386476447619, 0.04651386476447619, 0.04651386476447619, 0.04651386476447619, 0.04651386476447619, 0.04651386476447619, 0.046074356990568124, 0.029584640688217628, 0.011466200232752964, 0.011466200232752964)), (Array(39679, 39677, 39693, 39680, 39674, 39682, 39681, 39676, 41932, 41967),Array(0.05592908452117546, 0.05592908452117546, 0.05592908452117546, 0.05592908452117546, 0.05592908452117546, 0.05592908452117546, 0.055741511067049866, 0.03555776746162745, 0.01211237486139149, 0.01211237486139149)), (Array(39681, 39674, 39677, 39693, 39680, 39682, 39679, 39676, 41932, 41967),Array(0.06048967215526035, 0.060441959490269634, 0.060441959490269634, 0.060441959490269634, 0.060441959490269634, 0.060441959490269634, 0.060441959490269634, 0.03841642394224334, 0.011870824949431247, 0.011870824949431247)), (Array(39681, 39679, 39677, 39693, 39680, 39674, 39682, 39676, 41932, 41967),Array(0.06567035036792095, 0.06540012688944775, 0.06540012688944775, 0.06540012688944775, 0.06540012688944775, 0.06540012688944775, 0.06540012688944775, 0.041559067084071775, 0.012094068526188358, 0.012094068526188358)), (Array(39681, 39674, 39677, 39693, 39680, 39682, 39679, 39676, 41932, 41967),Array(0.07103855727399273, 0.07041257887551527, 0.07041257887551527, 0.07041257887551527, 0.07041257887551527, 0.07041257887551527, 0.07041257887551527, 0.04473147169952923, 0.011812341727214322, 0.011812341727214322)), (Array(39679, 39677, 39693, 39680, 39674, 39682, 39681, 39676, 41932, 41967),Array(0.04379464422302749, 0.04379464422302749, 0.04379464422302749, 0.04379464422302749, 0.04379464422302749, 0.04379464422302749, 0.04327692350658171, 0.027860179926002853, 0.01152922335209424, 0.01152922335209424)), (Array(39677, 39693, 39680, 39679, 39674, 39682, 39681, 39676, 41932, 41967),Array(0.04365837465882906, 0.04365837465882906, 0.04365837465882906, 0.04365837465882906, 0.04365837465882906, 0.04365837465882906, 0.04316378247947061, 0.02777238708460723, 0.011237658307104376, 0.011237658307104376)), (Array(39679, 39677, 39693, 39680, 39674, 39682, 39681, 39676, 41932, 41967),Array(0.04674235176753242, 0.04674235176753242, 0.04674235176753242, 0.04674235176753242, 0.04674235176753242, 0.04674235176753242, 0.0463101731327656, 0.02972951504690979, 0.011452462524389802, 0.011452462524389802)))

In [None]:
top10ConversationsPerTopic.length // number of topics

  

>     res52: Int = 20

In [None]:
//em_DldaModel.topicDistributions.take(10).foreach(println)

  

  

Note that the EMLDAOptimizer produces a DistributedLDAModel, which
stores not only the inferred topics but also the full training corpus
and topic distributions for each document in the training corpus.

In [None]:
val topicIndices = em_ldaModel.describeTopics(maxTermsPerTopic = 5)

  

>     topicIndices: Array[(Array[Int], Array[Double])] = Array((Array(6435, 9153, 2611, 9555, 9235),Array(1.0844350865928232E-5, 1.4037356622456141E-6, 1.0198257636937534E-6, 1.010016392533973E-6, 9.877489659219E-7)), (Array(6435, 9153, 2611, 9555, 9235),Array(1.2201894817101623E-5, 1.4560010186049552E-6, 1.0547580487281058E-6, 1.0446104695648421E-6, 1.0214202904824573E-6)), (Array(6435, 9153, 2611, 9555, 9235),Array(8.080320102037276E-6, 1.2828806625265042E-6, 9.387148884503143E-7, 9.296944883594565E-7, 9.095512260026888E-7)), (Array(0, 1, 2, 3, 4),Array(0.4097048012129488, 0.2966641691130405, 0.28104437242573427, 0.2068481221090779, 0.20178462784115517)), (Array(6435, 9153, 2611, 9555, 9235),Array(1.7791420865642426E-5, 1.5285401934315644E-6, 1.1022151610359566E-6, 1.0916092052333647E-6, 1.0671154286074535E-6)), (Array(6435, 9153, 2611, 9555, 9235),Array(1.5488156532652564E-5, 1.5613578155095174E-6, 1.1250530213722066E-6, 1.1142275765190935E-6, 1.0891766415036671E-6)), (Array(6435, 9153, 2611, 9555, 9235),Array(1.66201985348282E-5, 1.5337088341752489E-6, 1.1062252459821718E-6, 1.0955808549686414E-6, 1.0710096202234095E-6)), (Array(6435, 9153, 2611, 9555, 9235),Array(1.6434785283305463E-5, 1.527062738898831E-6, 1.1015632294086975E-6, 1.0909636379478556E-6, 1.066504587082138E-6)), (Array(6435, 9153, 2611, 9555, 9235),Array(9.82890555944203E-6, 1.360381381982805E-6, 9.90695338703216E-7, 9.811686105969582E-7, 9.596620143926599E-7)), (Array(6435, 9153, 2611, 9555, 9235),Array(1.8098274080649888E-5, 1.5662560052135424E-6, 1.127571968783498E-6, 1.1167221871321394E-6, 1.0915664277968502E-6)), (Array(6435, 9153, 2611, 9555, 9235),Array(1.9443267750173392E-5, 1.5746049595955017E-6, 1.1333735056120856E-6, 1.1224679386855895E-6, 1.0971718558358495E-6)), (Array(6435, 9153, 2611, 9555, 9235),Array(9.292996004992278E-6, 1.3619125930485615E-6, 9.924672219451632E-7, 9.82924355173023E-7, 9.614096002911668E-7)), (Array(6435, 9153, 2611, 9555, 9235),Array(1.619330796598465E-5, 1.4932367796221485E-6, 1.0785114269963956E-6, 1.06813378362302E-6, 1.0442595139466752E-6)), (Array(6435, 9153, 2611, 9555, 9235),Array(2.0195445781462442E-5, 1.6338598744234947E-6, 1.1726861776132844E-6, 1.1614034519421386E-6, 1.1350541791534873E-6)), (Array(6435, 9153, 2611, 9555, 9235),Array(2.1543159186970775E-5, 1.5791785506830092E-6, 1.1358076217376717E-6, 1.1248786573437884E-6, 1.099486352793341E-6)), (Array(6435, 9153, 2611, 9555, 9235),Array(2.3565018803229148E-5, 1.6252544688003071E-6, 1.16608206417593E-6, 1.1548627950846766E-6, 1.1286452359926982E-6)), (Array(6435, 9153, 2611, 9555, 9235),Array(2.498926755354901E-5, 1.5618937315237142E-6, 1.1234358831022108E-6, 1.1126257210374892E-6, 1.0875181953216021E-6)), (Array(6435, 9153, 2611, 9555, 9235),Array(1.5342892698391062E-5, 1.5117065677915513E-6, 1.0917779017440848E-6, 1.0812727863583168E-6, 1.057095929328646E-6)), (Array(6435, 9153, 2611, 9555, 9235),Array(1.5018034022313325E-5, 1.4466343222454145E-6, 1.04735014561389E-6, 1.0372732437703543E-6, 1.0142213705569144E-6)), (Array(6435, 9153, 2611, 9555, 9235),Array(1.627670929533595E-5, 1.4917359134556584E-6, 1.0776961757775105E-6, 1.067326467379095E-6, 1.043483836251971E-6)))

In [None]:
val vocabList = vectorizer.vocabulary

  

>     vocabList: Array[String] = Array(know, just, like, want, think, right, going, good, yeah, tell, come, time, look, didn, mean, make, okay, really, little, sure, gonna, thing, people, said, maybe, need, sorry, love, talk, thought, doing, life, night, things, work, money, better, told, long, help, believe, years, shit, does, away, place, hell, doesn, great, home, feel, fuck, kind, remember, dead, course, wouldn, wait, kill, guess, understand, thank, girl, wrong, leave, listen, talking, real, hear, stop, nice, happened, fine, wanted, father, gotta, mind, fucking, house, wasn, getting, world, stay, mother, left, came, care, thanks, knew, room, trying, guys, went, looking, coming, heard, friend, haven, seen, best, tonight, live, used, matter, killed, pretty, business, idea, couldn, head, miss, says, wife, called, woman, morning, tomorrow, start, stuff, saying, play, hello, baby, hard, probably, minute, days, took, somebody, today, school, meet, gone, crazy, wants, damn, forget, problem, cause, deal, case, friends, point, hope, jesus, afraid, looks, knows, year, worry, exactly, aren, half, thinking, shut, hold, wanna, face, minutes, bring, word, read, doctor, everybody, supposed, makes, story, turn, true, watch, thousand, family, brother, kids, week, happen, fuckin, working, open, happy, lost, john, hurt, town, ready, alright, late, actually, married, gave, beautiful, soon, jack, times, sleep, door, having, drink, hand, easy, gets, chance, young, trouble, different, anybody, shot, rest, hate, death, second, later, asked, phone, wish, check, quite, walk, change, police, couple, question, close, taking, heart, hours, making, comes, anymore, truth, trust, dollars, important, captain, telling, funny, person, honey, goes, eyes, reason, inside, stand, break, means, number, tried, high, white, water, suppose, body, sick, game, excuse, party, women, country, answer, christ, waiting, office, send, pick, alive, sort, blood, black, daddy, line, husband, goddamn, book, fifty, thirty, fact, million, died, hands, power, stupid, started, shouldn, months, boys, city, sense, dinner, running, hour, shoot, fight, drive, speak, george, ship, living, figure, dear, street, ahead, lady, seven, scared, free, feeling, frank, able, children, safe, moment, outside, news, president, brought, write, happens, sent, bullshit, lose, light, glad, child, girls, sounds, sister, promise, lives, till, sound, weren, save, poor, cool, shall, asking, plan, king, bitch, daughter, weeks, beat, york, cold, worth, taken, harry, needs, piece, movie, fast, possible, small, goin, straight, human, hair, company, food, tired, lucky, pull, wonderful, touch, looked, thinks, state, picture, leaving, words, control, clear, known, special, buddy, luck, order, follow, expect, mary, catch, mouth, worked, mister, learn, playing, perfect, dream, calling, questions, hospital, takes, ride, coffee, miles, parents, works, secret, hotel, explain, kidding, worse, past, outta, general, felt, drop, unless, throw, interested, hang, certainly, absolutely, earth, loved, dark, wonder, accident, seeing, turned, clock, simple, doin, date, sweet, meeting, clean, sign, feet, handle, music, report, giving, army, fucked, cops, charlie, smart, yesterday, information, fall, fault, bank, class, month, blow, swear, caught, major, paul, road, talked, choice, plane, boss, david, paid, wear, american, worried, lord, paper, goodbye, clothes, ones, terrible, strange, given, mistake, finish, kept, blue, murder, hurry, apartment, sell, middle, nothin, careful, hasn, meant, walter, moving, changed, imagine, fair, difference, quiet, happening, near, quit, personal, marry, figured, future, rose, building, mama, michael, early, agent, kinda, watching, private, trip, record, certain, busy, jimmy, broke, sake, longer, store, boat, stick, finally, born, evening, sitting, bucks, ought, chief, lying, history, kiss, honor, darling, lunch, favor, fool, uncle, respect, rich, land, liked, killing, peter, tough, brain, interesting, completely, problems, welcome, nick, wake, honest, radio, dick, cash, dance, dude, james, bout, floor, weird, court, calls, jail, drunk, window, involved, johnny, officer, needed, asshole, situation, spend, books, relax, pain, service, grand, dangerous, letter, security, stopped, offer, realize, table, bastard, message, instead, killer, jake, deep, nervous, somethin, pass, evil, english, bought, short, step, ring, picked, likes, machine, eddie, voice, upset, forgot, carry, lived, afternoon, fear, quick, finished, count, forgive, wrote, named, decided, totally, space, team, pleasure, doubt, lawyer, station, gotten, suit, bother, prove, return, slow, pictures, bunch, strong, list, wearing, driving, join, tape, christmas, attack, appreciate, force, church, college, hungry, standing, present, dying, prison, missing, charge, board, truck, public, calm, gold, staying, ball, hardly, hadn, missed, lead, island, government, horse, cover, french, reach, joke, fish, star, mike, surprise, america, moved, soul, dress, seconds, club, self, putting, movies, lots, cost, listening, price, saved, smell, mark, peace, dreams, entire, crime, gives, usually, single, department, holy, beer, west, protect, stuck, wall, nose, ways, teach, forever, grow, train, type, awful, rock, detective, billy, walking, dumb, papers, beginning, planet, folks, park, attention, birthday, hide, card, master, share, reading, test, starting, lieutenant, field, partner, enjoy, twice, film, dollar, bomb, mess, blame, south, loves, girlfriend, round, records, using, plenty, especially, gentlemen, evidence, silly, experience, admit, fired, normal, talkin, mission, louis, memory, fighting, lock, notice, crap, wedding, promised, marriage, ground, guns, glass, idiot, orders, impossible, heaven, knock, hole, neck, animal, spent, green, wondering, nuts, press, drugs, broken, position, names, asleep, jerry, visit, boyfriend, acting, feels, plans, paris, smoke, tells, wind, cross, holding, sheriff, gimme, walked, mention, writing, double, brothers, code, judge, pardon, keeps, fellow, fell, closed, lovely, angry, cute, charles, surprised, percent, correct, bathroom, agree, address, andy, ridiculous, summer, tommy, rules, group, account, note, pulled, sleeping, sing, learned, proud, laugh, colonel, upstairs, river, difficult, built, jump, area, dirty, betty, bridge, breakfast, bobby, locked, amazing, north, feelings, alex, plus, definitely, worst, accept, kick, seriously, grace, steal, wild, stories, file, gettin, relationship, advice, nature, contact, spot, places, waste, knowing, beach, stole, apart, favorite, faith, level, loose, risk, song, eating, foot, played, patient, washington, turns, witness, action, build, obviously, begin, split, crew, command, games, tight, decide, nurse, keeping, runs, form, bird, copy, insane, complete, arrest, consider, taste, scene, jeffrey, teeth, shoes, career, henry, sooner, devil, monster, showed, weekend, gift, innocent, study, heavy, hall, comin, danger, greatest, track, keys, raise, destroy, concerned, program, carl, blind, apologize, suddenly, hanging, bruce, california, chicken, seventy, forward, drinking, sweetheart, medical, suspect, admiral, guard, shop, professor, legs, willing, camp, data, ticket, tree, goodnight, television, losing, senator, murdered, burn, dunno, paying, possibly, trick, dropped, credit, extra, starts, warm, hiding, meaning, sold, stone, taught, marty, lately, cheap, lookin, science, simply, jeff, corner, harold, following, majesty, queen, duty, cars, training, heads, seat, discuss, bear, enemy, helped, noticed, common, screw, dave)

In [None]:
vocabList.size

  

>     res32: Int = 10000

In [None]:
val topics = topicIndices.map { case (terms, termWeights) =>
  terms.map(vocabList(_)).zip(termWeights)
}

  

>     topics: Array[Array[(String, Double)]] = Array(Array((just,0.030515134931284552), (like,0.02463563559747823), (want,0.022529385381465025), (damn,0.02094828832824297), (going,0.0203407289886203)), Array((yeah,0.10787301090151602), (look,0.0756831002291994), (know,0.04815746564274915), (wait,0.03897182014529944), (night,0.0341458394828345)), Array((gonna,0.08118584492034046), (money,0.051736711600637544), (shit,0.04620430294274594), (fuck,0.0399843125556081), (kill,0.03672740843080258)), Array((people,0.020091372023286612), (know,0.018613400462887356), (work,0.016775643603287843), (does,0.015522555458447744), (think,0.012161168331925723)), Array((know,0.031956573561538214), (just,0.030674598809934856), (want,0.027663491240851962), (tell,0.025727217382788027), (right,0.02300853167338119)), Array((love,0.05932570200934131), (father,0.030080735900045442), (life,0.01769248067468245), (true,0.016281752071881345), (young,0.014927950883812253)), Array((remember,0.03998401809663685), (went,0.01737965538107633), (lost,0.016916065536574213), (called,0.016443441316683228), (story,0.014849882671062261)), Array((house,0.028911209424810257), (miss,0.025669944694943093), (right,0.02091105252727788), (family,0.017862939987512365), (important,0.013959164390834044)), Array((saying,0.022939827090645636), (know,0.021335083902970984), (idea,0.017628999871937747), (business,0.017302568063786224), (police,0.012284217866942303)), Array((know,0.051876601466269136), (like,0.03828159069993671), (maybe,0.03754385940676905), (just,0.031938551661426284), (want,0.02876693222824349)), Array((years,0.032537676027398765), (going,0.030596831997667568), (case,0.02049555392502822), (doctor,0.018671171294737107), (working,0.017672067172167016)), Array((stuff,0.02236582778896705), (school,0.020057798194969816), (john,0.017134198006217606), (week,0.017075852415410653), (thousand,0.017013413435021035)), Array((little,0.08663446368316245), (girl,0.035120377589734936), (like,0.02992080326340266), (woman,0.0240813719635157), (baby,0.022471517953608963)), Array((know,0.0283115823590395), (leave,0.02744935904744228), (time,0.02050833156294194), (want,0.020124145131863225), (just,0.019466336438890477)), Array((didn,0.08220031921979461), (like,0.05062323326717784), (real,0.03087838046777391), (guess,0.02452989702353384), (says,0.022815035397008333)), Array((minutes,0.018541518543996716), (time,0.014737962244588431), (captain,0.012594614743931537), (thirty,0.01193707771669708), (ship,0.011260576815409516)), Array((okay,0.08153575328080886), (just,0.050004142902999975), (right,0.03438984898476042), (know,0.02821327795933634), (home,0.023397063860326372)), Array((country,0.011270500385627474), (power,0.010428408353623762), (president,0.009392162067926028), (fight,0.00799742811584178), (possible,0.007597974486019279)), Array((know,0.09541058020800194), (think,0.0698707939786508), (really,0.06881812755565207), (mean,0.02909700228968688), (just,0.028699687473471538)), Array((dead,0.03833642117149438), (like,0.017873711992106994), (hand,0.015280854355409379), (white,0.013718491413582671), (blood,0.012699265888344448)))

In [None]:
vocabList(47) // 47 is the index of the term 'university' or the first term in topics - this may change due to randomness in algorithm

  

>     res33: String = doesn

  

This is just doing it all at once.

In [None]:
val topicIndices = em_ldaModel.describeTopics(maxTermsPerTopic = 5)
val vocabList = vectorizer.vocabulary
val topics = topicIndices.map { case (terms, termWeights) =>
  terms.map(vocabList(_)).zip(termWeights)
}
println(s"$numTopics topics:")
topics.zipWithIndex.foreach { case (topic, i) =>
  println(s"TOPIC $i")
  topic.foreach { case (term, weight) => println(s"$term\t$weight") }
  println(s"==========")
}

  

>     20 topics:
>     TOPIC 0
>     just	0.030515134931284552
>     like	0.02463563559747823
>     want	0.022529385381465025
>     damn	0.02094828832824297
>     going	0.0203407289886203
>     ==========
>     TOPIC 1
>     yeah	0.10787301090151602
>     look	0.0756831002291994
>     know	0.04815746564274915
>     wait	0.03897182014529944
>     night	0.0341458394828345
>     ==========
>     TOPIC 2
>     gonna	0.08118584492034046
>     money	0.051736711600637544
>     shit	0.04620430294274594
>     fuck	0.0399843125556081
>     kill	0.03672740843080258
>     ==========
>     TOPIC 3
>     people	0.020091372023286612
>     know	0.018613400462887356
>     work	0.016775643603287843
>     does	0.015522555458447744
>     think	0.012161168331925723
>     ==========
>     TOPIC 4
>     know	0.031956573561538214
>     just	0.030674598809934856
>     want	0.027663491240851962
>     tell	0.025727217382788027
>     right	0.02300853167338119
>     ==========
>     TOPIC 5
>     love	0.05932570200934131
>     father	0.030080735900045442
>     life	0.01769248067468245
>     true	0.016281752071881345
>     young	0.014927950883812253
>     ==========
>     TOPIC 6
>     remember	0.03998401809663685
>     went	0.01737965538107633
>     lost	0.016916065536574213
>     called	0.016443441316683228
>     story	0.014849882671062261
>     ==========
>     TOPIC 7
>     house	0.028911209424810257
>     miss	0.025669944694943093
>     right	0.02091105252727788
>     family	0.017862939987512365
>     important	0.013959164390834044
>     ==========
>     TOPIC 8
>     saying	0.022939827090645636
>     know	0.021335083902970984
>     idea	0.017628999871937747
>     business	0.017302568063786224
>     police	0.012284217866942303
>     ==========
>     TOPIC 9
>     know	0.051876601466269136
>     like	0.03828159069993671
>     maybe	0.03754385940676905
>     just	0.031938551661426284
>     want	0.02876693222824349
>     ==========
>     TOPIC 10
>     years	0.032537676027398765
>     going	0.030596831997667568
>     case	0.02049555392502822
>     doctor	0.018671171294737107
>     working	0.017672067172167016
>     ==========
>     TOPIC 11
>     stuff	0.02236582778896705
>     school	0.020057798194969816
>     john	0.017134198006217606
>     week	0.017075852415410653
>     thousand	0.017013413435021035
>     ==========
>     TOPIC 12
>     little	0.08663446368316245
>     girl	0.035120377589734936
>     like	0.02992080326340266
>     woman	0.0240813719635157
>     baby	0.022471517953608963
>     ==========
>     TOPIC 13
>     know	0.0283115823590395
>     leave	0.02744935904744228
>     time	0.02050833156294194
>     want	0.020124145131863225
>     just	0.019466336438890477
>     ==========
>     TOPIC 14
>     didn	0.08220031921979461
>     like	0.05062323326717784
>     real	0.03087838046777391
>     guess	0.02452989702353384
>     says	0.022815035397008333
>     ==========
>     TOPIC 15
>     minutes	0.018541518543996716
>     time	0.014737962244588431
>     captain	0.012594614743931537
>     thirty	0.01193707771669708
>     ship	0.011260576815409516
>     ==========
>     TOPIC 16
>     okay	0.08153575328080886
>     just	0.050004142902999975
>     right	0.03438984898476042
>     know	0.02821327795933634
>     home	0.023397063860326372
>     ==========
>     TOPIC 17
>     country	0.011270500385627474
>     power	0.010428408353623762
>     president	0.009392162067926028
>     fight	0.00799742811584178
>     possible	0.007597974486019279
>     ==========
>     TOPIC 18
>     know	0.09541058020800194
>     think	0.0698707939786508
>     really	0.06881812755565207
>     mean	0.02909700228968688
>     just	0.028699687473471538
>     ==========
>     TOPIC 19
>     dead	0.03833642117149438
>     like	0.017873711992106994
>     hand	0.015280854355409379
>     white	0.013718491413582671
>     blood	0.012699265888344448
>     ==========
>     topicIndices: Array[(Array[Int], Array[Double])] = Array((Array(1, 2, 3, 135, 6),Array(0.030515134931284552, 0.02463563559747823, 0.022529385381465025, 0.02094828832824297, 0.0203407289886203)), (Array(8, 12, 0, 57, 32),Array(0.10787301090151602, 0.0756831002291994, 0.04815746564274915, 0.03897182014529944, 0.0341458394828345)), (Array(20, 35, 42, 51, 58),Array(0.08118584492034046, 0.051736711600637544, 0.04620430294274594, 0.0399843125556081, 0.03672740843080258)), (Array(22, 0, 34, 43, 4),Array(0.020091372023286612, 0.018613400462887356, 0.016775643603287843, 0.015522555458447744, 0.012161168331925723)), (Array(0, 1, 3, 9, 5),Array(0.031956573561538214, 0.030674598809934856, 0.027663491240851962, 0.025727217382788027, 0.02300853167338119)), (Array(27, 74, 31, 168, 202),Array(0.05932570200934131, 0.030080735900045442, 0.01769248067468245, 0.016281752071881345, 0.014927950883812253)), (Array(53, 92, 180, 113, 166),Array(0.03998401809663685, 0.01737965538107633, 0.016916065536574213, 0.016443441316683228, 0.014849882671062261)), (Array(78, 110, 5, 171, 232),Array(0.028911209424810257, 0.025669944694943093, 0.02091105252727788, 0.017862939987512365, 0.013959164390834044)), (Array(119, 0, 107, 106, 219),Array(0.022939827090645636, 0.021335083902970984, 0.017628999871937747, 0.017302568063786224, 0.012284217866942303)), (Array(0, 2, 24, 1, 3),Array(0.051876601466269136, 0.03828159069993671, 0.03754385940676905, 0.031938551661426284, 0.02876693222824349)), (Array(41, 6, 140, 162, 177),Array(0.032537676027398765, 0.030596831997667568, 0.02049555392502822, 0.018671171294737107, 0.017672067172167016)), (Array(118, 130, 181, 174, 170),Array(0.02236582778896705, 0.020057798194969816, 0.017134198006217606, 0.017075852415410653, 0.017013413435021035)), (Array(18, 62, 2, 114, 122),Array(0.08663446368316245, 0.035120377589734936, 0.02992080326340266, 0.0240813719635157, 0.022471517953608963)), (Array(0, 64, 11, 3, 1),Array(0.0283115823590395, 0.02744935904744228, 0.02050833156294194, 0.020124145131863225, 0.019466336438890477)), (Array(13, 2, 67, 59, 111),Array(0.08220031921979461, 0.05062323326717784, 0.03087838046777391, 0.02452989702353384, 0.022815035397008333)), (Array(158, 11, 233, 274, 295),Array(0.018541518543996716, 0.014737962244588431, 0.012594614743931537, 0.01193707771669708, 0.011260576815409516)), (Array(16, 1, 5, 0, 49),Array(0.08153575328080886, 0.050004142902999975, 0.03438984898476042, 0.02821327795933634, 0.023397063860326372)), (Array(257, 279, 313, 291, 351),Array(0.011270500385627474, 0.010428408353623762, 0.009392162067926028, 0.00799742811584178, 0.007597974486019279)), (Array(0, 4, 17, 14, 1),Array(0.09541058020800194, 0.0698707939786508, 0.06881812755565207, 0.02909700228968688, 0.028699687473471538)), (Array(54, 2, 198, 248, 266),Array(0.03833642117149438, 0.017873711992106994, 0.015280854355409379, 0.013718491413582671, 0.012699265888344448)))
>     vocabList: Array[String] = Array(know, just, like, want, think, right, going, good, yeah, tell, come, time, look, didn, mean, make, okay, really, little, sure, gonna, thing, people, said, maybe, need, sorry, love, talk, thought, doing, life, night, things, work, money, better, told, long, help, believe, years, shit, does, away, place, hell, doesn, great, home, feel, fuck, kind, remember, dead, course, wouldn, wait, kill, guess, understand, thank, girl, wrong, leave, listen, talking, real, hear, stop, nice, happened, fine, wanted, father, gotta, mind, fucking, house, wasn, getting, world, stay, mother, left, came, care, thanks, knew, room, trying, guys, went, looking, coming, heard, friend, haven, seen, best, tonight, live, used, matter, killed, pretty, business, idea, couldn, head, miss, says, wife, called, woman, morning, tomorrow, start, stuff, saying, play, hello, baby, hard, probably, minute, days, took, somebody, today, school, meet, gone, crazy, wants, damn, forget, problem, cause, deal, case, friends, point, hope, jesus, afraid, looks, knows, year, worry, exactly, aren, half, thinking, shut, hold, wanna, face, minutes, bring, word, read, doctor, everybody, supposed, makes, story, turn, true, watch, thousand, family, brother, kids, week, happen, fuckin, working, open, happy, lost, john, hurt, town, ready, alright, late, actually, married, gave, beautiful, soon, jack, times, sleep, door, having, drink, hand, easy, gets, chance, young, trouble, different, anybody, shot, rest, hate, death, second, later, asked, phone, wish, check, quite, walk, change, police, couple, question, close, taking, heart, hours, making, comes, anymore, truth, trust, dollars, important, captain, telling, funny, person, honey, goes, eyes, reason, inside, stand, break, means, number, tried, high, white, water, suppose, body, sick, game, excuse, party, women, country, answer, christ, waiting, office, send, pick, alive, sort, blood, black, daddy, line, husband, goddamn, book, fifty, thirty, fact, million, died, hands, power, stupid, started, shouldn, months, boys, city, sense, dinner, running, hour, shoot, fight, drive, speak, george, ship, living, figure, dear, street, ahead, lady, seven, scared, free, feeling, frank, able, children, safe, moment, outside, news, president, brought, write, happens, sent, bullshit, lose, light, glad, child, girls, sounds, sister, promise, lives, till, sound, weren, save, poor, cool, shall, asking, plan, king, bitch, daughter, weeks, beat, york, cold, worth, taken, harry, needs, piece, movie, fast, possible, small, goin, straight, human, hair, company, food, tired, lucky, pull, wonderful, touch, looked, thinks, state, picture, leaving, words, control, clear, known, special, buddy, luck, order, follow, expect, mary, catch, mouth, worked, mister, learn, playing, perfect, dream, calling, questions, hospital, takes, ride, coffee, miles, parents, works, secret, hotel, explain, kidding, worse, past, outta, general, felt, drop, unless, throw, interested, hang, certainly, absolutely, earth, loved, dark, wonder, accident, seeing, turned, clock, simple, doin, date, sweet, meeting, clean, sign, feet, handle, music, report, giving, army, fucked, cops, charlie, smart, yesterday, information, fall, fault, bank, class, month, blow, swear, caught, major, paul, road, talked, choice, plane, boss, david, paid, wear, american, worried, lord, paper, goodbye, clothes, ones, terrible, strange, given, mistake, finish, kept, blue, murder, hurry, apartment, sell, middle, nothin, careful, hasn, meant, walter, moving, changed, imagine, fair, difference, quiet, happening, near, quit, personal, marry, figured, future, rose, building, mama, michael, early, agent, kinda, watching, private, trip, record, certain, busy, jimmy, broke, sake, longer, store, boat, stick, finally, born, evening, sitting, bucks, ought, chief, lying, history, kiss, honor, darling, lunch, favor, fool, uncle, respect, rich, land, liked, killing, peter, tough, brain, interesting, completely, problems, welcome, nick, wake, honest, radio, dick, cash, dance, dude, james, bout, floor, weird, court, calls, jail, drunk, window, involved, johnny, officer, needed, asshole, situation, spend, books, relax, pain, service, grand, dangerous, letter, security, stopped, offer, realize, table, bastard, message, instead, killer, jake, deep, nervous, somethin, pass, evil, english, bought, short, step, ring, picked, likes, machine, eddie, voice, upset, forgot, carry, lived, afternoon, fear, quick, finished, count, forgive, wrote, named, decided, totally, space, team, pleasure, doubt, lawyer, station, gotten, suit, bother, prove, return, slow, pictures, bunch, strong, list, wearing, driving, join, tape, christmas, attack, appreciate, force, church, college, hungry, standing, present, dying, prison, missing, charge, board, truck, public, calm, gold, staying, ball, hardly, hadn, missed, lead, island, government, horse, cover, french, reach, joke, fish, star, mike, surprise, america, moved, soul, dress, seconds, club, self, putting, movies, lots, cost, listening, price, saved, smell, mark, peace, dreams, entire, crime, gives, usually, single, department, holy, beer, west, protect, stuck, wall, nose, ways, teach, forever, grow, train, type, awful, rock, detective, billy, walking, dumb, papers, beginning, planet, folks, park, attention, birthday, hide, card, master, share, reading, test, starting, lieutenant, field, partner, enjoy, twice, film, dollar, bomb, mess, blame, south, loves, girlfriend, round, records, using, plenty, especially, gentlemen, evidence, silly, experience, admit, fired, normal, talkin, mission, louis, memory, fighting, lock, notice, crap, wedding, promised, marriage, ground, guns, glass, idiot, orders, impossible, heaven, knock, hole, neck, animal, spent, green, wondering, nuts, press, drugs, broken, position, names, asleep, jerry, visit, boyfriend, acting, feels, plans, paris, smoke, tells, wind, cross, holding, sheriff, gimme, walked, mention, writing, double, brothers, code, judge, pardon, keeps, fellow, fell, closed, lovely, angry, cute, charles, surprised, percent, correct, bathroom, agree, address, andy, ridiculous, summer, tommy, rules, group, account, note, pulled, sleeping, sing, learned, proud, laugh, colonel, upstairs, river, difficult, built, jump, area, dirty, betty, bridge, breakfast, bobby, locked, amazing, north, feelings, alex, plus, definitely, worst, accept, kick, seriously, grace, steal, wild, stories, file, gettin, relationship, advice, nature, contact, spot, places, waste, knowing, beach, stole, apart, favorite, faith, level, loose, risk, song, eating, foot, played, patient, washington, turns, witness, action, build, obviously, begin, split, crew, command, games, tight, decide, nurse, keeping, runs, form, bird, copy, insane, complete, arrest, consider, taste, scene, jeffrey, teeth, shoes, career, henry, sooner, devil, monster, showed, weekend, gift, innocent, study, heavy, hall, comin, danger, greatest, track, keys, raise, destroy, concerned, program, carl, blind, apologize, suddenly, hanging, bruce, california, chicken, seventy, forward, drinking, sweetheart, medical, suspect, admiral, guard, shop, professor, legs, willing, camp, data, ticket, tree, goodnight, television, losing, senator, murdered, burn, dunno, paying, possibly, trick, dropped, credit, extra, starts, warm, hiding, meaning, sold, stone, taught, marty, lately, cheap, lookin, science, simply, jeff, corner, harold, following, majesty, queen, duty, cars, training, heads, seat, discuss, bear, enemy, helped, noticed, common, screw, dave)
>     topics: Array[Array[(String, Double)]] = Array(Array((just,0.030515134931284552), (like,0.02463563559747823), (want,0.022529385381465025), (damn,0.02094828832824297), (going,0.0203407289886203)), Array((yeah,0.10787301090151602), (look,0.0756831002291994), (know,0.04815746564274915), (wait,0.03897182014529944), (night,0.0341458394828345)), Array((gonna,0.08118584492034046), (money,0.051736711600637544), (shit,0.04620430294274594), (fuck,0.0399843125556081), (kill,0.03672740843080258)), Array((people,0.020091372023286612), (know,0.018613400462887356), (work,0.016775643603287843), (does,0.015522555458447744), (think,0.012161168331925723)), Array((know,0.031956573561538214), (just,0.030674598809934856), (want,0.027663491240851962), (tell,0.025727217382788027), (right,0.02300853167338119)), Array((love,0.05932570200934131), (father,0.030080735900045442), (life,0.01769248067468245), (true,0.016281752071881345), (young,0.014927950883812253)), Array((remember,0.03998401809663685), (went,0.01737965538107633), (lost,0.016916065536574213), (called,0.016443441316683228), (story,0.014849882671062261)), Array((house,0.028911209424810257), (miss,0.025669944694943093), (right,0.02091105252727788), (family,0.017862939987512365), (important,0.013959164390834044)), Array((saying,0.022939827090645636), (know,0.021335083902970984), (idea,0.017628999871937747), (business,0.017302568063786224), (police,0.012284217866942303)), Array((know,0.051876601466269136), (like,0.03828159069993671), (maybe,0.03754385940676905), (just,0.031938551661426284), (want,0.02876693222824349)), Array((years,0.032537676027398765), (going,0.030596831997667568), (case,0.02049555392502822), (doctor,0.018671171294737107), (working,0.017672067172167016)), Array((stuff,0.02236582778896705), (school,0.020057798194969816), (john,0.017134198006217606), (week,0.017075852415410653), (thousand,0.017013413435021035)), Array((little,0.08663446368316245), (girl,0.035120377589734936), (like,0.02992080326340266), (woman,0.0240813719635157), (baby,0.022471517953608963)), Array((know,0.0283115823590395), (leave,0.02744935904744228), (time,0.02050833156294194), (want,0.020124145131863225), (just,0.019466336438890477)), Array((didn,0.08220031921979461), (like,0.05062323326717784), (real,0.03087838046777391), (guess,0.02452989702353384), (says,0.022815035397008333)), Array((minutes,0.018541518543996716), (time,0.014737962244588431), (captain,0.012594614743931537), (thirty,0.01193707771669708), (ship,0.011260576815409516)), Array((okay,0.08153575328080886), (just,0.050004142902999975), (right,0.03438984898476042), (know,0.02821327795933634), (home,0.023397063860326372)), Array((country,0.011270500385627474), (power,0.010428408353623762), (president,0.009392162067926028), (fight,0.00799742811584178), (possible,0.007597974486019279)), Array((know,0.09541058020800194), (think,0.0698707939786508), (really,0.06881812755565207), (mean,0.02909700228968688), (just,0.028699687473471538)), Array((dead,0.03833642117149438), (like,0.017873711992106994), (hand,0.015280854355409379), (white,0.013718491413582671), (blood,0.012699265888344448)))

In [None]:
top10ConversationsPerTopic(2)

  

>     res54: (Array[Long], Array[Double]) = (Array(22243, 39967, 18136, 18149, 59043, 61513, 34087, 75874, 66270, 68876),Array(0.9986758340945384, 0.99866200816902, 0.9982983538060165, 0.9982983538060165, 0.9982983538060165, 0.9982983538060165, 0.9982983538060165, 0.9982983538060165, 0.9982983538060165, 0.9982983538060165))

In [None]:
top10ConversationsPerTopic(2)._1

  

>     res55: Array[Long] = Array(22243, 39967, 18136, 18149, 59043, 61513, 34087, 75874, 66270, 68876)

In [None]:
val scenesForTopic2 = sc.parallelize(top10ConversationsPerTopic(2)._1).toDF("id")

  

>     scenesForTopic2: org.apache.spark.sql.DataFrame = [id: bigint]

In [None]:
display(scenesForTopic2.join(corpusDF,"id"))

  

[TABLE]

In [None]:
sc.parallelize(top10ConversationsPerTopic(2)._1).toDF("id").join(corpusDF,"id").show(10,false)

  

>     +-----+----------------------------------------------------------------------------------+----------------------+---------+
>     |id   |corpus                                                                            |movieTitle            |movieYear|
>     +-----+----------------------------------------------------------------------------------+----------------------+---------+
>     |22243|Fuck him. :-()-: Don't. :-()-: Fuck her too.                                      |panic room            |2002     |
>     |59043|Are you ok? :-()-: Fuck no.                                                       |magnolia              |1999     |
>     |66270|Hey now... what the fuck... ? :-()-: Again.                                       |red white black & blue|2006     |
>     |75874|What about Moliere? :-()-: Fuck off.                                              |the beach             |2000/I   |
>     |68876|What the fuck is that? :-()-: A switchblade.                                      |seven                 |1979     |
>     |34087|Fuck me!  Yes! :-()-: Uh...                                                       |american pie          |1999     |
>     |61513|What the fuck is that?! :-()-: Screamer.                                          |arcade                |1993     |
>     |18136|What the fuck was that about? :-()-: She was jonesing for me.                     |made                  |2001     |
>     |18149|C'mon... :-()-: Fuck...                                                           |made                  |2001     |
>     |39967|Shit, shit, shit... :-()-: You're almost there, you can do it -- can do -- can do.|broadcast news        |1987     |
>     +-----+----------------------------------------------------------------------------------+----------------------+---------+

In [None]:
sc.parallelize(top10ConversationsPerTopic(5)._1).toDF("id").join(corpusDF,"id").show(10,false)

  

>     +-----+---------------------------------------------------------+-----------------+---------+
>     |id   |corpus                                                   |movieTitle       |movieYear|
>     +-----+---------------------------------------------------------+-----------------+---------+
>     |68250|I love you man :-()-: I love you too.                    |say anything...  |1989     |
>     |31256|I love you. :-()-: I love you.                           |total recall     |1990     |
>     |868  |I love you. :-()-: I love you.                           |8mm              |1999     |
>     |17285|Do me. :-()-: I love you. :-()-: I love you.             |little nicky     |2000     |
>     |56529|Why do you love me? :-()-: Why do you love me?           |jerry maguire    |1996     |
>     |67529|I love you, too. :-()-: I love you.  I love you.         |runaway bride    |1999     |
>     |82132|Why did you say that? :-()-: Say what? :-()-: I love you.|willow           |1988     |
>     |50163|I love you, Bud. :-()-: I love you more.                 |frequency        |2000     |
>     |39173|I love you. :-()-: I love you too, Dad.                  |body of evidence |1993     |
>     |57385|Yes? :-()-: I love you...                                |kramer vs. kramer|1979     |
>     +-----+---------------------------------------------------------+-----------------+---------+

In [None]:
corpusDF.show(5)

  

>     +-----+--------------------+------------+---------+
>     |   id|              corpus|  movieTitle|movieYear|
>     +-----+--------------------+------------+---------+
>     |17668|This would be fun...|lost horizon|     1937|
>     |17598|Cave, eh? Where? ...|lost horizon|     1937|
>     |17663|Something grand a...|lost horizon|     1937|
>     |17593|You see? You get ...|lost horizon|     1937|
>     |17658|Let me up! Let me...|lost horizon|     1937|
>     +-----+--------------------+------------+---------+
>     only showing top 5 rows

  

We've managed to get some good results here. For example, we can easily
infer that Topic 2 is about space, Topic 3 is about israel, etc.

We still get some ambiguous results like Topic 0.

To improve our results further, we could employ some of the below
methods:

-   Refilter data for additional data-specific stopwords
-   Use Stemming or Lemmatization to preprocess data
-   Experiment with a smaller number of topics, since some of these
    topics in the 20 Newsgroups are pretty similar
-   Increase model's MaxIterations

Visualize Results
-----------------

We will try visualizing the results obtained from the EM LDA model with
a d3 bubble chart.

In [None]:
// Zip topic terms with topic IDs
val termArray = topics.zipWithIndex

  

>     termArray: Array[(Array[(String, Double)], Int)] = Array((Array((just,0.030515134931284552), (like,0.02463563559747823), (want,0.022529385381465025), (damn,0.02094828832824297), (going,0.0203407289886203)),0), (Array((yeah,0.10787301090151602), (look,0.0756831002291994), (know,0.04815746564274915), (wait,0.03897182014529944), (night,0.0341458394828345)),1), (Array((gonna,0.08118584492034046), (money,0.051736711600637544), (shit,0.04620430294274594), (fuck,0.0399843125556081), (kill,0.03672740843080258)),2), (Array((people,0.020091372023286612), (know,0.018613400462887356), (work,0.016775643603287843), (does,0.015522555458447744), (think,0.012161168331925723)),3), (Array((know,0.031956573561538214), (just,0.030674598809934856), (want,0.027663491240851962), (tell,0.025727217382788027), (right,0.02300853167338119)),4), (Array((love,0.05932570200934131), (father,0.030080735900045442), (life,0.01769248067468245), (true,0.016281752071881345), (young,0.014927950883812253)),5), (Array((remember,0.03998401809663685), (went,0.01737965538107633), (lost,0.016916065536574213), (called,0.016443441316683228), (story,0.014849882671062261)),6), (Array((house,0.028911209424810257), (miss,0.025669944694943093), (right,0.02091105252727788), (family,0.017862939987512365), (important,0.013959164390834044)),7), (Array((saying,0.022939827090645636), (know,0.021335083902970984), (idea,0.017628999871937747), (business,0.017302568063786224), (police,0.012284217866942303)),8), (Array((know,0.051876601466269136), (like,0.03828159069993671), (maybe,0.03754385940676905), (just,0.031938551661426284), (want,0.02876693222824349)),9), (Array((years,0.032537676027398765), (going,0.030596831997667568), (case,0.02049555392502822), (doctor,0.018671171294737107), (working,0.017672067172167016)),10), (Array((stuff,0.02236582778896705), (school,0.020057798194969816), (john,0.017134198006217606), (week,0.017075852415410653), (thousand,0.017013413435021035)),11), (Array((little,0.08663446368316245), (girl,0.035120377589734936), (like,0.02992080326340266), (woman,0.0240813719635157), (baby,0.022471517953608963)),12), (Array((know,0.0283115823590395), (leave,0.02744935904744228), (time,0.02050833156294194), (want,0.020124145131863225), (just,0.019466336438890477)),13), (Array((didn,0.08220031921979461), (like,0.05062323326717784), (real,0.03087838046777391), (guess,0.02452989702353384), (says,0.022815035397008333)),14), (Array((minutes,0.018541518543996716), (time,0.014737962244588431), (captain,0.012594614743931537), (thirty,0.01193707771669708), (ship,0.011260576815409516)),15), (Array((okay,0.08153575328080886), (just,0.050004142902999975), (right,0.03438984898476042), (know,0.02821327795933634), (home,0.023397063860326372)),16), (Array((country,0.011270500385627474), (power,0.010428408353623762), (president,0.009392162067926028), (fight,0.00799742811584178), (possible,0.007597974486019279)),17), (Array((know,0.09541058020800194), (think,0.0698707939786508), (really,0.06881812755565207), (mean,0.02909700228968688), (just,0.028699687473471538)),18), (Array((dead,0.03833642117149438), (like,0.017873711992106994), (hand,0.015280854355409379), (white,0.013718491413582671), (blood,0.012699265888344448)),19))

In [None]:
// Transform data into the form (term, probability, topicId)
val termRDD = sc.parallelize(termArray)
val termRDD2 =termRDD.flatMap( (x: (Array[(String, Double)], Int)) => {
  val arrayOfTuple = x._1
  val topicId = x._2
  arrayOfTuple.map(el => (el._1, el._2, topicId))
})

  

>     termRDD: org.apache.spark.rdd.RDD[(Array[(String, Double)], Int)] = ParallelCollectionRDD[3066] at parallelize at <console>:109
>     termRDD2: org.apache.spark.rdd.RDD[(String, Double, Int)] = MapPartitionsRDD[3067] at flatMap at <console>:110

In [None]:
// Create DF with proper column names
val termDF = termRDD2.toDF.withColumnRenamed("_1", "term").withColumnRenamed("_2", "probability").withColumnRenamed("_3", "topicId")

  

>     termDF: org.apache.spark.sql.DataFrame = [term: string, probability: double, topicId: int]

In [None]:
display(termDF)

  

[TABLE]

Truncated to 30 rows

  

We will convert the DataFrame into a JSON format, which will be passed
into d3.

In [None]:
// Create JSON data
val rawJson = termDF.toJSON.collect().mkString(",\n")

  

>     rawJson: String = 
>     {"term":"just","probability":0.030515134931284552,"topicId":0},
>     {"term":"like","probability":0.02463563559747823,"topicId":0},
>     {"term":"want","probability":0.022529385381465025,"topicId":0},
>     {"term":"damn","probability":0.02094828832824297,"topicId":0},
>     {"term":"going","probability":0.0203407289886203,"topicId":0},
>     {"term":"yeah","probability":0.10787301090151602,"topicId":1},
>     {"term":"look","probability":0.0756831002291994,"topicId":1},
>     {"term":"know","probability":0.04815746564274915,"topicId":1},
>     {"term":"wait","probability":0.03897182014529944,"topicId":1},
>     {"term":"night","probability":0.0341458394828345,"topicId":1},
>     {"term":"gonna","probability":0.08118584492034046,"topicId":2},
>     {"term":"money","probability":0.051736711600637544,"topicId":2},
>     {"term":"shit","probability":0.04620430294274594,"topicId":2},
>     {"term":"fuck","probability":0.0399843125556081,"topicId":2},
>     {"term":"kill","probability":0.03672740843080258,"topicId":2},
>     {"term":"people","probability":0.020091372023286612,"topicId":3},
>     {"term":"know","probability":0.018613400462887356,"topicId":3},
>     {"term":"work","probability":0.016775643603287843,"topicId":3},
>     {"term":"does","probability":0.015522555458447744,"topicId":3},
>     {"term":"think","probability":0.012161168331925723,"topicId":3},
>     {"term":"know","probability":0.031956573561538214,"topicId":4},
>     {"term":"just","probability":0.030674598809934856,"topicId":4},
>     {"term":"want","probability":0.027663491240851962,"topicId":4},
>     {"term":"tell","probability":0.025727217382788027,"topicId":4},
>     {"term":"right","probability":0.02300853167338119,"topicId":4},
>     {"term":"love","probability":0.05932570200934131,"topicId":5},
>     {"term":"father","probability":0.030080735900045442,"topicId":5},
>     {"term":"life","probability":0.01769248067468245,"topicId":5},
>     {"term":"true","probability":0.016281752071881345,"topicId":5},
>     {"term":"young","probability":0.014927950883812253,"topicId":5},
>     {"term":"remember","probability":0.03998401809663685,"topicId":6},
>     {"term":"went","probability":0.01737965538107633,"topicId":6},
>     {"term":"lost","probability":0.016916065536574213,"topicId":6},
>     {"term":"called","probability":0.016443441316683228,"topicId":6},
>     {"term":"story","probability":0.014849882671062261,"topicId":6},
>     {"term":"house","probability":0.028911209424810257,"topicId":7},
>     {"term":"miss","probability":0.025669944694943093,"topicId":7},
>     {"term":"right","probability":0.02091105252727788,"topicId":7},
>     {"term":"family","probability":0.017862939987512365,"topicId":7},
>     {"term":"important","probability":0.013959164390834044,"topicId":7},
>     {"term":"saying","probability":0.022939827090645636,"topicId":8},
>     {"term":"know","probability":0.021335083902970984,"topicId":8},
>     {"term":"idea","probability":0.017628999871937747,"topicId":8},
>     {"term":"business","probability":0.017302568063786224,"topicId":8},
>     {"term":"police","probability":0.012284217866942303,"topicId":8},
>     {"term":"know","probability":0.051876601466269136,"topicId":9},
>     {"term":"like","probability":0.03828159069993671,"topicId":9},
>     {"term":"maybe","probability":0.03754385940676905,"topicId":9},
>     {"term":"just","probability":0.031938551661426284,"topicId":9},
>     {"term":"want","probability":0.02876693222824349,"topicId":9},
>     {"term":"years","probability":0.032537676027398765,"topicId":10},
>     {"term":"going","probability":0.030596831997667568,"topicId":10},
>     {"term":"case","probability":0.02049555392502822,"topicId":10},
>     {"term":"doctor","probability":0.018671171294737107,"topicId":10},
>     {"term":"working","probability":0.017672067172167016,"topicId":10},
>     {"term":"stuff","probability":0.02236582778896705,"topicId":11},
>     {"term":"school","probability":0.020057798194969816,"topicId":11},
>     {"term":"john","probability":0.017134198006217606,"topicId":11},
>     {"term":"week","probability":0.017075852415410653,"topicId":11},
>     {"term":"thousand","probability":0.017013413435021035,"topicId":11},
>     {"term":"little","probability":0.08663446368316245,"topicId":12},
>     {"term":"girl","probability":0.035120377589734936,"topicId":12},
>     {"term":"like","probability":0.02992080326340266,"topicId":12},
>     {"term":"woman","probability":0.0240813719635157,"topicId":12},
>     {"term":"baby","probability":0.022471517953608963,"topicId":12},
>     {"term":"know","probability":0.0283115823590395,"topicId":13},
>     {"term":"leave","probability":0.02744935904744228,"topicId":13},
>     {"term":"time","probability":0.02050833156294194,"topicId":13},
>     {"term":"want","probability":0.020124145131863225,"topicId":13},
>     {"term":"just","probability":0.019466336438890477,"topicId":13},
>     {"term":"didn","probability":0.08220031921979461,"topicId":14},
>     {"term":"like","probability":0.05062323326717784,"topicId":14},
>     {"term":"real","probability":0.03087838046777391,"topicId":14},
>     {"term":"guess","probability":0.02452989702353384,"topicId":14},
>     {"term":"says","probability":0.022815035397008333,"topicId":14},
>     {"term":"minutes","probability":0.018541518543996716,"topicId":15},
>     {"term":"time","probability":0.014737962244588431,"topicId":15},
>     {"term":"captain","probability":0.012594614743931537,"topicId":15},
>     {"term":"thirty","probability":0.01193707771669708,"topicId":15},
>     {"term":"ship","probability":0.011260576815409516,"topicId":15},
>     {"term":"okay","probability":0.08153575328080886,"topicId":16},
>     {"term":"just","probability":0.050004142902999975,"topicId":16},
>     {"term":"right","probability":0.03438984898476042,"topicId":16},
>     {"term":"know","probability":0.02821327795933634,"topicId":16},
>     {"term":"home","probability":0.023397063860326372,"topicId":16},
>     {"term":"country","probability":0.011270500385627474,"topicId":17},
>     {"term":"power","probability":0.010428408353623762,"topicId":17},
>     {"term":"president","probability":0.009392162067926028,"topicId":17},
>     {"term":"fight","probability":0.00799742811584178,"topicId":17},
>     {"term":"possible","probability":0.007597974486019279,"topicId":17},
>     {"term":"know","probability":0.09541058020800194,"topicId":18},
>     {"term":"think","probability":0.0698707939786508,"topicId":18},
>     {"term":"really","probability":0.06881812755565207,"topicId":18},
>     {"term":"mean","probability":0.02909700228968688,"topicId":18},
>     {"term":"just","probability":0.028699687473471538,"topicId":18},
>     {"term":"dead","probability":0.03833642117149438,"topicId":19},
>     {"term":"like","probability":0.017873711992106994,"topicId":19},
>     {"term":"hand","probability":0.015280854355409379,"topicId":19},
>     {"term":"white","probability":0.013718491413582671,"topicId":19},
>     {"term":"blood","probability":0.012699265888344448,"topicId":19}

  

We are now ready to use D3 on the rawJson data.

  

Step 1. Downloading and Loading Data into DBFS
----------------------------------------------

Here are the steps taken for downloading and saving data to the
distributed file system. Uncomment them for repeating this process on
your databricks cluster or for downloading a new source of data.

Unfortunately, the original data at:

-   [http://www.mpi-sws.org/~cristian/data/cornell*movie*dialogs\_corpus.zip](http://www.mpi-sws.org/~cristian/data/cornell_movie_dialogs_corpus.zip)

is not suited for manipulation and loading into dbfs easily. So the data
has been downloaded, directory renamed without white spaces, superfluous
OS-specific files removed, `dos2unix`'d, `tar -zcvf`'d and uploaded to
the following URL for an easily dbfs-loadable download:

-   [http://lamastex.org/datasets/public/nlp/cornell*movie*dialogs\_corpus.tgz](http://lamastex.org/datasets/public/nlp/cornell_movie_dialogs_corpus.tgz)

In [None]:
wget http://lamastex.org/datasets/public/nlp/cornell_movie_dialogs_corpus.tgz

  

>     --2019-05-17 15:34:09--  http://lamastex.org/datasets/public/nlp/cornell_movie_dialogs_corpus.tgz
>     Resolving lamastex.org (lamastex.org)... 166.62.28.100
>     Connecting to lamastex.org (lamastex.org)|166.62.28.100|:80... connected.
>     HTTP request sent, awaiting response... 200 OK
>     Length: 9914415 (9.5M) [application/x-tar]
>     Saving to: ‘cornell_movie_dialogs_corpus.tgz’
>
>          0K .......... .......... .......... .......... ..........  0%  149K 65s
>         50K .......... .......... .......... .......... ..........  1%  300K 48s
>        100K .......... .......... .......... .......... ..........  1% 5.69M 33s
>        150K .......... .......... .......... .......... ..........  2%  312K 32s
>        200K .......... .......... .......... .......... ..........  2% 22.5M 25s
>        250K .......... .......... .......... .......... ..........  3% 8.00M 21s
>        300K .......... .......... .......... .......... ..........  3% 52.2M 18s
>        350K .......... .......... .......... .......... ..........  4%  314K 20s
>        400K .......... .......... .......... .......... ..........  4% 37.3M 17s
>        450K .......... .......... .......... .......... ..........  5% 41.2M 15s
>        500K .......... .......... .......... .......... ..........  5% 9.28M 14s
>        550K .......... .......... .......... .......... ..........  6% 54.1M 13s
>        600K .......... .......... .......... .......... ..........  6% 26.1M 12s
>        650K .......... .......... .......... .......... ..........  7% 49.2M 11s
>        700K .......... .......... .......... .......... ..........  7%  322K 12s
>        750K .......... .......... .......... .......... ..........  8% 33.0M 11s
>        800K .......... .......... .......... .......... ..........  8% 39.0M 10s
>        850K .......... .......... .......... .......... ..........  9% 38.2M 10s
>        900K .......... .......... .......... .......... ..........  9% 51.5M 9s
>        950K .......... .......... .......... .......... .......... 10% 12.6M 9s
>       1000K .......... .......... .......... .......... .......... 10% 31.1M 8s
>       1050K .......... .......... .......... .......... .......... 11% 41.1M 8s
>       1100K .......... .......... .......... .......... .......... 11% 46.0M 8s
>       1150K .......... .......... .......... .......... .......... 12% 56.3M 7s
>       1200K .......... .......... .......... .......... .......... 12% 45.3M 7s
>       1250K .......... .......... .......... .......... .......... 13% 57.1M 7s
>       1300K .......... .......... .......... .......... .......... 13% 44.8M 6s
>       1350K .......... .......... .......... .......... .......... 14%  329K 7s
>       1400K .......... .......... .......... .......... .......... 14% 39.1M 7s
>       1450K .......... .......... .......... .......... .......... 15% 54.5M 6s
>       1500K .......... .......... .......... .......... .......... 16% 44.4M 6s
>       1550K .......... .......... .......... .......... .......... 16% 24.1M 6s
>       1600K .......... .......... .......... .......... .......... 17% 33.9M 6s
>       1650K .......... .......... .......... .......... .......... 17% 67.3M 6s
>       1700K .......... .......... .......... .......... .......... 18% 40.6M 5s
>       1750K .......... .......... .......... .......... .......... 18% 23.7M 5s
>       1800K .......... .......... .......... .......... .......... 19% 42.5M 5s
>       1850K .......... .......... .......... .......... .......... 19% 37.4M 5s
>       1900K .......... .......... .......... .......... .......... 20% 45.0M 5s
>       1950K .......... .......... .......... .......... .......... 20% 20.0M 5s
>       2000K .......... .......... .......... .......... .......... 21% 43.5M 4s
>       2050K .......... .......... .......... .......... .......... 21% 44.3M 4s
>       2100K .......... .......... .......... .......... .......... 22% 46.3M 4s
>       2150K .......... .......... .......... .......... .......... 22% 38.8M 4s
>       2200K .......... .......... .......... .......... .......... 23% 46.7M 4s
>       2250K .......... .......... .......... .......... .......... 23% 47.0M 4s
>       2300K .......... .......... .......... .......... .......... 24%  344K 4s
>       2350K .......... .......... .......... .......... .......... 24% 63.1M 4s
>       2400K .......... .......... .......... .......... .......... 25% 42.1M 4s
>       2450K .......... .......... .......... .......... .......... 25% 13.6M 4s
>       2500K .......... .......... .......... .......... .......... 26% 74.5M 4s
>       2550K .......... .......... .......... .......... .......... 26% 78.4M 4s
>       2600K .......... .......... .......... .......... .......... 27% 61.4M 4s
>       2650K .......... .......... .......... .......... .......... 27% 16.6M 4s
>       2700K .......... .......... .......... .......... .......... 28%  105M 3s
>       2750K .......... .......... .......... .......... .......... 28%  209M 3s
>       2800K .......... .......... .......... .......... .......... 29%  181M 3s
>       2850K .......... .......... .......... .......... .......... 29% 28.0M 3s
>       2900K .......... .......... .......... .......... .......... 30% 31.6M 3s
>       2950K .......... .......... .......... .......... .......... 30% 39.0M 3s
>       3000K .......... .......... .......... .......... .......... 31% 44.4M 3s
>       3050K .......... .......... .......... .......... .......... 32% 43.9M 3s
>       3100K .......... .......... .......... .......... .......... 32% 37.0M 3s
>       3150K .......... .......... .......... .......... .......... 33% 42.1M 3s
>       3200K .......... .......... .......... .......... .......... 33% 44.0M 3s
>       3250K .......... .......... .......... .......... .......... 34% 43.8M 3s
>       3300K .......... .......... .......... .......... .......... 34% 36.6M 3s
>       3350K .......... .......... .......... .......... .......... 35% 43.6M 3s
>       3400K .......... .......... .......... .......... .......... 35% 43.2M 2s
>       3450K .......... .......... .......... .......... .......... 36% 33.5M 2s
>       3500K .......... .......... .......... .......... .......... 36% 47.0M 2s
>       3550K .......... .......... .......... .......... .......... 37% 39.0M 2s
>       3600K .......... .......... .......... .......... .......... 37% 35.7M 2s
>       3650K .......... .......... .......... .......... .......... 38%  369K 2s
>       3700K .......... .......... .......... .......... .......... 38% 34.7M 2s
>       3750K .......... .......... .......... .......... .......... 39% 39.0M 2s
>       3800K .......... .......... .......... .......... .......... 39% 34.6M 2s
>       3850K .......... .......... .......... .......... .......... 40% 34.9M 2s
>       3900K .......... .......... .......... .......... .......... 40% 38.5M 2s
>       3950K .......... .......... .......... .......... .......... 41% 41.9M 2s
>       4000K .......... .......... .......... .......... .......... 41% 45.9M 2s
>       4050K .......... .......... .......... .......... .......... 42% 48.1M 2s
>       4100K .......... .......... .......... .......... .......... 42% 46.3M 2s
>       4150K .......... .......... .......... .......... .......... 43% 46.0M 2s
>       4200K .......... .......... .......... .......... .......... 43% 41.9M 2s
>       4250K .......... .......... .......... .......... .......... 44% 41.0M 2s
>       4300K .......... .......... .......... .......... .......... 44% 42.6M 2s
>       4350K .......... .......... .......... .......... .......... 45% 45.4M 2s
>       4400K .......... .......... .......... .......... .......... 45% 37.2M 2s
>       4450K .......... .......... .......... .......... .......... 46% 43.7M 2s
>       4500K .......... .......... .......... .......... .......... 46% 40.5M 2s
>       4550K .......... .......... .......... .......... .......... 47% 38.8M 2s
>       4600K .......... .......... .......... .......... .......... 48% 10.1M 2s
>       4650K .......... .......... .......... .......... .......... 48% 69.1M 2s
>       4700K .......... .......... .......... .......... .......... 49% 69.0M 2s
>       4750K .......... .......... .......... .......... .......... 49% 61.6M 2s
>       4800K .......... .......... .......... .......... .......... 50% 77.5M 2s
>       4850K .......... .......... .......... .......... .......... 50% 61.0M 2s
>       4900K .......... .......... .......... .......... .......... 51% 51.5M 1s
>       4950K .......... .......... .......... .......... .......... 51% 35.4M 1s
>       5000K .......... .......... .......... .......... .......... 52% 59.5M 1s
>       5050K .......... .......... .......... .......... .......... 52% 54.1M 1s
>       5100K .......... .......... .......... .......... .......... 53% 44.2M 1s
>       5150K .......... .......... .......... .......... .......... 53% 35.5M 1s
>       5200K .......... .......... .......... .......... .......... 54%  378K 1s
>       5250K .......... .......... .......... .......... .......... 54% 48.9M 1s
>       5300K .......... .......... .......... .......... .......... 55% 41.6M 1s
>       5350K .......... .......... .......... .......... .......... 55% 27.6M 1s
>       5400K .......... .......... .......... .......... .......... 56% 44.9M 1s
>       5450K .......... .......... .......... .......... .......... 56% 38.6M 1s
>       5500K .......... .......... .......... .......... .......... 57% 40.6M 1s
>       5550K .......... .......... .......... .......... .......... 57% 41.9M 1s
>       5600K .......... .......... .......... .......... .......... 58% 33.9M 1s
>       5650K .......... .......... .......... .......... .......... 58% 42.5M 1s
>       5700K .......... .......... .......... .......... .......... 59% 42.5M 1s
>       5750K .......... .......... .......... .......... .......... 59% 49.9M 1s
>       5800K .......... .......... .......... .......... .......... 60% 37.7M 1s
>       5850K .......... .......... .......... .......... .......... 60% 33.6M 1s
>       5900K .......... .......... .......... .......... .......... 61% 83.1M 1s
>       5950K .......... .......... .......... .......... .......... 61% 44.4M 1s
>       6000K .......... .......... .......... .......... .......... 62% 37.5M 1s
>       6050K .......... .......... .......... .......... .......... 63% 40.7M 1s
>       6100K .......... .......... .......... .......... .......... 63% 7.12M 1s
>       6150K .......... .......... .......... .......... .......... 64%  151M 1s
>       6200K .......... .......... .......... .......... .......... 64% 70.0M 1s
>       6250K .......... .......... .......... .......... .......... 65% 68.6M 1s
>       6300K .......... .......... .......... .......... .......... 65% 20.3M 1s
>       6350K .......... .......... .......... .......... .......... 66% 75.8M 1s
>       6400K .......... .......... .......... .......... .......... 66% 81.5M 1s
>       6450K .......... .......... .......... .......... .......... 67% 91.8M 1s
>       6500K .......... .......... .......... .......... .......... 67% 67.8M 1s
>       6550K .......... .......... .......... .......... .......... 68% 86.9M 1s
>       6600K .......... .......... .......... .......... .......... 68% 63.7M 1s
>       6650K .......... .......... .......... .......... .......... 69% 73.2M 1s
>       6700K .......... .......... .......... .......... .......... 69% 43.3M 1s
>       6750K .......... .......... .......... .......... .......... 70%  380K 1s
>       6800K .......... .......... .......... .......... .......... 70% 41.2M 1s
>       6850K .......... .......... .......... .......... .......... 71% 31.4M 1s
>       6900K .......... .......... .......... .......... .......... 71% 44.5M 1s
>       6950K .......... .......... .......... .......... .......... 72% 44.1M 1s
>       7000K .......... .......... .......... .......... .......... 72% 38.3M 1s
>       7050K .......... .......... .......... .......... .......... 73% 36.5M 1s
>       7100K .......... .......... .......... .......... .......... 73% 50.3M 1s
>       7150K .......... .......... .......... .......... .......... 74% 34.5M 1s
>       7200K .......... .......... .......... .......... .......... 74% 42.9M 1s
>       7250K .......... .......... .......... .......... .......... 75% 32.7M 1s
>       7300K .......... .......... .......... .......... .......... 75% 37.6M 1s
>       7350K .......... .......... .......... .......... .......... 76% 45.9M 1s
>       7400K .......... .......... .......... .......... .......... 76% 40.2M 1s
>       7450K .......... .......... .......... .......... .......... 77% 45.2M 1s
>       7500K .......... .......... .......... .......... .......... 77% 26.1M 1s
>       7550K .......... .......... .......... .......... .......... 78% 36.2M 1s
>       7600K .......... .......... .......... .......... .......... 79% 61.6M 0s
>       7650K .......... .......... .......... .......... .......... 79% 66.4M 0s
>       7700K .......... .......... .......... .......... .......... 80% 21.5M 0s
>       7750K .......... .......... .......... .......... .......... 80% 64.6M 0s
>       7800K .......... .......... .......... .......... .......... 81% 49.1M 0s
>       7850K .......... .......... .......... .......... .......... 81% 30.5M 0s
>       7900K .......... .......... .......... .......... .......... 82% 40.1M 0s
>       7950K .......... .......... .......... .......... .......... 82% 37.1M 0s
>       8000K .......... .......... .......... .......... .......... 83% 43.2M 0s
>       8050K .......... .......... .......... .......... .......... 83% 38.5M 0s
>       8100K .......... .......... .......... .......... .......... 84% 36.4M 0s
>       8150K .......... .......... .......... .......... .......... 84% 42.9M 0s
>       8200K .......... .......... .......... .......... .......... 85% 40.7M 0s
>       8250K .......... .......... .......... .......... .......... 85%  383K 0s
>       8300K .......... .......... .......... .......... .......... 86% 26.3M 0s
>       8350K .......... .......... .......... .......... .......... 86% 46.3M 0s
>       8400K .......... .......... .......... .......... .......... 87% 27.9M 0s
>       8450K .......... .......... .......... .......... .......... 87% 51.4M 0s
>       8500K .......... .......... .......... .......... .......... 88% 50.3M 0s
>       8550K .......... .......... .......... .......... .......... 88% 34.7M 0s
>       8600K .......... .......... .......... .......... .......... 89% 38.2M 0s
>       8650K .......... .......... .......... .......... .......... 89% 48.4M 0s
>       8700K .......... .......... .......... .......... .......... 90% 28.4M 0s
>       8750K .......... .......... .......... .......... .......... 90% 42.9M 0s
>       8800K .......... .......... .......... .......... .......... 91% 33.5M 0s
>       8850K .......... .......... .......... .......... .......... 91% 33.7M 0s
>       8900K .......... .......... .......... .......... .......... 92% 34.4M 0s
>       8950K .......... .......... .......... .......... .......... 92% 55.0M 0s
>       9000K .......... .......... .......... .......... .......... 93% 26.6M 0s
>       9050K .......... .......... .......... .......... .......... 93% 66.4M 0s
>       9100K .......... .......... .......... .......... .......... 94% 50.6M 0s
>       9150K .......... .......... .......... .......... .......... 95% 49.7M 0s
>       9200K .......... .......... .......... .......... .......... 95% 46.3M 0s
>       9250K .......... .......... .......... .......... .......... 96% 46.0M 0s
>       9300K .......... .......... .......... .......... .......... 96% 40.5M 0s
>       9350K .......... .......... .......... .......... .......... 97% 41.9M 0s
>       9400K .......... .......... .......... .......... .......... 97% 26.7M 0s
>       9450K .......... .......... .......... .......... .......... 98% 30.9M 0s
>       9500K .......... .......... .......... .......... .......... 98% 42.4M 0s
>       9550K .......... .......... .......... .......... .......... 99% 40.1M 0s
>       9600K .......... .......... .......... .......... .......... 99% 45.3M 0s
>       9650K .......... .......... .......... ..                   100% 33.8M=2.1s
>
>     2019-05-17 15:34:11 (4.61 MB/s) - ‘cornell_movie_dialogs_corpus.tgz’ saved [9914415/9914415]

  

Untar the file.

In [None]:
tar zxvf cornell_movie_dialogs_corpus.tgz

  

>     cornell_movie_dialogs_corpus/
>     cornell_movie_dialogs_corpus/movie_lines.txt
>     cornell_movie_dialogs_corpus/movie_characters_metadata.txt
>     cornell_movie_dialogs_corpus/README.txt
>     cornell_movie_dialogs_corpus/raw_script_urls.txt
>     cornell_movie_dialogs_corpus/movie_titles_metadata.txt
>     cornell_movie_dialogs_corpus/movie_conversations.txt
>     cornell_movie_dialogs_corpus/chameleons.pdf

  

Let us list and load all the files into dbfs after `dbfs.fs.mkdirs(...)`
to create the directory
`dbfs:/datasets/sds/nlp/cornell_movie_dialogs_corpus/`.

In [None]:
pwd && ls -al cornell_movie_dialogs_corpus

  

>     /databricks/driver
>     total 41552
>     drwxr-xr-x 2 ubuntu ubuntu     4096 Jan 11  2017 .
>     drwxr-xr-x 1 root   root       4096 May 17 15:34 ..
>     -rw-r--r-- 1 ubuntu ubuntu   290691 May  9  2011 chameleons.pdf
>     -rw-r--r-- 1 ubuntu ubuntu   705695 Jan 11  2017 movie_characters_metadata.txt
>     -rw-r--r-- 1 ubuntu ubuntu  6760930 Jan 11  2017 movie_conversations.txt
>     -rw-r--r-- 1 ubuntu ubuntu 34641919 Jan 11  2017 movie_lines.txt
>     -rw-r--r-- 1 ubuntu ubuntu    67289 Jan 11  2017 movie_titles_metadata.txt
>     -rw-r--r-- 1 ubuntu ubuntu    56177 Jan 11  2017 raw_script_urls.txt
>     -rw-r--r-- 1 ubuntu ubuntu     4181 Jan 11  2017 README.txt

In [None]:
dbutils.fs.rm("dbfs:/datasets/sds/nlp/cornell_movie_dialogs_corpus/",true)

  

>     res4: Boolean = true

In [None]:
dbutils.fs.mkdirs("dbfs:/datasets/sds/nlp/cornell_movie_dialogs_corpus/")

  

>     res5: Boolean = true

In [None]:

dbutils.fs.cp("file:///databricks/driver/cornell_movie_dialogs_corpus/movie_characters_metadata.txt","dbfs:/datasets/sds/nlp/cornell_movie_dialogs_corpus/movie_characters_metadata.txt")
dbutils.fs.cp("file:///databricks/driver/cornell_movie_dialogs_corpus/movie_conversations.txt","dbfs:/datasets/sds/nlp/cornell_movie_dialogs_corpus/movie_conversations.txt")
dbutils.fs.cp("file:///databricks/driver/cornell_movie_dialogs_corpus/movie_lines.txt","dbfs:/datasets/sds/nlp/cornell_movie_dialogs_corpus/movie_lines.txt")
dbutils.fs.cp("file:///databricks/driver/cornell_movie_dialogs_corpus/movie_titles_metadata.txt","dbfs:/datasets/sds/nlp/cornell_movie_dialogs_corpus/movie_titles_metadata.txt")
dbutils.fs.cp("file:///databricks/driver/cornell_movie_dialogs_corpus/raw_script_urls.txt","dbfs:/datasets/sds/nlp/cornell_movie_dialogs_corpus/raw_script_urls.txt")
dbutils.fs.cp("file:///databricks/driver/cornell_movie_dialogs_corpus/README.txt","dbfs:/datasets/sds/nlp/cornell_movie_dialogs_corpus/README.txt")


  

>     res6: Boolean = true

In [None]:
display(dbutils.fs.ls("dbfs:/datasets/sds/nlp/cornell_movie_dialogs_corpus/"))

  

[TABLE]