[ScaDaMaLe, Scalable Data Science and Distributed Machine Learning](https://lamastex.github.io/scalable-data-science/sds/3/x/)
==============================================================================================================================

Topic Modeling with Latent Dirichlet Allocation
===============================================

This is an augmentation of a notebook from Databricks Guide.  
This notebook will provide a brief algorithm summary, links for further
reading, and an example of how to use LDA for Topic Modeling.

Algorithm Summary
-----------------

-   **Task**: Identify topics from a collection of text documents
-   **Input**: Vectors of word counts
-   **Optimizers**:
    -   EMLDAOptimizer using [Expectation
        Maximization](https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm)
    -   OnlineLDAOptimizer using Iterative Mini-Batch Sampling for
        [Online Variational
        Bayes](https://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf)

In [None]:
val lang = "en"
val freeze_date = "20201120"

  

>     lang: String = en
>     freeze_date: String = 20201120

In [None]:
import scala.sys.process._
"echo hej"!!


  

>     warning: there was one feature warning; for details, enable `:setting -feature' or `:replay -feature'
>     import scala.sys.process._
>     res0: String =
>     "hej
>     "

In [None]:
s"echo ${lang},lang"!!

  

>     warning: there was one feature warning; for details, enable `:setting -feature' or `:replay -feature'
>     res1: String =
>     "en,lang
>     "

In [None]:
echo "hej"
echo "${lang}",lang

  

>     hej
>     ,lang

  

Links
-----

-   Spark API docs
    -   Scala:
        [LDA](https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.clustering.LDA)
    -   Python:
        [LDA](https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.clustering.LDA)
-   [MLlib Programming
    Guide](http://spark.apache.org/docs/latest/mllib-clustering.html#latent-dirichlet-allocation-lda)
-   [ML Feature Extractors &
    Transformers](http://spark.apache.org/docs/latest/ml-features.html)
-   [Wikipedia: Latent Dirichlet
    Allocation](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation)

Readings for LDA
----------------

-   A high-level introduction to the topic from Communications of the
    ACM
    -   <http://www.cs.columbia.edu/~blei/papers/Blei2012.pdf>
-   A very good high-level humanities introduction to the topic
    (recommended by Chris Thomson in English Department at UC, Ilam):
    -   <http://journalofdigitalhumanities.org/2-1/topic-modeling-and-digital-humanities-by-david-m-blei/>

Also read the methodological and more formal papers cited in the above
links if you want to know more.

Let's get a bird's eye view of LDA from
<http://www.cs.columbia.edu/~blei/papers/Blei2012.pdf> next.

-   See pictures (hopefully you read the paper last night!)
-   Algorithm of the generative model (this is unsupervised clustering)
-   For a careful introduction to the topic see Section 27.3 and 27.4
    (pages 950-970) pf Murphy's *Machine Learning: A Probabilistic
    Perspective, MIT Press, 2012*.
-   We will be quite application focussed or applied here!
-   Understand Expectation Maximization Algorithm read *Section 8.5 The
    EM Algorithm* in *The Elements of Statistical Learning* by Hastie,
    Tibshirani and Freidman (2001, Springer Series in Statistics). Read
    from free 21MB PDF of the book available from here
    <https://web.stanford.edu/~hastie/Papers/ESLII.pdf> or from its
    backup here
    <http://lamastex.org/research_events/Readings/StatLearn/ESLII.pdf>.

  

  

  

Probabilistic Topic Modeling Example
------------------------------------

This is an outline of our Topic Modeling workflow. Feel free to jump to
any subtopic to find out more.

-   Step 0. Dataset Review
-   Step 1. Downloading and Loading Data into DBFS
    -   (Step 1. only needs to be done once per shard - see details at
        the end of the notebook for Step 1.)
-   Step 2. Loading the Data and Data Cleaning
-   Step 3. Text Tokenization
-   Step 4. Remove Stopwords
-   Step 5. Vector of Token Counts
-   Step 6. Create LDA model with Online Variational Bayes
-   Step 7. Review Topics
-   Step 8. Model Tuning - Refilter Stopwords
-   Step 9. Create LDA model with Expectation Maximization
-   Step 10. Visualize Results

Step 0. Dataset Review
----------------------

In this example, we will use the mini [20 Newsgroups
dataset](http://kdd.ics.uci.edu/databases/20newsgroups/20newsgroups.html),
which is a random subset of the original 20 Newsgroups dataset. Each
newsgroup is stored in a subdirectory, with each article stored as a
separate file.

------------------------------------------------------------------------

------------------------------------------------------------------------

The following is the markdown file `20newsgroups.data.md` of the
original details on the dataset, obtained as follows:

\`\`\`%sh $ wget -k
http://kdd.ics.uci.edu/databases/20newsgroups/20newsgroups.data.html
--2016-04-07 10:31:51--
http://kdd.ics.uci.edu/databases/20newsgroups/20newsgroups.data.html
Resolving kdd.ics.uci.edu (kdd.ics.uci.edu)... 128.195.1.95 Connecting
to kdd.ics.uci.edu (kdd.ics.uci.edu)|128.195.1.95|:80... connected. HTTP
request sent, awaiting response... 200 OK Length: 4371 (4.3K)
\[text/html\] Saving to: '20newsgroups.data.html’

100%\[======================================&gt;\] 4,371 --.-K/s in 0s

2016-04-07 10:31:51 (195 MB/s) - '20newsgroups.data.html’ saved
\[4371/4371\]

Converting 20newsgroups.data.html... nothing to do. Converted 1 files in
0 seconds.

$ pandoc -f html -t markdown 20newsgroups.data.html &gt;
20newsgroups.data.md \`\`\`

### 20 Newsgroups

#### Data Type

text

#### Abstract

This data set consists of 20000 messages taken from 20 newsgroups.

#### Sources

##### Original Owner and Donor

    Tom Mitchell
    School of Computer Science
    Carnegie Mellon University
    tom.mitchell@cmu.edu

**Date Donated:** September 9, 1999

#### Data Characteristics

One thousand Usenet articles were taken from each of the following 20
newsgroups.

        alt.atheism
        comp.graphics
        comp.os.ms-windows.misc
        comp.sys.ibm.pc.hardware
        comp.sys.mac.hardware
        comp.windows.x
        misc.forsale
        rec.autos
        rec.motorcycles
        rec.sport.baseball
        rec.sport.hockey
        sci.crypt
        sci.electronics
        sci.med
        sci.space
        soc.religion.christian
        talk.politics.guns
        talk.politics.mideast
        talk.politics.misc
        talk.religion.misc

Approximately 4% of the articles are crossposted. The articles are
typical postings and thus have headers including subject lines,
signature files, and quoted portions of other articles.

#### Data Format

Each newsgroup is stored in a subdirectory, with each article stored as
a separate file.

#### Past Usage

T. Mitchell. Machine Learning, McGraw Hill, 1997.

T. Joachims (1996). [A probabilistic analysis of the Rocchio algorithm
with TFIDF for text
categorization](http://reports-archive.adm.cs.cmu.edu/anon/1996/CMU-CS-96-118.ps),
Computer Science Technical Report CMU-CS-96-118. Carnegie Mellon
University.

#### Acknowledgements, Copyright Information, and Availability

You may use this material free of charge for any educational purpose,
provided attribution is given in any lectures or publications that make
use of this material.

#### References and Further Information

Naive Bayes code for text classification is available from:
<http://www.cs.cmu.edu/afs/cs/project/theo-11/www/naive-bayes.html>

------------------------------------------------------------------------

[The UCI KDD Archive](http://kdd.ics.uci.edu/) \\ [Information and
Computer Science](http://www.ics.uci.edu/) \\ [University of California,
Irvine](http://www.uci.edu/) \\ Irvine, CA 92697-3425 \\

Last modified: September 9, 1999

------------------------------------------------------------------------

------------------------------------------------------------------------

**NOTE:** The mini dataset consists of 100 articles from the following
20 Usenet newsgroups:

    alt.atheism
    comp.graphics
    comp.os.ms-windows.misc
    comp.sys.ibm.pc.hardware
    comp.sys.mac.hardware
    comp.windows.x
    misc.forsale
    rec.autos
    rec.motorcycles
    rec.sport.baseball
    rec.sport.hockey
    sci.crypt
    sci.electronics
    sci.med
    sci.space
    soc.religion.christian
    talk.politics.guns
    talk.politics.mideast
    talk.politics.misc
    talk.religion.misc

Some of the newsgroups seem pretty similar on first glance, such as
*comp.sys.ibm.pc.hardware* and *comp.sys.mac.hardware*, which may affect
our results.

**NOTE:** A simpler and slicker version of the analysis is available in
this notebook:

-   <https://docs.cloud.databricks.com/docs/latest/sample_applications/07%20Sample%20ML/MLPipeline%20Newsgroup%20Dataset.html>

    But, let's do it the hard way here so that we can do it on other
    arbitrary datasets.

Step 2. Loading the Data and Data Cleaning
------------------------------------------

We have already used the wget command to download the file, and put it
in our distributed file system (this process takes about 10 minutes). To
repeat these steps or to download data from another source follow the
steps at the bottom of this worksheet on **Step 1. Downloading and
Loading Data into DBFS**.

Let's make sure these files are in dbfs now:

In [None]:
display(dbutils.fs.ls("dbfs:/datasets/mini_newsgroups")) // this is where the data resides in dbfs (see below to download it first, if you go to a new shard!)

  

[TABLE]

  

Now let us read in the data using `wholeTextFiles()`.

Recall that the `wholeTextFiles()` command will read in the entire
directory of text files, and return a key-value pair of (filePath,
fileContent).

As we do not need the file paths in this example, we will apply a map
function to extract the file contents, and then convert everything to
lowercase.

In [None]:
// Load text file, leave out file paths, convert all strings to lowercase
val corpus = sc.wholeTextFiles("/datasets/mini_newsgroups/*").map(_._2).map(_.toLowerCase()).cache() // let's cache

  

>     corpus: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[48468] at map at command-2972105651606653:2

In [None]:
corpus.count // there are 2000 documents in total - this action will take about 2 minutes

  

>     res4: Long = 2000

  

Review first 5 documents to get a sense for the data format.

In [None]:
corpus.take(5)

  

>     res5: Array[String] =
>     Array("xref: cantaloupe.srv.cs.cmu.edu alt.atheism:51121 soc.motss:139944 rec.scouting:5318
>     newsgroups: alt.atheism,soc.motss,rec.scouting
>     path: cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!fs7.ece.cmu.edu!europa.eng.gtefsd.com!howland.reston.ans.net!wupost!uunet!newsgate.watson.ibm.com!yktnews.watson.ibm.com!watson!watson.ibm.com!strom
>     from: strom@watson.ibm.com (rob strom)
>     subject: re: [soc.motss, et al.] "princeton axes matching funds for boy scouts"
>     sender: @watson.ibm.com
>     message-id: <1993apr05.180116.43346@watson.ibm.com>
>     date: mon, 05 apr 93 18:01:16 gmt
>     distribution: usa
>     references: <c47efs.3q47@austin.ibm.com> <1993mar22.033150.17345@cbnewsl.cb.att.com> <n4hy.93apr5120934@harder.ccr-p.ida.org>
>     organization: ibm research
>     lines: 15
>
>     in article <n4hy.93apr5120934@harder.ccr-p.ida.org>, n4hy@harder.ccr-p.ida.org (bob mcgwier) writes:
>
>     |> [1] however, i hate economic terrorism and political correctness
>     |> worse than i hate this policy.
>
>
>     |> [2] a more effective approach is to stop donating
>     |> to any organizating that directly or indirectly supports gay rights issues
>     |> until they end the boycott on funding of scouts.
>
>     can somebody reconcile the apparent contradiction between [1] and [2]?
>
>     --
>     rob strom, strom@watson.ibm.com, (914) 784-7641
>     ibm research, 30 saw mill river road, p.o. box 704, yorktown heights, ny  10598
>     ", "path: cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!fs7.ece.cmu.edu!europa.eng.gtefsd.com!howland.reston.ans.net!noc.near.net!news.centerline.com!uunet!olivea!sgigate!sgiblab!adagio.panasonic.com!nntp-server.caltech.edu!keith
>     from: keith@cco.caltech.edu (keith allan schneider)
>     newsgroups: alt.atheism
>     subject: re: >>>>>>pompous ass
>     message-id: <1pi9btinnqa5@gap.caltech.edu>
>     date: 2 apr 93 20:57:33 gmt
>     references: <1ou4koinne67@gap.caltech.edu> <1p72bkinnjt7@gap.caltech.edu> <93089.050046mvs104@psuvm.psu.edu> <1pa6ntinns5d@gap.caltech.edu> <1993mar30.210423.1302@bmerh85.bnr.ca> <1pcnqjinnpon@gap.caltech.edu> <kmr4.1344.733611641@po.cwru.edu>
>     organization: california institute of technology, pasadena
>     lines: 9
>     nntp-posting-host: punisher.caltech.edu
>
>     kmr4@po.cwru.edu (keith m. ryan) writes:
>
>     >>then why do people keep asking the same questions over and over?
>     >because you rarely ever answer them.
>
>     nope, i've answered each question posed, and most were answered multiple
>     times.
>
>     keith
>     ", "path: cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!fs7.ece.cmu.edu!europa.eng.gtefsd.com!howland.reston.ans.net!noc.near.net!news.centerline.com!uunet!olivea!sgigate!sgiblab!adagio.panasonic.com!nntp-server.caltech.edu!keith
>     from: keith@cco.caltech.edu (keith allan schneider)
>     newsgroups: alt.atheism
>     subject: re: >>>>>>pompous ass
>     message-id: <1pi9jkinnqe2@gap.caltech.edu>
>     date: 2 apr 93 21:01:40 gmt
>     references: <1ou4koinne67@gap.caltech.edu> <1p72bkinnjt7@gap.caltech.edu> <93089.050046mvs104@psuvm.psu.edu> <1pa6ntinns5d@gap.caltech.edu> <1993mar30.205919.26390@blaze.cs.jhu.edu> <1pcnp3innpom@gap.caltech.edu> <1pdjip$jsi@fido.asd.sgi.com>
>     organization: california institute of technology, pasadena
>     lines: 14
>     nntp-posting-host: punisher.caltech.edu
>
>     livesey@solntze.wpd.sgi.com (jon livesey) writes:
>
>     >>>how long does it [the motto] have to stay around before it becomes the
>     >>>default?  ...  where's the cutoff point?
>     >>i don't know where the exact cutoff is, but it is at least after a few
>     >>years, and surely after 40 years.
>     >why does the notion of default not take into account changes
>     >in population makeup?
>
>     specifically, which changes are you talking about?  are you arguing
>     that the motto is interpreted as offensive by a larger portion of the
>     population now than 40 years ago?
>
>     keith
>     ", "path: cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!fs7.ece.cmu.edu!europa.eng.gtefsd.com!howland.reston.ans.net!wupost!sdd.hp.com!sgiblab!adagio.panasonic.com!nntp-server.caltech.edu!keith
>     from: keith@cco.caltech.edu (keith allan schneider)
>     newsgroups: alt.atheism
>     subject: re: <political atheists?
>     date: 2 apr 1993 21:22:59 gmt
>     organization: california institute of technology, pasadena
>     lines: 44
>     message-id: <1piarjinnqsa@gap.caltech.edu>
>     references: <1p9bseinni6o@gap.caltech.edu> <1pamva$b6j@fido.asd.sgi.com> <1pcq4pinnqp1@gap.caltech.edu> <11702@vice.ico.tek.com>
>     nntp-posting-host: punisher.caltech.edu
>
>     bobbe@vice.ico.tek.com (robert beauchaine) writes:
>
>     >>but, you don't know that capital punishment is wrong, so it isn't the same
>     >>as shooting.  a better analogy would be that you continue to drive your car,
>     >>realizing that sooner or later, someone is going to be killed in an automobile
>     >>accident.  you *know* people get killed as a result of driving, yet you
>     >>continue to do it anyway.
>     >uh uh.  you do not know that you will be the one to do the
>     >killing.  i'm not sure i'd drive a car if i had sufficient evidence to
>     >conclude that i would necessarily kill someone during my lifetime.
>
>     yes, and everyone thinks as you do.  no one thinks that he is going to cause
>     or be involved in a fatal accident, but the likelihood is surprisingly high.
>     just because you are the man on the firing squad whose gun is shooting
>     blanks does not mean that you are less guilty.
>
>     >i don't know about jon, but i say *all* taking of human life is
>     >murder.  and i say murder is wrong in all but one situation:  when
>     >it is the only action that will prevent another murder, either of
>     >myself or another.
>
>     you mean that killing is wrong in all but one situtation?  and, you should
>     note that that situation will never occur.  there are always other options
>     thank killing.  why don't you just say that all killing is wrong.  this
>     is basically what you are saying.
>
>     >i'm getting a bit tired of your probabilistic arguments.
>
>     are you attempting to be condescending?
>
>     >that the system usually works pretty well is small consolation to
>     >the poor innocent bastard getting the lethal injection.  is your
>     >personal value of human life based solely on a statistical approach?
>     >you sound like an unswerving adherent to the needs of the many
>     >outweighing the needs of the few, so fuck the few.
>
>     but, most people have found the risk to be acceptable.  you are probably
>     much more likely to die in a plane crash, or even using an electric
>     blender, than you are to be executed as an innocent.  i personally think
>     that the risk is acceptable, but in an ideal moral system, no such risk
>     is acceptable.  "acceptable" is the fudge factor necessary in such an
>     approximation to the ideal.
>
>     keith
>     ", "path: cantaloupe.srv.cs.cmu.edu!das-news.harvard.edu!husc-news.harvard.edu!kuhub.cc.ukans.edu!wupost!howland.reston.ans.net!zaphod.mps.ohio-state.edu!sol.ctr.columbia.edu!ursa!pooh!halat
>     newsgroups: alt.atheism
>     subject: re: there must be a creator! (maybe)
>     message-id: <30066@ursa.bear.com>
>     from: halat@pooh.bears (jim halat)
>     date: 1 apr 93 21:24:35 gmt
>     reply-to: halat@pooh.bears (jim halat)
>     sender: news@bear.com
>     references: <16ba1e927.drporter@suvm.syr.edu>
>     lines: 24
>
>     in article <16ba1e927.drporter@suvm.syr.edu>, drporter@suvm.syr.edu (brad porter) writes:
>     >
>     >   science is wonderful at answering most of our questions.  i'm not the type
>     >to question scientific findings very often, but...  personally, i find the
>     >theory of evolution to be unfathomable.  could humans, a highly evolved,
>     >complex organism that thinks, learns, and develops truly be an organism
>     >that resulted from random genetic mutations and natural selection?
>
>     [...stuff deleted...]
>
>     computers are an excellent example...of evolution without "a" creator.
>     we did not "create" computers.  we did not create the sand that goes
>     into the silicon that goes into the integrated circuits that go into
>     processor board.  we took these things and put them together in an
>     interesting way. just like plants "create" oxygen using light through
>     photosynthesis.  it's a much bigger leap to talk about something that
>     created "everything" from nothing.  i find it unfathomable to resort
>     to believing in a creator when a much simpler alternative exists: we
>     simply are incapable of understanding our beginnings -- if there even
>     were beginnings at all.  and that's ok with me.  the present keeps me
>     perfectly busy.
>
>     -jim halat
>
>     ")

  

To review a random document in the corpus uncomment and evaluate the
following cell.

In [None]:
corpus.takeSample(false, 1)

  

>     res6: Array[String] =
>     Array("path: cantaloupe.srv.cs.cmu.edu!rochester!udel!gatech!howland.reston.ans.net!usc!cs.utexas.edu!qt.cs.utexas.edu!news.brown.edu!noc.near.net!bigboote.wpi.edu!bigwpi.wpi.edu!kedz
>     from: kedz@bigwpi.wpi.edu (john kedziora)
>     newsgroups: misc.forsale
>     subject: motorcycle wanted.
>     date: 22 feb 1993 14:22:51 gmt
>     organization: worcester polytechnic institute
>     lines: 11
>     expires: 5/1/93
>     message-id: <1manjr$ja0@bigboote.wpi.edu>
>     nntp-posting-host: bigwpi.wpi.edu
>
>     sender:
>     followup-to:kedz@wpi.wpi.edu
>     distribution: ne
>     organization: worcester polytechnic institute
>     keywords:
>
>     i am looking for an inexpensive motorcycle, nothing fancy, have to be able to do all maintinence my self. looking in the <$400 range.
>
>     if you can help me out, great!, please reply by e-mail.
>
>
>     ")

  

Note that the document begins with a header containing some metadata
that we don't need, and we are only interested in the body of the
document. We can do a bit of simple data cleaning here by removing the
metadata of each document, which reduces the noise in our dataset. This
is an important step as the accuracy of our models depend greatly on the
quality of data used.

In [None]:
// Split document by double newlines, drop the first block, combine again as a string and cache
val corpus_body = corpus.map(_.split("\\n\\n")).map(_.drop(1)).map(_.mkString(" ")).cache()

  

>     corpus_body: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[48472] at map at command-2972105651606660:2

In [None]:
corpus_body.count() // there should still be the same count, but now without meta-data block

  

>     res7: Long = 2000

  

Let's review first 5 documents with metadata removed.

In [None]:
corpus_body.take(5)

  

>     res8: Array[String] =
>     Array("in article <n4hy.93apr5120934@harder.ccr-p.ida.org>, n4hy@harder.ccr-p.ida.org (bob mcgwier) writes: |> [1] however, i hate economic terrorism and political correctness
>     |> worse than i hate this policy.
>     |> [2] a more effective approach is to stop donating
>     |> to any organizating that directly or indirectly supports gay rights issues
>     |> until they end the boycott on funding of scouts.   can somebody reconcile the apparent contradiction between [1] and [2]? --
>     rob strom, strom@watson.ibm.com, (914) 784-7641
>     ibm research, 30 saw mill river road, p.o. box 704, yorktown heights, ny  10598
>     ", "kmr4@po.cwru.edu (keith m. ryan) writes: >>then why do people keep asking the same questions over and over?
>     >because you rarely ever answer them. nope, i've answered each question posed, and most were answered multiple
>     times. keith
>     ", "livesey@solntze.wpd.sgi.com (jon livesey) writes: >>>how long does it [the motto] have to stay around before it becomes the
>     >>>default?  ...  where's the cutoff point?
>     >>i don't know where the exact cutoff is, but it is at least after a few
>     >>years, and surely after 40 years.
>     >why does the notion of default not take into account changes
>     >in population makeup?      specifically, which changes are you talking about?  are you arguing
>     that the motto is interpreted as offensive by a larger portion of the
>     population now than 40 years ago? keith
>     ", "bobbe@vice.ico.tek.com (robert beauchaine) writes: >>but, you don't know that capital punishment is wrong, so it isn't the same
>     >>as shooting.  a better analogy would be that you continue to drive your car,
>     >>realizing that sooner or later, someone is going to be killed in an automobile
>     >>accident.  you *know* people get killed as a result of driving, yet you
>     >>continue to do it anyway.
>     >uh uh.  you do not know that you will be the one to do the
>     >killing.  i'm not sure i'd drive a car if i had sufficient evidence to
>     >conclude that i would necessarily kill someone during my lifetime. yes, and everyone thinks as you do.  no one thinks that he is going to cause
>     or be involved in a fatal accident, but the likelihood is surprisingly high.
>     just because you are the man on the firing squad whose gun is shooting
>     blanks does not mean that you are less guilty. >i don't know about jon, but i say *all* taking of human life is
>     >murder.  and i say murder is wrong in all but one situation:  when
>     >it is the only action that will prevent another murder, either of
>     >myself or another. you mean that killing is wrong in all but one situtation?  and, you should
>     note that that situation will never occur.  there are always other options
>     thank killing.  why don't you just say that all killing is wrong.  this
>     is basically what you are saying. >i'm getting a bit tired of your probabilistic arguments. are you attempting to be condescending? >that the system usually works pretty well is small consolation to
>     >the poor innocent bastard getting the lethal injection.  is your
>     >personal value of human life based solely on a statistical approach?
>     >you sound like an unswerving adherent to the needs of the many
>     >outweighing the needs of the few, so fuck the few. but, most people have found the risk to be acceptable.  you are probably
>     much more likely to die in a plane crash, or even using an electric
>     blender, than you are to be executed as an innocent.  i personally think
>     that the risk is acceptable, but in an ideal moral system, no such risk
>     is acceptable.  "acceptable" is the fudge factor necessary in such an
>     approximation to the ideal. keith
>     ", in article <16ba1e927.drporter@suvm.syr.edu>, drporter@suvm.syr.edu (brad porter) writes:
>     >
>     >   science is wonderful at answering most of our questions.  i'm not the type
>     >to question scientific findings very often, but...  personally, i find the
>     >theory of evolution to be unfathomable.  could humans, a highly evolved,
>     >complex organism that thinks, learns, and develops truly be an organism
>     >that resulted from random genetic mutations and natural selection? [...stuff deleted...] computers are an excellent example...of evolution without "a" creator.
>     we did not "create" computers.  we did not create the sand that goes
>     into the silicon that goes into the integrated circuits that go into
>     processor board.  we took these things and put them together in an
>     interesting way. just like plants "create" oxygen using light through
>     photosynthesis.  it's a much bigger leap to talk about something that
>     created "everything" from nothing.  i find it unfathomable to resort
>     to believing in a creator when a much simpler alternative exists: we
>     simply are incapable of understanding our beginnings -- if there even
>     were beginnings at all.  and that's ok with me.  the present keeps me
>     perfectly busy. -jim halat)

  

Feature extraction and transformation APIs
------------------------------------------

See <http://spark.apache.org/docs/latest/ml-features.html>

To use the convenient [Feature extraction and transformation
APIs](http://spark.apache.org/docs/latest/ml-features.html), we will
convert our RDD into a DataFrame.

We will also create an ID for every document using `zipWithIndex`

-   for sytax and details search for `zipWithIndex` in
    <https://spark.apache.org/docs/latest/api/scala/org/apache/spark/rdd/RDD.html>

In [None]:
// Convert RDD to DF with ID for every document 
val corpus_df = corpus_body.zipWithIndex.toDF("corpus", "id")

  

>     corpus_df: org.apache.spark.sql.DataFrame = [corpus: string, id: bigint]

In [None]:
//display(corpus_df) // uncomment to see corpus 
// this was commented out after a member of the new group requested to remain anonymous on 20160525

  

  

Step 3. Text Tokenization
-------------------------

We will use the RegexTokenizer to split each document into tokens. We
can setMinTokenLength() here to indicate a minimum token length, and
filter away all tokens that fall below the minimum.

See <http://spark.apache.org/docs/latest/ml-features.html#tokenizer>.

In [None]:
import org.apache.spark.ml.feature.RegexTokenizer

// Set params for RegexTokenizer
val tokenizer = new RegexTokenizer()
.setPattern("[\\W_]+") // break by white space character(s)  - try to remove emails and other patterns
.setMinTokenLength(4) // Filter away tokens with length < 4
.setInputCol("corpus") // name of the input column
.setOutputCol("tokens") // name of the output column

// Tokenize document
val tokenized_df = tokenizer.transform(corpus_df)

  

>     import org.apache.spark.ml.feature.RegexTokenizer
>     tokenizer: org.apache.spark.ml.feature.RegexTokenizer = RegexTokenizer: uid=regexTok_5c944a5182e5, minTokenLength=4, gaps=true, pattern=[\W_]+, toLowercase=true
>     tokenized_df: org.apache.spark.sql.DataFrame = [corpus: string, id: bigint ... 1 more field]

In [None]:
//display(tokenized_df) // uncomment to see tokenized_df 
// this was commented out after a member of the new group requested to remain anonymous on 20160525

In [None]:
display(tokenized_df.select("tokens"))

  

Step 4. Remove Stopwords
------------------------

We can easily remove stopwords using the StopWordsRemover().

See
<http://spark.apache.org/docs/latest/ml-features.html#stopwordsremover>.

If a list of stopwords is not provided, the StopWordsRemover() will use
[this list of
stopwords](http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words),
also shown below, by default.

`a,about,above,across,after,afterwards,again,against,all,almost,alone,along,already,also,although,always,am,among,amongst,amoungst,amount,an,and,another,any,anyhow,anyone,anything,anyway,anywhere, are,around,as,at,back,be,became,because,become,becomes,becoming,been,before,beforehand,behind,being,below,beside,besides,between,beyond,bill,both,bottom,but,by,call,can,cannot,cant,co,computer,con,could, couldnt,cry,de,describe,detail,do,done,down,due,during,each,eg,eight,either,eleven,else,elsewhere,empty,enough,etc,even,ever,every,everyone,everything,everywhere,except,few,fifteen,fify,fill,find,fire,first, five,for,former,formerly,forty,found,four,from,front,full,further,get,give,go,had,has,hasnt,have,he,hence,her,here,hereafter,hereby,herein,hereupon,hers,herself,him,himself,his,how,however,hundred,i,ie,if, in,inc,indeed,interest,into,is,it,its,itself,keep,last,latter,latterly,least,less,ltd,made,many,may,me,meanwhile,might,mill,mine,more,moreover,most,mostly,move,much,must,my,myself,name,namely,neither,never, nevertheless,next,nine,no,nobody,none,noone,nor,not,nothing,now,nowhere,of,off,often,on,once,one,only,onto,or,other,others,otherwise,our,ours,ourselves,out,over,own,part,per,perhaps,please,put,rather,re,same, see,seem,seemed,seeming,seems,serious,several,she,should,show,side,since,sincere,six,sixty,so,some,somehow,someone,something,sometime,sometimes,somewhere,still,such,system,take,ten,than,that,the,their,them, themselves,then,thence,there,thereafter,thereby,therefore,therein,thereupon,these,they,thick,thin,third,this,those,though,three,through,throughout,thru,thus,to,together,too,top,toward,towards,twelve,twenty,two, un,under,until,up,upon,us,very,via,was,we,well,were,what,whatever,when,whence,whenever,where,whereafter,whereas,whereby,wherein,whereupon,wherever,whether,which,while,whither,who,whoever,whole,whom,whose,why,will, with,within,without,would,yet,you,your,yours,yourself,yourselves`

You can use `getStopWords()` to see the list of stopwords that will be
used.

In this example, we will specify a list of stopwords for the
StopWordsRemover() to use. We do this so that we can add on to the list
later on.

In [None]:
display(dbutils.fs.ls("dbfs:/tmp/stopwords")) // check if the file already exists from earlier wget and dbfs-load

  

[TABLE]

  

If the file `dbfs:/tmp/stopwords` already exists then skip the next two
cells, otherwise download and load it into DBFS by uncommenting and
evaluating the next two cells.

In [None]:
//%sh wget http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words -O /tmp/stopwords # uncomment '//' at the beginning and repeat only if needed again

  

>     --2020-11-18 17:10:07--  http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words
>     Resolving ir.dcs.gla.ac.uk (ir.dcs.gla.ac.uk)... 130.209.240.253
>     Connecting to ir.dcs.gla.ac.uk (ir.dcs.gla.ac.uk)|130.209.240.253|:80... connected.
>     HTTP request sent, awaiting response... 200 OK
>     Length: 2237 (2.2K)
>     Saving to: ‘/tmp/stopwords’
>
>          0K ..                                                    100%  213M=0s
>
>     2020-11-18 17:10:07 (213 MB/s) - ‘/tmp/stopwords’ saved [2237/2237]

In [None]:
//%fs cp file:/tmp/stopwords dbfs:/tmp/stopwords 

  

>     res14: Boolean = true

In [None]:
// List of stopwords
val stopwords = sc.textFile("/tmp/stopwords").collect()

  

>     stopwords: Array[String] = Array(a, about, above, across, after, afterwards, again, against, all, almost, alone, along, already, also, although, always, am, among, amongst, amoungst, amount, an, and, another, any, anyhow, anyone, anything, anyway, anywhere, are, around, as, at, back, be, became, because, become, becomes, becoming, been, before, beforehand, behind, being, below, beside, besides, between, beyond, bill, both, bottom, but, by, call, can, cannot, cant, co, computer, con, could, couldnt, cry, de, describe, detail, do, done, down, due, during, each, eg, eight, either, eleven, else, elsewhere, empty, enough, etc, even, ever, every, everyone, everything, everywhere, except, few, fifteen, fify, fill, find, fire, first, five, for, former, formerly, forty, found, four, from, front, full, further, get, give, go, had, has, hasnt, have, he, hence, her, here, hereafter, hereby, herein, hereupon, hers, herself, him, himself, his, how, however, hundred, i, ie, if, in, inc, indeed, interest, into, is, it, its, itself, keep, last, latter, latterly, least, less, ltd, made, many, may, me, meanwhile, might, mill, mine, more, moreover, most, mostly, move, much, must, my, myself, name, namely, neither, never, nevertheless, next, nine, no, nobody, none, noone, nor, not, nothing, now, nowhere, of, off, often, on, once, one, only, onto, or, other, others, otherwise, our, ours, ourselves, out, over, own, part, per, perhaps, please, put, rather, re, same, see, seem, seemed, seeming, seems, serious, several, she, should, show, side, since, sincere, six, sixty, so, some, somehow, someone, something, sometime, sometimes, somewhere, still, such, system, take, ten, than, that, the, their, them, themselves, then, thence, there, thereafter, thereby, therefore, therein, thereupon, these, they, thick, thin, third, this, those, though, three, through, throughout, thru, thus, to, together, too, top, toward, towards, twelve, twenty, two, un, under, until, up, upon, us, very, via, was, we, well, were, what, whatever, when, whence, whenever, where, whereafter, whereas, whereby, wherein, whereupon, wherever, whether, which, while, whither, who, whoever, whole, whom, whose, why, will, with, within, without, would, yet, you, your, yours, yourself, yourselves)

In [None]:
stopwords.length // find the number of stopwords in the scala Array[String]

  

>     res15: Int = 319

  

Finally, we can just remove the stopwords using the `StopWordsRemover`
as follows:

In [None]:
import org.apache.spark.ml.feature.StopWordsRemover

// Set params for StopWordsRemover
val remover = new StopWordsRemover()
.setStopWords(stopwords) // This parameter is optional
.setInputCol("tokens")
.setOutputCol("filtered")

// Create new DF with Stopwords removed
val filtered_df = remover.transform(tokenized_df)

  

>     import org.apache.spark.ml.feature.StopWordsRemover
>     remover: org.apache.spark.ml.feature.StopWordsRemover = StopWordsRemover: uid=stopWords_e3f48811ca4c, numStopWords=319, locale=en_US, caseSensitive=false
>     filtered_df: org.apache.spark.sql.DataFrame = [corpus: string, id: bigint ... 2 more fields]

  

Step 5. Vector of Token Counts
------------------------------

LDA takes in a vector of token counts as input. We can use the
`CountVectorizer()` to easily convert our text documents into vectors of
token counts.

The `CountVectorizer` will return
`(VocabSize, Array(Indexed Tokens), Array(Token Frequency))`.

Two handy parameters to note:

-   `setMinDF`: Specifies the minimum number of different documents a
    term must appear in to be included in the vocabulary.
-   `setMinTF`: Specifies the minimum number of times a term has to
    appear in a document to be included in the vocabulary.

See
<http://spark.apache.org/docs/latest/ml-features.html#countvectorizer>.

In [None]:
import org.apache.spark.ml.feature.CountVectorizer

// Set params for CountVectorizer
val vectorizer = new CountVectorizer()
.setInputCol("filtered")
.setOutputCol("features")
.setVocabSize(10000) 
.setMinDF(5) // the minimum number of different documents a term must appear in to be included in the vocabulary.
.fit(filtered_df)

  

>     import org.apache.spark.ml.feature.CountVectorizer
>     vectorizer: org.apache.spark.ml.feature.CountVectorizerModel = CountVectorizerModel: uid=cntVec_fa543e9ef09a, vocabularySize=6139

In [None]:
// Create vector of token counts
val countVectors = vectorizer.transform(filtered_df).select("id", "features")

  

>     countVectors: org.apache.spark.sql.DataFrame = [id: bigint, features: vector]

In [None]:
// see the first countVectors
countVectors.take(2)

  

>     res18: Array[org.apache.spark.sql.Row] = Array([0,(6139,[0,1,147,231,315,496,497,527,569,604,776,835,848,858,942,1144,1687,1980,2051,2455,2756,3060,3465,3660,4506,5434,5599],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])], [1,(6139,[0,2,43,135,188,239,712,786,936,963,1376,2144,2375,2792,5980,6078],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0])])

  

To use the LDA algorithm in the MLlib library, we have to convert the
DataFrame back into an RDD.

In [None]:
// Convert DF to RDD
import org.apache.spark.ml.linalg.Vector

val lda_countVector = countVectors.map { case Row(id: Long, countVector: Vector) => (id, countVector) }

  

>     import org.apache.spark.ml.linalg.Vector
>     lda_countVector: org.apache.spark.sql.Dataset[(Long, org.apache.spark.ml.linalg.Vector)] = [_1: bigint, _2: vector]

In [None]:
// format: Array(id, (VocabSize, Array(indexedTokens), Array(Token Frequency)))
lda_countVector.take(1)

  

>     res20: Array[(Long, org.apache.spark.ml.linalg.Vector)] = Array((0,(6139,[0,1,147,231,315,496,497,527,569,604,776,835,848,858,942,1144,1687,1980,2051,2455,2756,3060,3465,3660,4506,5434,5599],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])))

  

Let's get an overview of LDA in Spark's MLLIB
---------------------------------------------

See
<http://spark.apache.org/docs/latest/mllib-clustering.html#latent-dirichlet-allocation-lda>

Create LDA model with Online Variational Bayes
----------------------------------------------

We will now set the parameters for LDA. We will use the
OnlineLDAOptimizer() here, which implements Online Variational Bayes.

Choosing the number of topics for your LDA model requires a bit of
domain knowledge. As we know that there are 20 unique newsgroups in our
dataset, we will set numTopics to be 20.

In [None]:
val numTopics = 20

  

>     numTopics: Int = 20

  

We will set the parameters needed to build our LDA model. We can also
setMiniBatchFraction for the OnlineLDAOptimizer, which sets the fraction
of corpus sampled and used at each iteration. In this example, we will
set this to 0.8.

In [None]:
import org.apache.spark.mllib.clustering.{LDA, OnlineLDAOptimizer}

// Set LDA params
val lda = new LDA()
.setOptimizer(new OnlineLDAOptimizer().setMiniBatchFraction(0.8))
.setK(numTopics)
.setMaxIterations(3)
.setDocConcentration(-1) // use default values
.setTopicConcentration(-1) // use default values

  

>     import org.apache.spark.mllib.clustering.{LDA, OnlineLDAOptimizer}
>     lda: org.apache.spark.mllib.clustering.LDA = org.apache.spark.mllib.clustering.LDA@4c66b969

  

Create the LDA model with Online Variational Bayes.

In [None]:
// convert ML vectors into MLlib vectors
val lda_countVector_mllib = lda_countVector.map { case (id, vector) => (id, org.apache.spark.mllib.linalg.Vectors.fromML(vector)) }.rdd

val ldaModel = lda.run(lda_countVector_mllib)

  

>     lda_countVector_mllib: org.apache.spark.rdd.RDD[(Long, org.apache.spark.mllib.linalg.Vector)] = MapPartitionsRDD[48498] at rdd at command-2972105651606700:2
>     ldaModel: org.apache.spark.mllib.clustering.LDAModel = org.apache.spark.mllib.clustering.LocalLDAModel@12089f2f

  

Watch **Online Learning for Latent Dirichlet Allocation** in NIPS2010 by
Matt Hoffman (right click and open in new tab)

[!\[Matt Hoffman's NIPS 2010 Talk Online
LDA\]](http://videolectures.net/nips2010_hoffman_oll/thumb.jpg)\](http://videolectures.net/nips2010*hoffman*oll/)

Also see the paper on *Online varioational Bayes* by Matt linked for
more details (from the above URL):
[http://videolectures.net/site/normal*dl/tag=83534/nips2010*1291.pdf](http://videolectures.net/site/normal_dl/tag=83534/nips2010_1291.pdf)

Note that using the OnlineLDAOptimizer returns us a
[LocalLDAModel](http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.clustering.LocalLDAModel),
which stores the inferred topics of your corpus.

Review Topics
-------------

We can now review the results of our LDA model. We will print out all 20
topics with their corresponding term probabilities.

Note that you will get slightly different results every time you run an
LDA model since LDA includes some randomization.

Let us review results of LDA model with Online Variational Bayes, step
by step.

In [None]:
val topicIndices = ldaModel.describeTopics(maxTermsPerTopic = 5)

  

>     topicIndices: Array[(Array[Int], Array[Double])] = Array((Array(71, 3, 9, 2430, 0),Array(5.456369274341092E-4, 5.15810019838564E-4, 4.43450588115824E-4, 4.075030037014698E-4, 3.830882972891452E-4)), (Array(0, 1, 10, 1756, 5),Array(7.243679860522955E-4, 5.205551337073676E-4, 4.2131462803109094E-4, 3.875262232249066E-4, 3.840686898109464E-4)), (Array(181, 5, 1, 0, 39),Array(8.601464700591097E-4, 6.752902103880592E-4, 6.500562200737869E-4, 6.038070585805384E-4, 5.893362076984842E-4)), (Array(0, 1, 3, 6, 2),Array(0.005005798229088234, 0.0039983913839258195, 0.0030948923490935853, 0.003071089142700929, 0.0028386551849145965)), (Array(122, 232, 257, 451, 495),Array(5.305023692660539E-4, 5.279226283858802E-4, 5.137467563830091E-4, 4.7981072504714553E-4, 4.564660855609454E-4)), (Array(2, 3, 9, 390, 22),Array(4.6467793114745464E-4, 3.70607533768043E-4, 3.546696742137416E-4, 3.350392679578855E-4, 3.306658765614441E-4)), (Array(4, 1, 0, 5, 12),Array(0.003038256698342116, 0.0025078324323337996, 0.0024034144983831785, 0.0022747193799927522, 0.0021991877710786406)), (Array(0, 1, 3, 102, 4),Array(9.341678651682648E-4, 8.96517078104752E-4, 7.040412369499342E-4, 6.565548064091027E-4, 5.871756987318273E-4)), (Array(34, 16, 5, 2, 7),Array(0.0017314262265145416, 0.0010029471108932626, 9.453341953532564E-4, 8.741294235913222E-4, 8.579646992984083E-4)), (Array(2, 5, 127, 17, 35),Array(7.401320305255691E-4, 5.25437562171766E-4, 5.128243843308119E-4, 4.117174665497465E-4, 4.075891961744765E-4)), (Array(0, 1, 2504, 34, 4),Array(4.7388749728623464E-4, 4.063993264389524E-4, 4.0457908424546885E-4, 3.736117381344537E-4, 3.5686334299962655E-4)), (Array(125, 257, 2, 6, 5),Array(6.608611311012047E-4, 5.941215600203021E-4, 5.39101994589478E-4, 4.8079125727937115E-4, 4.7875489474363353E-4)), (Array(19, 56, 45, 139, 225),Array(4.482490888369106E-4, 4.014929284068151E-4, 3.080215302055523E-4, 2.9193082749794666E-4, 2.749459651495319E-4)), (Array(122, 5, 3, 0, 4),Array(0.0011153026386181404, 0.0011150472940365211, 9.95033276109429E-4, 9.583586599519659E-4, 8.923096480763279E-4)), (Array(326, 2, 611, 65, 41),Array(0.0015679031630728422, 0.0013082699154999442, 0.0013079468731776323, 0.001163621321570609, 0.0010655665993843193)), (Array(0, 1, 3, 2, 8),Array(0.002220009353761187, 0.0018989888620129102, 0.0010983058647170312, 0.001064361342260216, 0.001031614455222169)), (Array(126, 72, 1, 6, 0),Array(3.754264893405148E-4, 3.6839238147878475E-4, 3.6529313261814616E-4, 3.5514886560695136E-4, 3.4389161587012576E-4)), (Array(1469, 26, 2101, 1497, 423),Array(5.16094156620075E-4, 4.2699461994707256E-4, 4.2366509186811575E-4, 4.1923415520213754E-4, 4.100231439028559E-4)), (Array(19, 0, 205, 33, 1),Array(0.0013520486449812353, 6.701093744767828E-4, 6.028781386866265E-4, 5.900464637271989E-4, 5.82110328705979E-4)), (Array(437, 39, 463, 552, 21),Array(6.153757083907716E-4, 5.663741043989644E-4, 5.329968840969642E-4, 4.92706149452575E-4, 4.7513914457872553E-4)))

In [None]:
val vocabList = vectorizer.vocabulary

  

>     vocabList: Array[String] = Array(writes, article, people, just, know, like, think, does, time, good, make, used, windows, want, work, right, problem, need, really, image, said, data, going, information, better, believe, using, software, years, year, mail, sure, point, thanks, drive, program, available, space, file, power, help, government, things, question, doesn, number, case, world, look, read, line, version, come, thing, long, different, jpeg, best, fact, university, real, probably, didn, course, true, state, files, high, possible, actually, 1993, list, game, little, news, group, david, send, tell, wrong, graphics, based, support, able, place, free, called, subject, post, john, reason, color, second, great, card, having, public, email, info, following, start, hard, science, says, example, means, code, evidence, person, maybe, note, general, president, heard, quite, problems, mean, source, systems, life, price, order, window, standard, access, jesus, claim, paul, getting, looking, control, trying, disk, seen, simply, times, book, team, play, chip, local, encryption, idea, truth, opinions, issue, given, research, church, images, wrote, display, large, makes, remember, thought, national, doing, format, away, nasa, change, human, home, saying, small, mark, interested, current, internet, today, area, word, original, agree, left, memory, machine, works, microsoft, instead, working, hardware, kind, request, higher, sort, programs, questions, money, entry, later, israel, mike, pretty, hand, guess, include, netcom, address, technology, matter, cause, uiuc, type, video, speed, wire, days, server, usually, view, april, open, package, earth, stuff, unless, christian, told, important, similar, house, major, size, faith, known, provide, phone, body, michael, rights, ground, health, american, apple, feel, including, center, answer, bible, user, cost, text, lines, understand, check, anybody, security, mind, care, copy, wouldn, live, started, certainly, network, women, level, mouse, running, message, study, clinton, making, position, company, came, board, screen, groups, talking, single, common, white, test, wiring, christians, monitor, likely, black, special, quality, light, effect, nice, medical, members, certain, hope, sources, uucp, posted, canada, fine, hear, cars, write, clear, difference, police, love, history, couple, build, launch, press, situation, books, jewish, specific, sense, words, particular, anti, stop, posting, unix, talk, model, religion, discussion, school, contact, private, frank, turkish, keys, built, cable, taking, simple, legal, sound, consider, features, service, short, date, night, reference, argument, tools, comes, children, application, comments, device, scsi, clipper, applications, jews, doubt, tried, force, process, theory, objective, games, usenet, self, experience, steve, early, expect, needed, uses, tape, manager, interesting, station, killed, easy, value, death, exactly, turn, correct, response, needs, ones, according, amiga, drug, considered, language, reading, james, states, wanted, shuttle, goes, koresh, term, insurance, personal, strong, past, form, opinion, taken, result, future, sorry, mentioned, rules, especially, religious, hell, country, design, happy, went, society, plus, drivers, written, guns, various, author, haven, asked, results, analysis, gets, latest, longer, parts, advance, aren, previous, cases, york, laws, main, section, accept, input, looks, week, christ, weapons, required, mode, washington, community, robert, numbers, disease, head, fast, option, series, circuit, offer, macintosh, driver, office, israeli, range, exist, venus, andrew, period, clock, players, runs, values, department, moral, allow, organization, toronto, involved, knows, picture, colors, brian, sell, half, months, choice, dave, armenians, takes, currently, suggest, wasn, hockey, object, took, includes, individual, cards, federal, candida, policy, directly, total, title, protect, follow, americans, equipment, assume, close, food, purpose, recently, statement, present, devices, happened, deal, users, media, provides, happen, scientific, christianity, require, reasons, shall, dead, lost, action, speak, road, couldn, goal, bike, save, george, wants, city, details, california, mission, voice, useful, baseball, lead, obviously, completely, condition, complete, court, uunet, easily, terms, batf, engineering, league, responsible, administration, ways, international, compatible, sent, clearly, rest, algorithm, water, disclaimer, output, appreciated, freedom, digital, kill, issues, business, pass, hours, figure, error, fans, newsgroup, coming, operating, average, project, deleted, context, processing, companies, story, trade, appropriate, events, leave, port, berkeley, carry, season, face, basis, final, requires, building, heart, performance, difficult, addition, convert, political, page, lower, environment, player, king, points, armenian, volume, actual, resolution, field, willing, knowledge, apply, related, stanford, suppose, site, sale, advice, commercial, sounds, worth, orbit, lots, claims, limited, defense, entries, basic, supposed, designed, explain, directory, anonymous, handle, inside, ability, included, signal, young, turkey, family, reply, enforcement, radio, necessary, programming, wonder, suspect, wait, changes, neutral, forget, services, shot, greek, month, create, installed, printer, paper, friend, thinking, understanding, homosexuality, natural, morality, russian, finally, land, formats, names, machines, report, peter, setting, population, hold, break, comment, homosexual, normal, interface, eric, miles, product, rutgers, logic, reasonable, arab, communications, comp, percent, escrow, avoid, room, east, supply, types, lives, colorado, secure, million, developed, peace, cancer, multiple, allowed, library, cubs, expensive, agencies, cheap, recent, gary, soon, event, gives, soviet, looked, mention, supported, technical, street, caused, physics, happens, suggestions, doctor, release, obvious, choose, development, print, generally, outside, treatment, entire, bitnet, radar, chance, mass, table, friends, return, archive, install, folks, morning, member, electrical, illegal, diet, ideas, exists, muslim, jack, meaning, united, wish, smith, trouble, weeks, areas, social, concept, requests, straight, child, learn, supports, behavior, stand, engine, bring, thank, worked, unit, reality, remove, asking, appear, provided, pick, studies, possibly, practice, answers, drives, attempt, motif, west, modem, henry, trust, bits, existence, changed, decided, near, middle, belief, compound, continue, errors, false, extra, guys, arguments, proper, congress, particularly, class, yeah, safe, facts, loss, contains, thread, function, manual, attack, fonts, aware, privacy, andy, pages, operations, appears, worse, heat, command, drugs, wide, stupid, nature, constitution, institute, frame, armenia, wall, distribution, approach, hands, speaking, unfortunately, conference, independent, edge, division, shouldn, knew, effective, serial, added, compression, safety, crime, shows, indiana, bought, 1990, turks, modern, civil, ethernet, solution, 1992, abortion, cramer, blood, blue, letter, plastic, spend, allows, hello, utility, rate, appreciate, regular, writing, floppy, wondering, virginia, germany, simms, gave, operation, record, internal, faster, arms, giving, views, switch, tool, decision, playing, step, atheism, additional, method, described, base, concerned, stated, surface, kids, played, articles, scott, actions, font, capability, places, products, attitude, costs, patients, prevent, controller, fair, rule, buying, late, quote, highly, military, considering, keith, resources, cover, levels, connected, north, hate, countries, excellent, poor, market, necessarily, wires, created, shell, western, america, valid, turned, apparently, brought, functions, account, received, creation, watch, majority, cwru, driving, released, authority, committee, chips, quick, forward, student, protection, calls, richard, boston, complex, visual, absolutely, sold, arizona, produce, notice, plan, moon, minutes, lord, arabs, properly, fairly, boxes, murder, keyboard, greatly, killing, vote, panel, rangers, options, shareware)

In [None]:
val topics = topicIndices.map { case (terms, termWeights) =>
  terms.map(vocabList(_)).zip(termWeights)
}

  

>     topics: Array[Array[(String, Double)]] = Array(Array((list,5.456369274341092E-4), (just,5.15810019838564E-4), (good,4.43450588115824E-4), (pope,4.075030037014698E-4), (writes,3.830882972891452E-4)), Array((writes,7.243679860522955E-4), (article,5.205551337073676E-4), (make,4.2131462803109094E-4), (mormons,3.875262232249066E-4), (like,3.840686898109464E-4)), Array((working,8.601464700591097E-4), (like,6.752902103880592E-4), (article,6.500562200737869E-4), (writes,6.038070585805384E-4), (power,5.893362076984842E-4)), Array((writes,0.005005798229088234), (article,0.0039983913839258195), (just,0.0030948923490935853), (think,0.003071089142700929), (people,0.0028386551849145965)), Array((window,5.305023692660539E-4), (ground,5.279226283858802E-4), (women,5.137467563830091E-4), (option,4.7981072504714553E-4), (candida,4.564660855609454E-4)), Array((people,4.6467793114745464E-4), (just,3.70607533768043E-4), (good,3.546696742137416E-4), (shuttle,3.350392679578855E-4), (going,3.306658765614441E-4)), Array((know,0.003038256698342116), (article,0.0025078324323337996), (writes,0.0024034144983831785), (like,0.0022747193799927522), (windows,0.0021991877710786406)), Array((writes,9.341678651682648E-4), (article,8.96517078104752E-4), (just,7.040412369499342E-4), (science,6.565548064091027E-4), (know,5.871756987318273E-4)), Array((drive,0.0017314262265145416), (problem,0.0010029471108932626), (like,9.453341953532564E-4), (people,8.741294235913222E-4), (does,8.579646992984083E-4)), Array((people,7.401320305255691E-4), (like,5.25437562171766E-4), (paul,5.128243843308119E-4), (need,4.117174665497465E-4), (program,4.075891961744765E-4)), Array((writes,4.7388749728623464E-4), (article,4.063993264389524E-4), (henrik,4.0457908424546885E-4), (drive,3.736117381344537E-4), (know,3.5686334299962655E-4)), Array((jesus,6.608611311012047E-4), (women,5.941215600203021E-4), (people,5.39101994589478E-4), (think,4.8079125727937115E-4), (like,4.7875489474363353E-4)), Array((image,4.482490888369106E-4), (jpeg,4.014929284068151E-4), (number,3.080215302055523E-4), (chip,2.9193082749794666E-4), (faith,2.749459651495319E-4)), Array((window,0.0011153026386181404), (like,0.0011150472940365211), (just,9.95033276109429E-4), (writes,9.583586599519659E-4), (know,8.923096480763279E-4)), Array((turkish,0.0015679031630728422), (people,0.0013082699154999442), (armenian,0.0013079468731776323), (state,0.001163621321570609), (government,0.0010655665993843193)), Array((writes,0.002220009353761187), (article,0.0018989888620129102), (just,0.0010983058647170312), (people,0.001064361342260216), (time,0.001031614455222169)), Array((claim,3.754264893405148E-4), (game,3.6839238147878475E-4), (article,3.6529313261814616E-4), (think,3.5514886560695136E-4), (writes,3.4389161587012576E-4)), Array((sequence,5.16094156620075E-4), (using,4.2699461994707256E-4), (protein,4.2366509186811575E-4), (biology,4.1923415520213754E-4), (analysis,4.100231439028559E-4)), Array((image,0.0013520486449812353), (writes,6.701093744767828E-4), (video,6.028781386866265E-4), (thanks,5.900464637271989E-4), (article,5.82110328705979E-4)), Array((input,6.153757083907716E-4), (power,5.663741043989644E-4), (period,5.329968840969642E-4), (league,4.92706149452575E-4), (data,4.7513914457872553E-4)))

  

Feel free to take things apart to understand!

In [None]:
topicIndices(0)

  

>     res21: (Array[Int], Array[Double]) = (Array(71, 3, 9, 2430, 0),Array(5.456369274341092E-4, 5.15810019838564E-4, 4.43450588115824E-4, 4.075030037014698E-4, 3.830882972891452E-4))

In [None]:
topicIndices(0)._1

  

>     res22: Array[Int] = Array(71, 3, 9, 2430, 0)

In [None]:
topicIndices(0)._1(0)

  

>     res23: Int = 71

In [None]:
vocabList(topicIndices(0)._1(0))

  

>     res24: String = list

  

Review Results of LDA model with Online Variational Bayes - Doing all
four steps earlier at once.

In [None]:
val topicIndices = ldaModel.describeTopics(maxTermsPerTopic = 5)
val vocabList = vectorizer.vocabulary
val topics = topicIndices.map { case (terms, termWeights) =>
  terms.map(vocabList(_)).zip(termWeights)
}
println(s"$numTopics topics:")
topics.zipWithIndex.foreach { case (topic, i) =>
  println(s"TOPIC $i")
  topic.foreach { case (term, weight) => println(s"$term\t$weight") }
  println(s"==========")
}

  

>     20 topics:
>     TOPIC 0
>     list	5.456369274341092E-4
>     just	5.15810019838564E-4
>     good	4.43450588115824E-4
>     pope	4.075030037014698E-4
>     writes	3.830882972891452E-4
>     ==========
>     TOPIC 1
>     writes	7.243679860522955E-4
>     article	5.205551337073676E-4
>     make	4.2131462803109094E-4
>     mormons	3.875262232249066E-4
>     like	3.840686898109464E-4
>     ==========
>     TOPIC 2
>     working	8.601464700591097E-4
>     like	6.752902103880592E-4
>     article	6.500562200737869E-4
>     writes	6.038070585805384E-4
>     power	5.893362076984842E-4
>     ==========
>     TOPIC 3
>     writes	0.005005798229088234
>     article	0.0039983913839258195
>     just	0.0030948923490935853
>     think	0.003071089142700929
>     people	0.0028386551849145965
>     ==========
>     TOPIC 4
>     window	5.305023692660539E-4
>     ground	5.279226283858802E-4
>     women	5.137467563830091E-4
>     option	4.7981072504714553E-4
>     candida	4.564660855609454E-4
>     ==========
>     TOPIC 5
>     people	4.6467793114745464E-4
>     just	3.70607533768043E-4
>     good	3.546696742137416E-4
>     shuttle	3.350392679578855E-4
>     going	3.306658765614441E-4
>     ==========
>     TOPIC 6
>     know	0.003038256698342116
>     article	0.0025078324323337996
>     writes	0.0024034144983831785
>     like	0.0022747193799927522
>     windows	0.0021991877710786406
>     ==========
>     TOPIC 7
>     writes	9.341678651682648E-4
>     article	8.96517078104752E-4
>     just	7.040412369499342E-4
>     science	6.565548064091027E-4
>     know	5.871756987318273E-4
>     ==========
>     TOPIC 8
>     drive	0.0017314262265145416
>     problem	0.0010029471108932626
>     like	9.453341953532564E-4
>     people	8.741294235913222E-4
>     does	8.579646992984083E-4
>     ==========
>     TOPIC 9
>     people	7.401320305255691E-4
>     like	5.25437562171766E-4
>     paul	5.128243843308119E-4
>     need	4.117174665497465E-4
>     program	4.075891961744765E-4
>     ==========
>     TOPIC 10
>     writes	4.7388749728623464E-4
>     article	4.063993264389524E-4
>     henrik	4.0457908424546885E-4
>     drive	3.736117381344537E-4
>     know	3.5686334299962655E-4
>     ==========
>     TOPIC 11
>     jesus	6.608611311012047E-4
>     women	5.941215600203021E-4
>     people	5.39101994589478E-4
>     think	4.8079125727937115E-4
>     like	4.7875489474363353E-4
>     ==========
>     TOPIC 12
>     image	4.482490888369106E-4
>     jpeg	4.014929284068151E-4
>     number	3.080215302055523E-4
>     chip	2.9193082749794666E-4
>     faith	2.749459651495319E-4
>     ==========
>     TOPIC 13
>     window	0.0011153026386181404
>     like	0.0011150472940365211
>     just	9.95033276109429E-4
>     writes	9.583586599519659E-4
>     know	8.923096480763279E-4
>     ==========
>     TOPIC 14
>     turkish	0.0015679031630728422
>     people	0.0013082699154999442
>     armenian	0.0013079468731776323
>     state	0.001163621321570609
>     government	0.0010655665993843193
>     ==========
>     TOPIC 15
>     writes	0.002220009353761187
>     article	0.0018989888620129102
>     just	0.0010983058647170312
>     people	0.001064361342260216
>     time	0.001031614455222169
>     ==========
>     TOPIC 16
>     claim	3.754264893405148E-4
>     game	3.6839238147878475E-4
>     article	3.6529313261814616E-4
>     think	3.5514886560695136E-4
>     writes	3.4389161587012576E-4
>     ==========
>     TOPIC 17
>     sequence	5.16094156620075E-4
>     using	4.2699461994707256E-4
>     protein	4.2366509186811575E-4
>     biology	4.1923415520213754E-4
>     analysis	4.100231439028559E-4
>     ==========
>     TOPIC 18
>     image	0.0013520486449812353
>     writes	6.701093744767828E-4
>     video	6.028781386866265E-4
>     thanks	5.900464637271989E-4
>     article	5.82110328705979E-4
>     ==========
>     TOPIC 19
>     input	6.153757083907716E-4
>     power	5.663741043989644E-4
>     period	5.329968840969642E-4
>     league	4.92706149452575E-4
>     data	4.7513914457872553E-4
>     ==========
>     topicIndices: Array[(Array[Int], Array[Double])] = Array((Array(71, 3, 9, 2430, 0),Array(5.456369274341092E-4, 5.15810019838564E-4, 4.43450588115824E-4, 4.075030037014698E-4, 3.830882972891452E-4)), (Array(0, 1, 10, 1756, 5),Array(7.243679860522955E-4, 5.205551337073676E-4, 4.2131462803109094E-4, 3.875262232249066E-4, 3.840686898109464E-4)), (Array(181, 5, 1, 0, 39),Array(8.601464700591097E-4, 6.752902103880592E-4, 6.500562200737869E-4, 6.038070585805384E-4, 5.893362076984842E-4)), (Array(0, 1, 3, 6, 2),Array(0.005005798229088234, 0.0039983913839258195, 0.0030948923490935853, 0.003071089142700929, 0.0028386551849145965)), (Array(122, 232, 257, 451, 495),Array(5.305023692660539E-4, 5.279226283858802E-4, 5.137467563830091E-4, 4.7981072504714553E-4, 4.564660855609454E-4)), (Array(2, 3, 9, 390, 22),Array(4.6467793114745464E-4, 3.70607533768043E-4, 3.546696742137416E-4, 3.350392679578855E-4, 3.306658765614441E-4)), (Array(4, 1, 0, 5, 12),Array(0.003038256698342116, 0.0025078324323337996, 0.0024034144983831785, 0.0022747193799927522, 0.0021991877710786406)), (Array(0, 1, 3, 102, 4),Array(9.341678651682648E-4, 8.96517078104752E-4, 7.040412369499342E-4, 6.565548064091027E-4, 5.871756987318273E-4)), (Array(34, 16, 5, 2, 7),Array(0.0017314262265145416, 0.0010029471108932626, 9.453341953532564E-4, 8.741294235913222E-4, 8.579646992984083E-4)), (Array(2, 5, 127, 17, 35),Array(7.401320305255691E-4, 5.25437562171766E-4, 5.128243843308119E-4, 4.117174665497465E-4, 4.075891961744765E-4)), (Array(0, 1, 2504, 34, 4),Array(4.7388749728623464E-4, 4.063993264389524E-4, 4.0457908424546885E-4, 3.736117381344537E-4, 3.5686334299962655E-4)), (Array(125, 257, 2, 6, 5),Array(6.608611311012047E-4, 5.941215600203021E-4, 5.39101994589478E-4, 4.8079125727937115E-4, 4.7875489474363353E-4)), (Array(19, 56, 45, 139, 225),Array(4.482490888369106E-4, 4.014929284068151E-4, 3.080215302055523E-4, 2.9193082749794666E-4, 2.749459651495319E-4)), (Array(122, 5, 3, 0, 4),Array(0.0011153026386181404, 0.0011150472940365211, 9.95033276109429E-4, 9.583586599519659E-4, 8.923096480763279E-4)), (Array(326, 2, 611, 65, 41),Array(0.0015679031630728422, 0.0013082699154999442, 0.0013079468731776323, 0.001163621321570609, 0.0010655665993843193)), (Array(0, 1, 3, 2, 8),Array(0.002220009353761187, 0.0018989888620129102, 0.0010983058647170312, 0.001064361342260216, 0.001031614455222169)), (Array(126, 72, 1, 6, 0),Array(3.754264893405148E-4, 3.6839238147878475E-4, 3.6529313261814616E-4, 3.5514886560695136E-4, 3.4389161587012576E-4)), (Array(1469, 26, 2101, 1497, 423),Array(5.16094156620075E-4, 4.2699461994707256E-4, 4.2366509186811575E-4, 4.1923415520213754E-4, 4.100231439028559E-4)), (Array(19, 0, 205, 33, 1),Array(0.0013520486449812353, 6.701093744767828E-4, 6.028781386866265E-4, 5.900464637271989E-4, 5.82110328705979E-4)), (Array(437, 39, 463, 552, 21),Array(6.153757083907716E-4, 5.663741043989644E-4, 5.329968840969642E-4, 4.92706149452575E-4, 4.7513914457872553E-4)))
>     vocabList: Array[String] = Array(writes, article, people, just, know, like, think, does, time, good, make, used, windows, want, work, right, problem, need, really, image, said, data, going, information, better, believe, using, software, years, year, mail, sure, point, thanks, drive, program, available, space, file, power, help, government, things, question, doesn, number, case, world, look, read, line, version, come, thing, long, different, jpeg, best, fact, university, real, probably, didn, course, true, state, files, high, possible, actually, 1993, list, game, little, news, group, david, send, tell, wrong, graphics, based, support, able, place, free, called, subject, post, john, reason, color, second, great, card, having, public, email, info, following, start, hard, science, says, example, means, code, evidence, person, maybe, note, general, president, heard, quite, problems, mean, source, systems, life, price, order, window, standard, access, jesus, claim, paul, getting, looking, control, trying, disk, seen, simply, times, book, team, play, chip, local, encryption, idea, truth, opinions, issue, given, research, church, images, wrote, display, large, makes, remember, thought, national, doing, format, away, nasa, change, human, home, saying, small, mark, interested, current, internet, today, area, word, original, agree, left, memory, machine, works, microsoft, instead, working, hardware, kind, request, higher, sort, programs, questions, money, entry, later, israel, mike, pretty, hand, guess, include, netcom, address, technology, matter, cause, uiuc, type, video, speed, wire, days, server, usually, view, april, open, package, earth, stuff, unless, christian, told, important, similar, house, major, size, faith, known, provide, phone, body, michael, rights, ground, health, american, apple, feel, including, center, answer, bible, user, cost, text, lines, understand, check, anybody, security, mind, care, copy, wouldn, live, started, certainly, network, women, level, mouse, running, message, study, clinton, making, position, company, came, board, screen, groups, talking, single, common, white, test, wiring, christians, monitor, likely, black, special, quality, light, effect, nice, medical, members, certain, hope, sources, uucp, posted, canada, fine, hear, cars, write, clear, difference, police, love, history, couple, build, launch, press, situation, books, jewish, specific, sense, words, particular, anti, stop, posting, unix, talk, model, religion, discussion, school, contact, private, frank, turkish, keys, built, cable, taking, simple, legal, sound, consider, features, service, short, date, night, reference, argument, tools, comes, children, application, comments, device, scsi, clipper, applications, jews, doubt, tried, force, process, theory, objective, games, usenet, self, experience, steve, early, expect, needed, uses, tape, manager, interesting, station, killed, easy, value, death, exactly, turn, correct, response, needs, ones, according, amiga, drug, considered, language, reading, james, states, wanted, shuttle, goes, koresh, term, insurance, personal, strong, past, form, opinion, taken, result, future, sorry, mentioned, rules, especially, religious, hell, country, design, happy, went, society, plus, drivers, written, guns, various, author, haven, asked, results, analysis, gets, latest, longer, parts, advance, aren, previous, cases, york, laws, main, section, accept, input, looks, week, christ, weapons, required, mode, washington, community, robert, numbers, disease, head, fast, option, series, circuit, offer, macintosh, driver, office, israeli, range, exist, venus, andrew, period, clock, players, runs, values, department, moral, allow, organization, toronto, involved, knows, picture, colors, brian, sell, half, months, choice, dave, armenians, takes, currently, suggest, wasn, hockey, object, took, includes, individual, cards, federal, candida, policy, directly, total, title, protect, follow, americans, equipment, assume, close, food, purpose, recently, statement, present, devices, happened, deal, users, media, provides, happen, scientific, christianity, require, reasons, shall, dead, lost, action, speak, road, couldn, goal, bike, save, george, wants, city, details, california, mission, voice, useful, baseball, lead, obviously, completely, condition, complete, court, uunet, easily, terms, batf, engineering, league, responsible, administration, ways, international, compatible, sent, clearly, rest, algorithm, water, disclaimer, output, appreciated, freedom, digital, kill, issues, business, pass, hours, figure, error, fans, newsgroup, coming, operating, average, project, deleted, context, processing, companies, story, trade, appropriate, events, leave, port, berkeley, carry, season, face, basis, final, requires, building, heart, performance, difficult, addition, convert, political, page, lower, environment, player, king, points, armenian, volume, actual, resolution, field, willing, knowledge, apply, related, stanford, suppose, site, sale, advice, commercial, sounds, worth, orbit, lots, claims, limited, defense, entries, basic, supposed, designed, explain, directory, anonymous, handle, inside, ability, included, signal, young, turkey, family, reply, enforcement, radio, necessary, programming, wonder, suspect, wait, changes, neutral, forget, services, shot, greek, month, create, installed, printer, paper, friend, thinking, understanding, homosexuality, natural, morality, russian, finally, land, formats, names, machines, report, peter, setting, population, hold, break, comment, homosexual, normal, interface, eric, miles, product, rutgers, logic, reasonable, arab, communications, comp, percent, escrow, avoid, room, east, supply, types, lives, colorado, secure, million, developed, peace, cancer, multiple, allowed, library, cubs, expensive, agencies, cheap, recent, gary, soon, event, gives, soviet, looked, mention, supported, technical, street, caused, physics, happens, suggestions, doctor, release, obvious, choose, development, print, generally, outside, treatment, entire, bitnet, radar, chance, mass, table, friends, return, archive, install, folks, morning, member, electrical, illegal, diet, ideas, exists, muslim, jack, meaning, united, wish, smith, trouble, weeks, areas, social, concept, requests, straight, child, learn, supports, behavior, stand, engine, bring, thank, worked, unit, reality, remove, asking, appear, provided, pick, studies, possibly, practice, answers, drives, attempt, motif, west, modem, henry, trust, bits, existence, changed, decided, near, middle, belief, compound, continue, errors, false, extra, guys, arguments, proper, congress, particularly, class, yeah, safe, facts, loss, contains, thread, function, manual, attack, fonts, aware, privacy, andy, pages, operations, appears, worse, heat, command, drugs, wide, stupid, nature, constitution, institute, frame, armenia, wall, distribution, approach, hands, speaking, unfortunately, conference, independent, edge, division, shouldn, knew, effective, serial, added, compression, safety, crime, shows, indiana, bought, 1990, turks, modern, civil, ethernet, solution, 1992, abortion, cramer, blood, blue, letter, plastic, spend, allows, hello, utility, rate, appreciate, regular, writing, floppy, wondering, virginia, germany, simms, gave, operation, record, internal, faster, arms, giving, views, switch, tool, decision, playing, step, atheism, additional, method, described, base, concerned, stated, surface, kids, played, articles, scott, actions, font, capability, places, products, attitude, costs, patients, prevent, controller, fair, rule, buying, late, quote, highly, military, considering, keith, resources, cover, levels, connected, north, hate, countries, excellent, poor, market, necessarily, wires, created, shell, western, america, valid, turned, apparently, brought, functions, account, received, creation, watch, majority, cwru, driving, released, authority, committee, chips, quick, forward, student, protection, calls, richard, boston, complex, visual, absolutely, sold, arizona, produce, notice, plan, moon, minutes, lord, arabs, properly, fairly, boxes, murder, keyboard, greatly, killing, vote, panel, rangers, options, shareware)
>     topics: Array[Array[(String, Double)]] = Array(Array((list,5.456369274341092E-4), (just,5.15810019838564E-4), (good,4.43450588115824E-4), (pope,4.075030037014698E-4), (writes,3.830882972891452E-4)), Array((writes,7.243679860522955E-4), (article,5.205551337073676E-4), (make,4.2131462803109094E-4), (mormons,3.875262232249066E-4), (like,3.840686898109464E-4)), Array((working,8.601464700591097E-4), (like,6.752902103880592E-4), (article,6.500562200737869E-4), (writes,6.038070585805384E-4), (power,5.893362076984842E-4)), Array((writes,0.005005798229088234), (article,0.0039983913839258195), (just,0.0030948923490935853), (think,0.003071089142700929), (people,0.0028386551849145965)), Array((window,5.305023692660539E-4), (ground,5.279226283858802E-4), (women,5.137467563830091E-4), (option,4.7981072504714553E-4), (candida,4.564660855609454E-4)), Array((people,4.6467793114745464E-4), (just,3.70607533768043E-4), (good,3.546696742137416E-4), (shuttle,3.350392679578855E-4), (going,3.306658765614441E-4)), Array((know,0.003038256698342116), (article,0.0025078324323337996), (writes,0.0024034144983831785), (like,0.0022747193799927522), (windows,0.0021991877710786406)), Array((writes,9.341678651682648E-4), (article,8.96517078104752E-4), (just,7.040412369499342E-4), (science,6.565548064091027E-4), (know,5.871756987318273E-4)), Array((drive,0.0017314262265145416), (problem,0.0010029471108932626), (like,9.453341953532564E-4), (people,8.741294235913222E-4), (does,8.579646992984083E-4)), Array((people,7.401320305255691E-4), (like,5.25437562171766E-4), (paul,5.128243843308119E-4), (need,4.117174665497465E-4), (program,4.075891961744765E-4)), Array((writes,4.7388749728623464E-4), (article,4.063993264389524E-4), (henrik,4.0457908424546885E-4), (drive,3.736117381344537E-4), (know,3.5686334299962655E-4)), Array((jesus,6.608611311012047E-4), (women,5.941215600203021E-4), (people,5.39101994589478E-4), (think,4.8079125727937115E-4), (like,4.7875489474363353E-4)), Array((image,4.482490888369106E-4), (jpeg,4.014929284068151E-4), (number,3.080215302055523E-4), (chip,2.9193082749794666E-4), (faith,2.749459651495319E-4)), Array((window,0.0011153026386181404), (like,0.0011150472940365211), (just,9.95033276109429E-4), (writes,9.583586599519659E-4), (know,8.923096480763279E-4)), Array((turkish,0.0015679031630728422), (people,0.0013082699154999442), (armenian,0.0013079468731776323), (state,0.001163621321570609), (government,0.0010655665993843193)), Array((writes,0.002220009353761187), (article,0.0018989888620129102), (just,0.0010983058647170312), (people,0.001064361342260216), (time,0.001031614455222169)), Array((claim,3.754264893405148E-4), (game,3.6839238147878475E-4), (article,3.6529313261814616E-4), (think,3.5514886560695136E-4), (writes,3.4389161587012576E-4)), Array((sequence,5.16094156620075E-4), (using,4.2699461994707256E-4), (protein,4.2366509186811575E-4), (biology,4.1923415520213754E-4), (analysis,4.100231439028559E-4)), Array((image,0.0013520486449812353), (writes,6.701093744767828E-4), (video,6.028781386866265E-4), (thanks,5.900464637271989E-4), (article,5.82110328705979E-4)), Array((input,6.153757083907716E-4), (power,5.663741043989644E-4), (period,5.329968840969642E-4), (league,4.92706149452575E-4), (data,4.7513914457872553E-4)))

  

Going through the results, you may notice that some of the topic words
returned are actually stopwords that are specific to our dataset (for
eg: "writes", "article"...). Let's try improving our model.

Step 8. Model Tuning - Refilter Stopwords
-----------------------------------------

We will try to improve the results of our model by identifying some
stopwords that are specific to our dataset. We will filter these
stopwords out and rerun our LDA model to see if we get better results.

In [None]:
val add_stopwords = Array("article", "writes", "entry", "date", "udel", "said", "tell", "think", "know", "just", "newsgroup", "line", "like", "does", "going", "make", "thanks")

  

>     add_stopwords: Array[String] = Array(article, writes, entry, date, udel, said, tell, think, know, just, newsgroup, line, like, does, going, make, thanks)

In [None]:
// Combine newly identified stopwords to our exising list of stopwords
val new_stopwords = stopwords.union(add_stopwords)

  

>     new_stopwords: Array[String] = Array(a, about, above, across, after, afterwards, again, against, all, almost, alone, along, already, also, although, always, am, among, amongst, amoungst, amount, an, and, another, any, anyhow, anyone, anything, anyway, anywhere, are, around, as, at, back, be, became, because, become, becomes, becoming, been, before, beforehand, behind, being, below, beside, besides, between, beyond, bill, both, bottom, but, by, call, can, cannot, cant, co, computer, con, could, couldnt, cry, de, describe, detail, do, done, down, due, during, each, eg, eight, either, eleven, else, elsewhere, empty, enough, etc, even, ever, every, everyone, everything, everywhere, except, few, fifteen, fify, fill, find, fire, first, five, for, former, formerly, forty, found, four, from, front, full, further, get, give, go, had, has, hasnt, have, he, hence, her, here, hereafter, hereby, herein, hereupon, hers, herself, him, himself, his, how, however, hundred, i, ie, if, in, inc, indeed, interest, into, is, it, its, itself, keep, last, latter, latterly, least, less, ltd, made, many, may, me, meanwhile, might, mill, mine, more, moreover, most, mostly, move, much, must, my, myself, name, namely, neither, never, nevertheless, next, nine, no, nobody, none, noone, nor, not, nothing, now, nowhere, of, off, often, on, once, one, only, onto, or, other, others, otherwise, our, ours, ourselves, out, over, own, part, per, perhaps, please, put, rather, re, same, see, seem, seemed, seeming, seems, serious, several, she, should, show, side, since, sincere, six, sixty, so, some, somehow, someone, something, sometime, sometimes, somewhere, still, such, system, take, ten, than, that, the, their, them, themselves, then, thence, there, thereafter, thereby, therefore, therein, thereupon, these, they, thick, thin, third, this, those, though, three, through, throughout, thru, thus, to, together, too, top, toward, towards, twelve, twenty, two, un, under, until, up, upon, us, very, via, was, we, well, were, what, whatever, when, whence, whenever, where, whereafter, whereas, whereby, wherein, whereupon, wherever, whether, which, while, whither, who, whoever, whole, whom, whose, why, will, with, within, without, would, yet, you, your, yours, yourself, yourselves, article, writes, entry, date, udel, said, tell, think, know, just, newsgroup, line, like, does, going, make, thanks)

In [None]:
import org.apache.spark.ml.feature.StopWordsRemover

// Set Params for StopWordsRemover with new_stopwords
val remover = new StopWordsRemover()
.setStopWords(new_stopwords)
.setInputCol("tokens")
.setOutputCol("filtered")

// Create new df with new list of stopwords removed
val new_filtered_df = remover.transform(tokenized_df)

  

>     import org.apache.spark.ml.feature.StopWordsRemover
>     remover: org.apache.spark.ml.feature.StopWordsRemover = StopWordsRemover: uid=stopWords_3aeb67808752, numStopWords=336, locale=en_US, caseSensitive=false
>     new_filtered_df: org.apache.spark.sql.DataFrame = [corpus: string, id: bigint ... 2 more fields]

In [None]:
// Set Params for CountVectorizer
val vectorizer = new CountVectorizer()
.setInputCol("filtered")
.setOutputCol("features")
.setVocabSize(10000)
.setMinDF(5)
.fit(new_filtered_df)

// Create new df of countVectors
val new_countVectors = vectorizer.transform(new_filtered_df).select("id", "features")

  

>     vectorizer: org.apache.spark.ml.feature.CountVectorizerModel = CountVectorizerModel: uid=cntVec_b45dc41d1681, vocabularySize=6122
>     new_countVectors: org.apache.spark.sql.DataFrame = [id: bigint, features: vector]

In [None]:
// Convert DF to RDD
val new_lda_countVector = new_countVectors.map { case Row(id: Long, countVector: Vector) => (id, countVector) }

  

>     new_lda_countVector: org.apache.spark.sql.Dataset[(Long, org.apache.spark.ml.linalg.Vector)] = [_1: bigint, _2: vector]

  

We will also increase MaxIterations to 10 to see if we get better
results.

In [None]:
// Set LDA parameters

val new_lda = new LDA()
.setOptimizer(new OnlineLDAOptimizer().setMiniBatchFraction(0.8))
.setK(numTopics)
.setMaxIterations(10) // more than 3 this time
.setDocConcentration(-1) // use default values
.setTopicConcentration(-1) // use default values

  

>     new_lda: org.apache.spark.mllib.clustering.LDA = org.apache.spark.mllib.clustering.LDA@18950443

  

#### How to find what the default values are?

Dive into the source!!!

1.  Let's find the default value for `docConcentration` now.

-   Got to Apache Spark package Root:
    <https://spark.apache.org/docs/latest/api/scala/#package>
-   search for 'ml' in the search box on the top left (ml is for ml
    library)
-   Then find the `LDA` by scrolling below on the left to mllib's
    `clustering` methods and click on `LDA`
-   Then click on the source code link which should take you here:
    -   <https://github.com/apache/spark/blob/v2.2.0/mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala>
    -   Now, simply go to the right function and see the following
        comment block:

    \`\`\` /\*\*
    -   Concentration parameter (commonly named "alpha") for the prior
        placed on documents'

    -   distributions over topics ("theta").

    -   

    -   This is the parameter to a Dirichlet distribution, where larger
        values mean more smoothing

    -   (more regularization).

    -   

    -   If not set by the user, then docConcentration is set
        automatically. If set to

    -   singleton vector \[alpha\], then alpha is replicated to a vector
        of length k in fitting.

    -   Otherwise, the \[\[docConcentration\]\] vector must be length k.

    -   (default = automatic)

    -   

    -   Optimizer-specific parameter settings:

    -   -   EM

    -   - Currently only supports symmetric distributions, so all values in the vector should be

    -     the same.

    -   - Values should be > 1.0

    -   - default = uniformly (50 / k) + 1, where 50/k is common in LDA libraries and +1 follows

    -     from Asuncion et al. (2009), who recommend a +1 adjustment for EM.

    -   -   Online

    -   - Values should be >= 0

    -   - default = uniformly (1.0 / k), following the implementation from

    -     [[https://github.com/Blei-Lab/onlineldavb]].

    -   @group param \*/ \`\`\`

**HOMEWORK:** Try to find the default value for `TopicConcentration`.

In [None]:
// convert ML vectors into MLlib vectors
val new_lda_countVector_mllib = new_lda_countVector.map { case (id, vector) => (id, org.apache.spark.mllib.linalg.Vectors.fromML(vector)) }.rdd

// Create LDA model with stopwords refiltered
val new_ldaModel = new_lda.run(new_lda_countVector_mllib)

  

>     new_lda_countVector_mllib: org.apache.spark.rdd.RDD[(Long, org.apache.spark.mllib.linalg.Vector)] = MapPartitionsRDD[48524] at rdd at command-2972105651606725:2
>     new_ldaModel: org.apache.spark.mllib.clustering.LDAModel = org.apache.spark.mllib.clustering.LocalLDAModel@cd1d6c3

In [None]:
val topicIndices = new_ldaModel.describeTopics(maxTermsPerTopic = 5)
val vocabList = vectorizer.vocabulary
val topics = topicIndices.map { case (terms, termWeights) =>
  terms.map(vocabList(_)).zip(termWeights)
}
println(s"$numTopics topics:")
topics.zipWithIndex.foreach { case (topic, i) =>
  println(s"TOPIC $i")
  topic.foreach { case (term, weight) => println(s"$term\t$weight") }
  println(s"==========")
}

  

>     20 topics:
>     TOPIC 0
>     launch	0.0018991566653208747
>     venus	0.001608386554434624
>     soviet	8.72030638031398E-4
>     solar	8.609912854871232E-4
>     mars	8.046138120216586E-4
>     ==========
>     TOPIC 1
>     jesus	0.003564630672903134
>     paul	0.0010620611458734977
>     convenient	0.0010414512545162426
>     right	9.866449920449033E-4
>     faith	9.079692351253585E-4
>     ==========
>     TOPIC 2
>     people	0.0050744369599922175
>     time	0.0037679615136750988
>     good	0.0031518754597519515
>     used	0.0027790132828341705
>     want	0.0024095636936064604
>     ==========
>     TOPIC 3
>     space	0.0025022149166141873
>     power	0.0021782291410841416
>     launch	0.002111359946279174
>     shuttle	0.0020664642391445445
>     period	0.001611839391548988
>     ==========
>     TOPIC 4
>     request	0.0030611954606818894
>     requests	0.002845295081467825
>     send	0.0026961050785434156
>     listserv	8.968007642854657E-4
>     cars	8.003219366975154E-4
>     ==========
>     TOPIC 5
>     brake	6.365055215150269E-4
>     countersteering	4.7457636447507777E-4
>     church	3.997679165602668E-4
>     hurt	3.7624357644437374E-4
>     rangers	3.744317441121122E-4
>     ==========
>     TOPIC 6
>     armenian	0.0015092438860068022
>     armenians	9.118981008231858E-4
>     turkish	8.544625239099978E-4
>     people	8.387734420893837E-4
>     turks	7.400878605124472E-4
>     ==========
>     TOPIC 7
>     turkish	0.0033059004471591
>     people	0.002374389357465559
>     greek	0.001809920247645245
>     church	0.0014281860275933676
>     armenia	0.0012885153255372337
>     ==========
>     TOPIC 8
>     abortion	0.001769966795077937
>     insurance	0.0013886749552449332
>     berkeley	0.00109862520230469
>     april	0.001070736647379058
>     coverage	0.001007587169446664
>     ==========
>     TOPIC 9
>     people	6.71913688617854E-4
>     good	5.748251067986094E-4
>     window	5.351877510799513E-4
>     data	4.433831463966694E-4
>     graphics	4.2010935872913803E-4
>     ==========
>     TOPIC 10
>     people	5.692974858386746E-4
>     time	4.952303669444124E-4
>     right	4.545198526399578E-4
>     good	4.246599192403034E-4
>     request	3.9848498349590565E-4
>     ==========
>     TOPIC 11
>     baltimore	7.226100521831666E-4
>     rochester	6.922714750017755E-4
>     temperature	5.677709636448905E-4
>     working	4.72591303655453E-4
>     earth	4.7180279250405257E-4
>     ==========
>     TOPIC 12
>     cubs	0.0041598232323520855
>     suck	0.0034204423759851914
>     picture	0.0017696983549121998
>     objective	0.0015561892650109575
>     league	0.0011650519970741256
>     ==========
>     TOPIC 13
>     game	0.0017800541092511226
>     leafs	0.0016212787051204198
>     team	9.552723638976857E-4
>     wings	7.922154109650735E-4
>     selanne	6.522434137918226E-4
>     ==========
>     TOPIC 14
>     tools	6.174979652294881E-4
>     image	5.663646577785772E-4
>     scientific	3.95636174664227E-4
>     need	3.823243139060751E-4
>     different	3.8199073796157034E-4
>     ==========
>     TOPIC 15
>     wright	6.506336430145857E-4
>     male	5.67160309491281E-4
>     homosexuality	5.14998244308789E-4
>     term	4.0408689282191954E-4
>     church	3.739609270301795E-4
>     ==========
>     TOPIC 16
>     people	3.8974622513913065E-4
>     software	3.7178799420497876E-4
>     shuttle	3.5598683993245635E-4
>     period	3.349739004585728E-4
>     pope	2.972937912831444E-4
>     ==========
>     TOPIC 17
>     jpeg	0.005620162829655862
>     image	0.0017648558675315495
>     file	0.001476689702386814
>     format	0.0012636258258417812
>     color	0.0011599696941478064
>     ==========
>     TOPIC 18
>     science	0.0015655807370458632
>     truth	0.001068082653407235
>     frank	0.001001479025503385
>     dwyer	6.578977928831484E-4
>     origins	6.098647694507199E-4
>     ==========
>     TOPIC 19
>     cramer	0.0019155350929879933
>     homosexual	0.001263740938716715
>     optilink	0.0012167800168242017
>     people	9.826220401716289E-4
>     clayton	8.847414531624391E-4
>     ==========
>     topicIndices: Array[(Array[Int], Array[Double])] = Array((Array(287, 443, 717, 1030, 1101),Array(0.0018991566653208747, 0.001608386554434624, 8.72030638031398E-4, 8.609912854871232E-4, 8.046138120216586E-4)), (Array(114, 113, 1199, 7, 213),Array(0.003564630672903134, 0.0010620611458734977, 0.0010414512545162426, 9.866449920449033E-4, 9.079692351253585E-4)), (Array(0, 1, 2, 3, 5),Array(0.0050744369599922175, 0.0037679615136750988, 0.0031518754597519515, 0.0027790132828341705, 0.0024095636936064604)), (Array(26, 28, 287, 369, 441),Array(0.0025022149166141873, 0.0021782291410841416, 0.002111359946279174, 0.0020664642391445445, 0.001611839391548988)), (Array(171, 757, 65, 3174, 281),Array(0.0030611954606818894, 0.002845295081467825, 0.0026961050785434156, 8.968007642854657E-4, 8.003219366975154E-4)), (Array(2159, 4376, 132, 1331, 992),Array(6.365055215150269E-4, 4.7457636447507777E-4, 3.997679165602668E-4, 3.7624357644437374E-4, 3.744317441121122E-4)), (Array(585, 478, 312, 0, 848),Array(0.0015092438860068022, 9.118981008231858E-4, 8.544625239099978E-4, 8.387734420893837E-4, 7.400878605124472E-4)), (Array(312, 0, 651, 132, 816),Array(0.0033059004471591, 0.002374389357465559, 0.001809920247645245, 0.0014281860275933676, 0.0012885153255372337)), (Array(877, 378, 577, 198, 1274),Array(0.001769966795077937, 0.0013886749552449332, 0.00109862520230469, 0.001070736647379058, 0.001007587169446664)), (Array(0, 2, 110, 12, 68),Array(6.71913688617854E-4, 5.748251067986094E-4, 5.351877510799513E-4, 4.433831463966694E-4, 4.2010935872913803E-4)), (Array(0, 1, 7, 2, 171),Array(5.692974858386746E-4, 4.952303669444124E-4, 4.545198526399578E-4, 4.246599192403034E-4, 3.9848498349590565E-4)), (Array(2449, 1657, 1341, 170, 199),Array(7.226100521831666E-4, 6.922714750017755E-4, 5.677709636448905E-4, 4.72591303655453E-4, 4.7180279250405257E-4)), (Array(698, 1057, 464, 339, 526),Array(0.0041598232323520855, 0.0034204423759851914, 0.0017696983549121998, 0.0015561892650109575, 0.0011650519970741256)), (Array(60, 1169, 124, 1153, 2365),Array(0.0017800541092511226, 0.0016212787051204198, 9.552723638976857E-4, 7.922154109650735E-4, 6.522434137918226E-4)), (Array(326, 11, 500, 9, 42),Array(6.174979652294881E-4, 5.663646577785772E-4, 3.95636174664227E-4, 3.823243139060751E-4, 3.8199073796157034E-4)), (Array(1061, 1060, 643, 380, 132),Array(6.506336430145857E-4, 5.67160309491281E-4, 5.14998244308789E-4, 4.0408689282191954E-4, 3.739609270301795E-4)), (Array(0, 17, 369, 441, 2457),Array(3.8974622513913065E-4, 3.7178799420497876E-4, 3.5598683993245635E-4, 3.349739004585728E-4, 2.972937912831444E-4)), (Array(45, 11, 27, 145, 78),Array(0.005620162829655862, 0.0017648558675315495, 0.001476689702386814, 0.0012636258258417812, 0.0011599696941478064)), (Array(89, 130, 314, 1376, 2429),Array(0.0015655807370458632, 0.001068082653407235, 0.001001479025503385, 6.578977928831484E-4, 6.098647694507199E-4)), (Array(859, 667, 1344, 0, 1509),Array(0.0019155350929879933, 0.001263740938716715, 0.0012167800168242017, 9.826220401716289E-4, 8.847414531624391E-4)))
>     vocabList: Array[String] = Array(people, time, good, used, windows, want, work, right, problem, need, really, image, data, information, better, believe, using, software, years, year, mail, sure, point, drive, program, available, space, file, power, help, government, things, question, doesn, number, case, world, look, read, version, come, thing, different, long, best, jpeg, fact, university, probably, real, didn, course, state, true, files, high, possible, actually, 1993, list, game, little, news, group, david, send, wrong, based, graphics, support, able, place, called, free, john, subject, post, reason, color, great, second, card, public, having, email, info, following, start, hard, science, example, says, means, code, evidence, person, note, maybe, president, heard, general, mean, problems, quite, source, systems, life, price, standard, order, window, access, claim, paul, jesus, getting, looking, trying, control, disk, seen, simply, times, book, team, local, chip, play, encryption, idea, truth, given, church, issue, research, opinions, wrote, images, large, display, makes, remember, thought, doing, national, format, away, nasa, human, home, change, small, saying, interested, current, mark, area, internet, today, word, original, agree, left, memory, works, microsoft, machine, instead, hardware, kind, working, request, higher, sort, programs, questions, money, later, israel, mike, guess, hand, pretty, include, netcom, address, cause, matter, technology, uiuc, speed, wire, video, type, days, server, view, usually, april, earth, package, open, told, christian, stuff, unless, similar, important, size, major, house, provide, known, faith, ground, rights, michael, phone, body, center, including, health, american, apple, feel, cost, text, user, lines, bible, answer, care, copy, wouldn, understand, check, anybody, security, mind, live, certainly, started, running, message, mouse, level, network, women, study, clinton, making, position, company, came, groups, board, screen, white, common, talking, single, special, quality, black, wiring, test, likely, christians, monitor, nice, effect, light, members, medical, posted, uucp, hope, sources, certain, clear, difference, cars, write, canada, fine, hear, press, launch, build, police, love, history, couple, situation, books, particular, words, jewish, specific, sense, model, religion, anti, stop, posting, unix, talk, private, discussion, school, contact, cable, turkish, keys, frank, built, consider, service, sound, features, legal, taking, simple, comes, reference, argument, tools, children, short, night, jews, applications, clipper, device, application, comments, scsi, process, theory, objective, force, doubt, tried, self, experience, games, early, usenet, expect, steve, needed, tape, uses, interesting, killed, station, exactly, easy, death, value, turn, manager, needs, correct, according, amiga, ones, response, wanted, shuttle, language, states, drug, james, considered, reading, strong, koresh, insurance, personal, term, goes, result, future, taken, past, form, opinion, especially, religious, sorry, mentioned, rules, hell, written, various, author, guns, drivers, went, country, design, plus, happy, society, longer, gets, latest, results, analysis, haven, asked, main, section, laws, previous, cases, york, parts, aren, advance, weapons, christ, mode, required, input, looks, week, accept, community, washington, option, series, circuit, disease, robert, fast, numbers, head, exist, andrew, period, range, venus, israeli, macintosh, driver, office, offer, moral, allow, organization, involved, toronto, clock, players, department, runs, values, months, choice, half, colors, knows, picture, sell, brian, object, took, cards, includes, federal, hockey, individual, wasn, currently, suggest, dave, armenians, takes, protect, follow, americans, directly, candida, title, policy, total, devices, happened, statement, present, purpose, assume, close, recently, equipment, food, require, reasons, scientific, christianity, happen, users, media, provides, deal, wants, city, george, goal, couldn, bike, save, shall, dead, lost, action, speak, road, uunet, terms, batf, court, condition, easily, league, complete, engineering, obviously, details, completely, baseball, california, voice, mission, useful, lead, disclaimer, output, water, algorithm, clearly, administration, ways, compatible, international, sent, rest, responsible, pass, hours, digital, business, appreciated, issues, freedom, kill, project, deleted, companies, coming, operating, average, processing, context, story, figure, error, fans, season, face, port, carry, events, appropriate, leave, berkeley, trade, lower, player, king, page, convert, environment, armenian, political, points, basis, final, requires, heart, addition, performance, building, difficult, site, sale, suppose, related, stanford, resolution, field, willing, volume, actual, apply, knowledge, designed, explain, anonymous, supposed, directory, claims, worth, orbit, lots, basic, defense, advice, commercial, sounds, entries, limited, changes, wonder, suspect, radio, turkey, neutral, forget, wait, necessary, programming, reply, enforcement, inside, family, ability, handle, young, included, signal, homosexuality, natural, morality, finally, land, russian, paper, month, greek, friend, installed, create, thinking, printer, shot, services, understanding, population, hold, break, interface, comment, normal, eric, homosexual, setting, formats, names, peter, machines, report, east, supply, comp, percent, avoid, product, lives, colorado, communications, room, escrow, types, secure, arab, logic, miles, reasonable, rutgers, multiple, gary, soon, agencies, developed, recent, cubs, library, peace, expensive, cheap, cancer, million, allowed, physics, suggestions, doctor, caused, supported, technical, happens, event, looked, obvious, gives, soviet, street, mention, release, outside, table, print, mass, return, radar, archive, chance, install, treatment, bitnet, generally, development, friends, folks, choose, entire, weeks, united, social, wish, smith, trouble, child, straight, learn, supports, behavior, ideas, morning, muslim, member, diet, electrical, illegal, exists, requests, jack, areas, concept, meaning, reality, drives, appear, provided, studies, motif, attempt, possibly, west, answers, asking, pick, practice, engine, worked, stand, bring, thank, unit, remove, near, compound, errors, false, belief, continue, middle, changed, decided, modem, bits, existence, henry, trust, congress, extra, safe, facts, loss, yeah, contains, guys, particularly, arguments, proper, class, manual, frame, command, drugs, stupid, wide, nature, institute, armenia, constitution, thread, pages, function, andy, attack, fonts, privacy, aware, operations, heat, worse, appears, distribution, knew, effective, edge, division, shouldn, wall, approach, speaking, independent, unfortunately, hands, conference, crime, indiana, modern, ethernet, solution, turks, civil, bought, 1992, 1990, compression, safety, serial, added, shows, letter, cramer, faster, simms, operation, arms, internal, germany, gave, record, wondering, virginia, floppy, appreciate, blue, plastic, regular, writing, allows, abortion, utility, hello, rate, blood, spend, views, articles, actions, font, additional, method, described, concerned, scott, played, stated, kids, atheism, surface, base, step, decision, switch, tool, playing, giving, attitude, quote, keith, cover, levels, considering, highly, resources, north, military, connected, buying, places, capability, products, costs, patients, controller, fair, late, prevent, rule, western, poor, brought, functions, received, account, creation, watch, cwru, majority, forward, student, released, driving, authority, committee, protection, richard, boston, quick, calls, chips, valid, hate, shell, excellent, countries, market, necessarily, created, wires, america, apparently, turned, complex, fairly, minutes, murder, boxes, lord, keyboard, properly, plan, moon, arabs, arizona, visual, absolutely, notice, produce, sold, panel, dangerous, killing, begin, property, damage, electronics, living, failed, acts, tests, nation, intelligence, islam, vote, rangers, effort, options, greatly, holy, review, shareware, larry)
>     topics: Array[Array[(String, Double)]] = Array(Array((launch,0.0018991566653208747), (venus,0.001608386554434624), (soviet,8.72030638031398E-4), (solar,8.609912854871232E-4), (mars,8.046138120216586E-4)), Array((jesus,0.003564630672903134), (paul,0.0010620611458734977), (convenient,0.0010414512545162426), (right,9.866449920449033E-4), (faith,9.079692351253585E-4)), Array((people,0.0050744369599922175), (time,0.0037679615136750988), (good,0.0031518754597519515), (used,0.0027790132828341705), (want,0.0024095636936064604)), Array((space,0.0025022149166141873), (power,0.0021782291410841416), (launch,0.002111359946279174), (shuttle,0.0020664642391445445), (period,0.001611839391548988)), Array((request,0.0030611954606818894), (requests,0.002845295081467825), (send,0.0026961050785434156), (listserv,8.968007642854657E-4), (cars,8.003219366975154E-4)), Array((brake,6.365055215150269E-4), (countersteering,4.7457636447507777E-4), (church,3.997679165602668E-4), (hurt,3.7624357644437374E-4), (rangers,3.744317441121122E-4)), Array((armenian,0.0015092438860068022), (armenians,9.118981008231858E-4), (turkish,8.544625239099978E-4), (people,8.387734420893837E-4), (turks,7.400878605124472E-4)), Array((turkish,0.0033059004471591), (people,0.002374389357465559), (greek,0.001809920247645245), (church,0.0014281860275933676), (armenia,0.0012885153255372337)), Array((abortion,0.001769966795077937), (insurance,0.0013886749552449332), (berkeley,0.00109862520230469), (april,0.001070736647379058), (coverage,0.001007587169446664)), Array((people,6.71913688617854E-4), (good,5.748251067986094E-4), (window,5.351877510799513E-4), (data,4.433831463966694E-4), (graphics,4.2010935872913803E-4)), Array((people,5.692974858386746E-4), (time,4.952303669444124E-4), (right,4.545198526399578E-4), (good,4.246599192403034E-4), (request,3.9848498349590565E-4)), Array((baltimore,7.226100521831666E-4), (rochester,6.922714750017755E-4), (temperature,5.677709636448905E-4), (working,4.72591303655453E-4), (earth,4.7180279250405257E-4)), Array((cubs,0.0041598232323520855), (suck,0.0034204423759851914), (picture,0.0017696983549121998), (objective,0.0015561892650109575), (league,0.0011650519970741256)), Array((game,0.0017800541092511226), (leafs,0.0016212787051204198), (team,9.552723638976857E-4), (wings,7.922154109650735E-4), (selanne,6.522434137918226E-4)), Array((tools,6.174979652294881E-4), (image,5.663646577785772E-4), (scientific,3.95636174664227E-4), (need,3.823243139060751E-4), (different,3.8199073796157034E-4)), Array((wright,6.506336430145857E-4), (male,5.67160309491281E-4), (homosexuality,5.14998244308789E-4), (term,4.0408689282191954E-4), (church,3.739609270301795E-4)), Array((people,3.8974622513913065E-4), (software,3.7178799420497876E-4), (shuttle,3.5598683993245635E-4), (period,3.349739004585728E-4), (pope,2.972937912831444E-4)), Array((jpeg,0.005620162829655862), (image,0.0017648558675315495), (file,0.001476689702386814), (format,0.0012636258258417812), (color,0.0011599696941478064)), Array((science,0.0015655807370458632), (truth,0.001068082653407235), (frank,0.001001479025503385), (dwyer,6.578977928831484E-4), (origins,6.098647694507199E-4)), Array((cramer,0.0019155350929879933), (homosexual,0.001263740938716715), (optilink,0.0012167800168242017), (people,9.826220401716289E-4), (clayton,8.847414531624391E-4)))

  

We managed to get better results here. We can easily infer that topic 3
is about space, topic 7 is about religion, etc.

`========== TOPIC 3 station	0.0022184815200582244 launch	0.0020621309179376145 shuttle	0.0019305627762549198 space	0.0017600147075534092 redesign	0.0014972130065346592 ========== TOPIC 7 people	0.0038165245379908675 church	0.0036902650900400543 jesus	0.0029942866750178893 paul	0.0026144777524277044 bible	0.0020476251853453016 ==========`

Step 9. Create LDA model with Expectation Maximization
------------------------------------------------------

Let's try creating an LDA model with Expectation Maximization on the
data that has been refiltered for additional stopwords. We will also
increase MaxIterations here to 100 to see if that improves results.

See
<http://spark.apache.org/docs/latest/mllib-clustering.html#latent-dirichlet-allocation-lda>

In [None]:
import org.apache.spark.mllib.clustering.EMLDAOptimizer

// Set LDA parameters
val em_lda = new LDA()
.setOptimizer(new EMLDAOptimizer())
.setK(numTopics)
.setMaxIterations(100)
.setDocConcentration(-1) // use default values
.setTopicConcentration(-1) // use default values

  

>     import org.apache.spark.mllib.clustering.EMLDAOptimizer
>     em_lda: org.apache.spark.mllib.clustering.LDA = org.apache.spark.mllib.clustering.LDA@2e986867

In [None]:
val em_ldaModel = em_lda.run(new_lda_countVector_mllib)

  

>     em_ldaModel: org.apache.spark.mllib.clustering.LDAModel = org.apache.spark.mllib.clustering.DistributedLDAModel@5bbace31

  

Note that the EMLDAOptimizer produces a DistributedLDAModel, which
stores not only the inferred topics but also the full training corpus
and topic distributions for each document in the training corpus.

In [None]:
val topicIndices = em_ldaModel.describeTopics(maxTermsPerTopic = 5)

  

>     topicIndices: Array[(Array[Int], Array[Double])] = Array((Array(0, 312, 28, 478, 52),Array(0.013669950279615473, 0.012589446317671895, 0.011562911802003646, 0.009630031607012843, 0.008970764201152035)), (Array(98, 484, 0, 29, 2),Array(0.012611727242199007, 0.007831400710143183, 0.007236708874482028, 0.0071875150392249135, 0.0068614461851317764)), (Array(74, 38, 64, 136, 123),Array(0.017314708303090353, 0.016649998278242404, 0.01612424392875142, 0.013492756373023043, 0.01284347086675008)), (Array(63, 1, 36, 89, 62),Array(0.019077778219962822, 0.013309556388583813, 0.013284634297848614, 0.012895629630722405, 0.012622753621515886)), (Array(45, 27, 11, 54, 78),Array(0.02807219086862892, 0.02279193035831542, 0.0194505007349796, 0.012942440820968409, 0.012826486709331534)), (Array(26, 147, 287, 199, 369),Array(0.023842803515972327, 0.012758041129028296, 0.011217280608105142, 0.009997058638158966, 0.009032118770824646)), (Array(0, 184, 31, 5, 2),Array(0.016865341823208774, 0.016825261055182127, 0.015440418317600587, 0.012106113182750974, 0.010304699850692597)), (Array(53, 130, 66, 32, 314),Array(0.014220526283559909, 0.012631168518367666, 0.012555951644270336, 0.011574292577776707, 0.01048868971613375)), (Array(114, 132, 0, 113, 203),Array(0.014040414898900483, 0.01270680357421398, 0.010902857549805028, 0.010717922636260969, 0.01001816589354621)), (Array(191, 28, 3, 264, 214),Array(0.014037103668261866, 0.013555721025806456, 0.012516554650882044, 0.011959985021820363, 0.011450644497320205)), (Array(155, 190, 512, 8, 41),Array(0.012383107605954015, 0.011829510900575697, 0.009481024754599455, 0.008178117398401121, 0.008144782572592813)), (Array(178, 0, 444, 297, 35),Array(0.014925183607418603, 0.009581182347438974, 0.008770851989791053, 0.0076584570498958796, 0.007314463669539004)), (Array(4, 23, 81, 17, 110),Array(0.03859307988847965, 0.027419168870300246, 0.018201005649992354, 0.015065794528308522, 0.014345107593535791)), (Array(176, 378, 7, 0, 48),Array(0.014225381856759272, 0.010050331220816595, 0.00977063406649008, 0.008795489327973019, 0.008646891569460264)), (Array(65, 59, 20, 116, 171),Array(0.02195591908828555, 0.019379171839426126, 0.01819501694851847, 0.01682359979037597, 0.01600464183747604)), (Array(0, 30, 241, 289, 377),Array(0.020025915060240766, 0.011914779817356594, 0.011652192323642917, 0.010666176855822633, 0.010366058169772603)), (Array(60, 19, 127, 124, 345),Array(0.023643821554770753, 0.020615209712539002, 0.015861220524255214, 0.014993577810213812, 0.00977967167197561)), (Array(128, 126, 30, 188, 332),Array(0.016182899875128307, 0.01439691996001174, 0.011463190284006705, 0.01000883239093308, 0.009430305456545711)), (Array(47, 170, 84, 217, 9),Array(0.019902657675382598, 0.010374362298625423, 0.009816122482765829, 0.009588736191172297, 0.00893649722419374)), (Array(12, 11, 24, 25, 85),Array(0.018444866352691897, 0.017480078221570557, 0.010045391658608608, 0.009516965697547774, 0.00789517504344056)))

In [None]:
val vocabList = vectorizer.vocabulary

  

>     vocabList: Array[String] = Array(people, time, good, used, windows, want, work, right, problem, need, really, image, data, information, better, believe, using, software, years, year, mail, sure, point, drive, program, available, space, file, power, help, government, things, question, doesn, number, case, world, look, read, version, come, thing, different, long, best, jpeg, fact, university, probably, real, didn, course, state, true, files, high, possible, actually, 1993, list, game, little, news, group, david, send, wrong, based, graphics, support, able, place, called, free, john, subject, post, reason, color, great, second, card, public, having, email, info, following, start, hard, science, example, says, means, code, evidence, person, note, maybe, president, heard, general, mean, problems, quite, source, systems, life, price, standard, order, window, access, claim, paul, jesus, getting, looking, trying, control, disk, seen, simply, times, book, team, local, chip, play, encryption, idea, truth, given, church, issue, research, opinions, wrote, images, large, display, makes, remember, thought, doing, national, format, away, nasa, human, home, change, small, saying, interested, current, mark, area, internet, today, word, original, agree, left, memory, works, microsoft, machine, instead, hardware, kind, working, request, higher, sort, programs, questions, money, later, israel, mike, guess, hand, pretty, include, netcom, address, cause, matter, technology, uiuc, speed, wire, video, type, days, server, view, usually, april, earth, package, open, told, christian, stuff, unless, similar, important, size, major, house, provide, known, faith, ground, rights, michael, phone, body, center, including, health, american, apple, feel, cost, text, user, lines, bible, answer, care, copy, wouldn, understand, check, anybody, security, mind, live, certainly, started, running, message, mouse, level, network, women, study, clinton, making, position, company, came, groups, board, screen, white, common, talking, single, special, quality, black, wiring, test, likely, christians, monitor, nice, effect, light, members, medical, posted, uucp, hope, sources, certain, clear, difference, cars, write, canada, fine, hear, press, launch, build, police, love, history, couple, situation, books, particular, words, jewish, specific, sense, model, religion, anti, stop, posting, unix, talk, private, discussion, school, contact, cable, turkish, keys, frank, built, consider, service, sound, features, legal, taking, simple, comes, reference, argument, tools, children, short, night, jews, applications, clipper, device, application, comments, scsi, process, theory, objective, force, doubt, tried, self, experience, games, early, usenet, expect, steve, needed, tape, uses, interesting, killed, station, exactly, easy, death, value, turn, manager, needs, correct, according, amiga, ones, response, wanted, shuttle, language, states, drug, james, considered, reading, strong, koresh, insurance, personal, term, goes, result, future, taken, past, form, opinion, especially, religious, sorry, mentioned, rules, hell, written, various, author, guns, drivers, went, country, design, plus, happy, society, longer, gets, latest, results, analysis, haven, asked, main, section, laws, previous, cases, york, parts, aren, advance, weapons, christ, mode, required, input, looks, week, accept, community, washington, option, series, circuit, disease, robert, fast, numbers, head, exist, andrew, period, range, venus, israeli, macintosh, driver, office, offer, moral, allow, organization, involved, toronto, clock, players, department, runs, values, months, choice, half, colors, knows, picture, sell, brian, object, took, cards, includes, federal, hockey, individual, wasn, currently, suggest, dave, armenians, takes, protect, follow, americans, directly, candida, title, policy, total, devices, happened, statement, present, purpose, assume, close, recently, equipment, food, require, reasons, scientific, christianity, happen, users, media, provides, deal, wants, city, george, goal, couldn, bike, save, shall, dead, lost, action, speak, road, uunet, terms, batf, court, condition, easily, league, complete, engineering, obviously, details, completely, baseball, california, voice, mission, useful, lead, disclaimer, output, water, algorithm, clearly, administration, ways, compatible, international, sent, rest, responsible, pass, hours, digital, business, appreciated, issues, freedom, kill, project, deleted, companies, coming, operating, average, processing, context, story, figure, error, fans, season, face, port, carry, events, appropriate, leave, berkeley, trade, lower, player, king, page, convert, environment, armenian, political, points, basis, final, requires, heart, addition, performance, building, difficult, site, sale, suppose, related, stanford, resolution, field, willing, volume, actual, apply, knowledge, designed, explain, anonymous, supposed, directory, claims, worth, orbit, lots, basic, defense, advice, commercial, sounds, entries, limited, changes, wonder, suspect, radio, turkey, neutral, forget, wait, necessary, programming, reply, enforcement, inside, family, ability, handle, young, included, signal, homosexuality, natural, morality, finally, land, russian, paper, month, greek, friend, installed, create, thinking, printer, shot, services, understanding, population, hold, break, interface, comment, normal, eric, homosexual, setting, formats, names, peter, machines, report, east, supply, comp, percent, avoid, product, lives, colorado, communications, room, escrow, types, secure, arab, logic, miles, reasonable, rutgers, multiple, gary, soon, agencies, developed, recent, cubs, library, peace, expensive, cheap, cancer, million, allowed, physics, suggestions, doctor, caused, supported, technical, happens, event, looked, obvious, gives, soviet, street, mention, release, outside, table, print, mass, return, radar, archive, chance, install, treatment, bitnet, generally, development, friends, folks, choose, entire, weeks, united, social, wish, smith, trouble, child, straight, learn, supports, behavior, ideas, morning, muslim, member, diet, electrical, illegal, exists, requests, jack, areas, concept, meaning, reality, drives, appear, provided, studies, motif, attempt, possibly, west, answers, asking, pick, practice, engine, worked, stand, bring, thank, unit, remove, near, compound, errors, false, belief, continue, middle, changed, decided, modem, bits, existence, henry, trust, congress, extra, safe, facts, loss, yeah, contains, guys, particularly, arguments, proper, class, manual, frame, command, drugs, stupid, wide, nature, institute, armenia, constitution, thread, pages, function, andy, attack, fonts, privacy, aware, operations, heat, worse, appears, distribution, knew, effective, edge, division, shouldn, wall, approach, speaking, independent, unfortunately, hands, conference, crime, indiana, modern, ethernet, solution, turks, civil, bought, 1992, 1990, compression, safety, serial, added, shows, letter, cramer, faster, simms, operation, arms, internal, germany, gave, record, wondering, virginia, floppy, appreciate, blue, plastic, regular, writing, allows, abortion, utility, hello, rate, blood, spend, views, articles, actions, font, additional, method, described, concerned, scott, played, stated, kids, atheism, surface, base, step, decision, switch, tool, playing, giving, attitude, quote, keith, cover, levels, considering, highly, resources, north, military, connected, buying, places, capability, products, costs, patients, controller, fair, late, prevent, rule, western, poor, brought, functions, received, account, creation, watch, cwru, majority, forward, student, released, driving, authority, committee, protection, richard, boston, quick, calls, chips, valid, hate, shell, excellent, countries, market, necessarily, created, wires, america, apparently, turned, complex, fairly, minutes, murder, boxes, lord, keyboard, properly, plan, moon, arabs, arizona, visual, absolutely, notice, produce, sold, panel, dangerous, killing, begin, property, damage, electronics, living, failed, acts, tests, nation, intelligence, islam, vote, rangers, effort, options, greatly, holy, review, shareware, larry)

In [None]:
vocabList.size

  

>     res30: Int = 6122

In [None]:
val topics = topicIndices.map { case (terms, termWeights) =>
  terms.map(vocabList(_)).zip(termWeights)
}

  

>     topics: Array[Array[(String, Double)]] = Array(Array((people,0.013669950279615473), (turkish,0.012589446317671895), (power,0.011562911802003646), (armenians,0.009630031607012843), (state,0.008970764201152035)), Array((president,0.012611727242199007), (candida,0.007831400710143183), (people,0.007236708874482028), (help,0.0071875150392249135), (good,0.0068614461851317764)), Array((john,0.017314708303090353), (read,0.016649998278242404), (david,0.01612424392875142), (wrote,0.013492756373023043), (book,0.01284347086675008)), Array((group,0.019077778219962822), (time,0.013309556388583813), (world,0.013284634297848614), (science,0.012895629630722405), (news,0.012622753621515886)), Array((jpeg,0.02807219086862892), (file,0.02279193035831542), (image,0.0194505007349796), (files,0.012942440820968409), (color,0.012826486709331534)), Array((space,0.023842803515972327), (nasa,0.012758041129028296), (launch,0.011217280608105142), (earth,0.009997058638158966), (shuttle,0.009032118770824646)), Array((people,0.016865341823208774), (netcom,0.016825261055182127), (things,0.015440418317600587), (want,0.012106113182750974), (good,0.010304699850692597)), Array((true,0.014220526283559909), (truth,0.012631168518367666), (wrong,0.012555951644270336), (question,0.011574292577776707), (frank,0.01048868971613375)), Array((jesus,0.014040414898900483), (church,0.01270680357421398), (people,0.010902857549805028), (paul,0.010717922636260969), (christian,0.01001816589354621)), Array((wire,0.014037103668261866), (power,0.013555721025806456), (used,0.012516554650882044), (wiring,0.011959985021820363), (ground,0.011450644497320205)), Array((mark,0.012383107605954015), (speed,0.011829510900575697), (bike,0.009481024754599455), (problem,0.008178117398401121), (thing,0.008144782572592813)), Array((israel,0.014925183607418603), (people,0.009581182347438974), (israeli,0.008770851989791053), (jewish,0.0076584570498958796), (case,0.007314463669539004)), Array((windows,0.03859307988847965), (drive,0.027419168870300246), (card,0.018201005649992354), (software,0.015065794528308522), (window,0.014345107593535791)), Array((money,0.014225381856759272), (insurance,0.010050331220816595), (right,0.00977063406649008), (people,0.008795489327973019), (probably,0.008646891569460264)), Array((send,0.02195591908828555), (list,0.019379171839426126), (mail,0.01819501694851847), (looking,0.01682359979037597), (request,0.01600464183747604)), Array((people,0.020025915060240766), (government,0.011914779817356594), (started,0.011652192323642917), (police,0.010666176855822633), (koresh,0.010366058169772603)), Array((game,0.023643821554770753), (year,0.020615209712539002), (play,0.015861220524255214), (team,0.014993577810213812), (games,0.00977967167197561)), Array((encryption,0.016182899875128307), (chip,0.01439691996001174), (government,0.011463190284006705), (technology,0.01000883239093308), (clipper,0.009430305456545711)), Array((university,0.019902657675382598), (working,0.010374362298625423), (email,0.009816122482765829), (phone,0.009588736191172297), (need,0.00893649722419374)), Array((data,0.018444866352691897), (image,0.017480078221570557), (program,0.010045391658608608), (available,0.009516965697547774), (info,0.00789517504344056)))

In [None]:
vocabList(47) // 47 is the index of the term 'university' or the first term in topics - this may change due to randomness in algorithm

  

>     res31: String = university

  

This is just doing it all at once.

In [None]:
val topicIndices = em_ldaModel.describeTopics(maxTermsPerTopic = 5)
val vocabList = vectorizer.vocabulary
val topics = topicIndices.map { case (terms, termWeights) =>
  terms.map(vocabList(_)).zip(termWeights)
}
println(s"$numTopics topics:")
topics.zipWithIndex.foreach { case (topic, i) =>
  println(s"TOPIC $i")
  topic.foreach { case (term, weight) => println(s"$term\t$weight") }
  println(s"==========")
}

  

>     20 topics:
>     TOPIC 0
>     people	0.013669950279615473
>     turkish	0.012589446317671895
>     power	0.011562911802003646
>     armenians	0.009630031607012843
>     state	0.008970764201152035
>     ==========
>     TOPIC 1
>     president	0.012611727242199007
>     candida	0.007831400710143183
>     people	0.007236708874482028
>     help	0.0071875150392249135
>     good	0.0068614461851317764
>     ==========
>     TOPIC 2
>     john	0.017314708303090353
>     read	0.016649998278242404
>     david	0.01612424392875142
>     wrote	0.013492756373023043
>     book	0.01284347086675008
>     ==========
>     TOPIC 3
>     group	0.019077778219962822
>     time	0.013309556388583813
>     world	0.013284634297848614
>     science	0.012895629630722405
>     news	0.012622753621515886
>     ==========
>     TOPIC 4
>     jpeg	0.02807219086862892
>     file	0.02279193035831542
>     image	0.0194505007349796
>     files	0.012942440820968409
>     color	0.012826486709331534
>     ==========
>     TOPIC 5
>     space	0.023842803515972327
>     nasa	0.012758041129028296
>     launch	0.011217280608105142
>     earth	0.009997058638158966
>     shuttle	0.009032118770824646
>     ==========
>     TOPIC 6
>     people	0.016865341823208774
>     netcom	0.016825261055182127
>     things	0.015440418317600587
>     want	0.012106113182750974
>     good	0.010304699850692597
>     ==========
>     TOPIC 7
>     true	0.014220526283559909
>     truth	0.012631168518367666
>     wrong	0.012555951644270336
>     question	0.011574292577776707
>     frank	0.01048868971613375
>     ==========
>     TOPIC 8
>     jesus	0.014040414898900483
>     church	0.01270680357421398
>     people	0.010902857549805028
>     paul	0.010717922636260969
>     christian	0.01001816589354621
>     ==========
>     TOPIC 9
>     wire	0.014037103668261866
>     power	0.013555721025806456
>     used	0.012516554650882044
>     wiring	0.011959985021820363
>     ground	0.011450644497320205
>     ==========
>     TOPIC 10
>     mark	0.012383107605954015
>     speed	0.011829510900575697
>     bike	0.009481024754599455
>     problem	0.008178117398401121
>     thing	0.008144782572592813
>     ==========
>     TOPIC 11
>     israel	0.014925183607418603
>     people	0.009581182347438974
>     israeli	0.008770851989791053
>     jewish	0.0076584570498958796
>     case	0.007314463669539004
>     ==========
>     TOPIC 12
>     windows	0.03859307988847965
>     drive	0.027419168870300246
>     card	0.018201005649992354
>     software	0.015065794528308522
>     window	0.014345107593535791
>     ==========
>     TOPIC 13
>     money	0.014225381856759272
>     insurance	0.010050331220816595
>     right	0.00977063406649008
>     people	0.008795489327973019
>     probably	0.008646891569460264
>     ==========
>     TOPIC 14
>     send	0.02195591908828555
>     list	0.019379171839426126
>     mail	0.01819501694851847
>     looking	0.01682359979037597
>     request	0.01600464183747604
>     ==========
>     TOPIC 15
>     people	0.020025915060240766
>     government	0.011914779817356594
>     started	0.011652192323642917
>     police	0.010666176855822633
>     koresh	0.010366058169772603
>     ==========
>     TOPIC 16
>     game	0.023643821554770753
>     year	0.020615209712539002
>     play	0.015861220524255214
>     team	0.014993577810213812
>     games	0.00977967167197561
>     ==========
>     TOPIC 17
>     encryption	0.016182899875128307
>     chip	0.01439691996001174
>     government	0.011463190284006705
>     technology	0.01000883239093308
>     clipper	0.009430305456545711
>     ==========
>     TOPIC 18
>     university	0.019902657675382598
>     working	0.010374362298625423
>     email	0.009816122482765829
>     phone	0.009588736191172297
>     need	0.00893649722419374
>     ==========
>     TOPIC 19
>     data	0.018444866352691897
>     image	0.017480078221570557
>     program	0.010045391658608608
>     available	0.009516965697547774
>     info	0.00789517504344056
>     ==========
>     topicIndices: Array[(Array[Int], Array[Double])] = Array((Array(0, 312, 28, 478, 52),Array(0.013669950279615473, 0.012589446317671895, 0.011562911802003646, 0.009630031607012843, 0.008970764201152035)), (Array(98, 484, 0, 29, 2),Array(0.012611727242199007, 0.007831400710143183, 0.007236708874482028, 0.0071875150392249135, 0.0068614461851317764)), (Array(74, 38, 64, 136, 123),Array(0.017314708303090353, 0.016649998278242404, 0.01612424392875142, 0.013492756373023043, 0.01284347086675008)), (Array(63, 1, 36, 89, 62),Array(0.019077778219962822, 0.013309556388583813, 0.013284634297848614, 0.012895629630722405, 0.012622753621515886)), (Array(45, 27, 11, 54, 78),Array(0.02807219086862892, 0.02279193035831542, 0.0194505007349796, 0.012942440820968409, 0.012826486709331534)), (Array(26, 147, 287, 199, 369),Array(0.023842803515972327, 0.012758041129028296, 0.011217280608105142, 0.009997058638158966, 0.009032118770824646)), (Array(0, 184, 31, 5, 2),Array(0.016865341823208774, 0.016825261055182127, 0.015440418317600587, 0.012106113182750974, 0.010304699850692597)), (Array(53, 130, 66, 32, 314),Array(0.014220526283559909, 0.012631168518367666, 0.012555951644270336, 0.011574292577776707, 0.01048868971613375)), (Array(114, 132, 0, 113, 203),Array(0.014040414898900483, 0.01270680357421398, 0.010902857549805028, 0.010717922636260969, 0.01001816589354621)), (Array(191, 28, 3, 264, 214),Array(0.014037103668261866, 0.013555721025806456, 0.012516554650882044, 0.011959985021820363, 0.011450644497320205)), (Array(155, 190, 512, 8, 41),Array(0.012383107605954015, 0.011829510900575697, 0.009481024754599455, 0.008178117398401121, 0.008144782572592813)), (Array(178, 0, 444, 297, 35),Array(0.014925183607418603, 0.009581182347438974, 0.008770851989791053, 0.0076584570498958796, 0.007314463669539004)), (Array(4, 23, 81, 17, 110),Array(0.03859307988847965, 0.027419168870300246, 0.018201005649992354, 0.015065794528308522, 0.014345107593535791)), (Array(176, 378, 7, 0, 48),Array(0.014225381856759272, 0.010050331220816595, 0.00977063406649008, 0.008795489327973019, 0.008646891569460264)), (Array(65, 59, 20, 116, 171),Array(0.02195591908828555, 0.019379171839426126, 0.01819501694851847, 0.01682359979037597, 0.01600464183747604)), (Array(0, 30, 241, 289, 377),Array(0.020025915060240766, 0.011914779817356594, 0.011652192323642917, 0.010666176855822633, 0.010366058169772603)), (Array(60, 19, 127, 124, 345),Array(0.023643821554770753, 0.020615209712539002, 0.015861220524255214, 0.014993577810213812, 0.00977967167197561)), (Array(128, 126, 30, 188, 332),Array(0.016182899875128307, 0.01439691996001174, 0.011463190284006705, 0.01000883239093308, 0.009430305456545711)), (Array(47, 170, 84, 217, 9),Array(0.019902657675382598, 0.010374362298625423, 0.009816122482765829, 0.009588736191172297, 0.00893649722419374)), (Array(12, 11, 24, 25, 85),Array(0.018444866352691897, 0.017480078221570557, 0.010045391658608608, 0.009516965697547774, 0.00789517504344056)))
>     vocabList: Array[String] = Array(people, time, good, used, windows, want, work, right, problem, need, really, image, data, information, better, believe, using, software, years, year, mail, sure, point, drive, program, available, space, file, power, help, government, things, question, doesn, number, case, world, look, read, version, come, thing, different, long, best, jpeg, fact, university, probably, real, didn, course, state, true, files, high, possible, actually, 1993, list, game, little, news, group, david, send, wrong, based, graphics, support, able, place, called, free, john, subject, post, reason, color, great, second, card, public, having, email, info, following, start, hard, science, example, says, means, code, evidence, person, note, maybe, president, heard, general, mean, problems, quite, source, systems, life, price, standard, order, window, access, claim, paul, jesus, getting, looking, trying, control, disk, seen, simply, times, book, team, local, chip, play, encryption, idea, truth, given, church, issue, research, opinions, wrote, images, large, display, makes, remember, thought, doing, national, format, away, nasa, human, home, change, small, saying, interested, current, mark, area, internet, today, word, original, agree, left, memory, works, microsoft, machine, instead, hardware, kind, working, request, higher, sort, programs, questions, money, later, israel, mike, guess, hand, pretty, include, netcom, address, cause, matter, technology, uiuc, speed, wire, video, type, days, server, view, usually, april, earth, package, open, told, christian, stuff, unless, similar, important, size, major, house, provide, known, faith, ground, rights, michael, phone, body, center, including, health, american, apple, feel, cost, text, user, lines, bible, answer, care, copy, wouldn, understand, check, anybody, security, mind, live, certainly, started, running, message, mouse, level, network, women, study, clinton, making, position, company, came, groups, board, screen, white, common, talking, single, special, quality, black, wiring, test, likely, christians, monitor, nice, effect, light, members, medical, posted, uucp, hope, sources, certain, clear, difference, cars, write, canada, fine, hear, press, launch, build, police, love, history, couple, situation, books, particular, words, jewish, specific, sense, model, religion, anti, stop, posting, unix, talk, private, discussion, school, contact, cable, turkish, keys, frank, built, consider, service, sound, features, legal, taking, simple, comes, reference, argument, tools, children, short, night, jews, applications, clipper, device, application, comments, scsi, process, theory, objective, force, doubt, tried, self, experience, games, early, usenet, expect, steve, needed, tape, uses, interesting, killed, station, exactly, easy, death, value, turn, manager, needs, correct, according, amiga, ones, response, wanted, shuttle, language, states, drug, james, considered, reading, strong, koresh, insurance, personal, term, goes, result, future, taken, past, form, opinion, especially, religious, sorry, mentioned, rules, hell, written, various, author, guns, drivers, went, country, design, plus, happy, society, longer, gets, latest, results, analysis, haven, asked, main, section, laws, previous, cases, york, parts, aren, advance, weapons, christ, mode, required, input, looks, week, accept, community, washington, option, series, circuit, disease, robert, fast, numbers, head, exist, andrew, period, range, venus, israeli, macintosh, driver, office, offer, moral, allow, organization, involved, toronto, clock, players, department, runs, values, months, choice, half, colors, knows, picture, sell, brian, object, took, cards, includes, federal, hockey, individual, wasn, currently, suggest, dave, armenians, takes, protect, follow, americans, directly, candida, title, policy, total, devices, happened, statement, present, purpose, assume, close, recently, equipment, food, require, reasons, scientific, christianity, happen, users, media, provides, deal, wants, city, george, goal, couldn, bike, save, shall, dead, lost, action, speak, road, uunet, terms, batf, court, condition, easily, league, complete, engineering, obviously, details, completely, baseball, california, voice, mission, useful, lead, disclaimer, output, water, algorithm, clearly, administration, ways, compatible, international, sent, rest, responsible, pass, hours, digital, business, appreciated, issues, freedom, kill, project, deleted, companies, coming, operating, average, processing, context, story, figure, error, fans, season, face, port, carry, events, appropriate, leave, berkeley, trade, lower, player, king, page, convert, environment, armenian, political, points, basis, final, requires, heart, addition, performance, building, difficult, site, sale, suppose, related, stanford, resolution, field, willing, volume, actual, apply, knowledge, designed, explain, anonymous, supposed, directory, claims, worth, orbit, lots, basic, defense, advice, commercial, sounds, entries, limited, changes, wonder, suspect, radio, turkey, neutral, forget, wait, necessary, programming, reply, enforcement, inside, family, ability, handle, young, included, signal, homosexuality, natural, morality, finally, land, russian, paper, month, greek, friend, installed, create, thinking, printer, shot, services, understanding, population, hold, break, interface, comment, normal, eric, homosexual, setting, formats, names, peter, machines, report, east, supply, comp, percent, avoid, product, lives, colorado, communications, room, escrow, types, secure, arab, logic, miles, reasonable, rutgers, multiple, gary, soon, agencies, developed, recent, cubs, library, peace, expensive, cheap, cancer, million, allowed, physics, suggestions, doctor, caused, supported, technical, happens, event, looked, obvious, gives, soviet, street, mention, release, outside, table, print, mass, return, radar, archive, chance, install, treatment, bitnet, generally, development, friends, folks, choose, entire, weeks, united, social, wish, smith, trouble, child, straight, learn, supports, behavior, ideas, morning, muslim, member, diet, electrical, illegal, exists, requests, jack, areas, concept, meaning, reality, drives, appear, provided, studies, motif, attempt, possibly, west, answers, asking, pick, practice, engine, worked, stand, bring, thank, unit, remove, near, compound, errors, false, belief, continue, middle, changed, decided, modem, bits, existence, henry, trust, congress, extra, safe, facts, loss, yeah, contains, guys, particularly, arguments, proper, class, manual, frame, command, drugs, stupid, wide, nature, institute, armenia, constitution, thread, pages, function, andy, attack, fonts, privacy, aware, operations, heat, worse, appears, distribution, knew, effective, edge, division, shouldn, wall, approach, speaking, independent, unfortunately, hands, conference, crime, indiana, modern, ethernet, solution, turks, civil, bought, 1992, 1990, compression, safety, serial, added, shows, letter, cramer, faster, simms, operation, arms, internal, germany, gave, record, wondering, virginia, floppy, appreciate, blue, plastic, regular, writing, allows, abortion, utility, hello, rate, blood, spend, views, articles, actions, font, additional, method, described, concerned, scott, played, stated, kids, atheism, surface, base, step, decision, switch, tool, playing, giving, attitude, quote, keith, cover, levels, considering, highly, resources, north, military, connected, buying, places, capability, products, costs, patients, controller, fair, late, prevent, rule, western, poor, brought, functions, received, account, creation, watch, cwru, majority, forward, student, released, driving, authority, committee, protection, richard, boston, quick, calls, chips, valid, hate, shell, excellent, countries, market, necessarily, created, wires, america, apparently, turned, complex, fairly, minutes, murder, boxes, lord, keyboard, properly, plan, moon, arabs, arizona, visual, absolutely, notice, produce, sold, panel, dangerous, killing, begin, property, damage, electronics, living, failed, acts, tests, nation, intelligence, islam, vote, rangers, effort, options, greatly, holy, review, shareware, larry)
>     topics: Array[Array[(String, Double)]] = Array(Array((people,0.013669950279615473), (turkish,0.012589446317671895), (power,0.011562911802003646), (armenians,0.009630031607012843), (state,0.008970764201152035)), Array((president,0.012611727242199007), (candida,0.007831400710143183), (people,0.007236708874482028), (help,0.0071875150392249135), (good,0.0068614461851317764)), Array((john,0.017314708303090353), (read,0.016649998278242404), (david,0.01612424392875142), (wrote,0.013492756373023043), (book,0.01284347086675008)), Array((group,0.019077778219962822), (time,0.013309556388583813), (world,0.013284634297848614), (science,0.012895629630722405), (news,0.012622753621515886)), Array((jpeg,0.02807219086862892), (file,0.02279193035831542), (image,0.0194505007349796), (files,0.012942440820968409), (color,0.012826486709331534)), Array((space,0.023842803515972327), (nasa,0.012758041129028296), (launch,0.011217280608105142), (earth,0.009997058638158966), (shuttle,0.009032118770824646)), Array((people,0.016865341823208774), (netcom,0.016825261055182127), (things,0.015440418317600587), (want,0.012106113182750974), (good,0.010304699850692597)), Array((true,0.014220526283559909), (truth,0.012631168518367666), (wrong,0.012555951644270336), (question,0.011574292577776707), (frank,0.01048868971613375)), Array((jesus,0.014040414898900483), (church,0.01270680357421398), (people,0.010902857549805028), (paul,0.010717922636260969), (christian,0.01001816589354621)), Array((wire,0.014037103668261866), (power,0.013555721025806456), (used,0.012516554650882044), (wiring,0.011959985021820363), (ground,0.011450644497320205)), Array((mark,0.012383107605954015), (speed,0.011829510900575697), (bike,0.009481024754599455), (problem,0.008178117398401121), (thing,0.008144782572592813)), Array((israel,0.014925183607418603), (people,0.009581182347438974), (israeli,0.008770851989791053), (jewish,0.0076584570498958796), (case,0.007314463669539004)), Array((windows,0.03859307988847965), (drive,0.027419168870300246), (card,0.018201005649992354), (software,0.015065794528308522), (window,0.014345107593535791)), Array((money,0.014225381856759272), (insurance,0.010050331220816595), (right,0.00977063406649008), (people,0.008795489327973019), (probably,0.008646891569460264)), Array((send,0.02195591908828555), (list,0.019379171839426126), (mail,0.01819501694851847), (looking,0.01682359979037597), (request,0.01600464183747604)), Array((people,0.020025915060240766), (government,0.011914779817356594), (started,0.011652192323642917), (police,0.010666176855822633), (koresh,0.010366058169772603)), Array((game,0.023643821554770753), (year,0.020615209712539002), (play,0.015861220524255214), (team,0.014993577810213812), (games,0.00977967167197561)), Array((encryption,0.016182899875128307), (chip,0.01439691996001174), (government,0.011463190284006705), (technology,0.01000883239093308), (clipper,0.009430305456545711)), Array((university,0.019902657675382598), (working,0.010374362298625423), (email,0.009816122482765829), (phone,0.009588736191172297), (need,0.00893649722419374)), Array((data,0.018444866352691897), (image,0.017480078221570557), (program,0.010045391658608608), (available,0.009516965697547774), (info,0.00789517504344056)))

  

We've managed to get some good results here. For example, we can easily
infer that Topic 12 is about computers, Topic 8 is about Christianity,
etc.

We still get some ambiguous results.

To improve our results further, we could employ some of the below
methods:

-   Refilter data for additional data-specific stopwords
-   Use Stemming or Lemmatization to preprocess data
-   Experiment with a smaller number of topics, since some of these
    topics in the 20 Newsgroups are pretty similar
-   Increase model's MaxIterations

Visualize Results
-----------------

We will try visualizing the results obtained from the EM LDA model with
a d3 bubble chart.

In [None]:
// Zip topic terms with topic IDs
val termArray = topics.zipWithIndex

  

>     termArray: Array[(Array[(String, Double)], Int)] = Array((Array((people,0.013669950279615473), (turkish,0.012589446317671895), (power,0.011562911802003646), (armenians,0.009630031607012843), (state,0.008970764201152035)),0), (Array((president,0.012611727242199007), (candida,0.007831400710143183), (people,0.007236708874482028), (help,0.0071875150392249135), (good,0.0068614461851317764)),1), (Array((john,0.017314708303090353), (read,0.016649998278242404), (david,0.01612424392875142), (wrote,0.013492756373023043), (book,0.01284347086675008)),2), (Array((group,0.019077778219962822), (time,0.013309556388583813), (world,0.013284634297848614), (science,0.012895629630722405), (news,0.012622753621515886)),3), (Array((jpeg,0.02807219086862892), (file,0.02279193035831542), (image,0.0194505007349796), (files,0.012942440820968409), (color,0.012826486709331534)),4), (Array((space,0.023842803515972327), (nasa,0.012758041129028296), (launch,0.011217280608105142), (earth,0.009997058638158966), (shuttle,0.009032118770824646)),5), (Array((people,0.016865341823208774), (netcom,0.016825261055182127), (things,0.015440418317600587), (want,0.012106113182750974), (good,0.010304699850692597)),6), (Array((true,0.014220526283559909), (truth,0.012631168518367666), (wrong,0.012555951644270336), (question,0.011574292577776707), (frank,0.01048868971613375)),7), (Array((jesus,0.014040414898900483), (church,0.01270680357421398), (people,0.010902857549805028), (paul,0.010717922636260969), (christian,0.01001816589354621)),8), (Array((wire,0.014037103668261866), (power,0.013555721025806456), (used,0.012516554650882044), (wiring,0.011959985021820363), (ground,0.011450644497320205)),9), (Array((mark,0.012383107605954015), (speed,0.011829510900575697), (bike,0.009481024754599455), (problem,0.008178117398401121), (thing,0.008144782572592813)),10), (Array((israel,0.014925183607418603), (people,0.009581182347438974), (israeli,0.008770851989791053), (jewish,0.0076584570498958796), (case,0.007314463669539004)),11), (Array((windows,0.03859307988847965), (drive,0.027419168870300246), (card,0.018201005649992354), (software,0.015065794528308522), (window,0.014345107593535791)),12), (Array((money,0.014225381856759272), (insurance,0.010050331220816595), (right,0.00977063406649008), (people,0.008795489327973019), (probably,0.008646891569460264)),13), (Array((send,0.02195591908828555), (list,0.019379171839426126), (mail,0.01819501694851847), (looking,0.01682359979037597), (request,0.01600464183747604)),14), (Array((people,0.020025915060240766), (government,0.011914779817356594), (started,0.011652192323642917), (police,0.010666176855822633), (koresh,0.010366058169772603)),15), (Array((game,0.023643821554770753), (year,0.020615209712539002), (play,0.015861220524255214), (team,0.014993577810213812), (games,0.00977967167197561)),16), (Array((encryption,0.016182899875128307), (chip,0.01439691996001174), (government,0.011463190284006705), (technology,0.01000883239093308), (clipper,0.009430305456545711)),17), (Array((university,0.019902657675382598), (working,0.010374362298625423), (email,0.009816122482765829), (phone,0.009588736191172297), (need,0.00893649722419374)),18), (Array((data,0.018444866352691897), (image,0.017480078221570557), (program,0.010045391658608608), (available,0.009516965697547774), (info,0.00789517504344056)),19))

In [None]:
// Transform data into the form (term, probability, topicId)
val termRDD = sc.parallelize(termArray)
val termRDD2 =termRDD.flatMap( (x: (Array[(String, Double)], Int)) => {
  val arrayOfTuple = x._1
  val topicId = x._2
  arrayOfTuple.map(el => (el._1, el._2, topicId))
})

  

>     termRDD: org.apache.spark.rdd.RDD[(Array[(String, Double)], Int)] = ParallelCollectionRDD[50987] at parallelize at command-2972105651606744:2
>     termRDD2: org.apache.spark.rdd.RDD[(String, Double, Int)] = MapPartitionsRDD[50988] at flatMap at command-2972105651606744:3

In [None]:
// Create DF with proper column names
val termDF = termRDD2.toDF.withColumnRenamed("_1", "term").withColumnRenamed("_2", "probability").withColumnRenamed("_3", "topicId")

  

>     termDF: org.apache.spark.sql.DataFrame = [term: string, probability: double ... 1 more field]

In [None]:
display(termDF)

  

[TABLE]

Truncated to 30 rows

  

We will convert the DataFrame into a JSON format, which will be passed
into d3.

In [None]:
// Create JSON data
val rawJson = termDF.toJSON.collect().mkString(",\n")

  

>     rawJson: String =
>     {"term":"people","probability":0.013669950279615473,"topicId":0},
>     {"term":"turkish","probability":0.012589446317671895,"topicId":0},
>     {"term":"power","probability":0.011562911802003646,"topicId":0},
>     {"term":"armenians","probability":0.009630031607012843,"topicId":0},
>     {"term":"state","probability":0.008970764201152035,"topicId":0},
>     {"term":"president","probability":0.012611727242199007,"topicId":1},
>     {"term":"candida","probability":0.007831400710143183,"topicId":1},
>     {"term":"people","probability":0.007236708874482028,"topicId":1},
>     {"term":"help","probability":0.0071875150392249135,"topicId":1},
>     {"term":"good","probability":0.0068614461851317764,"topicId":1},
>     {"term":"john","probability":0.017314708303090353,"topicId":2},
>     {"term":"read","probability":0.016649998278242404,"topicId":2},
>     {"term":"david","probability":0.01612424392875142,"topicId":2},
>     {"term":"wrote","probability":0.013492756373023043,"topicId":2},
>     {"term":"book","probability":0.01284347086675008,"topicId":2},
>     {"term":"group","probability":0.019077778219962822,"topicId":3},
>     {"term":"time","probability":0.013309556388583813,"topicId":3},
>     {"term":"world","probability":0.013284634297848614,"topicId":3},
>     {"term":"science","probability":0.012895629630722405,"topicId":3},
>     {"term":"news","probability":0.012622753621515886,"topicId":3},
>     {"term":"jpeg","probability":0.02807219086862892,"topicId":4},
>     {"term":"file","probability":0.02279193035831542,"topicId":4},
>     {"term":"image","probability":0.0194505007349796,"topicId":4},
>     {"term":"files","probability":0.012942440820968409,"topicId":4},
>     {"term":"color","probability":0.012826486709331534,"topicId":4},
>     {"term":"space","probability":0.023842803515972327,"topicId":5},
>     {"term":"nasa","probability":0.012758041129028296,"topicId":5},
>     {"term":"launch","probability":0.011217280608105142,"topicId":5},
>     {"term":"earth","probability":0.009997058638158966,"topicId":5},
>     {"term":"shuttle","probability":0.009032118770824646,"topicId":5},
>     {"term":"people","probability":0.016865341823208774,"topicId":6},
>     {"term":"netcom","probability":0.016825261055182127,"topicId":6},
>     {"term":"things","probability":0.015440418317600587,"topicId":6},
>     {"term":"want","probability":0.012106113182750974,"topicId":6},
>     {"term":"good","probability":0.010304699850692597,"topicId":6},
>     {"term":"true","probability":0.014220526283559909,"topicId":7},
>     {"term":"truth","probability":0.012631168518367666,"topicId":7},
>     {"term":"wrong","probability":0.012555951644270336,"topicId":7},
>     {"term":"question","probability":0.011574292577776707,"topicId":7},
>     {"term":"frank","probability":0.01048868971613375,"topicId":7},
>     {"term":"jesus","probability":0.014040414898900483,"topicId":8},
>     {"term":"church","probability":0.01270680357421398,"topicId":8},
>     {"term":"people","probability":0.010902857549805028,"topicId":8},
>     {"term":"paul","probability":0.010717922636260969,"topicId":8},
>     {"term":"christian","probability":0.01001816589354621,"topicId":8},
>     {"term":"wire","probability":0.014037103668261866,"topicId":9},
>     {"term":"power","probability":0.013555721025806456,"topicId":9},
>     {"term":"used","probability":0.012516554650882044,"topicId":9},
>     {"term":"wiring","probability":0.011959985021820363,"topicId":9},
>     {"term":"ground","probability":0.011450644497320205,"topicId":9},
>     {"term":"mark","probability":0.012383107605954015,"topicId":10},
>     {"term":"speed","probability":0.011829510900575697,"topicId":10},
>     {"term":"bike","probability":0.009481024754599455,"topicId":10},
>     {"term":"problem","probability":0.008178117398401121,"topicId":10},
>     {"term":"thing","probability":0.008144782572592813,"topicId":10},
>     {"term":"israel","probability":0.014925183607418603,"topicId":11},
>     {"term":"people","probability":0.009581182347438974,"topicId":11},
>     {"term":"israeli","probability":0.008770851989791053,"topicId":11},
>     {"term":"jewish","probability":0.0076584570498958796,"topicId":11},
>     {"term":"case","probability":0.007314463669539004,"topicId":11},
>     {"term":"windows","probability":0.03859307988847965,"topicId":12},
>     {"term":"drive","probability":0.027419168870300246,"topicId":12},
>     {"term":"card","probability":0.018201005649992354,"topicId":12},
>     {"term":"software","probability":0.015065794528308522,"topicId":12},
>     {"term":"window","probability":0.014345107593535791,"topicId":12},
>     {"term":"money","probability":0.014225381856759272,"topicId":13},
>     {"term":"insurance","probability":0.010050331220816595,"topicId":13},
>     {"term":"right","probability":0.00977063406649008,"topicId":13},
>     {"term":"people","probability":0.008795489327973019,"topicId":13},
>     {"term":"probably","probability":0.008646891569460264,"topicId":13},
>     {"term":"send","probability":0.02195591908828555,"topicId":14},
>     {"term":"list","probability":0.019379171839426126,"topicId":14},
>     {"term":"mail","probability":0.01819501694851847,"topicId":14},
>     {"term":"looking","probability":0.01682359979037597,"topicId":14},
>     {"term":"request","probability":0.01600464183747604,"topicId":14},
>     {"term":"people","probability":0.020025915060240766,"topicId":15},
>     {"term":"government","probability":0.011914779817356594,"topicId":15},
>     {"term":"started","probability":0.011652192323642917,"topicId":15},
>     {"term":"police","probability":0.010666176855822633,"topicId":15},
>     {"term":"koresh","probability":0.010366058169772603,"topicId":15},
>     {"term":"game","probability":0.023643821554770753,"topicId":16},
>     {"term":"year","probability":0.020615209712539002,"topicId":16},
>     {"term":"play","probability":0.015861220524255214,"topicId":16},
>     {"term":"team","probability":0.014993577810213812,"topicId":16},
>     {"term":"games","probability":0.00977967167197561,"topicId":16},
>     {"term":"encryption","probability":0.016182899875128307,"topicId":17},
>     {"term":"chip","probability":0.01439691996001174,"topicId":17},
>     {"term":"government","probability":0.011463190284006705,"topicId":17},
>     {"term":"technology","probability":0.01000883239093308,"topicId":17},
>     {"term":"clipper","probability":0.009430305456545711,"topicId":17},
>     {"term":"university","probability":0.019902657675382598,"topicId":18},
>     {"term":"working","probability":0.010374362298625423,"topicId":18},
>     {"term":"email","probability":0.009816122482765829,"topicId":18},
>     {"term":"phone","probability":0.009588736191172297,"topicId":18},
>     {"term":"need","probability":0.00893649722419374,"topicId":18},
>     {"term":"data","probability":0.018444866352691897,"topicId":19},
>     {"term":"image","probability":0.017480078221570557,"topicId":19},
>     {"term":"program","probability":0.010045391658608608,"topicId":19},
>     {"term":"available","probability":0.009516965697547774,"topicId":19},
>     {"term":"info","probability":0.00789517504344056,"topicId":19}

  

We are now ready to use D3 on the rawJson data.

In [None]:
displayHTML(s"""
<!DOCTYPE html>
<meta charset="utf-8">
<style>

circle {
  fill: rgb(31, 119, 180);
  fill-opacity: 0.5;
  stroke: rgb(31, 119, 180);
  stroke-width: 1px;
}

.leaf circle {
  fill: #ff7f0e;
  fill-opacity: 1;
}

text {
  font: 14px sans-serif;
}

</style>
<body>
<script src="https://cdnjs.cloudflare.com/ajax/libs/d3/3.5.5/d3.min.js"></script>
<script>

var json = {
 "name": "data",
 "children": [
  {
     "name": "topics",
     "children": [
      ${rawJson}
     ]
    }
   ]
};

var r = 1000,
    format = d3.format(",d"),
    fill = d3.scale.category20c();

var bubble = d3.layout.pack()
    .sort(null)
    .size([r, r])
    .padding(1.5);

var vis = d3.select("body").append("svg")
    .attr("width", r)
    .attr("height", r)
    .attr("class", "bubble");

  
var node = vis.selectAll("g.node")
    .data(bubble.nodes(classes(json))
    .filter(function(d) { return !d.children; }))
    .enter().append("g")
    .attr("class", "node")
    .attr("transform", function(d) { return "translate(" + d.x + "," + d.y + ")"; })
    color = d3.scale.category20();
  
  node.append("title")
      .text(function(d) { return d.className + ": " + format(d.value); });

  node.append("circle")
      .attr("r", function(d) { return d.r; })
      .style("fill", function(d) {return color(d.topicName);});

var text = node.append("text")
    .attr("text-anchor", "middle")
    .attr("dy", ".3em")
    .text(function(d) { return d.className.substring(0, d.r / 3)});
  
  text.append("tspan")
      .attr("dy", "1.2em")
      .attr("x", 0)
      .text(function(d) {return Math.ceil(d.value * 10000) /10000; });

// Returns a flattened hierarchy containing all leaf nodes under the root.
function classes(root) {
  var classes = [];

  function recurse(term, node) {
    if (node.children) node.children.forEach(function(child) { recurse(node.term, child); });
    else classes.push({topicName: node.topicId, className: node.term, value: node.probability});
  }

  recurse(null, root);
  return {children: classes};
}
</script>
""")

  

### You try!

**NOW or Later as HOMEWORK**

1.  Try to do the same process for the State of the Union Addresses
    dataset from Week1. As a first step, first locate where that data
    is... Go to week1 and try to see if each SoU can be treated as a
    document for topic modeling and whether there is temporal clustering
    of SoU's within the same topic.

2.  Try to improve the tuning by elaborating the pipeline with stemming,
    lemmatization, etc in this news-group dataset (if you want to do a
    project based on this, perhaps). You can also parse the input to
    bring in the newsgroup id's from the directories (consider
    exploiting the file names in the `wholeTextFiles` method) as this
    will let you explore how well your unsupervised algorithm is doing
    relative to the known newsgroups each document falls in (note you
    generally won't have the luxury of knowing the topic labels for
    typical datasets in the unsupervised topic modeling domain).

3.  Try to parse the data closer to the clean dataset available in
    `/databricks-datasets/news20.binary/*` and walk through the
    following notebook (*but in Scala!*):

    -   <https://docs.cloud.databricks.com/docs/latest/sample_applications/07%20Sample%20ML/MLPipeline%20Newsgroup%20Dataset.html>

In [None]:
ls /databricks-datasets/news20.binary/data-001

  

[TABLE]

  

Step 1. Downloading and Loading Data into DBFS
----------------------------------------------

**you don't have to do the download in databricks if above cell has
contents in `/databricks-datasets/news20.binary/data-001`**

Here are the steps taken for downloading and saving data to the
distributed file system. Uncomment them for repeating this process on
your databricks cluster or for downloading a new source of data.

In [None]:
wget http://kdd.ics.uci.edu/databases/20newsgroups/mini_newsgroups.tar.gz -O /tmp/newsgroups.tar.gz

  

>     --2020-11-18 16:58:10--  http://kdd.ics.uci.edu/databases/20newsgroups/mini_newsgroups.tar.gz
>     Resolving kdd.ics.uci.edu (kdd.ics.uci.edu)... 128.195.1.86
>     Connecting to kdd.ics.uci.edu (kdd.ics.uci.edu)|128.195.1.86|:80... connected.
>     HTTP request sent, awaiting response... 200 OK
>     Length: 1860687 (1.8M) [application/x-gzip]
>     Saving to: ‘/tmp/newsgroups.tar.gz’
>
>          0K .......... .......... .......... .......... ..........  2%  765K 2s
>         50K .......... .......... .......... .......... ..........  5% 1.68M 2s
>        100K .......... .......... .......... .......... ..........  8%  165M 1s
>        150K .......... .......... .......... .......... .......... 11%  325M 1s
>        200K .......... .......... .......... .......... .......... 13% 1.79M 1s
>        250K .......... .......... .......... .......... .......... 16%  103M 1s
>        300K .......... .......... .......... .......... .......... 19%  116M 1s
>        350K .......... .......... .......... .......... .......... 22%  103M 0s
>        400K .......... .......... .......... .......... .......... 24% 1.88M 0s
>        450K .......... .......... .......... .......... .......... 27%  103M 0s
>        500K .......... .......... .......... .......... .......... 30% 90.5M 0s
>        550K .......... .......... .......... .......... .......... 33%  133M 0s
>        600K .......... .......... .......... .......... .......... 35% 96.6M 0s
>        650K .......... .......... .......... .......... .......... 38% 93.3M 0s
>        700K .......... .......... .......... .......... .......... 41% 93.7M 0s
>        750K .......... .......... .......... .......... .......... 44% 99.0M 0s
>        800K .......... .......... .......... .......... .......... 46% 2.07M 0s
>        850K .......... .......... .......... .......... .......... 49%  135M 0s
>        900K .......... .......... .......... .......... .......... 52%  104M 0s
>        950K .......... .......... .......... .......... .......... 55% 96.3M 0s
>       1000K .......... .......... .......... .......... .......... 57%  128M 0s
>       1050K .......... .......... .......... .......... .......... 60% 91.8M 0s
>       1100K .......... .......... .......... .......... .......... 63% 35.4M 0s
>       1150K .......... .......... .......... .......... .......... 66% 22.0M 0s
>       1200K .......... .......... .......... .......... .......... 68% 28.4M 0s
>       1250K .......... .......... .......... .......... .......... 71% 97.4M 0s
>       1300K .......... .......... .......... .......... .......... 74%  226M 0s
>       1350K .......... .......... .......... .......... .......... 77%  314M 0s
>       1400K .......... .......... .......... .......... .......... 79%  363M 0s
>       1450K .......... .......... .......... .......... .......... 82%  375M 0s
>       1500K .......... .......... .......... .......... .......... 85%  281M 0s
>       1550K .......... .......... .......... .......... .......... 88%  364M 0s
>       1600K .......... .......... .......... .......... .......... 90% 2.65M 0s
>       1650K .......... .......... .......... .......... .......... 93% 34.2M 0s
>       1700K .......... .......... .......... .......... .......... 96%  240M 0s
>       1750K .......... .......... .......... .......... .......... 99%  310M 0s
>       1800K .......... .......                                    100%  211M=0.2s
>
>     2020-11-18 16:58:11 (8.63 MB/s) - ‘/tmp/newsgroups.tar.gz’ saved [1860687/1860687]

  

Untar the file into the /tmp/ folder.

In [None]:
tar xvfz /tmp/newsgroups.tar.gz -C /tmp/

  

>     mini_newsgroups/alt.atheism/
>     mini_newsgroups/alt.atheism/51127
>     mini_newsgroups/alt.atheism/51310
>     mini_newsgroups/alt.atheism/53539
>     mini_newsgroups/alt.atheism/53336
>     mini_newsgroups/alt.atheism/53212
>     mini_newsgroups/alt.atheism/51199
>     mini_newsgroups/alt.atheism/54144
>     mini_newsgroups/alt.atheism/54170
>     mini_newsgroups/alt.atheism/51126
>     mini_newsgroups/alt.atheism/51313
>     mini_newsgroups/alt.atheism/51166
>     mini_newsgroups/alt.atheism/53760
>     mini_newsgroups/alt.atheism/53211
>     mini_newsgroups/alt.atheism/54251
>     mini_newsgroups/alt.atheism/53188
>     mini_newsgroups/alt.atheism/54237
>     mini_newsgroups/alt.atheism/51227
>     mini_newsgroups/alt.atheism/51146
>     mini_newsgroups/alt.atheism/53542
>     mini_newsgroups/alt.atheism/53291
>     mini_newsgroups/alt.atheism/53150
>     mini_newsgroups/alt.atheism/53427
>     mini_newsgroups/alt.atheism/53061
>     mini_newsgroups/alt.atheism/53564
>     mini_newsgroups/alt.atheism/53574
>     mini_newsgroups/alt.atheism/53351
>     mini_newsgroups/alt.atheism/53334
>     mini_newsgroups/alt.atheism/53610
>     mini_newsgroups/alt.atheism/51195
>     mini_newsgroups/alt.atheism/53753
>     mini_newsgroups/alt.atheism/53410
>     mini_newsgroups/alt.atheism/53303
>     mini_newsgroups/alt.atheism/53565
>     mini_newsgroups/alt.atheism/51170
>     mini_newsgroups/alt.atheism/51305
>     mini_newsgroups/alt.atheism/54137
>     mini_newsgroups/alt.atheism/53312
>     mini_newsgroups/alt.atheism/53575
>     mini_newsgroups/alt.atheism/53458
>     mini_newsgroups/alt.atheism/53249
>     mini_newsgroups/alt.atheism/53299
>     mini_newsgroups/alt.atheism/53393
>     mini_newsgroups/alt.atheism/54485
>     mini_newsgroups/alt.atheism/54254
>     mini_newsgroups/alt.atheism/54171
>     mini_newsgroups/alt.atheism/51281
>     mini_newsgroups/alt.atheism/53607
>     mini_newsgroups/alt.atheism/53606
>     mini_newsgroups/alt.atheism/53190
>     mini_newsgroups/alt.atheism/51223
>     mini_newsgroups/alt.atheism/51251
>     mini_newsgroups/alt.atheism/53525
>     mini_newsgroups/alt.atheism/53154
>     mini_newsgroups/alt.atheism/53126
>     mini_newsgroups/alt.atheism/53670
>     mini_newsgroups/alt.atheism/54250
>     mini_newsgroups/alt.atheism/53590
>     mini_newsgroups/alt.atheism/53512
>     mini_newsgroups/alt.atheism/53518
>     mini_newsgroups/alt.atheism/53284
>     mini_newsgroups/alt.atheism/54244
>     mini_newsgroups/alt.atheism/54215
>     mini_newsgroups/alt.atheism/54234
>     mini_newsgroups/alt.atheism/51121
>     mini_newsgroups/alt.atheism/53222
>     mini_newsgroups/alt.atheism/53433
>     mini_newsgroups/alt.atheism/53538
>     mini_newsgroups/alt.atheism/51203
>     mini_newsgroups/alt.atheism/53399
>     mini_newsgroups/alt.atheism/54222
>     mini_newsgroups/alt.atheism/51314
>     mini_newsgroups/alt.atheism/53358
>     mini_newsgroups/alt.atheism/53408
>     mini_newsgroups/alt.atheism/53599
>     mini_newsgroups/alt.atheism/51139
>     mini_newsgroups/alt.atheism/53369
>     mini_newsgroups/alt.atheism/53474
>     mini_newsgroups/alt.atheism/53623
>     mini_newsgroups/alt.atheism/51186
>     mini_newsgroups/alt.atheism/53653
>     mini_newsgroups/alt.atheism/53490
>     mini_newsgroups/alt.atheism/51191
>     mini_newsgroups/alt.atheism/53235
>     mini_newsgroups/alt.atheism/53633
>     mini_newsgroups/alt.atheism/54160
>     mini_newsgroups/alt.atheism/53420
>     mini_newsgroups/alt.atheism/51174
>     mini_newsgroups/alt.atheism/53558
>     mini_newsgroups/alt.atheism/51222
>     mini_newsgroups/alt.atheism/53123
>     mini_newsgroups/alt.atheism/54140
>     mini_newsgroups/alt.atheism/53659
>     mini_newsgroups/alt.atheism/53759
>     mini_newsgroups/alt.atheism/53603
>     mini_newsgroups/alt.atheism/53459
>     mini_newsgroups/alt.atheism/53062
>     mini_newsgroups/alt.atheism/51143
>     mini_newsgroups/alt.atheism/51131
>     mini_newsgroups/alt.atheism/54201
>     mini_newsgroups/alt.atheism/53509
>     mini_newsgroups/comp.graphics/
>     mini_newsgroups/comp.graphics/38464
>     mini_newsgroups/comp.graphics/38965
>     mini_newsgroups/comp.graphics/39659
>     mini_newsgroups/comp.graphics/38936
>     mini_newsgroups/comp.graphics/39008
>     mini_newsgroups/comp.graphics/39620
>     mini_newsgroups/comp.graphics/38980
>     mini_newsgroups/comp.graphics/39664
>     mini_newsgroups/comp.graphics/37916
>     mini_newsgroups/comp.graphics/38788
>     mini_newsgroups/comp.graphics/38867
>     mini_newsgroups/comp.graphics/39013
>     mini_newsgroups/comp.graphics/38755
>     mini_newsgroups/comp.graphics/38907
>     mini_newsgroups/comp.graphics/38853
>     mini_newsgroups/comp.graphics/38606
>     mini_newsgroups/comp.graphics/38998
>     mini_newsgroups/comp.graphics/39000
>     mini_newsgroups/comp.graphics/38571
>     mini_newsgroups/comp.graphics/38491
>     mini_newsgroups/comp.graphics/38421
>     mini_newsgroups/comp.graphics/38489
>     mini_newsgroups/comp.graphics/39027
>     mini_newsgroups/comp.graphics/38573
>     mini_newsgroups/comp.graphics/38693
>     mini_newsgroups/comp.graphics/37936
>     mini_newsgroups/comp.graphics/38470
>     mini_newsgroups/comp.graphics/38439
>     mini_newsgroups/comp.graphics/38636
>     mini_newsgroups/comp.graphics/38355
>     mini_newsgroups/comp.graphics/39675
>     mini_newsgroups/comp.graphics/39022
>     mini_newsgroups/comp.graphics/39017
>     mini_newsgroups/comp.graphics/38983
>     mini_newsgroups/comp.graphics/38839
>     mini_newsgroups/comp.graphics/38921
>     mini_newsgroups/comp.graphics/38925
>     mini_newsgroups/comp.graphics/38753
>     mini_newsgroups/comp.graphics/38880
>     mini_newsgroups/comp.graphics/39621
>     mini_newsgroups/comp.graphics/38264
>     mini_newsgroups/comp.graphics/38674
>     mini_newsgroups/comp.graphics/38843
>     mini_newsgroups/comp.graphics/39663
>     mini_newsgroups/comp.graphics/38244
>     mini_newsgroups/comp.graphics/38700
>     mini_newsgroups/comp.graphics/38459
>     mini_newsgroups/comp.graphics/38904
>     mini_newsgroups/comp.graphics/37930
>     mini_newsgroups/comp.graphics/38379
>     mini_newsgroups/comp.graphics/38670
>     mini_newsgroups/comp.graphics/38750
>     mini_newsgroups/comp.graphics/38942
>     mini_newsgroups/comp.graphics/38375
>     mini_newsgroups/comp.graphics/39049
>     mini_newsgroups/comp.graphics/37921
>     mini_newsgroups/comp.graphics/38380
>     mini_newsgroups/comp.graphics/38577
>     mini_newsgroups/comp.graphics/38758
>     mini_newsgroups/comp.graphics/39078
>     mini_newsgroups/comp.graphics/38409
>     mini_newsgroups/comp.graphics/38709
>     mini_newsgroups/comp.graphics/38968
>     mini_newsgroups/comp.graphics/38562
>     mini_newsgroups/comp.graphics/38370
>     mini_newsgroups/comp.graphics/38683
>     mini_newsgroups/comp.graphics/39048
>     mini_newsgroups/comp.graphics/38251
>     mini_newsgroups/comp.graphics/38220
>     mini_newsgroups/comp.graphics/38761
>     mini_newsgroups/comp.graphics/38224
>     mini_newsgroups/comp.graphics/38473
>     mini_newsgroups/comp.graphics/38386
>     mini_newsgroups/comp.graphics/39615
>     mini_newsgroups/comp.graphics/38266
>     mini_newsgroups/comp.graphics/38466
>     mini_newsgroups/comp.graphics/38622
>     mini_newsgroups/comp.graphics/38628
>     mini_newsgroups/comp.graphics/38603
>     mini_newsgroups/comp.graphics/39668
>     mini_newsgroups/comp.graphics/39072
>     mini_newsgroups/comp.graphics/37947
>     mini_newsgroups/comp.graphics/38613
>     mini_newsgroups/comp.graphics/38884
>     mini_newsgroups/comp.graphics/38369
>     mini_newsgroups/comp.graphics/38271
>     mini_newsgroups/comp.graphics/38402
>     mini_newsgroups/comp.graphics/38929
>     mini_newsgroups/comp.graphics/37944
>     mini_newsgroups/comp.graphics/38845
>     mini_newsgroups/comp.graphics/38846
>     mini_newsgroups/comp.graphics/38625
>     mini_newsgroups/comp.graphics/37942
>     mini_newsgroups/comp.graphics/38835
>     mini_newsgroups/comp.graphics/38893
>     mini_newsgroups/comp.graphics/38856
>     mini_newsgroups/comp.graphics/38454
>     mini_newsgroups/comp.graphics/38699
>     mini_newsgroups/comp.graphics/38704
>     mini_newsgroups/comp.graphics/38518
>     mini_newsgroups/comp.os.ms-windows.misc/
>     mini_newsgroups/comp.os.ms-windows.misc/9704
>     mini_newsgroups/comp.os.ms-windows.misc/10942
>     mini_newsgroups/comp.os.ms-windows.misc/9667
>     mini_newsgroups/comp.os.ms-windows.misc/9883
>     mini_newsgroups/comp.os.ms-windows.misc/10167
>     mini_newsgroups/comp.os.ms-windows.misc/9994
>     mini_newsgroups/comp.os.ms-windows.misc/9639
>     mini_newsgroups/comp.os.ms-windows.misc/9908
>     mini_newsgroups/comp.os.ms-windows.misc/10031
>     mini_newsgroups/comp.os.ms-windows.misc/9975
>     mini_newsgroups/comp.os.ms-windows.misc/10141
>     mini_newsgroups/comp.os.ms-windows.misc/10139
>     mini_newsgroups/comp.os.ms-windows.misc/9645
>     mini_newsgroups/comp.os.ms-windows.misc/10087
>     mini_newsgroups/comp.os.ms-windows.misc/9141
>     mini_newsgroups/comp.os.ms-windows.misc/9571
>     mini_newsgroups/comp.os.ms-windows.misc/9539
>     mini_newsgroups/comp.os.ms-windows.misc/9622
>     mini_newsgroups/comp.os.ms-windows.misc/10047
>     mini_newsgroups/comp.os.ms-windows.misc/9519
>     mini_newsgroups/comp.os.ms-windows.misc/10094
>     mini_newsgroups/comp.os.ms-windows.misc/9881
>     mini_newsgroups/comp.os.ms-windows.misc/10093
>     mini_newsgroups/comp.os.ms-windows.misc/10806
>     mini_newsgroups/comp.os.ms-windows.misc/9151
>     mini_newsgroups/comp.os.ms-windows.misc/10107
>     mini_newsgroups/comp.os.ms-windows.misc/9718
>     mini_newsgroups/comp.os.ms-windows.misc/9499
>     mini_newsgroups/comp.os.ms-windows.misc/10742
>     mini_newsgroups/comp.os.ms-windows.misc/10015
>     mini_newsgroups/comp.os.ms-windows.misc/10076
>     mini_newsgroups/comp.os.ms-windows.misc/9485
>     mini_newsgroups/comp.os.ms-windows.misc/10005
>     mini_newsgroups/comp.os.ms-windows.misc/9725
>     mini_newsgroups/comp.os.ms-windows.misc/9939
>     mini_newsgroups/comp.os.ms-windows.misc/9799
>     mini_newsgroups/comp.os.ms-windows.misc/10023
>     mini_newsgroups/comp.os.ms-windows.misc/10790
>     mini_newsgroups/comp.os.ms-windows.misc/10857
>     mini_newsgroups/comp.os.ms-windows.misc/9456
>     mini_newsgroups/comp.os.ms-windows.misc/9776
>     mini_newsgroups/comp.os.ms-windows.misc/10114
>     mini_newsgroups/comp.os.ms-windows.misc/9496
>     mini_newsgroups/comp.os.ms-windows.misc/10128
>     mini_newsgroups/comp.os.ms-windows.misc/9859
>     mini_newsgroups/comp.os.ms-windows.misc/9586
>     mini_newsgroups/comp.os.ms-windows.misc/10692
>     mini_newsgroups/comp.os.ms-windows.misc/10142
>     mini_newsgroups/comp.os.ms-windows.misc/9803
>     mini_newsgroups/comp.os.ms-windows.misc/9911
>     mini_newsgroups/comp.os.ms-windows.misc/9726
>     mini_newsgroups/comp.os.ms-windows.misc/9567
>     mini_newsgroups/comp.os.ms-windows.misc/9512
>     mini_newsgroups/comp.os.ms-windows.misc/10160
>     mini_newsgroups/comp.os.ms-windows.misc/9486
>     mini_newsgroups/comp.os.ms-windows.misc/9697
>     mini_newsgroups/comp.os.ms-windows.misc/9995
>     mini_newsgroups/comp.os.ms-windows.misc/9744
>     mini_newsgroups/comp.os.ms-windows.misc/9737
>     mini_newsgroups/comp.os.ms-windows.misc/9942
>     mini_newsgroups/comp.os.ms-windows.misc/10125
>     mini_newsgroups/comp.os.ms-windows.misc/10157
>     mini_newsgroups/comp.os.ms-windows.misc/9970
>     mini_newsgroups/comp.os.ms-windows.misc/9790
>     mini_newsgroups/comp.os.ms-windows.misc/10850
>     mini_newsgroups/comp.os.ms-windows.misc/9679
>     mini_newsgroups/comp.os.ms-windows.misc/10835
>     mini_newsgroups/comp.os.ms-windows.misc/9924
>     mini_newsgroups/comp.os.ms-windows.misc/10843
>     mini_newsgroups/comp.os.ms-windows.misc/10830
>     mini_newsgroups/comp.os.ms-windows.misc/10791
>     mini_newsgroups/comp.os.ms-windows.misc/9538
>     mini_newsgroups/comp.os.ms-windows.misc/10188
>     mini_newsgroups/comp.os.ms-windows.misc/10848
>     mini_newsgroups/comp.os.ms-windows.misc/10814
>     mini_newsgroups/comp.os.ms-windows.misc/9758
>     mini_newsgroups/comp.os.ms-windows.misc/9750
>     mini_newsgroups/comp.os.ms-windows.misc/9706
>     mini_newsgroups/comp.os.ms-windows.misc/10849
>     mini_newsgroups/comp.os.ms-windows.misc/9902
>     mini_newsgroups/comp.os.ms-windows.misc/10041
>     mini_newsgroups/comp.os.ms-windows.misc/9479
>     mini_newsgroups/comp.os.ms-windows.misc/10090
>     mini_newsgroups/comp.os.ms-windows.misc/10016
>     mini_newsgroups/comp.os.ms-windows.misc/10158
>     mini_newsgroups/comp.os.ms-windows.misc/10115
>     mini_newsgroups/comp.os.ms-windows.misc/9997
>     mini_newsgroups/comp.os.ms-windows.misc/9657
>     mini_newsgroups/comp.os.ms-windows.misc/10812
>     mini_newsgroups/comp.os.ms-windows.misc/10781
>     mini_newsgroups/comp.os.ms-windows.misc/10838
>     mini_newsgroups/comp.os.ms-windows.misc/10003
>     mini_newsgroups/comp.os.ms-windows.misc/10008
>     mini_newsgroups/comp.os.ms-windows.misc/9804
>     mini_newsgroups/comp.os.ms-windows.misc/9814
>     mini_newsgroups/comp.os.ms-windows.misc/9933
>     mini_newsgroups/comp.os.ms-windows.misc/9943
>     mini_newsgroups/comp.os.ms-windows.misc/9509
>     mini_newsgroups/comp.os.ms-windows.misc/9600
>     mini_newsgroups/comp.os.ms-windows.misc/9779
>     mini_newsgroups/comp.sys.ibm.pc.hardware/
>     mini_newsgroups/comp.sys.ibm.pc.hardware/60369
>     mini_newsgroups/comp.sys.ibm.pc.hardware/60393
>     mini_newsgroups/comp.sys.ibm.pc.hardware/60543
>     mini_newsgroups/comp.sys.ibm.pc.hardware/60842
>     mini_newsgroups/comp.sys.ibm.pc.hardware/60389
>     mini_newsgroups/comp.sys.ibm.pc.hardware/60232
>     mini_newsgroups/comp.sys.ibm.pc.hardware/61094
>     mini_newsgroups/comp.sys.ibm.pc.hardware/61076
>     mini_newsgroups/comp.sys.ibm.pc.hardware/60481
>     mini_newsgroups/comp.sys.ibm.pc.hardware/60691
>     mini_newsgroups/comp.sys.ibm.pc.hardware/60425
>     mini_newsgroups/comp.sys.ibm.pc.hardware/60475
>     mini_newsgroups/comp.sys.ibm.pc.hardware/60735
>     mini_newsgroups/comp.sys.ibm.pc.hardware/60732
>     mini_newsgroups/comp.sys.ibm.pc.hardware/61019
>     mini_newsgroups/comp.sys.ibm.pc.hardware/60304
>     mini_newsgroups/comp.sys.ibm.pc.hardware/60882
>     mini_newsgroups/comp.sys.ibm.pc.hardware/60992
>     mini_newsgroups/comp.sys.ibm.pc.hardware/61046
>     mini_newsgroups/comp.sys.ibm.pc.hardware/61120
>     mini_newsgroups/comp.sys.ibm.pc.hardware/61044
>     mini_newsgroups/comp.sys.ibm.pc.hardware/60859
>     mini_newsgroups/comp.sys.ibm.pc.hardware/60838
>     mini_newsgroups/comp.sys.ibm.pc.hardware/60137
>     mini_newsgroups/comp.sys.ibm.pc.hardware/60235
>     mini_newsgroups/comp.sys.ibm.pc.hardware/58983
>     mini_newsgroups/comp.sys.ibm.pc.hardware/61009
>     mini_newsgroups/comp.sys.ibm.pc.hardware/60912
>     mini_newsgroups/comp.sys.ibm.pc.hardware/58831
>     mini_newsgroups/comp.sys.ibm.pc.hardware/61090
>     mini_newsgroups/comp.sys.ibm.pc.hardware/60134
>     mini_newsgroups/comp.sys.ibm.pc.hardware/61168
>     mini_newsgroups/comp.sys.ibm.pc.hardware/60652
>     mini_newsgroups/comp.sys.ibm.pc.hardware/61164
>     mini_newsgroups/comp.sys.ibm.pc.hardware/60377
>     mini_newsgroups/comp.sys.ibm.pc.hardware/60684
>     mini_newsgroups/comp.sys.ibm.pc.hardware/60150
>     mini_newsgroups/comp.sys.ibm.pc.hardware/58829
>     mini_newsgroups/comp.sys.ibm.pc.hardware/60722
>     mini_newsgroups/comp.sys.ibm.pc.hardware/60439
>     mini_newsgroups/comp.sys.ibm.pc.hardware/60694
>     mini_newsgroups/comp.sys.ibm.pc.hardware/60440
>     mini_newsgroups/comp.sys.ibm.pc.hardware/60828
>     mini_newsgroups/comp.sys.ibm.pc.hardware/61026
>     mini_newsgroups/comp.sys.ibm.pc.hardware/61173
>     mini_newsgroups/comp.sys.ibm.pc.hardware/60548
>     mini_newsgroups/comp.sys.ibm.pc.hardware/60685
>     mini_newsgroups/comp.sys.ibm.pc.hardware/60699
>     mini_newsgroups/comp.sys.ibm.pc.hardware/60411
>     mini_newsgroups/comp.sys.ibm.pc.hardware/61158
>     mini_newsgroups/comp.sys.ibm.pc.hardware/60199
>     mini_newsgroups/comp.sys.ibm.pc.hardware/60376
>     mini_newsgroups/comp.sys.ibm.pc.hardware/60656
>     mini_newsgroups/comp.sys.ibm.pc.hardware/60928
>     mini_newsgroups/comp.sys.ibm.pc.hardware/60271
>     mini_newsgroups/comp.sys.ibm.pc.hardware/60837
>     mini_newsgroups/comp.sys.ibm.pc.hardware/61022
>     mini_newsgroups/comp.sys.ibm.pc.hardware/60551
>     mini_newsgroups/comp.sys.ibm.pc.hardware/61175
>     mini_newsgroups/comp.sys.ibm.pc.hardware/60278
>     mini_newsgroups/comp.sys.ibm.pc.hardware/60474
>     mini_newsgroups/comp.sys.ibm.pc.hardware/61098
>     mini_newsgroups/comp.sys.ibm.pc.hardware/58922
>     mini_newsgroups/comp.sys.ibm.pc.hardware/60998
>     mini_newsgroups/comp.sys.ibm.pc.hardware/60409
>     mini_newsgroups/comp.sys.ibm.pc.hardware/60663
>     mini_newsgroups/comp.sys.ibm.pc.hardware/60724
>     mini_newsgroups/comp.sys.ibm.pc.hardware/61154
>     mini_newsgroups/comp.sys.ibm.pc.hardware/60159
>     mini_newsgroups/comp.sys.ibm.pc.hardware/58994
>     mini_newsgroups/comp.sys.ibm.pc.hardware/60769
>     mini_newsgroups/comp.sys.ibm.pc.hardware/60982
>     mini_newsgroups/comp.sys.ibm.pc.hardware/60988
>     mini_newsgroups/comp.sys.ibm.pc.hardware/60453
>     mini_newsgroups/comp.sys.ibm.pc.hardware/60221
>     mini_newsgroups/comp.sys.ibm.pc.hardware/61130
>     mini_newsgroups/comp.sys.ibm.pc.hardware/61153
>     mini_newsgroups/comp.sys.ibm.pc.hardware/60151
>     mini_newsgroups/comp.sys.ibm.pc.hardware/60514
>     mini_newsgroups/comp.sys.ibm.pc.hardware/60749
>     mini_newsgroups/comp.sys.ibm.pc.hardware/60394
>     mini_newsgroups/comp.sys.ibm.pc.hardware/58966
>     mini_newsgroups/comp.sys.ibm.pc.hardware/60961
>     mini_newsgroups/comp.sys.ibm.pc.hardware/60934
>     mini_newsgroups/comp.sys.ibm.pc.hardware/60945
>     mini_newsgroups/comp.sys.ibm.pc.hardware/60457
>     mini_newsgroups/comp.sys.ibm.pc.hardware/60509
>     mini_newsgroups/comp.sys.ibm.pc.hardware/60191
>     mini_newsgroups/comp.sys.ibm.pc.hardware/60404
>     mini_newsgroups/comp.sys.ibm.pc.hardware/61003
>     mini_newsgroups/comp.sys.ibm.pc.hardware/60695
>     mini_newsgroups/comp.sys.ibm.pc.hardware/60841
>     mini_newsgroups/comp.sys.ibm.pc.hardware/61060
>     mini_newsgroups/comp.sys.ibm.pc.hardware/60766
>     mini_newsgroups/comp.sys.ibm.pc.hardware/60273
>     mini_newsgroups/comp.sys.ibm.pc.hardware/60698
>     mini_newsgroups/comp.sys.ibm.pc.hardware/61039
>     mini_newsgroups/comp.sys.ibm.pc.hardware/60154
>     mini_newsgroups/comp.sys.ibm.pc.hardware/60526
>     mini_newsgroups/comp.sys.ibm.pc.hardware/60156
>     mini_newsgroups/comp.sys.mac.hardware/
>     mini_newsgroups/comp.sys.mac.hardware/52155
>     mini_newsgroups/comp.sys.mac.hardware/52123
>     mini_newsgroups/comp.sys.mac.hardware/51752
>     mini_newsgroups/comp.sys.mac.hardware/51565
>     mini_newsgroups/comp.sys.mac.hardware/50473
>     mini_newsgroups/comp.sys.mac.hardware/51494
>     mini_newsgroups/comp.sys.mac.hardware/51867
>     mini_newsgroups/comp.sys.mac.hardware/50457
>     mini_newsgroups/comp.sys.mac.hardware/50419
>     mini_newsgroups/comp.sys.mac.hardware/50439
>     mini_newsgroups/comp.sys.mac.hardware/52342
>     mini_newsgroups/comp.sys.mac.hardware/52102
>     mini_newsgroups/comp.sys.mac.hardware/50546
>     mini_newsgroups/comp.sys.mac.hardware/52270
>     mini_newsgroups/comp.sys.mac.hardware/51928
>     mini_newsgroups/comp.sys.mac.hardware/52059
>     mini_newsgroups/comp.sys.mac.hardware/52113
>     mini_newsgroups/comp.sys.mac.hardware/51855
>     mini_newsgroups/comp.sys.mac.hardware/52156
>     mini_newsgroups/comp.sys.mac.hardware/52312
>     mini_newsgroups/comp.sys.mac.hardware/52231
>     mini_newsgroups/comp.sys.mac.hardware/51519
>     mini_newsgroups/comp.sys.mac.hardware/52246
>     mini_newsgroups/comp.sys.mac.hardware/51948
>     mini_newsgroups/comp.sys.mac.hardware/52335
>     mini_newsgroups/comp.sys.mac.hardware/51813
>     mini_newsgroups/comp.sys.mac.hardware/51751
>     mini_newsgroups/comp.sys.mac.hardware/51809
>     mini_newsgroups/comp.sys.mac.hardware/51963
>     mini_newsgroups/comp.sys.mac.hardware/51711
>     mini_newsgroups/comp.sys.mac.hardware/51763
>     mini_newsgroups/comp.sys.mac.hardware/51950
>     mini_newsgroups/comp.sys.mac.hardware/51846
>     mini_newsgroups/comp.sys.mac.hardware/52284
>     mini_newsgroups/comp.sys.mac.hardware/52094
>     mini_newsgroups/comp.sys.mac.hardware/52403
>     mini_newsgroups/comp.sys.mac.hardware/52269
>     mini_newsgroups/comp.sys.mac.hardware/50465
>     mini_newsgroups/comp.sys.mac.hardware/51707
>     mini_newsgroups/comp.sys.mac.hardware/51786
>     mini_newsgroups/comp.sys.mac.hardware/51539
>     mini_newsgroups/comp.sys.mac.hardware/51703
>     mini_newsgroups/comp.sys.mac.hardware/51962
>     mini_newsgroups/comp.sys.mac.hardware/52175
>     mini_newsgroups/comp.sys.mac.hardware/52296
>     mini_newsgroups/comp.sys.mac.hardware/51522
>     mini_newsgroups/comp.sys.mac.hardware/51805
>     mini_newsgroups/comp.sys.mac.hardware/52090
>     mini_newsgroups/comp.sys.mac.hardware/52190
>     mini_newsgroups/comp.sys.mac.hardware/51678
>     mini_newsgroups/comp.sys.mac.hardware/52069
>     mini_newsgroups/comp.sys.mac.hardware/51661
>     mini_newsgroups/comp.sys.mac.hardware/52276
>     mini_newsgroups/comp.sys.mac.hardware/51510
>     mini_newsgroups/comp.sys.mac.hardware/50533
>     mini_newsgroups/comp.sys.mac.hardware/52238
>     mini_newsgroups/comp.sys.mac.hardware/52065
>     mini_newsgroups/comp.sys.mac.hardware/52264
>     mini_newsgroups/comp.sys.mac.hardware/51613
>     mini_newsgroups/comp.sys.mac.hardware/52300
>     mini_newsgroups/comp.sys.mac.hardware/51996
>     mini_newsgroups/comp.sys.mac.hardware/51501
>     mini_newsgroups/comp.sys.mac.hardware/52079
>     mini_newsgroups/comp.sys.mac.hardware/50551
>     mini_newsgroups/comp.sys.mac.hardware/51799
>     mini_newsgroups/comp.sys.mac.hardware/52214
>     mini_newsgroups/comp.sys.mac.hardware/51750
>     mini_newsgroups/comp.sys.mac.hardware/51626
>     mini_newsgroups/comp.sys.mac.hardware/52223
>     mini_newsgroups/comp.sys.mac.hardware/51652
>     mini_newsgroups/comp.sys.mac.hardware/51832
>     mini_newsgroups/comp.sys.mac.hardware/52037
>     mini_newsgroups/comp.sys.mac.hardware/52163
>     mini_newsgroups/comp.sys.mac.hardware/51790
>     mini_newsgroups/comp.sys.mac.hardware/51782
>     mini_newsgroups/comp.sys.mac.hardware/52149
>     mini_newsgroups/comp.sys.mac.hardware/52071
>     mini_newsgroups/comp.sys.mac.hardware/52010
>     mini_newsgroups/comp.sys.mac.hardware/51808
>     mini_newsgroups/comp.sys.mac.hardware/52404
>     mini_newsgroups/comp.sys.mac.hardware/51595
>     mini_newsgroups/comp.sys.mac.hardware/51943
>     mini_newsgroups/comp.sys.mac.hardware/52234
>     mini_newsgroups/comp.sys.mac.hardware/50440
>     mini_newsgroups/comp.sys.mac.hardware/51770
>     mini_newsgroups/comp.sys.mac.hardware/51503
>     mini_newsgroups/comp.sys.mac.hardware/52081
>     mini_newsgroups/comp.sys.mac.hardware/51847
>     mini_newsgroups/comp.sys.mac.hardware/52045
>     mini_newsgroups/comp.sys.mac.hardware/52248
>     mini_newsgroups/comp.sys.mac.hardware/51929
>     mini_newsgroups/comp.sys.mac.hardware/52050
>     mini_newsgroups/comp.sys.mac.hardware/50518
>     mini_newsgroups/comp.sys.mac.hardware/51587
>     mini_newsgroups/comp.sys.mac.hardware/51675
>     mini_newsgroups/comp.sys.mac.hardware/51514
>     mini_newsgroups/comp.sys.mac.hardware/51509
>     mini_newsgroups/comp.sys.mac.hardware/51720
>     mini_newsgroups/comp.sys.mac.hardware/51908
>     mini_newsgroups/comp.sys.mac.hardware/52039
>     mini_newsgroups/comp.windows.x/
>     mini_newsgroups/comp.windows.x/67063
>     mini_newsgroups/comp.windows.x/66893
>     mini_newsgroups/comp.windows.x/67172
>     mini_newsgroups/comp.windows.x/67386
>     mini_newsgroups/comp.windows.x/66918
>     mini_newsgroups/comp.windows.x/67973
>     mini_newsgroups/comp.windows.x/67016
>     mini_newsgroups/comp.windows.x/66456
>     mini_newsgroups/comp.windows.x/67995
>     mini_newsgroups/comp.windows.x/68311
>     mini_newsgroups/comp.windows.x/67981
>     mini_newsgroups/comp.windows.x/67260
>     mini_newsgroups/comp.windows.x/67061
>     mini_newsgroups/comp.windows.x/68232
>     mini_newsgroups/comp.windows.x/67346
>     mini_newsgroups/comp.windows.x/67220
>     mini_newsgroups/comp.windows.x/68239
>     mini_newsgroups/comp.windows.x/66941
>     mini_newsgroups/comp.windows.x/66437
>     mini_newsgroups/comp.windows.x/67178
>     mini_newsgroups/comp.windows.x/67030
>     mini_newsgroups/comp.windows.x/66889
>     mini_newsgroups/comp.windows.x/67282
>     mini_newsgroups/comp.windows.x/68137
>     mini_newsgroups/comp.windows.x/66931
>     mini_newsgroups/comp.windows.x/67306
>     mini_newsgroups/comp.windows.x/67467
>     mini_newsgroups/comp.windows.x/67402
>     mini_newsgroups/comp.windows.x/68012
>     mini_newsgroups/comp.windows.x/68019
>     mini_newsgroups/comp.windows.x/67212
>     mini_newsgroups/comp.windows.x/66986
>     mini_newsgroups/comp.windows.x/67164
>     mini_newsgroups/comp.windows.x/67269
>     mini_newsgroups/comp.windows.x/68047
>     mini_newsgroups/comp.windows.x/67417
>     mini_newsgroups/comp.windows.x/66993
>     mini_newsgroups/comp.windows.x/66911
>     mini_newsgroups/comp.windows.x/67383
>     mini_newsgroups/comp.windows.x/66400
>     mini_newsgroups/comp.windows.x/67078
>     mini_newsgroups/comp.windows.x/67305
>     mini_newsgroups/comp.windows.x/66445
>     mini_newsgroups/comp.windows.x/66944
>     mini_newsgroups/comp.windows.x/67270
>     mini_newsgroups/comp.windows.x/68243
>     mini_newsgroups/comp.windows.x/67540
>     mini_newsgroups/comp.windows.x/66427
>     mini_newsgroups/comp.windows.x/67193
>     mini_newsgroups/comp.windows.x/67171
>     mini_newsgroups/comp.windows.x/67284
>     mini_newsgroups/comp.windows.x/67514
>     mini_newsgroups/comp.windows.x/66981
>     mini_newsgroups/comp.windows.x/67116
>     mini_newsgroups/comp.windows.x/67572
>     mini_newsgroups/comp.windows.x/67449
>     mini_newsgroups/comp.windows.x/67343
>     mini_newsgroups/comp.windows.x/66421
>     mini_newsgroups/comp.windows.x/66420
>     mini_newsgroups/comp.windows.x/67055
>     mini_newsgroups/comp.windows.x/67070
>     mini_newsgroups/comp.windows.x/67380
>     mini_newsgroups/comp.windows.x/67378
>     mini_newsgroups/comp.windows.x/67319
>     mini_newsgroups/comp.windows.x/66905
>     mini_newsgroups/comp.windows.x/66950
>     mini_newsgroups/comp.windows.x/66955
>     mini_newsgroups/comp.windows.x/68002
>     mini_newsgroups/comp.windows.x/67379
>     mini_newsgroups/comp.windows.x/67983
>     mini_newsgroups/comp.windows.x/66465
>     mini_newsgroups/comp.windows.x/67448
>     mini_newsgroups/comp.windows.x/66467
>     mini_newsgroups/comp.windows.x/67516
>     mini_newsgroups/comp.windows.x/67185
>     mini_newsgroups/comp.windows.x/68185
>     mini_newsgroups/comp.windows.x/67486
>     mini_newsgroups/comp.windows.x/66413
>     mini_newsgroups/comp.windows.x/66980
>     mini_newsgroups/comp.windows.x/66964
>     mini_newsgroups/comp.windows.x/67170
>     mini_newsgroups/comp.windows.x/67491
>     mini_newsgroups/comp.windows.x/68110
>     mini_newsgroups/comp.windows.x/66438
>     mini_newsgroups/comp.windows.x/67542
>     mini_newsgroups/comp.windows.x/67320
>     mini_newsgroups/comp.windows.x/67137
>     mini_newsgroups/comp.windows.x/67052
>     mini_newsgroups/comp.windows.x/68237
>     mini_newsgroups/comp.windows.x/67081
>     mini_newsgroups/comp.windows.x/68174
>     mini_newsgroups/comp.windows.x/67036
>     mini_newsgroups/comp.windows.x/64830
>     mini_newsgroups/comp.windows.x/66943
>     mini_newsgroups/comp.windows.x/67140
>     mini_newsgroups/comp.windows.x/66978
>     mini_newsgroups/comp.windows.x/66453
>     mini_newsgroups/comp.windows.x/68228
>     mini_newsgroups/comp.windows.x/67297
>     mini_newsgroups/comp.windows.x/67435
>     mini_newsgroups/misc.forsale/
>     mini_newsgroups/misc.forsale/74801
>     mini_newsgroups/misc.forsale/75941
>     mini_newsgroups/misc.forsale/76499
>     mini_newsgroups/misc.forsale/76460
>     mini_newsgroups/misc.forsale/76937
>     mini_newsgroups/misc.forsale/76299
>     mini_newsgroups/misc.forsale/70337
>     mini_newsgroups/misc.forsale/76927
>     mini_newsgroups/misc.forsale/76287
>     mini_newsgroups/misc.forsale/76062
>     mini_newsgroups/misc.forsale/76483
>     mini_newsgroups/misc.forsale/74745
>
>     *** WARNING: skipped 28325 bytes of output ***
>
>     mini_newsgroups/sci.med/59212
>     mini_newsgroups/sci.med/59161
>     mini_newsgroups/sci.med/58951
>     mini_newsgroups/sci.med/58852
>     mini_newsgroups/sci.med/59218
>     mini_newsgroups/sci.med/59197
>     mini_newsgroups/sci.med/59001
>     mini_newsgroups/sci.med/59368
>     mini_newsgroups/sci.med/59111
>     mini_newsgroups/sci.space/
>     mini_newsgroups/sci.space/60821
>     mini_newsgroups/sci.space/61455
>     mini_newsgroups/sci.space/61087
>     mini_newsgroups/sci.space/61027
>     mini_newsgroups/sci.space/61277
>     mini_newsgroups/sci.space/60191
>     mini_newsgroups/sci.space/61401
>     mini_newsgroups/sci.space/61145
>     mini_newsgroups/sci.space/61335
>     mini_newsgroups/sci.space/60960
>     mini_newsgroups/sci.space/61440
>     mini_newsgroups/sci.space/61230
>     mini_newsgroups/sci.space/61038
>     mini_newsgroups/sci.space/61276
>     mini_newsgroups/sci.space/60937
>     mini_newsgroups/sci.space/60843
>     mini_newsgroups/sci.space/61189
>     mini_newsgroups/sci.space/62408
>     mini_newsgroups/sci.space/62480
>     mini_newsgroups/sci.space/60925
>     mini_newsgroups/sci.space/60976
>     mini_newsgroups/sci.space/61256
>     mini_newsgroups/sci.space/60962
>     mini_newsgroups/sci.space/61171
>     mini_newsgroups/sci.space/61293
>     mini_newsgroups/sci.space/61546
>     mini_newsgroups/sci.space/61352
>     mini_newsgroups/sci.space/61009
>     mini_newsgroups/sci.space/62477
>     mini_newsgroups/sci.space/61371
>     mini_newsgroups/sci.space/62398
>     mini_newsgroups/sci.space/61363
>     mini_newsgroups/sci.space/60243
>     mini_newsgroups/sci.space/60942
>     mini_newsgroups/sci.space/61461
>     mini_newsgroups/sci.space/60993
>     mini_newsgroups/sci.space/60946
>     mini_newsgroups/sci.space/61236
>     mini_newsgroups/sci.space/60237
>     mini_newsgroups/sci.space/60840
>     mini_newsgroups/sci.space/61484
>     mini_newsgroups/sci.space/60929
>     mini_newsgroups/sci.space/61316
>     mini_newsgroups/sci.space/60995
>     mini_newsgroups/sci.space/61505
>     mini_newsgroups/sci.space/61154
>     mini_newsgroups/sci.space/61271
>     mini_newsgroups/sci.space/62319
>     mini_newsgroups/sci.space/60950
>     mini_newsgroups/sci.space/62428
>     mini_newsgroups/sci.space/61344
>     mini_newsgroups/sci.space/60822
>     mini_newsgroups/sci.space/60229
>     mini_newsgroups/sci.space/61017
>     mini_newsgroups/sci.space/61353
>     mini_newsgroups/sci.space/61215
>     mini_newsgroups/sci.space/61459
>     mini_newsgroups/sci.space/60834
>     mini_newsgroups/sci.space/61324
>     mini_newsgroups/sci.space/61165
>     mini_newsgroups/sci.space/61404
>     mini_newsgroups/sci.space/61558
>     mini_newsgroups/sci.space/61160
>     mini_newsgroups/sci.space/61118
>     mini_newsgroups/sci.space/60827
>     mini_newsgroups/sci.space/60222
>     mini_newsgroups/sci.space/61136
>     mini_newsgroups/sci.space/60171
>     mini_newsgroups/sci.space/61180
>     mini_newsgroups/sci.space/61532
>     mini_newsgroups/sci.space/61224
>     mini_newsgroups/sci.space/61272
>     mini_newsgroups/sci.space/60913
>     mini_newsgroups/sci.space/60944
>     mini_newsgroups/sci.space/61253
>     mini_newsgroups/sci.space/60941
>     mini_newsgroups/sci.space/59848
>     mini_newsgroups/sci.space/61046
>     mini_newsgroups/sci.space/61362
>     mini_newsgroups/sci.space/61187
>     mini_newsgroups/sci.space/61205
>     mini_newsgroups/sci.space/60181
>     mini_newsgroups/sci.space/61262
>     mini_newsgroups/sci.space/61208
>     mini_newsgroups/sci.space/61265
>     mini_newsgroups/sci.space/60154
>     mini_newsgroups/sci.space/61106
>     mini_newsgroups/sci.space/61192
>     mini_newsgroups/sci.space/60972
>     mini_newsgroups/sci.space/60794
>     mini_newsgroups/sci.space/60804
>     mini_newsgroups/sci.space/61057
>     mini_newsgroups/sci.space/61318
>     mini_newsgroups/sci.space/59904
>     mini_newsgroups/sci.space/61191
>     mini_newsgroups/sci.space/61450
>     mini_newsgroups/sci.space/61051
>     mini_newsgroups/sci.space/61534
>     mini_newsgroups/sci.space/61066
>     mini_newsgroups/sci.space/61431
>     mini_newsgroups/soc.religion.christian/
>     mini_newsgroups/soc.religion.christian/20736
>     mini_newsgroups/soc.religion.christian/20801
>     mini_newsgroups/soc.religion.christian/20674
>     mini_newsgroups/soc.religion.christian/20896
>     mini_newsgroups/soc.religion.christian/21319
>     mini_newsgroups/soc.religion.christian/21672
>     mini_newsgroups/soc.religion.christian/20812
>     mini_newsgroups/soc.religion.christian/21451
>     mini_newsgroups/soc.religion.christian/20947
>     mini_newsgroups/soc.religion.christian/20850
>     mini_newsgroups/soc.religion.christian/20744
>     mini_newsgroups/soc.religion.christian/20774
>     mini_newsgroups/soc.religion.christian/21696
>     mini_newsgroups/soc.religion.christian/21419
>     mini_newsgroups/soc.religion.christian/20899
>     mini_newsgroups/soc.religion.christian/20710
>     mini_newsgroups/soc.religion.christian/21339
>     mini_newsgroups/soc.religion.christian/20629
>     mini_newsgroups/soc.religion.christian/21373
>     mini_newsgroups/soc.religion.christian/20743
>     mini_newsgroups/soc.religion.christian/20738
>     mini_newsgroups/soc.religion.christian/20571
>     mini_newsgroups/soc.religion.christian/20742
>     mini_newsgroups/soc.religion.christian/21329
>     mini_newsgroups/soc.religion.christian/21481
>     mini_newsgroups/soc.religion.christian/20634
>     mini_newsgroups/soc.religion.christian/21334
>     mini_newsgroups/soc.religion.christian/20866
>     mini_newsgroups/soc.religion.christian/20886
>     mini_newsgroups/soc.religion.christian/21578
>     mini_newsgroups/soc.religion.christian/20724
>     mini_newsgroups/soc.religion.christian/21453
>     mini_newsgroups/soc.religion.christian/20511
>     mini_newsgroups/soc.religion.christian/20800
>     mini_newsgroups/soc.religion.christian/20491
>     mini_newsgroups/soc.religion.christian/21784
>     mini_newsgroups/soc.religion.christian/20657
>     mini_newsgroups/soc.religion.christian/20976
>     mini_newsgroups/soc.religion.christian/21799
>     mini_newsgroups/soc.religion.christian/21407
>     mini_newsgroups/soc.religion.christian/21658
>     mini_newsgroups/soc.religion.christian/21777
>     mini_newsgroups/soc.religion.christian/21754
>     mini_newsgroups/soc.religion.christian/21559
>     mini_newsgroups/soc.religion.christian/20799
>     mini_newsgroups/soc.religion.christian/21800
>     mini_newsgroups/soc.religion.christian/20621
>     mini_newsgroups/soc.religion.christian/21648
>     mini_newsgroups/soc.religion.christian/20914
>     mini_newsgroups/soc.religion.christian/21396
>     mini_newsgroups/soc.religion.christian/20540
>     mini_newsgroups/soc.religion.christian/21558
>     mini_newsgroups/soc.religion.christian/21621
>     mini_newsgroups/soc.religion.christian/20965
>     mini_newsgroups/soc.religion.christian/21788
>     mini_newsgroups/soc.religion.christian/21505
>     mini_newsgroups/soc.religion.christian/20936
>     mini_newsgroups/soc.religion.christian/21580
>     mini_newsgroups/soc.religion.christian/21585
>     mini_newsgroups/soc.religion.christian/21699
>     mini_newsgroups/soc.religion.christian/21531
>     mini_newsgroups/soc.religion.christian/20689
>     mini_newsgroups/soc.religion.christian/21382
>     mini_newsgroups/soc.religion.christian/21773
>     mini_newsgroups/soc.religion.christian/20952
>     mini_newsgroups/soc.religion.christian/21493
>     mini_newsgroups/soc.religion.christian/20900
>     mini_newsgroups/soc.religion.christian/21418
>     mini_newsgroups/soc.religion.christian/20867
>     mini_newsgroups/soc.religion.christian/21761
>     mini_newsgroups/soc.religion.christian/20779
>     mini_newsgroups/soc.religion.christian/20503
>     mini_newsgroups/soc.religion.christian/21522
>     mini_newsgroups/soc.religion.christian/20767
>     mini_newsgroups/soc.religion.christian/21663
>     mini_newsgroups/soc.religion.christian/21709
>     mini_newsgroups/soc.religion.christian/21535
>     mini_newsgroups/soc.religion.christian/21702
>     mini_newsgroups/soc.religion.christian/21597
>     mini_newsgroups/soc.religion.christian/20719
>     mini_newsgroups/soc.religion.christian/21529
>     mini_newsgroups/soc.religion.christian/20603
>     mini_newsgroups/soc.religion.christian/20664
>     mini_newsgroups/soc.religion.christian/20960
>     mini_newsgroups/soc.religion.christian/20898
>     mini_newsgroups/soc.religion.christian/20798
>     mini_newsgroups/soc.religion.christian/20637
>     mini_newsgroups/soc.religion.christian/20602
>     mini_newsgroups/soc.religion.christian/20554
>     mini_newsgroups/soc.religion.christian/21618
>     mini_newsgroups/soc.religion.christian/21698
>     mini_newsgroups/soc.religion.christian/21544
>     mini_newsgroups/soc.religion.christian/21708
>     mini_newsgroups/soc.religion.christian/21524
>     mini_newsgroups/soc.religion.christian/21342
>     mini_newsgroups/soc.religion.christian/20890
>     mini_newsgroups/soc.religion.christian/20811
>     mini_newsgroups/soc.religion.christian/20626
>     mini_newsgroups/soc.religion.christian/20506
>     mini_newsgroups/soc.religion.christian/21804
>     mini_newsgroups/talk.politics.guns/
>     mini_newsgroups/talk.politics.guns/54196
>     mini_newsgroups/talk.politics.guns/54303
>     mini_newsgroups/talk.politics.guns/54117
>     mini_newsgroups/talk.politics.guns/54402
>     mini_newsgroups/talk.politics.guns/54843
>     mini_newsgroups/talk.politics.guns/54200
>     mini_newsgroups/talk.politics.guns/54630
>     mini_newsgroups/talk.politics.guns/54616
>     mini_newsgroups/talk.politics.guns/54592
>     mini_newsgroups/talk.politics.guns/54697
>     mini_newsgroups/talk.politics.guns/53302
>     mini_newsgroups/talk.politics.guns/54323
>     mini_newsgroups/talk.politics.guns/54169
>     mini_newsgroups/talk.politics.guns/54877
>     mini_newsgroups/talk.politics.guns/55115
>     mini_newsgroups/talk.politics.guns/54230
>     mini_newsgroups/talk.politics.guns/54138
>     mini_newsgroups/talk.politics.guns/54637
>     mini_newsgroups/talk.politics.guns/53373
>     mini_newsgroups/talk.politics.guns/54312
>     mini_newsgroups/talk.politics.guns/55073
>     mini_newsgroups/talk.politics.guns/54417
>     mini_newsgroups/talk.politics.guns/55468
>     mini_newsgroups/talk.politics.guns/54875
>     mini_newsgroups/talk.politics.guns/54590
>     mini_newsgroups/talk.politics.guns/54861
>     mini_newsgroups/talk.politics.guns/54297
>     mini_newsgroups/talk.politics.guns/55484
>     mini_newsgroups/talk.politics.guns/55063
>     mini_newsgroups/talk.politics.guns/54302
>     mini_newsgroups/talk.politics.guns/55123
>     mini_newsgroups/talk.politics.guns/55264
>     mini_newsgroups/talk.politics.guns/54956
>     mini_newsgroups/talk.politics.guns/55470
>     mini_newsgroups/talk.politics.guns/53328
>     mini_newsgroups/talk.politics.guns/54660
>     mini_newsgroups/talk.politics.guns/53304
>     mini_newsgroups/talk.politics.guns/54570
>     mini_newsgroups/talk.politics.guns/53329
>     mini_newsgroups/talk.politics.guns/54715
>     mini_newsgroups/talk.politics.guns/54429
>     mini_newsgroups/talk.politics.guns/54248
>     mini_newsgroups/talk.politics.guns/54243
>     mini_newsgroups/talk.politics.guns/54452
>     mini_newsgroups/talk.politics.guns/53358
>     mini_newsgroups/talk.politics.guns/53348
>     mini_newsgroups/talk.politics.guns/54479
>     mini_newsgroups/talk.politics.guns/54416
>     mini_newsgroups/talk.politics.guns/54279
>     mini_newsgroups/talk.politics.guns/54659
>     mini_newsgroups/talk.politics.guns/54728
>     mini_newsgroups/talk.politics.guns/54447
>     mini_newsgroups/talk.politics.guns/55080
>     mini_newsgroups/talk.politics.guns/54675
>     mini_newsgroups/talk.politics.guns/54239
>     mini_newsgroups/talk.politics.guns/54518
>     mini_newsgroups/talk.politics.guns/54342
>     mini_newsgroups/talk.politics.guns/54591
>     mini_newsgroups/talk.politics.guns/53325
>     mini_newsgroups/talk.politics.guns/54726
>     mini_newsgroups/talk.politics.guns/54357
>     mini_newsgroups/talk.politics.guns/54395
>     mini_newsgroups/talk.politics.guns/54276
>     mini_newsgroups/talk.politics.guns/55106
>     mini_newsgroups/talk.politics.guns/55231
>     mini_newsgroups/talk.politics.guns/54714
>     mini_newsgroups/talk.politics.guns/54611
>     mini_newsgroups/talk.politics.guns/54450
>     mini_newsgroups/talk.politics.guns/54634
>     mini_newsgroups/talk.politics.guns/55068
>     mini_newsgroups/talk.politics.guns/54164
>     mini_newsgroups/talk.politics.guns/54560
>     mini_newsgroups/talk.politics.guns/54446
>     mini_newsgroups/talk.politics.guns/53369
>     mini_newsgroups/talk.politics.guns/55116
>     mini_newsgroups/talk.politics.guns/54538
>     mini_newsgroups/talk.politics.guns/54469
>     mini_newsgroups/talk.politics.guns/54633
>     mini_newsgroups/talk.politics.guns/54860
>     mini_newsgroups/talk.politics.guns/55036
>     mini_newsgroups/talk.politics.guns/55278
>     mini_newsgroups/talk.politics.guns/54535
>     mini_newsgroups/talk.politics.guns/54211
>     mini_newsgroups/talk.politics.guns/55060
>     mini_newsgroups/talk.politics.guns/54404
>     mini_newsgroups/talk.politics.guns/54698
>     mini_newsgroups/talk.politics.guns/54322
>     mini_newsgroups/talk.politics.guns/54748
>     mini_newsgroups/talk.politics.guns/55260
>     mini_newsgroups/talk.politics.guns/55489
>     mini_newsgroups/talk.politics.guns/54238
>     mini_newsgroups/talk.politics.guns/54152
>     mini_newsgroups/talk.politics.guns/54154
>     mini_newsgroups/talk.politics.guns/55239
>     mini_newsgroups/talk.politics.guns/55249
>     mini_newsgroups/talk.politics.guns/54433
>     mini_newsgroups/talk.politics.guns/54449
>     mini_newsgroups/talk.politics.guns/54626
>     mini_newsgroups/talk.politics.guns/54586
>     mini_newsgroups/talk.politics.guns/54624
>     mini_newsgroups/talk.politics.mideast/
>     mini_newsgroups/talk.politics.mideast/76504
>     mini_newsgroups/talk.politics.mideast/76424
>     mini_newsgroups/talk.politics.mideast/76020
>     mini_newsgroups/talk.politics.mideast/77815
>     mini_newsgroups/talk.politics.mideast/77387
>     mini_newsgroups/talk.politics.mideast/75972
>     mini_newsgroups/talk.politics.mideast/76401
>     mini_newsgroups/talk.politics.mideast/76222
>     mini_newsgroups/talk.politics.mideast/76403
>     mini_newsgroups/talk.politics.mideast/76285
>     mini_newsgroups/talk.politics.mideast/75388
>     mini_newsgroups/talk.politics.mideast/76154
>     mini_newsgroups/talk.politics.mideast/76120
>     mini_newsgroups/talk.politics.mideast/76113
>     mini_newsgroups/talk.politics.mideast/76322
>     mini_newsgroups/talk.politics.mideast/76014
>     mini_newsgroups/talk.politics.mideast/76516
>     mini_newsgroups/talk.politics.mideast/75895
>     mini_newsgroups/talk.politics.mideast/75913
>     mini_newsgroups/talk.politics.mideast/76500
>     mini_newsgroups/talk.politics.mideast/77232
>     mini_newsgroups/talk.politics.mideast/75982
>     mini_newsgroups/talk.politics.mideast/76542
>     mini_newsgroups/talk.politics.mideast/77288
>     mini_newsgroups/talk.politics.mideast/77275
>     mini_newsgroups/talk.politics.mideast/76435
>     mini_newsgroups/talk.politics.mideast/75942
>     mini_newsgroups/talk.politics.mideast/75976
>     mini_newsgroups/talk.politics.mideast/76284
>     mini_newsgroups/talk.politics.mideast/77235
>     mini_newsgroups/talk.politics.mideast/77332
>     mini_newsgroups/talk.politics.mideast/75916
>     mini_newsgroups/talk.politics.mideast/76389
>     mini_newsgroups/talk.politics.mideast/75966
>     mini_newsgroups/talk.politics.mideast/75910
>     mini_newsgroups/talk.politics.mideast/76320
>     mini_newsgroups/talk.politics.mideast/76369
>     mini_newsgroups/talk.politics.mideast/76495
>     mini_newsgroups/talk.politics.mideast/77272
>     mini_newsgroups/talk.politics.mideast/75918
>     mini_newsgroups/talk.politics.mideast/75920
>     mini_newsgroups/talk.politics.mideast/77383
>     mini_newsgroups/talk.politics.mideast/76456
>     mini_newsgroups/talk.politics.mideast/75952
>     mini_newsgroups/talk.politics.mideast/76213
>     mini_newsgroups/talk.politics.mideast/76062
>     mini_newsgroups/talk.politics.mideast/76205
>     mini_newsgroups/talk.politics.mideast/75917
>     mini_newsgroups/talk.politics.mideast/75979
>     mini_newsgroups/talk.politics.mideast/76242
>     mini_newsgroups/talk.politics.mideast/76548
>     mini_newsgroups/talk.politics.mideast/75956
>     mini_newsgroups/talk.politics.mideast/76557
>     mini_newsgroups/talk.politics.mideast/76277
>     mini_newsgroups/talk.politics.mideast/76121
>     mini_newsgroups/talk.politics.mideast/76416
>     mini_newsgroups/talk.politics.mideast/75396
>     mini_newsgroups/talk.politics.mideast/76073
>     mini_newsgroups/talk.politics.mideast/77250
>     mini_newsgroups/talk.politics.mideast/76160
>     mini_newsgroups/talk.politics.mideast/75929
>     mini_newsgroups/talk.politics.mideast/77218
>     mini_newsgroups/talk.politics.mideast/76517
>     mini_newsgroups/talk.politics.mideast/76501
>     mini_newsgroups/talk.politics.mideast/75938
>     mini_newsgroups/talk.politics.mideast/76444
>     mini_newsgroups/talk.politics.mideast/76068
>     mini_newsgroups/talk.politics.mideast/76398
>     mini_newsgroups/talk.politics.mideast/76271
>     mini_newsgroups/talk.politics.mideast/76047
>     mini_newsgroups/talk.politics.mideast/76458
>     mini_newsgroups/talk.politics.mideast/76372
>     mini_newsgroups/talk.politics.mideast/75943
>     mini_newsgroups/talk.politics.mideast/76166
>     mini_newsgroups/talk.politics.mideast/77813
>     mini_newsgroups/talk.politics.mideast/76410
>     mini_newsgroups/talk.politics.mideast/76374
>     mini_newsgroups/talk.politics.mideast/75369
>     mini_newsgroups/talk.politics.mideast/76508
>     mini_newsgroups/talk.politics.mideast/77212
>     mini_newsgroups/talk.politics.mideast/76249
>     mini_newsgroups/talk.politics.mideast/76233
>     mini_newsgroups/talk.politics.mideast/77305
>     mini_newsgroups/talk.politics.mideast/76082
>     mini_newsgroups/talk.politics.mideast/77177
>     mini_newsgroups/talk.politics.mideast/75875
>     mini_newsgroups/talk.politics.mideast/77203
>     mini_newsgroups/talk.politics.mideast/75901
>     mini_newsgroups/talk.politics.mideast/76080
>     mini_newsgroups/talk.politics.mideast/77392
>     mini_newsgroups/talk.politics.mideast/76498
>     mini_newsgroups/talk.politics.mideast/76105
>     mini_newsgroups/talk.politics.mideast/76179
>     mini_newsgroups/talk.politics.mideast/76197
>     mini_newsgroups/talk.politics.mideast/77322
>     mini_newsgroups/talk.politics.mideast/76544
>     mini_newsgroups/talk.politics.mideast/75394
>     mini_newsgroups/talk.politics.mideast/75963
>     mini_newsgroups/talk.politics.mideast/76152
>     mini_newsgroups/talk.politics.mideast/76395
>     mini_newsgroups/talk.politics.misc/
>     mini_newsgroups/talk.politics.misc/178813
>     mini_newsgroups/talk.politics.misc/176916
>     mini_newsgroups/talk.politics.misc/178862
>     mini_newsgroups/talk.politics.misc/178661
>     mini_newsgroups/talk.politics.misc/178433
>     mini_newsgroups/talk.politics.misc/178327
>     mini_newsgroups/talk.politics.misc/176884
>     mini_newsgroups/talk.politics.misc/178865
>     mini_newsgroups/talk.politics.misc/178656
>     mini_newsgroups/talk.politics.misc/178631
>     mini_newsgroups/talk.politics.misc/176878
>     mini_newsgroups/talk.politics.misc/178775
>     mini_newsgroups/talk.politics.misc/176986
>     mini_newsgroups/talk.politics.misc/178564
>     mini_newsgroups/talk.politics.misc/176869
>     mini_newsgroups/talk.politics.misc/178390
>     mini_newsgroups/talk.politics.misc/176984
>     mini_newsgroups/talk.politics.misc/178745
>     mini_newsgroups/talk.politics.misc/178994
>     mini_newsgroups/talk.politics.misc/176895
>     mini_newsgroups/talk.politics.misc/178789
>     mini_newsgroups/talk.politics.misc/178678
>     mini_newsgroups/talk.politics.misc/178887
>     mini_newsgroups/talk.politics.misc/178945
>     mini_newsgroups/talk.politics.misc/178997
>     mini_newsgroups/talk.politics.misc/178788
>     mini_newsgroups/talk.politics.misc/178532
>     mini_newsgroups/talk.politics.misc/176926
>     mini_newsgroups/talk.politics.misc/178718
>     mini_newsgroups/talk.politics.misc/178824
>     mini_newsgroups/talk.politics.misc/178907
>     mini_newsgroups/talk.politics.misc/178569
>     mini_newsgroups/talk.politics.misc/178960
>     mini_newsgroups/talk.politics.misc/178765
>     mini_newsgroups/talk.politics.misc/178870
>     mini_newsgroups/talk.politics.misc/178489
>     mini_newsgroups/talk.politics.misc/178738
>     mini_newsgroups/talk.politics.misc/176904
>     mini_newsgroups/talk.politics.misc/178566
>     mini_newsgroups/talk.politics.misc/176930
>     mini_newsgroups/talk.politics.misc/176983
>     mini_newsgroups/talk.politics.misc/178682
>     mini_newsgroups/talk.politics.misc/178301
>     mini_newsgroups/talk.politics.misc/176956
>     mini_newsgroups/talk.politics.misc/179066
>     mini_newsgroups/talk.politics.misc/178368
>     mini_newsgroups/talk.politics.misc/178382
>     mini_newsgroups/talk.politics.misc/178906
>     mini_newsgroups/talk.politics.misc/179070
>     mini_newsgroups/talk.politics.misc/178481
>     mini_newsgroups/talk.politics.misc/178851
>     mini_newsgroups/talk.politics.misc/178799
>     mini_newsgroups/talk.politics.misc/178924
>     mini_newsgroups/talk.politics.misc/178721
>     mini_newsgroups/talk.politics.misc/178451
>     mini_newsgroups/talk.politics.misc/178792
>     mini_newsgroups/talk.politics.misc/176881
>     mini_newsgroups/talk.politics.misc/178654
>     mini_newsgroups/talk.politics.misc/176951
>     mini_newsgroups/talk.politics.misc/178801
>     mini_newsgroups/talk.politics.misc/179018
>     mini_newsgroups/talk.politics.misc/176886
>     mini_newsgroups/talk.politics.misc/178360
>     mini_newsgroups/talk.politics.misc/178699
>     mini_newsgroups/talk.politics.misc/176988
>     mini_newsgroups/talk.politics.misc/179097
>     mini_newsgroups/talk.politics.misc/179067
>     mini_newsgroups/talk.politics.misc/178318
>     mini_newsgroups/talk.politics.misc/178447
>     mini_newsgroups/talk.politics.misc/178455
>     mini_newsgroups/talk.politics.misc/177008
>     mini_newsgroups/talk.politics.misc/178527
>     mini_newsgroups/talk.politics.misc/178337
>     mini_newsgroups/talk.politics.misc/178556
>     mini_newsgroups/talk.politics.misc/178522
>     mini_newsgroups/talk.politics.misc/178517
>     mini_newsgroups/talk.politics.misc/178793
>     mini_newsgroups/talk.politics.misc/178579
>     mini_newsgroups/talk.politics.misc/178837
>     mini_newsgroups/talk.politics.misc/178731
>     mini_newsgroups/talk.politics.misc/178668
>     mini_newsgroups/talk.politics.misc/178939
>     mini_newsgroups/talk.politics.misc/178560
>     mini_newsgroups/talk.politics.misc/178571
>     mini_newsgroups/talk.politics.misc/178487
>     mini_newsgroups/talk.politics.misc/178965
>     mini_newsgroups/talk.politics.misc/178998
>     mini_newsgroups/talk.politics.misc/176982
>     mini_newsgroups/talk.politics.misc/178606
>     mini_newsgroups/talk.politics.misc/178341
>     mini_newsgroups/talk.politics.misc/178622
>     mini_newsgroups/talk.politics.misc/178309
>     mini_newsgroups/talk.politics.misc/178769
>     mini_newsgroups/talk.politics.misc/178927
>     mini_newsgroups/talk.politics.misc/178751
>     mini_newsgroups/talk.politics.misc/178349
>     mini_newsgroups/talk.politics.misc/179095
>     mini_newsgroups/talk.politics.misc/178993
>     mini_newsgroups/talk.politics.misc/178361
>     mini_newsgroups/talk.politics.misc/178610
>     mini_newsgroups/talk.religion.misc/
>     mini_newsgroups/talk.religion.misc/83682
>     mini_newsgroups/talk.religion.misc/83790
>     mini_newsgroups/talk.religion.misc/83751
>     mini_newsgroups/talk.religion.misc/83618
>     mini_newsgroups/talk.religion.misc/83535
>     mini_newsgroups/talk.religion.misc/84068
>     mini_newsgroups/talk.religion.misc/83732
>     mini_newsgroups/talk.religion.misc/83798
>     mini_newsgroups/talk.religion.misc/83564
>     mini_newsgroups/talk.religion.misc/84351
>     mini_newsgroups/talk.religion.misc/84212
>     mini_newsgroups/talk.religion.misc/83601
>     mini_newsgroups/talk.religion.misc/83775
>     mini_newsgroups/talk.religion.misc/83919
>     mini_newsgroups/talk.religion.misc/84355
>     mini_newsgroups/talk.religion.misc/82799
>     mini_newsgroups/talk.religion.misc/83518
>     mini_newsgroups/talk.religion.misc/84053
>     mini_newsgroups/talk.religion.misc/83866
>     mini_newsgroups/talk.religion.misc/83599
>     mini_newsgroups/talk.religion.misc/83788
>     mini_newsgroups/talk.religion.misc/84350
>     mini_newsgroups/talk.religion.misc/83641
>     mini_newsgroups/talk.religion.misc/84316
>     mini_newsgroups/talk.religion.misc/83771
>     mini_newsgroups/talk.religion.misc/83717
>     mini_newsgroups/talk.religion.misc/83645
>     mini_newsgroups/talk.religion.misc/83742
>     mini_newsgroups/talk.religion.misc/83979
>     mini_newsgroups/talk.religion.misc/84356
>     mini_newsgroups/talk.religion.misc/83567
>     mini_newsgroups/talk.religion.misc/84060
>     mini_newsgroups/talk.religion.misc/83843
>     mini_newsgroups/talk.religion.misc/84413
>     mini_newsgroups/talk.religion.misc/83474
>     mini_newsgroups/talk.religion.misc/84510
>     mini_newsgroups/talk.religion.misc/83748
>     mini_newsgroups/talk.religion.misc/83829
>     mini_newsgroups/talk.religion.misc/83548
>     mini_newsgroups/talk.religion.misc/84400
>     mini_newsgroups/talk.religion.misc/83827
>     mini_newsgroups/talk.religion.misc/84302
>     mini_newsgroups/talk.religion.misc/82796
>     mini_newsgroups/talk.religion.misc/84178
>     mini_newsgroups/talk.religion.misc/84398
>     mini_newsgroups/talk.religion.misc/84567
>     mini_newsgroups/talk.religion.misc/84324
>     mini_newsgroups/talk.religion.misc/82812
>     mini_newsgroups/talk.religion.misc/83817
>     mini_newsgroups/talk.religion.misc/83558
>     mini_newsgroups/talk.religion.misc/83683
>     mini_newsgroups/talk.religion.misc/84103
>     mini_newsgroups/talk.religion.misc/83791
>     mini_newsgroups/talk.religion.misc/84115
>     mini_newsgroups/talk.religion.misc/83797
>     mini_newsgroups/talk.religion.misc/83673
>     mini_newsgroups/talk.religion.misc/84255
>     mini_newsgroups/talk.religion.misc/84164
>     mini_newsgroups/talk.religion.misc/83977
>     mini_newsgroups/talk.religion.misc/84359
>     mini_newsgroups/talk.religion.misc/83773
>     mini_newsgroups/talk.religion.misc/84244
>     mini_newsgroups/talk.religion.misc/84194
>     mini_newsgroups/talk.religion.misc/84290
>     mini_newsgroups/talk.religion.misc/84317
>     mini_newsgroups/talk.religion.misc/84251
>     mini_newsgroups/talk.religion.misc/83898
>     mini_newsgroups/talk.religion.misc/83983
>     mini_newsgroups/talk.religion.misc/83807
>     mini_newsgroups/talk.religion.misc/84152
>     mini_newsgroups/talk.religion.misc/83480
>     mini_newsgroups/talk.religion.misc/82804
>     mini_newsgroups/talk.religion.misc/82758
>     mini_newsgroups/talk.religion.misc/83785
>     mini_newsgroups/talk.religion.misc/84309
>     mini_newsgroups/talk.religion.misc/84195
>     mini_newsgroups/talk.religion.misc/83678
>     mini_newsgroups/talk.religion.misc/84293
>     mini_newsgroups/talk.religion.misc/84345
>     mini_newsgroups/talk.religion.misc/83900
>     mini_newsgroups/talk.religion.misc/84120
>     mini_newsgroups/talk.religion.misc/83730
>     mini_newsgroups/talk.religion.misc/83575
>     mini_newsgroups/talk.religion.misc/83719
>     mini_newsgroups/talk.religion.misc/83574
>     mini_newsgroups/talk.religion.misc/84436
>     mini_newsgroups/talk.religion.misc/83861
>     mini_newsgroups/talk.religion.misc/83563
>     mini_newsgroups/talk.religion.misc/84278
>     mini_newsgroups/talk.religion.misc/83780
>     mini_newsgroups/talk.religion.misc/83862
>     mini_newsgroups/talk.religion.misc/83435
>     mini_newsgroups/talk.religion.misc/83763
>     mini_newsgroups/talk.religion.misc/83796
>     mini_newsgroups/talk.religion.misc/84083
>     mini_newsgroups/talk.religion.misc/83437
>     mini_newsgroups/talk.religion.misc/83750
>     mini_newsgroups/talk.religion.misc/83571
>     mini_newsgroups/talk.religion.misc/83981
>     mini_newsgroups/talk.religion.misc/82764

  

The below cell takes about 10mins to run.

NOTE: It is slow partly because each file is small and we are facing the
'small files problem' with distributed file systems that need meta-data
for each file. If the file name is not needed then it may be better to
create one large stream of the contents of all the files into dbfs. We
leave this as it is to show what happens when we upload a dataset of
lots of little files into dbfs.

In [None]:
cp -r file:/tmp/mini_newsgroups dbfs:/datasets/mini_newsgroups

  

>     res1: Boolean = true

In [None]:
display(dbutils.fs.ls("dbfs:/datasets/mini_newsgroups"))

  

[TABLE]