Intro

Yahoo LDA is my fav LDA implementation at the moment but I've found though to understand the topics it generates it's useful to have some helper scripts. Here they are...

Dependencies

It's a mix of python and c++. (I wrote it all in python but go bored waiting for large files to be processed) The c++ stuff requires some building...

sudo apt-get install libboost-all-dev
cmake .
make # builds straight into bin

Input helpers

lda_prep

The input format for YahooLDA is primary_id secondary_id token token token

$ cat documents
a4c6da2cd18db81a210cd94773344234 2012-01-01T08:32:50 Your Money : Investors seek security in ...
33350e22199526324bc12d1c7700f944 2012-01-01T08:32:51 Skilled nail tech position | Please contact Kat ...
...

To do a simple normalisation (downcase and remove all tokens with len < 3 and without at least one alpha numeric) use lda_prep

$ cat documents | bin/lda_prep > documents.normalised

chop_most_least_freq

Sometimes you'll get better results by chopping the most/least frequent terms. A simple version of this can be done with chop_most_least_freq

$ bin/chop_most_least_freq --input documents.normalised --lower 0.001 --upper 0.3 > documents.chopped

This removes all tokens that don't appear in at least 0.1% of documents or appear in more than 30% of documents. (For a zipfian distribution of tokens I've found this to give reasonable results) TODO: rewrite upper/lower based on absolute freq of tokens rather than number of documents to make it distribution agnostic)

Additionally you might have specially marked up tokens that you want to retain regardless of their frequency. If so run with --keep. Eg to retain all tokens starting with foo you can run

$ bin/chop_most_least_freq --input documents.normalised --lower 0.001 --upper 0.3 --keep foo > documents.chopped

( This app requires two passes over the data so doesn't accept STDIN )

Running YahooLDA

(For completeness) You run YahooLDA using formatter and learntopics

$ formatter < documents.chopped
$ learntopics --topics 100 --iter=100

Output helpers

learntopics produces a few useful output files

lda.docToTop.txt

Document to Topic mapping, one line per document.

<primary_id> <secondary_id> (<topic_id>, <weight>) (<topic_id>, <weight>) ...

a4c6da2cd18db81a210cd94773344234 2012-01-01T08:32:50    (86,0.25) (45,0.220779) (13,0.207792) (98,0.100649) ...
33350e22199526324bc12d1c7700f944 2012-01-01T08:32:51    (39,0.705882) (36,0.117647) (48,0.0882353) ...
...

lda.topToWor.txt

Topic to Word mapping, one line per topic.

Topic <N>: (<token>, <weight>) (<token>, <weight>) ...

Topic 0: (video,0.151826) (show,0.121268) (live,0.106091) (watch,0.090558)  ...
Topic 1: (game,0.124723) (season,0.085363) (coach,0.0830232) ...
...

lda.worToTop.txt

Word to Topic mapping, one line per document.

<primary_id> <secondary_id> (<token>, <topic_id>) (<token>, <topic_id>) 

a4c6da2cd18db81a210cd94773344234 2012-01-01T08:32:50 (money,13) (investors,45) (seek,98) ...
33350e22199526324bc12d1c7700f944 2012-01-01T08:32:51 (skilled,39) (nail,36) (tech,39) ...
...

Helpers

topic prob mass per topic

To examine the sum of topic probabilities use mass_per_topic. It outputs for each topic the total mass assigned to the topic from across the documents. Eg the following shows that topics 58 and 17 have the most mass whereas 42 and 92 are hardly represented at all.

$ cat lda.docToTop.txt | bin/mass_per_topic.py | sort -k2 -nr
58	1795.96650244
17  1673.9486237
...
42	71.94400991
92  67.45150241

On a large corpus you can sample to get a representative result (eg using awk to take every 1000th document)

$ cat lda.docToTop.txt | awk 'NR%1000==0' | bin/mass_per_topic.py | sort -k2 -nr

topic freqs

Dump a list of topics frequencies per document (like a more complete version of lda.docToTop.txt)

$ head -n1 lda.worToTop.txt 
3       3       (a,413) (b,198) (c,60) (d,198) (e,290) (f,198) (g,198) (h,290)

$ head -n1 lda.worToTop.txt | worToTop_to_topics.py 
3   3 60   1.000000000000000
3   3 198  4.000000000000000
3   3 290  2.000000000000000
3   3 413  1.000000000000000

Optionally with l2 normalisation

$ head -n1 lda.worToTop.txt | worToTop_to_topics.py --normalise True
3   3 60   0.213200716355610
3   3 198  0.852802865422442
3   3 290  0.426401432711221
3   3 413  0.213200716355610

And with cumsum chop (ie cut back to a particular magnitude)

$ head -n1 lda.worToTop.txt | worToTop_to_topics.py --normalise True --chop 0.9
3   3	 198			 0.852802865422442
3   3	 290			 0.426401432711221

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
bin		bin
src		src
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Intro

Dependencies

Input helpers

lda_prep

chop_most_least_freq

Running YahooLDA

Output helpers

lda.docToTop.txt

lda.topToWor.txt

lda.worToTop.txt

Helpers

topic prob mass per topic

top topics for word

top docs for a topic

topic freqs

About

Releases

Packages

Contributors 2

Languages

matpalm/yahoo_lda_utils

Folders and files

Latest commit

History

Repository files navigation

Intro

Dependencies

Input helpers

lda_prep

chop_most_least_freq

Running YahooLDA

Output helpers

lda.docToTop.txt

lda.topToWor.txt

lda.worToTop.txt

Helpers

topic prob mass per topic

top topics for word

top docs for a topic

topic freqs

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages