# MALLET by Example

Normally MALLET is run on the unix command line. I am running MALLET in an IPython Notebook so I can *share and explain my work.* 

## The Data

To begin, we need to have some text for MALLEt to chew on. I've taken the easy way out and already prepared a set of documents for modeling. These are in the `data/` folder in this repository. 

There are 191 text files in that directory and each one represents a the text content of a blog post selected by the editors of [DHNOW](http://digitalhumanitiesnow.org/). Unfortunately, I've not included the metadata about which text document belongs to which post, oh well. ¯\\_(ツ)_/¯

What I can say is that I used the [Diffbot](http://diffbot.com) article extraction API to pull out just the text content from the blog posts (so no menu, about-me boilerplate, or comments). It is important to know from whence the data comes!!!!

In [54]:
%%bash
ls -lh data/ # This command lists the text files in the data directory

total 4648
-rw-r--r--@ 1 mcburton  staff    16K Aug 12  2012 1.txt
-rw-r--r--@ 1 mcburton  staff   7.5K Aug 12  2012 10.txt
-rw-r--r--@ 1 mcburton  staff   7.8K Aug 12  2012 100.txt
-rw-r--r--@ 1 mcburton  staff   9.2K Aug 12  2012 101.txt
-rw-r--r--@ 1 mcburton  staff    34K Aug 12  2012 102.txt
-rw-r--r--@ 1 mcburton  staff   5.5K Aug 12  2012 103.txt
-rw-r--r--@ 1 mcburton  staff    15K Aug 12  2012 104.txt
-rw-r--r--@ 1 mcburton  staff    33K Aug 12  2012 105.txt
-rw-r--r--@ 1 mcburton  staff    36K Aug 12  2012 106.txt
-rw-r--r--@ 1 mcburton  staff    34K Aug 12  2012 107.txt
-rw-r--r--@ 1 mcburton  staff   179B Aug 12  2012 108.txt
-rw-r--r--@ 1 mcburton  staff     0B Aug 12  2012 109.txt
-rw-r--r--@ 1 mcburton  staff   173K Aug 12  2012 11.txt
-rw-r--r--@ 1 mcburton  staff   5.4K Aug 12  2012 110.txt
-rw-r--r--@ 1 mcburton  staff   4.7K Aug 12  2012 111.txt
-rw-r--r--@ 1 mcburton  staff    18K Aug 12  2012 112.txt
-rw-r--r--@ 1 mcburton  staff   2.6K Aug 12  2012 113.txt
-rw-r--

Lets take a look at what is inside one of these blog posts.

In [55]:
%%bash
cat data/55.txt

Barbara Taranto
 Digital Program Director
 New York Public Library
The New York Public Library recently launched its first foray into crowd sourcing metadata by exposing 40,000 image pages of turn of the century restaurant and cruise ship menus: “What’s On the Menu?” The goal of the project was to widely distribute the transcription of the menu items into a structured and reusable form. The site was exceedingly popular in its first few months.
Recent activity has flattened somewhat, raising issues regarding the public’s appetite for these projects. More importantly, the menus project raised hard questions about the quality of the crowd sourced content, the longevity of the data, and the disposition of the data (e.g. What is it? Is it good enough for our purposes? Should we keep it? If yes, where does it belong?).
This presentation will discuss these issues and propose some alternative views on metadata, user-generated content, and the intersection of the two.
http://menus.nypl.org/

## Downloading MALLET

[MALLET](http://mallet.cs.umass.edu/index.php) isn't crazy complicated infrastructure, it is just a Java application written by [David Mimno](https://twitter.com/dmimno). So you have to have Java installed, but otherwise there are no technical dependencies. 

However, you need to understand the command line in order to use it (and you need to understand what it does). MALLET is not a "plug and play" tool, it has no graphical user interface, and, while being pretty well documented, it doesn't hold your hand.

The commands below automatically download and unzip version 2.0.7 of MALLET. You can also [download](http://mallet.cs.umass.edu/download.php) MALLET separately and unzip it yourself. The rest of this tutorial expects the MALLET application to be in the `mallet-2.0.7/` directory. 

In [None]:
%%bash
curl http://mallet.cs.umass.edu/dist/mallet-2.0.7.zip > mallet.zip # download the zip archive
unzip mallet.zip #unzip the archive into the current directory

Lets test to make sure MALLET is working correctly.

In [None]:
%%bash
mallet-2.0.7/bin/mallet # run the mallet executable

We just ran the MALLET program, but we didn't give it any commands so it complained with `Unrecognized command` and then spit out the list of commands it expects. That means it is working! Now we can actually do something!

## Running Mallet

Now that we have installed MALLET and we have some data, we can start doing some topic modeling. 😊

Topic modeling with MALLET involves two steps: 
1. importing your data
2. training the model

Importing the data involves re-shaping the narrative text into a bag-of-words. It also performs some cleaning such as removing stopwords. 

Training the model is the "topic modeling" part of the exercise that performs the algorithmic magic to tell you "what it all means" (not really tho...). 

### Importing data

There is one final bit of data fitness that needs to happen for us to be able to topic model these blog posts. We need to "import" the text files using the MALLET application and create a .mallet file. Here we will import all the text files in the `data/` directory using the `import-dir` command. 

In [56]:
%%bash
mallet-2.0.7/bin/mallet import-dir --input data \
--output documents.mallet \
--keep-sequence \
--remove-stopwords

Labels = 
   data


In [57]:
%%bash
ls -lh

total 33032
-rw-r--r--@   1 mcburton  staff   292K Jul  1 09:46 Data work.png
-rw-r--r--@   1 mcburton  staff   158K Jul  1 09:47 Data-science-workflow.png
-rw-r--r--@   1 mcburton  staff   288K Jul  1 09:47 Real workflow.png
drwxr-xr-x@ 194 mcburton  staff   6.4K Jun 30 15:35 data
-rw-r--r--@   1 mcburton  staff   1.1M Apr  4  2014 dh-blog-map-small.jpeg
-rw-r--r--@   1 mcburton  staff   815K Jul  1 11:17 documents.mallet
-rw-r--r--@   1 mcburton  staff    96K Jul  1 09:48 estimating.png
-rw-r--r--@   1 mcburton  staff    16K Jul  1 09:59 intro-to-topic-modeling.ipynb
-rw-r--r--@   1 mcburton  staff   232K Jul  1 09:57 intro-to-topic-modeling.slides.html
-rw-r--r--@   1 mcburton  staff   1.4K Jun 30 17:07 keywords.txt
drwx------@  15 mcburton  staff   510B Jul  1 11:12 mallet-2.0.7
-rw-r--r--@   1 mcburton  staff    13M Jun 30 16:49 mallet.zip
-rw-r--r--    1 mcburton  staff    22K Jul  1 11:17 running-MALLET.ipynb
-rw-r--r--@   1 mcburton  staff   171K May  9  2013 word-distribution2

This created a new file, `documents.mallet` in the current directory. This is the file that we will feed back to the MALLET application to "train" a topic model.

### Training the model

Now that we have the data imported and *in shape* we can finally do the topic modeling part of topic modeling. The command below trains a model on the text with 10 topics. You can tweak that number by changing the `--num-topics` parameter. Additionally, we are writing those keywords to a file (rather than just spitting them out on the command line) with the `--output-topic-keys` parameter.

In [60]:

!mallet-2.0.7/bin/mallet train-topics \
--input documents.mallet \
--num-topics 100 \
--output-topic-keys keywords.txt

Data loaded.
Coded LDA: 100 topics, 7 topic bits, 1111111 topic mask
max tokens: 16205
total tokens: 150526
<10> LL/token: -9.684
<20> LL/token: -9.39098
<30> LL/token: -9.27823
<40> LL/token: -9.21717

0	0.5	report link version umiacs conference april interaction march druin interfaces vol shneiderman technology interactive november october september journal digital 
1	0.5	twitter play online blog teaching ancient archaeology found department professor received interested culture written social roman director associate real 
2	0.5	http exhibit university google news org visitors research items works algorithms mla underwood illicit systems output journal site searches 
3	0.5	studies disciplines title logic recent practices disciplinary head literacy based implicit build st complex essay discipline institutions slow science 
4	0.5	work project scholarly criticism good critical simply don talk discourse publishing kinds ways kind makes haven explicit shape move 
5	0.5	society culture ba

Congrats, you just modeled some topics. Lets take a look at the `keywords.txt` file and see if we can get any insights about the digital humanities.

In [61]:
cat keywords.txt

0	0.5	version information human computer april interaction design conference march november study vol january ed july journal september evaluation press 
1	0.5	university department found culture media professor form studies state learning focus director social program knowledge written associate american teaching 
2	0.5	literature exhibit electronic works jan http creative mla visitors report net lori kathi station archive practice organization scholars impact 
3	0.5	dh disciplines term field disciplinary don mla academia debates discipline suggest methodologies colleagues fields traditional evaluating introduction matter literary 
4	0.5	work scholarship scholarly review projects criticism peer critical publishing community discourse traditional terms design produced critique good simply methodology 
5	0.5	bay salt francisco san company public land california ponds americans gold london people leslie citizens local story marshes miners 
6	0.5	big words agenda london bailey archi

There are other kinds of output you can get from MALLET. Besides the keywords, the other data most people use in their analysis are the document-topics mixtures. These data indicate the expression of each topic within a document. You can use this information to "cluster" documents based upon the the proportion of a topic in the set of documents. For an example, take a look at [a particular topic in the dfr-browser](http://agoldst.github.io/dfr-browser/demo/#/topic/4) or in [Mining the Dispatch](http://dsl.richmond.edu/dispatch/Topics/view/15)