# Topic Modeling Using MALLET

For this exercise, we will preform text analysis on the data contained in the "homework" database. This notebook will walk you through topic modeling NIH abstracts using [MALLET.](http://mallet.cs.umass.edu/topics.php)

Before begining, follow the instructions in the Topic Modeling word document to learn how to download Mallet and learn how to find the path to your mallet batch file. 

## Table of Contents

- [Initialization](#Initialization)

    - [Imports](#Imports)
    - [The `terminal` function](#The-terminal-function)

- [Getting Data](#Getting-Data)

    - [Exercise 1](#Exercise-1)

- [Generating Topics](#Generating-Topics)

    - [Exercise 2](#Exercise-2)
    - [Exercise 3](#Exercise-3)
    - [Exercise 4 - Identifying Topics from Word Clusters](#Exercise-4---Identifying-Topics-from-Word-Clusters)

- [Inferring Topics - Extra Credit](#Inferring-Topics---Extra-Credit)

    - [Exercise 5 - Extra Credit](#Exercise-5---Extra-Credit)

- [Resources for Topic Modeling](#Resources-for-Topic-Modeling)

## Initialization

* Back to the [Table of Contents](#Table-of-Contents)

Before we begin, we'll need to run the following code cells.  The first cell will import the Python libraries we'll be using. The second will define a function, `terminal()`, that we'll use to run commands.

### Imports

* Back to the [Table of Contents](#Table-of-Contents)

Please run the following code cells before proceeding.  The first takes care of importing packages we'll use and pre-loading some data that the Natural Language Toolkit (NLTK) needs for processing:

In [None]:
# Importing the modules we will use in this workbook
from subprocess import Popen, PIPE
import os
import pymysql
import string
import nltk
import re
from nltk.corpus import stopwords

# download some nltk resources
nltk.download( "punkt" )
nltk.download( "stopwords" )

### The `terminal` function

* Back to the [Table of Contents](#Table-of-Contents)

Python can send commands to the operating system's command shell, allowing you to run operating system level commands without having to leave an iPython notebook.  For your convenience, a terminal() function is defined below. The terminal() function accepts a terminal command, runs that command, then outputs the results to a text file.

The terminal() function’s signature is:

    terminal( commandTokenList, outputFile = "temp.txt" )

WHERE:

- **_commandTokenList_** - is the full operating system command line command you want to run broken into a list of string tokens on each space within the command (spaces not included in the list).  Examples:

    - command: `mkdir temp`
    
        - commandTokenList: [ 'mkdir', 'temp' ]
        
    - command: `ls -al | grep temp`
    
        - commandTokenList: [ 'ls', '-al', '|', 'grep', 'temp' ]
        
    - if there is a back slash ( "\" ) in the middle of a command to show that it continues on the next line, you do not need to include that back slash as a token in the list.
    - there are multiple ways to generate this list of tokens:
    
        1. Make a static list (as seen above):
        
                command_token_list = [ 'mkdir', 'temp' ]

        2. Place your command in a string variable, then use `string.split()` to break the string into a list on spaces:
        
                my_command = "ls -al | grep temp"
                command_token_list = my_command.split()
                
        3. Build up list of tokens one by one:
        
                command_token_list = []
                command_token_list.append( "ls" )
                command_token_list.append( "-al" )
                command_token_list.append( "|" )
                command_token_list.append( "grep" )
                command_token_list.append( "temp" )

- **_outputFile_** - is an optional parameter that tells the `terminal()` function where you want to store output from the command.  If you do not specify a file in the call to `terminal()` in the function call, it will write to a file named "`temp.txt`" in the current directory.

**Note:** The output of this function attempts to capture whether an error occured based on where the command sent its output (either the standard or error output streams).  Terminal commands deal with output in many different ways, however, including sometimes sending error output to the standard output stream, and sometimes sending non-error logging to the error stream.  If there is output, the terminal command will tell you to look at the output file, then will output tags at the end of the message to tell you where the output is from:

- "`[err]`" = error stream
- "`[standard]`" = standard stream. 

In [None]:
# and, defining our terminal() function, writes the output to temp.txt,
#    unless you change the filename.
def terminal( commandTokenList, outputFile = "temp.txt" ):
    
    # return reference
    status_OUT = ""

    fwrite = open(outputFile, 'w')
    pipe = Popen( commandTokenList, stdout = PIPE, stderr = PIPE, shell=True )
    text, err = pipe.communicate()    
    text = text.decode('ascii', 'ignore')
    err = err.decode('ascii', 'ignore')
    if len(text) > 2:
        fwrite.write(text)
    elif len(err) > 2:
        fwrite.write(err)
    else:
        print("No output returned")
    
    # status
    status_OUT = "terminal() call complete"

    # got any output at all?
    if ( ( ( err is not None ) and ( err != "" ) ) or ( ( text is not None ) and ( text != "" ) ) ):

        # yes.  Add a note to look at output file.
        status_OUT += " - see " + outputFile + " for more details"
    
        # error output?
        if ( ( err is not None ) and ( err != "" ) ):
    
            # append a flag
            status_OUT += " [err]"
    
        #-- END check for error messages. --#
    
        # standard output?
        if ( ( text is not None ) and ( text != "" ) ):
            
            status_OUT += " [standard]"
            
        #-- END check to see if any output at all. --#
    
    #-- END check for status message --#
    
    return status_OUT
    
#-- END function terminal() --#

## Getting Data

* Back to the [Table of Contents](#Table-of-Contents)

We will be using a sample of abstracts from NIH grants stored in the 'TextAnalysis' table in the 'homework' database to explore automated text analysis.

This table was created by downloading data from NIH ExPORTER (exporter.nih.gov). 

For text analysis, we'll be automatically deriving a list of topics based on these abstracts using MALLET, a Java based text analysis tool that makes topic modeling very easy.  MALLET is primarily a command line tool and requires a specific format for its data.  We'll use our `terminal()` function to run it, and we'll be creating appropriately formatted text files for each abstract as part of this exercise.  You can read more about importing data into MALLET [on the MALLET web site's "Importing Data" page.](http://mallet.cs.umass.edu/import.php)

Let us first create a temporary directory in our home folder using the `terminal()` function.

The following `terminal()` call will invoke the "`mkdir temp`" command to make a temporary directory named "`temp`" in your current working directory. The `terminal()` function will store its output in the default file - "`temp.txt`".

In [None]:
command_token_list = ['mkdir', 'temp']

terminal( command_token_list )

Now lets get ready to retrieve the abstracts and the ids of the grants with which they are associated. Each abstract will be stored in a file with the grant id as the filename.

First, we'll need to create a database connection to the "`homework`" database:

In [None]:
# Create MySQL connection
user = "<username>"
password = "<password>"
database = "homework"

# invoke the connect() function, passing parameters in variables.
db = pymysql.connect( user = user, password = password, database = database )

# output basic database connection info.
print( db )

# create a database cursor.
cursor = db.cursor( pymysql.cursors.DictCursor )

Next, we'll set up a few functions that we'll use in this exercise.  The first is `writeFile()`, a function that takes in a filename and text, and creates a new file populated with the text.  Run the cell below to define `writeFile()`.

The writeFile() function’s signature is:

    writeFile( filename, data )

WHERE:

- **_filename_** - is the name of the file you want to output, optionally combined with the path where you want the file to be output (if you are not writing to the current directory).
- **_data_** - is the data you want to be stored in the file.

It doesn't return anything.  If there are problems writing the file, Python will throw exceptions.  

In [None]:
def writeFile(filename, data):

    with open(filename, "w") as f:
        
        f.write(str(data))
        
    #-- END with (automatically closes file) --#

We also wrote a function to do some initial cleaning of the abstract text - `cleanAbstract()`.

The `cleanAbstract()` function’s signature is:

    cleanAbstract( text )

WHERE:

- **_text_** - is the text of an abstract you want to clean.

This function:

- accepts an abstract's text
- removes words that would be very common in NIH abstracts, because we dont want them to bias the results
- removes stopwords (MALLET can also do that)
- removes punctuation
- returns the resulting cleaned string

Run the cell below to define `cleanAbstract()`.

In [None]:
def cleanAbstract(text):
    
    # common words to remove
    commonWords = ['study', 'project', 'experiment', 'abstract', 'description', 'studies', \
                  'abstracts', 'projects', 'experiments', 'descriptions']

    # remove white space.
    text = re.sub('[\n\t\r\f]+', '', text)
    
    # convert to all lower case.
    text = text.lower()
    
    # break text up into tokens (words)
    tokens = nltk.word_tokenize(text)
    
    # retrieve list of stop words.
    stop = stopwords.words('english')

    # remove stop words from list of tokens.
    tokens = [t for t in tokens if t not in stop]

    # remove punctuation from tokens
    exclude = set(string.punctuation)
    tokenNew=[]
    for s in tokens:
        snew = ''.join(ch for ch in s if ch not in exclude)
        if snew!="":
            tokenNew.append(snew)

    # remove common words
    tokenNew = [t for t in tokenNew if t not in commonWords]

    # tie tokens back together.
    abstract  = ' '.join(t for t in tokenNew)

    return abstract

#-- END function cleanAbstract() --#

### Exercise 1

* Back to the [Table of Contents](#Table-of-Contents)

Retrieve the abstracts one by one from the database and write them to text files in the temp directory. For your convenience, the writeFile() function has already been created. You just need to call it with the path to the location you want the file to be stored and the contents of the abstract.

You'll be reading the grant application abstracts from the table "`TextAnalysis`" in the database "`homework`", to which we connected above.  The table `TextAnalysis` has two columns: "`APPLICATION_ID`", the ID of the application with which a given abstract was associated; and "`ABSTRACT_TEXT`", the full text of that application's abstract.

Create a query that SELECTs all of the records in TextAnalysis, and then for each row in the database, retrieve the application ID (column "`APPLICATION_ID`") and abstract (column "`ABSTRACT_TEXT`") from the row, clean the abstract using the "`cleanAbstract()`" function, then write the abstract out as the contents of a text file using the "`writeFile()`" function.

When writing files, write each to the `temp` directory we created inside the current directory (path is "./temp").  Set the name of the files to the application ID from their grant (stored in the "`APPLICATION_ID`" column in the `TextAnalysis` database table), followed by ".txt".

So, the combined path and name for a given file (passed to "`writeFile()`" in the "`filename`" parameter) should be:

    ./temp/<application_id>.txt
    
Again, remember to clean the text using `cleanAbstract()` before you write it out to a file.

In [None]:
# First create the query that you need to get the abstracts, excluding null abstracts
query = 'SELECT * FROM homework.TextAnalysis WHERE ABSTRACT_TEXT IS NOT NULL ORDER BY APPLICATION_ID LIMIT 1000;'

#Execute the query
cursor.execute(query)

### BEGIN SOLUTION
#Fetch the results one by one and write them to a file
row = cursor.fetchone()
while (row is not None):
    ID = row['APPLICATION_ID']
    abstract = row['ABSTRACT_TEXT']
    abstract = cleanAbstract(abstract)
    filename = './temp/' + str(ID) + ".txt"
    writeFile(filename, abstract)
    row = cursor.fetchone()
### END SOLUTION

# clean up
cursor.close()
db.close()

In [None]:
# Test to see if file was successfully written
f = open('./temp/2887634.txt', 'r')

## Generating Topics

* Back to the [Table of Contents](#Table-of-Contents)

We have now created a number of .txt files in the temp directory, each of which contains a single abstract.  We will be using the set of these abstracts together as a corpus of data for machine learning.

Our next task is to transform these individual files into a single file in MALLET format. To achieve this, we will use MALLET's import command. The import command can read in an entire directory, turn it into a MALLET file, and can also strip out common english stopwords. Our command will look something like this:

    /bin/mallet/bin/mallet import-dir --input path/to/temp/directory --output data.mallet --keep-sequence --remove-stopwords

Lets break down this command into each of the separate tokens it contains (where tokens are words separated by spaces):

- **`/bin/mallet/bin/mallet`** ==> is the path to the MALLET program
- **`import-dir`** ==> the first argument to the program mallet specifies what command the program is being asked to do.  The `import-dir` command tells MALLET to import an entire directory of files into a MALLET data file.
- **`--input`** ==> "--" are used in MALLET to signify parameter names, usually followed by a space and a parameter value.  `--input` is a parameter used to tell MALLET the directory in which the corpus of data is located.
- **`/path/to/temp/directory`** ==> Path to the directory that contains the corpus of data (the value for the parameter `--input`).
- **`--output`** ==> tells MALLET where to store the output
- **`data.mallet`** ==> name of file we'll store the MALLET data in (the value for the parameter `--outout`).
- **`--keep-sequence`** ==> parameter that tells MALLET to keep the original texts in the order in which they were listed in the directory.  This is an example of a parameter that doesn't have an associated value.
- **`--remove-stopwords`** ==> parameter that tells MALLET to remove common english stopwords like "a", "an", and "the".  Another parameter with no subsequent value.

If you want help with the options available for a given mallet command, you can ask mallet for those options.  To do this, at the command line, run mallet with the command whose options you want to see, followed by "`--help`".  So, for example, to see the options for the "`import-dir`" command, you'd run:

    /bin/mallet/bin/mallet import-dir --help
    
This will output a list of options, what each should contain and what the default is should that option not be specified.  Example output:

    jmorgan@ip-172-31-36-239:~$ /bin/mallet/bin/mallet import-dir --help
    A tool for creating instance lists of FeatureVectors or FeatureSequences from text documents.

    --help TRUE|FALSE
      Print this command line option usage information.  Give argument of TRUE for longer documentation
      Default is false
    --prefix-code 'JAVA CODE'
      Java code you want run before any other interpreted code.  Note that the text is interpreted without modification, so unlike some other Java code options, you need to include any necessary 'new's when creating objects.
      Default is null
    --config FILE
      Read command option values from a file
      Default is null
    --input DIR...
      The directories containing text files to be classified, one directory per class
      Default is (null)
    --output FILE
      Write the instance list to this file; Using - indicates stdout.
      Default is text.vectors
    ...
    
If an option is not present in the list, it isn’t supported for the specified command.

### Setting the Mallet Path Variable 

If you followed the instructions in the word document, then you have successfully installed mallet on your computer and identifyed the batch file for your mallet program. Lets save the location of the Mallet batch file to a variable for easy access. 

We can test the batch file by calling the help otption on a mallet command via our terminal function. 

In [None]:
# Write the path to your mallet batch file to the PATH_TO_MALLET variable below
PATH_TO_MALLET = "C:\\mallet\\bin\\mallet"


# Test our PATH_TO_MALLET variable with the Mallet command import-dir and the --help option 
# If successful, this command will produce the help documenation for the command import-dir
args = None

# Create list of command words in "args".
### BEGIN SOLUTION
args  = [PATH_TO_MALLET, 'import-dir', '--help']
### END SOLUTION

# run terminal() on args and print out results.
print( terminal( args ) )



### Exercise 2

* Back to the [Table of Contents](#Table-of-Contents)

Now use the `terminal()` function to run the MALLET `import-dir` command on your "`./temp`" directory.  Remember, the `terminal()` function accepts a list of arguments, with the command to be run the first item in the list, and then subsequent details of the command after, with each space-delimited part of the command an item in this list.  Given the above breakdown of the `import-dir` command, break that command into a list of arguments and invoke the command using `terminal()`, reading from your "`./temp`" directory and outputting the resulting MALLET data file to "`data.mallet`".

- For help with available options for the mallet "`import-dir`" command, run the following at the command line:

        /bin/mallet/bin/mallet import-dir --help

In [None]:
# store argument list in args[]
args = None

# Create list of command words in "args".
### BEGIN SOLUTION
args  = [PATH_TO_MALLET, 'import-dir', '--input',  './/temp//' , '--output', \
         'data.mallet', '--keep-sequence', '--remove-stopwords']
### END SOLUTION

# run terminal() on args and print out results.
print( terminal( args ) )

In [None]:
# Test to see if file data.mallet was successfully written
f = open('./data.mallet', 'r')

If you go to your working directory now, you should find a file named "`data.mallet`".  This is the MALLET data file that we will use as input when we ask MALLET to generate topics based on a corpus of text.

We will use the `train-topics` command in MALLET to generate our very own topic models.

In the following example, we execute this command using its default settings:

In [None]:
args = [PATH_TO_MALLET, 'train-topics', '--input', 'data.mallet']
print( terminal( args, "mallet-train-default.txt" ) )

This command opens `data.mallet` and repeatedly runs MALLET's topic modeling algorithm on the corpus of documents in  it using default settings, printing out the results as it goes and using the results of each run to train a topic detection model to detect topics based on words used in texts in the corpus.

The output of this command is captured by the `terminal()` command and written to a file whose name is in the second argument to the `terminal()` function (if you don't specify a name, it writes to `temp.txt`).  In the example above, we write the output to the file "`mallet-train-default.txt`".

You can look a this output to get an idea of how MALLET works.  By default, MALLET prints out the keywords that make up the top 10 topics it detects, on every 50th iteration. A good high-level way to judge if the algorithm has converged is to look at this output. Each time it outputs topics, for each of the ten topics, MALLET outputs the topic ID, then a list of the keywords it associates with each topic.  If the keywords it outputs with each topic don't change much between iterations, it means that the model has converged.

You can read more about the different options that can be used to fine tune the results [on the MALLET web site's "Topic Modeling" page.](http://mallet.cs.umass.edu/topics.php)

### Exercise 3

* Back to the [Table of Contents](#Table-of-Contents)

In the above example, we ran the base topic modeling algorithm but we didn't formally save the output anywhere.  If you look at the documentation pointed to above, it gives you different options to store the output.  Using this documentation as a guide, modify the args for the MALLET "`train-topics`" command so that it outputs:

- topic keys, stored in file "`topicKeys.txt`"
- topic composition of documents, stored in file "`docTopics.txt`"
- a serialized MALLET topic trainer object, stored in file "`model.mallet`"

Also add the option to:

- enable hyperparameter optimization
- increase the number of sampling iterations to 20,000
- increase the number of topics to 20

For help with available options for the mallet "`train-topics`" command, run the following at the command line:

    /bin/mallet/bin/mallet train-topics --help

**_NOTE: the topic modeling in this code cell could take a long time to complete - as long as there is an asterisk in the square brackets to its left ("In [*]"), it should still be running.  Give it some time._**

In [None]:
# Modify the MALLET command to output topic keys, topic composition of documents, and a serialized MALLET topic trainer object.
# Add the option to enable hyperparameter optimization, increase the number of sampling iterations to 20,000,
# and increase the number of topics to 20.

# store argument list in args[]
args = None

# Create list of command words in "args".
### BEGIN SOLUTION
args = [ PATH_TO_MALLET, 'train-topics', '--input', 'data.mallet', '--optimize-interval', '10', \
        '--output-topic-keys', 'topicKeys.txt', '--output-doc-topics', 'docTopics.txt', '--num-topics', '20', \
       '--num-iterations', '20000']
### END SOLUTION

# run terminal() on args and print out results.
print( terminal( args, "mallet-train.txt" ) )

In [None]:
# Test to see if file data.mallet was successfully written
f = open('./docTopics.txt', 'r')

Now lets look at some results. In addition to the output log file "`mallet-train.txt`", your execution of MALLET should have resulted in two output files:

- `topicKeys.txt` - a list of the topics detected in the abstracts, along with their weights and the words associated with each.
- `docTopics.txt` - for each file in the corpus, lists each of the detected topics and a relavance score that indicates how likely it is that a given abstract relates to that topic.

In `topicKeys.txt`, each topic gets a tab-delimited line in the file.  In a given topic's line, the first number is the numeric identifier of the topic (0, 1, 2, etc.), the second number gives an indication of the weight of that topic, and then after a tab, the line is completed with a list of the keywords associated with that topic.  An example (topic 0, weight 0.01502):

    14	0.05863	health care cancer data risk patients individuals outcomes genetic disease women testing aging older unreadable lung effects participants intervention community 

In `docTopics.txt`, each abstract, represented by its file path in the original directory, has a line in the file that lists the topics associated with that article and a relevance score for each topic, in order of decreasing relevance.  The relevance score runs from 0 to 1, where 0 is not at all relevant and 1 is perfectly related.  Example:

    1	file:/home/jmorgan/nbgrader/courses/2015-fall-big_data/source/05.%20Text%20Analysis/./temp/6287560.txt	14	0.7923786992036733	16	0.10258021829417482	11	0.0414781353290313	12	0.04106461349323023	17	0.020602274154395795	1	3.1281525697325206E-4	7	2.5520223298362594E-4	15	2.0605374283754973E-4	2	1.4476870589053855E-4	18	1.3480683006926762E-4	6	1.216816985277047E-4	10	1.0585995154023939E-4	9	1.0155273496603757E-4	5	9.406085778593105E-5	3	8.487431366553085E-5	19	7.761498905199808E-5	0	7.675511604129449E-5	13	7.071872110668556E-5	4	5.972207592082584E-5	8	4.957229813413215E-5

In this example, the abstract with ID 6287560 (found in the file path) is most highly related to topic 14 (our example above) with 0.792... relevance score.  In aggregate, this output could help you to find connections between documents based on these detected topics that you might not have otherwise noticed.

### Exercise 4 - Identifying Topics from Word Clusters

* Back to the [Table of Contents](#Table-of-Contents)

The topics our topic model generated based on the grant abstracts are clusters of words that it thinks are related.  Before we try to apply this model to an abstract not included in testing, however, we should first look more closly at these clusters of words.

Just because a set of words are consistently found together across a corpus of training documents doesn't mean that the topic or category represented by those words is meaningful or useful.  In order for a topic or set of topics to be useful, one must understand what underlying concept each topic represents.

There are many formal ways to take traits of something and figure out what underlying category sets of these traits represents (see, for example, [Grounded Theory](https://en.wikipedia.org/wiki/Grounded_theory)).

Even informally, however, it is a good exercise to look at the sets of traits created by an algorithm like topic modeling and see if you can make sense of the topics it finds (and that is usually a step in any formal process, as well - "face validity" - do the topics make sense?).

Based on the topics detected and output to `topicKeys.txt`, in the space below, list the topic ID, topic name, and brief description of any of the topics whose words suggest a substantive underlying concept or trait.  If none stand out, look for at least one or two that might, if you just removed a few words, or explain why you think none are meanigful.

## Inferring Topics - Extra Credit

* Back to the [Table of Contents](#Table-of-Contents)

You can use your newly trained model to infer topics for unseen documents. Since we got the first 1000 abstracts to train the model, let us use the model to infer topics on the 1001st abstract. We've already retrieved the text for it and placed it in the code cell that follows.  Run the code cell below to assign it to the variable "`unseenAbstract`", then create a file out of the abstract in a new work directory, named "`./infer`".

In [None]:
unseenAbstract = 'The overall goal of these experiments is to determine the effects of PD-linked mutations on \
the properties of the gene products and to use this information to discover methods to test possible explanations \
for pathogenicity.  We expect that this work will generate new therapeutic strategies for the treatment \
of Parkinson\'s disease (PD).  Our emphasis will be on protein fibrillogenesis, since fibrillar cytoplasmic \
aggregates, or Lewy bodies, are diagnostic for PD and a major fibrillar component of Lewy bodies is also the \
product of a gene linked to early-onset PD.  Three mutations, in two different genes, encoding alpha-synuclein \
(alphaS) and ubiquitin C-hydrolase (UCH), have been linked to early-onset PD.  We have shown that the two alphaS \
mutations effect the oligomerization properties of the protein; both favor oligomerization.  It is a central goal \
of the proposed research to understand the structural basis for oligomerization and fibrillization and the \
relationship between this process and disease (the latter will require a collaboration between this project \
and project 3).  We are also very interested in the ubiquitin-dependent degradation of alphaS, especially \
since UCH may be involved in that pathway.  Finally, the possibility that mutant UCH may also be a fibrillogenic \
protein is under investigation.  Protein (UCH and alphaS) fibrillization will be a target for medium-throughput \
screening assays to be run in the Center core facility (Core B). A gene linked to juvenile-onset parkinsonism, \
parkin, will be the subject of future biochemical and biophysical investigations.  This protein contains an \
N-terminal ubiquitin homology domain, which suggests its involvement (like UCH) in the degradative process.  \
We intend to characterize wild-type and mutant forms of Parkin.  The fact that this disease is inherited in \
an autosomal recessive manner suggests that gain of function due to toxic oligomers may not be involved. '

#Creating a new inference directory
terminal(['mkdir', 'infer'])
writeFile('./infer/' + "6302892.txt", unseenAbstract)

Documentation for topic inference can be found in [the Topic Modeling page at the MALLET site.](http://mallet.cs.umass.edu/topics.php)

### Exercise 5 - Extra Credit

* Back to the [Table of Contents](#Table-of-Contents)

Use the MALLET documentation to set up a call to `mallet` to infer topics for the file we just created in the `./infer` folder.

#### Step 1: store topic model in a file so it can be re-used

First, we provide a mallet command that will re-run our topic model with an additional parameter to output a topic inference model specification to a file named "`model.mallet`".  Use the cell below to re-run the model training with the `--inferencer-filename` option set to output a re-usable inferencer to the file "`model.mallet`".

**_NOTE: Each of the topic modeling steps could take a long time to complete - as long as there is an asterisk in the square brackets to its left ("In [*]"), the code in this cell should still be running on the server.  Give it some time._**

In [None]:
# We will first need to rerun our model with the --inferencer-filename option
args = [PATH_TO_MALLET, 'train-topics', '--input', 'data.mallet', '--optimize-interval', '10', \
        '--output-topic-keys', 'topicKeys.txt', '--output-doc-topics', 'docTopics.txt', '--num-topics', '20', \
       '--num-iterations', '20000', '--inferencer-filename', 'model.mallet']
print( terminal( args, "mallet-train-inferencer.txt" ) )

#### Step 2: Make a new mallet data file for the one abstract

Next, you'll create and run a mallet command that executes the `import-dir` command to  create a new MALLET data file named "`one.mallet`", based on the contents of your `./infer` directory rather than your `./temp` directory, that will contain the article whose topics you want to infer.  Use the option `--use-pipe-from data.mallet` to specify our original data file as a training file for this corpus.  Make the `terminal()` function call for this command output to the file "`mallet-infer-data.txt`".

As mentioned in the documenation, make sure that the new data is compatible with your training data. Use the option "`--use-pipe-from [MALLET TRAINING FILE]`" (without the square brackets around your training file) in the MALLET command import-dir to specify a training file.

- For help with available options for the mallet "`import-dir`" command, run the following at the command line:

        /bin/mallet/bin/mallet import-dir --help

**_NOTE: Each of the topic modeling steps could take a long time to complete - as long as there is an asterisk in the square brackets to its left ("In [*]"), the code in this cell should still be running on the server.  Give it some time._**

In [None]:
# Use import-dir to pull our one file into a mallet data file.

### BEGIN SOLUTION'
args = [PATH_TO_MALLET, 'import-dir', '--input', './infer', '--output', 'one.mallet', '--use-pipe-from', \
        'data.mallet']
print( terminal( args, "mallet-infer-data.txt" ) )
### END SOLUTION

#### Step 3: Infer Topics in Abstract

Finally, you'll create and run a mallet command that executes the `infer-topics` command, running for 10,000 iterations, using `one.mallet` as the input, the "`model.mallet`" model inferencer we created above as the inferencer, and that outputs the topics for the one abstract to a file named "`inf-one.txt`".  Make the `terminal()` function call for this command output to the file "`mallet-infer-topics.txt`".

- For help with available options for the mallet "`infer-topics`" command, run the following at the command line:

        /bin/mallet/bin/mallet infer-topics --help

**_NOTE: Each of the topic modeling steps could take a long time to complete - as long as there is an asterisk in the square brackets to its left ("In [*]"), the code in this cell should still be running on the server.  Give it some time._**

In [None]:
# Use the infer-topics command to detect topics in the one abstract by running the
#    abstract through the "model.mallet" inferencer on it.

### BEGIN SOLUTION
args = [PATH_TO_MALLET, 'infer-topics', '--input', 'one.mallet', '--inferencer', 'model.mallet', \
  '--output-doc-topics', 'inf-one.txt', '--num-iterations', '10000']
print( terminal( args, "mallet-infer-topics.txt" ) )

# Eventually, since all of the outputs of mallet are tab-delimited, could
#    probably read and parse some of these files, verify the output inside.
### END SOLUTION

In [None]:
# Test to see if file inf-one.txt was successfully written
f = open('./inf-one.txt', 'r')

## Resources for Topic Modeling

* Back to the [Table of Contents](#Table-of-Contents)

Below you will find some tutorials and resources for topic modeling.
- [General Introduction to Topic Modeling](https://www.cs.princeton.edu/~blei/papers/Blei2012.pdf)
- [Topic Modeling for Humanists](http://www.scottbot.net/HIAL/?p=19113)
- [Interpretation of Topic Models](http://www.umiacs.umd.edu/~jbg/docs/nips2009-rtl.pdf)