# Topic Explorer Notebook Tutorial

The InPhO Topic Explorer features a powerful interactive coding environment that enables direct manipulation of the corpus and models, in contrast to the web visualization.

When you run the `vsm notebook` command, several things happen:
1.  The Topic Explorer creates a new folder called `notebooks` and places several files inside it:
     -  **corpus.py** contains Python code which imports the Corpus object and a function that will load models trained on your corpus. It gathers this information from your `CORPUS.ini` file, where `CORPUS` is the name of the folder you prepared. More information on this special file is below.
     -  **Topic Explorer Tutorial.ipynb** is this file, which provides the skeleton documentation for interacting with the models.
     -  Other **ipynb** files, containing analyses that can be re-run on your own corpora.
2.  The Topic Explorer launches a [Jupyter Notebook](http://jupyter.org/). This allows you to program in Python on your local computer using the browser, rather than a terminal or other program. 

## Introducing Jupyter
If you look in the address bar of your browser, it should start with something like `localhost:8888/notebooks/Topic%20Explorer%20Tutorial.ipynb`. If you open a new tab or browser window and type in `localhost:8888` you will open the same list of files and be able to edit multiple files at once. Note that editing the same file in multiple tabs may cause you to accidentally overwrite your data. Jupyter does not check if the file is already open.

Also, note that if the first portion of the url (`localhost:8888`) is different, you will need to enter what it says on your computer to open a new tab. Most likely the "port number" (`8888`) will be different, and may indicate you have a second instance of `vsm notebook` or another Jupyter Notebook server running.


### Running Cells
The Jupyter notebook operates through individual *cells* that run Python code. To run a cell press the "run cell" (play) button or select the cell by clicking on it, then select in the Jupyter menu "Cell > Run Cells".

Try running the cell below:

In [None]:
print "Hello world!"

Immediately below the cell, the `Hello world!` should be printed, and to the left of the cell it should say `In [1]:`.

#### A note on kernels and brackets
If the number `[1]` is different, nothing is wrong, so long as there is a number printed.

The number in brackets (`[1]`) counts the number of times you have run a cell in this notebook session. A notebook session is tied to a *kernel*. The kernel runs the Python code. If you wish to reset the numbers and run your code step-by-step, starting from `[1]` again, go to the menu and select "Kernel > Restart & Run All".

If the number appears as `In [*]:` that means that the cell is currently running. When it changes to a number (`[2]`), then the files have completed importing.

If you feel this is taking an absurdly long time to load (in excess of a few seconds), please press the stop button and notify the package developers. There might be a bug in the modeling software.

### Errors and Debugging
Each cell automatically calls `print` on the last line of the cell. Run the cell below to see an example:

In [None]:
10

Let's use this to print `Hello world!` once again:

In [None]:
Hello world!

**D'oh!** This should have raised a `SyntaxError: invalid syntax`.

Note the message `File "<ipython-input-5-59ca0efa9f56>", line 1` (the portion after `ipython-input` may be different). This tells you which line in the program errored. If you have errors in more advanced code, the line number will be very helpful in diagnosing the problem.

For now, change the cell above to `"Hello world!"` and run again to get the proper output.

## Importing `corpus.py`
Now that you know how to run a cell, we can begin interacting with the topic models. First we will import your corpus objects. Select and run the cell below:

In [None]:
from corpus import *

You will now have access to several variables, the most important of which are:
 -  `c` -- The `vsm.Corpus` object
 -  `lda_v` -- A dictionary containing each of the `vsm.LdaViewer` instances. You can access a particular model with `lda_v[k]`, substituting k for a particular number, like `lda_v[20]` for the 20-topic model. If the model for that number of topics has not been trained, it will error.
 -  `topic_range` -- A list of the trained models (e.g., `[20, 40, 60, 80]`)
 -  `context_type` -- A string containing the particular context type modeled (e.g., `"sentence"`, `"document"`, `"article"`)

### Introducing the `vsm` module

The InPhO Topic Explorer is comprised of two modules:
1. The `topicexplorer` module contains code for the visualization and user interfaces.
2. The `vsm` module contains code for modeling differnet corpora. 

In order to make use of the term frequency (TF), term frequency-inverse document frequency (TfIdf), and latent semantic analysis (LSA) models, we must import the main vsm module:

In [None]:
from vsm import *

## Interacting with the Corpus: Term Frequencies

The command above has loaded your `Corpus` object into the `c` variable. You can see the list of all words that are in your corpus by typing `c.words` into a code cell:

In [None]:
c.words

Note that it only shows the first few and last few unique words in the corpus, alphabetically sorted. 

What if we want to get a list of how often each word occurs? For that, we can use the `vsm.model.TF` to build a frequency distribution over the terms in the corpus:

In [None]:
# train the model and create a TfViewer object
tf = TF(c, context_type)
tf.train()
tf_v = TfViewer(c, tf)

# print the most frequent terms in the document
# remember that IPython automatically prints the last cell of a document
tf_v.coll_freqs()

After running the cell above, you should see a table with the 20 most frequently used words.

## Interacting with Topic Models

The InPhO Topic Explorer doesn't just work with term frequencies though - it creates LDA topic models. Through the notebook interface these models can be powerfully manipulated to produce new analyses.

First, let's select a primary model to investigate, and load it into the variable `v`:

In [None]:
# print the number of topics in the first model
print topic_range[0]
# remember that list indexes start with 0 not 1!

# replace 'topic_range[0]' with a specific number, if you like
k = topic_range[0]

# load the topic model
v = lda_v[k]

The above code loads the first topic model into a viewer object. We have used the `topic_range[0]` instead of simply stating a number so that this same demo notebook will work with any model settings you've prepared. This portability enables us to write analyses that can be replicated across any corpus, and is one of the real strengths of using the `from corpus import *` model of coding your notebooks. If others are using the Topic Explorer to generate their objects, they can run the exact same analysis on different corpora, so long as the variable names are consistent

### `v.topics()`
First, lets print a list of topics:

In [None]:
v.topics()

To change the number of words printed per topic, use the `print_len` argument:

In [None]:
v.topics(print_len=20)

### Viewing Document-topic probabilities
The above code shows the topic-word distributions and allows us to estimate the quality of our topics.

#### `v.labels`
The property `v.labels` (without parentheses) returns a list of all documents in a corpus, and is useful for processing each document generically, wihtout having to look up the identifiers on the file system.

Below, we print the first 3 document labels:

In [None]:
for label in v.labels[:3]:
    print label

#### `v.doc_topics(doc_or_docs)`
Each document-topic distribution can be examined with `v.doc_topics()`, which takes as its argument either a single label or a list of labels. Below we view the distribution for the first 3 documents.

In [None]:
v.doc_topics(v.labels[:3])

#### `v.aggregate_doc_topics(doc_or_docs, normed_sum=False)`
While `v.doc_topics(doc_or_docs)` shows the distribution for each document, `v.aggregate_doc_topics()` shows the average distribution of a collection of documents. The `normed` argument tells the program whether to weight each document by its length (`normed_sum=True`) or to consider them all equally (`normed_sum=False`).

In [None]:
v.aggregate_doc_topics(v.labels[:3], normed_sum=True)

### Comparing documents with `v.dist()`

Topic models give us a way to compare the siimilarity between two documents. To do this, we use `v.dist()`:

In [None]:
v.dist(v.labels[0], v.labels[1])

#### Alternative distance measures
By default, the Topic Explorer uses the Jensen-Shannon Distance to calculate the distance between documents. The Jensen-Shannon Distance (JSD) is a symmetric measure based on information theory that characterizes the difference between two probability distributions.

However, several alternate methods are built into the `vsm.spatial` module. These include the Kullbeck-Liebler Divergence, which is an asymmetric component of the JSD and is used in [Murdock et al. (in review)](http://arxiv.org/abs/1509.07175) to characterize the cognitive surprise of a new text, given previous texts.

Rather than using the JSD and assuming symmetric divergence between items, we assume that the second document is encountered after the first, effectively measuring text-to-text divergence.

In [None]:
# first import KL divergence:
from vsm.spatial import KL_div

# calculate KL divergence from the first document to the second
print "First to second", v.dist(v.labels[0], v.labels[1], dist_fn=KL_div)

# calculate KL divergence from the second document to the first, highlighting asymmetry:
print "Second to first", v.dist(v.labels[1], v.labels[0], dist_fn=KL_div)

# Using Python's Help System

There are many other functions in the InPhO Topic Explorer and the associated `vsm` library. These are extensively documented within the code. 

One little-known feature about Python is its capacity for introspection: by using the `help()` method, one can find out all methods and properties of an object. For example, if one wanted to know what methods could be called on their corpus object, you could run:

In [None]:
help(c)

You can also get help on particular methods. For example, there are many arguments to `v.topics()` beyond `print_len`. These can be seen by calling `help(v.topics)` without parentheses after `v.topics`:

In [None]:
help(v.topics)

Calling `help(v.topics())` *with* parentheses will return help for the object reutrned by `v.topics()`, which is a `DataTable`:

In [None]:
help(v.topics())

It is important to emphasize that this functionality can be used with any python library, including the standard library. For example, one could look at all the functions included in the `math` library by using:

In [None]:
import math
help(math.log)

For emphasis, calling `help(math.log(3))` will return the documentation for `float`. Why? First `math.log(3)` will be evaluated, then `help()` will be called on that result: `help(1.0986122886681098)`

In [None]:
help(math.log(3))

## `help()` and `?`

Alternatively, you can place a `?` before any function to receive help in a separate "frame". This allows you to view help while scrolling up and down the notebook

In [None]:
?math.log

# Additional Examples

This notebook gives some basic building blocks for using the Topic Explorer. Additional examples can be found on GitHub in the [inpho/vsm-demo-notebooks repository](http://github.com/inpho/vsm-demo-notebooks).

# Contact Information
If you have additional questions regarding the InPhO Topic Explorer or have comments on this tutorial, please e-mail [tutorial@hypershelf.org](mailto:tutorial@hypershelf.org).
