## Importing `corpus.py`
Now that you know how to run a cell, we can begin interacting with the topic models. First we will import your corpus objects. Select and run the cell below:

In [1]:
from corpus import *

Running from notebook, using serial load function.
[20, 40, 60, 80, 100]
/home/hongliang/inpho/handian/compare/handian2and2mac/models/handian2-freq5-freq2-N23231312-LDA-K{0}-document-1000.npz


You will now have access to several variables, the most important of which are:
 -  `c` -- The `vsm.Corpus` object
 -  `lda_v` -- A dictionary containing each of the `vsm.LdaViewer` instances. You can access a particular model with `lda_v[k]`, substituting k for a particular number, like `lda_v[20]` for the 20-topic model. If the model for that number of topics has not been trained, it will error.
 -  `topic_range` -- A list of the trained models (e.g., `[20, 40, 60, 80]`)
 -  `context_type` -- A string containing the particular context type modeled (e.g., `"sentence"`, `"document"`, `"article"`)

### Introducing the `vsm` module

The InPhO Topic Explorer is comprised of two modules:
1. The `topicexplorer` module contains code for the visualization and user interfaces.
2. The `vsm` module contains code for modeling differnet corpora. 

In order to make use of the term frequency (TF), term frequency-inverse document frequency (TfIdf), and latent semantic analysis (LSA) models, we must import the main vsm module:

In [2]:
from vsm import *

## Interacting with the Corpus: Term Frequencies

The command above has loaded your `Corpus` object into the `c` variable. You can see the list of all words that are in your corpus by typing `c.words` into a code cell:

In [3]:
len(c.words)

15293

Note that it only shows the first few and last few unique words in the corpus, alphabetically sorted. 

What if we want to get a list of how often each word occurs? For that, we can use the `vsm.model.TF` to build a frequency distribution over the terms in the corpus:

In [4]:
# train the model and create a TfViewer object
tf = TF(c, context_type)
tf.train()
tf_v = TfViewer(c, tf)

# print the most frequent terms in the document
# remember that IPython automatically prints the last cell of a document
tf_v.coll_freqs()

Collection Frequencies,Collection Frequencies,Collection Frequencies,Collection Frequencies
Word,Counts,Word.1,Counts.1
年,501196,南,273696
王,410631,将,272349
州,383224,行,272075
天,367499,国,257069
月,330868,山,256549
日,316454,官,245958
书,309290,道,245572
军,295896,百,242402
时,294532,臣,241921
文,277990,东,235142


After running the cell above, you should see a table with the 20 most frequently used words.

## Interacting with Topic Models

The InPhO Topic Explorer doesn't just work with term frequencies though - it creates LDA topic models. Through the notebook interface these models can be powerfully manipulated to produce new analyses.

First, let's select a primary model to investigate, and load it into the variable `v`:

In [5]:
# print the number of topics in the first model
print topic_range[0]
# remember that list indexes start with 0 not 1!

# replace 'topic_range[0]' with a specific number, if you like
k = topic_range[0]

# load the topic model
v = lda_v[k]

20
Loading LDA data from /home/hongliang/inpho/handian/compare/handian2and2mac/models/handian2-freq5-freq2-N23231312-LDA-K20-document-1000.npz


The above code loads the first topic model into a viewer object. We have used the `topic_range[0]` instead of simply stating a number so that this same demo notebook will work with any model settings you've prepared. This portability enables us to write analyses that can be replicated across any corpus, and is one of the real strengths of using the `from corpus import *` model of coding your notebooks. If others are using the Topic Explorer to generate their objects, they can run the exact same analysis on different corpora, so long as the variable names are consistent

### `v.topics()`
First, lets print a list of topics:

In [6]:
v.topics()

Topics Sorted by Index,Topics Sorted by Index,Topics Sorted by Index,Topics Sorted by Index,Topics Sorted by Index,Topics Sorted by Index,Topics Sorted by Index,Topics Sorted by Index,Topics Sorted by Index,Topics Sorted by Index,Topics Sorted by Index
Topic,Words,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Topic 0,"官, 司, 部, 郎, 书, 史, 州, 士, 侍, 御",,,,,,,,,
Topic 1,"卷, 书, 文, 本, 诗, 传, 集, 记, 经, 纪",,,,,,,,,
Topic 2,"年, 王, 帝, 太, 书, 宗, 元, 时, 文, 国",,,,,,,,,
Topic 3,"年, 江, 尔, 督, 总, 兵, 部, 抚, 南, 命",,,,,,,,,
Topic 4,"月, 日, 年, 星, 辰, 度, 壬, 辛, 庚, 甲",,,,,,,,,
Topic 5,"气, 服, 热, 水, 病, 黄, 治, 血, 寒, 阳",,,,,,,,,
Topic 6,"官, 本, 钱, 年, 臣, 日, 百, 路, 州, 司",,,,,,,,,
Topic 7,"师, 法, 佛, 经, 道, 僧, 生, 时, 王, 门",,,,,,,,,
Topic 8,"王, 侯, 国, 齐, 年, 义, 正, 文, 郑, 礼",,,,,,,,,
Topic 9,"德, 阙, 天, 圣, 臣, 心, 道, 将, 神, 明",,,,,,,,,


To change the number of words printed per topic, use the `print_len` argument:

In [7]:
v.topics(print_len=20)

Topics Sorted by Index,Topics Sorted by Index,Topics Sorted by Index,Topics Sorted by Index,Topics Sorted by Index,Topics Sorted by Index,Topics Sorted by Index,Topics Sorted by Index,Topics Sorted by Index,Topics Sorted by Index,Topics Sorted by Index,Topics Sorted by Index,Topics Sorted by Index,Topics Sorted by Index,Topics Sorted by Index,Topics Sorted by Index,Topics Sorted by Index,Topics Sorted by Index,Topics Sorted by Index,Topics Sorted by Index,Topics Sorted by Index
Topic,Words,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
Topic 0,"官, 司, 部, 郎, 书, 史, 州, 士, 侍, 御, 军, 府, 正, 品, 都, 学, 监, 尚, 制, 置",,,,,,,,,,,,,,,,,,,
Topic 1,"卷, 书, 文, 本, 诗, 传, 集, 记, 经, 纪, 志, 古, 学, 篇, 汉, 录, 百, 史, 类, 名",,,,,,,,,,,,,,,,,,,
Topic 2,"年, 王, 帝, 太, 书, 宗, 元, 时, 文, 国, 州, 臣, 天, 德, 明, 朝, 皇, 日, 安, 官",,,,,,,,,,,,,,,,,,,
Topic 3,"年, 江, 尔, 督, 总, 兵, 部, 抚, 南, 命, 山, 军, 阿, 海, 官, 克, 巡, 民, 西, 贼",,,,,,,,,,,,,,,,,,,
Topic 4,"月, 日, 年, 星, 辰, 度, 壬, 辛, 庚, 甲, 丙, 戊, 癸, 分, 午, 天, 巳, 卯, 寅, 亥",,,,,,,,,,,,,,,,,,,
Topic 5,"气, 服, 热, 水, 病, 黄, 治, 血, 寒, 阳, 生, 汤, 药, 阴, 脉, 痛, 味, 白, 食, 虚",,,,,,,,,,,,,,,,,,,
Topic 6,"官, 本, 钱, 年, 臣, 日, 百, 路, 州, 司, 令, 诏, 法, 月, 民, 奏, 行, 已, 万, 千",,,,,,,,,,,,,,,,,,,
Topic 7,"师, 法, 佛, 经, 道, 僧, 生, 时, 王, 门, 天, 山, 日, 寺, 名, 真, 身, 罗, 行, 禅",,,,,,,,,,,,,,,,,,,
Topic 8,"王, 侯, 国, 齐, 年, 义, 正, 文, 郑, 礼, 父, 晋, 传, 周, 伯, 书, 楚, 服, 师, 母",,,,,,,,,,,,,,,,,,,
Topic 9,"德, 阙, 天, 圣, 臣, 心, 道, 将, 神, 明, 命, 功, 实, 表, 风, 灵, 文, 载, 成, 礼",,,,,,,,,,,,,,,,,,,


### Viewing Document-topic probabilities
The above code shows the topic-word distributions and allows us to estimate the quality of our topics.

#### `v.labels`
The property `v.labels` (without parentheses) returns a list of all documents in a corpus, and is useful for processing each document generically, wihtout having to look up the identifiers on the file system.

Below, we print the first 3 document labels:

In [8]:
for label in v.labels[:3]:
    print label

handian2/『史部』/职官/唐六典/卷十五·光禄寺.txt
handian2/『史部』/职官/唐六典/卷二十八·太子左右卫及诸率府.txt
handian2/『史部』/职官/唐六典/卷三·尚书户部.txt


#### `v.doc_topics(doc_or_docs)`
Each document-topic distribution can be examined with `v.doc_topics()`, which takes as its argument either a single label or a list of labels. Below we view the distribution for the first 3 documents.

In [9]:
v.doc_topics(v.labels[:3])

Distributions over Topics,Distributions over Topics,Distributions over Topics,Distributions over Topics,Distributions over Topics,Distributions over Topics
Doc: handian2/『史部』/职官/唐六典/卷十五·光禄寺.txt,Doc: handian2/『史部』/职官/唐六典/卷十五·光禄寺.txt,Doc: handian2/『史部』/职官/唐六典/卷二十八·太子左右卫及诸率府.txt,Doc: handian2/『史部』/职官/唐六典/卷二十八·太子左右卫及诸率府.txt,Doc: handian2/『史部』/职官/唐六典/卷三·尚书户部.txt,Doc: handian2/『史部』/职官/唐六典/卷三·尚书户部.txt
Topic,Prob,Topic,Prob,Topic,Prob
18,0.45781,0,0.64476,17,0.44986
0,0.30366,15,0.12606,6,0.20621
12,0.1686,2,0.08854,0,0.11983
5,0.04239,18,0.07904,12,0.04566
4,0.0163,14,0.06153,11,0.04494
8,0.00652,7,1e-05,13,0.03793
2,0.00466,1,1e-05,15,0.03262
17,0.0,3,1e-05,18,0.02875
3,0.0,4,1e-05,5,0.02283
16,0.0,5,1e-05,4,0.00652


#### `v.aggregate_doc_topics(doc_or_docs, normed_sum=False)`
While `v.doc_topics(doc_or_docs)` shows the distribution for each document, `v.aggregate_doc_topics()` shows the average distribution of a collection of documents. The `normed` argument tells the program whether to weight each document by its length (`normed_sum=True`) or to consider them all equally (`normed_sum=False`).

In [10]:
v.aggregate_doc_topics(v.labels[:3], normed_sum=True)

Aggregate Distribution over Topics,Aggregate Distribution over Topics
Topic,Prob
0,0.35608
18,0.18853
17,0.14996
12,0.07142
6,0.06874
15,0.05289
2,0.0322
5,0.02174
14,0.02051
11,0.01498


### Comparing documents with `v.dist()`

Topic models give us a way to compare the siimilarity between two documents. To do this, we use `v.dist()`:

In [11]:
v.dist(v.labels[0], v.labels[1])

0.62897624160128662

#### Alternative distance measures
By default, the Topic Explorer uses the Jensen-Shannon Distance to calculate the distance between documents. The Jensen-Shannon Distance (JSD) is a symmetric measure based on information theory that characterizes the difference between two probability distributions.

However, several alternate methods are built into the `vsm.spatial` module. These include the Kullbeck-Liebler Divergence, which is an asymmetric component of the JSD and is used in [Murdock et al. (in review)](http://arxiv.org/abs/1509.07175) to characterize the cognitive surprise of a new text, given previous texts.

Rather than using the JSD and assuming symmetric divergence between items, we assume that the second document is encountered after the first, effectively measuring text-to-text divergence.

In [12]:
# first import KL divergence:
from vsm.spatial import KL_div

# calculate KL divergence from the first document to the second
print "First to second", v.dist(v.labels[0], v.labels[1], dist_fn=KL_div)

# calculate KL divergence from the second document to the first, highlighting asymmetry:
print "Second to first", v.dist(v.labels[1], v.labels[0], dist_fn=KL_div)

First to second 4.15705067559
Second to first 3.57432545105


# Using Python's Help System

There are many other functions in the InPhO Topic Explorer and the associated `vsm` library. These are extensively documented within the code. 

One little-known feature about Python is its capacity for introspection: by using the `help()` method, one can find out all methods and properties of an object. For example, if one wanted to know what methods could be called on their corpus object, you could run:

In [13]:
help(c)

Help on Corpus in module vsm.corpus.base object:

class Corpus(BaseCorpus)
 |  The goal of the Corpus class is to provide an efficient representation    of a textual corpus.
 |  
 |  A Corpus object contains an integer representation of the text and
 |  maps to permit conversion between integer and string
 |  representations of a given word.
 |  
 |  As a BaseCorpus object, it includes a dictionary of tokenizations
 |  of the corpus and a method for viewing (without copying) these
 |  tokenizations. This dictionary also stores metadata (e.g.,
 |  document names) associated with the available tokenizations.
 |  
 |  :param corpus: A string array representing the corpus as a sequence of
 |      atomic words.
 |  :type corpus: array-like
 |  
 |  :param context_data: Each element in `context_data` is an array containing 
 |      the indices marking the token boundaries. An element in `context_data` is
 |      intended for use as a value for the `indices_or_sections`
 |      parameter in `

You can also get help on particular methods. For example, there are many arguments to `v.topics()` beyond `print_len`. These can be seen by calling `help(v.topics)` without parentheses after `v.topics`:

In [14]:
help(v.topics)

Help on method topics in module vsm.viewer.ldacgsviewer:

topics(self, topic_indices=None, sort=None, print_len=10, as_strings=True, compact_view=True, topic_labels=None) method of vsm.viewer.ldacgsviewer.LdaCgsViewer instance
    Returns a list of topics estimated by the model. 
    Each topic is represented by a list of words and the corresponding 
    probabilities.
    
    :param topic_indices: List of indices of topics to be
        displayed. Default is all topics.
    :type topic_indices: list of integers
    
    :param sort: Topic sort function.
    :type sort: string, values are "entropy", "oscillation", "index", "jsd",
        "user" (default if topic_indices set), "index" (default)
    
    :param print_len: Number of words shown for each topic. Default is 10.
    :type print_len: int, optional
    
    :param as_string: If `True`, each topic displays words rather than its
        integer representation. Default is `True`.
    :type as_string: boolean, optional
    
    :p

Calling `help(v.topics())` *with* parentheses will return help for the object reutrned by `v.topics()`, which is a `DataTable`:

In [15]:
help(v.topics())

Help on DataTable in module vsm.viewer.labeleddata object:

class DataTable(__builtin__.list)
 |  A subclass of list whose purpose is to store labels and
 |  formatting information for a list of 1-dimensional structured
 |  arrays. It also provides pretty-printing routines.
 |  
 |  Globally, the table has a default display length for the columns
 |  and a table header.
 |  
 |  A column can have a column-specific header.
 |  
 |  A subcolumn wraps the data found under a given field name. Each
 |  subcolumn has a label and a display width.
 |  
 |  :param l: List of 1-dimensional structured arrays.
 |  :type l: list
 |  
 |  :param table_header: The title of the object. Default is `None`.
 |  :type table_header: string, optional
 |  
 |  :param compact_view: If `True` the DataTable is displayed with its
 |      tokens only without the probabilities. Default is `True`
 |  :type compact_view: boolean, optional
 |  
 |  :attributes:
 |      * **table_header** (string)
 |          The titl

It is important to emphasize that this functionality can be used with any python library, including the standard library. For example, one could look at all the functions included in the `math` library by using:

In [16]:
import math
help(math.log)

Help on built-in function log in module math:

log(...)
    log(x[, base])
    
    Return the logarithm of x to the given base.
    If the base not specified, returns the natural logarithm (base e) of x.



In [17]:
slice_idxs = [range(s.start,s.stop) for i, s in enumerate(v.corpus.view_contexts('document',as_slices=True)) 
                      if i in docs]

NameError: name 'docs' is not defined

In [None]:
import copy
new_corpus = copy.deepcopy(v.corpus)
print new_corpus
print v.corpus

In [None]:
docs_labels = [v._res_doc_type(d) for d in v.labels]
docs, labels = zip(*docs_labels)

In [None]:
for a in v.corpus.view_contexts('document'):
    print len(a)

# Additional Examples

This notebook gives some basic building blocks for using the Topic Explorer. Additional examples can be found on GitHub in the [inpho/vsm-demo-notebooks repository](http://github.com/inpho/vsm-demo-notebooks).

# Contact Information
If you have additional questions regarding the InPhO Topic Explorer or have comments on this tutorial, please e-mail [tutorial@hypershelf.org](mailto:tutorial@hypershelf.org).
