Skip to content
Switch branches/tags

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

Topic Modeling with Logical Constraints on Words

This is a re-implementation of my work "Topic Models with Logical Constraints on Words" (ROBUS 2011), which allows us to use any logical expressions of soft constraints on words, such as Must-Link (ML) and Cannot-Link (CL) defined below when analyzing documents by topic modeling.

  • ML(A, B): A and B must appear in the same topic.
  • CL(A, B): A and B cannot apper in the same topic.

For example, we can use this soft constraint, (ML("kung-fu", "jackie") | ML("kung-fu", "bruce")) & CL("bruce", "jackie"), to distinguish Jackie Chan and Bruce Lee when analyzing movie reviews. The proposed model involves a latent Dirichlet allocation with Dirichlet forest priors (LDA-DF) trained via collapsed Gibbs sampling. Note that this implementation is a simplified version of the original LDA-DF, where this one directly encodes the maximal independent sets on a CL-graph into trees (see the implementation notes), whereas the original one encodes the cliques on each connected component in its complement graph into subtrees. See the paper or slides for details.


  • GCC (4.8.5)
  • Python (3.6.6)
  • PLY (3.10)
  • CxxTest (4.3)


PLY (Python Lex-Yacc) is used to parse logical expressions.

$ pip install ply


The following command runs an interactive demonstration when analyzing a toy dataset ([[A,A,B,B], [A,A,C,C], [A,A,B,B], [A,A,C,C]]) with two topics.

$ python -I

Select a constraint:
6) ( ML("A", "B") | ML("A", "C") ) & CL("B", "C")

Index> 6
- Word probabilities in each topic
A       0.499500
C       0.499500
B       0.000999
A       0.499500
B       0.499500
C       0.000999


Example script to understand how to use this software.

usage: [-h] [-i N] [-e N] [-s N] [-v] [-I]

Demonstration of topic modeling with logical constraints

python -I
python -i 6 -e 100 -s 0 -v

prepared constraint links:
1) ML("A","B")
2) CL("B", "C")
3) IL("B", "A")
4) ML("A", "B") | ML("A", "C")
5) IL("B", "A") & IL("C", "A")
6) ( ML("A", "B") | ML("A", "C") ) & CL("B", "C")
7) IL("B", "A") & IL("C", "A") & CL("B", "C")
8) ML("A", "B") & ML("A", "C")
9) ML("A", "B") & CL("B", "C")

optional arguments:
  -h, --help         show this help message and exit
  -i N, --index N    index of prepared constraint links
  -e N, --eta N      eta parameter for strength of links
  -s N, --seed N     eta parameter for strength of links
  -v, --verbose      verbose mode
  -I, --interactive  interactive mode


Script to make a training dataset (.dat/lex) from a raw input file including space-separated word sequences. The .dat file includes data sequences of word-ids, and the .lex file is its dictionary.

usage: [-h] [-i FILE] [-o PREF] [-m N]

Script to make a training dataset (.dat/lex) from a raw text file (.txt)

python utils/ -i data/test.txt -o out/test -m 1

optional arguments:
  -h, --help            show this help message and exit
  -i FILE, --input FILE
                        input file (.txt) for raw texts
  -o PREF, --out_base PREF
                        output prefix of path to save .dat/lex files
  -m N, --min_freq N    minimum frequency for .lex file

We can make a dataset as follows.

$ cat data/test.txt
$ python utils/ -i data/test.txt -o out/test -m 1
* Info:
- raw_file: data/test.txt
- lex_file: out/test.lex
- dat_file: out/test.dat
- min_freq: 1
* Making lex file
* Making dat file
* Done
$ cat out/test.dat out/test.lex
0:2 1:2
0:2 2:2
0:2 1:2
0:2 2:2
$ cat out/test.lex


Compiler to convert a given constraint (logical expressions of MLs and CLs) into the corresponding DNF of primitives.

usage: [-h] [-l FILE] [-c FILE] [-d FILE] [-i] [--debug] [-v]

Compiler from constraint linkes to DNF of primitives

python utils/ -c data/test.cst -d data/test.dnf -l data/test.lex -v
python utils/ -I

optional arguments:
  -h, --help           show this help message and exit
  -l FILE, --lex FILE  lexicon file (.lex) for .dat file
  -c FILE, --cst FILE  file (.cst) to load constraint links
  -d FILE, --dnf FILE  file (.dnf) to save compiled DNF
  -I, --interactive    interactive mode for debugging used without any other
  --debug              debug mode for ply parser
  -v, --verbose        verbose mode

We can create .dnf file from a text or file including a constraint as follows.

$ python utils/ -v -c '(ML("A","B")|ML("A","C"))&CL("B","C")' -l./data/test.lex -d out/test.dnf
compiled DNF: Ep(0,1)&Np(2) | Ep(0,2)&Np(1) | Np(0)&Np(1) | Np(0)&Np(2)
dtree-0: Np(0)&Np(1)
dtree-1: Ep(0,1)&Np(2)
dtree-2: Ep(0,2)&Np(1)
dtree-3: Np(0)&Np(2)
$ cat out/test.dnf 

Or, we can interactively check the functionality as follows.

$ python utils/ -I
Input links> (ML("A","B")|ML("A","C"))&CL("B","C")
dnf primitives: Ep(A,B)&Np(C) | Ep(A,C)&Np(B) | Np(A)&Np(B) | Np(A)&Np(C)
dnf file:


Main program to infer model parameters from a dataset and a compiled DNF.

usage: ldadf [OPTION..] DATA

LDA with logical constraints on words

./src/ldadf -n2 -m100 -o out/test -v data/test.dat
./src/ldadf -n2 -m100 -o out/test -v -d data/test.dnf -e10 data/test.dat

optional arguments
  -o    output path (prefix for .phi/.theta/.dti/.smp)
  -n    number of topics
  -a    hyperparameter alpha of document-topic distribution theta
  -b    hyperparameter beta of topic-word distribution phi
  -m    maximum number of training steps
  -l    number of inner loops
  -u    number of burn-in steps
  -c    stop training if perplexity does not change
  -s    seed of random function
  -v    verbose mode
  -d    file (.dnf) including compiled dnf from constraint linkes
  -e    strength parameter eta of constraint links
  -h    print this message

We can run this program as follows.

$ cd src; make release; cd ..
$ ./src/ldadf -n2 -m100 -o out/test -v -d data/test.dnf -e10 data/test.dat
* Parameters
- data file: data/test.dat
- out base: out/test
- num topics: 2
* Initialization
- loading data/test.dat
# docs: 4
# words: 3
# terms: 16
- loading data/test.dnf
- tree: Np(0,1)
- tree: Ep(0,1)^Np(2)
# dtrees: 2
* Preprocessing
* Inference
- step 0: pp = 2.71123
wrote to out/test.step_0.*
- step 90: pp = 2.50231
wrote to out/test.step_90.*
wrote to out/*
* Finish


Viewer to check the learned parameters

$ python utils/
usage: [-h] [-p PREF] [-l FILE] [-d FILE] [-n N] [-t N] [-m N]
                 [-o FILE] [-v]

Viewer of learned parameters

python utils/ out/
python utils/ -p out/ -l data/test.lex -d data/test.dnf -n 1

optional arguments:
  -h, --help            show this help message and exit
  -p PREF, --param PREF
                        prefix of path to learned parameters (.phi, .dti,
  -l FILE, --lex FILE   file (.lex) for lexicons in .dat file
  -d FILE, --dnf FILE   file (.dnf) for compliled dnf of primitives
  -n N, --num_words N   maximum number of words to be displayed
  -t N, --num_topics N  maximum number of topics to be displayed
  -m N, --num_docs N    maximum number of docs to be displayed
  -o FILE, --out FILE   file (.fot) to save
  -v, --verbose         verbose mode

We can see the topic-word probabilities, the assignment of Dirichlet trees on topics, and so on, as follows.

$ python utils/ -p out/ -l data/test.lex -d data/test.dnf -v
- Word probabilities in each topic
A       0.663391
B       0.335790
C       0.000819
C       0.995146
A       0.002427
B       0.002427

- Assignment of dtrees on topics
dtrees: Ep(A,B)&Np(C) | Np(A)&Np(B)
topic-0: Ep(A,B)&Np(C)
topic-1: Np(A)&Np(B)

- Topic probabilities in each document
topic-0 0.870141
topic-1 0.129859
topic-0 0.635590
topic-1 0.364410
topic-0 0.870141
topic-1 0.129859
topic-0 0.635590
topic-1 0.364410

- Word-topic counts (c_{wz}) of the last sample
         A      B       C
topic-0: 8      4       0
topic-1: 0      0       4

Data Format


This file includes data sequences in a dataset. One line represents one document expressed as a space-separated sequence of colon-separated (word-id, freq) pairs. The following example is the above-mentioned toy dataset ([[A,A,B,B], [A,A,C,C], [A,A,B,B], [A,A,C,C]]).

0:2 1:2 
0:2 2:2 
0:2 1:2 
0:2 2:2 


This file includes the dictionary of .dat file. One line represents one word, which corresponds to the line number as word-id.



This file includes a compiled constraint (DNF). One line represents one conjunction that consists of semicolon-separated Eps and Np. Eps are separated with colons, and words are separated with commas. The following example expresses the DNF, Np(A,B) | Ep(A,B) & Np(C), compiled from a constraint, ML(A,B) & CL(B,C).



    title = "Topic Models with Logical Constraints on Words",
    author = "Kobayashi, Hayato and Wakaki, Hiromi and Yamasaki, Tomohiro and Suzuki, Masaru",
    booktitle = "Proceedings of Workshop on Robust Unsupervised and Semisupervised Methods in Natural Language Processing",
    year = "2011",
    publisher = "Association for Computational Linguistics",
    url = "",
    pages = "33--40",


This software is released under the MIT License. See LICENSE.