# Calculate posterior predictions
$$
\newcommand{\given}{\vert}
\newcommand{\text}{\mathrm{text}}
\newcommand{\xtext}{w_1, w_2 \ldots w_n}
$$    

The following calculates the posterior prediction over words conditioned a text.

Given a text $\text$, the posterior predictive distribution is, informally speaking, the distribution over words that are consistent with the discourse topics of the $\text$. It is calculated as follows:
$$
\begin{align}
\mathrm{P}(w \given \phi, \text, a, m) &= \int \mathrm{P}(w \given \phi, \pi) \mathrm{P}(\pi \given \text, a, m) d\pi,\\
&= \int \big[ \sum_{\{x\}} \mathrm{P}(w \given \phi, x)\mathrm{P}(x \given \pi) \big] \mathrm{P}(\pi \given \text, a, m) d\pi
\end{align}
$$
where $\mathrm{P}(\pi \given \text, a, m)$ is the posterior distribution over topic distributions of text $\text$ and $\phi$ is the set of $K$ component topics and $a$, $m$ are the hyper-parameters of the Dirichlet prior over the per document mixing distribution.
 

In [1]:
import os
import errno
import utils
import numpy

from utils import topicmodels

from itertools import cycle

import cPickle as pickle

import datetime

## Download files for the topic model

Get data and some MCMC state samples for a HDPMM topic model. For each file, we provide its sha256 hash to check its integrity. If the files are already downloaded, the `curl` will not try to redownload them, but will just check their integrity. If the downloaded files are bz2 compressed, they will be uncompressed unless uncompressed versions exists already in the `cache_directory`.

In [2]:
url_root = 'http://www.lawsofthought.org/shared'

cache_directory = '_cache'

filenames = {
    'experiment_cfg' : [('Brismo.cfg',
                         '909d9f8de483c4547f26fb4c34b91e12908ab5c144e065dc0fe6c1504b1f22c9')],
    'corpus' : [('bnc_78723408_250_500_49328.npz.bz2', 
                 'b9d828f7697871e01a263b8f3978911c70ff45cab9af4c86fbb43c3baef969d9')],
    'mcmc_samples' : [('hdptm_061216085831_7090_state_12946.npz.bz2', 
                       '9ba9850ff51fd60b679fd2af85cbaa4b3d69a2f31f4a0705475c0fffe3374330')]
}

topicmodels.curl(url_root, 
                 filenames['experiment_cfg'] + filenames['corpus'] + filenames['mcmc_samples'], 
                 cache=cache_directory,
                 verbose=False)

Now load up the corpus and one of the state samples.

In [3]:
corpus_data = topicmodels.load(filenames['corpus'][0][0], 
                               cache=cache_directory,
                               verbose=False)

state = topicmodels.load(filenames['mcmc_samples'][0][0],
                         cache=cache_directory,
                         verbose=False)

In [4]:
texts = topicmodels.get_experiment_texts('Brismo.cfg', cache=cache_directory)

In [5]:
model = topicmodels.PosteriorPredictive(corpus_data, state, verbose=True)

In [6]:
from ipyparallel import Client

clients = Client()

clients.block = True

clients[:].push(dict(
    model=model,
    texts=texts)
);

view = clients.load_balanced_view()

In [7]:
random = numpy.random.RandomState(1001)
arguments = zip(texts.keys(), 
                random.randint(1001, 10001, size=len(texts.keys())))

In [8]:
func = lambda argument : (argument[0],
                          model.posterior_prediction(texts[argument[0]], 
                                                     seed=argument[1], 
                                                     burn_in_iterations=100000, 
                                                     iterations=50000, 
                                                     max_attempts_to_converge=15))

_posterior_predictions = view.map(func, arguments)

posterior_predictions = {}
for x, y in _posterior_predictions:
    posterior_predictions[x] = y[0]

In [9]:
failed_to_converge = []
for x,y in _posterior_predictions:
    if numpy.round(y[1], 2) > 1.01:
        failed_to_converge.append(x)

arguments_retry = []
for name in cycle(failed_to_converge):
    arguments_retry.append((name, random.randint(101, 1001)))
    if len(arguments_retry) >= len(clients):
        break

func = lambda argument : (argument[0],
                          model.posterior_prediction(texts[argument[0]], 
                                                     seed=argument[1], 
                                                     burn_in_iterations=100000, 
                                                     iterations=50000, 
                                                     max_attempts_to_converge=2))

posterior_predictions_retry = view.map(func, arguments_retry)

In [10]:
for x,y in posterior_predictions_retry:
    if y[1] < 1.01:
        posterior_predictions[x] = y[0]
        break
else:
    raise Exception('No chain converged.')

In [11]:
for text_name in sorted(posterior_predictions.keys(), key=lambda args: int(args.split('_')[1])):
    print(text_name)
    print('-'*len(text_name))
    print(texts[text_name])
    print('='*10)
    print(topicmodels.topic2str(posterior_predictions[text_name], corpus_data['vocabulary']))
    print('='*50)
    print('')

text_0
------
‘I don't know what I did without it’ is the sentiment. There is a difference
between fully automatic washing machines — which change the nature of the task
altogether — and ‘twin tub’ machines where the hot wet washing has to be
lifted manually into a separate drying compartment. The women who had this type
of machine complained about the considerable amount of work still required of
the housewife, and the mess on the kitchen floor to be cleared up afterwards. In
a similar way the launderette does not remove the physical drudgery of washing.
The housewife has to get the washing there in the first place, she has to unload
it, sort it, sit and watch it wash and dry (or dash out to shop in the interim)
and then pack it all up again. This, when there is a baby in the pram and a two-
or three-year-old to attend to, is no mean feat.
housework,water,housewife,machine,washing,women,bath,clean,machines,time,job,home,clothes,day,cleaning,wash,domestic,housewives,bathroom,wife,house

In [12]:
timestamp = datetime.datetime.now().strftime('%Y.%m.%d.%s')

with open('_cache/posterior_predictions.%s.pkl' % timestamp, 'wb') as f:
    pickle.dump(posterior_predictions, f, protocol=2)