# Posterior predictive distributions for text stimuli in the experiment


$$
\newcommand{\Prob}[1]{\mathrm{P}( #1 )}
\newcommand{\given}{\vert}
\newcommand{\text}{\mathrm{text}}
\newcommand{\xtext}{w_1, w_2 \ldots w_n}
$$    

The following calculates the posterior prediction over words conditioned a text.

Given a text $\text$, the posterior predictive distribution is, informally speaking, the distribution over words that are consistent with the discourse topics of the $\text$. It is calculated as follows:
$$
\begin{align}
\mathrm{P}(w \given \phi, \text, a, m) &= \int \mathrm{P}(w \given \phi, \pi) \mathrm{P}(\pi \given \text, \phi, a, m) d\pi,\\
&= \int \big[ \sum_{\{x\}} \mathrm{P}(w \given \phi, x)\mathrm{P}(x \given \pi) \big] \mathrm{P}(\pi \given \text, \phi, a, m) d\pi
\end{align}
$$
where $\mathrm{P}(\pi \given \text, \phi, a, m)$ is the posterior distribution over topic distributions of text $\text$, with $\phi$, $a$, $m$ being the parameters of Topic model ($\phi$ is the set of $K$ component topics and $a$, $m$ are the hyper-parameters of the Dirichlet prior over the per document mixing distribution).

Note that we need to infer the posterior $\Prob{\pi \given \text, \phi, a, m}$ by sampling using a Gibbs sampler. Specifically, we iteratively sample as follows:
$$
\begin{aligned}
x_{1:n} &\sim \Prob{x_{1:n} \given w_{1:n}, \phi, \pi},\\
\pi &\sim \Prob{\pi \given x_{1:n}, a, m}.
\end{aligned}
$$
The probability distributions in both of these steps can be calculated exactly.  

## Preface


This notebook depends on existing data sets, and ultimately produces a new one. See below for details. The Gibbs sampler, described above, is relatively expensive and must be used with all 50 text stimuli. For this, we use an ipyparallel cluster on a 16 core machine. The samplers use seeds, which are generated by a master seed, so they can be reproduced. For convenience, the results of the samplers will be cached and the notebook can be set to skip over the sampler and read directly from the cache. When this is done, the notebook completes all cells in less than a minute. Otherwise, it will take a day to a few days (exactly how much will be confirmed). 

In [1]:
# standard library imports
import os
import errno
import utils
import numpy
import pandas
import datetime

import cPickle as pickle
from itertools import cycle

# local imports
from utils import topicmodels, utils, datautils

## Check required files

We need data and some MCMC state samples for a HDPMM topic model. Later, we will need the corpus vocabulary and the results of the recall experiment. We will also checked whether we have the cached result of the mcmc sampler.

In [2]:
cache_directory = '../cache'
cache_fullpath = lambda path: os.path.join(cache_directory, path)

mcmc_sampler_seed = 1001
cached_sampler_result = 'posterior_predictions.%d.pkl' % mcmc_sampler_seed

filenames = {
    'experiment_cfg' : [('Brismo.cfg',
                         '909d9f8de483c4547f26fb4c34b91e12908ab5c144e065dc0fe6c1504b1f22c9')],
    'corpus' : [('bnc_78723408_250_500_49328.npz.bz2', 
                 'b9d828f7697871e01a263b8f3978911c70ff45cab9af4c86fbb43c3baef969d9')],
    'mcmc_samples' : [('hdptm_061216085831_7090_state_12946.npz.bz2', 
                       '9ba9850ff51fd60b679fd2af85cbaa4b3d69a2f31f4a0705475c0fffe3374330')],
    'bnc_vocab' : [('bnc_vocab_49328.txt',
                    '55737507ea9a2c18d26b81c0a446c074c6b8c72dedfa782c763161593e6e3b97')],
    'recall_results' : [('brisbane_06b643a_recall_results.pkl',
                         'a94d812373123b9a8b1eac848276e8ffc6a563ebca71ff2bf5adc97c825cbc14')],
    'cached_sampler_result' : [(cached_sampler_result,
                            '5b739f338e087e2130bf72cfd70c0702b617f77841cdb6d5b5f70988aead718d')]
    
}

utils.verify_cache_files(filenames['experiment_cfg'] +\
                         filenames['corpus'] +\
                         filenames['mcmc_samples'] +\
                         filenames['bnc_vocab'] +\
                         filenames['recall_results'] +\
                         filenames['cached_sampler_result'],
                         cache=cache_directory,
                         verbose=False)

Now load up the corpus and one of the state samples.

In [3]:
corpus_data = utils.loadnpz(filenames['corpus'][0][0], 
                               cache=cache_directory,
                               verbose=False)

state = utils.loadnpz(filenames['mcmc_samples'][0][0],
                         cache=cache_directory,
                         verbose=False)

Create a topic model using the above data set and parameter state. This for the purpose of doing the posterior predictive inference.

In [4]:
model = topicmodels.PosteriorPredictive(corpus_data, state, verbose=True)

Get all the texts used as stimuli in the experiment.

In [5]:
texts = topicmodels.get_experiment_texts('Brismo.cfg', cache=cache_directory)

## Gibbs sampling for the posterior predictions

If this sampling is to be done again, i.e. the cached result is not being used, make sure that the ipcluster is running, i.e. `ipcluster start -n 16`. 

In [6]:
use_cached_result = True

if not use_cached_result:
    
    from ipyparallel import Client

    clients = Client()

    clients.block = True

    clients[:].push(dict(
        model=model,
        texts=texts)
    );

    view = clients.load_balanced_view()

    random = numpy.random.RandomState(mcmc_sampler_seed)
    arguments = zip(texts.keys(), 
                    random.randint(1001, 10001, size=len(texts.keys())))

    func = lambda argument : (argument[0],
                              model.posterior_prediction(texts[argument[0]], 
                                                         seed=argument[1], 
                                                         burn_in_iterations=100000, 
                                                         iterations=50000, 
                                                         max_attempts_to_converge=15))

    _posterior_predictions = view.map(func, arguments)

    posterior_predictions = {}
    for x, y in _posterior_predictions:
        posterior_predictions[x] = y[0]

    failed_to_converge = []
    for x,y in _posterior_predictions:
        if numpy.round(y[1], 2) > 1.01:
            failed_to_converge.append(x)

    arguments_retry = []
    for name in cycle(failed_to_converge):
        arguments_retry.append((name, random.randint(101, 1001)))
        if len(arguments_retry) >= len(clients):
            break

    func = lambda argument : (argument[0],
                              model.posterior_prediction(texts[argument[0]], 
                                                         seed=argument[1], 
                                                         burn_in_iterations=100000, 
                                                         iterations=50000, 
                                                         max_attempts_to_converge=2))

    posterior_predictions_retry = view.map(func, arguments_retry)

    for x,y in posterior_predictions_retry:
        if y[1] < 1.01:
            posterior_predictions[x] = y[0]
            break
    else:
        raise Exception('No chain converged.')
        
    timestamp = datetime.datetime.now().strftime('%Y.%m.%d.%s')

    with open(cache_fullpath(cached_sampler_result), 'wb') as f:
        pickle.dump(posterior_predictions, f, protocol=2)
        
else:
    
    with open(cache_fullpath(cached_sampler_result), 'rb') as f:
        posterior_predictions = pickle.load(f)

## View posterior predictions

In [7]:
for text_name in sorted(posterior_predictions.keys(), key=lambda args: int(args.split('_')[1])):
    print(text_name)
    print('-'*len(text_name))
    print(texts[text_name])
    print('='*10)
    print(topicmodels.topic2str(posterior_predictions[text_name], corpus_data['vocabulary'], K=100))
    print('='*50)
    print('')

text_0
------
‘I don't know what I did without it’ is the sentiment. There is a difference
between fully automatic washing machines — which change the nature of the task
altogether — and ‘twin tub’ machines where the hot wet washing has to be
lifted manually into a separate drying compartment. The women who had this type
of machine complained about the considerable amount of work still required of
the housewife, and the mess on the kitchen floor to be cleared up afterwards. In
a similar way the launderette does not remove the physical drudgery of washing.
The housewife has to get the washing there in the first place, she has to unload
it, sort it, sit and watch it wash and dry (or dash out to shop in the interim)
and then pack it all up again. This, when there is a baby in the pram and a two-
or three-year-old to attend to, is no mean feat.
housework,water,housewife,machine,washing,women,bath,clean,machines,time,job,home,clothes,day,cleaning,wash,domestic,housewives,bathroom,wife,house

## Create csv file with predictions for recalled words 

Here, we'll create a special file for use with the logistic regression modelling of the recall memory results. This file is a $J \times V^\prime + 1$ matrix, written as a csv file, where $J$ is the number of texts and $V^\prime$ is the set of recalled words, across all participants, that are also in the training corpus vocabulary. The matrix rows must sum to 1.0. However, because the $V^\prime$ words are a subset of the the $V$ words in the corpus vocabulary, their posterior probabilities do not add to 1.0. Word $V^\prime + 1$ therefore essentially signifies any remaining word, and the probability mass assigned to it is equal to 1.0 minus the total mass assigned to the $V^\prime$ words. We label this word the `ALTERNATIVE_WORD`.

In [8]:
vocabulary = open(cache_fullpath('bnc_vocab_49328.txt')).read().split()
vocab = datautils.Vocab(vocabulary)

recalled_words = sorted(set(
    pandas.read_pickle(cache_fullpath('brisbane_06b643a_recall_results.pkl'))['word'].values)\
                        .intersection(vocabulary)
                       )

In [9]:
predictive_probabilities = []

text_names = sorted(posterior_predictions.keys(), key=lambda arg: int(arg.split('_')[1]))

for text_name in text_names:
    f = []
    for word in recalled_words:
        f.append(posterior_predictions[text_name][vocab.word2index[word]])
    predictive_probabilities.append(f)

predictive_probabilities = numpy.array(predictive_probabilities)

predictive_probabilities = numpy.c_[predictive_probabilities, 1-predictive_probabilities.sum(1)]

header = ','.join(recalled_words + ['ALTERNATIVE_WORD'])

M = [header]
for i,f in enumerate(predictive_probabilities):
    M.append(text_names[i] + ',' + ','.join(map(str, f)))
M = '\n'.join(M)

posterior_predictions_of_recalled_words = 'posterior_predictions_of_recalled_words.csv'

with open(cache_fullpath(posterior_predictions_of_recalled_words), 'w') as f:
    f.write(M)

# Verify the integrity of the exported csv file.
assert utils.checksum(cache_fullpath(posterior_predictions_of_recalled_words))\
    == 'e4235327a826ac176e1529d22c2a3011049aa6a35266fe9aa727fed22d32f345'