# Predicted Word Associates of Texts, based on Word Association Norms

Here, we calculated the predicted word associates of the texts that are used in the text memory experiment. These predictions can be used to test a word association account of text memory.

## A probabilistic model of Word Association Data

In a typical word association task, participants are given a word and asked to produce words that are related to it. For example, for the word "house", the participant might give the words "door", "window", "garden", etc. For a given stimulus word $w_j$, and combining responses across all participants, we can list the response words as $r_{j1}, r_{j2} \ldots r_{jn_j}$. A simple probabilistic generative of these responses is as follows:
$$
r_{ji} \sim \mathrm{dcat}(\pi_j), \quad \text{for $i \in \{1, 2 \ldots n_j\}$},
$$
where $\pi_j$ is a categorical probability distribution over the $V$ word types in the language. A hierarchical or multilevel model would model each $\pi_j$ itself as being drawn from a Dirichlet distribution, i.e. 
$$
\pi_j \sim \mathrm{ddirich}(\alpha)
$$
where $\alpha$ are the (hyper)-parameters of the Dirichlet distribution, which can also be written as $\alpha = a\cdot m$, where $a > 0 $ is the scalar concentration parameter and the categorical distribution $m$ is the location parameter.

The posterior distribution over $\pi_j$ is
$$
\mathrm{P}(\pi_j \vert \mathcal{D}) = \int \mathrm{P}(\pi_j \vert \mathcal{D}, a, m)\mathrm{P}(a, m \vert \mathcal{D}).
$$
Here,  $\mathcal{D})$ signifies the observed data, and
$$
\mathrm{P}(\pi_j \vert \mathcal{D}, a, m) = \mathrm{Dirichlet}(R_j ; a, m) 
$$
where $R_j = R_{j1}, R_{j2} \ldots R_{jk} \ldots R_{jV}$ and $R_{jk}$ is the frequency with which word $w_k$ is produced as an associate of word $w_j$.
We can approximate $\mathrm{P}(\pi_j \vert \mathcal{D})$ as follows:
$$
\begin{align}
\mathrm{P}(\pi_j \vert \mathcal{D}) &= \int \mathrm{P}(\pi_j \vert \mathcal{D}, a, m)\mathrm{P}(a, m \vert \mathcal{D}),\\
&\approx \mathrm{P}(\pi_j \vert \mathcal{D}, \hat{a}, \hat{m})
\end{align}
$$ where $\hat{a}$ and $\hat{m}$ are the means of the posterior distribution $\mathrm{P}(a, m \vert \mathcal{D})$.


With this model, the probability of responding with word $w_k$ given word $w_j$ as a stimulus word is simply
$$
\mathrm{P}(w_k \vert w_j) = \int \mathrm{P}(w_k \vert w_j, \pi_j) \mathrm{P}(\pi_j \vert  \mathcal{D}),
$$
which can be calculated in closed form as 
$$
\mathrm{P}(w_k \vert w_j) = \frac{R_{jk} + a\cdot m_k}{a + \sum_{k=1}^V R_{jk}} 
$$

Given a text 
$$ 
\textrm{text} \triangleq w_{1}, w_{2} \ldots w_{n}, 
$$
the predicted probability that word $w_k$ is associated with $\textrm{text}_{j}$ is 
$$
\mathrm{P}(w_k \vert \textrm{text}) = \frac{1}{n} \sum_{j = 1}^{n} \mathrm{P}(w_k \vert w_{j}).
$$

## Preparation

Import modules; create classes and functions; download and prepare all the data files etc needed for this analysis

In [1]:
%matplotlib inline

from __future__ import division

from matplotlib import pyplot
import matplotlib.pylab as pylab

from collections import defaultdict

import os

import configobj
import numpy
import pandas
import cPickle as pickle

from utils import utils, topicmodels

from gustav.samplers import fortransamplers

from numpy.random import randint, dirichlet, rand
from numpy import ones, zeros, unique, empty

import string

In [2]:
class MakeVocab(object):
    
    '''
    A class to make a vocabulary.
    '''
    
    @classmethod
    def new(cls, word_list):
        
        return cls(word_list)
    
    def __init__(self, word_list):
        
        self.V = len(word_list)
    
        self.word2index = {w:i for i,w in enumerate(word_list)}
        self.index2word = {i:w for i,w in enumerate(word_list)}

        
def text_to_words(text, vocabulary):
    '''Extract words from a text'''
    return [word 
            for word in utils.tokenize(text) 
            if word in vocabulary.word2index]


class WordAssociations(object):
    
    '''
    An class to store the word association data in a dict of dict format, e.g
    
    associations['stimulus']['associate'] = count
    
    where `count` gives the number of times that `associate` was listed
    as an associate of `stimulus`.
    
    '''
    

    def __init__(self, word_associations_data):
        
        self.word_associations_data = word_associations_data
            
        self.build_associations()

    def build_associations(self):
        
        self.associations = defaultdict(lambda : defaultdict(int))
        
        for row in self.word_associations_data:
            
            subject, stimulus, assoc1, assoc2, assoc3 = row.split(';')

            for associate in (assoc1, assoc2, assoc3):
            
                self.associations[stimulus][associate] += 1


def update(args):
    '''Gibbs sampler update function.
    To be sent to parallel workers.'''
    polya_sampler = _get_sampler(args)
    return polya_sampler.update(args['iterations'])

def sample(args):
    '''Gibbs sampler sampling function. 
    To be sent to parallel workeres.'''
    polya_sampler = _get_sampler(args)
    return polya_sampler.sample(args['iterations'], thin=args['thin'])

def _get_sampler(args):
    
    R = topicmodels.zeros((args['J'], args['V']))
    R[args['rows'], args['cols']] = args['values']
    
    if not 'inits' in args:
        inits = {}
    else:
        inits = args['inits']
    
    return topicmodels.DirichletMultinomialCompound(R, inits=inits)


def gelman_diag(psi):
    
    '''The Gelman-Rubin convergence diagnostic.'''
    
    m, n = psi.shape
    
    B = n * numpy.var(psi.mean(1), ddof=1)
    
    W = psi.var(1, ddof=1).mean() 

    V = (n-1)/n * W + B/n    

    return numpy.sqrt(V/W)

def get_argslist(args, nchains=3, results=None):
    
    '''Set up arg list to be sent to parallel workers.'''
    
    if results is None:
        return [args]*nchains
    else:
        argslist = [args]*len(results)
        if results is not None:
            for i,result in enumerate(results):
                argslist[i]['inits'] = result
        return argslist

In [3]:
url_root = 'http://www.lawsofthought.org/shared'

cache_directory = '../_cache'
cache_fullpath = lambda path: os.path.join(cache_directory, path)

filenames = {
    'experiment_cfg' : [('Brismo.cfg',
                         '909d9f8de483c4547f26fb4c34b91e12908ab5c144e065dc0fe6c1504b1f22c9')],
    'corpus' : [('bnc_78723408_250_500_49328.npz.bz2', 
                 'b9d828f7697871e01a263b8f3978911c70ff45cab9af4c86fbb43c3baef969d9')]
}

utils.curl(url_root,
           filenames['experiment_cfg'] + filenames['corpus'], 
           cache=cache_directory,
           verbose=False)

stimuli = configobj.ConfigObj(cache_fullpath('Brismo.cfg'))['text_memoranda']

corpus_data = utils.loadnpz('bnc_78723408_250_500_49328.npz.bz2', 
                            cache=cache_directory,
                            verbose=False)

bnc_vocabulary = MakeVocab.new(word_list=corpus_data['vocabulary'])

texts = {}
for key,value in stimuli.items():
    texts[key] = text_to_words(value['text'], vocabulary=bnc_vocabulary)  

The following assumes that the file `associations_en_05_01_2015.csv.bz2`, whose sha256 checksum is `06a527e5c9647f37a4a2ee0744a309f57f259e203238b87e0f466b74f7a6e63e` is available in the `_cache` directory. 

This is a compressed csv file of word association norms collected at https://www.smallworldofwords.org/en and generously shared by Simon De Deyne (https://simondedeyne.me/). 

Unfortunately, I am not at liberty to share this data presently, and so please contact either Simon De Deyne or Gert Storms in order to obtain it. 

In [4]:
word_associations_data = utils.loadcsv('associations_en_05_01_2015.csv.bz2', 
                                       cache=cache_directory)

We'll restrict ourselves to stimulus words that are in our BNC corpus vocabulary. We'll restrict ourselves to associate words that are in the BNC corpus and the set of recalled words that are in the BNC corpus. 

In [5]:
word_associations = WordAssociations(word_associations_data)

# Get the stimulus vocabulary 
stimulus_vocabulary_word_list = sorted([word.lower() 
                                        for word in word_associations.associations.keys() 
                                        if word in bnc_vocabulary.word2index])

stimulus_vocabulary = MakeVocab.new(stimulus_vocabulary_word_list)

# Get the association vocabulary
association_vocabulary = []
for stimulus in word_associations.associations:
    association_vocabulary.extend(
        [word.lower() for word in word_associations.associations[stimulus].keys() 
         if word.lower() in bnc_vocabulary.word2index]
    )

Df = {}
Df['recall'] = pandas.read_pickle(cache_fullpath('brisbane_06b643a_recall_results.pkl'))

recalled_words = sorted(set(map(string.lower, Df['recall']['word'].values)).intersection(corpus_data['vocabulary']))
  
association_vocabulary_word_list = sorted(set(association_vocabulary + recalled_words))

associate_vocabulary = MakeVocab.new(association_vocabulary_word_list)

The following creates a sparse representation of the $R$ count matrices.

In [6]:
rows = []
cols = []
values = []
for stimulus_word in word_associations.associations:
    for associate_word in word_associations.associations[stimulus_word]:
        
        try:
            
            j = stimulus_vocabulary.word2index[stimulus_word]
            k = associate_vocabulary.word2index[associate_word]

            value = word_associations.associations[stimulus_word][associate_word]

            rows.append(j)
            cols.append(k)
            values.append(value)
        
        except KeyError:
        
            pass
        
args = dict(rows = rows,
            cols = cols,
            values = values,
            J = stimulus_vocabulary.V,
            V = associate_vocabulary.V)     

### Start Cluster 


In [7]:
# Start a cluster on the command line with "ipcluster start -n NCHAINS" 
# where NCHAINS is at least as large as the number of chains you are using in the sampler

from ipyparallel import Client

clients = Client()

clients.block = True

clients[:].push(dict(_get_sampler=_get_sampler));
    
with clients[:].sync_imports():
    from utils import topicmodels

view = clients.load_balanced_view()

importing topicmodels from utils on engine(s)


Update the Gibbs samplers. The convergence rate seems fast so 1000 should be more than enough for burn-in. Of course, we will check it below too.

In [8]:
args['iterations'] = 1000

results = view.map(update, get_argslist(args, nchains=3))

In [9]:
results

[{'b': 57.43008610933294,
  'c': 9082.968124977902,
  'psi': array([  3.89411886e-06,   1.39080836e-05,   2.57947939e-05, ...,
           1.60232471e-06,   5.99696632e-07,   9.28083264e-06])},
 {'b': 57.437841789043254,
  'c': 9218.723392788199,
  'psi': array([  5.25172384e-06,   8.20211914e-06,   1.36864105e-05, ...,
           3.46497564e-06,   2.26439891e-06,   8.70785626e-06])},
 {'b': 57.48626355278941,
  'c': 9257.17816973585,
  'psi': array([  5.61182204e-06,   8.02518121e-06,   2.46929404e-05, ...,
           2.41954770e-06,   6.81463237e-07,   8.98309333e-06])}]

Now, sample from `b`, `c` and `psi`.

In [10]:
args['iterations'] = 100
args['thin'] = 10

samples = view.map(sample, get_argslist(args, results=results))

Check diagnostics for `b` and `c` using Gelman-Rubin.

In [11]:
gelman_diag(numpy.array([samples[k]['c'] for k in (0, 1, 2)]))

0.99543353533485535

In [12]:
gelman_diag(numpy.array([samples[k]['b'] for k in (0, 1, 2)]))

0.99523360612496081

For the `psi`, we will simply check if the mean vectors of each chain are more or less identical, i.e. highly intercorrelated.

In [13]:
psi = numpy.array([numpy.array(samples[k]['psi']).mean(0) for k in (0, 1, 2)])
numpy.corrcoef(psi)

array([[ 1.        ,  0.99995385,  0.99995504],
       [ 0.99995385,  1.        ,  0.99995483],
       [ 0.99995504,  0.99995483,  1.        ]])

Now, average across the samples to be `b` and `psi`, which are actually the `a` and `m`, respectively, in the description at the top of this page.

In [14]:
psi = psi.mean(0)

b = numpy.array([samples[k]['b'] for k in (0, 1, 2)]).mean()

bpsi = b * psi 

R = topicmodels.zeros((args['J'], args['V']))
R[args['rows'], args['cols']] = args['values']

smoothed_association_matrix = ((R + bpsi).T/(R + bpsi).sum(1)).T

Some more sanity checking. Sample some words, and make sure their associates according to the model match the original data.

In [15]:
def check_some_words(K, N, seed):
    random = numpy.random.RandomState(seed)
    
    for k in random.permutation(smoothed_association_matrix.shape[0])[:K]:
        
        smoothed_associates = ','.join([associate_vocabulary.index2word[i] 
                                        for i in numpy.flipud(numpy.argsort(smoothed_association_matrix[k]))[:N]])
        
        original_associates = ','.join([associate_vocabulary.index2word[i] 
                                        for i in numpy.flipud(numpy.argsort(R[k]))[:N]])
    
    
        print(stimulus_vocabulary.index2word[k].capitalize())
        print('Smoothed: ' + smoothed_associates)
        print('-'*10)
        print('Original: ' + original_associates)
        print('='*100)

check_some_words(25, 100, 1001)

Respond
Smoothed: answer,reply,question,talk,call,letter,response,speak,listen,communicate,react,return,email,retort,paper,test,quick,phone,write,pond,quickly,emergency,interview,rebuttal,reciprocate,school,time,sex,children,cat,words,smile,fight,anger,read,choice,flight,pen,voice,wave,correct,text,forward,key,respect,talking,communication,homework,argue,button,conversation,agree,professional,shout,waiting,message,hear,reaction,telephone,speaking,send,needy,command,awake,concern,acknowledge,automatic,request,comment,meetings,ambulance,controlled,feedback,invitation,attend,query,critique,immediately,interpret,coma,unresponsive,comeback,concur,validate,generate,resign,bicker,remit,adroit,paraphrase,resurrect,rejoinder,concisely,food,money,water,love,red,fun,bad
----------
Original: answer,reply,question,talk,call,letter,response,react,speak,listen,communicate,return,email,retort,interview,pond,rebuttal,paper,reciprocate,phone,quick,emergency,quickly,write,test,waiting,agree,read,resign,s

### Make predictions

The following does the 
$$
\mathrm{P}(w_k \vert \textrm{text}) = \frac{1}{n} \sum_{j = 1}^{n} \mathrm{P}(w_k \vert w_{j}).
$$
for each text.

In [16]:
predicted_associates = {}
for text_name in stimuli:
    
    w = numpy.zeros_like(smoothed_association_matrix[0])
    
    for word in texts[text_name]:

        try:
            j = stimulus_vocabulary.word2index[word]
            w += smoothed_association_matrix[j]
        except:
            pass
    
    predicted_associates[text_name] = w/w.sum()
 

Let's have a look at these predictions. 

In [17]:
for text_name in stimuli:
    print(stimuli[text_name]['text'])
    print('-'*26)
    print(','.join([associate_vocabulary.index2word[i] 
                    for i in numpy.flipud(numpy.argsort(predicted_associates[text_name]))[:100]]))
    print('='*50)

‘I don't know what I did without it’ is the sentiment. There is a difference
between fully automatic washing machines — which change the nature of the task
altogether — and ‘twin tub’ machines where the hot wet washing has to be
lifted manually into a separate drying compartment. The women who had this type
of machine complained about the considerable amount of work still required of
the housewife, and the mess on the kitchen floor to be cleared up afterwards. In
a similar way the launderette does not remove the physical drudgery of washing.
The housewife has to get the washing there in the first place, she has to unload
it, sort it, sit and watch it wash and dry (or dash out to shop in the interim)
and then pack it all up again. This, when there is a baby in the pram and a two-
or three-year-old to attend to, is no mean feat.
--------------------------
clothes,clean,water,machine,car,laundry,dishes,soap,money,bath,baby,job,food,cleaning,alike,chair,mother,cold,time,woman,store,run,wet

Get all the predictions for each recalled words and all words in the recognition tests.

In [18]:
stimuli_words = []

for text_name in stimuli:
    
    _, n = text_name.split('_')
    n = int(n)+1
    inwords = stimuli[text_name]['inwords'].split(',')
    outwords = stimuli[text_name]['outwords'].split(',')
    for word in inwords+outwords+recalled_words:
        try:
            p = predicted_associates[text_name][associate_vocabulary.word2index[word]]
            stimuli_words.append((str(n) + '-' + word, p))
        except KeyError:
            print('Unknown word in text %s: "%s"' % (text_name,word))

associations_predictions = dict(stimuli_words)

with open(cache_fullpath('word_associates_from_association_norms_mdc.pkl'), 'wb') as f:
    pickle.dump(associations_predictions, f, protocol=2)

Unknown word in text text_22: "dhow"
