# Predicted Word Associates of Texts, based on Word Association Norms

$$
\newcommand{\data}{\mathcal{D}}
\newcommand{\Prob}[1]{\mathrm{P}( #1 )}
\newcommand{\given}{\vert}
$$

Here, we will calculate the predicted word associates of the texts that are used in the text memory experiment. These predictions can be used to test a word association account of text memory.

Word association norms can, at least in most cases, be defined by a matrix $A$, such that 
$$
A_{ij} \triangleq \text{frequency that word $w_i$ is stated as associated with word $w_j$}.
$$

Therefore, the conditional (empirical) probability of word $w_i$ given $w_j$ is 
$$
\mathrm{P}(w_i \vert w_j) = \frac{A_{ij}}{\sum_{i=1}^V A_{ij}},
$$
where $V$ is the total number of words in our vocabulary of response words.

Given a text, $\textrm{text}_{t}$, defined as
$
\textrm{text}_{t} \triangleq w_{t 1}, w_{t 2} \ldots w_{t n_{t}}, 
$
the predicted probability that word $w_k$ is associated with $\textrm{text}_{t}$ is 
$$
\mathrm{P}(w_k \vert \textrm{text}_{t}) = \frac{1}{n_{t}} \sum_{i = 1}^{n_t} \mathrm{P}(w_k \vert w_{t i}).
$$

## Preface

This notebook requires existing datasets, and will write a new dataset, see below. It will execute all cells completely in about 10-20 seconds. 

## Set up

In the following steps, we basically set up the data that we need.

In [1]:
from __future__ import division

# Standard library imports
import os
from collections import defaultdict
import cPickle as pickle
import string

# Third party imports
import configobj
import numpy
import pandas
from numpy.random import randint, dirichlet, rand
from numpy import ones, zeros, unique, empty
from scipy.sparse import coo_matrix
from gustav.samplers import fortransamplers

# Local imports
from utils import utils, topicmodels

Create some helper classes and functions.

In [2]:
class MakeVocab(object):
    
    '''
    A class to make a vocabulary.
    '''
    
    @classmethod
    def new(cls, word_list):
        
        return cls(word_list)
    
    def __init__(self, word_list):
        
        self.word_list = word_list
        
        self.V = len(word_list)
    
        self.word2index = {w:i for i,w in enumerate(word_list)}
        self.index2word = {i:w for i,w in enumerate(word_list)}

        
def text_to_words(text, vocabulary):
    '''Extract words from a text'''
    return [word 
            for word in utils.tokenize(text) 
            if word in vocabulary.word2index]


class WordAssociations(object):
    
    '''
    An class to store the word association data in a dict of dict format, e.g
    
    associations['stimulus']['associate'] = count
    
    where `count` gives the number of times that `associate` was listed
    as an associate of `stimulus`.
    
    '''
    

    def __init__(self, word_associations_data):
        
        self.word_associations_data = word_associations_data
            
        self.build_associations()

    def build_associations(self):
        
        '''
        Return dictionary whose keys are the stimulus words, and
        whose values are the dictionaries whose keys are the associate
        words and whose values are the frequencies of their association 
        with the stimulus word.
        
        Thus,
        
        self.associations['foo']['bar'] 
        
        gives the number of times the word 'bar' was said to be 
        associated with the word 'foo'.

        Note: We will convert all stimulus words to lowercase. This 
        will in effect fold stimulus words that are upper or lower case 
        variants of one another. 
        
        This is not really an issue, however, because this  
        will only affect one word, Sunday/sunday, as can be verified below with
        self.check_stimulus_words_case()
        
        We will also fold all associate words to lower case. Thus, an 
        associate 'foo' is treated as identical to 'Foo'.
        
        '''
        
        self.associations = defaultdict(lambda : defaultdict(int))
        
        for row in self.word_associations_data:
            
            subject, stimulus, assoc1, assoc2, assoc3 = row.split(';')
            
            for associate in (assoc1, assoc2, assoc3):
                
                # We will make all stimulus words lowercase
                # Effectively folding stimulus words
                self.associations[stimulus.lower()][associate.lower()] += 1
                
                
    def check_stimulus_words_case(self):
        
        '''
        Return the list of stimulus words that are upper/lower case
        variants of one another. For example, if one stimulus word is
        'Foo' and the other is 'foo', this will be returned.

        '''
        
        stimuli = defaultdict(dict)
        
        for row in self.word_associations_data:
            
            _, stimulus, _, _, _ = row.split(';')
            
            stimuli[stimulus.lower()][stimulus] = None    
                
        
        return filter(lambda items: len(items[1]) > 1, stimuli.items())

Download and prepare all the data files etc needed for this analysis

In [3]:
cache_directory = '../cache'
cache_fullpath = lambda path: os.path.join(cache_directory, path)

filenames = {
    'experiment_cfg' : [('Brismo.cfg',
                         '909d9f8de483c4547f26fb4c34b91e12908ab5c144e065dc0fe6c1504b1f22c9')],
    'corpus' : [('bnc_78723408_250_500_49328.npz.bz2', 
                 'b9d828f7697871e01a263b8f3978911c70ff45cab9af4c86fbb43c3baef969d9')]
}

utils.verify_cache_files(filenames['experiment_cfg'] + filenames['corpus'],
                         cache=cache_directory,
                         verbose=False)

stimuli = configobj.ConfigObj(cache_fullpath('Brismo.cfg'))['text_memoranda']

corpus_data = utils.loadnpz('bnc_78723408_250_500_49328.npz.bz2', 
                            cache=cache_directory,
                            verbose=False)

bnc_vocabulary = MakeVocab.new(word_list=corpus_data['vocabulary'])

texts = {}
for key,value in stimuli.items():
    texts[key] = text_to_words(value['text'], vocabulary=bnc_vocabulary)  

The following assumes that the file `associations_en_05_01_2015.csv.bz2`, whose sha256 checksum is `06a527e5c9647f37a4a2ee0744a309f57f259e203238b87e0f466b74f7a6e63e` is available in the `_cache` directory. 

This is a compressed csv file of word association norms collected at https://www.smallworldofwords.org/en and generously shared by Simon De Deyne (https://simondedeyne.me/). 

Unfortunately, I am not at liberty to share this data presently, and so please contact either Simon De Deyne or Gert Storms in order to obtain it. 

In [4]:
assert os.path.exists(cache_fullpath('associations_en_05_01_2015.csv.bz2'))
assert utils.checksum(cache_fullpath('associations_en_05_01_2015.csv.bz2'))\
    == '06a527e5c9647f37a4a2ee0744a309f57f259e203238b87e0f466b74f7a6e63e'
    
word_associations_data = utils.loadcsv('associations_en_05_01_2015.csv.bz2', 
                                       cache=cache_directory)

## Create association matrix $A$, etc

We'll use all stimulus words but we'll restrict ourselves to associate words that are in the BNC corpus. Thus avoids dealing with the mass of highly infrequent responses. We will form the union of this set abd the set of recalled words that are in the BNC corpus. This will give us all the association norms data that we'll need for the recognition and recall analyses.

In [5]:
word_associations = WordAssociations(word_associations_data)

# Get the stimulus vocabulary 
stimulus_vocabulary = MakeVocab.new(
    sorted(
        set(
            word_associations.associations.keys()
        )
    )
)

# Get the association vocabulary
association_vocabulary = []
for stimulus_word in stimulus_vocabulary.word_list:
    association_vocabulary.extend(
        filter(lambda word: word in bnc_vocabulary.word2index, 
               word_associations.associations[stimulus_word])
    )

Df = {}
Df['recall'] = pandas.read_pickle(cache_fullpath('brisbane_06b643a_recall_results.pkl'))

recalled_words = sorted(
     set(
         filter(lambda word: word in bnc_vocabulary.word2index,
                map(string.lower, Df['recall']['word'].values)
               )
     )
)
  
association_vocabulary_word_list = sorted(set(association_vocabulary + recalled_words))

associate_vocabulary = MakeVocab.new(association_vocabulary_word_list)

Here, we'll make some functions that will help the creation of the $A$ matrix.

The following creates a sparse representation of the $A$ count matrix.

In [6]:
def get_association_matrix(word_associations, stimulus_vocabulary, associate_vocabulary):

    rows = []
    cols = []
    values = []
    for stimulus_word in word_associations.associations:
        for associate_word in word_associations.associations[stimulus_word]:

            try:

                j = stimulus_vocabulary.word2index[stimulus_word]
                k = associate_vocabulary.word2index[associate_word]

                value = word_associations.associations[stimulus_word][associate_word]

                rows.append(j)
                cols.append(k)
                values.append(value)

            except KeyError:

                pass

    # Make a sparse array, but return the dense array (note the .A at the end)
    return coo_matrix((values, (rows, cols)), 
                   shape=(stimulus_vocabulary.V, associate_vocabulary.V)).A

A = get_association_matrix(word_associations, stimulus_vocabulary, associate_vocabulary)

The following will create a probability vector of length equal to the number of word associates. The probability of associate $j$ is
$$
f_j  \propto \epsilon + \sum_{i=1}^V A_{ij},
$$
which is the sum of frequency of association of word $j$ over all stimulus words plus $\epsilon$. The value of $\epsilon$ will be set by default to be $1.0$, and is essentially a smoothing variable, particularly to avoid underflow problems that might arise when $\sum_{i=1}^V A_{ij} = 0$.

In [7]:
def get_unigram_probability(A, eps=1.0):
    _f = A.sum(0) + eps
    return (_f/_f.sum()).flatten()

f = get_unigram_probability(A)

Now, we will calculate 
$$
\mathrm{P}(w_i \vert w_j) \propto 
\begin{cases}
\frac{A_{ij}}{\sum_{i=1}^V A_{ij}},&\quad\text{if $A_{ij}>0$},\\
f_i, &\quad\text{if $A_{ij}=0$},
\end{cases}
$$

The effect of this is as follows: In situations where $A_{ji}$ is 0, we replace the predicted probability $\frac{A_{ij}}{\sum_{i=1}^V A_{ij}}$, which will necessarily be 0.0, by $\Prob{w_i}$. In other words, this gives a *back-off* model for the zero counts. The main purpose and benefit of this is to ensure that all observed zero count stimulus-associate word pairs are given a reasonable nonzero estimate. Also, as above, the values of zero would cause trouble in analyses where we need to calculate logarithms.

In [8]:
smoothed_association_matrix = numpy.zeros_like(A, dtype=float)
for j, Aj in enumerate(A):
    
    I = Aj == 0
    
    _p = Aj/Aj.sum()
    
    _p[I] = f[I]
    
    smoothed_association_matrix[j] = _p/_p.sum()

Some sanity checking. Sample some words, and make sure their predicted associates according to the model match the original data, except for the low probability predictions, which will have back off to the marginal probabilities in the smoothed model.

In [9]:
def check_some_words(K, N, seed):
    random = numpy.random.RandomState(seed)
    
    for k in random.permutation(smoothed_association_matrix.shape[0])[:K]:
        
        smoothed_associates = ','.join([associate_vocabulary.index2word[i] 
                                        for i in numpy.flipud(numpy.argsort(smoothed_association_matrix[k]))[:N]])
        
        original_associates = ','.join([associate_vocabulary.index2word[i] 
                                        for i in numpy.flipud(numpy.argsort(A[k]))[:N]])
    
    
        print(stimulus_vocabulary.index2word[k].capitalize())
        print('Smoothed: ' + smoothed_associates)
        print('-'*10)
        print('Original: ' + original_associates)
        print('='*100)

check_some_words(25, 100, 1001)

Hinge
Smoothed: door,metal,swing,squeak,bend,close,bracket,unhinged,depend,joint,screw,creaky,creak,gate,unhinge,knob,rusty,cupboard,crux,squeaky,iron,pin,connection,brass,flap,oil,swinging,pivot,money,food,water,dangerous,lid,brain,wood,angle,balance,laptop,decision,dangle,sharp,rock,drawer,table,screws,elbow,rotate,frame,rust,peep,latch,flexible,machine,flex,book,sanity,choice,mechanics,change,nail,leverage,fold,brink,suspense,attachment,window,squeaking,bronze,jacket,failure,handle,lock,cabinet,join,hardware,screen,wait,bilge,key,holds,connected,dependent,stuck,bolt,dependency,car,music,green,love,red,sex,happy,white,dog,cold,school,time,bad,sad,fun
----------
Original: door,metal,swing,squeak,close,bend,bracket,unhinged,depend,joint,screw,knob,gate,creaky,creak,rusty,unhinge,connection,swinging,brass,cupboard,squeaky,crux,pivot,oil,iron,flap,pin,suspense,dependent,laptop,dependency,sharp,key,lock,failure,stuck,connected,decision,peep,balance,rust,bilge,dangle,wait,brain,table,flex,

## Make predictions of associates for each text

The following does the 
$$
\mathrm{P}(w_k \vert \textrm{text}) = \frac{1}{n} \sum_{j = 1}^{n} \mathrm{P}(w_k \vert w_{j}).
$$
for each text.

In [10]:
predicted_associates = {}
for text_name in stimuli:
    
    w = numpy.zeros_like(smoothed_association_matrix[0])
    
    for word in texts[text_name]:

        try:
            j = stimulus_vocabulary.word2index[word]
            w += smoothed_association_matrix[j]
        except:
            pass
    
    predicted_associates[text_name] = w/w.sum()
 

Let's have a look at these predictions. 

In [11]:
for text_name in stimuli:
    print(stimuli[text_name]['text'])
    print('-'*26)
    print(','.join([associate_vocabulary.index2word[i] 
                    for i in numpy.flipud(numpy.argsort(predicted_associates[text_name]))[:100]]))
    print('='*50)

‘I don't know what I did without it’ is the sentiment. There is a difference
between fully automatic washing machines — which change the nature of the task
altogether — and ‘twin tub’ machines where the hot wet washing has to be
lifted manually into a separate drying compartment. The women who had this type
of machine complained about the considerable amount of work still required of
the housewife, and the mess on the kitchen floor to be cleared up afterwards. In
a similar way the launderette does not remove the physical drudgery of washing.
The housewife has to get the washing there in the first place, she has to unload
it, sort it, sit and watch it wash and dry (or dash out to shop in the interim)
and then pack it all up again. This, when there is a baby in the pram and a two-
or three-year-old to attend to, is no mean feat.
--------------------------
clothes,water,clean,machine,car,money,laundry,dishes,baby,soap,food,bath,job,cold,time,alike,chair,mother,woman,cleaning,run,dog,store

## Write predictions of recall & recognition words to file 

Get all the predictions for each recalled words and all words in the recognition tests.

In [12]:
stimuli_words = []

for text_name in stimuli:
    
    _, n = text_name.split('_')
    n = int(n)+1
    inwords = stimuli[text_name]['inwords'].split(',')
    outwords = stimuli[text_name]['outwords'].split(',')
    for word in inwords+outwords+recalled_words:
        try:
            p = predicted_associates[text_name][associate_vocabulary.word2index[word]]
            stimuli_words.append((str(n) + '-' + word, p))
        except KeyError:
            print('Unknown word in text %s: "%s"' % (text_name,word))

associations_predictions = dict(stimuli_words)

with open(cache_fullpath('word_associates_from_association_norms.pkl'), 'wb') as f:
    pickle.dump(associations_predictions, f, protocol=2)

Unknown word in text text_22: "dhow"


## Create csv file with predictions for recalled words 

Here, we'll create a special file for use with the multionomial logistic regression modelling of the recall memory results. This file is a $J \times V^\prime + 1$ matrix, written as a csv file, where $J$ is the number of texts and $V^\prime$ is the set of recalled words, across all participants, that are also in the training corpus vocabulary. 

See `posterior-predictive-distributions` notebook for more details. There, an identical procedure is followed. 

In [13]:
predictive_probabilities = []
text_names = sorted(predicted_associates.keys(), key=lambda arg: int(arg.split('_')[1]))
for text_name in text_names:
    f = []
    for word in recalled_words:
        f.append(predicted_associates[text_name][associate_vocabulary.word2index[word]])
    predictive_probabilities.append(f)

predictive_probabilities = numpy.array(predictive_probabilities)

predictive_probabilities = numpy.c_[predictive_probabilities, 1-predictive_probabilities.sum(1)]

header = ','.join(recalled_words + ['ALTERNATIVE_WORD'])

M = [header]
for i,f in enumerate(predictive_probabilities):
    M.append(text_names[i] + ',' + ','.join(map(str, f)))
M = '\n'.join(M)
    
association_predictions_of_recalled_words = 'association_predictions_of_recalled_words.csv'

with open(cache_fullpath(association_predictions_of_recalled_words), 'w') as f:
    f.write(M)

# Verify the integrity of the exported csv file.
assert utils.checksum(cache_fullpath(association_predictions_of_recalled_words))\
    == '34b7f4dd9bea8bac9248699ae3313e096b12572f2eae8fb763dcef3448b25c6f'