# Predicted Word Associates of Texts, based on Word Cooccurrence Statistics

Here, we will calculate the predicted word associates of the texts that are used in our text memory experiment. These predictions can be used to test a word association account of text memory, where word associations are defined in terms of cooccurrence statistics in the language.

$$
\newcommand{\data}{\mathcal{D}}
\newcommand{\Prob}[1]{\mathrm{P}( #1 )}
\newcommand{\given}{\vert}
$$

For this, we need to calculate 
$$
\Prob{w_k \given w_l} \triangleq \frac{ \Prob{w_k,w_l} }{\Prob{w_l}}, 
$$
which is the empirical probability of observing word $w_k$ in some linguistic context, e.g., a short text, given that we've observed $w_l$ there. This is calculated as follows. First, we calculate $\Prob{w_k, w_l \given \textrm{text}=j}$, or the joint (empirical) probability of finding word $w_k$ and word $w_l$ in a given text $j$. If text $j$ has $n_j$ words, then 
there are $n_j (n_j - 1)$ ordered word pairs in total in text $j$. If there are $n_{jk}$ and $n_{jl}$ occurrences of words $w_k$ and $w_l$, respectively, when $w_k \neq w_l$,  there are $n_{jk} n_{jl}$ pairs containing the words $w_k$ and $w_l$. When $w_k = w_l$, there are $n_{jk} (n_{jl}-1)$ word pairs that are of words $w_k$ and $w_l$. 
From this, 
$$
\mathrm{P}(w_k,w_l \given \textrm{text}=j) = \frac{n_{jk} (n_{jl} - \mathbb{I}(k = l))}{n_j (n_j-1)},
$$
where $\mathbb{I}(k = l)$ takes the value of 1 whenever $w_k = w_l$, and 0 otherwise. Consequently, 
averaging over all $J$ texts, we have
$$
\mathrm{P}(w_k,w_l) =\frac{1}{J} \sum_{j=1}^J \mathrm{P}(w_k,w_l \given \textrm{text}=j),
$$
and so from this, we may easily calculate $\Prob{w_k \given w_l} = \tfrac{ \Prob{w_k,w_l} }{\Prob{w_l}}$.

Having calculated $\Prob{w_k \given w_l}$, given a text 
$$ 
\textrm{text}_{j^\prime} \triangleq w_{j^\prime 1}, w_{j^\prime 2} \ldots w_{j^\prime n_{j^\prime}}, 
$$
the predicted probability that word $w_k$ is associated with $\textrm{text}_{j^\prime}$ is 
$$
\mathrm{P}(w_k \vert \textrm{text}_{j^\prime}) = \frac{1}{n_{j^\prime}} \sum_{i = 1}^{n_{j^\prime}} \mathrm{P}(w_k \vert w_{j^\prime i}).
$$

## Preface

This notebook requires some existing datasets, and will create a new dataset, see below. Some of the sparse matrix calculates are relatively intensive, but it will execute all cells completely in about 2-3 minutes. 

In [1]:
from __future__ import division

# Standard library imports
import os
import pickle

# Third party imports
import pandas
import configobj
import numpy
from scipy.sparse import coo_matrix, csc_matrix, lil_matrix

# Home spun third party imports
from gustav import models
from gustav.utils import SparseCountMatrix

# Local imports
from utils import utils, datautils

## Check required files

Check the integrity of required data files.

In [2]:
cache_directory = '../cache'
cache_fullpath = lambda path: os.path.join(cache_directory, path)

filenames = {
    'experiment_cfg' : [('Brismo.cfg',
                         '909d9f8de483c4547f26fb4c34b91e12908ab5c144e065dc0fe6c1504b1f22c9')],
    'text-corpus' : [('bnc_texts_78639361_183975_251_499.txt.bz2',
                      '0db086c97c7d3a26c131c8c51c03e45368dc95d8c8202fb6fdbed23d159afe02')],
    'vocabulary' : [('bnc_vocab_49324.txt',
                     'ecf66c77121cf67e416580cf5cc0853bd1813dcfd946298723134e547324cb6b')],
    'recall_results' : [('brisbane_06b643a_recall_results.pkl',
                         'a94d812373123b9a8b1eac848276e8ffc6a563ebca71ff2bf5adc97c825cbc14')],
}

utils.verify_cache_files(filenames['experiment_cfg'] +\
                         filenames['text-corpus'] +\
                         filenames['vocabulary'] +\
                         filenames['recall_results'], 
           cache=cache_directory,
           verbose=False)

# bunzip the text-corpus
utils.bunzip(filenames['text-corpus'][0][0], cache=cache_directory)

## Process the data:

* Get the text stimuli that were used in the experiment
* Get the corpus vocabulary as a Vocab object instance
* Convert each text stimuli into a bag of words
* Read in the experiment recall results
* Find the recalled words that are also in the corpus vocabulary

In [3]:
memoranda = configobj.ConfigObj(cache_fullpath(filenames['experiment_cfg'][0][0]))['text_memoranda']

text_filename = cache_fullpath(os.path.splitext(filenames['text-corpus'][0][0])[0])

vocabulary_filename = cache_fullpath(filenames['vocabulary'][0][0])

vocabulary = open(vocabulary_filename).read().split()
vocab = datautils.Vocab(vocabulary)

text_to_words = lambda text: [word for word in datautils.tokenize(text) if word in vocab.word2index]

texts_as_words = {}
for text_name in memoranda:
    texts_as_words[text_name] = text_to_words(memoranda[text_name]['text'])

Df = {}
Df['recall'] = pandas.read_pickle(cache_fullpath('brisbane_06b643a_recall_results.pkl'))

recalled_words = sorted(set(Df['recall']['word'].values).intersection(vocabulary))

## Make sparse matrix of document-word counts

The following calculates a $J \times V$ sparse matrix $R$, where 
$$
R_{jk} \triangleq \text{frequency of term $k$ in text $j$}.
$$

In [4]:
#sparse_count_matrix_filename = 'bnc_78723408_250_500_vocab49328_sparse_matrix_count'

sparse_count_matrix = SparseCountMatrix.new(text_filename = text_filename,
                                            vocabulary_filename = vocabulary_filename)

# Create a coo sparse matrix
#sparse_count_matrix = numpy.load(cache_fullpath(sparse_count_matrix_filename) + '.npz')

R = coo_matrix((sparse_count_matrix['values'], 
                (sparse_count_matrix['rows'], sparse_count_matrix['cols'])), 
               shape=(sparse_count_matrix['J'], sparse_count_matrix['V'])).tolil()

We will compute a $J \times V$ matrix $S$ such that 
$$
S_{jk} = \frac{n_{jk}}{n_j (n_j-1)}.
$$


For $w_k \neq w_l$, we can calculate $\Prob{w_k, w_l}$ as follows:
$$
\begin{aligned}
\mathrm{P}(w_k,w_l) &= \frac{1}{J} \sum_{j=1}^J \frac{n_{jk} n_{jl}}{n_j (n_j-1)},\\
                    &= \frac{1}{J} \sum_{j=1}^J S_{jk} R_{jl}.
\end{aligned}
$$

However, whenever $w_k = w_l$, 
$$
\begin{aligned}
\mathrm{P}(w_k,w_l) &= \frac{1}{J} \sum_{j=1}^J \frac{n_{jk} (n_{jl} - 1)}{n_j (n_j-1)},\\
                    &= \frac{1}{J} \left( \sum_{j=1}^J S_{jk} R_{jl} - \sum_{j=1}^J S_{jk} \right)
\end{aligned}
$$

In [5]:
def get_joint_prob(R):
    
    J, V = R.shape 

    Rj = R.sum(1).A.flatten()

    z = Rj * (Rj - 1)

    Z = lil_matrix((J,J))

    Z.setdiag(1/z)

    S = (R.T * Z ).T 

    C = S.T * R 

    C.setdiag(C.diagonal() - S.sum(0).A.flatten())
    
    return C/J 

def get_conditional_prob(P):

    Pz = P.sum(1).A.flatten()
    
    Z = lil_matrix(P.shape)
    
    Z.setdiag(1/Pz)
    
    return (P.T * Z).T

In [6]:
P = get_joint_prob(R)

In [7]:
C = get_conditional_prob(P)

In [8]:
f = P.sum(0).A.flatten()

Here, we add an optional "smoothing" feature such that if the observed counts $R_{kl}$ are 0, we replaced the predicted probability $\Prob{w_k\given w_l}$, which will necessarily be 0.0, by $\Prob{w_k}$. In other words, this gives *back-off* model for the zero counts. The main purpose and benefit of this is to ensure that all zero count bigrams are given a reasonable nonzero estimate, and the values of zero would cause trouble in analysis when we need to calculate logarithms.

In [9]:
def get_conditional_probability(C, f, word, smooth=True):
    
    '''
    Return probability of cooccurrence given word `word`.
    `f` is unigram frequency of word
    '''
    
    j = vocab.word2index[word]
    
    Rj = C[j].A.flatten()
    
    p = Rj/Rj.sum()
        
    if smooth:
        I = Rj == 0
        p[I] = f[I] # Where observed cooccurrence is 0, use unigram frequency instewad
    
    return p/p.sum()

Do some sanity checks, by checking the cooccurrences of randomly chosen words.

In [10]:
def check_some_words(K, N, seed):
    random = numpy.random.RandomState(seed)
    
    for k in random.permutation(P.shape[0])[:K]:
        
        word_k = vocab.index2word[k]
        
        smoothed_associates = ','.join([vocab.index2word[i] 
                                        for i in numpy.flipud(numpy.argsort(get_conditional_probability(C, f, word_k)))[:N]])
        
        original_associates = ','.join([vocab.index2word[i] 
                                        for i in numpy.flipud(numpy.argsort(get_conditional_probability(C, f, word_k, smooth=False)))[:N]])
    
    
        print(vocab.index2word[k].capitalize())
        print('Smoothed: ' + smoothed_associates)
        print('-'*10)
        print('Original: ' + original_associates)
        print('='*100)

check_some_words(25, 100, 1001)

Futurity
Smoothed: nature,empty,possessed,tomorrow,set,baffled,futurity,book,novelist,reading,mind,letter,reader,town,decision,replies,suicide,peter,life,ad,bound,real,world,trust,imagine,tale,outcomes,newspaper,road,children,human,future,reality,hands,management,verb,chapter,fiction,doubt,revelation,free,contemplating,intellectual,standpoint,words,lying,means,aiming,double,apocalyptic,heart,clause,including,aim,nurse,bishop,inside,attach,exchange,persuade,impossibility,spare,bent,brilliant,eternity,imagined,quintets,attacks,sets,agonized,laws,poisoned,boundless,jest,crazed,negative,lighted,urges,intention,rudderless,lamp,saviour,response,wood,metaphysical,lots,wondering,circles,single,guaranteed,cross,neutrally,retains,wrote,torpor,collective,lonely,chip,obscure,bafflement
----------
Original: nature,empty,possessed,tomorrow,set,baffled,futurity,book,novelist,reading,mind,letter,reader,town,decision,replies,suicide,peter,life,ad,bound,real,world,imagine,outcomes,tale,newspaper,road,tr

Calculate the predicted probability that word $w_k$ is associated with $\textrm{text}_{j^\prime}$:
$$
\mathrm{P}(w_k \vert \textrm{text}_{j^\prime}) = \frac{1}{n_{j^\prime}} \sum_{i = 1}^{n_{j^\prime}} \mathrm{P}(w_k \vert w_{j^\prime i}).
$$

In [11]:
predictions = {}
for text_name in memoranda:
    
    prime_words = texts_as_words[text_name]
    
    _predictions = numpy.zeros(len(vocabulary))
    for prime_word in prime_words:
        _predictions += get_conditional_probability(P, f, prime_word)
    
    _predictions /= len(prime_words)
    
    predictions[text_name] = dict(zip(vocabulary, _predictions))

View the predictions.

In [12]:
def print_topic(text_name, K=25):   
    return ','.join([w for w,_ in sorted(predictions[text_name].items(), 
                                         key=lambda value: value[1], 
                                         reverse=True)][:K])

In [13]:
for text_name in memoranda:
    print(memoranda[text_name]['text'])
    print('*'*10)
    print(print_topic(text_name, K=100))
    print('='*25)

‘I don't know what I did without it’ is the sentiment. There is a difference
between fully automatic washing machines — which change the nature of the task
altogether — and ‘twin tub’ machines where the hot wet washing has to be
lifted manually into a separate drying compartment. The women who had this type
of machine complained about the considerable amount of work still required of
the housewife, and the mess on the kitchen floor to be cleared up afterwards. In
a similar way the launderette does not remove the physical drudgery of washing.
The housewife has to get the washing there in the first place, she has to unload
it, sort it, sit and watch it wash and dry (or dash out to shop in the interim)
and then pack it all up again. This, when there is a baby in the pram and a two-
or three-year-old to attend to, is no mean feat.
**********
time,people,day,women,home,life,house,water,children,found,left,world,looked,mother,hand,set,told,head,system,eyes,night,family,door,days,feel,woman,g

Calculate the predictions for the experimental stimuli in the recognition test and also for the recalled words.

In [14]:
cooccurrence_predictions = {}
for text_name in memoranda:
    
    _, n = text_name.split('_')
    n = int(n)+1
    
    inwords = memoranda[text_name]['inwords'].split(',')
    outwords = memoranda[text_name]['outwords'].split(',')
    
    for word in inwords+outwords+recalled_words:
        
        try:

            prob = predictions[text_name][word]
            cooccurrence_predictions[str(n) + '-' + word] = prob
            
        except KeyError:
            
            print('Unknown word in text %s: "%s"' % (text_name,word))

cooccurrence_statistics_pkl = cache_fullpath('word_associates_from_cooccurrence_statistics.pkl')

with open(cooccurrence_statistics_pkl, 'wb') as f:
    pickle.dump(cooccurrence_predictions, f, protocol=2)    

assert utils.checksum(cooccurrence_statistics_pkl)\
    == 'efbcb9bae13142296ed164335313f69e380ec49811e885ce3bc4351a10cd2889'

Unknown word in text text_22: "dhow"


## Create csv file with predictions for recalled words 

Here, we'll create a special file for use with the multionomial logistic regression modelling of the recall memory results. This file is a $J \times V^\prime + 1$ matrix, written as a csv file, where $J$ is the number of texts and $V^\prime$ is the set of recalled words, across all participants, that are also in the training corpus vocabulary. 

See `posterior-predictive-distributions` notebook for more details. There, an identical procedure is followed. 

In [15]:
predictive_probabilities = []
text_names = sorted(predictions.keys(), key=lambda arg: int(arg.split('_')[1]))
for text_name in text_names:
    f = []
    for word in recalled_words:
        f.append(predictions[text_name][word])
    predictive_probabilities.append(f)

predictive_probabilities = numpy.array(predictive_probabilities)

predictive_probabilities = numpy.c_[predictive_probabilities, 1-predictive_probabilities.sum(1)]

header = ','.join(recalled_words + ['ALTERNATIVE_WORD'])

M = [header]
for i,f in enumerate(predictive_probabilities):
    M.append(text_names[i] + ',' + ','.join(map(str, f)))
M = '\n'.join(M)
    
cooccurrences_predictions_of_recalled_words = 'cooccurrences_predictions_of_recalled_words.csv'

with open(cache_fullpath(cooccurrences_predictions_of_recalled_words), 'w') as f:
    f.write(M)

# Verify the integrity of the exported csv file.
assert utils.checksum(cache_fullpath(cooccurrences_predictions_of_recalled_words))\
     == 'ae0b2e17fb6b800aa304a6990e1a6cfb5f6bd815ad4749828955dc96651e907c'