# Stimuli Selection for Experiment *Brisbane*

In [19]:
import utils

## Downloading the British National Corpus

The British National Corpus (BNC) is available from http://ota.ox.ac.uk/desc/2554 (with the reference guide available at http://www.natcorp.ox.ac.uk/docs/URG/index.html).

Under the licence agreement, I am not at liberty to redistribute my copy of the BNC corpus here. However, to ensure reproducibility of what I describe below, the details of zip archive of the BNC that I obtained from the University of Oxford Text Archive (OTA), named `2554.zip`, are as follows:

* Size: 539M
* md5sum: 394a702072f2f3f62f467f8f911420e7
* sha1sum: d0bbc6e29745bcddea42d2c57f5fa5e485002524 

You can run the following linux commands to check the file size and checksum hashes of the BNC zip archive.

In [20]:
%%bash 
BNC_ZIP_FILE=2554.zip
du -h $BNC_ZIP_FILE
md5sum $BNC_ZIP_FILE
sha1sum $BNC_ZIP_FILE

539M	2554.zip
394a702072f2f3f62f467f8f911420e7  2554.zip
d0bbc6e29745bcddea42d2c57f5fa5e485002524  2554.zip


On the basis of the above info, you can guarantee if you have a bit-for-bit identical copy of the archive. Even if it is not bit-for-bit identical, the differences may still be of no practical consequence. In the file `2554.filelist.txt`, I provide a list of the contents of the archive. In the file, `2554.checksum.txt`, I provide the md5sum checksum of all the files. This will allow you to check to see if and where any differences exist between the files I am using and the BNC files you obtain from the OTA.

## Read in the BNC xml file names

For what follows below, I'm assuming the BNC corpus zip archive has been unzipped to `./bnc` and so the corpus xml files are below the `./bnc/2554/download/Texts/` directory. The following command will get the list of the xml files in the corpus.

In [21]:
corpus_filenames = utils.Corpus.get_corpus_filenames('./bnc/2554/download/Texts/')

If the above command ran without problem, the number of `.xml` corpus files you have is {{ len(corpus_filenames) }}.

## Load up all BNC paragraphs

The following commands will parse all the xml files in the corpus and read in the entire corpus as a list of paragraphs (using the BNC's own definition of what constitutes a paragraph). Each element of this list, will give the following

* The xml filename which contains the paragraph
* The top level division (div1) in the xml file that contains the paragraph
* The order of the paragraph in the div1 division (i.e. is it the first, second, third, etc, paragraph)
* How many paragraphs are in that div1 division
* The paragraph text itself (i.e., raw text, no xml)
* The paragraph text as a list of individual words (using BNC's own definition of what constitutes a word
* The word length of the paragraph

Note that this xml parsing and the data reading from disk is a slow process, and I so run it in parallel. I did so on a 16 core machine, and still it tool many hours. Once the parsing and reading in is down, a pickle file is created. Once this file is created, if `use_cached_data` is set to `True`, this pickle file is read in instead of running the parser again.

In [22]:
use_cached_data = False

if not use_cached_data:
    
    from ipyparallel import Client

    clients = Client()
    clients.block = True
    view = clients.load_balanced_view()

    paragraphs = utils.get_all_paragraphs_parallel(view,
                                                   corpus_filenames)
    
    paragraphs = sorted(paragraphs, 
                        key=lambda args: args['corpus_filename'])
    
    utils.dump(paragraphs, filename='paragraphs.pkl')

    
else:
    
    paragraphs = utils.load('paragraphs.pkl')

## Get corpus vocabulary 

For our purposes, the vocabulary of the BNC is not set of all word types as defined by the BNC itself, i.e., those strings marked up with the `<w` ... `>` tag. This contains around 500K word types and includes many strings that we would not normally recognize as everyday words. 

To only use everyday words, we get the intersection of all the words in the BNC with those listed in the `2of4brif.txt` dictionary of around 60K English words. This vocabularly list is taken from the 12Dicts package of the SCOWL (And Friends) database of English words used for creating word list for spell checkers: http://wordlist.aspell.net/12dicts/. These files are in the public domain. 

We then remove all stopwords from this list. The stop word lists we used were `FoxStoplist.txt` and `SmartStoplist.txt` which were take from https://github.com/aneesha/RAKE.git (commit 22474be2ba9a88d78ea2f2efd8d1f8115af869e1) and are used here according to the MIT license for the repository.

The three files `2of4brif.txt`, `FoxStoplist.txt`, `SmartStoplist.txt` are distributed with this notebook. You can check their file integrity with `md5sum` and this should give you:

* 2of4brif.txt: 57fc602974b1ea0e8bb40f3b191ff100
* FoxStoplist.txt: ffc5e787b820d4f4d552f03dc073423b
* SmartStoplist.txt: 1430878775662c041a9ef7f48e491c8e

In [23]:
vocabulary = utils.get_corpus_vocabulary(paragraphs)

## Get inverse document frequency

For the purposes of extracting keywords from texts, we'll use a simple tfidf, i.e. term-frequency by inverse document frequency, definition of keywords. In other words, the keywords of a text are those words in our vocabulary, defined above, with the highest tfidf value. More precisely, for word $i$, its tfidf in a text is 
$$
f_i \times \log(N/n_i) 
$$
where $f_i$ is the frequency of occurrence of word $i$ in the text, $N$ is the total number of documents in the corpus, and $n_i$ is the total number of documents in the corpus where word $i$ occurs at least once. The latter term, i.e. $\log(N/n_i)$ is the inverse document frequency. 

For present purposes, we defined a "document" as any paragraph with over 100 words. 

In [24]:
idf = utils.get_inverse_document_frequency(paragraphs)

## Word association norms data

One of the issues we will be addressing in this study is whether and to what extent memories of spoken and written language are predictable from word associations of the words in the text. As such, it is necessary to use a data set of word associations for the analysis. Also, we need to make sure that the text and word lists that we use in the memory experiment have sufficient numbers of example words in this data sets. 

The largest English language data-set currently in use was collected at http://www.smallworldofwords.com/en/, and here I am using an early release of this data provided by Simon De Deyne. 

The data is in a csv format and its checksum is
* associations_en_05_01_2015.csv: 40df7669ab0751e2753a44540cd7c8a1 

In [25]:
%%bash 
md5sum associations_en_05_01_2015.csv

40df7669ab0751e2753a44540cd7c8a1  associations_en_05_01_2015.csv


In [26]:
word_association = utils.WordAssociations('associations_en_05_01_2015.csv')

This data set has {{len(word_association.stimulus_words)}} unique stimulus words, with {{len(word_association.association_words)}} unique word associates. 

## Randomly sample paragraphs

We want to randomly sample 50 paragraphs from the BNC. 

These paragraphs should all be around 150 +/- 10 words in length. They should have a specified density of words from our `2of4brif.txt` vocabulary list. This is to keep very strange and unusual texts that might be filled with e.g. jargon terms. Likewise, they should have a specified density of words that are stimulus words in the word association norms data set. 

We start, therefore, by simply filtering out paragraphs that met these criteria.

In [27]:
paragraph_indices = utils.filter_paragraphs(paragraphs,
                                            word_association,
                                            minimum_length=140,
                                            maximum_length=160,
                                            density=(0.9,0.75))

This gives us a set of {{len(paragraph_indices)}} paragraphs to choose from.

Now, we will randomly sample 100 of these paragraphs. For each one, we will also create a list of its top twenty key words. This is to make the list of words for word list based memory tests. For recognition memory tests, we will need a set of words that are in the text or word list, and a set of words that are not in the text of word list. For the "in" or "target" list of words, this can simply be 10 of the 20 key words. For the "out" or "lure" list, we choose keywords from paragraphs that are adjacent, i.e. before and after, the selected paragraph. We will only choose relatively short neighbour paragraphs, i.e. between 50 and 150 words, each. These words must obviously not be in the paragraph. Nor do we want repetitions of morphologically related words. For example, we want only one of the following "fox", "foxes", "foxing", etc. Moreover, we don't even want any of these words to have morphological variants in the text. Finally, We also want all the key words in the text and so all the words in the word list to be in the set of stimulus words from the word association data set and we want all the target words and all the lure words to be in the list of word associates from the word association data set.

In [28]:
sampled_paragraphs\
    = utils.ParagraphSampler.select_paragraphs(paragraphs,
                                               paragraph_indices,
                                               idf,
                                               vocabulary,
                                               word_association,
                                               number_of_paragraphs=100,
                                               list_lengths=(20,10),
                                               neighbour_paragraph_length=(50,150),
                                               association_word_density_threshold=1.0,
                                               random_seed=12345)

In [29]:
utils.verify_sample_paragraphs(sampled_paragraphs, word_association)

Now, we select half the paragraphs to be used for word list based memory tests and half to be used text based memory tests.

In [30]:
utils.randomly_sample_as_texts_and_wordlists(sampled_paragraphs, random_seed=321)

Write the sampled paragraphs to file. This will write the files as text and also as pickle, so don't provide a extension to the filename. That will be added automatically.

In [31]:
utils.write_paragraphs_to_file(sampled_paragraphs, filename='sample-paragraphs')

Let's look at them. 

In [32]:
print(utils.write_paragraphs_to_str(sampled_paragraphs))

Wordlist: yarn, pattern, stitch, clumsy, shiny, knit, super, knitting, width, wool, wise,
beautiful
-----
Targets: yarn, pattern, stitch, clumsy, shiny, knit, super, knitting, width, wool
-----
Lures: cards, ready, cast, supply, list, dizzy, design, tuck, leaflet, rib
Wordlist: meditation, relaxation, consciousness, dreary, bald, puzzled, healing, beat,
legs, technique, sitting, hospital, reading, writing, understand, run, book,
difficult, left, day
-----
Targets: meditation, relaxation, consciousness, dreary, bald, puzzled, healing, beat,
legs, technique
-----
Lures: lozenges, mind, khaki, grind, carriages, awoke, revert, rotting, relics,
exceedingly
This is why pre-retirement planning is sometimes known as ‘mid-life planning’—
an appropriate term in view of the fact that present-day retirement can span
almost as many years as a working career. Planning, or at least mental
preparation, is needed for this transition, whether we intend to have an active
retirement, or just ‘sit and do n