[Note: Token counts may be slightly off due to filtering modifications.]
-
You should create a
/raw_data
directory to store the raw Dundee files (these must be acquired independently, as they are proprietary) -
You should also create a directory that will be pointed to in
filenames.py
as thegoogle_lm_dir
. The.pbtxt
andckpt
files from tensorflow/models should be placed here. -
process_experiment.py
is the central file for processing. The file requires a (directory of) file(s) with raw corpus data. In the case of the Dundee corpus, these files are formatted as:
Are 1 1 1 1 1 ...
tourists 1 1 1 2 2 ...
enticed 1 1 1 3 3 ...
The important element is that each line contains one token per line. Tokens should include punctuation, rather than having punctuationo on a separate line.
-
create_text_token_files.create()
prints two files from the above directory of files:(a)
test_corpus.tkn
- A corpus file where each sentence is on a separate line. Punctuation is also surrounded by whitespaces, e.g. "exists . "(b)
test_token_line.tkn
- A corpus file with each token on a separate line. Again, punctuation constitutes a discrete token.
-
The corpus file returned above is fed into
rnn_corpus.create()
. -
A
GoogleLanguageModel
is created. This relies on code and files from Tal Linzen's rnn_agreement codebase. The directory where these files are stored should be hardcoded intofilenames.py
. Also make sure that all dependencies listed in this repo are added to your current environment. -
The RNN-LSTM is processed as follows:
(a)
glm.process_init()
builds a new RNN, with<S>
as the first word.(b) The transition probability of each new word occuring as the next state is calculated, and the LSTM state is updated.
(c) When the sentence terminal
</s>
is encountered, a new LSTM is initiated with<s>
.(d)
rnn_corpus.py
writes the output of the calculated probabilities tornn_probability_file
. Each line consists of:- Token file number
- Token index
- Token
- Token probability
-
Optionally, the
rnn_probability_file
can be fed intoperplexity.py
, which outputs the perplexity of the entire corpus given the probabilities of each token.
-
The count-based n-gram models are calculated using
kenlm
. In order to usekenlm
's output with the steps below, we use the following commands (Thank you to Marten van Schijndel for these):i) Create a binary file of the language model. This is in the format of a corpus arpa file.
lmplz -o 5 -S 60G -T /tmp < 1-billion/lm_1b-raw-text.txt | build/bin/build_binary /dev/stdin lm1b-5gram.binary
- adding
-T
helps speed up processing - The
raw-text.txt
file is a large text file from tensorflow with all of raw text, with punctuation whitespace delimited. - The last arguement is the outputted binary file.
ii) Use the language model created in (i) to measure the (Dundee) corpus
query lm1b-5gram.binary < dundee_corpus.tkn | python process_kenlm.py > output.txt
dundee_corpus.tkn
is the file output from the model generation above, with one token per line, including punctuation as a separate token.- The processing file
process_kenlm.py
is adapted from modelblocks. It takes the regular output ofkenlm
and puts it into a format appropriate for an R dataframe or similar format.
- adding
-
The above steps should be run for the LSTM model, as well as a 5-, 4-, 3-, 2- and 1-gram language model. The results should all be put into a csv of the following format:
FILE,WNUM,WORD,rnn_prob,kenlm5gram_prob,kenlm4gram_prob,kenlm3gram_prob,kenlm2gram_prob 01,1,Are,-7.557139726,-3.3412342,-3.3412342,-3.3412342,-3.3412373 01,2,tourists,-6.339384999,-4.647959,-4.647959,-4.582983,-4.7727933 01,3,enticed,-5.600467538,-6.262744,-6.2964334,-6.231951,-6.517497 ... 01,10,?,-0.075328262,-2.691623,-2.58747,-2.7365475,-2.306135 01,NA,</s>,-0.035893436,-0.16058159,-0.19427118,-0.055592526,-0.1109385
-
All alignment and subsequent exclusion is done in
create_gaze_data.R
. This creates proper alignment between lm_1b and dundee, to make the tokenization scheme alike. It then aligns all dundee data (for all 10 subjects) to the lm_1b and kenlm probabilities.The alignment function requires the following files, from preprocessing (above):
token_surprisal.csv
- Illustrated in Step 0 above, from preprocessing. The full Dundee corpus shoud have 60,916 tokens.lm1b-1gram.tsv
- This is essentially an occurence count. The full token list should have 2,425,339 tokens. This should be in a standard arpa format with headers:kenlm1gram.total.prob WORD -8.92661 <unk> 0 <s> -1.4210622 </s> -2.1811512 The -3.0964477 U.S.
token.line
- Should be generated from a csv formatted as, done in preprocessing:WORD,FILE,SCREEN,LINE.NUM,WORD.IN.LINE.NUM,word_id "Are",01,1,1,1,01_1 "tourists",01,1,1,2,01_2 "enticed",01,1,1,3,01_3 ... "existence?",01,1,1,10,01_10
masterTX.csv
- a concatenated file of all tx files used inprocess_experiments.py
above. This was used for Demberg & Keller (2008)
-
Interpolated models are added. The interpolation blend was taken from the optimal blend (lowest perplexity). The function to run this can also be found seperately in
evaluation_code/interpolate.R
. -
The sentence terminal token
</s>
is removed. This is accomplished byNA
for the line numbers. The sentence termina was not given a line number when processing lm_1b because it is noot in the Dundee corpus, and would throw off alignment. Once these are filtered ut,token.surprisal
should have 58,528 tokens. -
Since all probabilities thus far are in log space, we can sum the logprobs of the constituent tokens of a word to get the total logprob for a word. The 58k tokens should be reduced to 51,501 whole words in
word.surprisal.all
. -
Unigram probabilities are added with a left join. If a word is not in the unigram probabilities, i.e. it is OOV, it will not find a match, and be filtered out. This includes words with punctuation.
-
For every word, all of the previous word's information is added.
-
If the current word's previous word was non-alphabetic, then the current word is removed. This should result in 44,164 remaining words. If the current word is followed by a word with leading punctuation, this is also filtered out.
-
Add booleans for line edges are added to the token list
-
Filter out words with punctuation, leaving
dat.no.punct
with 372,371 tokens. -
Filter out words in Dundee that are not in lm_1b, leaving 319,333 tokens. Then filter out words for which the previous word was not in the lm_1b corpus. There should now be 37,420 unique tokens.
-
Filter out line beginnings and endings.
The align function above will create two data structures:
word.surprisal
- A unique row for each tokendat.no.punct
- A unique row for each instance of each token, for all subjects
Uless otherwise noted, the modeling files can all be run independently. There is not necessarily any order to run them in. The files are split up by what type of model they use, either GAMs or linear models.
gams.R
- a large file that creates many different types of GAMs that are used belowACL_GAMS.R
- unpublished GAM codes for uni-, bi-, trigrams both fitted and balancedfitted_df_gams.R
- fitted df nonlinear GAMs and log likelihood testslinear_fitted_df_gams.R
- fitted df linear GAMsfixed_df_gams.R
- fixed df GAMS and log likelihood testsgam_linearity.R
- unpublished tests for measuring how close a GAM is to a linear trend line, as a way to measure the GAM's linearitygam_summary.R
- A scratchpad file for extracting summary datagamm4_gams.R
- Experiment to see if we can usegamm4
package instead ofmgcv
. Not very thoroughly developed.
linear-models.R
- create a number of mixed effect models, with different random effectsanovas.R
- unpublished ANOVA code for comparing interpolated modelsbigram-trigram-random-lin-models.R
- playing around withstargazer
package for displaying different summaries of linear models with random effectslikelihood-ratio-tests.R
- usesdrop1
function to compare different LMslm-random-effects.R
- linear models with random effects for words, and removing random correlationslm-iterative-emnlp.R
- chi-sq tests for EMNLP paperunpubbed_lmer_models.R
- similar models to those published in CMCL and for EMNLP
gam_plots.R
- (This does require precompiled GAM data structures, such as those fromfitted_df_gams.R
.) This created the GAM plots for the CMCL paper, using base R.ll-lm.R
- plots for CMCL paper with delta log likelihood fr different language modelsperplexity-plot.R
- unused perplexity plotprobability-densities.R
- unused file playing around withdplyr
and'melt
The use of create_logprob_corpus_vectors
is optional and is not necessary for this project.
- Interpolation experiments can be run in parallel. Since they require a lot of processing power, it is recommended to run it remotely on something like AWS.
- NB: For Northwestern users, you must be logged into the Northwestern VPN to access the lab's AMIs.
- The interpolation experiments require the use of the large binary language model files created above.
- For NU, these are stored on an AWS instance in the
kenlm/
directory
- For NU, these are stored on an AWS instance in the
- Experiments are run from the shell scripts located in
create_probabilities_from_raw_data/
.- The script
run-all-experiments.sh
will launch ascreen
instance for each of the 20 dundee files - Each instance then calls
run-experiment.sh
, which loads the anaconda interpreter, and then runsprocess_experiment.py
process_experiment.py
is the same files used above, also the corpora have already been created, so those lines should be commented out- The important line is
create_logprob_corpus_vectors. create (tokenized_line_file, logprob_file)
- The script
- As mentioned above, these are created in the
create()
method ofcreate_logprob_corpus_vectors.py
. - Input
tokenized_line_file
is a comma-delimited file with the following features- [0] - file_id
- [1] - token_index
- [2] - token
logprob_file
is a file where the output will be stored
- Output - a csv file
- [0] - interpolation weight, e.g. [0.00 - 1.00]
- [1] - from input
- [2] - the logsumexp of the multiplicatively interpolated probability vector (see below)
- [3] - the token probability from the multiplicatively interpolated vector
- Processing
- multiplicative interpolation vector
- calculated from interpolating the optimal vector (sometimes called "lstm vector") and the ngram vector, at the interpolation weight, as:
np.multiply(optimal_vec, np.array([vec_w])) + np.multiply(ngram_vec, np.array([(1 - vec_w)]))
- calculated from interpolating the optimal vector (sometimes called "lstm vector") and the ngram vector, at the interpolation weight, as:
- multiplicative interpolation vector
multi_interpolate.R
- What kind of input
- What it does
- What it outputs
evaluation_code/CIs-effective-interp.R
evaluation_code/interpolate.R
evaluation_code/bnc.R
- attempting the same filtering as Smith & Levy (2013) for British versus American spellingevaluation_code/perplexity.R
- for calculating perplexity of various LMs. This is referenced inprocess_experiments.py
.- Within the
create()
function, the language models must be set to the ones you would like for interpolation- For the published papers, we use the optimal LSTM, 5-gram, and 2-gram models
- The following should all be set for the experiment
begin_weight = 0.01 end_weight = 1.00 increment_weight = 0.01
- Within the