This project trains several language models and evaluates them on two test corpora.
- Verify that
preprocessing.py
,modelling.py
,questions.py
and the three corpora (brown-train.txt
,brown-test.txt
,learner-test.txt
) are all in your current directory. - Run
python3 questions.py
in your terminal from the directory - The program will output four files:
output.txt
has the answers to the questions belowbrown-train-PP.txt
has the preprocessed Brown training corpus.brown-test-PP.txt
has the preprocessed Brown test corpus.learner-test-PP.txt
has the preprocessed Learner test corpus.
- Open
output.txt
to obtain the answers to the questions listed below
Each corpus is a collection of texts, one sentence per line. Brown-train.txt contains 26,000 sentences from the Brown corpus. This corpus was used to train the language models. The test corpora ( brown-test.txt and learner-test.txt ) were used to evaluate the language models that were trained. brown-test.txt is a collection of sentences from the Brown corpus, different from the training data, and learner-test.txt are essays written by non-native writers of English that are part of the FCE Corpus.
Prior to training, the following pre-processing steps were completed:
- Padding each sentence in the training and test corpora with start and end symbols (using <s> and </s>, respectively).
- Lowercasing all words in the training and test corpora. Note that the data already has been tokenized (i.e. the punctuation has been split off words).
- Replacing all words occurring in the training data once with the token <unk>. Every word in the test data not seen in training was treated as <unk>.
brown-train.txt was used to train the following language models:
- A unigram maximum likelihood model.
- A bigram maximum likelihood model.
- A bigram model with Add-One smoothing.
The questions below were answered :
- How many word types (unique words) are there in the training corpus? Please include the padding symbols and the unknown token.
- How many word tokens are there in the training corpus?
- What percentage of word tokens and word types in each of the test corpora did not occur in training (before you mapped the unknown words to <unk> in training and test data)?
- What percentage of bigrams (bigram types and bigram tokens) in each of the test corpora did not occur in training (treat <unk> as a token that has been observed).
- Compute the log probabilities of the following sentences under the three models (ignore capitalization and pad each sentence as described above).
Please list all of the parameters required to compute the probabilities and show the complete calculation.
Which of the parameters have zero values under each model? Use log base 2 in your
calculations. Map words not observed in the training corpus to the token.
- He was laughed off the screen .
- There was no compulsion behind them .
- I look forward to hearing your reply .
- Compute the perplexities of each of the sentences above under each of the models.
- Compute the perplexities of the entire test corpora, separately for the brown-test.txt and learner-test.txt under each of the models.