Language Modeling

This project trains several language models and evaluates them on two test corpora.

How to Run

Verify that preprocessing.py, modelling.py, questions.py and the three corpora (brown-train.txt, brown-test.txt, learner-test.txt) are all in your current directory.
Run python3 questions.py in your terminal from the directory
The program will output four files:
- output.txt has the answers to the questions below
- brown-train-PP.txt has the preprocessed Brown training corpus.
- brown-test-PP.txt has the preprocessed Brown test corpus.
- learner-test-PP.txt has the preprocessed Learner test corpus.
Open output.txt to obtain the answers to the questions listed below

Corpora

Each corpus is a collection of texts, one sentence per line. Brown-train.txt contains 26,000 sentences from the Brown corpus. This corpus was used to train the language models. The test corpora ( brown-test.txt and learner-test.txt ) were used to evaluate the language models that were trained. brown-test.txt is a collection of sentences from the Brown corpus, different from the training data, and learner-test.txt are essays written by non-native writers of English that are part of the FCE Corpus.

Preprocessing

Prior to training, the following pre-processing steps were completed:

Padding each sentence in the training and test corpora with start and end symbols (using <s> and </s>, respectively).
Lowercasing all words in the training and test corpora. Note that the data already has been tokenized (i.e. the punctuation has been split off words).
Replacing all words occurring in the training data once with the token <unk>. Every word in the test data not seen in training was treated as <unk>.

Training Models

brown-train.txt was used to train the following language models:

A unigram maximum likelihood model.
A bigram maximum likelihood model.
A bigram model with Add-One smoothing.

Questions

The questions below were answered :

How many word types (unique words) are there in the training corpus? Please include the padding symbols and the unknown token.
How many word tokens are there in the training corpus?
What percentage of word tokens and word types in each of the test corpora did not occur in training (before you mapped the unknown words to <unk> in training and test data)?
What percentage of bigrams (bigram types and bigram tokens) in each of the test corpora did not occur in training (treat <unk> as a token that has been observed).
Compute the log probabilities of the following sentences under the three models (ignore capitalization and pad each sentence as described above). Please list all of the parameters required to compute the probabilities and show the complete calculation. Which of the parameters have zero values under each model? Use log base 2 in your calculations. Map words not observed in the training corpus to the token.
- He was laughed off the screen .
- There was no compulsion behind them .
- I look forward to hearing your reply .
Compute the perplexities of each of the sentences above under each of the models.
Compute the perplexities of the entire test corpora, separately for the brown-test.txt and learner-test.txt under each of the models.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Language Modeling

How to Run

Corpora

Preprocessing

Training Models

Questions

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.gitignore		.gitignore
README.md		README.md
brown-test-PP.txt		brown-test-PP.txt
brown-test.txt		brown-test.txt
brown-train-PP.txt		brown-train-PP.txt
brown-train.txt		brown-train.txt
learner-test-PP.txt		learner-test-PP.txt
learner-test.txt		learner-test.txt
modelling.py		modelling.py
output.txt		output.txt
preprocessing.py		preprocessing.py
questions.py		questions.py

markab4/Language-Modeling-in-Python

Folders and files

Latest commit

History

Repository files navigation

Language Modeling

How to Run

Corpora

Preprocessing

Training Models

Questions

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages