This program calculates the forward conditional probability and backward conditional probability of Norwegian Bokmål words in sentences in a text file. The probability calculation is based the formulae in Onnis et al. (2022) and uses data from the n-gram database from the National Library of Norway.
- Download the Norwegian Bokmål n-gram database from the National Library of Norway to the project directory.
- Decompress the archive, e.g. by opening a command line and issuing the command:
tar xf ngram_nob.tar.gz
You should have a bokm
folder in the project directory after some minutes.
Run the nGramIzer on any number of text files with the command
py -3 ngram.py input_file [input_file_2]...
Output files will be generated with a _result.csv
postfix.
Note: building the dictionaries takes some time before the actual analysis runs.
The generated CSV will contain the following columns:
- Sentence Number
- Word Number
- Word
- Forward Probability
- Backward Probability