Modeling truncation in Brazilian Portuguese
Pham, Mike and Jackson L. Lee. 2018/Forthcoming. Mincing words: Balancing recovery and deletion in word truncation. Glossa.
This repo contains code, datasets, and results associated with this publication.
main.py: Python script for running the different truncation models over the data files in the
data/: directory for datasets
pt_br_full.txt: Brazilian Portuguese lexicon (8.7 MB)
gold_standard.txt: gold standard nouns with attested truncation, all annotated with predictions from models such as the binary foot models
plots_for_words/: directory for output plots for individual words for log(left-complete counts) and log(right-complete counts)
results/: directory for output files
- error details in CSV
- error distribution boxplot
- L- and R-complete counts for individual in LaTeX and PDF
- R code for making individual plots for test words
readme.md: this readme file
NumPy, Pandas, and Seaborn are required to run
For reproducibility, the exact versions we use are pinned down in
requirements.txt, and you can install these dependencies by
pip install -r requirements.txt.
We are using Python 3.6.3.
Optional -- If the following commands are recognized in your path environment:
main.py) is run and the plots for individual words are saved in
main.py) is compiled and the resultant PDF is saved in
Rscript comes from R, whereas
xelatex comes from a LaTeX distribution
Download this repository to your local drive by one of these two methods:
Download and unzip https://github.com/jacksonllee/BP-truncation/archive/master.zip
Clone this repository:
$ git clone https://github.com/jacksonllee/BP-truncation.git $ cd BP-truncation
main.py can take optional arguments.
-h brings up the help page with details of these arguments:
$ python main.py -h usage: main.py [-h] [-f] [-l] [-r] [-d] [-x LEXICON] [-g GOLDSTANDARD] Modeling truncation in Brazilian Portuguese, by Mike Pham and Jackson Lee optional arguments: -h, --help show this help message and exit -f, --freqtoken Use token frequencies in lexicon (default: False) -l, --latex Compile the output LaTeX file (default: False) -r, --run_r_script Run R script (default: False) -d, --digraphsfixed Change orthographic digraphs into monographs (default: False) -x LEXICON, --lexicon LEXICON Lexicon file (default: data/pt_br_full.txt) -g GOLDSTANDARD, --goldstandard GOLDSTANDARD Gold standard file (default: data/gold_standard.txt)
The sample output files included in this repository are generated by this command:
$ python main.py -lr
This command has most of the default settings as described in the help page
shown above, except that R code is run and LaTeX compilation is triggered.
If you don't want to run R and LaTeX, simply run
with no other arguments.
If, for instance, you'd like to make use of word token frequency information
in the models that involve right-completes and left-completes,
you should run
python main.py -flr (still running R and LaTeX).
All output files for this command bear the suffix "-tokenfreq".
To activate orthographic digraph replacements, run
python main.py -dlr.
All output files are suffixed by "-nodigraphs".
may be used to override the default lexicon
and gold standard files, respectively.
See the sections below on their file format.
The lexicon file
data/pt_br_full.txt is a plain text file
where each line begins with a word, and then a space, and finally
the frequency count of that word.
Here are the first ten lines of the
the lexicon file:
que 12021478 não 9712854 o 9578625 de 8089861 a 7188507 é 6843557 você 6211533 e 5863939 eu 5741437 um 4589127
This lexicon file is from here (released with an MIT license), which in turn derived the lexicon and frequency counts from movie subtitles. The data is therefore highly representative of the spoken language.
Gold standard file
The gold standard file is a plain text file
where each line has one original (untruncated) word, and then a
space/tab, and finally the truncated form (TF) of that original word;
the TF is just for reference and is not used by
main.py in any way.
The original word is annotated by the following symbols:
|: where the true truncation point is for forming the truncated stem (TS). For instance, the TS for adrenalina (see below) is adren.
$: where the truncation point is as predicted by the binLR model.
#: where the truncation point is as predicted by the binRL model.
Here are the first ten lines of the default gold standard file in this
adr$en|#alina adrena an$alf|#abeto analfa bat$#er|ista batera bel$e|z#a belê berm|$ud#a bermas bij$#u|teria biju bis|$av#ó bisa bob|$eir#a bobis bot$equ|#im boteco burg|$#ês burga