The following six steps describe how to input aligned CLDF data to loanpy, and how to mine sound correspondences and evaluate and visualise their predictive power.
These are necessary to predict phonotactic repairs during loanword adaptation.
cldfbench ronataswestoldturkic.mineEAHinvs invs.json
.. automodule:: ronataswestoldturkiccommands.mineEAHinvs :members:
Since any existing phoneme can be adapted when entering a language through a loanword, we have to create a heuristic adaptation prediction for as many IPA characters as possible, in this case 6491.
cldfbench ronataswestoldturkic.makeheur EAH heur.json
.. automodule:: ronataswestoldturkiccommands.makeheur :members:
The output will serve as fuel for predicting loanword adaptations and historical reconstructions later on.
cldfbench ronataswestoldturkic.minesc H EAH
cldfbench ronataswestoldturkic.minesc WOT EAH heur.json
.. automodule:: ronataswestoldturkiccommands.minesc :members:
The sound-correspondence file is stored as a computer-readable json. To create a human-readable tsv-file, run:
cldfbench ronataswestoldturkic.vizsc H EAH
cldfbench ronataswestoldturkic.vizsc WOT EAH
.. automodule:: ronataswestoldturkiccommands.vizsc :members:
In this section, we are checking the predictive power of the mined sound correspondences with loanpy's eval_all function
cldfbench ronataswestoldturkic.evalsc H EAH "[10, 100, 500, 700, 1000, 5000, 7000]"
cldfbench ronataswestoldturkic.evalsc WOT EAH "[10, 100, 500, 700, 1000, 5000, 7000]" True True heur.json
.. automodule:: ronataswestoldturkiccommands.evalsc :members:
To gauge the performance of the model, we can plot an ROC curve, calculate its optimum cut-off value and its area under the curve (AUC), a common metric to evaluate predictive models:
cldfbench ronataswestoldturkic.plot_eval H EAH
cldfbench ronataswestoldturkic.plot_eval WOT EAH
The results:
Predicting reconstructions from modern Hungarian words:
The ROC curve shows how the relative number of true positives (y-axis) increases, as the relative number of false positives (x-axis) increases. The optimal cut-off point is at 700 false positives per word, which yields 284 correct reconstructions out of 406 (i.e. 70%). The AUC is just above 0.7, which is considered acceptable. Note that the relative number of false positives and the AUC stay the same, irrespective of whether false positives are counted on a per-word basis (7000) or as an aggregate sum (7000 * 813= 5,691,000). The absolute number of possible true positives (406) was reached after prefiltering the 512 cognate sets of the raw data in Part 2 step 1.
The performance of this model can be improved by removing irregular sound correspondences. By inspecting the file loanpy/H2EAHsc.tsv we can see that many words contain sound correspondences that occur only once throughout the entire etymological dictionary. Counting the number of those cognate sets shows that 106 out of 406 or 26% of all etymologies contain at least one sound correspondence that is irregular, i.e. occurs only in one single etymology. (Note that the pre-filtering did not skew this ratio because it picked all cognate sets with an Early Ancient Hungarian and a Hungarian counterpart.) If we remove those 106 cognate sets with irregular sound correspondences from our training and test data, 300 cognate sets remain and we get following result:
This model performs significantly better than the previous one. At its optimum of 100 guesses per word it reconstructs 279 out of 300 forms (93%) correctly. The AUC is above 0.9, which is considered outstanding.
Predicting loanword adaptations from West Old Turkic words:
Out of 512 etymologies, 384 contained loanword adaptations from West Old Turkic into Early ancient Hungarian. This pre-filtering was carried out in Part 2 step 1. At its optimum of 100 guesses per word, the model predicted 346 words correctly out of 384 (90%). The AUC is above 0.9, which is considered outstanding.
What happened under the hood:
.. automodule:: ronataswestoldturkiccommands.plot_eval :members: