A finite state morphological analyzer for the Slovene language, written by Matt Gardner.
This analyzer is a fast, lightweight and easily extensible morphological disambiguation tool written with FOMA, an open source reimplementation of Xerox's finite state tool, xfst. The analyzer disambiguates words into the JOS morphosyntactic specifications (MSDs). As an example, one might input a word and get back a set of analyses:
$ foma foma: load slovene.bin foma: up apply up> jezikoslovje jezikoslovje+N+Common+Neut+Sing+Acc jezikoslovje+N+Common+Neut+Sing+Nom
or one could input an analysis and get an inflection:
foma: down apply down> jezikoslovje+N+Common+Neut+Plural+Gen jezikoslovij
This kind of analyzer is useful for a range of downstream natural language processing tasks, like part of speech tagging and parsing, as well as for an aid to learners of the language. Its main benefit over a lookup table is its extensibility; in order to add a word to a lookup table, all of its inflections must be manually, individually added. With a finite state analyzer, only the lemma needs to be added, and the morphology built into the analyzer will handle the inflection. Finite state analyzers can also give reasonable guesses for unseen words, where lookup tables would fail to give any output, though this analyzer still has only very limited support for unseen words.
The analyzer was built and tested using the Sloleks lexicon of Slovene language. This code contains a script that processes Sloleks and outputs a set of FOMA lexicon files for each part of speech, including whatever information is necessary for a word's inflection (e.g., whether or not a final "e" is fleeting, as in "pomemben" -> "pomembna" vs. "rumen" -> "rumena"). Each part of speech then has rules for generating the regular and irregular morphology of the language.
There are also scripts to test the output of analyzer against the MSDs contained in Sloleks. There are approximately 930,000 unique forms in Sloleks (except for those marked as spelling errors, which we have so far ignored), with an average of 3 possible MSD analyses for each form. The test scripts can test each part of speech individually, or all of them together, to see if the set of analyses for each word matches what is found in Sloleks. The system currently has 100% recall on that test, and 98.8% precision. That means that for every word in Sloleks, this morphological analyzer produces a set of MSDs that is a superset of analyses given in Sloleks, and for 98.8%, the sets match exactly. The remaining precision errors are mostly due to overprediction of number or gender, where a proper noun only has a singular form, or a possessive adjective only has a feminine form, or other similar issues.
100% recall was acheived by running a separate script that found remaining errors and automatically generated lexicon override files for the forms that were not correctly processed. This was done after the rules were such that about 96.5% of all forms were correctly processed - in lieu of encoding all of the long tail of irregular morphology by hand, this was done automatically.
To get some idea of the coverage of this analyzer, a popular Slovene news site, rtvslo.si, was scraped for textual content. In the text of the most popular articles, on average only between 2 and 5 percent of the tokens in the article are not contained in the 930,000 forms in Sloleks. The vast majority of these tokens are sequences of digits (e.g., "2009") or proper names. The analyzer was thus improved to give guesses for unseen words---the number guesser works well, though the proper name guesser currently vastly over-generates analyses. But taking those two classes of tokens out of the unseen pool leaves less than 1% of the tokens unrecognized by this analyzer, and most of those are due to processing errors (e.g., taking the period off of "oz.", which would otherwise be recognized as an abbreviation, but is left as "oz", which is unrecognized, or hyphenated words like "22-letna"). There is a script in this repository that will perform this analysis on the current top articles on rtvslo.si.
The resulting analyzer is easily extensible in a number of ways. If a newer, improved version of Sloleks is released, the scripts here will easily process it and update the generated lexicon files, assuming the format is the same. If a person wants to tackle some piece of the long tail of irregular morphology that is currently left to automated overrides, it is easy to modify a few rules in the files pertaining to that part of speech and run tests to see how much the analyzer is improved. And in implementing the number and proper name guessers, some initial work was also given to guessing common feminine nouns, which would cover the rare cases when a non-proper unseen word is encountered in real text. This is still very preliminary, but future work could expand the unseen word guessing capabilities of the analyzer and add tests that compare the guesses against the analyses given in Sloleks. Because of the design of the system, this should not be a lot of work for regular words; an addition of four total lines of code in two files produced a guesser for feminine nouns ending in "ost" that performs quite well.
The FOMA and other code contained in this repository are available to the
public under the terms of the GNU General Public License version 3, distributed
here in the
LICENSE file. The Sloleks
lexicon is available under a Creative
Commons license (CC-BY, which
data/ directory, run
python create_lexica.py and
python create_tests.py. These commands will generate a number of files in the
tests/ directories. These files are checked in to the
repository, so if you modify the Sloleks processing scripts, you can see the
difference it made by running
Modifying lexical rules
For detailed information on
.foma files, look at FOMA's Getting
Started guide. Here I
will just explain the layout of the system. Each part of speech has
.lexc files in the
lexica/ directory. Using verbs as an
example, these are:
verbs.lexc, a file generated automatically by the
verbs_rules.lexc, a file written by hand that contains the basic morphological rules to inflect verbs,
verbs_overrides.lexc, a file written by hand that contains irregular morphology for a small number of highly irregular words, and
verbs_auto_overrides.lexc, a file generated automatically by running the
create_auto_overrides.pyscript (described later).
There is also a single file containing regular expression rules that perform
stem changes and other complex morphological processes. This file is
scripts/ directory contains four useful scripts, each of which is
described below. All of these scripts should be run from the base directory,
make_bin.py: This will combine all of the
.fomafiles, run them through FOMA, and produce a file called
slovene.bin, which is suitable for use with
flookup, or for loading into
foma. For more information on the use of those tools, see FOMA's user guide (linked above).
test.py: This script first calls
make_bin.py, then uses the resulting binary and the tests in the
tests/directory to judge the accuracy of the analyzer. For usage information run
python scripts/test.py --help. The results are shown in the
results/directory. The main results file is
results/[test]_stats.tsv, which is suitable to be viewed with
find_errors.py: This will look through the error file for a particular test and select only those whose MSD matches some input. This is useful, for example, if you want to see errors on only a particular kind of pronoun, even though you can only run tests on pronouns as a whole.
create_auto_overrides.py: This file looks at the error file for a particular test, as well as the test file (originally generated from Sloleks), and produces an
*_auto_overrides.lexc. If you are working to improve the analyzer, you should run tests without the auto overrides (with the
--no-auto-overridesoption), then re-run this script to reproduce the (now hopefully smaller)
In this directory there is a script for scraping a few news articles from
rtvslo.si and running some tests to see what unseen words there are. This
script is currently written to be run from the
scraping/ directory, not the