The foma FST compiler
The foma compiler is essentially a tool for converting regular expressions to finite automata and transducers. It supports a variety of operations (many more than are found in search-regex formalisms such as the Python
re module). The interface also includes tools for performing various tests on automata and transducers, passing words through transducers (doing translations), and importing and exporting transducers in various formats.
The foma program runs with a read-eval-print loop (REPL), like IPython/Jupyter. That means that each command given is executed and the output is printed, and a new prompt is displayed. You can run scripts of commands by either launching foma with the
foma -l flag, or by typing
source filename inside foma.
When you start foma, it should respond with a title page, something like:
Foma, version 0.9.18alpha ... Type "help" to list all commands available. Type "help <topic>" or help "<operator>" for further help. foma:
If you type
help at the prompt, you'll be given a long terse list of help topics. This tutorial is intended to alleviate the learning curve with foma.
Compiling regular expressions
There are two commands
regex that invoke the regular expression compiler. Saying
regex REGULAR-EXPRESSION ;
simply compiles the regular expression into an automaton/transducer and gives you the result. For example
regex a|b|c|d ;
would compile the regular expression in question and return the following:
371 bytes. 2 states, 4 arcs, 4 paths. foma:
The result says that the (minimal) finite-state machine that corresponds to the regex has 2 states, 4 arcs, and that there are 4 valid paths through it. This result gets stored on an internal stack, which is what the number inside the prompt
foma: refers to: you now have one finite-state machine stored on the internal stack.
define command has the following schema:
def NAME REGEX ;
For example, to define a regular language that contains the words
dog and give that language a name
Animals, we could issue:
def Animals c a t | d o g ;
Note the spaces between the symbols. Those are necessary, and will be explained below. When doing a
def you can subsequently use the defined label in other regular expressions, for example:
regex Animals+ ;
would now re-use your earlier definition of
Animals and this regular expression is equivalent to saying:
regex [c a t | d o g]+ ;
To get an overview of what regular expressions you have defined (so called language constants), you can saying
in the foma interface.
If you have the GraphViz tool installed, you should be able to view the result of the last issued regex by typing
view. (Note that this only works on UNIXes such as Linux or OSX, not Windows). It is often very helpful to view the result of your regular expressions graphically and reason about them in two modalities.
For a simpler view of any transducer, you can type
net, which gives an ASCII listing of the transitions and basic information about the last issued regex.
You can also get information about the alphabet of a transducer by the command
Basic FST testing
If you issue the command
words you'll get a list of all the words accepted by an automaton. For example:
foma: regex c a t | d o g ; 455 bytes. 6 states, 6 arcs, 2 paths. foma: words cat dog
With transducers, the
pairs command is often useful. It prints all the input/output pairs from a transducer.
foma: regex c:d a:o t:g; 407 bytes. 4 states, 3 arcs, 1 path. foma: pairs cat dog
upper-words and lower-words
These commands print out only the input-side or output-side sets from a transducer. Of course, since automata are treated like repeating transducers, there is no difference in behavior for automata.
foma: regex c:d a:o t:g; 407 bytes. 4 states, 3 arcs, 1 path. foma: upper-words cat foma: lower-words dog
down command will enter a special prompt (which you can exit with
CTRL-D) and allows you to give a transducer an input, yielding the translation output:
foma: regex c:d a:o t:g; 407 bytes. 4 states, 3 arcs, 1 path. foma: down apply down> cat dog apply down> dog ???
If the transducer doesn't accept the input, then
??? is printed.
There is a equivalent
up command, which works the same way, but in the inverse direction.
The regular expression formalism in
foma differs somewhat from standard formalisms. It inherits a regular expression syntax called Xerox regular expressions. The main differences between standard regular expressions and this formalism is given below.
The main differences and things to look out for can be summarized as follows:
- All regular expressions need to end in
]are used for grouping. That means that a standard regular expression
(a|b)*should be written
[a|b]*. There are no "character classes", so something like
[a-d]must be explicitly listed in a union in foma:
- Parentheses, in turn, denote "optionality". So an expression such as
c a t (s)would denote the two words
catssince whatever is inside the parentheses is "optional". This comes from linguistic usage of parentheses.
- Multi-character symbols are used: these are strings that are themselves single symbols (see below).
- The wildcard symbol is
.. For example, the set of words that end in
acan be described as
- Apart from union
|, other logical operators are allowed, such as intersection
~. Since regular languages really denote sets of strings, we can perform set operations on them. For example, the set of words that begin and end with
acould be describes as
a ?* & ?* a.
- Special symbols are escaped either by placing them in quotes, e.g.
"_"to talk about an underscore symbol, or by prepending with a parenthesis symbol
%, for example
- Transducer operations are included: the cross-product
.o.and different string rewriting operators
->are commonly used.
0represents the epsilon (ε) -symbol. This is mostly used in creating transducers. For example, a transducer that accepts
aas input and outputs nothing would be denoted by the regex
regex a:0 ;
Foma allows entire strings to be single atomic symbols. This means that the two regular expressions below will give a different result:
regex cat ; regex c a t ;
Here, the first one contains the single-symbol word
cat whereas the second one contains a word made out of three symbols,
t. Compile the two regular expressions, and then issue the command
view to see the difference when you view it as an automaton/transducer.
Multi-character symbols are almost never used except for grammatical tags such as
[Pl]. The logic is that such tags really aren't strings, but denote atomic ideas and so should be treated as single symbols. Morphological FSTs also become smaller if atomic symbols are used for tags.
Create a regular expression denoting the set that contains the words
horses (i.e. six words in total). Try to not repeat the
s-part in your regular expression. Use the
view command to examine the resulting automaton/transducer.
Commonly used notation:
||Zero or more of anything|
||any symbol except
||any symbol except on in the language
||word edge (in rewrite rules)|
||zero or more
Transducers: the cross-product
The cross-product operator
: is the simplest way to create a transducer from a regular expression. For example:
foma: regex a:b; 228 bytes. 2 states, 1 arc, 1 path.
compiles a transducer that accepts
a as its input, and outputs
b. (View this transducer graphically).
Note that the
: binds tighter than any other operator. This means that if you want to pair up strings, you'll need to indicate grouping by parentheses. For example, to define a transducer that maps
dog, you should do
foma: regex [c a t]:[d o g];
regex c a t:d o g ; , since in this case, you get a transducer that maps
Another example, here's a transducer that deletes the first and last letters of every input word:
foma: regex ?:0 ?* ?:0; 338 bytes. 3 states, 3 arcs, Cyclic. foma: down apply down> cats at
Another example: here's a transducer that maps
cat (foreshadowing some kind of crude morphological analysis). Note the escaping of the
"[Sg]", which is now a single symbol.
foma: regex c a t "[Sg]":0 ; 374 bytes. 5 states, 4 arcs, 1 path. foma: up apply up> cat cat[Sg]
Note also that in the example, we ask (by applying
up) what the input-side equivalent of output-side
Use the cross-product operation to create a transducer that maps
c a t [Sg] to
c a t and
c a t [Pl] to
c a t s and
d o g [Sg] to
d o g and
d o g [Pl] to
d o g s. Note that spaces are given here deliberately only to show that
[Pl] should be multi-character symbols, whereas the rest should be single symbols.
Rewrite rule basics
Rewrite rules are often used when modeling morphophonology. These have the following basic format.
LHS -> RHS || LC _ RC ;
LHS is left-hand-side,
RHS is right-hand-side,
LC is left-context, and
RC is right-context.
A regular expression such as:
regex a -> b || c _ d ;
would create a transducer that converts all instances of
b, but only if that
a occurs between
RC are optional, so you could say:
regex a -> 0 || _ b ;
creating a transducer that deletes
a-symbols when they occur before a
You can also drop the context altogether, and say:
regex x -> y ;
Which creates a transducer that maps all
y, always, passing everything else through untouched.
Word boundaries in rewrite rules
The special symbol
.#. is used to refer to edges of words. For example, the following transducer deletes all
x-symbols, but only at the beginning of a word. Everything else is passed through as-is.
regex x -> 0 || .#. _ ;
Epenthesis (insertion) rules
There is a special type of rule, called epenthesis rules is used whenever you want to insert something (from nothing) in a special position. This is denoted by
[..] as the
LHS. For example:
regex [..] -> x || y _ z ;
would result in a transducer that inserts
z, as seen below:
foma: down apply down> yzyz yxzyxz
If you have two (or more) transducers, they can be joined by composition. For example:
foma: regex a -> b .o. b -> c ; 416 bytes. 1 state, 4 arcs, Cyclic.
Creates the composition of the two above transducers, mapping
a directly to
c. Question: what does
c map to in the inverse direction (do
up and then
Normally, individual transducers are given names with
def, and then composed. For example:
foma: def RuleA a -> b || c _ d; defined RuleA: 634 bytes. 4 states, 16 arcs, Cyclic. foma: def RuleB b -> x || c _ ; defined RuleB: 480 bytes. 2 states, 8 arcs, Cyclic. foma: regex RuleA .o. RuleB; 708 bytes. 4 states, 19 arcs, Cyclic.
First create a transducer
Lexicon that does the following: maps
c a t [Sg] to
c a t and
c a t [Pl] to
c a t + s and
b u s [Sg] to
b u s and
b u s [Pl] to
b u s + s. Now that you have this, create a rewrite rule that translates all instances of
e whenever it's surrounded by
s on both sides. Def this rule to have the name
EInsert. Now compose
EInsert by issuing
regex Lexicon .o. EInsert ;
You should get a resulting transducer that maps
b u s [Pl] to
b u s e s and
c a t [Pl] to
c a t s. Now, you can create a third transducer
Cleanup that always deletes all
+-symbols and compose that in last.
regex Lexicon .o. EInsert .o. Cleanup ;
You should now have your first complete morphological toy grammar. Test it with
up and the word
- Not separating symbols by spaces; writing
regex cat|dog;instead of
regex c a t | d o g;.
- Forgetting the semicolon at the end of a regular expressions.
- Not keeping track of binding: see
help precedencein foma to see which operators bind in what order.