# Regular Grammars with Thrax

- Evgeny A. Stepanov
- stepanov.evgeny.a@gmail.com

__Requirements__

- [Thrax](http://www.openfst.org/twiki/bin/view/GRM/Thrax)
    - run `conda install -c conda-forge thrax`

## Take Aways

- Basic usage of thrax to write regular grammars to generate/accept input
- Does not constitute a complete coverage of the tool

## Grammars as Regular Expressions

> The *OpenGrm Thrax Grammar Compiler* is a set of tools for compiling grammars expressed as __regular expressions__ and context-dependent rewrite rules into weighted finite-state transducers using the OpenFst format.

(Pynini is a python version). 

The tool is similar to a popular data generation and augmentation tool [Chatito](https://github.com/rodrigopivi/Chatito) (and [Chatette](https://github.com/SimGus/Chatette)). Specifically, it implements a Domain Specific Language (DSL) that allows you to define templates to __generate__ and __test__ sentences of interest.

The strength of weighted finite-state transducers (WFST) formalism over DSL (thus Thrax over others) is that WFSTs are widely used in speech and language applications. The hypothesis space for tasks like automatic speech recognition (ASR) and optical character recognition can be represented as a compact, efficiently searchable cascade of WFSTs. Moreover, manually-generated grammatical resources such as pronunciation lexicons and phonological rules are also naturally represented as finite-state transducers. Consequently, compiled Thrax grammars can be combined with trained statistical language models.

A grammar describes how to form strings from a language's lexicon that are valid according to the language's syntax. Compiled Thrax grammars thus can be used to generate an arbitrary number of strings that comply with the grammar definition.

### Provided Tools

- __thraxmakedep__ creates `Makefile` with grammar dependencies
- __thraxcompiler__ compiles grammar into `far`
- __thraxrandom-generator__ generates specified number of sentences from grammar
- __thraxrewrite-tester__ can be used to test grammar

## Thrax Grammar Definition

### Grammars as Regular Languages

Thrax allows the definition of "templates" for the generation of data. The example below demonstrates a regular expression that generates search queries in the movie domain.

In [2]:
%%bash

fname=movies§

cat > $fname.grm <<EOF

search = ("I am " ("looking" | "searching") " for ") | ("show me " | "list ") "some "?;
export query = Optimize[search "movies"];

EOF

Note use of `?` to generate alternative token some.

#### Compiling Grammar

In [3]:
%%bash

thraxmakedep movies.grm
cat Makefile
make

movies.far: movies.grm 
	thraxcompiler --input_grammar=$< --output_far=$@

clean:
	rm -f 
thraxcompiler --input_grammar=movies.grm --output_far=movies.far
Evaluating rule: search
Evaluating rule: query


In [4]:
%%bash

farinfo movies.far

far type                                          sttable
arc type                                          standard
fst type                                          vector
# of FSTs                                         1
total # of states                                 43
total # of arcs                                   46
total # of final states                           1


#### Generating Queries

`thraxrandom-generator` can be used to generate sentences accepted by the grammar, specifying the number to be generated via `--noutput`

In [5]:
%%bash
thraxrandom-generator --far=movies.far --rule=query --noutput=10 | sort | uniq

****************************************
I am looking for movies
I am searching for movies
list movies
list some movies
show me movies
show me some movies


#### Testing Queries

It is possible to test the defined grammar using `thraxrewrite-tester` as below.
If the grammar accepts the string, the tool re-writes the string as output; fails otherwise.

In [6]:
%%bash

echo "show me some movies" | thraxrewrite-tester --far=movies.far --rules=query
echo "show me a movie" | thraxrewrite-tester --far=movies.far --rules=query

Input string: Output string: show me some movies
Input string: Input string: Rewrite failed.
Input string: 

### Regular Patterns

Regular Expressions can be used to generate/accept entities of interest.

In [7]:
%%bash

fname=year

cat > $fname.grm <<EOF

digit = ("0"|"1"|"2"|"3"|"4"|"5"|"6"|"7"|"8"|"9");
year = ("19" digit{2})|("20"((("0"|"1") digit)|("2" ("0"|"1"))));
export release_year = Optimize[year];

EOF

In [8]:
%%bash
thraxmakedep year.grm
make
thraxrandom-generator --far=year.far --rule=release_year --noutput=10 | sort | uniq

thraxcompiler --input_grammar=year.grm --output_far=year.far
Evaluating rule: digit
Evaluating rule: year
Evaluating rule: release_year
****************************************
1911
1945
1955
1965
1969
1971
1980
2007
2021


### Reading External Files

Thrax can read list of elements from external file and use them to generate/accept sentences.

Let's extend out movies grammar with actors.

In [9]:
%%bash

fname=actors

cat > $fname.txt <<EOF

brad pitt
clint eastwood
zoe saldana
scarlett johansson

EOF

In [7]:
%%bash

fname=movies2

cat > $fname.grm <<EOF

actors = StringFile['actors.txt'];
search = ("I am " ("looking" | "searching") " for ") | ("show me " | "list ") "some "? ;
search_movie = search "movies with " actors;
export query = Optimize[search_movie];

EOF

In [8]:
%%bash
thraxmakedep movies2.grm
make
thraxrandom-generator --far=movies2.far --rule=query --noutput=10 | sort | uniq

thraxcompiler --input_grammar=movies2.grm --output_far=movies2.far
Evaluating rule: actors
Evaluating rule: search
Evaluating rule: search_movie
Evaluating rule: query
****************************************
list movies with brad pitt
list movies with scarlett johansson
list movies with zoe saldana
list some movies with clint eastwood
show me movies with brad pitt
show me movies with scarlett johansson
show me movies with zoe saldana


### Exercise

Write a grammar that generates/accepts searchers movies by

- actor (e.g. "starring Zoe Saldana")
- release year ("released in 2020")
- director (e.g. "directed by Steven Spielberg")
- producer (e.g. "produced by Disney")

In [18]:
%%bash

fname=exercise1
cat > $fname.grm <<EOF

digit = ("0"|"1"|"2"|"3"|"4"|"5"|"6"|"7"|"8"|"9");

actors = StringFile['actors.txt'];
years = ("19" digit{2})|("20"((("0"|"1") digit)|("2" ("0"|"1"))));
directors = StringFile['directors.txt'];
producers = StringFile['producers.txt'];
search_movie =  (("I am " ("looking" | "searching") " for ") | ("show me " | "list ") "some ")? "movies " ("starring " | "with ") actors ", " ("released in " | "from ")? years ", directed by " directors " and produced by " producers;
export query = Optimize[search_movie];

EOF

In [19]:
%%bash
thraxmakedep exercise1.grm
make
thraxrandom-generator --far=exercise1.far --rule=query --noutput=10 | sort | uniq

thraxcompiler --input_grammar=exercise1.grm --output_far=exercise1.far
Evaluating rule: digit
Evaluating rule: actors
Evaluating rule: years
Evaluating rule: directors
Evaluating rule: producers
Evaluating rule: search_movie
Evaluating rule: query
****************************************
I am searching for movies with Zoe Saldana, from 1994, directed by Steven Spielberg and produced by Disney
list some movies starring Clint Eastwood, from 2005, directed by Steven Spielberg and produced by 20th Century Fox
list some movies starring Zoe Saldana, released in 2007, directed by Cristopher Nolan and produced by 20th Century Fox
movies starring Scarlett Johansson, from 1907, directed by Quentin Tarantino and produced by 20th Century Fox
movies starring Zoe Saldana, from 2000, directed by Steven Spielberg and produced by 20th Century Fox
movies with Clint Eastwood, 1942, directed by Cristopher Nolan and produced by 20th Century Fox
movies with Clint Eastwood, 2002, directed by Quentin Tarantin