How to change linguistic resources
Make the tools ready (needed to be done one time only)
Build the library and tools
- git clone Bling-Fire-Git-Path
- cd BlingFire
- mkdir Release
- cd Release
- cmake -DCMAKE_BUILD_TYPE=Release ..
This will take a few minutes
Alternatively you can use a Visual Studio Code with CMake, CMake Tools and C/C++ plugins installed. Select Release mode for build and then your files are going to be in the build folder.
Make sure the tools are in the path
Now you need to install the tools into the location known in PATH or to set the PATH to see the BlingFire directory with the tools. For the later one run this command from the BlingFire directory:
- . ./scripts/set_env
Let's make sure that the tools are actually in the PATH, type:
All tools respond to --help, so you should see something like:
Usage: fa_nfa2dfa [OPTION] [< input.txt] [> output.txt] This program converts non-deterministic finite-state machine into deterministic one. --in=input-file - reads input from the input-file, if omited stdin is used --out=output-file - writes output to the output-file, if omited stdout is used --out2=output-file - writes output to the output-file, if omited stdout is used --pos-nfa=input-file - reads reversed position NFA from input-file, needed for --fsm=pos-rs-nfa to store only ambiguous positions, if omited stores all positions --fsm=rs-nfa - makes convertion from Rabin-Scott NFA (is used by default) --fsm=pos-rs-nfa - makes convertion from Rabin-Scott position NFA, builds Moore Multi Dfa --fsm=mealy-nfa - makes convertion from Mealy NFA into a cascade of two Mealy Dfa (general case) or a single Mealy DFA (trivial case) --spec-any=N - treats input weight N as a special any symbol, if specified produces Dfa with the same symbol on arcs, which must be interpreted as any other --bi-machine - uses bi-machine for Mealy NFA determinization --no-output - does not do any output --verbose - prints out debug information, if supported
Edit linguistic sources and compile them into automata
Let's change the working directory into the root for linguistic sources:
Note: we will add separate documentation on different format of the linguistic resources, for the moment we will modify the tokenization logic only like this:
And now to recompile the wbd directory (word boundary disambiguation) or word-breaking or tokenization logic is defined in this directory. We need simply type:
make -f Makefile.gnu lang=wbd all
You should see something like this one the screen:
fa_build_conf \ --in=wbd/ldb.conf.small \ --out=wbd/tmp/ldb.mmap.small.txt fa_fsm2fsm_pack --type=mmap \ --in=wbd/tmp/ldb.mmap.small.txt \ --out=wbd/tmp/ldb.conf.small.dump \ --auto-test fa_build_lex --dict-root=. --full-unicode --in=wbd/wbd.lex.utf8 \ --tagset=wbd/wbd.tagset.txt --out-fsa=wbd/tmp/wbd.rules.fsa.txt \ --out-fsa-iwmap=wbd/tmp/wbd.rules.fsa.iwmap.txt \ --out-map=wbd/tmp/wbd.rules.map.txt fa_fsm2fsm_pack --alg=triv --type=moore-dfa --remap-iws --use-iwia --in=wbd/tmp/wbd.rules.fsa.txt --iw-map=wbd/tmp/wbd.rules.fsa.iwmap.txt --out=wbd/tmp/wbd.fsa.small.dump fa_fsm2fsm_pack --alg=triv --type=mmap --in=wbd/tmp/wbd.rules.map.txt --out=wbd/tmp/wbd.mmap.small.dump --auto-test fa_merge_dumps --out=ldb/wbd.bin wbd/tmp/ldb.conf.small.dump wbd/tmp/wbd.fsa.small.dump wbd/tmp/wbd.mmap.small.dump
This means that make is doing it job and remaking all the dependent targets.
If you see "ERROR: XYZ" message on the screen, then find the one that appeared first and let try to understand which tool it came from, what was input to this tool and what were the command line parameters. Double check with --help that these parameters make sense. Let us know if you are stuck, we'll be happy to help.
How to verify the compiled file is working correctly
- For the tokenizer you can use fa_lex tool. See fa_lex --help for more details
printf "Hi There! This is a simple test." | fa_lex --ldb=ldb/wbd.bin --tagset=wbd/wbd.tagset.txt
The output should be something like:
Hi/WORD There/WORD !/WORD This/WORD is/WORD a/WORD simple/WORD test/WORD ./WORD
- For the single token related transformation you can use test_ldb tool. See test_ldb --help for more details
- See tools.txt for details
What is the structure of linguistic sources
The Linguistic Data Base (LDB) files are simply containers of combined together address independent memory dumps of different structures such as: maps, multi maps, finite state automata, arrays.
To avoid usage mistakes such as the dictionary was collected in case sensitive way and someone looks it up in case insensitive and similar which are difficult to find. The runtime options are also compiled into one of those maps (configuration map) and are a part of the final LDB file. The compiled configuration map is defines which functions the LDB has resources for and what parameters should be used for each function at runtime.
ldbsrc -- main LDB root Name_1 -- name of the project #1 ldb.conf.small -- runtime configuration parameters for the project #1, required file options.small -- LDB compilation options for the project #1, required file [other resources] Name_2 -- name of the project #2 ... ldb -- a root for all the compiled binary files name1.bin -- compiled binary for the project #1 name2.bin -- compiled binary for the project #2 ... Makefile.gnu -- make file for compilation