Switch FST to DoubleArrayTrie #76

johtani · 2020-09-28T02:29:24Z

Switch FST library to yada (Double Array Trie) in PrefixDict.
Need rust version >= 1.46.0 to build.

This PR breaks the data structure from the previous version.

Still a work in progress. Remain tasks are here:

Change .fst file to .da
Add some tests
Update CHANGES.md
Check benches
- Debug an error on long text bench
~~Check/test lindera-server~~
Check/test lindera-tantivy
- Need tokenizer to be Clone-able.
~~Change other builders, e.g. neologd, unidic~~
- After new version released

Need rust version >= 1.46.0 to build

johtani · 2020-09-28T02:32:04Z

This branch passes make test already exists. However, I got an error running cargo bench in the lindera package.

Benchmarking Long text/bench-tokenize-long-text: Warming up for 3.0000 sthread 'main' panicked at 'byte index 5 is not a char boundary; it is inside '相' (bytes 3..6) of `は相違《そうい》なくっても暑いには極ってる。`', /Users/johtani/.rustup/toolchains/1.46.0-x86_64-apple/rust/src/libcore/str/mod.rs:2018:47
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

johtani · 2020-09-28T02:53:10Z

The result of Cargo bench. Improve the speed tokenize methods :)


test result: ok. 0 passed; 0 failed; 18 ignored; 0 measured; 0 filtered out

     Running /Users/johtani/IdeaProjects/rust-workspace/lindera-workspace/lindera/target/release/deps/bench-4e692558eb08e35b
Gnuplot not found, using plotters backend
Benchmarking bench-constructor: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 82.8s or reduce sample count to 10.
bench-constructor       time:   [15.986 ms 16.083 ms 16.214 ms]                               
                        change: [+2.2083% +5.1511% +7.4330%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

Benchmarking bench-constructor-custom-dict: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 86.7s or reduce sample count to 10.
bench-constructor-custom-dict                                                                             
                        time:   [16.079 ms 16.139 ms 16.214 ms]
                        change: [-1.1785% +0.5738% +1.8693%] (p = 0.54 > 0.05)
                        No change in performance detected.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) high mild
  3 (3.00%) high severe

bench-tokenize-wiki     time:   [76.202 us 76.963 us 77.743 us]                                
                        change: [-15.749% -14.453% -13.197%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild

bench-tokenize-custom-dict                                                                             
                        time:   [17.819 us 18.329 us 18.855 us]
                        change: [-26.619% -24.613% -22.628%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  9 (9.00%) high mild
  5 (5.00%) high severe

Benchmarking Long text/bench-tokenize-long-text: Warming up for 3.0000 sthread 'main' panicked at 'byte index 5 is not a char boundary; it is inside '相' (bytes 3..6) of `は相違《そうい》なくっても暑いには極ってる。`', /Users/johtani/.rustup/toolchains/1.46.0-x86_64-apple/rust/src/libcore/str/mod.rs:2018:47
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

johtani · 2020-09-30T02:19:10Z

After upgrading yada 0.3.1, there is no errors!!

johtani · 2020-09-30T09:04:51Z

This is the result of cargo bench after upgrading yada 0.3.1.

test result: ok. 0 passed; 0 failed; 18 ignored; 0 measured; 0 filtered out

     Running /Users/johtani/IdeaProjects/rust-workspace/lindera-workspace/lindera/target/release/deps/bench-f9b2f9f58925b6fa
Gnuplot not found, using plotters backend
Benchmarking bench-constructor: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 86.5s or reduce sample count to 10.
bench-constructor       time:   [16.719 ms 16.746 ms 16.775 ms]                               
                        change: [+11.266% +11.846% +12.404%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild

Benchmarking bench-constructor-custom-dict: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 85.9s or reduce sample count to 10.
bench-constructor-custom-dict                                                                             
                        time:   [16.903 ms 16.924 ms 16.948 ms]
                        change: [+8.7492% +9.0774% +9.4225%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  1 (1.00%) high severe

bench-tokenize-wiki     time:   [75.686 us 76.070 us 76.546 us]                                
                        change: [-13.333% -11.474% -9.2561%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 15 outliers among 100 measurements (15.00%)
  2 (2.00%) high mild
  13 (13.00%) high severe

bench-tokenize-custom-dict                                                                             
                        time:   [16.689 us 16.756 us 16.828 us]
                        change: [-29.111% -28.071% -27.095%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe

Benchmarking Long text/bench-tokenize-long-text: Warming up for 3.0000 s
Warning: Unable to complete 20 samples in 5.0s. You may wish to increase target time to 35.3s or reduce sample count to 10.
Long text/bench-tokenize-long-text                                                                           
                        time:   [157.82 ms 158.99 ms 160.43 ms]
                        change: [-28.585% -25.637% -22.439%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 20 measurements (5.00%)
  1 (5.00%) high severe

johtani · 2020-09-30T09:05:35Z

And dict.fst data sizes.

yada
5425152 Sep 30 16:07 dict.fst
fst
2147765 Sep 25 23:42 dict.fst

And remove FST name in this repository

johtani · 2020-09-30T15:07:49Z

I got an error building lindera-tantivy. The Tokenizer has to clone-able.
I make a feature request on the yada repository.

Related: takuyaa/yada#8

Need Tokenizer as a Cloneable

johtani · 2020-09-30T16:14:44Z

Fixed Cloneable.

johtani · 2020-10-06T01:48:08Z

It's ready to review I think. I'll check other builders after the new version released.

mosuka

LGTM

mosuka · 2020-10-06T02:08:53Z

Thanks for the good work!

Switch FST library to yada (Double Array Trie) in PrefixDict

262b053

Need rust version >= 1.46.0 to build

johtani added the enhancement New feature or request label Sep 28, 2020

Remove lindera-fst from Cargo.toml

2fa0587

Upgrade yada-0.3.1

12e6726

johtani added 2 commits September 30, 2020 23:05

Change .fst to .da

aa25da4

And remove FST name in this repository

Add long text test case

66e8217

Upgrade yada 0.3.2

4c1ad72

Need Tokenizer as a Cloneable

johtani marked this pull request as ready for review September 30, 2020 16:14

johtani requested a review from mosuka September 30, 2020 16:14

mosuka approved these changes Oct 6, 2020

View reviewed changes

mosuka merged commit 858b4ea into master Oct 6, 2020

mosuka deleted the switch-fst-2-yada branch October 7, 2020 12:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch FST to DoubleArrayTrie #76

Switch FST to DoubleArrayTrie #76

johtani commented Sep 28, 2020 •

edited

johtani commented Sep 28, 2020 •

edited

johtani commented Sep 28, 2020 •

edited

johtani commented Sep 30, 2020

johtani commented Sep 30, 2020

johtani commented Sep 30, 2020

johtani commented Sep 30, 2020

johtani commented Sep 30, 2020

johtani commented Oct 6, 2020

mosuka left a comment

mosuka commented Oct 6, 2020

Switch FST to DoubleArrayTrie #76

Switch FST to DoubleArrayTrie #76

Conversation

johtani commented Sep 28, 2020 • edited

johtani commented Sep 28, 2020 • edited

johtani commented Sep 28, 2020 • edited

johtani commented Sep 30, 2020

johtani commented Sep 30, 2020

johtani commented Sep 30, 2020

johtani commented Sep 30, 2020

johtani commented Sep 30, 2020

johtani commented Oct 6, 2020

mosuka left a comment

Choose a reason for hiding this comment

mosuka commented Oct 6, 2020

johtani commented Sep 28, 2020 •

edited

johtani commented Sep 28, 2020 •

edited

johtani commented Sep 28, 2020 •

edited