Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch FST to DoubleArrayTrie #76

Merged
merged 6 commits into from Oct 6, 2020
Merged

Switch FST to DoubleArrayTrie #76

merged 6 commits into from Oct 6, 2020

Conversation

johtani
Copy link
Member

@johtani johtani commented Sep 28, 2020

Switch FST library to yada (Double Array Trie) in PrefixDict.
Need rust version >= 1.46.0 to build.

This PR breaks the data structure from the previous version.

Still a work in progress. Remain tasks are here:

  • Change .fst file to .da
  • Add some tests
  • Update CHANGES.md
  • Check benches
    • Debug an error on long text bench
  • Check/test lindera-server
  • Check/test lindera-tantivy
    • Need tokenizer to be Clone-able.
  • Change other builders, e.g. neologd, unidic
    • After new version released

@johtani johtani added the enhancement New feature or request label Sep 28, 2020
@johtani
Copy link
Member Author

johtani commented Sep 28, 2020

This branch passes make test already exists. However, I got an error running cargo bench in the lindera package.

Benchmarking Long text/bench-tokenize-long-text: Warming up for 3.0000 sthread 'main' panicked at 'byte index 5 is not a char boundary; it is inside '相' (bytes 3..6) of `は相違《そうい》なくっても暑いには極ってる。`', /Users/johtani/.rustup/toolchains/1.46.0-x86_64-apple/rust/src/libcore/str/mod.rs:2018:47
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

@johtani
Copy link
Member Author

johtani commented Sep 28, 2020

The result of Cargo bench. Improve the speed tokenize methods :)


test result: ok. 0 passed; 0 failed; 18 ignored; 0 measured; 0 filtered out

     Running /Users/johtani/IdeaProjects/rust-workspace/lindera-workspace/lindera/target/release/deps/bench-4e692558eb08e35b
Gnuplot not found, using plotters backend
Benchmarking bench-constructor: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 82.8s or reduce sample count to 10.
bench-constructor       time:   [15.986 ms 16.083 ms 16.214 ms]                               
                        change: [+2.2083% +5.1511% +7.4330%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

Benchmarking bench-constructor-custom-dict: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 86.7s or reduce sample count to 10.
bench-constructor-custom-dict                                                                             
                        time:   [16.079 ms 16.139 ms 16.214 ms]
                        change: [-1.1785% +0.5738% +1.8693%] (p = 0.54 > 0.05)
                        No change in performance detected.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) high mild
  3 (3.00%) high severe

bench-tokenize-wiki     time:   [76.202 us 76.963 us 77.743 us]                                
                        change: [-15.749% -14.453% -13.197%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild

bench-tokenize-custom-dict                                                                             
                        time:   [17.819 us 18.329 us 18.855 us]
                        change: [-26.619% -24.613% -22.628%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  9 (9.00%) high mild
  5 (5.00%) high severe

Benchmarking Long text/bench-tokenize-long-text: Warming up for 3.0000 sthread 'main' panicked at 'byte index 5 is not a char boundary; it is inside '相' (bytes 3..6) of `は相違《そうい》なくっても暑いには極ってる。`', /Users/johtani/.rustup/toolchains/1.46.0-x86_64-apple/rust/src/libcore/str/mod.rs:2018:47
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

@johtani
Copy link
Member Author

johtani commented Sep 30, 2020

After upgrading yada 0.3.1, there is no errors!!

@johtani
Copy link
Member Author

johtani commented Sep 30, 2020

This is the result of cargo bench after upgrading yada 0.3.1.

test result: ok. 0 passed; 0 failed; 18 ignored; 0 measured; 0 filtered out

     Running /Users/johtani/IdeaProjects/rust-workspace/lindera-workspace/lindera/target/release/deps/bench-f9b2f9f58925b6fa
Gnuplot not found, using plotters backend
Benchmarking bench-constructor: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 86.5s or reduce sample count to 10.
bench-constructor       time:   [16.719 ms 16.746 ms 16.775 ms]                               
                        change: [+11.266% +11.846% +12.404%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild

Benchmarking bench-constructor-custom-dict: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 85.9s or reduce sample count to 10.
bench-constructor-custom-dict                                                                             
                        time:   [16.903 ms 16.924 ms 16.948 ms]
                        change: [+8.7492% +9.0774% +9.4225%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  1 (1.00%) high severe

bench-tokenize-wiki     time:   [75.686 us 76.070 us 76.546 us]                                
                        change: [-13.333% -11.474% -9.2561%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 15 outliers among 100 measurements (15.00%)
  2 (2.00%) high mild
  13 (13.00%) high severe

bench-tokenize-custom-dict                                                                             
                        time:   [16.689 us 16.756 us 16.828 us]
                        change: [-29.111% -28.071% -27.095%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe

Benchmarking Long text/bench-tokenize-long-text: Warming up for 3.0000 s
Warning: Unable to complete 20 samples in 5.0s. You may wish to increase target time to 35.3s or reduce sample count to 10.
Long text/bench-tokenize-long-text                                                                           
                        time:   [157.82 ms 158.99 ms 160.43 ms]
                        change: [-28.585% -25.637% -22.439%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 20 measurements (5.00%)
  1 (5.00%) high severe

@johtani
Copy link
Member Author

johtani commented Sep 30, 2020

And dict.fst data sizes.

yada
5425152 Sep 30 16:07 dict.fst
fst
2147765 Sep 25 23:42 dict.fst

And remove FST name in this repository
@johtani
Copy link
Member Author

johtani commented Sep 30, 2020

I got an error building lindera-tantivy. The Tokenizer has to clone-able.
I make a feature request on the yada repository.

Related: takuyaa/yada#8

Need Tokenizer as a Cloneable
@johtani
Copy link
Member Author

johtani commented Sep 30, 2020

Fixed Cloneable.

@johtani johtani marked this pull request as ready for review September 30, 2020 16:14
@johtani johtani requested a review from mosuka September 30, 2020 16:14
@johtani
Copy link
Member Author

johtani commented Oct 6, 2020

It's ready to review I think. I'll check other builders after the new version released.

Copy link
Member

@mosuka mosuka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mosuka
Copy link
Member

mosuka commented Oct 6, 2020

Thanks for the good work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants