Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve khmer segmenter performance by using fst segmenter #251

Merged

Conversation

xshadowlegendx
Copy link
Contributor

@xshadowlegendx xshadowlegendx commented Nov 23, 2023

Pull Request

Related issue

Fixes #250

What does this PR do?

  • add khmer words fst converted from khmerdict.txt using fst crate
  • change to use fst segmenter instead of icu_segmenter
  • remove icu deps

PR checklist

Please check if your PR fulfills the following requirements:

  • Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
  • Have you read the contributing guidelines?
  • Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!

@xshadowlegendx xshadowlegendx marked this pull request as ready for review November 23, 2023 08:32
@ManyTheFish
Copy link
Member

Hello @xshadowlegendx,
thank you for your PR,
did you run the benchmarks? And did you see any improvements? 😄

@xshadowlegendx
Copy link
Contributor Author

hello @ManyTheFish, yes I did run the benchmarks against the main branch and did saw some improvements, and also basic test with 489 movies data from production to ensure the segmenter still works

# main branch
# segment/132/Khmer/Khm
 	Lower bound 	Estimate 	Upper bound
Slope 	24.357 µs 	24.558 µs 	24.747 µs
R² 	0.8692733 	0.8755603 	0.8699834
Mean 	24.109 µs 	24.399 µs 	24.722 µs
Std. Dev. 	1.1266 µs 	1.5696 µs 	2.1047 µs
Median 	24.149 µs 	24.471 µs 	24.663 µs
MAD 	945.18 ns 	1.2040 µs 	1.5451 µs

# segment/327/Khmer/Khm
 	Lower bound 	Estimate 	Upper bound
Slope 	50.207 µs 	50.619 µs 	51.049 µs
R² 	0.8184340 	0.8248126 	0.8178764
Mean 	50.045 µs 	50.632 µs 	51.240 µs
Std. Dev. 	2.5476 µs 	3.0627 µs 	3.5343 µs
Median 	49.838 µs 	50.600 µs 	51.182 µs
MAD 	2.1212 µs 	2.7998 µs 	3.6153 µs

# tokenize/132/Khmer/Khm
 	Lower bound 	Estimate 	Upper bound
Slope 	30.104 µs 	30.220 µs 	30.352 µs
R² 	0.9621360 	0.9641448 	0.9615239
Mean 	30.448 µs 	30.557 µs 	30.676 µs
Std. Dev. 	389.53 ns 	580.93 ns 	767.59 ns
Median 	30.572 µs 	30.624 µs 	30.675 µs
MAD 	134.49 ns 	183.81 ns 	266.37 ns

# tokenize/327/Khmer/Khm
 	Lower bound 	Estimate 	Upper bound
Slope 	60.602 µs 	60.851 µs 	61.141 µs
R² 	0.9654886 	0.9673231 	0.9648228
Mean 	60.819 µs 	61.024 µs 	61.268 µs
Std. Dev. 	624.79 ns 	1.1528 µs 	1.6001 µs
Median 	60.648 µs 	60.700 µs 	60.792 µs
MAD 	224.98 ns 	332.19 ns 	427.22 ns
# after enhancement
# segment/132/Khmer/Khm
 	Lower bound 	Estimate 	Upper bound
Slope 	16.332 µs 	16.381 µs 	16.424 µs
R² 	0.9860116 	0.9873707 	0.9863219
Mean 	16.406 µs 	16.459 µs 	16.525 µs
Std. Dev. 	126.77 ns 	309.52 ns 	455.06 ns
Median 	16.396 µs 	16.402 µs 	16.416 µs
MAD 	39.580 ns 	56.285 ns 	71.022 ns

# segment/327/Khmer/Khm
 	Lower bound 	Estimate 	Upper bound
Slope 	28.371 µs 	28.459 µs 	28.576 µs
R² 	0.9794999 	0.9807700 	0.9785282
Mean 	28.427 µs 	28.489 µs 	28.562 µs
Std. Dev. 	200.02 ns 	348.37 ns 	466.55 ns
Median 	28.359 µs 	28.400 µs 	28.439 µs
MAD 	102.44 ns 	137.48 ns 	173.04 ns

# tokenize/132/Khmer/Khm
 	Lower bound 	Estimate 	Upper bound
Slope 	29.387 µs 	29.453 µs 	29.542 µs
R² 	0.9731752 	0.9738369 	0.9726103
Mean 	29.440 µs 	29.564 µs 	29.724 µs
Std. Dev. 	280.87 ns 	733.64 ns 	1.1028 µs
Median 	29.353 µs 	29.368 µs 	29.398 µs
MAD 	75.512 ns 	98.902 ns 	134.29 ns

# tokenize/327/Khmer/Khm
 	Lower bound 	Estimate 	Upper bound
Slope 	52.207 µs 	52.294 µs 	52.401 µs
R² 	0.9893887 	0.9897693 	0.9892005
Mean 	52.279 µs 	52.386 µs 	52.513 µs
Std. Dev. 	332.50 ns 	605.68 ns 	832.29 ns
Median 	52.183 µs 	52.247 µs 	52.268 µs
MAD 	170.13 ns 	228.47 ns 	294.66 ns

Copy link
Member

@ManyTheFish ManyTheFish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thank you for your contribution!

bors merge

Copy link
Contributor

meili-bors bot commented Nov 27, 2023

Build succeeded:

@meili-bors meili-bors bot merged commit b31e01d into meilisearch:main Nov 27, 2023
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

remove unnecessary iteration in khmer segmenter
2 participants