Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🏎 ⚡️ 💯 [Rough] Benchmark across Segmentation Tools, Libraries and Algorithms #68

Closed
wants to merge 11 commits into from

Conversation

nipunsadvilkar
Copy link
Owner

@nipunsadvilkar nipunsadvilkar commented Jun 21, 2020

Segmentation Tools, Libraries and Algorithms:

  • Stanza
  • syntok
  • NLTK
  • spaCy
  • blingfire
Tool Accuracy Speed (ms)
blingfire 75.00% 49.91
pySBD 97.92% 2449.18
syntok 68.75% 783.73
spaCy 52.08% 473.96
stanza 72.92% 120803.37
NLTK 56.25% 342.98

…nto npn-benchmark

* 'npn-multiple-lang' of github.com:nipunsadvilkar/pySBD: (22 commits)
  🔧  Update setup.py
  🎨  cleanup
  ✨ 💫  Support Kazakh language
  ♻️  Refactor Processor to get language specific Processor
  ✨ 💫  Support Deutsch language
  ✨ 💫  Support Japanese language
  ✨ 💫  Support Italian language
  ✨ 💫  Support Greek language
  ✨ 💫  Support Burmese language
  ✨ 💫  Support French language
  ✨ 💫  Support Danish language
  ✨ 💫  Support Dutch language
  ✨ 💫  Support Persian language
  ✅ 🔧  Fix PySBDFactory & limit char_span to English lang only
  ✨ 💫  Support Polish language
  ✨ 💫  Support Russian language
  ✨ 💫  Support Urdu language
  🚑  Add Bulgarian language files
  ✨ 💫  Support Bulgarian language
  ✨ 💫  Support Armenian language
  ...
@nipunsadvilkar nipunsadvilkar self-assigned this Jun 21, 2020
@nipunsadvilkar nipunsadvilkar changed the title Benchmark across Segmentation Tools, Libraries and Algorithms 🏎 ⚡️ 💯 Benchmark across Segmentation Tools, Libraries and Algorithms Jun 21, 2020
Copy link

@DeNeutoy DeNeutoy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These benchmarks are good for comparing libraries, but to create a "words per second" benchmark it might be good to just run these on a larger quantity of text, e.g over 10k sentences.

One minor comment on the correctness of the stanza benchmark, but otherwise LGTM

benchmarks/stanza_benchmark.py Outdated Show resolved Hide resolved
@nipunsadvilkar
Copy link
Owner Author

@DeNeutoy Can you suggest any dataset to benchmark against?

@DeNeutoy
Copy link

@nipunsadvilkar Perhaps a book from Project Gutenberg? They have full plaintext books, e.g:
http://www.gutenberg.org/files/1661/1661-0.txt

This would allow us to also analyse failure cases of the various methods also.

@DeNeutoy
Copy link

Here is another alg to benchmark - https://github.com/microsoft/BlingFire#python-api-description

Blingfire is very fast, but I don't know how good their sbd module is.

@nipunsadvilkar
Copy link
Owner Author

@DeNeutoy : Benchmarked blingfire. Quiet amazed by its speed & accuracy 💯

Base automatically changed from npn-multiple-lang to master July 12, 2020 17:48
@nipunsadvilkar nipunsadvilkar changed the title 🏎 ⚡️ 💯 Benchmark across Segmentation Tools, Libraries and Algorithms 🏎 ⚡️ 💯 [Rough] Benchmark across Segmentation Tools, Libraries and Algorithms Jul 12, 2020
@nipunsadvilkar
Copy link
Owner Author

Going with @DeNeutoy approach #69

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants