New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🏎 ⚡️ 💯 [Rough] Benchmark across Segmentation Tools, Libraries and Algorithms #68
Conversation
…nto npn-benchmark * 'npn-multiple-lang' of github.com:nipunsadvilkar/pySBD: (22 commits) 🔧 Update setup.py 🎨 cleanup ✨ 💫 Support Kazakh language ♻️ Refactor Processor to get language specific Processor ✨ 💫 Support Deutsch language ✨ 💫 Support Japanese language ✨ 💫 Support Italian language ✨ 💫 Support Greek language ✨ 💫 Support Burmese language ✨ 💫 Support French language ✨ 💫 Support Danish language ✨ 💫 Support Dutch language ✨ 💫 Support Persian language ✅ 🔧 Fix PySBDFactory & limit char_span to English lang only ✨ 💫 Support Polish language ✨ 💫 Support Russian language ✨ 💫 Support Urdu language 🚑 Add Bulgarian language files ✨ 💫 Support Bulgarian language ✨ 💫 Support Armenian language ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These benchmarks are good for comparing libraries, but to create a "words per second" benchmark it might be good to just run these on a larger quantity of text, e.g over 10k sentences.
One minor comment on the correctness of the stanza benchmark, but otherwise LGTM
@DeNeutoy Can you suggest any dataset to benchmark against? |
@nipunsadvilkar Perhaps a book from Project Gutenberg? They have full plaintext books, e.g: This would allow us to also analyse failure cases of the various methods also. |
Here is another alg to benchmark - https://github.com/microsoft/BlingFire#python-api-description Blingfire is very fast, but I don't know how good their sbd module is. |
@DeNeutoy : Benchmarked |
Segmentation Tools, Libraries and Algorithms: