🏎 ⚡️ 💯 [Rough] Benchmark across Segmentation Tools, Libraries and Algorithms #68

nipunsadvilkar · 2020-06-21T09:20:34Z

Segmentation Tools, Libraries and Algorithms:

Tool	Accuracy	Speed (ms)
blingfire	75.00%	49.91
pySBD	97.92%	2449.18
syntok	68.75%	783.73
spaCy	52.08%	473.96
stanza	72.92%	120803.37
NLTK	56.25%	342.98

…nto npn-benchmark * 'npn-multiple-lang' of github.com:nipunsadvilkar/pySBD: (22 commits) 🔧 Update setup.py 🎨 cleanup ✨ 💫 Support Kazakh language ♻️ Refactor Processor to get language specific Processor ✨ 💫 Support Deutsch language ✨ 💫 Support Japanese language ✨ 💫 Support Italian language ✨ 💫 Support Greek language ✨ 💫 Support Burmese language ✨ 💫 Support French language ✨ 💫 Support Danish language ✨ 💫 Support Dutch language ✨ 💫 Support Persian language ✅ 🔧 Fix PySBDFactory & limit char_span to English lang only ✨ 💫 Support Polish language ✨ 💫 Support Russian language ✨ 💫 Support Urdu language 🚑 Add Bulgarian language files ✨ 💫 Support Bulgarian language ✨ 💫 Support Armenian language ...

DeNeutoy

These benchmarks are good for comparing libraries, but to create a "words per second" benchmark it might be good to just run these on a larger quantity of text, e.g over 10k sentences.

One minor comment on the correctness of the stanza benchmark, but otherwise LGTM

benchmarks/stanza_benchmark.py

nipunsadvilkar · 2020-06-23T10:14:04Z

@DeNeutoy Can you suggest any dataset to benchmark against?

DeNeutoy · 2020-06-23T18:13:02Z

@nipunsadvilkar Perhaps a book from Project Gutenberg? They have full plaintext books, e.g:
http://www.gutenberg.org/files/1661/1661-0.txt

This would allow us to also analyse failure cases of the various methods also.

DeNeutoy · 2020-06-23T21:43:04Z

Here is another alg to benchmark - https://github.com/microsoft/BlingFire#python-api-description

Blingfire is very fast, but I don't know how good their sbd module is.

nipunsadvilkar · 2020-06-24T17:10:45Z

@DeNeutoy : Benchmarked blingfire. Quiet amazed by its speed & accuracy 💯

nipunsadvilkar · 2020-07-12T18:03:19Z

Going with @DeNeutoy approach #69

nipunsadvilkar added 8 commits June 12, 2020 19:21

💫 Add syntok benchmark & handle Roman lists

c42d4a3

⚡️ GRS Benchmark with spaCy, syntok, nltk

8dd0c83

🐛 Fix list item replacer

e7d7067

🎨 cleanup syntok test files

78cc9e0

🎨 rm syntok file

19c9169

⚡️ Benchmark with stanza

e2bf3db

📝 Comment on benchmark with stanza

df50738

nipunsadvilkar self-assigned this Jun 21, 2020

nipunsadvilkar changed the title ~~Benchmark across Segmentation Tools, Libraries and Algorithms~~ 🏎 ⚡️ 💯 Benchmark across Segmentation Tools, Libraries and Algorithms Jun 21, 2020

DeNeutoy approved these changes Jun 22, 2020

View reviewed changes

benchmarks/stanza_benchmark.py Outdated Show resolved Hide resolved

🎨 ⚡️ Limit stanza nlp pipeline to tokenize

8fc9e71

⚡️ 💫 Benchmark with blingfire

eeb180d

📝 English GRS with rule description

85d33fe

Base automatically changed from npn-multiple-lang to master July 12, 2020 17:48

nipunsadvilkar changed the title ~~🏎 ⚡️ 💯 Benchmark across Segmentation Tools, Libraries and Algorithms~~ 🏎 ⚡️ 💯 [Rough] Benchmark across Segmentation Tools, Libraries and Algorithms Jul 12, 2020

nipunsadvilkar closed this Jul 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🏎 ⚡️ 💯 [Rough] Benchmark across Segmentation Tools, Libraries and Algorithms #68

🏎 ⚡️ 💯 [Rough] Benchmark across Segmentation Tools, Libraries and Algorithms #68

nipunsadvilkar commented Jun 21, 2020 •

edited

DeNeutoy left a comment

nipunsadvilkar commented Jun 23, 2020

DeNeutoy commented Jun 23, 2020

DeNeutoy commented Jun 23, 2020

nipunsadvilkar commented Jun 24, 2020

nipunsadvilkar commented Jul 12, 2020

🏎 ⚡️ 💯 [Rough] Benchmark across Segmentation Tools, Libraries and Algorithms #68

🏎 ⚡️ 💯 [Rough] Benchmark across Segmentation Tools, Libraries and Algorithms #68

Conversation

nipunsadvilkar commented Jun 21, 2020 • edited

DeNeutoy left a comment

Choose a reason for hiding this comment

nipunsadvilkar commented Jun 23, 2020

DeNeutoy commented Jun 23, 2020

DeNeutoy commented Jun 23, 2020

nipunsadvilkar commented Jun 24, 2020

nipunsadvilkar commented Jul 12, 2020

nipunsadvilkar commented Jun 21, 2020 •

edited