Performance improvements #70

nipunsadvilkar · 2020-07-13T16:44:18Z

As mentioned in #41 , abbreviation_replacer.py takes too long and needs to be refactored and needs a performance improvement.

Speed Benchmark on bigger text file

Tool	Speed
blingfire_tokenize	55.63 ms
nltk_tokenize	198.17 ms
pysbd_tokenize	12846.23 ms
spacy_tokenize	741.54 ms
spacy_dep_tokenize	17642.21 ms
stanza_tokenize	35623.08 ms
syntok_tokenize	1455.21 ms

Text file used: http://www.gutenberg.org/files/1661/1661-0.txt

wget http://www.gutenberg.org/files/1661/1661-0.txt -P benchmarks/

replace forloop with regex pattern

…formance * 'master' of github.com:nipunsadvilkar/pySBD: rename use format, delete other scripts

codecov-commenter · 2020-07-13T16:45:15Z

Codecov Report

Merging #70 into master will decrease coverage by 0.17%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master      #70      +/-   ##
==========================================
- Coverage   98.30%   98.13%   -0.18%     
==========================================
  Files          37       37              
  Lines        1063     1071       +8     
==========================================
+ Hits         1045     1051       +6     
- Misses         18       20       +2

Flag	Coverage Δ
#unittests	`98.13% <100.00%> (-0.18%)`	⬇️

Impacted Files	Coverage Δ
pysbd/utils.py	`72.41% <ø> (-0.92%)`	⬇️
pysbd/abbreviation_replacer.py	`100.00% <100.00%> (ø)`
pysbd/lang/bulgarian.py	`100.00% <100.00%> (ø)`
pysbd/lang/common/standard.py	`100.00% <100.00%> (ø)`
pysbd/lang/deutsch.py	`100.00% <100.00%> (ø)`
pysbd/lang/italian.py	`100.00% <100.00%> (ø)`
pysbd/lang/russian.py	`100.00% <100.00%> (ø)`
pysbd/languages.py	`96.87% <100.00%> (ø)`
pysbd/segmenter.py	`100.00% <100.00%> (ø)`
pysbd/lang/arabic.py	`90.47% <0.00%> (-9.53%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e6c596f...97ff5c2. Read the comment docs.

benchmarks/benchmark.py

DeNeutoy · 2020-07-13T17:33:58Z

pysbd/abbreviation_replacer.py

+        abbrs = "|".join([re.escape(abr.strip()) for abr in self.lang.Abbreviation.ABBREVIATIONS])
+        abbregex = re.compile(r"(?:^|\s|\r|\n)({})\b".format(abbrs), flags=re.IGNORECASE)
+        abbrev_matches = re.findall(abbregex, original)
+        try:


Why is this try except now needed?

Earlier for-loop on each abbreviation logic from pragmatic-segmenter was too time-taking hence we did refactoring of this function to improving performance on bigger text files. Languages - ru, it, bg - was breaking due to the regex approach. Hence, added ABBREVIATIONS2 to handle those
languages specific abbreviation cases.

DeNeutoy · 2020-07-13T17:35:02Z

pysbd/abbreviation_replacer.py

-                )
-        return self.text
+        abbrs = "|".join([re.escape(abr.strip()) for abr in self.lang.Abbreviation.ABBREVIATIONS])
+        abbregex = re.compile(r"(?:^|\s|\r|\n)({})\b".format(abbrs), flags=re.IGNORECASE)


I'm not sure if this is the right way to do the re.compile, because it will be compiled each time you call this function. Maybe it needs to be in self.lang.Abbreviation.ABBREVIATIONS?

Moved ABBREVIATIONS regex to __init__. Need to check how moving all individual regex to re.compile first impact the performance.

Fixes #41

nipunsadvilkar · 2020-08-04T14:39:42Z

Used #71 approach

nipunsadvilkar added 5 commits July 12, 2020 18:08

⚡️ 🐛 Refactor AbbreviationReplacer

8e423cf

replace forloop with regex pattern

🚑 Quick Abbreviations for - ru, it, bg lang

4485360

⚡️ Performance improvements

d4027db

Merge branch 'master' of github.com:nipunsadvilkar/pySBD into npn-per…

ac16ed6

…formance * 'master' of github.com:nipunsadvilkar/pySBD: rename use format, delete other scripts

⚡️ speed benchmark on bigger file

8168046

nipunsadvilkar added enhancement ⚡️ performance speed, performance improvements labels Jul 13, 2020

DeNeutoy suggested changes Jul 13, 2020

View reviewed changes

nipunsadvilkar added 4 commits July 16, 2020 19:44

⚡️ 🐛 refactor abbreviation replacer & Add speed benchmakr

d4e4abf

Fixes #41

💚 💡 Add comment for new abbreviation logic & cleanup

35315d9

⬇️ rm spacy dependency

b0e9504

🚨 ignore rules init regex compile

97ff5c2

nipunsadvilkar closed this Aug 4, 2020

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance improvements #70

Performance improvements #70

nipunsadvilkar commented Jul 13, 2020 •

edited

Loading

codecov-commenter commented Jul 13, 2020 •

edited

Loading

DeNeutoy Jul 13, 2020

nipunsadvilkar Jul 16, 2020

DeNeutoy Jul 13, 2020

nipunsadvilkar Jul 16, 2020

nipunsadvilkar commented Aug 4, 2020

Performance improvements #70

Performance improvements #70

Conversation

nipunsadvilkar commented Jul 13, 2020 • edited Loading

Speed Benchmark on bigger text file

codecov-commenter commented Jul 13, 2020 • edited Loading

Codecov Report

DeNeutoy Jul 13, 2020

Choose a reason for hiding this comment

nipunsadvilkar Jul 16, 2020

Choose a reason for hiding this comment

DeNeutoy Jul 13, 2020

Choose a reason for hiding this comment

nipunsadvilkar Jul 16, 2020

Choose a reason for hiding this comment

nipunsadvilkar commented Aug 4, 2020

nipunsadvilkar commented Jul 13, 2020 •

edited

Loading

codecov-commenter commented Jul 13, 2020 •

edited

Loading