Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance improvements #70

Closed
wants to merge 9 commits into from
Closed

Performance improvements #70

wants to merge 9 commits into from

Conversation

nipunsadvilkar
Copy link
Owner

@nipunsadvilkar nipunsadvilkar commented Jul 13, 2020

As mentioned in #41 , abbreviation_replacer.py takes too long and needs to be refactored and needs a performance improvement.

Speed Benchmark on bigger text file

Tool Speed
blingfire_tokenize 55.63 ms
nltk_tokenize 198.17 ms
pysbd_tokenize 12846.23 ms
spacy_tokenize 741.54 ms
spacy_dep_tokenize 17642.21 ms
stanza_tokenize 35623.08 ms
syntok_tokenize 1455.21 ms

Text file used: http://www.gutenberg.org/files/1661/1661-0.txt

wget http://www.gutenberg.org/files/1661/1661-0.txt -P benchmarks/

@codecov-commenter
Copy link

codecov-commenter commented Jul 13, 2020

Codecov Report

Merging #70 into master will decrease coverage by 0.17%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #70      +/-   ##
==========================================
- Coverage   98.30%   98.13%   -0.18%     
==========================================
  Files          37       37              
  Lines        1063     1071       +8     
==========================================
+ Hits         1045     1051       +6     
- Misses         18       20       +2     
Flag Coverage Δ
#unittests 98.13% <100.00%> (-0.18%) ⬇️
Impacted Files Coverage Δ
pysbd/utils.py 72.41% <ø> (-0.92%) ⬇️
pysbd/abbreviation_replacer.py 100.00% <100.00%> (ø)
pysbd/lang/bulgarian.py 100.00% <100.00%> (ø)
pysbd/lang/common/standard.py 100.00% <100.00%> (ø)
pysbd/lang/deutsch.py 100.00% <100.00%> (ø)
pysbd/lang/italian.py 100.00% <100.00%> (ø)
pysbd/lang/russian.py 100.00% <100.00%> (ø)
pysbd/languages.py 96.87% <100.00%> (ø)
pysbd/segmenter.py 100.00% <100.00%> (ø)
pysbd/lang/arabic.py 90.47% <0.00%> (-9.53%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e6c596f...97ff5c2. Read the comment docs.

@nipunsadvilkar nipunsadvilkar added enhancement ⚡️ performance speed, performance improvements labels Jul 13, 2020
benchmarks/benchmark.py Outdated Show resolved Hide resolved
benchmarks/benchmark.py Outdated Show resolved Hide resolved
benchmarks/benchmark.py Outdated Show resolved Hide resolved
abbrs = "|".join([re.escape(abr.strip()) for abr in self.lang.Abbreviation.ABBREVIATIONS])
abbregex = re.compile(r"(?:^|\s|\r|\n)({})\b".format(abbrs), flags=re.IGNORECASE)
abbrev_matches = re.findall(abbregex, original)
try:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this try except now needed?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Earlier for-loop on each abbreviation logic from pragmatic-segmenter was too time-taking hence we did refactoring of this function to improving performance on bigger text files. Languages - ru, it, bg - was breaking due to the regex approach. Hence, added ABBREVIATIONS2 to handle those
languages specific abbreviation cases.

)
return self.text
abbrs = "|".join([re.escape(abr.strip()) for abr in self.lang.Abbreviation.ABBREVIATIONS])
abbregex = re.compile(r"(?:^|\s|\r|\n)({})\b".format(abbrs), flags=re.IGNORECASE)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is the right way to do the re.compile, because it will be compiled each time you call this function. Maybe it needs to be in self.lang.Abbreviation.ABBREVIATIONS?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved ABBREVIATIONS regex to __init__. Need to check how moving all individual regex to re.compile first impact the performance.

@nipunsadvilkar
Copy link
Owner Author

Used #71 approach

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement ⚡️ performance speed, performance improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants