-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance improvements #70
Conversation
replace forloop with regex pattern
…formance * 'master' of github.com:nipunsadvilkar/pySBD: rename use format, delete other scripts
Codecov Report
@@ Coverage Diff @@
## master #70 +/- ##
==========================================
- Coverage 98.30% 98.13% -0.18%
==========================================
Files 37 37
Lines 1063 1071 +8
==========================================
+ Hits 1045 1051 +6
- Misses 18 20 +2
Continue to review full report at Codecov.
|
abbrs = "|".join([re.escape(abr.strip()) for abr in self.lang.Abbreviation.ABBREVIATIONS]) | ||
abbregex = re.compile(r"(?:^|\s|\r|\n)({})\b".format(abbrs), flags=re.IGNORECASE) | ||
abbrev_matches = re.findall(abbregex, original) | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this try except now needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Earlier for-loop on each abbreviation logic from pragmatic-segmenter was too time-taking hence we did refactoring of this function to improving performance on bigger text files. Languages - ru, it, bg
- was breaking due to the regex approach. Hence, added ABBREVIATIONS2
to handle those
languages specific abbreviation cases.
pysbd/abbreviation_replacer.py
Outdated
) | ||
return self.text | ||
abbrs = "|".join([re.escape(abr.strip()) for abr in self.lang.Abbreviation.ABBREVIATIONS]) | ||
abbregex = re.compile(r"(?:^|\s|\r|\n)({})\b".format(abbrs), flags=re.IGNORECASE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this is the right way to do the re.compile, because it will be compiled each time you call this function. Maybe it needs to be in self.lang.Abbreviation.ABBREVIATIONS
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved ABBREVIATIONS regex to __init__
. Need to check how moving all individual regex to re.compile first impact the performance.
Used #71 approach |
As mentioned in #41 ,
abbreviation_replacer.py
takes too long and needs to be refactored and needs a performance improvement.Speed Benchmark on bigger text file
Text file used: http://www.gutenberg.org/files/1661/1661-0.txt