-
Notifications
You must be signed in to change notification settings - Fork 10
Conversation
* Backports inspirehep/refextract#15. * Adds new often cited report numbers, found with invenio-scripts/unrecognized_report_numbers.py. * Removes broken pattern escaping. * Relaxes patterns to allow dropping leading 0. * Replaces all ' ' patterns by 's' to allow additional spaces. * Cosmetic improvement: aligns second column. Signed-off-by: Micha Moskovic <michamos@gmail.com>
@@ -203,8 +201,6 @@ def institute_num_pattern_to_regex(pattern): | |||
('yy', r'\d\d'), | |||
('s', r'\s*'), | |||
(r'/', r'\/')] | |||
# first, escape certain characters that could be sensitive to a regexp: | |||
pattern = re_report_num_chars_to_escape.sub(r'\\\g<1>', pattern) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you explain why this escaping isn't necessary? I'm lacking context here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I addressed that in an other comment.
@@ -203,8 +201,6 @@ def institute_num_pattern_to_regex(pattern): | |||
('yy', r'\d\d'), | |||
('s', r'\s*'), | |||
(r'/', r'\/')] | |||
# first, escape certain characters that could be sensitive to a regexp: | |||
pattern = re_report_num_chars_to_escape.sub(r'\\\g<1>', pattern) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't really comment on the KB changes, they appear ok/plausible. but beyond the removal of the non-ascii escaping regex I also wonder if this was checked against overly broad matches. It's not apparent from the code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you mean? this patch was tested against the testsuite of inspirehep/refextract so there should be no major regressions, but no further testing was done. What do you suggest?
@@ -194,7 +193,6 @@ def institute_num_pattern_to_regex(pattern): | |||
""" | |||
simple_replacements = [ | |||
('9', r'\d'), | |||
('9+', r'\d+'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in fact doesn't look like there is a '9+' left in the kb -- is this why the pattern is removed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole pattern substitution logic was broken:
- first the pattern was escaped, but
+
was not the whitelist, so this would lead to9+
->9\+
- the replacements were tried sequentially, so the
9
rule was applied before the9+
rule anyway
By not escaping patterns, we only need substitutions for the "words", not the quantifiers.
@@ -194,7 +193,6 @@ def institute_num_pattern_to_regex(pattern): | |||
""" | |||
simple_replacements = [ | |||
('9', r'\d'), | |||
('9+', r'\d+'), | |||
('w+', r'\w+'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
per comment above, I don't actually see a '+' in the kb, so is this line superfluous?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is currently unused, but I didn't want to remove it as it might be useful in the future. The explicit +
is present in order to be able to write a literal w
in the pattern.
invenio-scripts/unrecognized_report_numbers.py.