Skip to content
This repository has been archived by the owner on Sep 20, 2021. It is now read-only.

docextract: report numbers update #395

Merged
merged 1 commit into from
Apr 20, 2017

Conversation

michamos
Copy link

  • Backports kbs: report numbers update refextract#15.
  • Adds new often cited report numbers, found with
    invenio-scripts/unrecognized_report_numbers.py.
  • Removes broken pattern escaping.
  • Relaxes patterns to allow dropping leading 0.
  • Replaces all ' ' patterns by 's' to allow additional spaces.
  • Cosmetic improvement: aligns second column.

* Backports inspirehep/refextract#15.
* Adds new often cited report numbers, found with
invenio-scripts/unrecognized_report_numbers.py.
* Removes broken pattern escaping.
* Relaxes patterns to allow dropping leading 0.
* Replaces all ' ' patterns by 's' to allow additional spaces.
* Cosmetic improvement: aligns second column.

Signed-off-by: Micha Moskovic <michamos@gmail.com>
@@ -203,8 +201,6 @@ def institute_num_pattern_to_regex(pattern):
('yy', r'\d\d'),
('s', r'\s*'),
(r'/', r'\/')]
# first, escape certain characters that could be sensitive to a regexp:
pattern = re_report_num_chars_to_escape.sub(r'\\\g<1>', pattern)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain why this escaping isn't necessary? I'm lacking context here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I addressed that in an other comment.

@@ -203,8 +201,6 @@ def institute_num_pattern_to_regex(pattern):
('yy', r'\d\d'),
('s', r'\s*'),
(r'/', r'\/')]
# first, escape certain characters that could be sensitive to a regexp:
pattern = re_report_num_chars_to_escape.sub(r'\\\g<1>', pattern)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't really comment on the KB changes, they appear ok/plausible. but beyond the removal of the non-ascii escaping regex I also wonder if this was checked against overly broad matches. It's not apparent from the code

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean? this patch was tested against the testsuite of inspirehep/refextract so there should be no major regressions, but no further testing was done. What do you suggest?

@@ -194,7 +193,6 @@ def institute_num_pattern_to_regex(pattern):
"""
simple_replacements = [
('9', r'\d'),
('9+', r'\d+'),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in fact doesn't look like there is a '9+' left in the kb -- is this why the pattern is removed here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole pattern substitution logic was broken:

  • first the pattern was escaped, but + was not the whitelist, so this would lead to 9+ -> 9\+
  • the replacements were tried sequentially, so the 9 rule was applied before the 9+ rule anyway

By not escaping patterns, we only need substitutions for the "words", not the quantifiers.

@@ -194,7 +193,6 @@ def institute_num_pattern_to_regex(pattern):
"""
simple_replacements = [
('9', r'\d'),
('9+', r'\d+'),
('w+', r'\w+'),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per comment above, I don't actually see a '+' in the kb, so is this line superfluous?

Copy link
Author

@michamos michamos Apr 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is currently unused, but I didn't want to remove it as it might be useful in the future. The explicit + is present in order to be able to write a literal w in the pattern.

@tsgit tsgit merged commit 5369ecb into inspirehep:prod Apr 20, 2017
@tsgit tsgit removed the in progress label Apr 20, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants