docextract: report numbers update #395

michamos · 2017-04-11T14:24:05Z

Backports kbs: report numbers update refextract#15.
Adds new often cited report numbers, found with
invenio-scripts/unrecognized_report_numbers.py.
Removes broken pattern escaping.
Relaxes patterns to allow dropping leading 0.
Replaces all ' ' patterns by 's' to allow additional spaces.
Cosmetic improvement: aligns second column.

* Backports inspirehep/refextract#15. * Adds new often cited report numbers, found with invenio-scripts/unrecognized_report_numbers.py. * Removes broken pattern escaping. * Relaxes patterns to allow dropping leading 0. * Replaces all ' ' patterns by 's' to allow additional spaces. * Cosmetic improvement: aligns second column. Signed-off-by: Micha Moskovic <michamos@gmail.com>

tsgit · 2017-04-20T04:44:33Z

modules/docextract/lib/refextract_kbs.py

@@ -203,8 +201,6 @@ def institute_num_pattern_to_regex(pattern):
                            ('yy',   r'\d\d'),
                            ('s',    r'\s*'),
                            (r'/',   r'\/')]
-    # first, escape certain characters that could be sensitive to a regexp:
-    pattern = re_report_num_chars_to_escape.sub(r'\\\g<1>', pattern)


can you explain why this escaping isn't necessary? I'm lacking context here

I addressed that in an other comment.

tsgit · 2017-04-20T04:50:20Z

modules/docextract/lib/refextract_kbs.py

@@ -203,8 +201,6 @@ def institute_num_pattern_to_regex(pattern):
                            ('yy',   r'\d\d'),
                            ('s',    r'\s*'),
                            (r'/',   r'\/')]
-    # first, escape certain characters that could be sensitive to a regexp:
-    pattern = re_report_num_chars_to_escape.sub(r'\\\g<1>', pattern)


can't really comment on the KB changes, they appear ok/plausible. but beyond the removal of the non-ascii escaping regex I also wonder if this was checked against overly broad matches. It's not apparent from the code

what do you mean? this patch was tested against the testsuite of inspirehep/refextract so there should be no major regressions, but no further testing was done. What do you suggest?

tsgit · 2017-04-20T04:54:14Z

modules/docextract/lib/refextract_kbs.py

@@ -194,7 +193,6 @@ def institute_num_pattern_to_regex(pattern):
    """
    simple_replacements = [
                            ('9',    r'\d'),
-                            ('9+',   r'\d+'),


in fact doesn't look like there is a '9+' left in the kb -- is this why the pattern is removed here?

The whole pattern substitution logic was broken:

first the pattern was escaped, but + was not the whitelist, so this would lead to 9+ -> 9\+

the replacements were tried sequentially, so the 9 rule was applied before the 9+ rule anyway

By not escaping patterns, we only need substitutions for the "words", not the quantifiers.

tsgit · 2017-04-20T04:55:05Z

modules/docextract/lib/refextract_kbs.py

@@ -194,7 +193,6 @@ def institute_num_pattern_to_regex(pattern):
    """
    simple_replacements = [
                            ('9',    r'\d'),
-                            ('9+',   r'\d+'),
                            ('w+',   r'\w+'),


per comment above, I don't actually see a '+' in the kb, so is this line superfluous?

It is currently unused, but I didn't want to remove it as it might be useful in the future. The explicit + is present in order to be able to write a literal w in the pattern.

michamos requested a review from tsgit April 11, 2017 14:24

david-caro added the in progress label Apr 11, 2017

michamos mentioned this pull request Apr 11, 2017

backport improvements to refextract #375

Closed

2 tasks

tsgit suggested changes Apr 20, 2017

View reviewed changes

tsgit merged commit 5369ecb into inspirehep:prod Apr 20, 2017

tsgit removed the in progress label Apr 20, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docextract: report numbers update #395

docextract: report numbers update #395

michamos commented Apr 11, 2017

tsgit Apr 20, 2017

michamos Apr 20, 2017

tsgit Apr 20, 2017

michamos Apr 20, 2017

tsgit Apr 20, 2017

michamos Apr 20, 2017

tsgit Apr 20, 2017

michamos Apr 20, 2017 •

edited

Loading

docextract: report numbers update #395

docextract: report numbers update #395

Conversation

michamos commented Apr 11, 2017

tsgit Apr 20, 2017

Choose a reason for hiding this comment

michamos Apr 20, 2017

Choose a reason for hiding this comment

tsgit Apr 20, 2017

Choose a reason for hiding this comment

michamos Apr 20, 2017

Choose a reason for hiding this comment

tsgit Apr 20, 2017

Choose a reason for hiding this comment

michamos Apr 20, 2017

Choose a reason for hiding this comment

tsgit Apr 20, 2017

Choose a reason for hiding this comment

michamos Apr 20, 2017 • edited Loading

Choose a reason for hiding this comment

michamos Apr 20, 2017 •

edited

Loading