-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
arXiv: only unescape safe LaTeX macros #299
Conversation
9e3a638
to
b176715
Compare
I see. Interesting approach. One problem is with escaped whitespace, typically to separate non-braced macros from following text. A common pattern is
the original handles that better
both handle 'foo {\AA} foo' ok. That's a special case of handling whitespace escaping slash
and
It's easy to address by adding " " to LATEX_ALLOWED_MACROS
|
hepcrawl/parsers/arxiv.py
Outdated
def latex_to_unicode(cls, latex_string): | ||
try: | ||
return cls._l2t.latex_to_text(latex_string) | ||
except Exception: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure what would trigger an exception here. I didn't manage to trigger one.
might be interesting to log it somehow, should it happen
some more adverse side effects:
this turns valid macros into invalid ones and removes meaningful whitespace and, my suggested whitespace change is bad for macros which are left unchanged
that space after the macros is somewhat important change of scope
list of titles obtained via
and
|
Thanks @tsgit for your input. I'm scratching my head about https://arxiv.org/abs/1306.5943 which has unicode characters in the visible title on arXiv and some of the metadata tags, but contains LaTeX macros for Greek letters in other metadata tags and (most importantly for us) in the OAI-PMH |
b176715
to
aba69ab
Compare
this looks quite good. it still eats multiple whitespace, though. the examples below all have 2 spaces after the macro, and are left with none
this is the most common problem. reducing 2 spaces after a macro to none it removes curlies that are meaningful
but it leaves other curlies
this is quite rare in titles, could be manually cleaned up it has some unintended side effects an (wrong) parentheses
that's a somewhat common pattern specific to \sqrt and could be manually cleaned up |
Based on number of titles affected I think the main remaining issue is the whitespace after macro issue. It certainly affects readability. |
one more problem: "comments" are stripped, that has unintended consequences
there are a lot of those |
the
the whitespace issue is trickier. it seems like either pylatexenc leaves all whitespace alone or it gobbles up all whitespace as it is insignificant in math context. |
aba69ab
to
ed27e3d
Compare
Thanks again for the comments @tsgit, I've just pushed a new commit.
Remaining difficult issues:
I don't see an easy way to solve these two issues. I think the current conversion has some quirks but is reasonable enough, so unless you discover new issues I would think it's ready to be merged. |
Hi Micha, thanks for special casing For the whitespace issue let's take a step back and look at actual data. I should look at mixed math titles, to determine the total number of possibly affected records. It may be such a small fraction that it is not worth arguing about. You are incorrect about accented characters and spaces. Space is not necessary for accented characters and trailing space is not consumed by latex. This also applies to your current implementation.
The setting for whitespace affects things like Polish l I agree that this would be a bad thing for However for the Among titles without any math delimiters I find 142 titles with a macro with 2 trailing spaces -- and inspection shows that this is overwhelmingly intentional spacing
For pseudo math retaining space after macro is avoiding unintended contractions, preserves readability, and is not altering the meaning of formulas. One can argue about aesthetics of things like Inspecting titles with a mix of math and other things is a bit more involved. So for the title field, a different whitespace option could be used than for the author field. I have not looked at impact on the I also wonder how spacing affects search. Thanks |
by my count, the total number of titles with A common pattern in mixed math titles is
anyhow, most of these look like they need some manual adjustment. |
ed27e3d
to
9ee29c2
Compare
@tsgit you're right about escape sequences inside words. Those should be rare in practice, so I've changed the setting to respect spacing in source (and collapsing two spaces to one, to avoid introducing extra whitespace for untranslated macros such as |
9f1f573
to
0d83cc3
Compare
* Instead of unescaping everything and potentially lose meaningful info, now only macros which can be translated losslessly are handled (`latex-base` and `advanced-symbols` groups of `pylatexenc`). Macros not in these groups are preserved, as are braces when needed. The only issue is potentially wrong handling of whitespace, but that can't easily be fixed: we respect the spacing used in the source, but that might introduce additional whitespace if a macro is used in a middle of a word. That should be rare though. * ref: inspirehep/inspirehep#1754
0d83cc3
to
9d856a1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
now only macros which can be translated losslessly are handled
(
latex-base
andadvanced-symbols
groups ofpylatexenc
). Macrosnot in these groups are preserved, as are braces when needed. The only
issue is potentially wrong handling of whitespace, but that can't
easily be fixed: we respect the spacing used in the source, but that
might introduce additional whitespace if a macro is used in a middle
of a word. That should be rare though.