arXiv: only unescape safe LaTeX macros #299

michamos · 2021-03-19T16:01:59Z

Instead of unescaping everything and potentially lose meaningful info,
now only macros which can be translated losslessly are handled
(latex-base and advanced-symbols groups of pylatexenc). Macros
not in these groups are preserved, as are braces when needed. The only
issue is potentially wrong handling of whitespace, but that can't
easily be fixed: we respect the spacing used in the source, but that
might introduce additional whitespace if a macro is used in a middle
of a word. That should be rare though.
ref: Decode LaTeX in arXiv parser inspirehep#1754

tsgit · 2021-03-19T19:17:00Z

I see. Interesting approach. One problem is with escaped whitespace, typically to separate non-braced macros from following text. A common pattern is

_l2t.latex_to_text('foo \\AA\\ foo')
'foo Å\\ foo'

the original handles that better

ol2t.latex_to_text('foo \\AA\\ foo')
'foo Å foo

both handle 'foo {\AA} foo' ok.

That's a special case of handling whitespace escaping slash

_l2t.latex_to_text('foo \ bar')
'foo \\ bar'

ol2t.latex_to_text('foo \ bar')
'foo  bar'

and

_l2t.latex_to_text('foo \\ bar')
'foo \\ bar'

ol2t.latex_to_text('foo \\ bar')
'foo  bar'

It's easy to address by adding " " to LATEX_ALLOWED_MACROS
since it's part of base_macros. Then I get

_l2t.latex_to_text('foo \\AA\\ bar \\ wibble')
'foo Å bar  wibble'

_l2t.latex_to_text('foo \\AA\ bar \ wibble')
'foo Å bar  wibble'

tsgit · 2021-03-19T19:51:33Z

hepcrawl/parsers/arxiv.py

+    def latex_to_unicode(cls, latex_string):
+        try:
+            return cls._l2t.latex_to_text(latex_string)
+        except Exception:


not sure what would trigger an exception here. I didn't manage to trigger one.
might be interesting to log it somehow, should it happen

tsgit · 2021-03-19T21:44:14Z

some more adverse side effects:
enclosing curlies are removed, this alters the meaning and scope of things

How to Correctly Stitch Together {\it Kepler} Data of a Blazhko Star
How to Correctly Stitch Together \it Kepler Data of a Blazhko Star

First detections of the [NII] 122 {\mu}m line at high redshift: Demonstrating the utility of the line for studying galaxies in the early universe
First detections of the [NII] 122 \mum line at high redshift: Demonstrating the utility of the line for studying galaxies in the early universe

AVAST Survey 0.4-1.0 {\mu}m Spectroscopy of Igneous Asteroids in the Inner and Middle Main Belt
AVAST Survey 0.4-1.0 \mum Spectroscopy of Igneous Asteroids in the Inner and Middle Main Belt

Central exclusive J/{\psi} and {\chi}c production at LHCb
Central exclusive J/\psi and \chic production at LHCb

this turns valid macros into invalid ones and removes meaningful whitespace

and, my suggested whitespace change is bad for macros which are left unchanged

Measurement of the \Sigma\ beam asymmetry for the \omega\ photo-production off the proton and the neutron at GRAAL
Measurement of the \Sigma beam asymmetry for the \omega photo-production off the proton and the neutron at GRAAL

Profiles of Lyman\alpha\ Emission Lines
Profiles of Lyman\alpha Emission Lines

The nature of [S III]{\lambda}{\lambda}9096, 9532 emitters at z = 1.34 and 1.23
The nature of [S III]\lambda\lambda9096, 9532 emitters at z = 1.34 and 1.23

that space after the macros is somewhat important

change of scope

The estimate of emission region locations of {\it Fermi} FSRQs
The estimate of emission region locations of \it Fermi FSRQs

{\it Ab initio} perturbation calculations of realistic effective interactions in the Hartree--Fock basis
\it Ab initio perturbation calculations of realistic effective interactions in the Hartree--Fock basis

On the origin of the Type~{\sc ii} spicules - dynamic 3D MHD simulations
On the origin of the Type~\sc ii spicules - dynamic 3D MHD simulations

list of titles obtained via

x = perform_request_search(p="245__a:/\\\\/ -245__a:/\$/ ")
len(x)
 2333

and

for t in titles:
    tn = _l2t.latex_to_text(t)
    if tn == t:
        same.add(t)
    else:
        changed.add((t, tn))
len(changed), len(same)
 (493, 1836)

for t, tn in changed:
    print(f"{t}\n{tn}\n")

michamos · 2021-03-22T08:56:29Z

Thanks @tsgit for your input. I'm scratching my head about https://arxiv.org/abs/1306.5943 which has unicode characters in the visible title on arXiv and some of the metadata tags, but contains LaTeX macros for Greek letters in other metadata tags and (most importantly for us) in the OAI-PMH arXiv export format. AFAICS, none of this behavior is documented, and trying to reverse-engineer this is a waste of time. I'll contact Martin from arXiv to try to get more information.

tsgit · 2021-03-25T19:21:09Z

this looks quite good. it still eats multiple whitespace, though. the examples below all have 2 spaces after the macro, and are left with none

Measurement of the production of  \Xi \iota  pairs in jets at ... 
Measurement of the production of  Ξιpairs in jets at ...

J \psi  Production Z Hadronic Decays
J ψProduction Z Hadronic Decays

Corrections to the  \tau  polarisation
Corrections to the  τpolarisation

Results on a search for Higgs bosons in the h \nu\bar\nu  channel at  sqrt(s) =189 GeV using iterative discriminant analysis
Results on a search for Higgs bosons in the h νν̅channel at  sqrt(s) =189 GeV using iterative discriminant analysis

this is the most common problem. reducing 2 spaces after a macro to none

it removes curlies that are meaningful

\overrightarrow{p}
\overrightarrowp

\lowercase{e}
\lowercasee

but it leaves other curlies

\bar{QCD}-
{̅Q̅C̅D̅}̅-

this is quite rare in titles, could be manually cleaned up

it has some unintended side effects an (wrong) parentheses

at  \sqrt(s) =192-202 GeV
at  √(()s) =192-202 GeV

that's a somewhat common pattern specific to \sqrt and could be manually cleaned up
I fixed 40 records with the pattern \sqrt(s) to \sqrt{s}

tsgit · 2021-03-25T19:25:31Z

Based on number of titles affected I think the main remaining issue is the whitespace after macro issue. It certainly affects readability.
Apart from the \sqrt(s) pattern the other issues I flagged are quite infrequent.

tsgit · 2021-03-25T19:51:15Z

one more problem: "comments" are stripped, that has unintended consequences

Data from Figure 12, 0-20%, shoulder from: Dihadron azimuthal correlations in Au$+$Au collisions at $\sqrt{s_{NN}}=$ 200 GeV
Data from Figure 12, 0-20

Data from Figure 18b - $c_{H\tilde{B}}$vs.$c_{H\tilde{G}}$ Obs. 95% CLs from: Higgs boson production cross-section measurements and their EFT interpretation in the $4\ell $ decay channel at $\sqrt{s}=$13 TeV with the ATLAS detector
Data from Figure 18b - $c_{H\tilde{B}}$vs.$c_{H\tilde{G}}$ Obs. 95

Data from Xi- pT spectrum, Au+Au 7.7 GeV, 40-60% from: Strange hadron production in Au+Au collisions at $\sqrt{s_{NN}}=$7.7 , 11.5, 19.6, 27, and 39 GeV
Data from Xi- pT spectrum, Au+Au 7.7 GeV, 40-60

A 4% measurement of $H_0$ using the cumulative distribution of strong-lensing time delays in doubly-imaged quasars
A 4

there are a lot of those

tsgit · 2021-03-26T02:48:38Z

the comments issue is of course easily addressed by option

keep_comments=True

the whitespace issue is trickier. it seems like either pylatexenc leaves all whitespace alone or it gobbles up all whitespace as it is insignificant in math context.
I don't see an option to consume exactly one whitespace after a macro without {} or trailing \

michamos · 2021-03-26T16:37:57Z

Thanks again for the comments @tsgit, I've just pushed a new commit.

keep_comments=True is enabled to keep things starting with a %;
I've overridden sqrt handling to have different formatting based on whether the next char is a (.

Remaining difficult issues:

disappearing spacing: as you certainly know, whitespace in LaTeX is not significant, so 1 vs 2 spaces after a macro are treated in the same way. Macros eat the whitespace after them, so the current uses where space gets gobbled are not strictly LaTeX but TeX-isms (proper LaTeX would be \foo\ bar or \foo{} bar instead of \foo bar). We could set strict_latex_spaces="based-on-source" as documented in https://pylatexenc.readthedocs.io/en/latest/latex2text/#latex-to-text-converter-class, but the opposite issue would arise, in that K\"a hler would be translated to Kä hler. I'm not sure that's much better.
braces: the problem here is that macros in LaTeX can take a variable number of arguments delimited by curly braces. For unknown macros, by definition we don't know how many arguments they take, so they are treated by the library as taking zero arguments, so in \foo{bar} {bar} is treated as a group following the macro, not its argument. Furthermore, I decided to preserve braces only for groups containing more than one char (after conversion) to avoid things like K{\"a}hler -> K{ä}hler. Again, there's nothing we can do here except teaching the latexwalker parser about frequently occurring macros it doesn't know (see example in https://pylatexenc.readthedocs.io/en/latest/latex2text/#custom-latex-conversion-rules-a-simple-template).

I don't see an easy way to solve these two issues. I think the current conversion has some quirks but is reasonable enough, so unless you discover new issues I would think it's ready to be merged.

tsgit · 2021-03-26T21:45:48Z

Hi Micha,

thanks for special casing \\sqrt(s)

For the whitespace issue let's take a step back and look at actual data.

I should look at mixed math titles, to determine the total number of possibly affected records. It may be such a small fraction that it is not worth arguing about.

You are incorrect about accented characters and spaces. Space is not necessary for accented characters and trailing space is not consumed by latex. This also applies to your current implementation.

print(l2t('K\\"ahler or K\\"a hler Stanis\\l aw and Stanis{\\l}aw'))
Kähler or Kä hler Stanisław and Stanisław

The setting for whitespace affects things like Polish l \l in a name.

I agree that this would be a bad thing for author fields.

However for the title field we should compare the frequency of (one letter?) macros (outside of math environments) where trailing space should possibly be consumed
to the frequency of symbol type macros followed by a regular word and
(other) macros with two trailing spaces (which admittedly is a TeX-ism with some layout hinting - and it's not consistently used).

Among titles without any math delimiters I find 142 titles with a macro with 2 trailing spaces -- and inspection shows that this is overwhelmingly intentional spacing
I find zero titles with a name with a one letter macro. I do find two 1-letter instances
... the \\Z boson ...
... and D\\O Experiments
and here space is good. The contraction is hard to read

l2t('and D\\O Experiments'))
and DØExperiments

For pseudo math retaining space after macro is avoiding unintended contractions, preserves readability, and is not altering the meaning of formulas. One can argue about aesthetics of things like μνμ vs. μ ν μ.

Inspecting titles with a mix of math and other things is a bit more involved.

So for the title field, a different whitespace option could be used than for the author field.

I have not looked at impact on the abstract field at all.

I also wonder how spacing affects search.
If the title contains \\lambda couplings and that is converted to λcouplings
Is a title search for λ couplings going to find that ?

Thanks
T.

tsgit · 2021-03-27T02:04:11Z

by my count, the total number of titles with \[A-z] outside of math is 1465
so it's a small fraction of records overall

A common pattern in mixed math titles is \x \to \y. In math mode there is some judicious whitespace on both sides of the arrow. In your current solution there is not

print(l2t('\\tau^- \\to K^*'))
τ^- →K^*

print(l2t('\\tau \\to \\mu \\nu_\\mu '))
τ→μν_μ

anyhow, most of these look like they need some manual adjustment.

michamos · 2021-03-29T13:03:55Z

@tsgit you're right about escape sequences inside words. Those should be rare in practice, so I've changed the setting to respect spacing in source (and collapsing two spaces to one, to avoid introducing extra whitespace for untranslated macros such as {\\sc ii}).

* Instead of unescaping everything and potentially lose meaningful info, now only macros which can be translated losslessly are handled (`latex-base` and `advanced-symbols` groups of `pylatexenc`). Macros not in these groups are preserved, as are braces when needed. The only issue is potentially wrong handling of whitespace, but that can't easily be fixed: we respect the spacing used in the source, but that might introduce additional whitespace if a macro is used in a middle of a word. That should be rare though. * ref: inspirehep/inspirehep#1754

tsgit

LGTM

michamos requested a review from tsgit March 19, 2021 16:01

michamos changed the title ~~arXiv: only escape arXiv escape sequences~~ arXiv: only unescape arXiv escape sequences Mar 19, 2021

michamos force-pushed the arxiv-latex-escape branch 3 times, most recently from 9e3a638 to b176715 Compare March 19, 2021 16:16

tsgit reviewed Mar 19, 2021

View reviewed changes

michamos force-pushed the arxiv-latex-escape branch from b176715 to aba69ab Compare March 23, 2021 17:59

michamos changed the title ~~arXiv: only unescape arXiv escape sequences~~ arXiv: only unescape safe LaTeX macros Mar 23, 2021

michamos force-pushed the arxiv-latex-escape branch from aba69ab to ed27e3d Compare March 26, 2021 16:18

michamos force-pushed the arxiv-latex-escape branch from ed27e3d to 9ee29c2 Compare March 29, 2021 12:59

michamos force-pushed the arxiv-latex-escape branch 2 times, most recently from 9f1f573 to 0d83cc3 Compare March 29, 2021 14:15

michamos force-pushed the arxiv-latex-escape branch from 0d83cc3 to 9d856a1 Compare March 29, 2021 14:25

tsgit approved these changes Mar 29, 2021

View reviewed changes

michamos merged commit f863d80 into inspirehep:master Mar 30, 2021

michamos deleted the arxiv-latex-escape branch March 30, 2021 07:30

michamos mentioned this pull request May 10, 2021

LaTeX in titles is wrongly escaped adsabs/bumblebee#2135

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

arXiv: only unescape safe LaTeX macros #299

arXiv: only unescape safe LaTeX macros #299

michamos commented Mar 19, 2021 •

edited

tsgit commented Mar 19, 2021

tsgit Mar 19, 2021

tsgit commented Mar 19, 2021

michamos commented Mar 22, 2021

tsgit commented Mar 25, 2021 •

edited

tsgit commented Mar 25, 2021

tsgit commented Mar 25, 2021

tsgit commented Mar 26, 2021 •

edited

michamos commented Mar 26, 2021

tsgit commented Mar 26, 2021

tsgit commented Mar 27, 2021 •

edited

michamos commented Mar 29, 2021

tsgit left a comment

arXiv: only unescape safe LaTeX macros #299

arXiv: only unescape safe LaTeX macros #299

Conversation

michamos commented Mar 19, 2021 • edited

tsgit commented Mar 19, 2021

tsgit Mar 19, 2021

Choose a reason for hiding this comment

tsgit commented Mar 19, 2021

michamos commented Mar 22, 2021

tsgit commented Mar 25, 2021 • edited

tsgit commented Mar 25, 2021

tsgit commented Mar 25, 2021

tsgit commented Mar 26, 2021 • edited

michamos commented Mar 26, 2021

tsgit commented Mar 26, 2021

tsgit commented Mar 27, 2021 • edited

michamos commented Mar 29, 2021

tsgit left a comment

Choose a reason for hiding this comment

michamos commented Mar 19, 2021 •

edited

tsgit commented Mar 25, 2021 •

edited

tsgit commented Mar 26, 2021 •

edited

tsgit commented Mar 27, 2021 •

edited