Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

arXiv: only unescape safe LaTeX macros #299

Merged
merged 1 commit into from
Mar 30, 2021

Conversation

michamos
Copy link
Contributor

@michamos michamos commented Mar 19, 2021

  • Instead of unescaping everything and potentially lose meaningful info,
    now only macros which can be translated losslessly are handled
    (latex-base and advanced-symbols groups of pylatexenc). Macros
    not in these groups are preserved, as are braces when needed. The only
    issue is potentially wrong handling of whitespace, but that can't
    easily be fixed: we respect the spacing used in the source, but that
    might introduce additional whitespace if a macro is used in a middle
    of a word. That should be rare though.
  • ref: Decode LaTeX in arXiv parser inspirehep#1754

@michamos michamos requested a review from tsgit March 19, 2021 16:01
@michamos michamos changed the title arXiv: only escape arXiv escape sequences arXiv: only unescape arXiv escape sequences Mar 19, 2021
@michamos michamos force-pushed the arxiv-latex-escape branch 3 times, most recently from 9e3a638 to b176715 Compare March 19, 2021 16:16
@tsgit
Copy link
Contributor

tsgit commented Mar 19, 2021

I see. Interesting approach. One problem is with escaped whitespace, typically to separate non-braced macros from following text. A common pattern is

_l2t.latex_to_text('foo \\AA\\ foo')
'foo Å\\ foo'

the original handles that better

ol2t.latex_to_text('foo \\AA\\ foo')
'foo Å foo

both handle 'foo {\AA} foo' ok.

That's a special case of handling whitespace escaping slash

_l2t.latex_to_text('foo \ bar')
'foo \\ bar'

ol2t.latex_to_text('foo \ bar')
'foo  bar'

and

_l2t.latex_to_text('foo \\ bar')
'foo \\ bar'

ol2t.latex_to_text('foo \\ bar')
'foo  bar'

It's easy to address by adding " " to LATEX_ALLOWED_MACROS
since it's part of base_macros. Then I get

_l2t.latex_to_text('foo \\AA\\ bar \\ wibble')
'foo Å bar  wibble'

_l2t.latex_to_text('foo \\AA\ bar \ wibble')
'foo Å bar  wibble'

def latex_to_unicode(cls, latex_string):
try:
return cls._l2t.latex_to_text(latex_string)
except Exception:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure what would trigger an exception here. I didn't manage to trigger one.
might be interesting to log it somehow, should it happen

@tsgit
Copy link
Contributor

tsgit commented Mar 19, 2021

some more adverse side effects:
enclosing curlies are removed, this alters the meaning and scope of things

How to Correctly Stitch Together {\it Kepler} Data of a Blazhko Star
How to Correctly Stitch Together \it Kepler Data of a Blazhko Star

First detections of the [NII] 122 {\mu}m line at high redshift: Demonstrating the utility of the line for studying galaxies in the early universe
First detections of the [NII] 122 \mum line at high redshift: Demonstrating the utility of the line for studying galaxies in the early universe

AVAST Survey 0.4-1.0 {\mu}m Spectroscopy of Igneous Asteroids in the Inner and Middle Main Belt
AVAST Survey 0.4-1.0 \mum Spectroscopy of Igneous Asteroids in the Inner and Middle Main Belt

Central exclusive J/{\psi} and {\chi}c production at LHCb
Central exclusive J/\psi and \chic production at LHCb

this turns valid macros into invalid ones and removes meaningful whitespace

and, my suggested whitespace change is bad for macros which are left unchanged

Measurement of the \Sigma\ beam asymmetry for the \omega\ photo-production off the proton and the neutron at GRAAL
Measurement of the \Sigma beam asymmetry for the \omega photo-production off the proton and the neutron at GRAAL

Profiles of Lyman\alpha\ Emission Lines
Profiles of Lyman\alpha Emission Lines

The nature of [S III]{\lambda}{\lambda}9096, 9532 emitters at z = 1.34 and 1.23
The nature of [S III]\lambda\lambda9096, 9532 emitters at z = 1.34 and 1.23

that space after the macros is somewhat important

change of scope

The estimate of emission region locations of {\it Fermi} FSRQs
The estimate of emission region locations of \it Fermi FSRQs

{\it Ab initio} perturbation calculations of realistic effective interactions in the Hartree--Fock basis
\it Ab initio perturbation calculations of realistic effective interactions in the Hartree--Fock basis

On the origin of the Type~{\sc ii} spicules - dynamic 3D MHD simulations
On the origin of the Type~\sc ii spicules - dynamic 3D MHD simulations

list of titles obtained via

x = perform_request_search(p="245__a:/\\\\/ -245__a:/\$/ ")
len(x)
 2333

and

for t in titles:
    tn = _l2t.latex_to_text(t)
    if tn == t:
        same.add(t)
    else:
        changed.add((t, tn))
len(changed), len(same)
 (493, 1836)

for t, tn in changed:
    print(f"{t}\n{tn}\n")

@michamos
Copy link
Contributor Author

Thanks @tsgit for your input. I'm scratching my head about https://arxiv.org/abs/1306.5943 which has unicode characters in the visible title on arXiv and some of the metadata tags, but contains LaTeX macros for Greek letters in other metadata tags and (most importantly for us) in the OAI-PMH arXiv export format. AFAICS, none of this behavior is documented, and trying to reverse-engineer this is a waste of time. I'll contact Martin from arXiv to try to get more information.

@michamos michamos changed the title arXiv: only unescape arXiv escape sequences arXiv: only unescape safe LaTeX macros Mar 23, 2021
@tsgit
Copy link
Contributor

tsgit commented Mar 25, 2021

this looks quite good. it still eats multiple whitespace, though. the examples below all have 2 spaces after the macro, and are left with none

Measurement of the production of  \Xi \iota  pairs in jets at ... 
Measurement of the production of  Ξιpairs in jets at ...
J \psi  Production Z Hadronic Decays
J ψProduction Z Hadronic Decays
Corrections to the  \tau  polarisation
Corrections to the  τpolarisation
Results on a search for Higgs bosons in the h \nu\bar\nu  channel at  sqrt(s) =189 GeV using iterative discriminant analysis
Results on a search for Higgs bosons in the h νν̅channel at  sqrt(s) =189 GeV using iterative discriminant analysis

this is the most common problem. reducing 2 spaces after a macro to none

it removes curlies that are meaningful

\overrightarrow{p}
\overrightarrowp
\lowercase{e}
\lowercasee

but it leaves other curlies

\bar{QCD}-
{̅Q̅C̅D̅}̅-

this is quite rare in titles, could be manually cleaned up

it has some unintended side effects an (wrong) parentheses

at  \sqrt(s) =192-202 GeV
at  √(()s) =192-202 GeV

that's a somewhat common pattern specific to \sqrt and could be manually cleaned up
I fixed 40 records with the pattern \sqrt(s) to \sqrt{s}

@tsgit
Copy link
Contributor

tsgit commented Mar 25, 2021

Based on number of titles affected I think the main remaining issue is the whitespace after macro issue. It certainly affects readability.
Apart from the \sqrt(s) pattern the other issues I flagged are quite infrequent.

@tsgit
Copy link
Contributor

tsgit commented Mar 25, 2021

one more problem: "comments" are stripped, that has unintended consequences

Data from Figure 12, 0-20%, shoulder from: Dihadron azimuthal correlations in Au$+$Au collisions at $\sqrt{s_{NN}}=$ 200 GeV
Data from Figure 12, 0-20
Data from Figure 18b - $c_{H\tilde{B}}$vs.$c_{H\tilde{G}}$ Obs. 95% CLs from: Higgs boson production cross-section measurements and their EFT interpretation in the $4\ell $ decay channel at $\sqrt{s}=$13 TeV with the ATLAS detector
Data from Figure 18b - $c_{H\tilde{B}}$vs.$c_{H\tilde{G}}$ Obs. 95
Data from Xi- pT spectrum, Au+Au 7.7 GeV, 40-60% from: Strange hadron production in Au+Au collisions at $\sqrt{s_{NN}}=$7.7 , 11.5, 19.6, 27, and 39 GeV
Data from Xi- pT spectrum, Au+Au 7.7 GeV, 40-60
A 4% measurement of $H_0$ using the cumulative distribution of strong-lensing time delays in doubly-imaged quasars
A 4

there are a lot of those

@tsgit
Copy link
Contributor

tsgit commented Mar 26, 2021

the comments issue is of course easily addressed by option

keep_comments=True

the whitespace issue is trickier. it seems like either pylatexenc leaves all whitespace alone or it gobbles up all whitespace as it is insignificant in math context.
I don't see an option to consume exactly one whitespace after a macro without {} or trailing \

@michamos
Copy link
Contributor Author

Thanks again for the comments @tsgit, I've just pushed a new commit.

  • keep_comments=True is enabled to keep things starting with a %;
  • I've overridden sqrt handling to have different formatting based on whether the next char is a (.

Remaining difficult issues:

  • disappearing spacing: as you certainly know, whitespace in LaTeX is not significant, so 1 vs 2 spaces after a macro are treated in the same way. Macros eat the whitespace after them, so the current uses where space gets gobbled are not strictly LaTeX but TeX-isms (proper LaTeX would be \foo\ bar or \foo{} bar instead of \foo bar). We could set strict_latex_spaces="based-on-source" as documented in https://pylatexenc.readthedocs.io/en/latest/latex2text/#latex-to-text-converter-class, but the opposite issue would arise, in that K\"a hler would be translated to Kä hler. I'm not sure that's much better.
  • braces: the problem here is that macros in LaTeX can take a variable number of arguments delimited by curly braces. For unknown macros, by definition we don't know how many arguments they take, so they are treated by the library as taking zero arguments, so in \foo{bar} {bar} is treated as a group following the macro, not its argument. Furthermore, I decided to preserve braces only for groups containing more than one char (after conversion) to avoid things like K{\"a}hler -> K{ä}hler. Again, there's nothing we can do here except teaching the latexwalker parser about frequently occurring macros it doesn't know (see example in https://pylatexenc.readthedocs.io/en/latest/latex2text/#custom-latex-conversion-rules-a-simple-template).

I don't see an easy way to solve these two issues. I think the current conversion has some quirks but is reasonable enough, so unless you discover new issues I would think it's ready to be merged.

@tsgit
Copy link
Contributor

tsgit commented Mar 26, 2021

Hi Micha,

thanks for special casing \\sqrt(s)

For the whitespace issue let's take a step back and look at actual data.

I should look at mixed math titles, to determine the total number of possibly affected records. It may be such a small fraction that it is not worth arguing about.

You are incorrect about accented characters and spaces. Space is not necessary for accented characters and trailing space is not consumed by latex. This also applies to your current implementation.

print(l2t('K\\"ahler or K\\"a hler Stanis\\l aw and Stanis{\\l}aw'))
Kähler or Kä hler Stanisław and Stanisław

The setting for whitespace affects things like Polish l \l in a name.

I agree that this would be a bad thing for author fields.

However for the title field we should compare the frequency of (one letter?) macros (outside of math environments) where trailing space should possibly be consumed
to the frequency of symbol type macros followed by a regular word and
(other) macros with two trailing spaces (which admittedly is a TeX-ism with some layout hinting - and it's not consistently used).

Among titles without any math delimiters I find 142 titles with a macro with 2 trailing spaces -- and inspection shows that this is overwhelmingly intentional spacing
I find zero titles with a name with a one letter macro. I do find two 1-letter instances
... the \\Z boson ...
... and D\\O Experiments
and here space is good. The contraction is hard to read

l2t('and D\\O Experiments'))
and DØExperiments

For pseudo math retaining space after macro is avoiding unintended contractions, preserves readability, and is not altering the meaning of formulas. One can argue about aesthetics of things like μνμ vs. μ ν μ.

Inspecting titles with a mix of math and other things is a bit more involved.

So for the title field, a different whitespace option could be used than for the author field.

I have not looked at impact on the abstract field at all.

I also wonder how spacing affects search.
If the title contains \\lambda couplings and that is converted to λcouplings
Is a title search for λ couplings going to find that ?

Thanks
T.

@tsgit
Copy link
Contributor

tsgit commented Mar 27, 2021

by my count, the total number of titles with \[A-z] outside of math is 1465
so it's a small fraction of records overall

A common pattern in mixed math titles is \x \to \y. In math mode there is some judicious whitespace on both sides of the arrow. In your current solution there is not

print(l2t('\\tau^- \\to K^*'))
τ^- →K^*
print(l2t('\\tau \\to \\mu \\nu_\\mu '))
τ→μν_μ

anyhow, most of these look like they need some manual adjustment.

@michamos
Copy link
Contributor Author

@tsgit you're right about escape sequences inside words. Those should be rare in practice, so I've changed the setting to respect spacing in source (and collapsing two spaces to one, to avoid introducing extra whitespace for untranslated macros such as {\\sc ii}).

@michamos michamos force-pushed the arxiv-latex-escape branch 2 times, most recently from 9f1f573 to 0d83cc3 Compare March 29, 2021 14:15
* Instead of unescaping everything and potentially lose meaningful info,
  now only macros which can be translated losslessly are handled
  (`latex-base` and `advanced-symbols` groups of `pylatexenc`). Macros
  not in these groups are preserved, as are braces when needed. The only
  issue is potentially wrong handling of whitespace, but that can't
  easily be fixed: we respect the spacing used in the source, but that
  might introduce additional whitespace if a macro is used in a middle
  of a word. That should be rare though.
* ref: inspirehep/inspirehep#1754
Copy link
Contributor

@tsgit tsgit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@michamos michamos merged commit f863d80 into inspirehep:master Mar 30, 2021
@michamos michamos deleted the arxiv-latex-escape branch March 30, 2021 07:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants