Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LaTeX reader: Improve \noindent and \textgreek parsing #1783

Closed
adunning opened this issue Dec 4, 2014 · 8 comments
Closed

LaTeX reader: Improve \noindent and \textgreek parsing #1783

adunning opened this issue Dec 4, 2014 · 8 comments

Comments

@adunning
Copy link
Contributor

adunning commented Dec 4, 2014

In a LaTeX document (credit to one of my students), I have noticed that Pandoc is randomly dropping text from \it and \textgreek commands. Try, for instance:

pandoc -f latex -t markdown << EOT

\medskip
\noindent {\it Hypothesize} is composed of the noun {\it hypothesis} and the verb forming the suffix {\it -ize} (OED s.v. 'hypothesize, v.'). The suffix could be traced from late Latin {\it -iz\={a}re, -\={i}z\={a}re}, to Greek \textgreek{-ίζειν} (formative of verbs) (OED s.v. '-ize, suffix').

EOT

Expected output:

*Hypothesize* is composed of the noun *hypothesis* and the verb forming the
suffix *-ize* (OED s.v. ’hypothesize, v.’). The suffix could be traced from late 
Latin *-izāre, -īzāre*, to Greek ίζειν (formative of verbs) (OED s.v. ’-ize, suffix’).

Actual output:

is composed of the noun <span>*hypothesis*</span> and the verb forming the
suffix <span>*-ize*</span> (OED s.v. ’hypothesize, v.’). The suffix
could be traced from late Latin <span>*-izāre, -īzāre*</span>, to Greek
(formative of verbs) (OED s.v. ’-ize, suffix’).
@jgm
Copy link
Owner

jgm commented Dec 4, 2014

I assure you, it's not random! Pandoc simply doesn't recognize the \textgreek command (which is not standard latex), so it gets parsed as raw latex. It would appear in latex/PDF output but not markdown.

I could add support for the \textgreek command and have pandoc pass through the contents literally or in a special span.

I didn't see problems with \it in your sample. (Btw, it's best practice in LaTeX to use \emph{blah} instead of the old {\it blah}. If you do that you won't get the span tags.)

+++ Andrew Dunning [Dec 03 14 17:29 ]:

In a LaTeX document (credit to one of my students), I have noticed that Pandoc is randomly dropping text from \it and \textgreek commands. Try, for instance:

pandoc -f latex -t markdown << EOT

\medskip
\noindent {\it Hypothesize} is composed of the noun {\it hypothesis} and the verb forming the suffix {\it -ize} (OED s.v. 'hypothesize, v.'). The suffix could be traced from late Latin {\it -iz\={a}re, -\={i}z\={a}re}, to Greek \textgreek{-ίζειν} (formative of verbs) (OED s.v. '-ize, suffix').

EOT

Expected output:

*Hypothesize* is composed of the noun *hypothesis* and the verb forming the
suffix *-ize* (OED s.v. ’hypothesize, v.’). The suffix could be traced from late
Latin *-izāre, -īzāre*, to Greek ίζειν (formative of verbs) (OED s.v. ’-ize, suffix’).

Actual output:

is composed of the noun <span>*hypothesis*</span> and the verb forming the
suffix <span>*-ize*</span> (OED s.v. ’hypothesize, v.’). The suffix
could be traced from late Latin <span>*-izāre, -īzāre*</span>, to Greek
(formative of verbs) (OED s.v. ’-ize, suffix’).

Reply to this email directly or view it on GitHub:
#1783

@jgm
Copy link
Owner

jgm commented Dec 4, 2014

Possible workaround: add to the end of your latex preamble,

\renewcommand{\textgreek}[1]{#1}

+++ Andrew Dunning [Dec 03 14 17:29 ]:

In a LaTeX document (credit to one of my students), I have noticed that Pandoc is randomly dropping text from \it and \textgreek commands. Try, for instance:

pandoc -f latex -t markdown << EOT

\medskip
\noindent {\it Hypothesize} is composed of the noun {\it hypothesis} and the verb forming the suffix {\it -ize} (OED s.v. 'hypothesize, v.'). The suffix could be traced from late Latin {\it -iz\={a}re, -\={i}z\={a}re}, to Greek \textgreek{-ίζειν} (formative of verbs) (OED s.v. '-ize, suffix').

EOT

Expected output:

*Hypothesize* is composed of the noun *hypothesis* and the verb forming the
suffix *-ize* (OED s.v. ’hypothesize, v.’). The suffix could be traced from late
Latin *-izāre, -īzāre*, to Greek ίζειν (formative of verbs) (OED s.v. ’-ize, suffix’).

Actual output:

is composed of the noun <span>*hypothesis*</span> and the verb forming the
suffix <span>*-ize*</span> (OED s.v. ’hypothesize, v.’). The suffix
could be traced from late Latin <span>*-izāre, -īzāre*</span>, to Greek
(formative of verbs) (OED s.v. ’-ize, suffix’).

Reply to this email directly or view it on GitHub:
#1783

@adunning
Copy link
Contributor Author

adunning commented Dec 4, 2014

Thanks for the response! I'm certainly aware that these aren't normal, but it's always interesting to see what happens with documents from the wild.

Adding that to the preamble takes care of the Greek. I said it was random because there are actually a few places where the Greek comes through without that, which I don't understand.

Interesting that you cannot reproduce the italics problem. I'm running Pandoc 1.13.1; perhaps it has been fixed in the development version?

@adunning
Copy link
Contributor Author

adunning commented Dec 4, 2014

Sorry, I think I misunderstood 'I didn't see', and my example could have been clearer; I did not realize the <span> tags were intentional, but the real problem is that the first word of the paragraph is being dropped. To be more specific, this returns only a blank line:

$ pandoc -f latex -t markdown << EOT
\noindent {\it Hypothesize}
EOT

@jgm
Copy link
Owner

jgm commented Dec 4, 2014

Ah. I see. The problem is that pandoc doesn't recognize \noindent
either, and so treats it as a latex command and guesses that {\it Hypothesize} is its argument -- so the whole thing gets treated as raw
tex. Pandoc should definitely support \noindent better.

+++ Andrew Dunning [Dec 03 14 18:01 ]:

Sorry, I think I misunderstood 'I didn't see', and my example could have been clearer; I did not realize the <span> tags were intentional, but the real problem is that the first word of the paragraph is being dropped. To be more specific, this returns only a blank line:

$ pandoc -f latex -t markdown << EOT
\noindent {\it Hypothesize}
EOT

Reply to this email directly or view it on GitHub:
#1783 (comment)

@adunning adunning changed the title LaTeX reader: Text from \it and \textgreek commands randomly dropped LaTeX reader: Improve \noindent and \textgreek parsing Dec 4, 2014
@adunning
Copy link
Contributor Author

adunning commented Dec 4, 2014

That makes more sense; I've changed the title to reflect this. (And, personally, I don't think there's much reason to put Greek text in a span.)

@jgm
Copy link
Owner

jgm commented Dec 4, 2014

+++ Andrew Dunning [Dec 03 14 18:46 ]:

That makes more sense; I've changed the title to reflect this. (And, personally, I don't think there's much reason to put Greek text in a span.)

Pandoc just treats bare {...} as a span, since that's more or less what it is.

@jgm
Copy link
Owner

jgm commented Dec 15, 2014

Note: we need to treat {...} as a span because of pandoc-citeproc, which needs to know when things have been protected from capitalization transformations.

@jgm jgm closed this as completed in 9bf76fa Dec 15, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants