Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recognition of LTR(Right to Left) Word(s) in a *RTL* document #5558

Open
nima87 opened this issue Jun 6, 2019 · 7 comments
Open

Recognition of LTR(Right to Left) Word(s) in a *RTL* document #5558

nima87 opened this issue Jun 6, 2019 · 7 comments

Comments

@nima87
Copy link

nima87 commented Jun 6, 2019

#5545
I think I explained my request(and not really a problem with Pandoc) very clearly, but I repeat it here again, however, if you find any part ambiguous, please ask me to elaborate further.
fmpandoc.docx
Github doesnt support tex files so I couldn't upload Pandoc converted tex file. Suppose you convert this file to tex with Pandoc by:

pandoc -s fmpandoc.docx --wrap=none -t latex -o fmpandoc.tex

For the sake of simplicity you can safely remove your tex preamble and add these lines before \begin{document}:

\documentclass{article}
\usepackage{xepersian}% with xepersian the Bidi package is loaded.
\settextfont{Tahoma}% I suppose you have Tahoma unicode which contains Persian glyphs.

When compiling this file with xelatex, the English group is rendered reversely, that is, (The Wild Flower Key. Frederick Warne & Co. p. 310.) is rendered (.310 p. Co. & Warne Fredrick Key. Flower Wild The). If you put it inside \lr{...} command the order of English sentence is rendered correctly.
What I had in mind was not distinguishing LTR and RTL words, solely LTR words. I asked if it was possible to put ltr words inside an \lr{} command using Pandoc?
I would like to appreciate your efforts in creating and developping Pandoc, it is really great. Thank you.
Best.

  • In case of footnotes the issue should be addressed differently. If the footnote is purely Latin the command is \LTRfootnote{} and if it contains both rtl and ltr it is \RTLfootnote{}.
  • Although I have no use of it but for the sake of thourough explanation: Suppose you have an English article with some Persian words or paragraphs in it. You only need to load bidi pkg, you wouldn't need xepersian. You will have to define a Persian font family and put you RTL words in \RL{...} command or your paragraph in \begin{RTL}...\end{RTL}(case-sesitive), of course with the inclusion of your persian font command in both. In case you were interested, I would upload a mwe.
@jgm
Copy link
Owner

jgm commented Jun 6, 2019

It's unfortunate in this case that Word marks rtl but not ltr.

Here's a step toward a solution. Convert the Word to markdown (even with the released version, since the rtl spans are useless here), and explicitly add spans marking the English passages, like so:

نعنا[^1] ([The Wild Flower Key. Frederick Warne & Co. p. 310.]{lang=en}) گیاهی است
علفی با ریشه هوایی و ساقه‌های مستقیم و چهارگوش و زیرزمینی. ساقه و
برگ‌های خوش‌بوی آن خوراکی و دارویی است و گاهی گل‌های رنگین دارد.

Now convert the resulting markdown file to a PDF, using

pandoc fmpandoc.md -o fmpandoc.pdf --pdf-engine=xelatex -Mlang=fa -Mmainfont=Tahoma

You should see proper alignment for the English and Persian phrases. (This approach uses pandoc's default polyglossia.) Maybe not quite what you want, and requires manual intervention via an intermediary markdown file, but a start.

What we need is a way to automatically mark up the English bits as english (or alternatively as dir=ltr).

This could probably be done using a lua filter, but it's a bit complex since you have to put the span over multiple consecutive elements.

@nima87
Copy link
Author

nima87 commented Jun 7, 2019

@jgm Thank you, So you mean I go into the markdown file and include each instance of english word or group of words in a ([...]{lang=en}). It's actually what I have to do in tex, put them in \lr{...}. Now my solution with emacs is nearly semi-automatic. What I do with my tex file is using emacs query-replace-regexp

M-x query-replace-regexp \(\ *\)\([[:ascii:]]*[a-zA-z]\) RET \1\\lr{\2}

It will incrementally find and replace the instances of English words. In some cases, the English group boundaries is not recognized correctly, so I have to stop the query-replace-regexp and put the phrase in the argument of \lr{} command manually and conviniently enter M-C-% on this point [the key binding for query-replace-regexp] and continue the replacement until the next improper instance. Keep in mind that my document is 600 pages of A4 and there is in every single page several instances of English word or words. It would've saved a great deal of time if I could find an automatic solution in Pandoc, even without this feature Pandoc is great for me. Just imagine without Pandoc what my task consisted in: read line by line and find italic, bold, underline word or words, both English and Persian and type \textit{...} \textbf{...} \textunderline{...}, let alone figures, tables, and much more. So Pandoc really helps.
Have a nice time.

@mb21
Copy link
Collaborator

mb21 commented Jun 7, 2019

If I understand correctly, the underlying problem is that word doesn't have a representation for rtl-documents (like latex, html and pandoc-markdown have). In Word, all documents are ltr-documents, and some (or all) parts of the text are marked-up as rtl.

But if you know it's actually a rtl-document, you could use a lua-filter to transform pandoc's internal document AST to what the LaTeX writer expects, as @jgm mentioned. With the current pandoc nightly-build, the filter would:

  1. wrap every piece of text that's not already in a span in a new span with dir=ltr
  2. remove every span with dir=rtl
  3. set the document's metadata to dir: rtl

It's not the most straight-forward filter, but shouldn't be impossible either.

@nima87
Copy link
Author

nima87 commented Jun 7, 2019

@mb21 I'm not a programmer but as far as I know neither markdown nor latex have annotations for rtl. In case of latex if you load a package called bidi you can mark rtl and print it in the output.
I think what you suggest is practical but there are complications such as punctuations. I shall find a way to recognize a block of text as a single object. Punctuations for example periods, colons, semi colons are always stick to the preceding character and seperated with the next character by a space. Maybe giving a set of ascii numbers plus punctuation character codes.

@jgm
Copy link
Owner

jgm commented Jun 8, 2019

@jkr could I get your thoughts on the feature I added in ad9770f ?
I see you earlier added some rtl support for the docx writer.
I want to make sure this is compatible and makes sense.

@jkr
Copy link
Collaborator

jkr commented Jun 10, 2019

The code itself looks good. I do remember, though, that there were some subtleties that kept me from implementing it (or stalled out my motivation):

#3147

It seems likely that your implementation will take care of the majority. What we want to handle (taking English and Arabic as example languages) is something

  1. produced by an English-locale word, all in English (we already do this).
  2. produced by an English-locale word, with a quote in Arabic (your changes would do this).
  3. produced by an English-locale word, all in Arabic (I think your changes would do this).
  4. produced by an Arabic-locale word, all in Arabic (?)
  5. produced by an Arabic-locale word, with a quote in English (?)
  6. produced by an Arabic-locale word, all in English (?)

The locale is mainly important here, because of how the default bidi and rtl settings would pop up.

Offhand, I'm not sure if your changes would cover these bases. Unfortunately it's a bit of a hectic week, so I might not get to look at it more closely for a few days. But this sounds like a job for TDD anyway. My brother-in-law is a Hebrew philologist working in the UK, and I think he works on both English and Israeli computers, so I might be able to get the above collection of docs from him, though, if that would help.

@jgm
Copy link
Owner

jgm commented Jun 10, 2019

Yes, it would be helpful to have some real-world test documents.
What we'd ideally like is to detect the "default" setting of the ltr attribute for the document. Then we could set Just LTR instead of Nothing for the unmarked bits in a document whose default is rtl. But I didn't see anything obvious in the document linked above that says "the default for this is rtl."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants