Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(feature) :lang special attribute syntax #895

Closed
ousia opened this issue Jun 28, 2013 · 39 comments
Closed

(feature) :lang special attribute syntax #895

ousia opened this issue Jun 28, 2013 · 39 comments

Comments

@ousia
Copy link
Contributor

ousia commented Jun 28, 2013

Hi John,

this comes from #675.

This is about enabling language notation for document parts.

The notation could be (in the spirit of .class and #identifier):

:lang :es :en :de :grc

I think a new element RawSpan would be also needed to add language notation in some passages (not all of them, but some).

Many thanks for your excellent work,

Pablo

@jgm
Copy link
Owner

jgm commented Jun 29, 2013

This seems to duplicate #162 in part. I'll close that one and keep the link here.

@ousia
Copy link
Contributor Author

ousia commented Oct 18, 2014

Well, this issue has been opened for almost 15 months.

Although the proposal should be more complete, I want to focus in this minimal part: would it be possible that extended markdown has the special attribute :lang?

I think that hardcoding the language tag in HTML should be avoided. And as a special attribute the user has less to type.

Otherwise, a simple example such as:

_Legibility_ is hyphenated differently than _Lesbarkeit_.

should be written now in extended markdown:

_Legibility_ is hyphenated differently than <span lang="de">_Lesbarkeit_</span>.

instead of the simpler proposal:

_Legibility_ is hyphenated differently than _Lesbarkeit_{:de}.

The proposal has the following benefits:

  • The user avoids double tagging (that can be seen as punishment for passages in foreign languages).

  • It would be easier to parse to other formats than XML. (Right now, hyphenation is ignored when converted to textile, LaTeX, ConTeXt, RTF, OpenDocument, and (I guess) DOCX.

    The issue with TeX implementations is that they have hyphenation enabled by default. Other formats that allow hyphenation disable it by default.

  • Even when parsing to HTML, it would be easier to parse language to xml:lang when required.

@jgm, what do you think about this?

@lemzwerg
Copy link

Using <span lang="XX">...</span> is butt ugly. I would really like to not have HTML syntax.

@mpickering
Copy link
Collaborator

I don't think there is high enough demand for language specific syntax but this is also something which would be easier with syntax for a generic span.

@ousia
Copy link
Contributor Author

ousia commented Dec 7, 2014

Many thanks for your comment, @mpickering.

I think pandoc needs first to be able to set the document language with only one attribute value for YAML lang variable (#1614).

@njbart
Copy link

njbart commented Oct 5, 2015

I don't think there is high enough demand for language specific syntax …

I'm not sure either, but whatever the outcome of the discussion on syntax is, I feel that now that we have the mapping from lang to polyglossia-lang, babel-langetc., we should also implement the conversion from <span lang="XX">…</span> and <div lang="XX">…</div> to the appropriate latex commands or environments:

For babel and polyglossia, e.g.:

markdown babel polyglossia
<span lang="es">…</span> \foreignlanguage{spanish}{…} \textspanish{…}
<div lang="es">…</div> \begin{otherlanguage}{spanish}…
\end{otherlanguage}
\begin{spanish}…
\end{spanish}

If/when a consensus on a language specific syntax emerges, support for it could be added easily later.

@mb21
Copy link
Collaborator

mb21 commented Oct 6, 2015

@nickbart1980 Are the mappings from <span lang="XX"> to \textYYY{…} the same as from lang to babel-lang? What about XeTeX, LuaTeX and ConTeXt?

@njbart
Copy link

njbart commented Oct 6, 2015

Are the mappings from <span lang="XX"> to \textYYY{…} the same as from lang to babel-lang?

<span lang="XX"> to \textYYY{…} is for polyglossia, but: yes, AFAIK all the mappings for in-text commands and environments match those currently in LaTeX.hs, both for babel and polyglossia – with one exception for the latter:

“… for Arabic one cannot use the environment arabic, as \arabic is defined internally by LaTeX. In this case we need to use the environment Arabic.” (polyglossia manual)

I edited the table above to include the babel syntax; I have no idea about LuaTeX and ConTeXt, though.

EDIT: It seems we might be able to use the babel syntax for polyglossia, too:

“Some macros defined in babel’s hyphen.cfg (and thus usually compiled into the XƎLATEX and LuaLATEX format) are redefined, but keep a similar behaviour, namely \selectlanguage, \foreignlanguage, and the environment otherlanguage.” (polyglossia manual)

I haven’t tested this, however, and I’m not sure how well this works for language variants etc.

EDIT 2: As I suspected: With polyglossia/xelatex, \foreignlanguage{english}{\today} and \foreignlanguage{german}{\today} work; \foreignlanguage{british}{\today} and \foreignlanguage{austrian}{\today} don’t (no output at all); see report: reutenauer/polyglossia#112.

@njbart
Copy link

njbart commented Oct 6, 2015

… or we could define our own commands using BCP47 tags, e.g. \IETFlang{en-GB}{Blah.}.

These could be defined in the latex template, using xparse, along these lines (this is an example for polyglossia only):

\ExplSyntaxOn
\NewDocumentCommand{\IETFlang}{ m m }
 {
  \str_case:nnn { #1 }
   {
    { ar    } { \textarabic{#2} }
    { de-DE } { \textgerman{#2} }
    { de-AT } { \textgerman[variant=austrian]{#2} }
    { en-US } { \textenglish{#2} }
    { en-GB } { \textenglish[variant=british]{#2} }
    { fr-FR } { \textfrench{#2} }
        % others
   }
   {
    #2 % I~don't~know~what~to~do~with~`#1'
   }
 }
\ExplSyntaxOff

@ghost
Copy link

ghost commented Oct 6, 2015

If I recall correctly, in Context you use

\language[es]
¡hola!
\language[en]

@mb21
Copy link
Collaborator

mb21 commented Oct 6, 2015

It would be better to not pollute the templates with too much redefining and we already have the BCP47-to-polyglossia/babel functions in the LaTeX writer. But the problem with the other LaTeX engines is bothersome. Maybe we'll just have to decide that when people start writing <span lang="es">... in their markdown documents they'll have to go with XeLaTeX and polyglossia, since that's what everyone using multiple languages etc. does anyhow.

@njbart
Copy link

njbart commented Oct 6, 2015

It would be better to not pollute the templates …

I agree. But what else can we do if we want pandoc to create engine-agnostic LaTeX documents?

Maybe we'll just have to decide that when people start writing <span lang="es">... in their markdown documents they'll have to go with XeLaTeX and polyglossia …

Fine with me. And if I understand it correctly, LuaLaTeX works with polyglossia, too. So we’d just map from <span lang="es">... to the polyglossia syntax.

@mb21
Copy link
Collaborator

mb21 commented Oct 8, 2015

I tend to favour the approach of just outputting the polyglossia commands. The only major downside I can think of if someone does pandoc -f html -o out.pdf and there are lang attributes on spans/divs in that HTML, then the PDF generation will fail (rather unexpectedly) since the default latex-engine is pdflatex. Was there some discussion on changing the default pdf engine? Maybe it's time to switch?

@mb21
Copy link
Collaborator

mb21 commented Oct 10, 2015

Thoughts anyone?

Also, should the otherlangs variable be automatically populated with the values of all lang attributes in all spans and divs, or is it better to leave it as it is and have authors specify it manually in the YAML metadata? The first option means less manual work, the second potentially more control (although I guess we could still allow the variable to be overridden by the YAML).

@jgm
Copy link
Owner

jgm commented Oct 10, 2015

+++ mb21 [Oct 10 15 03:55 ]:

Thoughts anyone?

I'm torn. I like the idea of emitting the commands
directly, rather than defining new commands in the template.
But, that's only going to work if are willing to have it
break for people not using xelatex. Since we have language
support with babel for non-xelatex, I'm reluctant to do
that.

So that makes me incline towards the idea of defining new
commands in the preamble. They could be defined using the
\ifxetex command, so that they had different definitions
for babel and polyglossia.

Arguments against this: (a) it clutters up the
templates, (b) it clutters up the preamble of generated
documents, (c) it makes fragments less self-contained.

I think we can eliminate concern (a) by generating the
macro definitions inside the LaTeX writer, and just having
a single variable in the template that gets filled with
these macro definitions. If we do that, we can also reduce
concern (b) by generating only the definitions we need,
given the languages actually used in the document. So,
this is how I'm leaning currently.

Also, should the otherlangs variable be automatically populated with
the values of all lang attributes in all spans and divs, or is it
better to leave it as it is and have authors specify it manually in the
YAML metadata? The first option means less manual work, the second
potentially more control (although I guess we could still allow the
variable to be overridden by the YAML).

Maybe we could use the following approach: set a default
value for otherlangs based on the tags in the document, but
let it be overridden by an explicit value in metadata.

@njbart
Copy link

njbart commented Oct 11, 2015

@jgm: Sounds all good to me.

@mb21
Copy link
Collaborator

mb21 commented Oct 12, 2015

Since we have language support with babel for non-xelatex.

Well, we have support for the main language with babel, but not multilingual. But yeah... if we emit define commands to map babel to polyglossia, the question then is which one to emit and define. Seems cleaner to emit polyglossia and introduce some hacks for babel (which users that are serious about languages won't use anyway), on the other hand polyglossia seems to imply in the manual that they aim to support babel's command but this doesn't work right now?

@nickbart1980 do you know which way has the easier/simpler LaTeX definitions? \begin{otherlanguage}{british} <-> \begin{english}[variant=british]

@njbart
Copy link

njbart commented Oct 12, 2015

markdown babel polyglossia
<span lang="en-GB">…</span> \foreignlanguage{british}{…} \textenglish[variant=british]{…}
<div lang="en-GB">…</div> \begin{otherlanguage}{british}…
\end{otherlanguage}
\begin{english}[variant=british]…
\end{english}

I’m not sure how to define polyglossia wrappers for the the babel commands though.

What would work, relatively easy and transparent, is defining our own commands, e.g. \textbcpfourseven{en-GB}{…}; see my ideas using xparse above.

The latex writer could put all required definitions into one pandoc variable, so they won’t clutter up the template (though they will of course appear in the document itself).

@mb21
Copy link
Collaborator

mb21 commented Oct 14, 2015

I've started on this... @jgm is there a way to extract all lang attributes from both divs and spans without resorting to query extractDivLang blocks ++ query extractSpanLang blocks The problem being that there is no supertype a for both Block and Inlines to write extractLang :: a -> [String].

@jgm
Copy link
Owner

jgm commented Oct 14, 2015

It's annoying. If I'd just made Attr a newtype, we could
write a function Attr -> [String]. This should probably
be changed next time we mess with pandoc-types, but as you
know that's a painful thing to do.

So for now I think you need to do two queries. But it
would probably be more efficient to do the inline queries
inside the block query, rather than doing them separately
and appending. I'm thinking something like:

extractLang :: Block -> [String]
extractLang (Div (_,_,kvs) _) bs = [whatever]
extractLang (Para ils) = extractLangInlines ils

+++ mb21 [Oct 14 15 13:46 ]:

I've started on this... [1]@jgm is there a way to extract all lang
attributes from both divs and spans without resorting to query
extractDivLang blocks ++ query extractSpanLang blocks The problem being
that there is no supertype a for both Block and Inlines to write
extractLang :: a -> [String].


Reply to this email directly or [2]view it on GitHub.

References

  1. https://github.com/jgm
  2. (feature) :lang special attribute syntax #895 (comment)

mb21 added a commit to mb21/pandoc that referenced this issue Oct 15, 2015
Also collect lang and dir attributes on spans and divs to set the lang,
otherlangs and dir variables if they aren’t set already. See jgm#895.
@mb21
Copy link
Collaborator

mb21 commented Oct 15, 2015

Thanks jgm, makes sense...

About the mapping... I implemented outputting the polyglossia commands, now trying to come up with LaTeX mappings from polyglossia to babel. I have it working for most languages with e.g.:

\newcommand{\textspanish}[2][]{\foreignlanguage{spanish}{#2}}
\newenvironment{spanish}[1]{\begin{otherlanguage}{spanish}}{\end{otherlanguage}}

until I realized that for spanish, babel itself defines a \textspanish command already and for some reason using \renewcommand goes into infinite recursion (TeX capacity exceeded, sorry [grouping levels=255].) If someone with LaTeX skills can get this to work I'd appreciate that. Otherwise I guess we'll have to go with @nickbart1980 proposal of rolling our own textBCP47{en-GB} command and redefine it for both babel and polyglossia users...

mb21 added a commit to mb21/pandoc that referenced this issue Oct 15, 2015
For LaTeX, also collect lang and dir attributes on spans and divs to set the lang,
otherlangs and dir variables if they aren’t set already. See jgm#895.
@njbart
Copy link

njbart commented Oct 15, 2015

FWIW, a 2009 version of the babel manual, http://www.pvv.ntnu.no/~berland/latex/docs/babel.pdf contains references to \textspanish (and \textgalician, too). I didn’t spot any other \textLANG commands in that document so far. The most recent babel manual (Version 3.9m 2015/08/03) seems to contain no such references.

@mb21
Copy link
Collaborator

mb21 commented Oct 15, 2015

Good tip, I seem to have babel 2014/03/24 3.9k The Babel package installed (from MacTex 2014), gonna give MacTex 2015 a try :)

@njbart
Copy link

njbart commented Oct 15, 2015

I’m afraid \textspanish and \textgalician are still in texlive 2015 (just not in the babel manual): in
/usr/local/texlive/2015/texmf-dist/tex/generic/babel-spanish/spanish.ldf and
/usr/local/texlive/2015/texmf-dist/tex/generic/babel-galician/galician.ldf

@njbart
Copy link

njbart commented Oct 17, 2015

Maybe you could use \textSpanish and \textGalician, then just two additional \newcommands are needed when using polyglossia.

@mb21
Copy link
Collaborator

mb21 commented Oct 17, 2015

@nickbart1980 yeah, although I still think it should be possible to renewcommand textspanish etc... Can you reproduce the error with pdflatex? which version/distribution? (see example doc: I asked at tex.stackexchange and was told it works for him).

Edit: never mind, got a great answer over there that solves the issue.

mb21 added a commit to mb21/pandoc that referenced this issue Oct 17, 2015
For LaTeX, also collect lang and dir attributes on spans and divs to set the lang,
otherlangs and dir variables if they aren’t set already. See jgm#895.
mb21 added a commit to mb21/pandoc that referenced this issue Oct 17, 2015
For LaTeX, also collect lang and dir attributes on spans and divs to set the lang,
otherlangs and dir variables if they aren’t set already. See jgm#895.
mb21 added a commit to mb21/pandoc that referenced this issue Oct 18, 2015
For LaTeX, also collect lang and dir attributes on spans and divs to set the lang,
otherlangs and dir variables if they aren’t set already. See jgm#895.
mb21 added a commit to mb21/pandoc that referenced this issue Oct 18, 2015
For LaTeX, also collect lang and dir attributes on spans and divs to set the lang,
otherlangs and dir variables if they aren’t set already. See jgm#895.
@njbart
Copy link

njbart commented Oct 25, 2015

pandoc now emits \textArabic – this is wrong: only the environment name should be capitalised (\begin{Arabic} … \end{Arabic}), but not the command, which has to be \textarabic{…}.

@mb21
Copy link
Collaborator

mb21 commented Oct 25, 2015

@nickbart1980 thanks for the correction, is in pull #2481

@njbart
Copy link

njbart commented Oct 25, 2015

For non-Latin scripts, it seems we need to add sensible font defaults and methods for overriding them.

I’ve just been looking at xelatex/polyglossia so far: The default font used by xelatex (some form of Computer Modern, it seems), e.g., supports neither Greek nor Arabic - example:

pandoc --latex-engine=xelatex -o test.pdf << EOT

العَرَبِية

---
lang: ar
...
EOT

Result:

! Package polyglossia Error: The current roman font does not contain the Arabic script!
(polyglossia)                Please define \arabicfont with \newfontfamily.

See the polyglossia package documentation for explanation.
Type  H <return>  for immediate help.
...  

l.69 \begin{document}

pandoc: Error producing PDF from TeX source

Adding, e.g.,

\newfontfamily\arabicfont[Script=Arabic]{Amiri}

to default.latex fixes this.

Interestingly enough, the scrartcl class also needs \newfontfamily\arabicfontsf[Script=Arabic]{Amiri} when the input document contains headers.

My suggestion is to solve this issue by adding sensible defaults and also introducing pandoc variables like arabicfont or greekfont.

For Arabic, which does not have separate sf and tt variants, we might use something like the following in default.latex:

$if(arabicfont)$
\newfontfamily\arabicfont[Script=Arabic]{$arabicfont$}
\newfontfamily\arabicfontsf[Script=Arabic]{$arabicfont$}
\newfontfamily\arabicfonttt[Script=Arabic]{$arabicfont$}
$else$
\newfontfamily\arabicfont[Script=Arabic]{Amiri}
\newfontfamily\arabicfontsf[Script=Arabic]{Amiri}
\newfontfamily\arabicfonttt[Script=Arabic]{Amiri}
$endif$

For other scripts we will probably need separate variables, e.g., greekfont, greekfontsf and greekfonttt.

We could also try to use, by default, fonts that support more scripts; Times New Roman, e.g., seems to contain both Greek and Arabic (though here, again, an extra definition \newfontfamily\arabicfontsf[Script=Arabic]{Times New Roman} was needed with scrartcl).

I’m not sure why scrartcl seems to require an extra arabicfontsf definition, or how to simplify this in general. Any hints are appreciated.

@lemzwerg
Copy link

scrartcl needs a sans-serif font for headings. Your suggestion how to set up $(arabicfont) in your previous comment looks fine to me.

BTW, my solution to the script problem in XeTeX (I'm still using pandoc 1.12.3.3) is to use the ucharclasses package, which automatically switches the script depending on the Unicode block – no need to explicitly specify a script! For example, I have this in my LaTeX template file:

\ifxetex
  \usepackage{ucharclasses}

  \newfontfamily{\arabicfont}[Script=Arabic]{Amiri}
  \newfontfamily{\devanagarifont}[Script=Devanagari]{FreeSerif}
  \newfontfamily{\laofont}[Script=Lao]{NotoSerifLao}
  \newfontfamily{\telugufont}[Script=Telugu]{Pothana2000}
  \newfontfamily{\thaifont}[Script=Thai]{FreeSerif}

  \setTransitionTo{Arabic}{\begingroup\arabicfont}
  \setTransitionFrom{Arabic}{\endgroup}
  \setTransitionTo{Devanagari}{\begingroup\devanagarifont}
  \setTransitionFrom{Devanagari}{\endgroup}
  \setTransitionTo{Lao}{\begingroup\laofont}
  \setTransitionFrom{Lao}{\endgroup}
  \setTransitionTo{Telugu}{\begingroup\telugufont}
  \setTransitionFrom{Telugu}{\endgroup}
  \setTransitionTo{Thai}{\begingroup\thaifont}
  \setTransitionFrom{Thai}{\endgroup}
\fi

I'm mentioning this just for reference, since a solution that understands and uses language tags is preferable for many reasons. (Note that I don't have non-latin scripts in the headers, so there aren't proper commands for setting up sans-serif fonts.)

If you want to see this in action, have a look at the ttfautohint package (the PDF and the constructed pandoc input files are part of the release tarball only).

@mb21
Copy link
Collaborator

mb21 commented Oct 26, 2015

For monolingual documents, setting the mainfont to a font that supports all the characters in the doc should solve the issue, right? And everyone who's serious about typesetting bilingual documents will need to manually select some fonts that work together anyway and can include the necessary \newfontfamily definitions in the header-includes variable, like:

---
lang: ar
mainfont: ArialUnicodeMS
header-includes: "\\newfontfamily\\arabicfont[Script=Arabic]{Amiri}"
---

العَرَبِية

This may not be ideal (maybe we should mention it in the README). But if we were to include something like \newfontfamily\arabicfont[Script=Arabic]{$arabicfont$} for every variation of script/language/bold/italics/bold-italics/typewriter etc.—that would be well over a hundred lines polluting the template. And we cannot generate those since we wouldn't know what $arabicfont$ should be (I think template variable substitution happens only once). We could generate \newfontfamily definitions for the languages used and set it to some default font, but the user couldn't change that then.

For ConTeXt it seems we could specify fallback fonts for certain unicode ranges. Unfortunately, while ucharclasses provides the same functionality for Polyglossia, it seems the \newfontfamily definitions are still necessary?

So, I'm not sure there's much we can do (except mentioning this in the README, and maybe set up something for ConTeXt), although recommendations are welcome.

@njbart
Copy link

njbart commented Oct 26, 2015

And everyone who's serious about typesetting bilingual documents will need to manually select some fonts that work together anyway …

I’d agree as far as serious work is concerned, but I’m more than a little worried if casual users trying to use a non-Latin script do not get any output but just an error message. In other words, I’d expect pandoc to generate some at least halfway decent output even if a user does not actively specify any non-Latin fonts at all.

Couldn’t we introduce at least one polyglossia-fonts variable to which the latex writer adds sensible default font definitions for non-Latin scripts, based on the language tags found in the document?

So, whenever a document contains the language tag ar,

\newfontfamily\arabicfont[Script=Arabic]{Amiri}
\newfontfamily\arabicfontsf[Script=Arabic]{Amiri}
\newfontfamily\arabicfonttt[Script=Arabic]{Amiri}

would be added to polyglossia-fonts; and $if(polyglossia-fonts)$$polyglossia-fonts$$endif$ would of course be included in default.latex.

These definitions could then still be overridden by, e.g.,

---
header-includes: "\\newfontfamily\\arabicfont[Script=Arabic]{Scheherazade}\\newfontfamily\\arabicfontsf[Script=Arabic]{Scheherazade}"
...

@mb21
Copy link
Collaborator

mb21 commented Oct 26, 2015

I’d expect pandoc to generate some at least halfway decent output even if a user does not actively specify any non-Latin fonts at all.

Indeed that would be nice, however currently not the case either: e.g. arab characters require you to specify --latex-engine xelatex (or you get ! Package inputenc Error: Unicode char \u8:ا not set up for use with LaTeX.) and even then the PDF is currently empty for me since the default font doesn't have the required characters. Personally, I think sooner or later we should change the default pdf engine to either XeLaTeX, ConTeXt or even plain TeX anyway...

Couldn’t we introduce at least one polyglossia-fonts variable to which the latex writer adds sensible default font definitions for non-Latin scripts, based on the language tags found in the document? [...]
These definitions could then still be overridden by, e.g., header-includes: "\\newfontfamily...

I see, indeed if they can be overriden (and they can, just tested) then I'm in favour as well. So I suppose now we need a good list of fonts widely available for lots of languages...

@adunning
Copy link
Contributor

@nickbart1980, the default font in XeTeX and LuaTeX (Latin Modern) is set by fontspec. Perhaps an issue for adding defaults for other languages should be added to its repository, to see whether a broader solution can be found?

@ousia
Copy link
Contributor Author

ousia commented Nov 15, 2015

Could we also discuss the syntax for the language attribute?

:lang is fine for me. It is borrowed from Textile, but it is fine, since it is the CSS selector for pseudo-classes (and languages in CSS are a pseudo-class).

If we could agree on this, the language special attribute could be already implemented in the elements that allow it. These are mainly titles and code.

At least for code, when writing technical documents in other languages than English, it is extremely useful to be able to tag inline code as being written in English. Otherwise hyphenation for that part will be highly probable wrong.

And when issue #168 will be solved, we would benefit a lot from the special syntax for language attributes in text divisions and spans.

So, could we reach an agreement about the syntax for the language attribute?

@ousia
Copy link
Contributor Author

ousia commented Feb 17, 2016

@jgm, after issue #168 is fixed, could we discuss this issue?

This issue is older than #168. It comes comes from #162, [which I originally reported at https://code.google.com/archive/p/pandoc/issues/201 (more than six years ago).

I think that :lang is the natural syntax for languages, since it is the CSS selector. This would be the same syntax for #id and #class.

And a comment on the issue: it is about the syntax (or at least, that was my original report). As I reported the original issue at Google Code, you discarded it because it looked like recreating LaTeX in pandoc (see comment 7 there).

Well, time flies. And the vast majority of comments in this issue explain how LaTeX (its babel and polyglossia packages) deals with languages.

I think we have to set a special language syntax first.

@ousia ousia mentioned this issue Feb 18, 2016
@ousia
Copy link
Contributor Author

ousia commented Feb 28, 2016

I have tried to discuss it at the mailing list, but I guess this should be the proper place to discuss it (since I got no reply there).

Special language syntax is needed to have different document sections in different languages, such as in:

# The US Constitution {:en}

[English text]

# Das deutsche Grundgesetz {:de}

[deutscher Text]

I think it is clear we need that special syntax for the language attribute. it is essential for multilingual documents.

@ickc
Copy link
Contributor

ickc commented Apr 21, 2016

I want to echo on the use of ucharclasses @lemzwerg mentioned. Basically what I was trying to do is to modify the default pandoc latex template, together with the use of some new yaml variables, to provide a general method to setup ucharclasses in the yaml front matter. But it doesn't work so far, and would probably require some change in the pandoc program itself:

Example Preamble

For example, I want this in the preamble:

\usepackage[Latin, Greek, Hebrew]{ucharclasses}
\usepackage{xltxtra,xunicode}
\usepackage{unicode-math}

\newcommand{\latinfont}{\renewcommand\rmdefault{lmr}\renewcommand\sfdefault{lmss}\renewcommand\ttdefault{lmtt}\defaultfontfeatures[\rmfamily,\sffamily]{Ligatures=TeX}}
\setTransitionsForLatin{\latinfont}{}

\newfontfamily\greekfont{Cardo} % Download at http://scholarsfonts.net/cardofnt.html
\setTransitionsForGreek{\greekfont}{}
\newfontfamily\hebrewfont{Cardo} % same as above
\setTransitionsFor{Hebrew}{\hebrewfont\setRTL}{\setLTR}

Modifying Pandoc Template for LaTeX

I modified the default latex template starting from line 85, in the section relevant to lang, like this (basically inserted $if(ucharclasses)$):

\ifnum 0\ifxetex 1\fi\ifluatex 1\fi=0 % if pdftex
  \usepackage[shorthands=off,$for(babel-otherlangs)$$babel-otherlangs$,$endfor$main=$babel-lang$]{babel}
$if(babel-newcommands)$
  $babel-newcommands$
$endif$
\else
$if(ucharclasses)$
  \usepackage[Latin,$for(polyglossia-otherlangs)$$polyglossia-otherlangs.name$$sep$,$endfor$]{ucharclasses}
  \usepackage{xltxtra,xunicode}
  \usepackage{unicode-math}
    \newcommand{\latinfont}{\renewcommand\rmdefault{lmr}\renewcommand\sfdefault{lmss}\renewcommand\ttdefault{lmtt}\defaultfontfeatures[\rmfamily,\sffamily]{Ligatures=TeX}}
  \setTransitionsForLatin{\latinfont}{}
$for(polyglossia-otherlangs)$
  \newfontfamily\$polyglossia-otherlangs.name$font{$$polyglossia-otherlangs.name$font$}
  \setTransitionsFor{$polyglossia-otherlangs.name$}{\$polyglossia-otherlangs.name$font}{}
$endfor$
$else$
  \usepackage{polyglossia}
  \setmainlanguage[$polyglossia-lang.options$]{$polyglossia-lang.name$}
$for(polyglossia-otherlangs)$
  \setotherlanguage[$polyglossia-otherlangs.options$]{$polyglossia-otherlangs.name$}
$endfor$
$endif$
\fi

YAML Front Matter

I then put the following in the front matter of the pandoc file:

lang:   en
otherlangs: [el,he]
ucharclasses:   true
greekfont:  Cardo
hebrewfont: Cardo

Generated TeX

The resulted generated TeX file is:

\ifnum 0\ifxetex 1\fi\ifluatex 1\fi=0 % if pdftex
  \usepackage[shorthands=off,greek,hebrew,main=english]{babel}
\else
  \usepackage[Latin,greek,hebrew]{ucharclasses}
  \usepackage{xltxtra,xunicode}
  \usepackage{unicode-math}
    \newcommand{\latinfont}{\renewcommand\rmdefault{lmr}\renewcommand\sfdefault{lmss}\renewcommand\ttdefault{lmtt}\defaultfontfeatures[\rmfamily,\sffamily]{Ligatures=TeX}}
  \setTransitionsForLatin{\latinfont}{}
  \newfontfamily\greekfont{$polyglossia-otherlangs.name}
  \setTransitionsFor{greek}{\greekfont}{}
  \newfontfamily\hebrewfont{$polyglossia-otherlangs.name}
  \setTransitionsFor{hebrew}{\hebrewfont}{}
\fi

Problems

"Nested" Pandoc Variables?

Comparing [Example Preamble] to [Generated TeX], the \newfontfamily doesn't work.

What I wanted to do is to use $polyglossia-otherlangs.name$ to get the name of the lang first, e.g. greek, then append it with font, and call for a variable named $greekfont$. I tried to do it with the following code: \newfontfamily\$polyglossia-otherlangs.name$font{$$polyglossia-otherlangs.name$font$}, but the nested variables doesn't work because $$ means literal $.

Any idea how to use nested variables?

polyglossia-otherlangs.name

I'm using polyglossia-otherlangs.name, I'm not sure how much of them are the same as those used by ucharclasses. So a ucharclasses-otherlangs.name is probably needed. e.g. for the greek and hebrew used, ucharclasses is case sensitive and required Greek and Hebrew with the capitalization.

RTL/LTR

May be something similar to $polyglossia-lang.options$ and $dir$ variable is needed, so that if the ucharclasses-otherlangs.name is RTL, \setRTL, setLTR are inserted:

\setTransitionsFor{Hebrew}{\hebrewfont\setRTL}{\setLTR}

ucharclasses Issue

If one want to implement ucharclasses, beware of issue Pomax/ucharclasses#7 and do not follow the example in the official documentation of ucharclasses. I'm not sure what is the best practice but mine above in [Example Preamble] works fine.

Edit: the issue has been fixed by the newest update in ucharclasses v2.1. The documented official suggestion is good now.

@jgm
Copy link
Owner

jgm commented Mar 9, 2017

Obsoleted by #3451

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants