Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

YAML lang attribute not working in ODT #1667

Closed
ousia opened this issue Oct 5, 2014 · 26 comments
Closed

YAML lang attribute not working in ODT #1667

ousia opened this issue Oct 5, 2014 · 26 comments

Comments

@ousia
Copy link
Contributor

ousia commented Oct 5, 2014

From #1614, i think that lang attribute should set document language in (at least) ConTeXt, .docx and .odt.

Sample file:


---
title: Titel
lang: de
...

# Kapitel

Mein Text

At least with version 1.12.3.3, language attribute isn’t set for ConTeXt, ODF or OOXML documents.

I think this should be improved.

Many thanks for your excellent work.

@ousia ousia changed the title YAML lang attribute in ConTeXt, docx and ODF YAML lang attribute not working in ConTeXt, docx and ODF Oct 5, 2014
@mpickering mpickering added the bug label Dec 5, 2014
@ousia
Copy link
Contributor Author

ousia commented Jan 8, 2015

@jgm, @mpickering, just in case it might help, I attach a list that contains the equivalences between ISO-639 language codes used in HTML and ConTeXt language codes (they differ from LaTeX ones).

af -> af
af-ZA -> af
ar -> ar
ar-DZ -> ar-dz
ar-IQ -> ar-iq
ar-JO -> ar-jo
ar-LB -> ar-lb
ar-MA -> ar-ma
ar-SY -> ar-sy
bg -> bg
bg-BG -> bg
ca -> ca
ca-ES -> ca
cy -> cy
cy-UK -> cy
cz -> cz
cz-CZ -> cz
da -> da
da-DK -> da
de -> de
de-1901 -> deo
de-AT -> de-at
de-CH -> de-ch
de-DE -> de-de
el -> gr
en -> en
en-UK -> uk
en-US -> en
es -> es
es-ES -> es
et -> et
et-EE -> et
eu -> eu
eu-ES -> eu
fi -> fi
fi-FI -> fi
fr -> fr
fr-CA -> fr
fr-FR -> fr
grc -> agr
he -> il
he-IL -> il
hr -> hr
hr-HR -> hr
hu -> hu
hu-HU -> hu
is -> is
is-IS -> is
it -> it
it-IT -> it
jp -> ja
jp-JP -> ja
la -> la
nb -> nb
nb-NO -> nb
nl -> nl
nl-NL -> nl
nn -> nn
nn-NO -> nn
no -> no
no-NO -> no
pl -> pl
pl-PL -> pl
pt -> pt
pt-BR -> pt
pt-PT -> pt
ro -> ro
ro-RO -> ro
ru -> ru
ru-RU -> ru
sk -> sk
sk-SK -> sk
sl -> sl
sl-SL -> sl
sv -> sv
sv-SE -> sv
tr -> tr
tr-TR -> tr
uk -> ua
uk-UA -> ua
vi -> vn
vi-VN -> vn

@ousia
Copy link
Contributor Author

ousia commented Jan 8, 2015

A workaround for different language names in ConTeXt is to define language synonyms with their XML variants (I added only the required ones):

\installlanguage [af-ZA] [af]
\installlanguage [ar-DZ] [ar-dz]
\installlanguage [ar-IQ] [ar-iq]
\installlanguage [ar-JO] [ar-jo]
\installlanguage [ar-LB] [ar-lb]
\installlanguage [ar-MA] [ar-ma]
\installlanguage [ar-SY] [ar-sy]
\installlanguage [bg-BG] [bg]
\installlanguage [ca-ES] [ca]
\installlanguage [cy-UK] [cy]
\installlanguage [cz-CZ] [cz]
\installlanguage [da-DK] [da]
\installlanguage [de-1901] [deo]
\installlanguage [de-AT] [de-at]
\installlanguage [de-CH] [de-ch]
\installlanguage [de-DE] [de-de]
\installlanguage [el] [gr]
\installlanguage [en-UK] [uk]
\installlanguage [en-US] [en]
\installlanguage [es-ES] [es]
\installlanguage [et-EE] [et]
\installlanguage [eu-ES] [eu]
\installlanguage [fi-FI] [fi]
\installlanguage [fr-CA] [fr]
\installlanguage [fr-FR] [fr]
\installlanguage [grc] [agr]
\installlanguage [he] [il]
\installlanguage [he-IL] [il]
\installlanguage [hr-HR] [hr]
\installlanguage [hu-HU] [hu]
\installlanguage [is-IS] [is]
\installlanguage [it-IT] [it]
\installlanguage [jp] [ja]
\installlanguage [jp-JP] [ja]
\installlanguage [nb-NO] [nb]
\installlanguage [nl-NL] [nl]
\installlanguage [nn-NO] [nn]
\installlanguage [no-NO] [no]
\installlanguage [pl-PL] [pl]
\installlanguage [pt-BR] [pt]
\installlanguage [pt-PT] [pt]
\installlanguage [ro-RO] [ro]
\installlanguage [ru-RU] [ru]
\installlanguage [sk-SK] [sk]
\installlanguage [sl-SL] [sl]
\installlanguage [sv-SE] [sv]
\installlanguage [tr-TR] [tr]
\installlanguage [uk] [ua]
\installlanguage [uk-UA] [ua]
\installlanguage [vi] [vn]
\installlanguage [vi-VN] [vn]

@tolot27
Copy link
Contributor

tolot27 commented Mar 30, 2015

There is a good article about setting the language in docx: https://social.msdn.microsoft.com/Forums/office/en-US/22e59387-8b00-4436-aa70-8372b3fc560a/how-to-change-openxml-word-document-language-culture-info?forum=oxmlsdk

I've changed my template.docx accordingly and it works great. Setting the w:activeWritingStyle was not necessary. I've just set/changed the w:lang in the w:docDefaults part of settings.xml, removed all style lang tags, changed w:themeFontLang in settings.xml (which is IMHO not required) and changed w:lang in the <w:sdt> part of document.xml, if present.

@lierdakil
Copy link
Contributor

@tolot27, could you explain why do you need to set docx language? I just use default one for documents with other locales and experience no problems whatsoever (maybe I haven't encountered them though), since word's language autodetection is pretty good, at least for my case.

@nkalvi
Copy link

nkalvi commented Mar 30, 2015

I agree that Word does a reasonably good job in detecting language and checking spelling/grammar.

However, I can notice a couple of things (when default language not set to the 'actual' default):

  • When document is opened in Word, the Language displayed will be English in the status bar.
    (I use a sample with Norwegian text in Norwegian edition of Word 2013 as example):
    word-us
    vs
    word-no
  • If the document is spell checked and saved, Word will insert language tags for all non-US-English paragraphs.
            <w:r w:rsidRPr="00C74311">
                <w:rPr>
                    <w:lang w:val="nb-NO"/>
                </w:rPr>
                <w:t>Midt på treet Tirsdag og onsdag ser det til at været blir midt på treet de fleste steder i landet. Riktignok starter tirsdagen finfint i Nord-Norge, men utover dagen skyer det til. – Et lavtrykk blir liggende over Nord-Norge, og selv om det ikke er så kraf</w:t>

With default language set correctly (i.e. <w:lang w:val="nb-NO" w:eastAsia="en-US" w:bidi="ar-SA"/>):

<w:r>
                <w:t>Midt på treet Tirsdag og onsdag ser det til at været blir midt på treet de fleste steder i landet. Riktignok starter tirsdagen finfint i Nord-Norge, men utover dagen skyer det til. – Et lavtrykk blir liggende over Nord-Norge, og selv om det ikke er så kraf</w:t>
            </w:r>

I'm not sure how many users are affected adversely by these issues.

@tolot27
Copy link
Contributor

tolot27 commented Mar 31, 2015

@nkalvi explained it comprehensively.

For my private documents I keep language autodetection on, but for all my other scientific and technical documents I switch it of because it does not work well.

@lierdakil
Copy link
Contributor

I should probably take a stab at this before 1.14 release...

@mb21
Copy link
Collaborator

mb21 commented Sep 25, 2015

This is fixed in ConTeXt, LaTeX and HTML with #2369, but not for ODT docx etc.

@jgm jgm changed the title YAML lang attribute not working in ConTeXt, docx and ODF YAML lang attribute not working in docx and ODF Sep 25, 2015
@jgm
Copy link
Owner

jgm commented Nov 24, 2015

Confirming that for docx, we need to put the language in for en-US in the docDefaults section of styles.xml.

Similarly in odt: in odt/styles.xml, there's a style:default-style element with

      <style:text-properties style:use-window-font-color="true"
      fo:font-size="12pt" fo:language="en" fo:country="US"

Note that language and country need to be separated out here.

@mb21
Copy link
Collaborator

mb21 commented Jan 28, 2017

I started looking into this but got only about this far:

let lang = case lookupMeta "lang" meta of
             Just (MetaString s) -> splitBy (=='-') s
             _                   -> []

Couldn't figure out how to appropriately modify styledoc for now... (e.g. are we supposed to use Text.XML.Light or Text.Pandoc.Readers.Odt.Generic.XMLConverter?) Maybe @jkr can have a look..?

@jgm
Copy link
Owner

jgm commented Jan 29, 2017

@mb21 Looks like styledoc is parsed by Text.XML.Light's parseXml, so you'd use that library to manipulate it.

mb21 added a commit to mb21/pandoc that referenced this issue Feb 22, 2017
@jgm jgm added this to the pandoc 2.0 milestone Mar 15, 2017
@jgm
Copy link
Owner

jgm commented Mar 16, 2017

@mb21 are you still working on this?

@mb21
Copy link
Collaborator

mb21 commented Mar 16, 2017

@jgm Feel free to take over, I won't have time for at least another week... you can see where I got stuck battling Text.XML.Light in the referenced commit above...

@jgm
Copy link
Owner

jgm commented Mar 16, 2017

I got this far (it compiles but doesn't seem to work; I don't have time to investigate why right now):

   -- styles
+  let lang = case lookupMeta "lang" meta of
+          Just (MetaString s) -> Just s
+          _                   -> Nothing
+
+  let addLang :: Element -> Element
+      addLang e = case lang of
+                       Nothing -> e
+                       Just l  ->
+                         case XML.toTree . go l . XML.fromElement $ e of
+                              Elem e' -> e'
+                              _       -> e -- return original
+       where go :: String -> Cursor -> Cursor
+             go l cursor = case XML.findRec (isLangElt . current) cursor of
+                                Nothing -> cursor
+                                Just t  -> XML.modifyContent (setval l) t
+             setval :: String -> Content -> Content
+             setval l (Elem e') = Elem $ e'{ elAttribs = map (setvalattr l) $
+                                              elAttribs e' }
+             setval _ x = x
+             setvalattr :: String -> XML.Attr -> XML.Attr
+             setvalattr l (XML.Attr qn@(QName "val" _ _) _) = XML.Attr qn l
+             setvalattr _ x = x
+             isLangElt (Elem e') = qName (elName e') == "lang"
+             isLangElt _ = False
+
   let stylepath = "word/styles.xml"
-  styledoc <- parseXml refArchive distArchive stylepath
+  styledoc <- addLang <$> parseXml refArchive distArchive stylepath

You also need to add import Text.XML.Light.Cursor as XML.

mb21 added a commit to mb21/pandoc that referenced this issue Mar 19, 2017
@mb21
Copy link
Collaborator

mb21 commented Mar 19, 2017

@jgm Looks like I got your code to work. Mainly, MetaString s should have been MetaInlines [Str s]...

mb21 added a commit to mb21/pandoc that referenced this issue Mar 19, 2017
jgm pushed a commit that referenced this issue Mar 20, 2017
@jgm
Copy link
Owner

jgm commented Mar 20, 2017

Now this is fixed for docx. (I think - testing always welcome!)
Now only ODT remains to be done.

@jgm jgm changed the title YAML lang attribute not working in docx and ODF YAML lang attribute not working in ~~docx~~ and ODT Mar 20, 2017
@jgm jgm changed the title YAML lang attribute not working in ~~docx~~ and ODT YAML lang attribute not working in docx and ODT Mar 20, 2017
jgm pushed a commit that referenced this issue Mar 21, 2017
@jgm jgm changed the title YAML lang attribute not working in docx and ODT YAML lang attribute not working in ODT Apr 22, 2017
@jgm jgm removed the format:Docx label Apr 22, 2017
@jgm
Copy link
Owner

jgm commented Jun 23, 2017

Does anyone know what needs to be done to set the document language in ODT?
@MarLinn?

@MarLinn
Copy link
Contributor

MarLinn commented Jun 24, 2017

I just looked into the spec.
The language is set as a "natural language identifier as defined by RFC5646 or its successor".
It is set as text content of a <dc:language> element inside the <office:meta> element.
The latter in turn may be in one of two places:

  • In a standalone XML file it is a direct child of the root element <office:document>.
  • In a typical, packaged file it a direct child of the root element <office:document-meta> of the meta.xml file.

@MarLinn
Copy link
Contributor

MarLinn commented Jun 24, 2017

If I understand correctly, this is only about setting the language in the writer, right? I've never touched that one. But if I remember correctly the ODT reader

  1. does not handle standalone documents at all
  2. does not even look for meta.xml, let alone parse it
  3. was developed after the ODT writer and independently from it

So any part of the reader is unlikely to help here. Or to put it positively, there's nothing restricting anyone to that XMLConverter-stuff.

@imz
Copy link

imz commented Jun 24, 2017 via email

@imz
Copy link

imz commented Jun 24, 2017

The German Wikipedia entry for OpenDocument has an example[1] for explaining the ODT format. In the meta.xml file there, you can see the language specified (as German) by

		<dc:language>de-DE</dc:language>

...hope that helps. -- answer by Wolf


[1] Permanent link:
https://de.wikipedia.org/w/index.php?title=OpenDocument&oldid=166672125#meta.xml-Datei

@ousia
Copy link
Contributor Author

ousia commented Jun 24, 2017

Just a comment. ODT languages (either for the whole document or for a part of it) require both language and country.

I mean, these are recognized:

en-US es-ES de-DE fr-FR it-IT

But these aren’t recognized:

en es de fr it

Only to name a few, in both cases.

@jgm jgm closed this as completed in 083a224 Jun 25, 2017
jgm added a commit that referenced this issue Jun 25, 2017
This improves on the last commit, which didn't work in
some important ways.

See #1667.
@mb21
Copy link
Collaborator

mb21 commented Jun 25, 2017

dublincore:language specifies to use BCP 47 (which is currently basically RFC 5646).... so according to the spec the country shouldn't be needed, right? Maybe real-life implementations require it though...

@mikolysz
Copy link

Be aware this is important for accessibility. Blind users who use special software that speaks the text on the screen aloud get confused by this. By default, such software switches the speech to whatever language the document declares. So, when opening a polish document, on a polish system, with a polish synthesizer set as the default, the document will be read in english, making it completely impossible to understand. Advanced computer users can turn this off, but not everyone is aware that such a possibility exists.

@jgm
Copy link
Owner

jgm commented Oct 20, 2019

@devil418 yes it's important -- but this issue has been closed as fixed.
Is there still a problem in recent pandoc?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants