Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Punctuation merge at the end of citation suffix doesn’t work with unicode last characters #33

Open
pripple opened this issue Nov 23, 2020 · 12 comments

Comments

@pripple
Copy link

pripple commented Nov 23, 2020

Hi, out of gratitude for this great piece of software I have tried a little bit to understand how Haskell works , but still I’m not so much familiar with it … I think this is the piece of code telling if a final dot should be added to a citation or not, isn’t it? Is it possible that it doesn’t catch cases where the final character isn’t an ASCII but a Unicode character?

https://github.com/jgm/pandoc/blob/68b298ed9aee405033da9a2b44ae86f2241a123d/src/Text/Pandoc/Citeproc.hs#L394-L405

I think, the merging of punctuation at the end of a citation suffix doesn’t work with unicode last characters, however, it works with ASCII characters. Maybe that (d:c:_) doesn’t allow _ to be a unicode character? At least in the LuaLaTeX/BibLaTeX/Biber pipeline I don’t get extra dots after footnote citations ending with a curly quote character.

It is difficult to provide a small MWE, because it depends on the CSL – the standard output is in parentheses, without a dot suffix. For the MWE, I will use the CSL provided at https://www.zotero.org/styles?q=id%3Auniversitat-freiburg-geschichte and call the file footnote_style.csl.

For testing purposes, I am running the following LaTeX code through the following command, using yesterdays’s nightly build, as in the current version, 2.11.2, there would even be brackets around the suffix as resolved in 9a40976 and 7db2cf5.

This doesn’t have an additional dot.\cite[Vgl.][3, and this is important!]{introduction}
But this has.\cite[Vgl.][3: „some words from the citation.“]{introduction}
pandoc -f latex -C --bibliography pandoc_test.bib --csl footnote_style.csl -t native
I am getting this output …

[Para [Str "This",Space,Str "doesn\8217t",Space,Str "have",Space,Str "an",Space,Str "additional",Space,Str "dot.",Cite [Citation {citationId = "introduction", citationPrefix = [Str "Vgl."], citationSuffix = [Str "3,",Space,Str "and",Space,Str "this",Space,Str "is",Space,Str "important!"], citationMode = NormalCitation, citationNoteNum = 0, citationHash = 0}] [Note [Para [Str "Vgl.",Space,Strong [Str "Editor,",Space,Str "Emil"],Str ":",Space,Str "Introduction",Space,Str "to",Space,Str "the",Space,Str "Essays,",Space,Str "in:",Space,Emph [Str "Herausgeber,",Space,Str "Herbert/Editor,",Space,Str "Emil",Space,Str "(Hgg.)"],Str ":",Space,Str "The",Space,Str "Ultimate",Space,Str "TeXnic",Space,Str "Bibliographer,",Space,Str "Edinburgh",Space,Str "2000,",Space,Str "S.",Space,Str "3\8211&17,",Str "",Space,Str "S.",Space,Str "3,",Space,Str "and",Space,Str "this",Space,Str "is",Space,Str "important",Str "!"]]],Space,Str "But",Space,Str "this",Space,Str "has.",Cite [Citation {citationId = "introduction", citationPrefix = [Str "Vgl."], citationSuffix = [Str "3:",Space,Str "\8222some",Space,Str "words",Space,Str "from",Space,Str "the",Space,Str "citation.\8220"], citationMode = NormalCitation, citationNoteNum = 0, citationHash = 0}] [Note [Para [Str "Vgl.",Space,Str "ebd.,",Space,Str "S.",Space,Str "3:",Space,Str "\8222",Str "some",Space,Str "words",Space,Str "from",Space,Str "the",Space,Str "citation",Str ".",Str "\8220."]]]]
,Div ("refs",["references","csl-bib-body"],[])
[Div ("ref-introduction",["csl-entry"],[])
[Para [Strong [Str "Editor,",Space,Str "Emil"],Str ":",Space,Str "Introduction",Space,Str "to",Space,Str "the",Space,Str "Essays,",Space,Str "in:",Space,Emph [Str "Herausgeber,",Space,Str "Herbert/Editor,",Space,Str "Emil",Space,Str "(Hgg.)"],Str ":",Space,Str "The",Space,Str "Ultimate",Space,Str "TeXnic",Space,Str "Bibliographer,",Space,Str "Edinburgh",Space,Str "2000,",Space,Str "S.",Space,Str "3\8211&17."]]]]

The important aspect is that the second suffix is taken as ,Str "\8222",Str "some",Space,Str "words",Space,Str "from",Space,Str "the",Space,Str "citation",Str ".",Str "\8220."
I would expect there not to be a dot after \8220, parallel to the way BibLaTeX treats the situation: ,Str "\8222",Str "some",Space,Str "words",Space,Str "from",Space,Str "the",Space,Str "citation",Str ".",Str "\8220"

pandoc_test.bib
@incollection{introduction,
	author = {Emil Editor},
	crossref = {collection},
	pages = {3–17},
	title = {Introduction to the Essays}}

@collection{collection,
	address = {Edinburgh},
	editor = {Herbert Herausgeber and Emil Editor},
	title = {The Ultimate \TeX nic Bibliographer},
	year = {2000}}
@jgm
Copy link
Owner

jgm commented Nov 23, 2020

We're not trying to detect final punctuation (which we could do in a unicode-friendly way using isPunctuation).
We're looking for certain specific punctuation marks. We don't want to remove periods after ), for example.

What about curly quotes? Yes, in general, we want a period, since there is no way of knowing ahead of time whether we have e.g. just a quoted word; also in some styles we allow dots after quotes. There is a different part of the code that is responsible for moving periods inside quotes (when that is desired).

This code knows to check inside Quoted elements; the problem is that pandoc won't created a Quoted element with this style of quotes, so you just get the literal quote characters, and the algorithm isn't currently smart enough to look inside that.

@jgm
Copy link
Owner

jgm commented Nov 23, 2020

Some possible solutions:

  • modify the LaTeX reader so that the pair \8222...\8220 triggers a Quoted element. This should be pretty safe, but I'm not sure if there are locales that use these symbols differently.
  • alternatively, do this only if the metadata lang is (what? German?)
  • modify citeproc so that the movePunctuationInsideQuotes function is locale-sensitive and can thus recognize a closing quotation mark appropriate to the locale.

@pripple
Copy link
Author

pripple commented Nov 23, 2020

Thank you, John, for the further explanation and the suggestions!

This should be pretty safe, but I'm not sure if there are locales that use these symbols differently.

I have checked the summary table on quotation marks on Wikipedia and Hebrew seems to be the only language that uses „ (U+201E) or ‚ (U+201A) on the right side of a quotation, however, it is written from right to left. So, treating those as opening a quotation shouldn’t be wrong in any case.

And yes, I am primarily thinking of German, with the primary quotation marks „…“ | ‚…‘ (but there are also »…« | ›…‹).

As they don’t conflict, I would prefer if they were always enabled, because I am not sure if pandocs reads the babel option from the LaTeX preamble. At least in LibreOffice, all German text is marked by the spellchecker, so it seems the document uses English as the default language.

So always accepting those combinations should do the trick, as this new MWE with another CSL shows with yesterday’s nightly build – you can tell from the plain text output I chose for readability here that it works in the case where it recognizes the quotation. With the CSL I used before for my test, it somehow didn’t work in this case either so I thought that wouldn’t be of any use.

This doesn’t have an additional dot.\cite[See][3, and this is important!]{introduction}
But this has.\cite[See][3: „some words from the citation.“]{introduction}
This works again as expected.\cite[See][3: “some words from the citation.”]{introduction}
pandoc -f latex -C --bibliography pandoc_test.bib --csl footnote_style.csl -t plain

This doesn’t have an additional dot.[1] But this has.[2] This works
again as expected.[3]

Editor, Emil. “Introduction to the Essays.” In The Ultimate TeXnic
Bibliographer, edited by Herbert Herausgeber and Emil Editor, 3–17.
Edinburgh, 2000.

[1] See Emil Editor, “Introduction to the Essays,” in The Ultimate
TeXnic Bibliographer, ed. Herbert Herausgeber and Emil Editor
(Edinburgh, 2000), 3, and this is important!

[2] See ibid., 3: „some words from the citation.“.

[3] See ibid. “some words from the citation.”

(By the way, you can tell from the output that it treats the colon as part of the page number … In BibLateX, I can specify the page number part with \pnfmt and wrap the rest in a \passifpages command, but I haven’t found a way here.)

pandoc_test.bib
@incollection{introduction,
	author = {Emil Editor},
	crossref = {collection},
	pages = {3–17},
	title = {Introduction to the Essays}}

@collection{collection,
	address = {Edinburgh},
	editor = {Herbert Herausgeber and Emil Editor},
	title = {The Ultimate \TeX nic Bibliographer},
	year = {2000}}

@jgm
Copy link
Owner

jgm commented Nov 24, 2020

At least in LibreOffice, all German text is marked by the spellchecker, so it seems the document uses English as the default language.

This is likely something that can be fixed in the odt/opendocument writer. If you use -t native -s, does the metadata portion contain a value for lang? If so, then fixing this may just be a matter of making the odt/opendocument writer sensitive to this. Open another issue about that if you like.

you can tell from the output that it treats the colon as part of the page number

Why do you say so? I don't think it is.

I can specify the page number part with \pnfmt and wrap the rest in a \passifpages command, but I haven’t found a way here.

See the manual here:

pandoc will use heuristics to distinguish the locator
from the suffix. In complex cases, the locator can be enclosed
in curly braces:

[@smith{ii, A, D-Z}, with a suffix]
[@smith, {pp. iv, vi-xi, (xv)-(xvii)} with suffix here]

@pripple
Copy link
Author

pripple commented Nov 24, 2020

Thank you for your help in narrowing focus on this issue, John! 🤓

So, the first part of the reply is about my suggestion to parse (U+201E) or (U+201A) on the left side of a quotation in all languages, isn’t it? If the language selection works correctly, you can have pandoc be strict.

I understand that you might want pandoc to be strict here. However, as typing UTF-8 quotes doesn’t trigger a “Have I selected the correct language for these quotes in the metadata/for this sentence?” question automatically, I would still appreciate if it could just parse non-conflicting quotation marks in all languages.

At least when a certain language is selected in metadata, all corresponding Unicode quotation marks should be recognized as quotation elements, I suggest. There is an appropriate summary table on quotation marks on Wikipedia.

Actually, if you want to be strict, English curly quotation marks shouldn’t be parsed as quotation environments then. But, as I explained above, I prefer a more relaxed approach here 😉

At least in LibreOffice, all German text is marked by the spellchecker, so it seems the document uses English as the default language.

This is likely something that can be fixed in the odt/opendocument writer. If you use -t native -s, does the metadata portion contain a value for lang? If so, then fixing this may just be a matter of making the odt/opendocument writer sensitive to this. Open another issue about that if you like.

No, it doesn’t include lang. But maybe this is on purpose? (If not, I will open an issue on that.) When I include --metadata lang=de-DE in the pandoc command and don’t rely on it being parsed from the LaTeX (Babel) option, German text isn’t marked by the spell checker in LibreOffice, so the ODT writer works correctly, however, German curly quotes still aren’t recognized as quotes when putting out -t native.

MWE
% !TEX encoding = UTF-8 Unicode
\documentclass{article}
\usepackage[ngerman,british]{babel}

\begin{document}
	
Wird der Text als Deutsch markiert? Werden „deutsche Anführungszeichen“ akzeptiert? \foreignlanguage{british}{And what about “English parts”?}

\end{document}
pandoc -f latex --metadata lang=de-DE -o test.odt

you can tell from the output that it treats the colon as part of the page number

Why do you say so? I don't think it is.

I say so because if it didn’t treat the colon as part of the page number, imho it should put out in the MWE from my last comment: [3] See ibid.: “some words from the citation.”

I can specify the page number part with \pnfmt and wrap the rest in a \passifpages command, but I haven’t found a way here.

See the manual here:

pandoc will use heuristics to distinguish the locator
from the suffix. In complex cases, the locator can be enclosed
in curly braces:

[@smith{ii, A, D-Z}, with a suffix]
[@smith, {pp. iv, vi-xi, (xv)-(xvii)} with suffix here]

I have tried with a MWE and pandoc doesn’t parse the \pnfmt command (which should do the trick in BibLaTeX) in the way the manual suggests it does with curly braces in Markdown citations. I am willing to read the manual, of course, but it’s not always obvious where to find the information on a very specific situation like this, sorry! 🙈

@jgm
Copy link
Owner

jgm commented Nov 25, 2020

I say so because if it didn’t treat the colon as part of the page number, imho it should put out in the MWE from my last comment: [3] See ibid.: “some words from the citation.”

This is really style-dependent. I don't know about the csl you're using (it's not in the main repository), but I tried your example with both chicago-fullnote-bibliography and chicago-fullnote-bibliography-with-ibid. I got the ibid with the latter, but not with the former. Try a different CSL file that uses ibid.

@pripple
Copy link
Author

pripple commented Nov 25, 2020

Hi, thank you! Actually, I am using chicago-fullnote-bibliography-with-ibid from the zotero.org repository for these MWEs. Anyway, I will try to figure out those side issues and, if appropriate, open seperate issues for them.


I would like to come back to the solutions you suggested for the main topic of this issue in your comment https://github.com/jgm/pandoc/issues/6879#issuecomment-732318725:

Some possible solutions:

  • modify the LaTeX reader so that the pair \8222...\8220 triggers a Quoted element. This should be pretty safe, but I'm not sure if there are locales that use these symbols differently.

I have checked the summary table on quotation marks on Wikipedia and Hebrew seems to be the only language that uses (U+201E/\8222) or (U+201A/\8218) on the right side of a quotation, however, it is written from right to left.

However, as the MWE below shows, the parsing has to be strictly language specific, because otherwise, e. g. an input of „…“ in an English context would be put out again as “…” – as of course, the Quoted DoubleQuote environment doesn’t store the originally used quotation marks.

  • alternatively, do this only if the metadata lang is (what? German?)

Yes, I am primarily thinking of German, with the primary quotation marks „…“ | ‚…‘ (but there are also »…« | ›…‹). This is currently not supported. Below, I am providing a MWE which shows that „…“ isn’t recognized as quotation in an explicitly German environment and that “…” is recognized, even in a German environment (even though these aren’t correct quotation marks in German). – This would lead to wrong output as described above. In German context, only „…“ and »…« (not “…”) should be recognized as Quoted DoubleQuote environments and all Quoted DoubleQuote environments in German context should be put out as „…“. Accordingly for ‚…‘ and ›…‹ and Quoted SingleQuote environments in German context.

MWE (folded, click to unfold)

I am using the current version, 2.11.2, and the output is the same with today’s fresh nightly build.

% !TEX encoding = UTF-8 Unicode
\documentclass{article}
\usepackage[british,ngerman]{babel}

\begin{document}
Verwendung „deutscher“ Anführungszeichen.
Usage of “English” quotation marks.
\foreignlanguage{ngerman}{Verwendung „deutscher“ Anführungszeichen im deutschen Kontext}
\foreignlanguage{ngerman}{Usage of “English” quotation marks in German context.}

\end{document}

Run through the command pandoc -f latex -t native, this yields: (I have set the relevant parts in bold and italicised the beginnings of the language environments.)

[Para [Str "Verwendung",Space,Str "\8222deutscher\8220",Space,Str "Anf\252hrungszeichen.",SoftBreak,Str "Usage",Space,Str "of",Space,Quoted DoubleQuote [Str "English"],Space,Str "quotation",Space,Str "marks.",SoftBreak,Span ("",[],[("lang","de-DE")]) [Str "Verwendung",Space,Str "\8222deutscher\8220",Space,Str "Anf\252hrungszeichen",Space,Str "im",Space,Str "deutschen",Space,Str "Kontext"],SoftBreak,Span ("",[],[("lang","de-DE")]) [Str "Usage",Space,Str "of",Space,Quoted DoubleQuote [Str "English"],Space,Str "quotation",Space,Str "marks",Space,Str "in",Space,Str "German",Space,Str "context."]]]

I would expect, if quotation marks were treated strictly language-specific: (Again, I have set the relevant parts in bold and italicised the beginnings of the language environments.)

[Para [Str "Verwendung",Space,Str "\8222deutscher\8220",Space,Str "Anf\252hrungszeichen.",SoftBreak,Str "Usage",Space,Str "of",Space,Quoted DoubleQuote [Str "English"],Space,Str "quotation",Space,Str "marks.",SoftBreak,Span ("",[],[("lang","de-DE")]) [Str "Verwendung",Space,Quoted DoubleQuote [Str "deutscher"],Space,Str "Anf\252hrungszeichen",Space,Str "im",Space,Str "deutschen",Space,Str "Kontext"],SoftBreak,Span ("",[],[("lang","de-DE")]) [Str "Usage",Space,Str "of",Space,Str "\8220English\8221",Space,Str "quotation",Space,Str "marks",Space,Str "in",Space,Str "German",Space,Str "context."]]]

This would, however, also need a fix of the Quotation writers, because running this expected output back through pandoc -f native -t plain yields at the moment:

Verwendung „deutscher“ Anführungszeichen. Usage of “English” quotation
marks. Verwendung “deutscher” Anführungszeichen im deutschen Kontext
Usage of “English” quotation marks in German context.

Notice how in the back conversion, even in the explicitly German context, the Quoted DoubleQuote environment is put out as English style quotation marks.

If this route is chosen, this issue becomes “Reading and Writing of Non-English Curly Quotation Marks”.


But, you have suggested another solution in the same comment, https://github.com/jgm/pandoc/issues/6879#issuecomment-732318725:

  • modify citeproc so that the movePunctuationInsideQuotes function is locale-sensitive and can thus recognize a closing quotation mark appropriate to the locale.

This would be fine also and maybe a much simpler (intermediate) solution, as no full handling of reading and writing of international curly quotation marks has to be supported. In German, possible closing quotation marks are “‘«‹\8220\8216\171\8249.

@jgm
Copy link
Owner

jgm commented Nov 25, 2020

One note: the default de-DE locale for citeproc has punctuation-in-quote=false, and this default is not overridden in your Auniversitat-freiburg-geschichte csl file. So punctuation moving would not be activated in this case anyway. To be consistent with this style, shouldn't you use : „some words from the citation“. or simply omit the period?

@pripple
Copy link
Author

pripple commented Nov 26, 2020

One note: the default de-DE locale for citeproc has punctuation-in-quote=false, and this default is not overridden in your Auniversitat-freiburg-geschichte csl file. So punctuation moving would not be activated in this case anyway. To be consistent with this style, shouldn't you use : „some words from the citation“. or simply omit the period?

I am using a custom CSL and I am quite free in how to style my citations as long as it doesn’t change within the publication. Generally, I am trying to make an exact duplicate of the output of the BibLaTeX footnote-dw style. I have lots of citations where I give the original (English) text in the footnote, so these are full sentences. When the end of the sentence with its full stop is part of the direct citation, I want the full stop to be inside the curly quotes. I don’t want to have another dot after the curly quotes in those occasions; only when I cite a single word or a phrase that isn’t at the end of the sentence, I won’t put a full stop myself and rely on the footnote style to add it at the end of the suffix.

Having a look at that punctuation-in-quote setting, I realize that also in English, the behaviour is different from what I get from the BibLaTeX footnote-dw style – it always puts the full stop inside the curly quotes, no matter if there was one or if it is added.

% !TEX encoding = UTF-8 Unicode
\documentclass{article}
\usepackage[style=footnote-dw]{biblatex}
\addbibresource{mwe.bib}
\begin{document}

This is where I would like to have the full stop inside the curly quotes, also in German.\cite[See][3, consider especially: “It can be difficult to deal with full stops and curly quotes.”]{introduction}

This is where the full stop should go at the very end.\cite[See][4, “different situation”]{introduction}

\end{document}

Running this with LuaLaTeX and Biber results in the following text in the resulting PDF file:

This is where I would like to have the full stop inside the curly quotes, also in German.1
This is where the full stop should go at the very end.2

1See John Doe: A Bibliographer’s TEXnic Inquiry, in: The Pandoc Journal 25.3 (1989), pp. 393–396, 3, consider especially: “It can be difficult to deal with full stops and curly quotes.”
2See ibid., 4, “different situation”.

Running it through pandoc mwe.tex -C --bibliography mwe.bib --csl chicago-fullnote-bibliography-with-ibid.csl -t plain with yesterday’s nightly build outputting the same as the current version, 2.11.2, puts the full stop inside in both footnotes:

This is where I would like to have the full stop inside the curly
quotes, also in German.[1]

This is where the full stop should go at the very end.[2]

Doe, John. “A Bibliographer’s TeXnic Inquiry.” The Pandoc Journal 25,
no. 3 (1989): 393–96.

[1] See John Doe, “A Bibliographer’s TeXnic Inquiry,” The Pandoc Journal
25, no. 3 (1989): 3, consider especially: “It can be difficult to deal
with full stops and curly quotes.”

[2] See ibid., 4, “different situation.”

I begin to realize that I will probably just have to go through all footnotes manually and do those adjustments … Maybe I’ll do an intermediate step with a text format so I can scope regex substitutions to the footnotes.

@jgm
Copy link
Owner

jgm commented Nov 26, 2020

You shouldn't hope for exact duplication of what biblatex does.
But I'd still like to think more about how to avoid the extra dot in Str ".",Str "\8220.".

@pripple
Copy link
Author

pripple commented Nov 30, 2020

You shouldn't hope for exact duplication of what biblatex does.

Yes, I understand. I’m experimenting a little bit with make4ht -l -f odt output.tex now which might be better suited for someone who is just fine with the BibLaTeX output and has to hand in something in a document format.

But I'd still like to think more about how to avoid the extra dot in Str ".",Str "\8220.".

Is there something further I can do to help you with that?

@jgm
Copy link
Owner

jgm commented Nov 30, 2020

Actually I see now that the issue comes up in citeproc itself, not in anything pandoc does.
The period is added by citeproc.
So I'll transfer this. We should be able to create a test case using just citeproc.

@jgm jgm transferred this issue from jgm/pandoc Nov 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants