Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Citation reader in Word document issue #9116

Closed
estedeahora opened this issue Oct 2, 2023 · 11 comments
Closed

Citation reader in Word document issue #9116

estedeahora opened this issue Oct 2, 2023 · 11 comments
Labels

Comments

@estedeahora
Copy link

Explain the problem.

I'm trying to convert a file from Word to Markdown, but I'm getting some problems with converting citations that don't include the author. In the output the citation is interpreted as having an author. I suspect the problem is in the reader, because I can identify the problem in the intermediate AST. I think this is a bug and I'm pretty sure it didn't happen in previous versions of pandoc.

I attach a minimal example (EX.zip) with a Word document, as well as the file with the reference (.json).

Thanks!

The command line I am using is:

pandoc +RTS -K512m -RTS EX.docx --to markdown+footnotes+citations+smart+link_attributes --from docx+citations --output EX.md -s --extract-media=./ --wrap=none --citeproc --bibliography=./EX_biblio.json 

The result obtained is:

---
bibliography: ./EX_biblio.json
references:
- abstract: "Ecological analysis is a promising approach to the study of urban social stratification, for differences in the residential distributions of occupations groups are found to parallel the differences among them in socio-economic status and recruitment. The occupation groups at the extremes of the socioeconomic scale are the most segregated. Residential concentration in low-rent areas and residential centralization are inversely related to socioeconomic status. Inconsistencies in the ranking of occupation groups according to residential patterns occur at points where there is evidence of status disequilibrium. CR - Copyright © 1955 The University of Chicago Press"
  author:
  - family: Duncan
    given: Otis Dudley
  - family: Duncan
    given: Beverly
  citation-key: Duncan1955a
  container-title: American Journal of Sociology
  DOI: 10.1086/221609
  id: 1580
  ISSN: 0002-9602
  issue: 5
  issued: 1955
  note: "305 citations (Crossref) \\[2023-05-01\\] Citation Key: Duncan1955a"
  page: 493-503
  title: Residential Distribution and Occupational Stratification
  type: article-journal
  volume: 60
title: This is the title of my article
---

# Introduction

This is a reference to Duncan [@1580].

# References {#references .unnumbered}

But I would expect to get this in the body:

# Introduction

This is a reference to Duncan [-@1580].

Pandoc version?
Pandoc 3.1.8

@estedeahora estedeahora added the bug label Oct 2, 2023
@jgm
Copy link
Owner

jgm commented Oct 2, 2023

Nothing has changed in this code for a long time, so I doubt previous versions behaved differently, but if you can confirm it that would be interesting.

@estedeahora
Copy link
Author

Hello!

It is most likely that I am mistaken about past behaviour.

However, I think it would then be a possible change in the writer. When I look at the xml inside the docx (see image bellow) I see that the citation has a '"suppress-author":true' field which looks promising in this direction. I don't handle Haskell, but it occurs to me that this would be a possible modification, so that it includes a "-" when this field appears in xml citation (if this field appears it should change from '[@1580]' to '[-@1580]').

Do you see this as a possible change?

image

@jgm
Copy link
Owner

jgm commented Oct 2, 2023

Yes, it should be possible!

@jgm
Copy link
Owner

jgm commented Oct 3, 2023

I believe suppress-author is a newly added field in itemData, which citeproc doesn't support yet. So first we'd need to add support for it in jgm/citeproc, and then we could make a small change at
https://github.com/jgm/pandoc/blob/main/src/Text/Pandoc/Readers/Docx.hs#L526

@jgm
Copy link
Owner

jgm commented Oct 3, 2023

Hm, I can't find any documentation in the CSL spec for suppress-author. It may be that an external tool is including this non-supported field...but I don't think we should be supporting random unsupported stuff.

@jgm
Copy link
Owner

jgm commented Oct 3, 2023

@bdarcus would probably know more about the current state of things.

@bdarcus
Copy link

bdarcus commented Oct 3, 2023

Hm, I can't find any documentation in the CSL spec for suppress-author. It may be that an external tool is including this non-supported field...but I don't think we should be supporting random unsupported stuff.

It's not part of the CSL spec; perhaps coming from Zotero/citeproc-js?

@estedeahora
Copy link
Author

Hm, I can't find any documentation in the CSL spec for suppress-author. It may be that an external tool is including this non-supported field...but I don't think we should be supporting random unsupported stuff.

I ask you to better understand the problem. As far as I can see, Pandoc "cite" objects have a "citationMode" field that can be set to "SuppressAuthor" (see image). The problem would be that the field in question is not supported by CLS, but it is supported by Pandoc?

image

It's not part of the CSL spec; perhaps coming from Zotero/citeproc-js?

I found this. I understand that it is not at all convenient to add random fields. But would this specification be enough to justify its inclusion in Pandoc? I think it could be quite a useful feature and I can't think of any other way to handle it (for example with a custom reader or a filter ).

I'm seeing that citeproc-pandoc handles page numbers as suffixes. What is the reason for this? In citeproc-js the 'label' and 'locator' fields are used for this. I understand that most of the time this is not a problem, but it can be a drawback when using other languages (as in my case). However, this is a lower priority, because it can be easily handled with a filter. (However, perhaps this is another issue.)

Thank you very much for your work and dedication.

@jgm
Copy link
Owner

jgm commented Oct 4, 2023

So this appears to be a citeproc-js add-on.
I don't think we should modify citeproc itself. We could add an extra parse step just to look for the extra suppress-author field.

Page numbers as suffixes: although pandoc's AST doesn't separate locators and labels from the rest of the suffix, there is code in pandoc that does this before calling citeproc. So it should work properly. Standard locator label abbreviations (as defined in citeproc for your locale) should work, but make sure lang is set in your metadata.

@estedeahora
Copy link
Author

estedeahora commented Nov 27, 2023

I've been trying to understand a bit about how this works. As I said, although I've been reading about Haskell, I must confess that I'm having a hard time understanding it.

To understand how this aspect works, I've been looking at the commits 73fe7c1, 0011c95, 9ef8650, 60caa0a and e07c0e7. Related issue #7840.

In the docx file, the "suppress-author" field (: True) contains the necessary information. This field does not appear when the citation is "normal". Considering that, as @jgm comments, the problem is in line 526 of Docx.hs, I think adding something similar to this code could help:

, citationMode = maybe NormalCitation if citationItemSupress-author item then SuppressAuthor

I don't know how it will handle citationItemSupress-author. I haven't managed to test locally either, so I'm sharing this idea here.

@estedeahora
Copy link
Author

estedeahora commented Apr 18, 2024

I tried at the time to modify the Haskell code, but was unsuccessful. With this in mind I developed a little trick that allows me to have a workaround until someone who understands Haskell better can tackle it.

The trick consists in using a lua filter to detect in the text of the pandoc.Cite those that start with numbers (or with the 'suffix' + a number). To work, this requires that Zotero-Word uses some kind of (author, year) based citation schema.

I leave you the filter code and I hope this can help if someone comes across the same problem.

local stringify = pandoc.utils.stringify

function Cite(cite)

    local cite_text = stringify(cite)
    
    for _, citation in ipairs(cite.citations) do
        prefix = stringify(citation.prefix)
        if #prefix > 0 then
            prefix = stringify(citation.prefix) .. " "
        end
        if cite_text:match('^%(' .. prefix .. '[0-9]') then
            citation.mode = "SuppressAuthor"
        end
    end
    return cite
end

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants