Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docx reader: metadata recognition is blocked if other elements come before title #8986

Open
StephanMeijer opened this issue Aug 7, 2023 · 13 comments
Labels

Comments

@StephanMeijer
Copy link
Contributor

StephanMeijer commented Aug 7, 2023

Explain the problem.

How does the .docx reader in Pandoc determine the top style, such as Title, and what implications does this approach have for international documents? Specifically, in Dutch (NL) documents, the top style for Title is often named Title but has a Style ID of Titel (the Dutch translation for Title).

I believe this might result in the Title being converted to a regular paragraph.

Pandoc version?

pandoc 3.1.6
Features: +server +lua
Scripting engine: Lua 5.4
User data directory: /Users/steve/.local/share/pandoc
Copyright (C) 2006-2023 John MacFarlane. Web: https://pandoc.org
This is free software; see the source for copying conditions. There is no
warranty, not even for merchantability or fitness for a particular purpose.

Example

This example has been anonymized and therefore contains gibberish.

Visual representation of document in Microsoft Word image

Expected

Expected "Znzxar txfnfdcestx turpfmdrhpff" to be marked as the Title of this document as seen in the screenshot above.

Actual

Text "Znzxar txfnfdcestx turpfmdrhpff" is marked as regular text paragraph, not the Title of the document.

Sources:

Input: input.docx

Output: none/expected.html


Our internal ID: NLDOC-837

@StephanMeijer StephanMeijer changed the title Issue with .docx reader's handling of top styles in international documents .docx reader's handling of top styles in international documents Aug 7, 2023
@StephanMeijer
Copy link
Contributor Author

StephanMeijer commented Aug 7, 2023

I'm currently unable to provide examples but I think this HeadingInPairs in docProps/app.xml attribute might have something to do with this..

image

@StephanMeijer
Copy link
Contributor Author

ChatGPT guesses this Word-file to be created with Microsoft Word 2016
image

@StephanMeijer
Copy link
Contributor Author

Example added.

@jgm
Copy link
Owner

jgm commented Aug 7, 2023

You are right: we just look at the style.

metaStyles :: M.Map ParaStyleName T.Text
metaStyles = M.fromList [ ("Title", "title")
                        , ("Subtitle", "subtitle")
                        , ("Author", "author")
                        , ("Date", "date")
                        , ("Abstract", "abstract")]

Paragraphs with these styles turn into metadata values.
I'm not sure how to deal with the full variety of style names in other languages.

@jgm
Copy link
Owner

jgm commented Aug 7, 2023

I would have assumed that the style ID would stay the same in localizations, while the style name changes, but you are reporting the reverse. It would be good to have more information here from others using localized versions of Word.

EDIT: Also, the style names above are compared against style names, not ids, so it should work if your style name is really "Title".

@jgm
Copy link
Owner

jgm commented Aug 7, 2023

OK, this doesn't have anything to do with the style ID or with the language.

Pandoc looks for "metadata" paragraphs only at the beginning of the document. Since your document begins with another element (an image of a cat), the paragraph is not treated as metadata. Removing the cat picture causes the text to be treated as a title.

@StephanMeijer
Copy link
Contributor Author

@jgm Is this an issue that ought to be resolved in Pandoc itself, or is it better to so some preprocessing on our side?

This anonymized example is based on a real-life document we got. So I assume it might be best to fix it in Pandoc? Or is the setup like this invalid per standards?

@jgm
Copy link
Owner

jgm commented Aug 7, 2023

The pandoc behavior is intentional, but it could be changed. The current approach is conservative: we don't want to pick up a style "Date" that occurs deep in the body of the document as the metadata data... Changing it might produce some unexpected effects.

@StephanMeijer
Copy link
Contributor Author

Shouldn't specific styles always be considered 'metadata', such as "Title" or "Subtitle"?

@jgm
Copy link
Owner

jgm commented Aug 8, 2023

Who knows? Word has a Date style. Is it intended for the document date only? Or is it something one might use for other dates in the body of the document? In fact, a user could use it either way. If they did the latter we'd be picking up dates from the body of the document and treating them as metadata.

I'm tempted to change things so that these styles are always considered metadata, even if they don't come at the beginning, but I'm also resisting the temptation, because it might have bad results -- and these would only become evident after the change was made. I think it was probably done this way for a reason.

@StephanMeijer
Copy link
Contributor Author

Could you give me examples of metadata that are not metadata in different contexts?

@jgm
Copy link
Owner

jgm commented Aug 8, 2023

See my previous comment on Date. Do I have a real-world example? No. I try to deal with Word documents as little as possible. But as I said: anyone can apply the Date style anywhere in the document they wish! So, maybe there are lots of documents where the Date style is not used for metadata. That would be my guess, anyway.

@StephanMeijer
Copy link
Contributor Author

I will try to compose some examples and strategies for extracting certain parts of metadata.

@jgm jgm changed the title .docx reader's handling of top styles in international documents Docx reader: metadata recognition is blocked if other elements come before title Aug 10, 2023
StephanMeijer added a commit to StephanMeijer/pandoc that referenced this issue Oct 31, 2023
StephanMeijer added a commit to StephanMeijer/pandoc that referenced this issue Oct 31, 2023
StephanMeijer added a commit to StephanMeijer/pandoc that referenced this issue Oct 31, 2023
StephanMeijer added a commit to StephanMeijer/pandoc that referenced this issue Oct 31, 2023
StephanMeijer added a commit to StephanMeijer/pandoc that referenced this issue Nov 5, 2023
StephanMeijer added a commit to StephanMeijer/pandoc that referenced this issue Nov 5, 2023
StephanMeijer added a commit to StephanMeijer/pandoc that referenced this issue Nov 6, 2023
StephanMeijer added a commit to StephanMeijer/pandoc that referenced this issue Nov 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants