Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docx reader: common class names don't work in different languages. #1692

Closed
lierdakil opened this issue Oct 14, 2014 · 16 comments
Closed

Docx reader: common class names don't work in different languages. #1692

lierdakil opened this issue Oct 14, 2014 · 16 comments

Comments

@lierdakil
Copy link
Contributor

Same as issue #1607, but for docx reader. There is limited internationalization support in reader with prefix list, but this approach is rather clunky.
I advocate use of <name val="heading #"> fallback for non-english locales.

lierdakil added a commit to lierdakil/pandoc that referenced this issue Oct 14, 2014
@lierdakil
Copy link
Contributor Author

I would be thankful for tips on how to avoid using CharStyleMap for paragraph styles (since most fields are unused in this case) with minimal changes to code base. Since haskell is not really my primary (or even secondary) language, this task is beyond my comprehension.

@jkr
Copy link
Collaborator

jkr commented Oct 14, 2014

Well, CharStyleMap is just an alias for an M.Map String RunStyle. It doesn't really cost anything to define another alias for M.Map String ParagraphStyle.

I don't really like the string argument to archiveToStyles. In haskell, you can do this in a more typesafe way by defining a type data Foo = This | That | TheOther.

Beyond that, though, I'd say that the reuse of CharStyleMap, and the functions leading to it, leads to more trouble than it's worth. You could have a ParStyleMap, and then have archiveToStyles output both sorts, i.e.

archiveToStyles :: Archive -> (CharStyleMap, ParStyleMap)

It would be clearer, and avoid the extra parameter.

@jkr
Copy link
Collaborator

jkr commented Oct 14, 2014

I'd be happy to try to build off this patch to implement those changes. It might take a day or two for me to get to it, though.

@jkr
Copy link
Collaborator

jkr commented Oct 14, 2014

And thanks for your work on this!

@jkr
Copy link
Collaborator

jkr commented Oct 14, 2014

By the way, I take it you have an internationalized version of word, right? What about block quotes?

@lierdakil
Copy link
Contributor Author

Yes, I do have internationalized version of Word. Russian, to be precise. My workflow is not using blockquotes, however, so I never checked. Let me check and get back to you.

Also, please see my patches for #1607 when you have some time. I think those turned out a little better.

@jkr
Copy link
Collaborator

jkr commented Oct 14, 2014

I'll take a look at the writer patches soon -- though probably not till
Thursday or so. A brief glance at the discussion suggests I'd probably
say the same things that Matt was saying, but I'll dig in further when I
can.

I'll pull these changes into a branch of mine, and then see if I can
simplify them a little. If you find out about block quotes, I'll try to
integrate that as well.

Lists are actually something else that should be dealt with here, but I
have this big rewrite of the list code that I haven't had time to
finish, and I want to see if I can get to that first.

Yes, I do have internationalized version of Word. Russian, to be
precise. My workflow is not using blockquotes, however, so I never
checked. Let me check and get back to you.

Also, please see my patches for #1607 when you have some time. I think
those turned out a little better.

@lierdakil
Copy link
Contributor Author

Okay, about block quotes. Same as with headers, styleId is mangled beyond recognition (e.g. Quote style has styleId="20", believe it or not), but <name> child has english-language val, although no "Block Quote"s as one might expect. Regrettably, I have no access to US version of Word, so I can't compare against that.
As I said, there is actually no <name val="Block Quote"> style in word-created document. There are, however "Block Text" and "Quote" (interpreted as "Цитата" and "Цитата2" in word GUI -- that is "Quote" and "Quote2" in Russian). So "BlockQuote" style is treated as custom. As such, reader does not recognize "Block Text" or "Quote" as block quotes, and writer simply creates new custom style (which is fine for writer, in my opinion).

@jkr
Copy link
Collaborator

jkr commented Oct 14, 2014

Could you email me your Russian doc? Email address in the header of the docx reader source.

@lierdakil
Copy link
Contributor Author

Sure. No idea what you might want besides quotes and headings, but those are included.

jkr pushed a commit to jkr/pandoc that referenced this issue Oct 14, 2014
Signed-off-by: Jesse Rosenthal <jrosenthal@jhu.edu>
lierdakil added a commit to lierdakil/pandoc that referenced this issue Oct 22, 2014
@lierdakil
Copy link
Contributor Author

Okay, I made my code a bit more clean and extensible in separate branch. I dislike the fact that I had to either use ad-hoc polymorphism, or have two variants of each function called from archiveToStyles, but at least this code is easily extended for other special cases.

@lierdakil
Copy link
Contributor Author

Or here is another bright idea: we could probably just "demangle" paragraph styleIds based on a lookup table, e.g. if given paragraph has style with id = "Titre1", and it's name is "heading 1", we could just write "Heading1" instead of "Titre1" into ParagraphStyle pStyle. Although this sounds unnecessarily hacky to me, it saves from a headache of maintaining special fields in ParagraphStyle.

@jkr
Copy link
Collaborator

jkr commented Oct 25, 2014

I much prefer the approach in 1692-alt, and it passes tests -- but it doesn't pick up on the block quotes, because getBlockQuote tests for styleId before it tests for styleName. So if it has some name like "21" or "a4" (as in your Russian doc), it gives a "False." The most straightforward to take care of this would be to change the order of the guards:

getBlockQuote :: NameSpaces -> Element -> Maybe Bool
getBlockQuote ns element
  | Just styleName <- findChild (elemName ns "w" "name") element >>=
                      findAttr (elemName ns "w" "val")
                    = Just $ styleName `elem` blockQuoteStyleNames
  | Just styleId <- findAttr (elemName ns "w" "styleId") element
                  = Just $ styleId `elem` blockQuoteStyleIds
getBlockQuote _ _ = Nothing

There are some refinements (do we combine two consecutive blockquotes with different styles the way we do two normal blockquotes, or not?) but this will be an improvement.

Also: there should be a more infomative commit message. Could you repush (or just squash and push --force to the same branch) with the above fix and a more informative commit? No need to do a pull request -- just let me know it's up. Then I'll merge it and add a test case based on the Russian file you sent me.

@lierdakil
Copy link
Contributor Author

Right. I should have these bools as guards. Not sure what I was thinking when pushing. Just reordering is no-go, since that won't catch inherited styles. I will push in a minute.

lierdakil added a commit to lierdakil/pandoc that referenced this issue Oct 25, 2014
This patch builds paragraph styles tree, then checks if paragraph has
style.styleId or style/name.val matching predetermined patterns.
Works with "Heading#" (name.val="heading #") for headings and
"Quote"|"BlockQuote"|"BlockQuotation" (name.val="Quote"|"Block Text")
for block quotes.
@lierdakil
Copy link
Contributor Author

Ok, all done. Squashed into 16a51f7

jkr pushed a commit that referenced this issue Oct 25, 2014
This patch builds paragraph styles tree, then checks if paragraph has
style.styleId or style/name.val matching predetermined patterns.
Works with "Heading#" (name.val="heading #") for headings and
"Quote"|"BlockQuote"|"BlockQuotation" (name.val="Quote"|"Block Text")
for block quotes.
@jkr
Copy link
Collaborator

jkr commented Oct 25, 2014

Pushed. Thanks so much for keeping on top of this! I'll close this issue, since any further refinements should probably be its own thing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants