Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BITS reader #7740

Open
kamoe opened this issue Dec 8, 2021 · 13 comments
Open

BITS reader #7740

kamoe opened this issue Dec 8, 2021 · 13 comments

Comments

@kamoe
Copy link
Contributor

kamoe commented Dec 8, 2021

New BITS reader

Support for BITS XML, the book extension of JATS XML.

As part of an academic project, I am exploring ways to develop a tool to transform BITS XML into DOCX. This is relevant for the use case of academic book publishing, where XML archives of previous editions need to be transformed into DOCX for authors to work in the new edition. This is a recurrent scenario, and academic publishers spend today considerable time and money in third party conversions that could easily and efficiently be handled in house.

Since this is a scheduled project with time and deadlines assigned to it (full or partial completion by September 2022 at the latest), I will develop a version of a full or partial tool.

As per recent discussion (https://groups.google.com/g/pandoc-discuss/c/E5J9-qevSEk) this seems to be a relevant and welcome addition to Pandoc.

Alternatives
I have explored OxGarage and transpect as well, and also the option of a completely standalone java tool developed from scratch. A pandoc BITS reader (and later a Pandoc BITS writer) seem to be the easiest and straightforward solution as of now.

@jgm
Copy link
Owner

jgm commented Dec 8, 2021

If BITS is an extension of JATS, then it might be good to explore developing this capacity as a modification of the current JATS writer, rather than a new module. (That avoids lots of duplicated code.) Note that the JATS writer already exports several functions for different JATS variants; the same strategy could be used, perhaps, for BITS?

(Just to be clear, I wouldn't want to merge a separate BITS module if BITS is too similar to JATS; that just makes maintenance difficult going forward.)

@kamoe
Copy link
Contributor Author

kamoe commented Dec 8, 2021 via email

@jgm
Copy link
Owner

jgm commented Dec 8, 2021

If BITS is strictly an extension of JATS, then they could be handled in the same reader. The reader could have something in State, for example, that tells it whether to allow BITS extensions. It could export a separate function readBits that enables this.

@kamoe
Copy link
Contributor Author

kamoe commented Dec 9, 2021 via email

@jgm
Copy link
Owner

jgm commented Dec 9, 2021

Even if BITS isn't strictly a superset of JATS, it still might make sense to implement it as a variant in the JATS module -- it depends on the extent of the divergences, I guess.

Another alternative would be to extract some of the common code into an internal module.

@kamoe
Copy link
Contributor Author

kamoe commented Mar 29, 2023

From what I now understand of the JATS reader (15 months after my first comment!), it seems to me that the easiest thing to do would be to just enhance the existing JATS reader (to also support BITS). Just by adding the cases:

"collection-meta" -> parseMetadata e
"book-meta" -> parseMetadata e

to the parseBlock function, the reader would already start supporting BITS metadata elements without much further effort. Of course, that is just the beginning, and it will be necessary to add more than a few other cases, and further functions to fully support all essential elements, but as a quick start, that is what I would do... then I would see if it makes sense to split into two readers/common modules later on?

@jgm
Copy link
Owner

jgm commented Mar 29, 2023

If BITS is basically just JATS plus a few extra elements, then I think that's definitely the way to go.

One way to handle this is to have a parameter in the JATSState that controls the "variant" -- settings could be BITS and JATS.

The reader could check this variant in places where the behavior would diverge.

The module could then export two functions, readJATS and readBITS, which set this state parameter differently but otherwise do the same thing.

@kamoe
Copy link
Contributor Author

kamoe commented Apr 8, 2023

Makes sense. Just to summarize and double check the proposed approach:

  1. Modify the existing JATSState to add a "variant" parameter with value BITS or JATS
  2. Modify the existing readJATS function, to set the new variant to JATS
  3. Write a new readBITS function derived from readJATS, and that sets the new variant to BITS
  4. Modify the parseBlock function to consider additional cases to accommodate completely new BITS elements (that never occur in JATS)
  5. Modify the parseBlock function to check the variant value in those cases where behavior diverge for BITS, and provide alternative/additional behavior for those cases

Am I getting this right?

@kamoe
Copy link
Contributor Author

kamoe commented Apr 25, 2023

Actually, I just realized there is already a boolean "variant" parameter in the JATSState: jatsBook:

instance Default JATSState where
def = JATSState{ jatsSectionLevel = 0
, jatsQuoteType = DoubleQuote
, jatsMeta = mempty
, jatsBook = False
, jatsFootnotes = mempty
, jatsContent = []
, jatsInFigure = False }

Given that the JATS reader was written based on the DocBook reader, and that that spec supports both articles and books, it makes sense that boolean variant existed there (called dbBook).

In DocBook, when the document encounters book-only content, this variant is set to true:

"chapter" -> modify (\st -> st{ dbBook = True}) >> sect 0
"part" -> modify (\st -> st{ dbBook = True}) >> sect (-1)

"book" -> modify (\st -> st{ dbBook = True }) >>
addMetadataFromElement e >> getBlocks e

And when dealing with article content, it is set to false:

"article" -> modify (\st -> st{ dbBook = False }) >>
addMetadataFromElement e >> getBlocks e

Seems like dbBook was copied as jatsBook for JATS, but it is never used, never updated to true. This makes sense since JATS only supports articles. The first two lines of the sect function have thus no purpose, n' is always n:

sect n = do isbook <- gets jatsBook
let n' = if isbook || n == 0 then n + 1 else n

Until now. Seems like we could use this to start to model BITS for jatsBook = true...

@kamoe
Copy link
Contributor Author

kamoe commented Aug 23, 2023

@jgm After a thorough look, I believe it is possible to have a minimal BITS-enabled reader purely by adding a few lines to the JATS reader. I think this is the simplest way to do it.

The main point is to make use of the existing jatsBook variant in the JATSState, as I explain here.

I created a first draft for you to have an idea of what I mean here: #9016

This should already produce a decent AST from a BITS document, but it is by no means definitive. I would add a few more lines to account for a few additional BITS-only elements, via alternative treatment relying on the jatsBook boolean value. If we find that this starts to diverge significantly down the road, then we could envisage the creation of a separate BITS.hs file, but as it is now, I think it makes sense to have both formats incorporated in the one JATS.hs reader.

What do you think?

@jgm
Copy link
Owner

jgm commented Aug 23, 2023

Agreed, this plan make sense.

@kamoe
Copy link
Contributor Author

kamoe commented Oct 15, 2023

Update: I have written a new clean-slate PR here. This incorporates the minimal required BITS behaviours for an equivalent BITS reader (equivalent coverage to JATS, same limitations, etc). This should still be consistent with older JATS behaviours, but cannot guarantee that until I have finished the unit tests I'd like (hence still marking as draft). I will try and complete those this week, and then I think this should be in good shape for a first review.

@kamoe
Copy link
Contributor Author

kamoe commented Oct 24, 2023

@jgm All Unit tests finished and passed. See my latest comment on the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants