Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Page-break in other output formats than LaTeX #1934

Open
todd-a-jacobs opened this issue Feb 10, 2015 · 31 comments
Open

Page-break in other output formats than LaTeX #1934

todd-a-jacobs opened this issue Feb 10, 2015 · 31 comments

Comments

@todd-a-jacobs
Copy link

Pagebreaks Don't Work for Most Output Formats

I have a Markdown file that is supposed to have pagebreaks between certain sections. However, Pandoc 1.10.1 isn't honoring the \newpage or \pagebreak commands when rendering RTF, DOCX, or ODT formatted files. The commands I'm using to invoke pandoc are:

for format in rtf docx odt; do
    pandoc \
        --smart \
        --normalize \
        --standalone \
        --self-contained \
        -f markdown \
        -t $format \
        --output="${FILE/markdown/$format}" \
        "$FILE"
    echo "Created ${FILE/markdown/$format}"
done

PDF Seems to Work

However, the PDF format (which requires a slightly different invocation because it doesn't respect the -t flag) seems to respect the pagebreak requests. For example:

pandoc \
    --standalone \
    --normalize \
    --smart \
    --self-contained \
    --from=markdown \
    --output="${FILE/markdown/pdf}" \
    "$FILE"
echo "Created ${FILE/markdown/pdf}"
@jgm
Copy link
Owner

jgm commented Feb 10, 2015 via email

@todd-a-jacobs
Copy link
Author

A PageBreak element would be great, but I'd be happy to use a filter in the meantime. However, I'm not sure what's entailed in doing so. How would I generate a DOCX with forced page breaks using a filtering mechanism?

@jkr
Copy link
Collaborator

jkr commented Feb 19, 2015

@CodeGnome : see this thread for some hints on setting up a filter for pagebreaks in docx output:

https://groups.google.com/forum/#!searchin/pandoc-discuss/pagebreak/pandoc-discuss/FzLrhk0vVbU/GtSHaI0jddAJ

@s7726
Copy link

s7726 commented Mar 20, 2015

@CodeGnome If your page breaks happen to be prior to a given heading level, you can just set the page break before property for that heading style.

@Hi-Angel
Copy link

I am also voting for the feature to be added — many formats have something according to a page break _(even in CSS are things like page-break-_)*.

@Hi-Angel
Copy link

Hi-Angel commented Aug 1, 2015

Hi, I'm just looking through the code in the hope to add the pagebreak, and some features, and I found, well… Does @jgm notice the two years old pull request?

@jgm
Copy link
Owner

jgm commented Aug 1, 2015

+++ Hi-Angel [Aug 01 15 08:28 ]:

Hi, I'm just looking through the code in the hope to add the pagebreak,
and some features, and I found, well… Does [1]@jgm notice [2]the two
years old pull request?

Adding a NewPage element to the definition and builder is trivial.
But then you need to support it in every reader and writer;
that's a lot more work.

@oadam
Copy link

oadam commented Oct 14, 2016

If a pull request adding support for NewPage was submitted (including support in every reader and writer), would it be accepted ?
I really need this feature and I'm ready to spend time on this.

@jgm
Copy link
Owner

jgm commented Oct 14, 2016

Yes, I'd accept it if it's of good quality.

Note, it requires a breaking change in pandoc-types.
I'd like to make a new release soon of pandoc-types
(which already has breaking changes) and pandoc.
If you plan to do this soon I could wait a bit.

How do you propose to treat output formats with
nothing corresponding to a page break?

Would it make sense, perhaps, to render it as a

Div ("",["pagebreak"],[]) []

which could at least be intercepted in filters?
This could even be a native pandoc way of creating it.

@oadam
Copy link

oadam commented Oct 14, 2016

I'll follow whatever recommendation you give :-)

If your code snippet means empty div with a pagebreak css class then yes that might be a good idea (it could be parsed as well by the html reader).

Maybe the writer could even add a inline style attribute with page-break-after: always ?

No need to wait for this before pushing your breaking change. To be honest, I won't look into it before at least a few weeks but it's definitely something that is on my business' road-map.

@s7726
Copy link

s7726 commented Oct 14, 2016

Putting a class on an empty div won't work (or at least be portable).

http://www.w3schools.com/cssref/pr_print_pageba.asp

Note: You cannot use this property on an empty

or on absolutely positioned elements.

I recently found the page-break-avoid property. I applied it to

  • 's that contained figures that needed to stay with that particular step in a procedure.

  • @tarleb
    Copy link
    Collaborator

    tarleb commented Oct 14, 2016

    MDN states on page-break-before (emphasis mine):

    It won't apply on an empty <div> that won't generate a box.

    I guess with a little bit of CSS hackery, the div could still be made to generate a box.

    @jgm
    Copy link
    Owner

    jgm commented Oct 16, 2016

    OK, that's good to know. So implementing a page break in
    the HTML writer might be nontrivial...but it's also not
    really essential -- I think it would be okay if we just
    supported formats that typically produce paginated output
    (latex, docx, etc.).

    +++ Gavin S [Oct 14 16 11:44 ]:

    Putting a class on an empty div won't work (or at least be portable).

    [1]http://www.w3schools.com/cssref/pr_print_pageba.asp

    Note: You cannot use this property on an empty <div> or on
    absolutely positioned elements.
    

    I recently found the page-break-avoid property. I applied it to

  • 's
    that contained figures that needed to stay with that particular step in
    a procedure.


    You are receiving this because you were mentioned.
    Reply to this email directly, [2]view it on GitHub, or [3]mute the
    thread.

    References

    1. http://www.w3schools.com/cssref/pr_print_pageba.asp
    2. Page-break in other output formats than LaTeX #1934 (comment)
    3. https://github.com/notifications/unsubscribe-auth/AAAL5HMatca2im4qobxWGKAd7nIHl7rZks5qz81_gaJpZM4Ded9Q
  • @Jmuccigr
    Copy link
    Contributor

    Would definitely like to see this.

    And really would like to see printed html handle this too, but that's probably out of scope for pandoc.

    @mb21 mb21 changed the title Pandoc doesn't honor \newpage or \pagebreak except for PDF output files. Page-break in other output formats than LaTeX Jan 22, 2017
    @mb21
    Copy link
    Collaborator

    mb21 commented Jan 22, 2017

    Some observations on how different formats handle page breaks:

    From the perspective of HTML/CSS, page breaking is about layout, not structure, and is thus implemented in CSS (with the page-break-before and page-break-after properties, as supported by wkhtmltopdf – note that they might be superseded by break-before and break-after but browser support is not forthcoming). As has been noted, these can only be applied to block level elements and the intended usage is to apply them to headers or section divs.

    In some restructured-text processors, a pagebreak can apparently also be achieved by a block level directive.

    On the other hand, in more imperative document models (ODT, docx, etc), pagebreak usually seems to be an inline element. The pandoc AST already has inline LineBreak and SoftBreak elements and one possible implementation would be to replace them with an inline Break element that has an attribute type=line, type=soft, type=page,type=column etc. Note that implementing a native pandoc pagebreak element as inline is more general than a block element, since the block element can always be simulated by wrapping an inline in an otherwise empty paragraph.

    Finally, from the perspective of markdown, I would probably use something like this:

    ------- {.pagebreak}
    

    @fabtho
    Copy link

    fabtho commented Feb 6, 2017

    I would like to see this to implemented. I just tried to write some filter for pandoc, to use pagebreack for md to ODT, but no success. (I used the source on Google Groups, as mentioned above)

    @link2xt
    Copy link
    Collaborator

    link2xt commented Jun 27, 2017

    Muse format also has pagebreaks: http://amusewiki.org/library/manual#toc7

    @mb21
    Copy link
    Collaborator

    mb21 commented Sep 8, 2017

    btw, iA Writer pagebreak syntax is:

    +++
    

    which produces:

    <div style="page-break-before: always;"></div>
    

    which webkit-based browsers seem to understand.

    @autotel
    Copy link

    autotel commented Oct 1, 2018

    another nice workaround:

    • insert a horizontal line -----------------
    • format the "horizontal line" style to break a page and be invisible, using the text editor (libre office in my case)

    @grenade
    Copy link

    grenade commented Jul 9, 2019

    @CodeGnome : see this thread for some hints on setting up a filter for pagebreaks in docx output:

    https://groups.google.com/forum/#!searchin/pandoc-discuss/pagebreak/pandoc-discuss/FzLrhk0vVbU/GtSHaI0jddAJ

    thanks for this! i went down this rabbit hole today. it was my first foray into haskell and i'm pleased to say that i am now standing next to a completely bald yak¹. here's what happened:

    the problem:

    i have a github gist containing markdown files. i have a react app that transforms these markdown files into an html web page. i wanted a way to transform the same markdown files into a hosted google doc that has built in docx and pdf output formats.

    the solution:

    write some bash that combines all of the gist's markdown files into a single markdown file and use pandoc to transform the markdown into docx format that can be uploaded as a google doc.

    the implementation:

    • use jq and the github gist api to produce a file containing the combined markdown
      • the trick here is to insert a separator (\n\n\\newpage\n\n) between the individual markdown files that pandoc can interpret as a block paragraph containing only a page-break.
    • run pandoc against the combined markdown file to convert it into docx format
      • here the trick is to correctly interpret the page-break separator tokens and use a filter to replace them with the correct docx xml separator syntax (<w:p><w:r><w:br w:type=\"page\"/></w:r></w:p>).
      • create a haskell code file (docx-page-filter.hs) containing the filter (thank you Joel Allen and John MacFarlane):
    import Text.Pandoc.JSON
    
    pagebreakXml :: String
    pagebreakXml = "<w:p><w:r><w:br w:type=\"page\"/></w:r></w:p>"
    
    pagebreakBlock :: Block
    pagebreakBlock = RawBlock (Format "openxml") pagebreakXml
    
    blockSwapper :: Block -> Block
    blockSwapper (Para [Str "\\newpage"])  = pagebreakBlock
    blockSwapper blk = blk
    
    main = toJSONFilter blockSwapper
    • the code above requires compiling but ghc --make -v docx-page-filter.hs throws an error about not being able to import Text.Pandoc.JSON. i don't know what version of ghc was already installed on my fedora-30 system or where it came from.

      • download and install the distro build tools, the package manager and the pandoc dependencies:

        sudo dnf install ghc
        sudo dnf install cabal-install
        cabal update
        cabal install pandoc
        
      • go have a coffee now. maybe even go for a run or mow the lawn. you have some time...

    • if everything compiles, you can run a command like this to perform the conversion:

      pandoc combined.md --from gfm --filter docx-page-filter --to docx --output converted.docx
      

    @tarleb
    Copy link
    Collaborator

    tarleb commented Jul 23, 2019

    The Lua filters repository has a pagebreak filter which converts raw \newpage commands into page breaks for most formats.

    @ghost
    Copy link

    ghost commented Sep 12, 2019

    I wanted to note that Epub3 supports page breaks as well, although for possibly different use cases.

    A page list and page break indicators allow users in mixed print-digital environments to coordinate their positions.

    This is nice for preserving information about page numbers (e.g. for citations, printing, or accessibility such as audio queues) without interfering with the document layout.

    It supports both in-line and block page breaks.

    An empty span element identifies a page break inside a block element. It is identified as a page break using the role attribute with the value doc-pagebreak. The aria-label attribute provides an announceable value.

    <p><span role="doc-pagebreak" id="pg24" aria-label="24"/>
       …
    </p>

    A div element identifies a page break where inline elements are not allowed. This example shows an example of a page number that is intended to be visible in the content.

        <div role="doc-pagebreak" id="pg24">24</div>

    Some notes:

    • would need to keep a counter to mark the page numbers
    • intended to be placed at page beginnings, rather than endings
    • cannot be placed inside lists

    My personal preference is for formfeed chars to be interpreted as page breaks, at least in markdown. I use the pdftotext CLI to produce formfeed-delimited text files that can be turned into markdown for pandoc, and it would be great if those could be preserved.

    @jeffmcneill
    Copy link

    This might be somewhat related. Pagebreaks seem to be automatically supported in markdown->pdf in terms of H1s being recognized as new section headers, using:

      \usepackage{titlesec} 
      \newcommand{\sectionbreak}{\clearpage} 
    

    Also, when markdown->epub the same section headers H1 are recognized and page breaks are implemented. All fine and dandy.

    I'm wondering if it is possible somehow to have H2s recognized as section breaks as well. The main reason is because I need to have both H1 and H2 act as section breaks (page breaks).


    Ok, I've worked through these issues, and here is how I've dealt with them, so far: I've added \pagebreak before each new H2, that takes care of the latex/pdf side. For epub, I added the style:

    h2 {display: block;
        page-break-before: always; /* CSS 2 */
        break-before: page;   /* CSS 3+ */ }
    

    That seems to take care of the epub side.

    If anyone has additional suggestions/options especially for the latex/pdf side, that would be great, but otherwise I've got it working.

    @jgm
    Copy link
    Owner

    jgm commented Nov 25, 2019

    Try the same thing with \subsectionbreak?

    @jeffmcneill
    Copy link

    @jgm Excellent! It also supresses a page break if an H2 follows directly an H1, which is what I want. I can't seem to do that with Epub/CSS but that is less of an issue to have an extra page in an ebook, whereas one has to pay for each page in print.

      \usepackage{titlesec} 
      \newcommand{\sectionbreak}{\clearpage} 
      \newcommand{\subsectionbreak}{\clearpage} 
    

    Here is documentation of the various section commands that can be used with package titlesec. http://tug.ctan.org/tex-archive/macros/latex/contrib/titlesec/titlesec.pdf

    @SandeepNaidu
    Copy link

    This still does not work for pandoc export to docx!

    @gmile
    Copy link

    gmile commented Jul 1, 2020

    Had to introduce page breaks to html files that are being converted to .docx, ended up with this script in Lua:

    function Para (el)
      if #el.content == 1 and el.content[1].text == "Pagebreak" then
        return pandoc.RawBlock('openxml', '<w:p><w:r><w:br w:type="page"/></w:r></w:p>')
      end
    end
    
    return {
      {Para = Para}
    }

    Given the following input:

    <html>
      <body>
        <p>Page 1</p>
        <p>Pagebreak</p>
        <p>Page 2</p>
        <p>Pagebreak</p>
        <p>Page 3</p>
      </body>
    </html>

    It can be used like this:

    pandoc input.html \
      --standalone \
      --lua-filter pagebreak.lua \
      --reference-doc my_styles.docx \
      --output output.docx
    

    @tarikgraba
    Copy link
    Contributor

    Hi there,

    Can the support of <?asciidoc-pagebreak?> added to the XML DocBook reader?
    This tag is generated by asciidoctor/asciidoc when inerting a page break.

    It would be great to be able to convert DocBook to Latex without loosing this info.

    @dwojtas
    Copy link

    dwojtas commented Feb 10, 2022

    Hi,
    I see no response to the <?asciidoc-pagebreak?> support request for the docbook reader, I would also benefit from this.
    I am processing documents

    • from asciidoc to docbook using asciidoctor
    • from docbook to docx using pandoc with custom docx template.

    The effects are beautifull, but I must always post-process it by hand with Ctrl+Return to page-break on new chapters.

    @jgm
    Copy link
    Owner

    jgm commented Feb 10, 2022

    Can the support of <?asciidoc-pagebreak?> added to the XML DocBook reader?

    There's no native AST element corresponding to a page break.

    @leogama
    Copy link

    leogama commented Mar 17, 2022

    The R package rmarkdown has a good page break filter: https://github.com/rstudio/rmarkdown/blob/main/inst/rmarkdown/lua/pagebreak.lua

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Projects
    None yet
    Development

    No branches or pull requests