Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

epub noterefs across files not properly converted #5531

Open
alibou99 opened this issue May 29, 2019 · 15 comments
Open

epub noterefs across files not properly converted #5531

alibou99 opened this issue May 29, 2019 · 15 comments

Comments

@alibou99
Copy link

alibou99 commented May 29, 2019

PS C:\files\dev\Pandoc> pandoc --version
pandoc.exe 2.7.2
Compiled with pandoc-types 1.17.5.4, texmath 0.11.2.2, skylighting 0.7.7

issue :
when I convert EPUB files to md, docx, or html, some of the text is missing. it happens when there is a note call.
here is an example :

<p class="txt_courant_justif"><span epub:type="pagebreak" id="page_36" title="36"/>
Text1
<a class="apnb" epub:type="noteref" id="ap_ntb-002-1" href="p1chap2.xhtml#ntb-002">1</a>. 
Text2
<a class="apnb" epub:type="noteref" id="ap_ntb-003-1" href="p1chap2.xhtml#ntb-003">2</a>
Text3
</p>
<p class="txt_courant_justif">
Text4
</p>

in this example, the Tex2 and Text3 are missing in the output

@jgm
Copy link
Owner

jgm commented May 29, 2019

Simpler way to reproduce this:

% pandoc -f html+epub_html_exts -t native
<p class="txt_courant_justif"><span epub:type="pagebreak" id="page_36" title="36"/>
Text1
<a class="apnb" epub:type="noteref" id="ap_ntb-002-1" href="p1chap2.xhtml#ntb-002">1</a>. 
Text2
<a class="apnb" epub:type="noteref" id="ap_ntb-003-1" href="p1chap2.xhtml#ntb-003">2</a>
Text3
</p>
<p class="txt_courant_justif">
Text4
</p>
^D
[Para [Span ("page_36",[],[("type","pagebreak"),("title","36")]) [],SoftBreak,Str "Text1",SoftBreak,Str ""]
,Para [Str "Text4"]]

@alibou99
Copy link
Author

I'm new to Git, excuse me if I do not understand everything, but what does it mean ?

@alibou99
Copy link
Author

my source document is an epub 3.0 not an html

@jgm
Copy link
Owner

jgm commented May 29, 2019

This gives a way to reproduce the underlying issue in a simpler way, without actually producing an epub (because the epub reader uses the html reader plus a special extension under the hood). It's really a "note to self" for me to diagnose this.

@alibou99
Copy link
Author

thank you very much, I just tested the conversion via Calibre, no problem, I have the whole text. However with caliber, the notes are not recognized as such

@jgm
Copy link
Owner

jgm commented May 29, 2019

Yes, pandoc is stumbling on notes that refer to another file, such as href="p1chap2.xhtml#ntb-002".

@jgm
Copy link
Owner

jgm commented May 29, 2019

With the commit I just pushed, we now get:

[Para [Span ("page_36",[],[("type","pagebreak"),("title","36")]) [],SoftBreak,Str "Text1",SoftBreak,Link ("ap_ntb-002-1",["apnb"],[]) [Str "1"] ("p1chap2.xhtml#ntb-002",""),Str ".",SoftBreak,Str "Text2",SoftBreak,Link ("ap_ntb-003-1",["apnb"],[]) [Str "2"] ("p1chap2.xhtml#ntb-003",""),SoftBreak,Str "Text3"]
,Para [Str "Text4"]]

which is an improvement. The missing text is no longer missing. However, the noterefs are being parsed as links rather than proper noterefs, so there is still work to do.

@alibou99
Copy link
Author

I work for a non-profit organization, we prepare books for digital braille so that it is used by the blind. for this, our pivot format is docx or RTF. for the moment Pandoc manages at least the thing, but with this problem of the missing texts, I am reviewing all the procedure to switch to another tool, I hope that we will find a quick solution.

@jgm
Copy link
Owner

jgm commented May 29, 2019

By tonight there should be a nightly available in pandoc-nightlies; this will at least solve the missing text problem.

@jgm jgm changed the title issue converting from epub document epub noterefs across files not properly converted May 29, 2019
@alibou99
Copy link
Author

very good news, how can I benefit from this corrected version as quickly as possible ?

@alibou99
Copy link
Author

I installed pandoc via chocolatey

@alibou99
Copy link
Author

this is my first post in the git, is there a specific command to update Pandoc on my computer and take advantage of the fix ?
thank you

@jgm
Copy link
Owner

jgm commented May 29, 2019

Here's a binary of the latest Windows build: https://ci.appveyor.com/project/jgm/pandoc/build/job/gy92q5at64l3e68q/artifacts

@jgm jgm removed this from the 2.7.3 milestone May 29, 2019
@alibou99
Copy link
Author

alibou99 commented May 30, 2019

thank you very much it works very well and I'm no longer missing text.
now trying to see the problem of footnotes. here are two examples, the first code works very well, the conversion eoub to docx produces a word document that recognizes the footnotes, the second example do not have it.
example1: good one

<p class="nonindentb">Text1<a epub:type="noteref" class="noteref" id="fn-1" href="#fn1">1</a> Text2</p>

<div epub:type="footnote" id="fn1">
<p class="noindent0"><a class="link" href="#fn-1"><span style="color: #000000;">1</span></a>. Text...</p>
</div>

example2 : bad one

<p class="txt_courant_justif"><span epub:type="pagebreak" id="page_36" title="36"/>
Text1
<a class="apnb" epub:type="noteref" id="ap_ntb-003-1" href="p1chap2.xhtml#ntb-003">2</a>
Text2
</p>
<p class="txt_courant_justif">
Text4
</p>

<section class="defnotes" epub:type="footnotes">
<!--note--><aside class="ntb" epub:type="footnote" id="ntb-003">
<p class="txt_justif"><a href="p1chap2.xhtml#ap_ntb-003-1">2</a>. Text...</p></aside>
<!--note--></section></section>

I greatly appreciate your help

@jgm
Copy link
Owner

jgm commented May 30, 2019

Yes, the problem is that pandoc currently will only pick up footnotes that are defined in the same file. In your second example the note is in a different file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants