Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fb2 to html5 loose images after conversion #5847

Open
XCanG opened this issue Oct 23, 2019 · 3 comments
Open

fb2 to html5 loose images after conversion #5847

XCanG opened this issue Oct 23, 2019 · 3 comments

Comments

@XCanG
Copy link

XCanG commented Oct 23, 2019

pandoc 2.7.3
Compiled with pandoc-types 1.17.5.4, texmath 0.11.2.2, skylighting 0.8.1


I have fb2 with images, however I need to create html5 file to be able to read it later, but the problem, what image data are lost after conversion.

Code example:

<?xml version="1.0" encoding="UTF-8"?>
<FictionBook xmlns="http://www.gribuser.ru/xml/fictionbook/2.0" xmlns:l="http://www.w3.org/1999/xlink">
...
<body>
<title>...</title>
<image l:href="#img_10887"/>
<p>...</p>
...
</body>
<binary id="img_10887" content-type="image/jpeg">...</binary><binary id="img_13353" content-type="image/jpeg">...</binary>...
</FictionBook>

between binary tags there a lot of data, image itself, and there several images used there.

Tags are converted into this:

<img src="img_10887" />

But image info are not copied.

What you can do here is paste base64 images straight into src tag, example 1x1px transparent png:

<img src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAADUlEQVR42mP8/5+hHgAHggJ/PchI7wAAAABJRU5ErkJggg==">
@jgm
Copy link
Owner

jgm commented Oct 24, 2019

Have you tried with --extract-media=PATH?

@XCanG
Copy link
Author

XCanG commented Oct 25, 2019

No, I didn't know about that parameter. With even --extract-media=. it would work, but it not create extension for image which is a bit of a problem.

Is there any way to convert and put base64 into src="" argument itself?

For part data:image/XXX;base64, you may determine by reading binary data of first characters and check which format is encoding here.

I have some example from other language

This is python language and it determine a lot types of media, however if you remove and leave only to, probably, jpg, png, gif that would be fine (webp also image, but it probably wont support in other formats, tiff not used as embed image as well).

Built-in libraries from code example: struct and os

  def get_image_metadata(file_path: str):
    size = os.path.getsize(file_path)

    with open(file_path, "rb") as input:
        width, height = -1, -1
        data = input.read(32)

        if (size >= 10) and data[:6] in (b"GIF87a", b"GIF89a"):
            # GIF
            type, format = "img", GIF
            w, h = struct.unpack("<HH", data[6:10])
            width, height = int(w), int(h)
        elif ((size >= 24) and data.startswith(b"\211PNG\r\n\032\n") and (data[12:16] == b"IHDR")):
            # PNG
            type, format = "img", PNG
            w, h = struct.unpack(">LL", data[16:24])
            width, height = int(w), int(h)
        elif (size >= 16) and data.startswith(b"\211PNG\r\n\032\n"):
            # older PNG
            type, format = "img", PNG
            w, h = struct.unpack(">LL", data[8:16])
            width, height = int(w), int(h)
        elif (size >= 2) and data.startswith(b"\377\330"):
            # JPEG
            type, format = "img", JPEG
            input.seek(0)
            input.read(2)
            b = input.read(1)
            try:
                while (b and ord(b) != 0xDA):
                    while (ord(b) != 0xFF):
                        b = input.read(1)
                    while (ord(b) == 0xFF):
                        b = input.read(1)
                    if (ord(b) >= 0xC0 and ord(b) <= 0xC3):
                        input.read(3)
                        h, w = struct.unpack(">HH", input.read(4))
                        break
                    else:
                        input.read(int(struct.unpack(">H", input.read(2))[0]) - 2)
                    b = input.read(1)
                width, height = int(w), int(h)
            except struct.error as e:
                return None # UnknownImageFormat("StructError" + msg)
            except ValueError as e:
                return None # UnknownImageFormat("ValueError" + msg)
            except Exception as e:
                return None # UnknownImageFormat(e.__class__.__name__)
        elif (size >= 26) and data.startswith(b"BM"):
            # BMP
            type, format = "img", BMP
            headersize = struct.unpack("<I", data[14:18])[0]
            if headersize == 12:
                w, h = struct.unpack("<HH", data[18:22])
                width, height = int(w), int(h)
            elif headersize >= 40:
                w, h = struct.unpack("<ii", data[18:26])
                width, height = int(w), abs(int(h)) # as h is negative when stored upside down
            else:
                return None # UnknownImageFormat("Unkown DIB header size:" + str(headersize))
        elif (size >= 8) and data[:4] in (b"II\052\000", b"MM\000\052"):
            # Standard TIFF, big- or little-endian
            # BigTIFF and other different but TIFF-like formats are not
            # supported currently
            type, format = "img", TIFF
            byteOrder = data[:2]
            boChar = ">" if byteOrder == "MM" else "<"
            # maps TIFF type id to size (in bytes)
            # and python format char for struct
            tiffTypes = {
                1: (1, boChar + "B"),   # BYTE
                2: (1, boChar + "c"),   # ASCII
                3: (2, boChar + "H"),   # SHORT
                4: (4, boChar + "L"),   # LONG
                5: (8, boChar + "LL"),  # RATIONAL
                6: (1, boChar + "b"),   # SBYTE
                7: (1, boChar + "c"),   # UNDEFINED
                8: (2, boChar + "h"),   # SSHORT
                9: (4, boChar + "l"),   # SLONG
                10: (8, boChar + "ll"), # SRATIONAL
                11: (4, boChar + "f"),  # FLOAT
                12: (8, boChar + "d")   # DOUBLE
            }
            ifdOffset = struct.unpack(boChar + "L", data[4:8])[0]
            try:
                countSize = 2
                input.seek(ifdOffset)
                ec = input.read(countSize)
                ifdEntryCount = struct.unpack(boChar + "H", ec)[0]
                # 2 bytes: TagId + 2 bytes: type + 4 bytes: count of values + 4
                # bytes: value offset
                ifdEntrySize = 12
                for i in range(ifdEntryCount):
                    entryOffset = ifdOffset + countSize + i * ifdEntrySize
                    input.seek(entryOffset)
                    tag = input.read(2)
                    tag = struct.unpack(boChar + "H", tag)[0]
                    if (tag == 256 or tag == 257):
                        # if type indicates that value fits into 4 bytes, value
                        # offset is not an offset but value itself
                        type = input.read(2)
                        type = struct.unpack(boChar + "H", type)[0]
                        if type not in tiffTypes:
                            return None # UnknownImageFormat("Unkown TIFF field type:" + str(type))
                        typeSize, typeChar = tiffTypes[type][0:2]
                        input.seek(entryOffset + 8)
                        value = input.read(typeSize)
                        value = int(struct.unpack(typeChar, value)[0])
                        if tag == 256:
                            width = value
                        else:
                            height = value
                    if width > -1 and height > -1:
                        break
            except Exception as e:
                return None
        elif size >= 0 and data[4:8] == b"ftyp" and data[8:12] in [b"isom", b"mp42", b"mp41", b"qt "]:
            # MP4
            type, format = "video", MP4
        elif size >= 0 and data[24:32] == b"matroska":
            # MKV
            type, format = "video", MKV
        elif size >= 0 and data[26:30] == b"webm":
            # WEBM
            type, format = "video", WEBM
        elif size >= 0 and data[0:4] == b"RIFF" and data[8:12] == b"WEBP":
            # WEBP https://developers.google.com/speed/webp/docs/riff_container
            type, format = "img", WEBP
            # static
            if data[12:16] == b"VP8X":
                width = int.from_bytes(data[24:27], byteorder="little") + 1
                height = int.from_bytes(data[27:30], byteorder="little") + 1
            elif data[12:16] == b"ANMF":
                width = int.from_bytes(data[18:21], byteorder="little") + 1
                height = int.from_bytes(data[21:24], byteorder="little") + 1
            else:
                return None
        elif size >= 2: #correct that
            # see http://en.wikipedia.org/wiki/ICO_(file_format)
            type, format = "img", ICO
            input.seek(0)
            reserved = input.read(2)
            if 0 != struct.unpack("<H", reserved)[0]:
                return None
            format = input.read(2)
            if 1 != struct.unpack("<H", format)[0]:
                return None
            num = input.read(2)
            num = struct.unpack("<H", num)[0]
            if num > 1:
                import warnings
                warnings.warn("ICO File contains more than one image")
            # http://msdn.microsoft.com/en-us/library/ms997538.aspx
            w, h = input.read(1), input.read(1)
            width, height = ord(w), ord(h)
        else:
            return None

    return {"path": file_path, "type": type, "format": format, "file_size": size, "width": width, "height": height}

@wiscat
Copy link

wiscat commented Jan 5, 2021

Also, this option can be useful, to built-in images into the result:
https://pandoc.org/MANUAL.html#option--self-contained

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants