Skip to content

Pandoc altering the identifiers parsed from Headers on html, to lowercase and appending -top #10882

@rafmartom

Description

@rafmartom

Explain the problem.

I have a script to run against all the identifiers of most of the Objects found in a html file

`parse-identifiers-issue.lua'

function parse_identifier(elem, elem_type)
    -- Discard elements with no id
    if elem.identifier == '' or elem.identifier == nil then
        return elem
    end

    print('[DEBUG] elem.identifier : ' .. elem.identifier) -- DEBUGGING
    print('[DEBUG] elem_type : ' .. elem_type) -- DEBUGGING

    return elem
end


return {
    { CodeBlock = function(e) return parse_identifier(e, "CodeBlock") end },
    { Div = function(e) return parse_identifier(e, "Div") end },
    { Figure = function(e) return parse_identifier(e, "Figure") end },
    { Header = function(e) return parse_identifier(e, "Header") end },
    { Table = function(e) return parse_identifier(e, "Table") end },
    { Code = function(e) return parse_identifier(e, "Code") end },
    { Image = function(e) return parse_identifier(e, "Image") end },
    { Link = function(e) return parse_identifier(e, "Link") end },
    { Span = function(e) return parse_identifier(e, "Span") end },
    { Cell = function(e) return parse_identifier(e, "Cell") end },
    { TableFoot = function(e) return parse_identifier(e, "TableFoot") end },
    { TableHead = function(e) return parse_identifier(e, "TableHead") end },
    { Para = function(e) return parse_identifier(e, "Para") end },
    { BlockQuote = function(e) return parse_identifier(e, "BlockQuote") end },
    { BulletList = function(e) return parse_identifier(e, "BulletList") end },
    { OrderedList = function(e) return parse_identifier(e, "OrderedList") end }
}

Passed against the following website

curl -s https://www.man7.org/linux/man-pages/man0/aio.h.0p.html | pandoc -f html -t plain -o /dev/null -L ./parse-identifiers-issue.lua

It produces the following output

[DEBUG] elem.identifier : aio.h0p-linux-manual-page
[DEBUG] elem_type : Header
[DEBUG] elem.identifier : prolog-top
[DEBUG] elem_type : Header
[DEBUG] elem.identifier : name-top
[DEBUG] elem_type : Header
[DEBUG] elem.identifier : synopsis-top
[DEBUG] elem_type : Header
[DEBUG] elem.identifier : description-top
[DEBUG] elem_type : Header
[DEBUG] elem.identifier : application-usage-top
[DEBUG] elem_type : Header
[DEBUG] elem.identifier : rationale-top
[DEBUG] elem_type : Header
[DEBUG] elem.identifier : future-directions-top
[DEBUG] elem_type : Header
[DEBUG] elem.identifier : see-also-top
[DEBUG] elem_type : Header
[DEBUG] elem.identifier : copyright-top
[DEBUG] elem_type : Header
[DEBUG] elem.identifier : PROLOG
[DEBUG] elem_type : Link
[DEBUG] elem.identifier : NAME
[DEBUG] elem_type : Link
[DEBUG] elem.identifier : SYNOPSIS
[DEBUG] elem_type : Link
[DEBUG] elem.identifier : DESCRIPTION
[DEBUG] elem_type : Link
[DEBUG] elem.identifier : APPLICATION_USAGE
[DEBUG] elem_type : Link
[DEBUG] elem.identifier : RATIONALE
[DEBUG] elem_type : Link
[DEBUG] elem.identifier : FUTURE_DIRECTIONS
[DEBUG] elem_type : Link
[DEBUG] elem.identifier : SEE_ALSO
[DEBUG] elem_type : Link
[DEBUG] elem.identifier : COPYRIGHT
[DEBUG] elem_type : Link
[DEBUG] elem.identifier : top_of_page
[DEBUG] elem_type : Span

Where is pandoc actually parsing those prolog-top , name-top etc...
All I can see in the .html is the following

<h2><a id="PROLOG" href="#PROLOG"></a>PROLOG  &nbsp; &nbsp; &nbsp; &nbsp; <a href="#top_of_page"><span class="top-link">top</span></a></h2><pre>
<h2><a id="NAME" href="#NAME"></a>NAME  &nbsp; &nbsp; &nbsp; &nbsp; <a href="#top_of_page"><span class="top-link">top</span></a></h2><pre>

What is the point of parsing them that way?, can someone break down how has pandoc computed these identifiers?

Pandoc version?

$ pandoc --version
pandoc 3.1.11.1
Features: -server +lua
Scripting engine: Lua 5.4
User data directory: /home/fakuve/.local/share/pandoc
Copyright (C) 2006-2023 John MacFarlane. Web: https://pandoc.org
This is free software; see the source for copying conditions. There is no
warranty, not even for merchantability or fitness for a particular purpose.

Using Debian Trixie , Pandoc compiled using sources in x64 Computer

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions