-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: x/net/html: add offset & line tracking, case preservation #34302
Comments
/cc @nigeltao @namusyaka |
The short answer is that I'm not enthusiastic. Re Token offsets, you might already be able to report the line/column of a token, for error tracking: #28343 (comment) Node offsets (as opposed to Token offsets) are a tricker concept. A large part of the HTML5 parsing algorithm is fixing up HTML that wouldn't be well formed XML, such as dealing with improperly nested tags. This can create nodes that have no corresponding token. What offset should we give those? Also, some nodes are formed form multiple tokens: e.g. separate start and end tags. Will Node.Offset be sufficient, or would you also need Node.StartOffset and Node.EndOffset? Re OrigData, what's the garbage cost of all the extra strings? In any case, HTML is case insensitive. If you're trying to use HTML in a case sensitive way, then you're not really using HTML in an HTML5 way, and this package is about HTML as specified by HTML5. If you really must have case sensitive names, then you could possibly have a separate processing step to find all the |
Thanks @nigeltao, I've looked at this in even more detail now and it seems you're right - the things I'm asking can in fact be done with the existing package, although I would argue it's very not-obvious that this is the case. Following your suggestion from here #28343 and tracking offset := 0
for tt := z.Next(); tt != html.ErrorToken {
offset += len(z.Raw())
} In this case is The other issue is that this line https://github.com/golang/net/blob/master/html/token.go#L1158 lower cases the attribute key in-place before returning it. So while this is solvable in various ways (for example making a copy of EDIT: Bold meant only for readability of the key points. |
You're right. I sent out https://go-review.googlesource.com/c/net/+/198357 |
I'm going to close this issue. Re lower-casing keys, I'll repeat that HTML5 is specified to be case-insensitive, and in any case, it sounds like you can work around it. If you can't, feel free to re-open this issue with further thoughts. |
Thanks @nigeltao for the documentation update, appreciated! That handles the offset issue. Regarding the lower-casing, I'll split my reply up into three clear parts - the issue (what remains of it), the proposal and the motivation. I don't see a way for me to re-open this issue. I can make a new one if that's useful. Remaining IssueUsing ProposalHow about adding the following method:
It would have the same behavior as TagAttr except it would return the key and val as-is, without calling
MotivationI fully understand that the logic that per the HTML spec, attribute keys are case-insensitive, thus lower-casing them should have no negative impact on programs using the parser. If the lower-cased version is semantically the same, why not make it consistent - I get that. My main counter argument is that if allowing access to the original input does not muck up the library too much, then I don't see a reason to disallow callers to get at it. I don't have a way of knowing how many other people run into this issue, but I can summarize my specific use case: I am the primary author of Vugu (https://github.com/vugu/vugu), which is a library that (among other things) takes a hand-edited HTML document and writes corresponding Go code. This Go code runs in various environments (there's a in-browser WebAssembly case and server-side static file output case) to output a document (based on the original input) in that environment. Certain attributes, e.g. Documents are 100% valid HTML, Vugu just derives additional meaning and performs additional actions based on certain elements and attributes. I know this isn't the most common use case of this library, but I also don't think it's the most obscure. -- Hopefully this proposal is more workable. It is a much lower-impact change, just a single method added with no modifications to tokenizer state, no struct fields added, nothing like that. Please let me know what you think. |
Should the second
This program prints:
and e.g. the "bpi" and "BPI" cases are preserved. The html package lower-cases the attribute keys but not the values.
Can you give an example, then, if I looked at https://www.vugu.org/doc/files/markup but didn't see anything that needed case-sensitive attribute keys. I'm assuming that the |
@nigeltao Sorry about the confusion - the docs have not yet been updated to correspond with the latest code on master and this issue is coming up as part of a substantial refactor I did there. The relevant info is here: https://github.com/vugu/vugu/wiki/Refactor-Notes---Sep-2019 Example HTML from new root.vugu: <html><body><ul>
<main:DemoLine vg-for="i := 0; i < c.ItemCount; i++" vg-key="i" :Num="i"></main:DemoLine>
</ul></body></html> In this case I need |
I'll re-open the issue to consider adding a Nonetheless, I would recommend something like
|
Thanks, I get what you mean on the naming. Finding syntax for the various Vugu features that is concise, readable, works with Go identifier conventions, doesn't conflict with other HTML and is also internally consistent is an interesting challenge. I'll be reviewing this and taking these points into account. |
I am making surgical alterations to HTML, similar (it turns out) to vugu, but for translation purposes. Using a full parse + re-render is overkill. (I also some hit early implementation issues with it, but sadly I no longer recall what they were.) To move beyond text nodes to things like input placeholders, I too need to know the precise byte offsets of each attribute's name and value.
Maybe this proposal can be revived? |
After looking at #31312 closer and trying some things out, I wanted to open a new issue with a more specific and succinct proposal:
Features
At a high level, the desired features are:
Motivation
The motivation for 1 is cases where a caller needs to know the context of a token or node so this can be reported to the user. Indicating the position of an error, or where in the source HTML a particular element comes from, etc.
The motivation for 2 is for situations where the caller simply needs to know what was originally there before lower-casing. It could be useful in error reporting to show the original element name for example. (The case for me specifically is I'm doing code generation based on HTML and so allowing the user to use mixed case in a tag name means they can specify the exact name of a Go struct with CamelCase - the documents in question are mostly standard HTML but of course these elements are treated differently.)
Observations
Tracking the byte offset of each token and node is simple and unambiguous.
Line and column information can be derived from this offset, but raises other questions like the definition of a line ending, how do tabs impact the reported column position, multi-byte characters, etc.
Thus these concerns would seem to be better handled outside the parser - if the parser reports the byte offset into the original input, that's enough.
Providing an option to keep the original case of an element in the
Data
field seems, in retrospect, dangerous. I can't prove that this won't cause things to break (the parser does many weird and wonderful things internally to handle special cases with different tags - for which I have a new found respect after diving deeper into the source :) ). A new field seems much more sensible.Both tracking offset information and preserving the original text require fields to be added to the appropriate structs (see below). The performance impact of them seems negligible. I thus opted to not implement them as ParserOptions, as it would provide little benefit to disable these features - the fields would still be there just unpopulated.
Changes
Here's a summary of how this could work (how it works in the prototype):
Add
OrigData string; Offset int
toToken
. OrigData is the same as data but before lower casing. (If they are the same they point to the exact same bytes.)Add
OrigData string; Offset int
toNode
, same behavior (fields are just copied from Token to Node).Add
OrigKey
toAttribute
. Original Key before lower casing. (In retrospect Attribute probably could have an Offset too, I might add that.)And basically just the book keeping to fill out those fields.
Separately, a LineCounter struct is added which is wraps an io.Reader and records the positions of every line ending and provides a call to go from an offset to line number and line start offset. Callers can use this to reconstruct "line:col" given a byte offset from Node.Offset, etc. Notice that the question of how to deal with tabs or multi-byte characters becomes the caller's responsibility; this is intentional. (I didn't bother to implement alternate line endings, it assumes \n for now, but this could be done easily.) I'm not sure if it is appropriate to have a LineCounter inside the HTML package, it could just as well be somewhere else.
Working Prototype
The working code with these modifications is here: https://github.com/vugu/html .
The main commit that touches the internal stuff is: vugu/html@da33d26
Some basic tests are in there, I'm sure more could be done. (I also haven't looked at code that writes out HTML to see how the originally-cased data can be used there too or if this would be an option or what.) I also understand if some of this needs to be split up, this proposal is a few different things lumped together.
I believe these features could be generally useful and so I wanted to see if this is something that could potentially end up being merged, and if so discuss what's required to get to that point.
The text was updated successfully, but these errors were encountered: