Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Script Tags Cause Incorrect Chunked Parsing With Closing Body and HTML Tags #154

Closed
schrodingersket opened this issue Jul 26, 2018 · 2 comments

Comments

@schrodingersket
Copy link

schrodingersket commented Jul 26, 2018

I've noticed that the inclusion of script tags along with closing body or html tags causes the resulting HTML document to be malformed when parsed. Without script tags or when body and html closing tags are omitted, parsing occurs as expected. A slight modification of the HTML chunks in chunks_high_level.c illustrates the issue:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#include <myhtml/api.h>

mystatus_t serialization_callback(const char* data, size_t len, void* ctx)
{
    printf("%.*s", (int)len, data);
    return MyCORE_STATUS_OK;
}

int main(int argc, const char * argv[])
{
    char html[][64] = {
            "<!DOCT",
            "YPE htm",
            "l>",
            "<html><head>",
            "<script>console.log('Hello, world!');</script>",
            "<ti",
            "tle>HTML chun",
            "ks parsing</",
            "title>",
            "</head><bod",
            "y><div cla",
            "ss=",
            "\"bestof",
            "class",
            "\">",
            "good for me",
            "</div>",
            "</body>",
            "</html>",
        "\0"
    };
    
    // basic init
    myhtml_t* myhtml = myhtml_create();
    myhtml_init(myhtml, MyHTML_OPTIONS_DEFAULT, 1, 0);
    
    // init tree
    myhtml_tree_t* tree = myhtml_tree_create();
    myhtml_tree_init(tree, myhtml);
    
    myhtml_encoding_set(tree, MyENCODING_UTF_8);
    
    for(size_t i = 0; html[i][0]; i++)
    {
        printf("Parse chunk: %s\n", html[i]);
        
        // parse html
        myhtml_parse_chunk(tree, html[i], strlen(html[i]));
    }
    
    // call to the end
    myhtml_parse_chunk_end(tree);
    
    // print fragment
    myhtml_serialization_tree_callback(myhtml_tree_get_document(tree), serialization_callback, NULL);
    
    // release resources
    myhtml_tree_destroy(tree);
    myhtml_destroy(myhtml);
    
    return 0;
}

This outputs the following:

Parse chunk: <!DOCT
Parse chunk: YPE htm
Parse chunk: l>
Parse chunk: <html><head>
Parse chunk: <script>console.log('Hello, world!');</script>
Parse chunk: <ti
Parse chunk: tle>HTML chun
Parse chunk: ks parsing</
Parse chunk: title>
Parse chunk: </head><bod
Parse chunk: y><div cla
Parse chunk: ss=
Parse chunk: "bestof
Parse chunk: class
Parse chunk: ">
Parse chunk: good for me
Parse chunk: </div>
Parse chunk: </body>
Parse chunk: </html>
<!DOCTYPE html><html><head><script>console.log('Hello, world!');</script><title>HTML chunks parsing</title></head><body><div class="bestofclass">good for me</div></body></html></script></head><body></body></html>
Process finished with exit code 0

You'll notice the extraneous </script></head><body></body></html> string at the end, as though the initial script tag was never closed.

It's also quite possible that I'm misunderstanding how myhtml_parse_chunk is supposed to be used - if so, clarification would be greatly appreciated.

Thanks in advance for your time and attention!

@schrodingersket schrodingersket changed the title Script Tags Cause Incorrect Chunked Parsing With Trailing Script Tags Cause Incorrect Chunked Parsing With Closing Body and HTML Tags Jul 26, 2018
@lexborisov
Copy link
Owner

Hi @schrodingersket
Sorry that I did not reply for a while. The problem is resolved.

Thank you!

@schrodingersket
Copy link
Author

Fantastic - thank you so much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants