Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with \r #91

Closed
foobar27 opened this issue Feb 18, 2017 · 6 comments
Closed

Problem with \r #91

foobar27 opened this issue Feb 18, 2017 · 6 comments

Comments

@foobar27
Copy link

When a text node terminates with \r there is some non-deterministic behaviour (sometimes random characters in the text).

#include <iostream>
#include <cstring>

#include "myhtml/api.h"

void traverse(myhtml_tree_node_t *root) {
    if (!root) {
        return;
    }
    myhtml_tag_id_t tag = myhtml_node_tag_id(root);
    switch (tag) {
    case MyHTML_TAG__END_OF_FILE:
    case MyHTML_TAG__UNDEF:
         return;

    case MyHTML_TAG__TEXT: {
        const char* txt = myhtml_node_text(root, NULL);
        std::cout << "length: " << strlen(txt) << std::endl;
        return;
    }
    default:
        break;
    }
    myhtml_tree_node_t* child = myhtml_node_child(root);
    while (child != NULL) {
        traverse(child);
        child = myhtml_node_next(child);
    }
}

int main() {
    const char* input = "<div>Foo\r</div>";
    myhtml_t* myhtml = myhtml_create();
    if (!myhtml) {
        std::cerr << "myhtml_create failed" << std::endl;
        return 0;
    }
    myhtml_status_t res = myhtml_init(myhtml, MyHTML_OPTIONS_DEFAULT, 1, 0);
    if (MYHTML_FAILED(res)) {
        std::cerr << "myhtml_init failed with " << res << std::endl;
        return 0;
    }
    size_t inputLength = strlen(input);
    myhtml_tree_t* tree = myhtml_tree_create();
    if (!tree) {
        std::cerr << "error constructing tree" << std::endl;
        return 1;
    }

    myhtml_status_t tree_res = myhtml_tree_init(tree, myhtml);
    if (MYHTML_FAILED(tree_res)) {
        std::cerr << "myhtml_tree_init failed with " << res << std::endl;
        myhtml_tree_destroy(tree);
        return 1;
    }

    // parse html
    myhtml_parse(tree, MyHTML_ENCODING_UTF_8, input, inputLength);

    traverse(myhtml_tree_get_node_html(tree));

    // release resources
    myhtml_tree_destroy(tree);
    std::cout << "finished" << std::endl;
    myhtml_destroy(myhtml);
}

When I compile the above program with the clang memory sanitizer I get a warning about an uninitialized variable at the location where I do strlen().

$ clang++ -g -fsanitize=memory -I/path/to/myhtml/include -L/path/to/myhtml/lib -lmyhtml -lpthread -o scratchpad scratchpad.cpp`
$ LD_LIBRARY_PATH=/path/to/myhtml/lib/ ./scratchpad
Uninitialized bytes in __interceptor_strlen at offset 0 inside [0x752000000008, 5)
==2044==WARNING: MemorySanitizer: use-of-uninitialized-value
    #0 0x490b6d in traverse(myhtml_tree_node*) scratchpad.cpp:18:36
    #1 0x490cf6 in traverse(myhtml_tree_node*) scratchpad.cpp:26:9
    #2 0x490cf6 in traverse(myhtml_tree_node*) scratchpad.cpp:26:9
    #3 0x490cf6 in traverse(myhtml_tree_node*) scratchpad.cpp:26:9
    #4 0x491571 in main scratchpad.cpp:60:5
    #5 0x7f35eecc8290 in __libc_start_main (/usr/lib/libc.so.6+0x20290)
    #6 0x419f49 in _start (scratchpad+0x419f49)

SUMMARY: MemorySanitizer: use-of-uninitialized-value scratchpad.cpp:18:36 in traverse(myhtml_tree_node*)
Exiting

I think it's somehow related to the following part of the code:

str->length++;

Temporarily removing lines 238 and 239 avoids the sanitizer warning, but I don't dare to do a pull request right now because I don't fully understand the intention of the code. My guess is that it tries to replace \r\n by \n, and it mistakenly increments the pointer when the string ends in \r.

@lexborisov
Copy link
Owner

Fixed! Please, try to test this.
Thanks for the help!

@foobar27
Copy link
Author

Thanks for the quick fix.
My original page is now parsing properly.
The sanitizer still complains for the test program above but I think it's a false alarm (strlen() accesses behind the 0-character, probably to leverage SIMD instructions).

@lexborisov
Copy link
Owner

You're welcome!
Could you give me the page on which the parser did not work?

@foobar27
Copy link
Author

foobar27 commented Feb 18, 2017

Sure: http://www.romedic.ro/forum/intarzierea-menstruatiei-6857 (before the text "Ceea ce aveti..." there is a br, and before the br there is a \r).

@lexborisov
Copy link
Owner

Thanks!

@lexborisov
Copy link
Owner

and you can:

        size_t length;
        const char* txt = myhtml_node_text(root, &length);
        std::cout << "length: " << length << std::endl;
        return;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants