Problem with \r #91

foobar27 · 2017-02-18T19:51:34Z

When a text node terminates with \r there is some non-deterministic behaviour (sometimes random characters in the text).

#include <iostream>
#include <cstring>

#include "myhtml/api.h"

void traverse(myhtml_tree_node_t *root) {
    if (!root) {
        return;
    }
    myhtml_tag_id_t tag = myhtml_node_tag_id(root);
    switch (tag) {
    case MyHTML_TAG__END_OF_FILE:
    case MyHTML_TAG__UNDEF:
         return;

    case MyHTML_TAG__TEXT: {
        const char* txt = myhtml_node_text(root, NULL);
        std::cout << "length: " << strlen(txt) << std::endl;
        return;
    }
    default:
        break;
    }
    myhtml_tree_node_t* child = myhtml_node_child(root);
    while (child != NULL) {
        traverse(child);
        child = myhtml_node_next(child);
    }
}

int main() {
    const char* input = "<div>Foo\r</div>";
    myhtml_t* myhtml = myhtml_create();
    if (!myhtml) {
        std::cerr << "myhtml_create failed" << std::endl;
        return 0;
    }
    myhtml_status_t res = myhtml_init(myhtml, MyHTML_OPTIONS_DEFAULT, 1, 0);
    if (MYHTML_FAILED(res)) {
        std::cerr << "myhtml_init failed with " << res << std::endl;
        return 0;
    }
    size_t inputLength = strlen(input);
    myhtml_tree_t* tree = myhtml_tree_create();
    if (!tree) {
        std::cerr << "error constructing tree" << std::endl;
        return 1;
    }

    myhtml_status_t tree_res = myhtml_tree_init(tree, myhtml);
    if (MYHTML_FAILED(tree_res)) {
        std::cerr << "myhtml_tree_init failed with " << res << std::endl;
        myhtml_tree_destroy(tree);
        return 1;
    }

    // parse html
    myhtml_parse(tree, MyHTML_ENCODING_UTF_8, input, inputLength);

    traverse(myhtml_tree_get_node_html(tree));

    // release resources
    myhtml_tree_destroy(tree);
    std::cout << "finished" << std::endl;
    myhtml_destroy(myhtml);
}

When I compile the above program with the clang memory sanitizer I get a warning about an uninitialized variable at the location where I do strlen().

$ clang++ -g -fsanitize=memory -I/path/to/myhtml/include -L/path/to/myhtml/lib -lmyhtml -lpthread -o scratchpad scratchpad.cpp`
$ LD_LIBRARY_PATH=/path/to/myhtml/lib/ ./scratchpad
Uninitialized bytes in __interceptor_strlen at offset 0 inside [0x752000000008, 5)
==2044==WARNING: MemorySanitizer: use-of-uninitialized-value
    #0 0x490b6d in traverse(myhtml_tree_node*) scratchpad.cpp:18:36
    #1 0x490cf6 in traverse(myhtml_tree_node*) scratchpad.cpp:26:9
    #2 0x490cf6 in traverse(myhtml_tree_node*) scratchpad.cpp:26:9
    #3 0x490cf6 in traverse(myhtml_tree_node*) scratchpad.cpp:26:9
    #4 0x491571 in main scratchpad.cpp:60:5
    #5 0x7f35eecc8290 in __libc_start_main (/usr/lib/libc.so.6+0x20290)
    #6 0x419f49 in _start (scratchpad+0x419f49)

SUMMARY: MemorySanitizer: use-of-uninitialized-value scratchpad.cpp:18:36 in traverse(myhtml_tree_node*)
Exiting

I think it's somehow related to the following part of the code:

myhtml/source/myhtml/mystring.c

Line 238 in 371b307

str->length++;

Temporarily removing lines 238 and 239 avoids the sanitizer warning, but I don't dare to do a pull request right now because I don't fully understand the intention of the code. My guess is that it tries to replace \r\n by \n, and it mistakenly increments the pointer when the string ends in \r.

The text was updated successfully, but these errors were encountered:

lexborisov · 2017-02-18T20:15:29Z

Fixed! Please, try to test this.
Thanks for the help!

foobar27 · 2017-02-18T20:48:25Z

Thanks for the quick fix.
My original page is now parsing properly.
The sanitizer still complains for the test program above but I think it's a false alarm (strlen() accesses behind the 0-character, probably to leverage SIMD instructions).

lexborisov · 2017-02-18T20:51:24Z

You're welcome!
Could you give me the page on which the parser did not work?

foobar27 · 2017-02-18T20:55:01Z

Sure: http://www.romedic.ro/forum/intarzierea-menstruatiei-6857 (before the text "Ceea ce aveti..." there is a br, and before the br there is a \r).

lexborisov · 2017-02-18T20:58:29Z

Thanks!

lexborisov · 2017-02-18T21:04:59Z

and you can:

        size_t length;
        const char* txt = myhtml_node_text(root, &length);
        std::cout << "length: " << length << std::endl;
        return;

lexborisov added a commit that referenced this issue Feb 18, 2017

Added forgot '\0' if text node ends with '\r' #91

58dfb5e

foobar27 closed this as completed Feb 18, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem with \r #91

Problem with \r #91

foobar27 commented Feb 18, 2017

lexborisov commented Feb 18, 2017

foobar27 commented Feb 18, 2017

lexborisov commented Feb 18, 2017

foobar27 commented Feb 18, 2017 •

edited

Loading

lexborisov commented Feb 18, 2017

lexborisov commented Feb 18, 2017

Problem with \r #91

Problem with \r #91

Comments

foobar27 commented Feb 18, 2017

lexborisov commented Feb 18, 2017

foobar27 commented Feb 18, 2017

lexborisov commented Feb 18, 2017

foobar27 commented Feb 18, 2017 • edited Loading

lexborisov commented Feb 18, 2017

lexborisov commented Feb 18, 2017

foobar27 commented Feb 18, 2017 •

edited

Loading