Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setting text contents of a node? #8

Closed
CrimsonVex opened this issue May 7, 2015 · 7 comments
Closed

Setting text contents of a node? #8

CrimsonVex opened this issue May 7, 2015 · 7 comments

Comments

@CrimsonVex
Copy link

I'm assuming this wasn't intended, but would it be possible to create a way to set the text contents of a CNode? I'm in a situation where I need to update parts of a DOM on the fly, and I need such a feature.

If it were to be implemented, I'd imagine overloading .text() for a CNode to accept an std::string would work well and be similar to the JQuery function .html().

@TechnikEmpire
Copy link

This isn't possible, because of the nature of gumbo iteself, all of the node data you're having exposed to you is entirely managed internally by gumbo. If you mess with it at all, you're going to bug out or even crash your program, because you're tampering with the managed memory of another object. This is the very clear contract that gumbo provides to you, that if you want to own things, you need to copy them.

@CrimsonVex
Copy link
Author

I'd assume then a suitable option is to perform replacements on the original string given to CDocument and re-parse it? (I suppose that's not so bad)

@TechnikEmpire
Copy link

I made heavy modifications to gumbo-query just to be able to perform the simplest modifications of nodes at a good speed. These modifications included providing a Get() method to expose the underlying gumbo_node of CDocument/CNode. I then wrote several helper functions, the most important one is generating a unique node ID string. Like so:

std::string SerializeUtil::getUniqueNodeId(GumboNode* node)
{
    std::string nodeId = "";

    nodeId.append(std::to_string(node->index_within_parent));

    GumboNode* parent = node->parent;

    while (parent != nullptr)
    {
        nodeId.append(std::to_string(parent->index_within_parent));

        parent = parent->parent;
    }

    return nodeId;
}

Using this unique node ID, I could then keep a map of nodes I wanted to manipulate by storing them in a simple std::unordered_map<std::string, int> object. The INT can be set to an integer that represents what manipulation you wish to have done on the node while it is being rendered. For example, remove, modify so on. Then I heavily modified https://github.com/google/gumbo-parser/blob/master/examples/serialize.cc to take an optional pointer to such maps, so that while it's rendering the GumboOutput back to an HTML string, it can perform modifications (by checking the unique ID of each node as it begins to render it against the unordered_map provided).

So yeah, not too bad, but there is a lot involved to doing these modifications. For me, this approach was necessary because I'm doing modifications to HTML in real-time as users browse, so speed was of the utmost importance.

@CrimsonVex
Copy link
Author

In my case speed isn't an issue. I'm making some POST requests, analysing the response and then making subsequent POST requests. I haven't tried it yet, but I'm assuming my simple idea of using the Replace function on my System::Strings should work (that particular replace function is quite fast), as I probably need to replace a couple of

tags each containing a few thousand or so characters after each POST. It's not optimal but it might be okay. Thanks for clarifying that for me though.

@TechnikEmpire
Copy link

Look at the code behind the text() methods and such in gumbo-query. They are just convenience functions that copy data from the parsed html, which resides exclusively in and owned by GumboOutput. So if you change the text that you get back from node.text(), this will have absolutely no effect on the actual document that you parsed. gumo-parser and gumbo-query only provide to you a read-only access to traverse parsed html. Maybe I'm not understand your use case, maybe the only text you need you're getting copied to you when you call text() on your node. But I want to make it clear that if you're expecting to get a HTML response, replace the text() of one element and end up with the whole response including your modifications, this simply isn't possible out of the box.

@CrimsonVex
Copy link
Author

I'm thinking more along the lines of having a global variable string. Everytime I make a new request that responds with pieces of HTML, I merge them into the global string by replacing the current CNode.text() with the HTML piece, and pass this global string to CDocument to be analysed again before making further requests.

@lazytiger
Copy link
Owner

I think this feature can be implemented by CNode:startPos and CNode:endPos
You can replace the data from startPos to endPos as what you want.

@lazytiger lazytiger reopened this Jun 8, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants