New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSQuery Exception when ID of HTML Element contains space #5

Closed
muchio7 opened this Issue Jun 7, 2012 · 6 comments

Comments

Projects
None yet
2 participants
@muchio7

muchio7 commented Jun 7, 2012

CSQuery causes an exception when an ID of an html element contains a space Ex. <img id="image test" src="url"/>. I know that spaces in the ID are not allowed in valid IDs but it would be nice if CSQuery gracefully recovered from it. Not sure what the optimal behavior should be whether it should be just not found if searched on or behind the scenes setting a delimter and translating between the two automatically Ex: query for ('img test') converted to ('img_test') when parsed and when also searched on. The only downfall I see is this may create a duplicate ID in the DOM.

@jamietre

This comment has been minimized.

Show comment
Hide comment
@jamietre

jamietre Jun 7, 2012

Owner

I just created this test and it passes:

        var dom = CQ.Create("<body><div></div><img id=\"image test\" src=\"url\"/>content</div></body>");
        var res = dom["img"];

        Assert.AreEqual(1, res.Length);
        Assert.AreEqual("image test", res[0].Id);

Can you show me some specific failing code or point me to a URL that won't parse?

There's no reason that an ID couldn't have a space -- the ID attribute is treated the same as any other; anything inside the quotes should become its value.

However it is not possible to make that selector work as you'd like because "img test" is a valid selector that looks for an element "test" that is a child of elements "img". That is, because the space already has a specific meaning inside selectors (search all the children of the selection so far) it doesn't make sense to try to anticipate this situation. In theory I could try to check for tag selectors that aren't valid tag names and treat them differently but that could create a host of unexpected situations since using invalid (or "custom" as we like to say) tag names is common.

There are actually valid IDs that can't be selected either; for example, the period (dot) is a legal part of an ID name, but it also has a different meaning in a selector, which is to select class names. e.g.

<div id="test.id"></div>

$('#test.id') ==> no results (either jQuery of CsQuery)
Owner

jamietre commented Jun 7, 2012

I just created this test and it passes:

        var dom = CQ.Create("<body><div></div><img id=\"image test\" src=\"url\"/>content</div></body>");
        var res = dom["img"];

        Assert.AreEqual(1, res.Length);
        Assert.AreEqual("image test", res[0].Id);

Can you show me some specific failing code or point me to a URL that won't parse?

There's no reason that an ID couldn't have a space -- the ID attribute is treated the same as any other; anything inside the quotes should become its value.

However it is not possible to make that selector work as you'd like because "img test" is a valid selector that looks for an element "test" that is a child of elements "img". That is, because the space already has a specific meaning inside selectors (search all the children of the selection so far) it doesn't make sense to try to anticipate this situation. In theory I could try to check for tag selectors that aren't valid tag names and treat them differently but that could create a host of unexpected situations since using invalid (or "custom" as we like to say) tag names is common.

There are actually valid IDs that can't be selected either; for example, the period (dot) is a legal part of an ID name, but it also has a different meaning in a selector, which is to select class names. e.g.

<div id="test.id"></div>

$('#test.id') ==> no results (either jQuery of CsQuery)
@muchio7

This comment has been minimized.

Show comment
Hide comment
@muchio7

muchio7 Jun 7, 2012

Hi James,

Thanks for getting back with me. Here is the html that will not parse:

When I run the following code:

Dim csq As CsQuery.CQ = CsQuery.CQ.Create(HTMLElement)

Where HTMLElmenet equals the html above I get the following exception:

System.Exception : Unexpected character found at position 9: "

at CsQuery.Engine.SelectorParser.Parse(String selector)

at CsQuery.Engine.SelectorChain..ctor(String selector)

at CsQuery.Implementation.DomRoot.GetElementById(String id)

at CsQuery.Implementation.NodeList.Add(IDomObject item)

at CsQuery.CQ.Load(Char[] html)

at CsQuery.CQ.Load(String html)

Let me know if there is anything else I can do to help track down the
problem. When I removed the id attribute it works so I was assuming it was
with the space in the id. I have not dug into csquery's code yet but if
you need my help tracking it down let me know.

Thanks,

Brandon

On Thu, Jun 7, 2012 at 5:35 AM, James Treworgy <
reply@reply.github.com

wrote:

I just created this test and it passes:

       var dom = CQ.Create("<body><div></div><img id=\"image test\"

src="url"/>content");
var res = dom["img"];

       Assert.AreEqual(1, res.Length);
       Assert.AreEqual("image test", res[0].Id);

Can you show me some specific failing code or point me to a URL that won't
parse?

There's no reason that an ID couldn't have a space -- the ID attribute is
treated the same as any other; anything inside the quotes should become its
value.

However it is not possible to make that selector work as you'd like
because "img test" is a valid selector that looks for an element named
"test" that is a child of elements "img". That is, because the space
already has a specific meaning inside selectors (search all the children of
the selection so far) it doesn't make sense to try to anticipate this
situation.

There are actually valid IDs that can't be selected either; for example,
the period (dot) is a legal part of an ID name, but it also has a different
meaning in a selector, which is to select class names. e.g.

$('#test.id') ==> no results (either jQuery of CsQuery)


Reply to this email directly or view it on GitHub:
#5 (comment)

muchio7 commented Jun 7, 2012

Hi James,

Thanks for getting back with me. Here is the html that will not parse:

When I run the following code:

Dim csq As CsQuery.CQ = CsQuery.CQ.Create(HTMLElement)

Where HTMLElmenet equals the html above I get the following exception:

System.Exception : Unexpected character found at position 9: "

at CsQuery.Engine.SelectorParser.Parse(String selector)

at CsQuery.Engine.SelectorChain..ctor(String selector)

at CsQuery.Implementation.DomRoot.GetElementById(String id)

at CsQuery.Implementation.NodeList.Add(IDomObject item)

at CsQuery.CQ.Load(Char[] html)

at CsQuery.CQ.Load(String html)

Let me know if there is anything else I can do to help track down the
problem. When I removed the id attribute it works so I was assuming it was
with the space in the id. I have not dug into csquery's code yet but if
you need my help tracking it down let me know.

Thanks,

Brandon

On Thu, Jun 7, 2012 at 5:35 AM, James Treworgy <
reply@reply.github.com

wrote:

I just created this test and it passes:

       var dom = CQ.Create("<body><div></div><img id=\"image test\"

src="url"/>content");
var res = dom["img"];

       Assert.AreEqual(1, res.Length);
       Assert.AreEqual("image test", res[0].Id);

Can you show me some specific failing code or point me to a URL that won't
parse?

There's no reason that an ID couldn't have a space -- the ID attribute is
treated the same as any other; anything inside the quotes should become its
value.

However it is not possible to make that selector work as you'd like
because "img test" is a valid selector that looks for an element named
"test" that is a child of elements "img". That is, because the space
already has a specific meaning inside selectors (search all the children of
the selection so far) it doesn't make sense to try to anticipate this
situation.

There are actually valid IDs that can't be selected either; for example,
the period (dot) is a legal part of an ID name, but it also has a different
meaning in a selector, which is to select class names. e.g.

$('#test.id') ==> no results (either jQuery of CsQuery)


Reply to this email directly or view it on GitHub:
#5 (comment)

@jamietre

This comment has been minimized.

Show comment
Hide comment
@jamietre

jamietre Jun 7, 2012

Owner

Confirmed bug. As it turns out this is not actually the parser - what happens is that when you add a new element to the DOM, it verifies that its ID is unique, and removes it if not. It's breaking when trying to select the invalid ID using GetElementByID.

I think this logic is flawed, even though it's designed to prevent invalid HTML, it would result in an incorrect representation of what it was fed. Since browsers allow it (even though it's invalid) CsQuery should too. Shouldn't take long to fix.

Owner

jamietre commented Jun 7, 2012

Confirmed bug. As it turns out this is not actually the parser - what happens is that when you add a new element to the DOM, it verifies that its ID is unique, and removes it if not. It's breaking when trying to select the invalid ID using GetElementByID.

I think this logic is flawed, even though it's designed to prevent invalid HTML, it would result in an incorrect representation of what it was fed. Since browsers allow it (even though it's invalid) CsQuery should too. Shouldn't take long to fix.

jamietre added a commit that referenced this issue Jun 7, 2012

@jamietre

This comment has been minimized.

Show comment
Hide comment
@jamietre

jamietre Jun 7, 2012

Owner

Fixed - and the element can be selected by ID now using GetElementById. Duplicate IDs are also kept when parsing HTML.

It will still remove the ID attribute if you try to add a duplicate ID to an existing DOM. It looks like jQuery does not do this automatically from a quick google on the subject. It seems like desirable behavior to me so I am going to leave it this way.

Test:

string html = @"<img alt="""" id=""Picture 7"" src=""Image.aspx?imageId=26381""
            style=""border-top-width: 1px; border-right-width: 1px; border-bottom-width:
            1px; border-left-width: 1px; border-top-style: solid; border-right-style:
            solid; border-bottom-style: solid; border-left-style: solid; margin-left:
            1px; margin-right: 1px; width: 950px; height: 451px; "" />";

var dom = CQ.Create(html);
var res = dom["img"];
Assert.AreEqual(1, res.Length);
Assert.AreEqual("Picture 7", res[0].Id);

var img = dom.Document.GetElementById("Picture 7");
Assert.IsNotNull(img);
Assert.AreEqual("Picture 7", img.Id);
Owner

jamietre commented Jun 7, 2012

Fixed - and the element can be selected by ID now using GetElementById. Duplicate IDs are also kept when parsing HTML.

It will still remove the ID attribute if you try to add a duplicate ID to an existing DOM. It looks like jQuery does not do this automatically from a quick google on the subject. It seems like desirable behavior to me so I am going to leave it this way.

Test:

string html = @"<img alt="""" id=""Picture 7"" src=""Image.aspx?imageId=26381""
            style=""border-top-width: 1px; border-right-width: 1px; border-bottom-width:
            1px; border-left-width: 1px; border-top-style: solid; border-right-style:
            solid; border-bottom-style: solid; border-left-style: solid; margin-left:
            1px; margin-right: 1px; width: 950px; height: 451px; "" />";

var dom = CQ.Create(html);
var res = dom["img"];
Assert.AreEqual(1, res.Length);
Assert.AreEqual("Picture 7", res[0].Id);

var img = dom.Document.GetElementById("Picture 7");
Assert.IsNotNull(img);
Assert.AreEqual("Picture 7", img.Id);

@jamietre jamietre closed this Jun 7, 2012

@muchio7

This comment has been minimized.

Show comment
Hide comment
@muchio7

muchio7 Jun 7, 2012

Thanks James. I tested the latest version and it works. I like the idea
of not being create bad html except for when parsing. Any site which deals
with user created content or poorly code sites have the possibility of bad
html. Thanks for your quick response. I really like the CSQuery project
and am glad there is a library which incorporates the power of jquery on
the server. Got tired of saying "if only I could write this part in
jquery".

Brandon

On Thu, Jun 7, 2012 at 10:32 AM, James Treworgy <
reply@reply.github.com

wrote:

Fixed - and the element can be selected by ID now using GetElementById.
Duplicate IDs are also ok.

It will still fail if you try to add a duplicate ID to an existing DOM
that's already been built from HTML, which I think is correct.. CsQuery
should stop you from creating bad HTML (except when parsing).

Test:

string html = @"<img alt="""" id=""Picture 7""
src=""Image.aspx?imageId=26381""
style=""border-top-width: 1px; border-right-width: 1px;
border-bottom-width:
1px; border-left-width: 1px; border-top-style: solid;
border-right-style:
solid; border-bottom-style: solid; border-left-style:
solid; margin-left:
1px; margin-right: 1px; width: 950px; height: 451px; ""
/>";

var dom = CQ.Create(html);
var res = dom["img"];
Assert.AreEqual(1, res.Length);
Assert.AreEqual("Picture 7", res[0].Id);

var img = dom.Document.GetElementById("Picture 7");
Assert.IsNotNull(img);
Assert.AreEqual("Picture 7", img.Id);


Reply to this email directly or view it on GitHub:
#5 (comment)

muchio7 commented Jun 7, 2012

Thanks James. I tested the latest version and it works. I like the idea
of not being create bad html except for when parsing. Any site which deals
with user created content or poorly code sites have the possibility of bad
html. Thanks for your quick response. I really like the CSQuery project
and am glad there is a library which incorporates the power of jquery on
the server. Got tired of saying "if only I could write this part in
jquery".

Brandon

On Thu, Jun 7, 2012 at 10:32 AM, James Treworgy <
reply@reply.github.com

wrote:

Fixed - and the element can be selected by ID now using GetElementById.
Duplicate IDs are also ok.

It will still fail if you try to add a duplicate ID to an existing DOM
that's already been built from HTML, which I think is correct.. CsQuery
should stop you from creating bad HTML (except when parsing).

Test:

string html = @"<img alt="""" id=""Picture 7""
src=""Image.aspx?imageId=26381""
style=""border-top-width: 1px; border-right-width: 1px;
border-bottom-width:
1px; border-left-width: 1px; border-top-style: solid;
border-right-style:
solid; border-bottom-style: solid; border-left-style:
solid; margin-left:
1px; margin-right: 1px; width: 950px; height: 451px; ""
/>";

var dom = CQ.Create(html);
var res = dom["img"];
Assert.AreEqual(1, res.Length);
Assert.AreEqual("Picture 7", res[0].Id);

var img = dom.Document.GetElementById("Picture 7");
Assert.IsNotNull(img);
Assert.AreEqual("Picture 7", img.Id);


Reply to this email directly or view it on GitHub:
#5 (comment)

@jamietre

This comment has been minimized.

Show comment
Hide comment
@jamietre

jamietre Jun 7, 2012

Owner

Thanks for the positive feedback. "if only I could write this part in jquery" ... yeah that's pretty much why I started this!!

Btw. just added to NuGet, will be a little below the radar until I finish the missing css selectors.

Owner

jamietre commented Jun 7, 2012

Thanks for the positive feedback. "if only I could write this part in jquery" ... yeah that's pretty much why I started this!!

Btw. just added to NuGet, will be a little below the radar until I finish the missing css selectors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment