Implement XML parsing using stdlib#155
Conversation
…/xml to parse xml
|
Hey, thank you, how is it going? |
|
Not going well. I am not sure what am I failing to capture in that big XML file. Any ideas @CorentinB ? |
| if strings.HasPrefix(value.(string), "http") { | ||
| URL, err := url.Parse(value.(string)) | ||
| switch tok := tok.(type) { | ||
| case xml.StartElement: |
There was a problem hiding this comment.
case xml.StartElement:
startElement = tok
currentNode = &LeafNode{Path: startElement.Name.Local}
for _, attr := range tok.Attr {
if strings.HasPrefix(attr.Value, "http") {
parsedURL, err := url.Parse(attr.Value)
if err == nil {
URLs = append(URLs, parsedURL)
}
}
}this fixed the Huge sitemap test by extracting XML attributes.
Now, the URLs' size and content match the previous tests. :)
There was a problem hiding this comment.
Right, opening tags also have urls in some cases.
Now the only left is TestXMLBodyReadError, gotta look at it
There was a problem hiding this comment.
I think the TestXMLBodyReadError test is invalid(?) since NopCloser certainly won't return an EOF error on xmlBody, err := io.ReadAll(resp.Body)
There was a problem hiding this comment.
Possible, but they did pass previously. Sure, it does not return an EOF on read. For the test to pass, I had to decode the Token once and catch the error. (and seek back for the loop)
_, err = decoder.Token()
if err != nil {
return nil, sitemap, err
}
// seek back to 0 if we are still here
reader.Seek(0, 0)
decoder = xml.NewDecoder(reader)Catching this in the loop won't work cleanly, since I want to know if EOF was somewhere in-between the file (invalid XML), or at the start (this error)
I will push these changes
Co-Authored-By: yzqzss <30341059+yzqzss@users.noreply.github.com>
|
Thanks guys! |
|
You're welcome |
This PR replaces
github.com/clbanning/mxj/v2and usesencoding/xmlxml.Decoderto parse xml and extract urls within.EDIT: All tests pass now. The PR is complete
Co-Author: @yzqzss
Only two tests fail, I am trying to fix those
Tests
Closes #84