Skip to content

Implement XML parsing using stdlib#155

Merged
CorentinB merged 5 commits intointernetarchive:mainfrom
HarshNarayanJha:feature/xml-processing-stdlib
Nov 11, 2024
Merged

Implement XML parsing using stdlib#155
CorentinB merged 5 commits intointernetarchive:mainfrom
HarshNarayanJha:feature/xml-processing-stdlib

Conversation

@HarshNarayanJha
Copy link
Copy Markdown
Contributor

@HarshNarayanJha HarshNarayanJha commented Nov 2, 2024

This PR replaces github.com/clbanning/mxj/v2 and uses encoding/xml xml.Decoder to parse xml and extract urls within.

EDIT: All tests pass now. The PR is complete

Co-Author: @yzqzss

Only two tests fail, I am trying to fix those

Tests
go: downloading git.archive.org/wb/gocrawlhq v1.2.13
internal/pkg/crawl/config.go:11:2: unrecognized import path "git.archive.org/wb/gocrawlhq": https fetch: Get "https://git.archive.org/wb/gocrawlhq?go-get=1": dial tcp 207.241.235.124:443: i/o timeout
=== RUN   TestJSON
=== RUN   TestJSON/Valid_JSON_with_URLs
=== RUN   TestJSON/Invalid_JSON
=== RUN   TestJSON/JSON_with_no_URLs
=== RUN   TestJSON/JSON_with_URLs_in_various_fields
=== RUN   TestJSON/JSON_with_array_of_URLs
--- PASS: TestJSON (0.00s)
    --- PASS: TestJSON/Valid_JSON_with_URLs (0.00s)
    --- PASS: TestJSON/Invalid_JSON (0.00s)
    --- PASS: TestJSON/JSON_with_no_URLs (0.00s)
    --- PASS: TestJSON/JSON_with_URLs_in_various_fields (0.00s)
    --- PASS: TestJSON/JSON_with_array_of_URLs (0.00s)
=== RUN   TestXML
=== RUN   TestXML/Valid_XML_with_URLs
=== RUN   TestXML/Empty_XML
=== RUN   TestXML/Invalid_XML
=== RUN   TestXML/XML_with_invalid_URL
=== RUN   TestXML/Huge_sitemap
    xml_test.go:88: XML() gotURLs count = 10000, want 100002
--- FAIL: TestXML (0.76s)
    --- PASS: TestXML/Valid_XML_with_URLs (0.00s)
    --- PASS: TestXML/Empty_XML (0.00s)
    --- PASS: TestXML/Invalid_XML (0.00s)
    --- PASS: TestXML/XML_with_invalid_URL (0.00s)
    --- FAIL: TestXML/Huge_sitemap (0.66s)
=== RUN   TestXMLBodyReadError
    xml_test.go:127: XML() expected error, got nil
--- FAIL: TestXMLBodyReadError (0.00s)
FAIL
FAIL	github.com/internetarchive/Zeno/internal/pkg/crawl/extractor	0.780s
FAIL

Closes #84

@CorentinB
Copy link
Copy Markdown
Collaborator

Hey, thank you, how is it going?

@HarshNarayanJha
Copy link
Copy Markdown
Contributor Author

HarshNarayanJha commented Nov 8, 2024

Not going well. I am not sure what am I failing to capture in that big XML file. Any ideas @CorentinB ?

if strings.HasPrefix(value.(string), "http") {
URL, err := url.Parse(value.(string))
switch tok := tok.(type) {
case xml.StartElement:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

		case xml.StartElement:
			startElement = tok
			currentNode = &LeafNode{Path: startElement.Name.Local}
			for _, attr := range tok.Attr {
				if strings.HasPrefix(attr.Value, "http") {
					parsedURL, err := url.Parse(attr.Value)
					if err == nil {
						URLs = append(URLs, parsedURL)
					}
				}
			}

this fixed the Huge sitemap test by extracting XML attributes.
Now, the URLs' size and content match the previous tests. :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, opening tags also have urls in some cases.

Now the only left is TestXMLBodyReadError, gotta look at it

Copy link
Copy Markdown
Collaborator

@yzqzss yzqzss Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the TestXMLBodyReadError test is invalid(?) since NopCloser certainly won't return an EOF error on xmlBody, err := io.ReadAll(resp.Body)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible, but they did pass previously. Sure, it does not return an EOF on read. For the test to pass, I had to decode the Token once and catch the error. (and seek back for the loop)

_, err = decoder.Token()
if err != nil {
	return nil, sitemap, err
}

// seek back to 0 if we are still here
reader.Seek(0, 0)
decoder = xml.NewDecoder(reader)

Catching this in the loop won't work cleanly, since I want to know if EOF was somewhere in-between the file (invalid XML), or at the start (this error)

I will push these changes

@HarshNarayanJha HarshNarayanJha marked this pull request as ready for review November 8, 2024 07:57
Comment thread internal/pkg/crawl/extractor/xml.go Outdated
Comment thread internal/pkg/crawl/extractor/xml.go Outdated
@yzqzss yzqzss mentioned this pull request Nov 9, 2024
@CorentinB CorentinB added the enhancement New feature or request label Nov 10, 2024
@CorentinB CorentinB requested a review from yzqzss November 10, 2024 09:12
Copy link
Copy Markdown
Collaborator

@yzqzss yzqzss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM :)

@CorentinB CorentinB merged commit 803fe6a into internetarchive:main Nov 11, 2024
@CorentinB
Copy link
Copy Markdown
Collaborator

Thanks guys!

@HarshNarayanJha
Copy link
Copy Markdown
Contributor Author

You're welcome

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Replace github.com/clbanning/mxj/v2 with stdlib

3 participants