Yet Another Krawler
I found to some dismay that there don't seem to be many link crawlers written
in golang so maybe this project is mistitled. But this is in fact, a link
crawler that accepts a custom walkFn
in the spirit of filepath.Walk
which gets each HTML node in the parse tree. In this way, one can easily decide
what consistutes an "interesting node" for some particular definition of
interesting.
At the core it has the following set of events:
- Start with a single URL, add it to a queue
- Dequeue a work item
- Grab its content, build a parse tree of the content
- Scan the parse tree for interesting HTML nodes. In this case interesting consists of
- images
- javascript
- external links (links that point to somewhere other than the host of the original URL in step 1)
- stylesheet(s)
- child links (links that are hosted on the same URL as the URL in step 1)
- If there are any child links, enqueue them for later processing
- Start over at step 2 until the queue is empty
- Output each page's url and its assets, by walking the page tree from top to bottom, left to right.
$ go get https://github.com/mrallen1/yak
$ cd $GOPATH/src/github.com/mrallen1/yak
$ go test -cover
$ go install
$ cd $GOPATH/bin
$ ./yak http://www.example.com
The current implementation is a depth-first recursive tree walk. It suffers from excessive head-of-line blocking. A way to solve that would be implementing some workers to concurrently dequeue work items, but I haven't mastered golang channels and concurrency idioms well enough for that yet. That's just a matter of further practice and mentoring though.
This example code shows what could be implemented.
Some times URL nodes have a /
on the end and sometimes they don't. It confuses
my map of visited pages so sometimes I revisit pages I have in fact already parsed. I don't
know enough about the internals of the net/url
package to say why this is.
At the moment, there is no maximum depth to the work queue.
He wrote two of them!