You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm loving colly so far. I'm new to the Go programming language and I've just been messing around with your scraping library and found a weird bug.
I was just testing out scraping my website, and then allowing the scraper to scrape Medium. I end up with this error:
(I'm using Go v 1.9 on Linux x86).
This is the code:
package main
import (
"fmt"
"github.com/asciimoo/colly"
)
func main() {
scraper := colly.NewCollector()
scraper.AllowedDomains = []string{"onslow.io", "medium.com"}
scraper.OnHTML("a[href]", func(element *colly.HTMLElement) {
link := element.Attr("href")
// Print link
fmt.Printf("Link found: %q -> %s\n", element.Text, link)
// Visit link found on page
// Only those links are visited which are in AllowedDomains
go scraper.Visit(element.Request.AbsoluteURL(link))
})
scraper.OnError(func(request *colly.Response, err error) {
fmt.Println("Request URL:", request.Request.URL, "failed with response:", request, "\nError:", err)
})
scraper.OnRequest(func(request *colly.Request) {
fmt.Println("Visiting", request.URL.String())
})
scraper.Visit("http://onslow.io")
scraper.Wait()
}
From what I've gathered, it has to do with the Goroutines possibly not syncing properly?
If you have any other ideas on the cause of this, it'd be great to hear them!
Cheers
The text was updated successfully, but these errors were encountered:
@AwolDes thanks for the detailed bug report and the nice words. 2665d14 fixes the bug hopefully.
A minor advice for the above code: The concurrency is not limited what is not adviced (you can DOS the targets or get "Too Many Requests" error). Use LimitRules to control the allowed parallelism for domains or spawn fixed number of goroutines.
Limit example:
Thanks for the quick fix @asciimoo! And thank you for pointing out the Limit() function that colly has. I wasn't aware of this, I'll be sure to implement it.
Hey mate!
I'm loving colly so far. I'm new to the Go programming language and I've just been messing around with your scraping library and found a weird bug.
I was just testing out scraping my website, and then allowing the scraper to scrape Medium. I end up with this error:
(I'm using Go v 1.9 on Linux x86).
This is the code:
From what I've gathered, it has to do with the Goroutines possibly not syncing properly?
If you have any other ideas on the cause of this, it'd be great to hear them!
Cheers
The text was updated successfully, but these errors were encountered: