Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runtime Error: invalid memory address or nil pointer dereference #35

Closed
AwolDes opened this issue Oct 18, 2017 · 2 comments
Closed

Runtime Error: invalid memory address or nil pointer dereference #35

AwolDes opened this issue Oct 18, 2017 · 2 comments

Comments

@AwolDes
Copy link

AwolDes commented Oct 18, 2017

Hey mate!

I'm loving colly so far. I'm new to the Go programming language and I've just been messing around with your scraping library and found a weird bug.

I was just testing out scraping my website, and then allowing the scraper to scrape Medium. I end up with this error:
image
(I'm using Go v 1.9 on Linux x86).

This is the code:

package main
import (
    "fmt"
    "github.com/asciimoo/colly"
)

func main() {
    scraper := colly.NewCollector()
    scraper.AllowedDomains = []string{"onslow.io", "medium.com"}

    scraper.OnHTML("a[href]", func(element *colly.HTMLElement) {
        link := element.Attr("href")
        // Print link
		fmt.Printf("Link found: %q -> %s\n", element.Text, link)
		// Visit link found on page
		// Only those links are visited which are in AllowedDomains
		go scraper.Visit(element.Request.AbsoluteURL(link))
    })

    scraper.OnError(func(request *colly.Response, err error) {
        fmt.Println("Request URL:", request.Request.URL, "failed with response:", request, "\nError:", err)
    })

    scraper.OnRequest(func(request *colly.Request) {
        fmt.Println("Visiting", request.URL.String())
    })

    scraper.Visit("http://onslow.io")
    scraper.Wait()


}

From what I've gathered, it has to do with the Goroutines possibly not syncing properly?

If you have any other ideas on the cause of this, it'd be great to hear them!

Cheers

@asciimoo
Copy link
Member

@AwolDes thanks for the detailed bug report and the nice words. 2665d14 fixes the bug hopefully.
A minor advice for the above code: The concurrency is not limited what is not adviced (you can DOS the targets or get "Too Many Requests" error). Use LimitRules to control the allowed parallelism for domains or spawn fixed number of goroutines.
Limit example:

scraper.Limit(&colly.LimitRule{DomainGlob: "onslow.io", Parallelism: 2})
scraper.Limit(&colly.LimitRule{DomainGlob: "medium.com", Parallelism: 5})

@AwolDes
Copy link
Author

AwolDes commented Oct 19, 2017

Thanks for the quick fix @asciimoo! And thank you for pointing out the Limit() function that colly has. I wasn't aware of this, I'll be sure to implement it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants