-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for robots.txt #1
Comments
Is it still open for contribution? |
Hi, It is, but I do not use this project anymore. I'd recommend sjdirect/abot or another crawler that uses the new async HttpClient. |
It would be much appreciated if you just tell me which pattern have you used for handling of threads and also for fetching links from websites, so, It could be easy for me to dig into this project and modify it with latest concepts of c#. |
Sorry for the late reply, been away. There is not much to it really. I am using the APM Model (https://msdn.microsoft.com/en-us/library/ms228963(v=vs.110).aspx) in https://github.com/alexandernyquist/Harvest/blob/master/Harvest/Downloader.cs. Then there are two threads which controls all Downloaders at https://github.com/alexandernyquist/Harvest/blob/master/Harvest/DownloaderQueue.cs. Links are scraped using the HtmlAgilityPack in the Page-class (https://github.com/alexandernyquist/Harvest/blob/master/Harvest/Page.cs). But please, do not build on this code. It is way better to use modern techniques using async/await, System.Net.Http.HttpClient. I can spike out an example for you if you'd like? |
We should have built in features to handle robots.txt.
The text was updated successfully, but these errors were encountered: