Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for robots.txt #1

Open
pinscript opened this issue Feb 25, 2014 · 4 comments
Open

Add support for robots.txt #1

pinscript opened this issue Feb 25, 2014 · 4 comments

Comments

@pinscript
Copy link
Owner

We should have built in features to handle robots.txt.

@adeelkhalid992
Copy link

Is it still open for contribution?

@pinscript
Copy link
Owner Author

Hi,

It is, but I do not use this project anymore. I'd recommend sjdirect/abot or another crawler that uses the new async HttpClient.

@adeelkhalid992
Copy link

It would be much appreciated if you just tell me which pattern have you used for handling of threads and also for fetching links from websites, so, It could be easy for me to dig into this project and modify it with latest concepts of c#.

@pinscript
Copy link
Owner Author

Sorry for the late reply, been away.

There is not much to it really. I am using the APM Model (https://msdn.microsoft.com/en-us/library/ms228963(v=vs.110).aspx) in https://github.com/alexandernyquist/Harvest/blob/master/Harvest/Downloader.cs.

Then there are two threads which controls all Downloaders at https://github.com/alexandernyquist/Harvest/blob/master/Harvest/DownloaderQueue.cs.

Links are scraped using the HtmlAgilityPack in the Page-class (https://github.com/alexandernyquist/Harvest/blob/master/Harvest/Page.cs).

But please, do not build on this code. It is way better to use modern techniques using async/await, System.Net.Http.HttpClient.

I can spike out an example for you if you'd like?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants