-
-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check visitors for the google bots' user agent strings #2
Comments
Please don't do that. I use the googlebot user agent all the time. It allows me to view some content on forums, for which other users would need to register. Either use a reverse DNS lookup or an unofficial IP list. If I visit the site with my browser it would get deleted. |
Obviously don't use the user agent as the only indicator. But it could tell you when to check the index. |
visited by googlebot !== indexed in google |
@dfyx meant that unless the page is visited by a Google bot, it cannot be indexed by Google. Not viceversa. Checking if the page is indexed, should be started only after a Google bot visits it. That makes sense to me. |
Got it, thanks. |
@ErtugKaya exactly. The page needs to be visited at least once from a crawler to be indexed. After that it gets tricky, because crawler could also index other search engines and then you might have a visit from search engine A, but your site might also turn up in search engine B. |
So this is just an optimization? Once the site has been visited, it might take a while to appear in the index, so you'll do many queries anyway. |
if you query search engines you can do this once every 12 hours, also most browsers send the referrer header to your site. If a referrer points to a search engine, do the query for your site, and if it appears, then delete the site. |
This would be an interesting technical optimization -- but in this particular case the choice to Google constantly was an intentional conceptual decision, not a technical one. (i.e. part of the crux of what made the idea enjoyable for me was the "a website that googles itself constantly" humor) If someone would like to make this and others will find it useful, please open a new issue as a Pull Request and I will either merge it to an alternate branch or link to the fork in the README. |
Instead of constantly googling (which needs a lot of resources and gives you a delayed result), you could just check all visitors for the user agent strings google uses: https://support.google.com/webmasters/answer/1061943?hl=en
Once you see a bot, you could probably still search for the page just to confirm that it got on the index (I'd guess it takes a few hours but I'm not sure)
The text was updated successfully, but these errors were encountered: