Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check visitors for the google bots' user agent strings #2

Closed
dfyx opened this issue Mar 9, 2015 · 9 comments
Closed

Check visitors for the google bots' user agent strings #2

dfyx opened this issue Mar 9, 2015 · 9 comments

Comments

@dfyx
Copy link

dfyx commented Mar 9, 2015

Instead of constantly googling (which needs a lot of resources and gives you a delayed result), you could just check all visitors for the user agent strings google uses: https://support.google.com/webmasters/answer/1061943?hl=en

Once you see a bot, you could probably still search for the page just to confirm that it got on the index (I'd guess it takes a few hours but I'm not sure)

@AyrA
Copy link

AyrA commented Mar 9, 2015

Please don't do that. I use the googlebot user agent all the time. It allows me to view some content on forums, for which other users would need to register. Either use a reverse DNS lookup or an unofficial IP list. If I visit the site with my browser it would get deleted.

@dfyx
Copy link
Author

dfyx commented Mar 9, 2015

Obviously don't use the user agent as the only indicator. But it could tell you when to check the index.

@fabiosussetto
Copy link

visited by googlebot !== indexed in google

@ErtugKaya
Copy link

@dfyx meant that unless the page is visited by a Google bot, it cannot be indexed by Google. Not viceversa.

Checking if the page is indexed, should be started only after a Google bot visits it. That makes sense to me.

@fabiosussetto
Copy link

Got it, thanks.

@AyrA
Copy link

AyrA commented Mar 10, 2015

@ErtugKaya exactly. The page needs to be visited at least once from a crawler to be indexed. After that it gets tricky, because crawler could also index other search engines and then you might have a visit from search engine A, but your site might also turn up in search engine B.

Microsoft did that at least once

@remram44
Copy link

So this is just an optimization? Once the site has been visited, it might take a while to appear in the index, so you'll do many queries anyway.

@AyrA
Copy link

AyrA commented Mar 11, 2015

if you query search engines you can do this once every 12 hours, also most browsers send the referrer header to your site. If a referrer points to a search engine, do the query for your site, and if it appears, then delete the site.

@mroth
Copy link
Owner

mroth commented Mar 13, 2015

This would be an interesting technical optimization -- but in this particular case the choice to Google constantly was an intentional conceptual decision, not a technical one. (i.e. part of the crux of what made the idea enjoyable for me was the "a website that googles itself constantly" humor)

If someone would like to make this and others will find it useful, please open a new issue as a Pull Request and I will either merge it to an alternate branch or link to the fork in the README.

@mroth mroth closed this as completed Mar 13, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants