Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent google crawling, or make it faster. #24

Closed
jasononeil opened this issue Mar 18, 2015 · 3 comments
Closed

Prevent google crawling, or make it faster. #24

jasononeil opened this issue Mar 18, 2015 · 3 comments

Comments

@jasononeil
Copy link
Owner

We just had a google bot start crawling "preview.lib.haxe.org". (Still not sure where it scraped the URL from, but oh well).

It hit the File Browser, which currently displays a source file by opening the haxelib zip, unpacking the file, rendering it, and sending it to the client. Needless to say, with the tens (hundreds?) of thousands of files, this was causing significant strain on the server.

I've turned the preview site off for now until I fix this, either by having a faster (cached?) implementation, or by using robots.txt to block google from the file browser section.

@markknol
Copy link
Collaborator

Ah that's odd. Google reads our mail!
Since most content is static per lib version, you might just render out the stuff once (if it doesn't exist yet) to plain html, store that in cache or on disk and serve that?

@jasononeil
Copy link
Owner Author

Yes I think that's a suitable solution for text files, we could cache them
in the DB. Images and binaries we can perhaps block from web crawlers, as
they won't be valuable to search results and are not might be too large to
suitably cache, especially some of the ndll files etc.

On Wed, Mar 18, 2015 at 9:10 PM, Mark Knol notifications@github.com wrote:

Since most content is static per lib version, you might just render out
the stuff once (if it doesn't exist yet) to plain html, store that in cache
or on disk and serve that?


Reply to this email directly or view it on GitHub
#24 (comment).

@markknol
Copy link
Collaborator

markknol commented Apr 9, 2015

What is the state of this?

jasononeil added a commit that referenced this issue May 8, 2015
See #24

This mostly solves it, though I should still do some DB caching.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants