Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

robots.txt doesn't behave as expected #1948

Closed
watson opened this issue Dec 18, 2018 · 11 comments
Closed

robots.txt doesn't behave as expected #1948

watson opened this issue Dec 18, 2018 · 11 comments

Comments

@watson
Copy link
Member

watson commented Dec 18, 2018

Our robots.txt file currently contain this:

User-Agent: *
Disallow: /dist/
Disallow: /docs/
Allow: /dist/latest/
Allow: /dist/latest/docs/api/
Allow: /api/

I'm not sure of the reason for disallowing /docs/, but whatever the case, I don't think it has the intended effect. Instead of removing it from Google, it just seems to remove Googles ability to show any meaningful content related to the link - but it still links to sites under /docs/.

Example: A search for "node.js util.inherits" shows this:

image

If you follow the "Learn why" link, you're told that:

[...] the website prevented Google from creating a page description, but didn't actually hide the page from Google.
[...] You are seeing this result because the page is blocked by a robots.txt file on your website. (robots.txt tells Google not to read your page; if you block the page to us, we can't create a page description in search results.)

@fhemberger
Copy link
Contributor

Hmmm … 🤔 I have no idea why we've set it that way. Maybe @rvagg remembers?
I think this goes back to the early website days and hasn't been touched ever since.

@rvagg
Copy link
Member

rvagg commented Dec 18, 2018

nope! and the use of /dist/ instead of /download/ also rattles my bones, I desperately want /dist/ to be deprecated.

I think robots.txt is up for complete revision, someone suggest a new one that makes sense and make it so. https://github.com/nodejs/nodejs.org/blob/master/static/robots.txt

We had a discussion recently about the best entrypoint to the docs and I think we had differing opinions. Some people like /docs/, some do it through /api/ (I do). The docs themselves suggest doing it through /docs/latest-*/api/ is the "official" way.

@carlos-ds
Copy link

Hi,

I believe the correct steps to ensure proper indexation are:

  1. Create a sitemap
  2. Upload the sitemap to the root directory
  3. Modify robots.txt and refer to the sitemap in it
  4. Submit robots.txt via Google Search Console or wait for Google to pass by

With your permission, I could make a proposal for an XML sitemap based on the output of one of the third-party tools Google suggests (although most of them seem paid or dead) at https://support.google.com/webmasters/answer/183668?hl=en

Please let me know, I'd be happy to help.

@alexandrtovmach
Copy link
Contributor

@carlos-ds Thank you for help and initiative. It'd be great if you investigate it and create PR.

@carlos-ds
Copy link

Ok thanks @alexandrtovmach .

I suggest the following approach:

  • I'm gonna use Sitemap Generator CLI (https://www.npmjs.com/package/sitemap-generator) to generate an XML sitemap for https://nodejs.org. I'm testing the tool right now.
  • I will go through the generated sitemap manually to see the result and make changes to the generator config accordingly, then re-run the generator and validate it again until the desired result
  • Someone should validate this XML sitemap
  • Once the sitemap is validated, I can get to work on making changes to robots.txt
  • Someone should validate the changes to robots.txt
  • Robots.txt should be modified on the production server and someone with access to Google Search Console of https://nodejs.org should submit the XML sitemap

Does that sound like a good approach for this issue? I'd appreciate your feedback.

@ovflowd
Copy link
Member

ovflowd commented Mar 12, 2023

@richardlau I was rechecking this issue and nodejs.org/robots.txt gives 404. Which is reference here (https://github.com/nodejs/nodejs.org/blob/main/public/robots.txt). Is this an nginx bug? Because opening nodejs.org/manifest.json works (https://github.com/nodejs/nodejs.org/blob/main/public/manifest.json)

@richardlau
Copy link
Member

richardlau commented Mar 13, 2023

@richardlau I was rechecking this issue and nodejs.org/robots.txt gives 404. Which is reference here (https://github.com/nodejs/nodejs.org/blob/main/public/robots.txt). Is this an nginx bug? Because opening nodejs.org/manifest.json works (https://github.com/nodejs/nodejs.org/blob/main/public/manifest.json)

@ovflowd It's currently aliased -- I presume it has moved as part of the Next.js rewrite?
https://github.com/nodejs/build/blob/faf52ba2a17983598637dc0f1c918451299a38ad/ansible/www-standalone/resources/config/nodejs.org?plain=1#L328-L331

@ovflowd
Copy link
Member

ovflowd commented Mar 13, 2023

true, I moved from the static folder to the root of the public folder. Can we maybe remove these alias from there?

@ovflowd
Copy link
Member

ovflowd commented Mar 13, 2023

I made an update in this PR (nodejs/build#3139) I still believe we could try that PR out. (To see if everything is ✅)

@ovflowd
Copy link
Member

ovflowd commented Mar 13, 2023

We can also for now make a hot-fix to the nginx and remove those aliases. But either way, I feel confident enough that the new nginx config is working. You can create a temporary file and use nginx -t to check if the config is valid

@ovflowd
Copy link
Member

ovflowd commented Mar 15, 2023

Closing as fixed.

@ovflowd ovflowd closed this as completed Mar 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants