robots.txt doesn't behave as expected #1948

watson · 2018-12-18T09:29:19Z

Our robots.txt file currently contain this:

User-Agent: *
Disallow: /dist/
Disallow: /docs/
Allow: /dist/latest/
Allow: /dist/latest/docs/api/
Allow: /api/

I'm not sure of the reason for disallowing /docs/, but whatever the case, I don't think it has the intended effect. Instead of removing it from Google, it just seems to remove Googles ability to show any meaningful content related to the link - but it still links to sites under /docs/.

Example: A search for "node.js util.inherits" shows this:

If you follow the "Learn why" link, you're told that:

[...] the website prevented Google from creating a page description, but didn't actually hide the page from Google.
[...] You are seeing this result because the page is blocked by a robots.txt file on your website. (robots.txt tells Google not to read your page; if you block the page to us, we can't create a page description in search results.)

The text was updated successfully, but these errors were encountered:

fhemberger · 2018-12-18T09:47:23Z

Hmmm … 🤔 I have no idea why we've set it that way. Maybe @rvagg remembers?
I think this goes back to the early website days and hasn't been touched ever since.

rvagg · 2018-12-18T10:50:16Z

nope! and the use of /dist/ instead of /download/ also rattles my bones, I desperately want /dist/ to be deprecated.

I think robots.txt is up for complete revision, someone suggest a new one that makes sense and make it so. https://github.com/nodejs/nodejs.org/blob/master/static/robots.txt

We had a discussion recently about the best entrypoint to the docs and I think we had differing opinions. Some people like /docs/, some do it through /api/ (I do). The docs themselves suggest doing it through /docs/latest-*/api/ is the "official" way.

carlos-ds · 2019-04-22T19:25:34Z

Hi,

I believe the correct steps to ensure proper indexation are:

Create a sitemap
Upload the sitemap to the root directory
Modify robots.txt and refer to the sitemap in it
Submit robots.txt via Google Search Console or wait for Google to pass by

With your permission, I could make a proposal for an XML sitemap based on the output of one of the third-party tools Google suggests (although most of them seem paid or dead) at https://support.google.com/webmasters/answer/183668?hl=en

Please let me know, I'd be happy to help.

alexandrtovmach · 2019-07-02T10:36:37Z

@carlos-ds Thank you for help and initiative. It'd be great if you investigate it and create PR.

carlos-ds · 2019-07-14T15:45:56Z

Ok thanks @alexandrtovmach .

I suggest the following approach:

I'm gonna use Sitemap Generator CLI (https://www.npmjs.com/package/sitemap-generator) to generate an XML sitemap for https://nodejs.org. I'm testing the tool right now.
I will go through the generated sitemap manually to see the result and make changes to the generator config accordingly, then re-run the generator and validate it again until the desired result
Someone should validate this XML sitemap
Once the sitemap is validated, I can get to work on making changes to robots.txt
Someone should validate the changes to robots.txt
Robots.txt should be modified on the production server and someone with access to Google Search Console of https://nodejs.org should submit the XML sitemap

Does that sound like a good approach for this issue? I'd appreciate your feedback.

ovflowd · 2023-03-12T13:41:04Z

@richardlau I was rechecking this issue and nodejs.org/robots.txt gives 404. Which is reference here (https://github.com/nodejs/nodejs.org/blob/main/public/robots.txt). Is this an nginx bug? Because opening nodejs.org/manifest.json works (https://github.com/nodejs/nodejs.org/blob/main/public/manifest.json)

richardlau · 2023-03-13T13:43:42Z

@richardlau I was rechecking this issue and nodejs.org/robots.txt gives 404. Which is reference here (https://github.com/nodejs/nodejs.org/blob/main/public/robots.txt). Is this an nginx bug? Because opening nodejs.org/manifest.json works (https://github.com/nodejs/nodejs.org/blob/main/public/manifest.json)

@ovflowd It's currently aliased -- I presume it has moved as part of the Next.js rewrite?
https://github.com/nodejs/build/blob/faf52ba2a17983598637dc0f1c918451299a38ad/ansible/www-standalone/resources/config/nodejs.org?plain=1#L328-L331

ovflowd · 2023-03-13T14:05:49Z

true, I moved from the static folder to the root of the public folder. Can we maybe remove these alias from there?

ovflowd · 2023-03-13T14:07:24Z

I made an update in this PR (nodejs/build#3139) I still believe we could try that PR out. (To see if everything is ✅)

ovflowd · 2023-03-13T14:08:17Z

We can also for now make a hot-fix to the nginx and remove those aliases. But either way, I feel confident enough that the new nginx config is working. You can create a temporary file and use nginx -t to check if the config is valid

ovflowd · 2023-03-15T13:00:01Z

Closing as fixed.

ovflowd closed this as completed Mar 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

robots.txt doesn't behave as expected #1948

robots.txt doesn't behave as expected #1948

watson commented Dec 18, 2018

fhemberger commented Dec 18, 2018

rvagg commented Dec 18, 2018

carlos-ds commented Apr 22, 2019

alexandrtovmach commented Jul 2, 2019

carlos-ds commented Jul 14, 2019

ovflowd commented Mar 12, 2023

richardlau commented Mar 13, 2023 •

edited

ovflowd commented Mar 13, 2023

ovflowd commented Mar 13, 2023

ovflowd commented Mar 13, 2023

ovflowd commented Mar 15, 2023

robots.txt doesn't behave as expected #1948

robots.txt doesn't behave as expected #1948

Comments

watson commented Dec 18, 2018

fhemberger commented Dec 18, 2018

rvagg commented Dec 18, 2018

carlos-ds commented Apr 22, 2019

alexandrtovmach commented Jul 2, 2019

carlos-ds commented Jul 14, 2019

ovflowd commented Mar 12, 2023

richardlau commented Mar 13, 2023 • edited

ovflowd commented Mar 13, 2023

ovflowd commented Mar 13, 2023

ovflowd commented Mar 13, 2023

ovflowd commented Mar 15, 2023

richardlau commented Mar 13, 2023 •

edited