Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

obey robots.txt is not working #5

Open
notacoder-ui opened this issue Dec 21, 2020 · 6 comments
Open

obey robots.txt is not working #5

notacoder-ui opened this issue Dec 21, 2020 · 6 comments

Comments

@notacoder-ui
Copy link

Hi,

I have set a rule in robots.txt that
Disallow: /mailto:%20iasdf%66o%40r%65%69asdfdf%2ede
Disallow: /news-letter/unsub

And started the cron job to index but the job always indexed the above both urls.

How to skip some urls not getting indexed in the sitemap.

@notacoder-ui
Copy link
Author

And also I got this in my typo3 error log:

Mon, 21 Dec 2020 05:56:00 +0000 [ERROR] request="e909aaa824fb3" component="INM.InmGooglesitemap.Generators.SitemapGenerator": Extension inm_googlesitemap: Error Code: 5 --- Reason: Socket-stream timed out (timeout set to 5 sec).

This error log which made site to show the 503 error and restarting the php-fpm service showed the site again.

Please check this too

@merzilla
Copy link
Owner

Hi @notacoder-ui ,
there may be that other rules overlay the stuff from your robots. Please provide more information: TYPO3 version, PHP version etc. And of course the settings you made in the Scheduler task are important.

@notacoder-ui
Copy link
Author

Okay.

Typo3 version : 9.5.19
PHP version: 7.2.34
Settings in schedular:

settings

@merzilla
Copy link
Owner

merzilla commented Dec 21, 2020

Okay, well adding mailto to regexDirectoryExclude will not help you... this will exclude something like https://foo.tld/mailto/something .
But you may add news-letter here instead of the mailto to exclude this path.
What you also can shorten is linkExtractionTags: Update this field that you only have href there.
I know, mailto is also in a href.
But let me know if it's better now. If not I would have to check why mailto is not omitted by default.

@notacoder-ui
Copy link
Author

Hi @merzilla

I updated the settings as you said and ran the cron job.
Site went to 503 mode and I got to see this in the error log:

Tue, 22 Dec 2020 05:05:01 +0000 [ERROR] request="08db8edc7ac5b" component="INM.InmGooglesitemap.Generators.SitemapGenerator": Extension inm_googlesitemap: Response Header not correct. Got HTTP Status Code 302 for URL https://www.xyz.de/mailto:%20%69n%66%6f%40%72eise%6cinie%2e%64e --- Complete Response Header: HTTP/1.1 302 Found
Date: Tue, 22 Dec 2020 05:05:01 GMT
Server: Apache
X-Powered-By: PHP/7.2.34
location: /404fehler
X-Powered-By: PleskLin
X-UA-Compatible: IE=edge
X-Content-Type-Options: nosniff
Cache-Control: public, no-transform, must-revalidate
Last-modified: Mon, 14 Dec 2020 10:10:10 GMT
Content-Length: 0
Connection: close
Content-Type: text/html; charset=UTF-8

@notacoder-ui
Copy link
Author

Hi @merzilla

I need to update settings like some links should not be indexed while generating a sitemap.xml file.
Is there any possibility that I can set it to avoid such a URL?

Or obey robots.txt functionality is also fine for me so that I can set URLs there with disallow and that is not getting indexed while generating a new sitemap.xml

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants