Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cumulativefilter vs crawler #2157

Closed
asaage opened this issue Jul 29, 2020 · 10 comments
Closed

cumulativefilter vs crawler #2157

asaage opened this issue Jul 29, 2020 · 10 comments
Labels
Milestone

Comments

@asaage
Copy link

asaage commented Jul 29, 2020

I recently added a cumulativefilter on a shop-page and since then the crawler/search-indexer doesn't work properly anymore.
there are ~150 pages/products in total but the number of pages to be crawled grows into absurde amounts (57000+) and the crawler never catches up crawling.
I suspect the url-parameter - as with unpublished filter the crawler works as expected.
Filter-urls look like this:
foo.html?cumulativefilter=ODQ7YWRkO2ZpbHRlcl9hdHRyaWJ1dGVzOzQ1OQ==
this is then somehow translated into
foo.html?isorc=1728
Any Ideas how to fix this? Im not totally sure if adding a rel="noindex" would be appropiate.

@Toflar
Copy link
Member

Toflar commented Jul 29, 2020

The same would be the case for any crawler. So I guess it just doesn't make sense for any crawler to even crawl these url's which means they should all get rel="nofollow".

@asaage
Copy link
Author

asaage commented Jul 29, 2020

makes sense.
The model doesn't use a dedicated nav_-template though.
I will create one from nav_default and adjust accordingly.
Maybe you could set this in the module already to avoid template-adjustments for the future?
I guess someweher in here.

@asaage
Copy link
Author

asaage commented Jul 29, 2020

Actually neither nofollow nor noindex seems to solve this 😢

@Toflar
Copy link
Member

Toflar commented Jul 29, 2020

The debug log should tell you where the URL was found.

@asaage
Copy link
Author

asaage commented Jul 29, 2020

Can i get that live if i use the console?
When i run it in the Backend i don't get a debug-log because crawling doesn't finish.

@Toflar
Copy link
Member

Toflar commented Jul 29, 2020

Commands are self-documenting :) Just run contao:crawl --help :)

@asaage
Copy link
Author

asaage commented Jul 29, 2020

well - i'm stuck here...
I don't know how to interpret this. 🤷‍♂️
Despite "rel-nofollow" pages are being Forwarded to the search indexer. Was indexed successfully.
Other occasions show "rel-nofollow" Do not request because when the crawl URI was found, the "rel" attribute contained "nofollow".
I have tons of categoryXY.html?isorc=XY&cumulativefilter=XY combinations

@Toflar
Copy link
Member

Toflar commented Jul 29, 2020

Do not request because when the crawl URI was found, the "rel" attribute contained "nofollow".

That's exactly what you want. This means it's not going to be requested and thus also not indexed. But maybe it's found elsewhere again? Try finding the entry where the URL is requested. You should see where it was found, then you have to fix it on this page as well.

@asaage
Copy link
Author

asaage commented Jul 30, 2020

There is only one cumulativefilter-module and all links with cumulativefilter parameter get 303-redirected to the same page with isorc parameter.
Also every such url appears twice in the crawl-log - first with a
Do not request because when the crawl URI was found, the "rel" attribute contained "nofollow".
message followed by a
Forwarded to the search indexer. Did not index because of the following reason: Was explicitly marked "noSearch" in page settings.
Although i didn't set this.
From what i can see there is always a "contao:noSearch":false ld-json on the category-page.
That is needed (but should probably be canged based on the isorc or cumulativefilter-parameter being present - i think a canonical-tag would be helpful here as well).
Despite the crawler crawling around endlessly the indexer manages to fill tl_search with all needed urls + a couple extra.
Maybe leave this open until someone runs into the same issue - I don't think i configured something faulty.

@aschempp aschempp added the bug label Oct 12, 2020
@aschempp aschempp added this to the 2.6.14 milestone Oct 12, 2020
@aschempp
Copy link
Member

I have added nofollow to the filter items in 5792c49 anyway because I think that makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants