Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[4] Modern routing, nonSEF & SEF urls alias manipulation #32880

Closed
PhilETaylor opened this issue Mar 26, 2021 · 49 comments
Closed

[4] Modern routing, nonSEF & SEF urls alias manipulation #32880

PhilETaylor opened this issue Mar 26, 2021 · 49 comments

Comments

@PhilETaylor
Copy link
Contributor

PhilETaylor commented Mar 26, 2021

Steps to reproduce the issue

forked from #32879 about #32490

There is one more case I wanted to address but ran out of time and that is with the new MODERN routing its possible to generate a url like:

https://example.com/?view=article&id=3:my-article&catid=9

where my-article is the alias of the My Article (id:3) article. This can also be manipulated like:

https://example.com/?view=article&id=3:HAHA&catid=9

and that url will still work correctly. This still needs addressing.

Expected result

https://example.com/?view=article&id=3:HAHA&catid=9 is a 404 as HAHA has no part to play here

Or could just remove the alias part from the id...
https://example.com/?view=article&id=3&catid=9

Actual result

https://example.com/?view=article&id=3:HAHA&catid=9 is a valid url that loads the article with id 3 and HAHA is not checked to see if its the same as the article alias (its not)

Additionally with SEF on and "Remove IDs from URLs" turned of

Checking Joomla 4, with "Remove IDs from URLs" turned off, then the same bug exists with SEF urls that #32887 aimed to fix in Joomla 3 whereas:

A url of http://127.0.0.1:4444/bottom-most/1-my-article is generated, and can be manipulated like http://127.0.0.1:4444/bottom-most/1-HAHAHAHAHAHAH without a redirect/404 being generated :-(

Additional comments

@ReLater
Copy link
Contributor

ReLater commented Mar 27, 2021

Just a question: Is there a reason that the alias is included in id value? Does the router code need it somewhere?

Generally speaking: I personally am happy that the alias is not checked and that it doesn't matter if there is one or not or the wrong one. (I also understand all the discussions about SEO problems but know how to avoid them.)

@PhilETaylor

This comment was marked as abuse.

@Bakual
Copy link
Contributor

Bakual commented Mar 27, 2021

Google will eventually associate that new fake url with the nasty stuff.

Is there any proof that this will happen? Because I actually think Google is smart enough to not do that. After all, it's a single link from an external site and all internal links to the same content have a different URL. For Google that is a mild form of duplicate content and they pretty sure give priority to the same-site links when it comes to which URL is correct.
Correct forming of canonical links would actually solve the issue as well as it tells Google the correct URL for that content.

@PhilETaylor

This comment was marked as abuse.

@Bakual
Copy link
Contributor

Bakual commented Mar 27, 2021

I just checked with some major news-site here in Switzerland (20min.ch) and Germany (bild.de and spiegel.de). You can manipulate their URL of any article as well, but they redirect (301) to the correct page then. Which is what I would expect as a user and site owner. As a site owner, I don't like presenting 404 to my users, even if they mispelled the URL.

So imho the best solution would be to check the incoming URL and if it's not correct, automatically redirect to the correct one.
Plainly showing a 404 just because the alias doesn't match is not a good solution.

@PhilETaylor

This comment was marked as abuse.

@Bakual
Copy link
Contributor

Bakual commented Mar 27, 2021

A 404 is literally the correct response to an invalid url!

Technically yes. But most sites actually like it when users find the correct page even if they misspelled the URL. So a redirect 301 to the correct URL is also an absolutely correct response.

I agree that the current state is not a desired behavior. However I think it's still better than showing a 404. And yes I think the other PR is not a correcct solution.
But then, I'm just one single guy. I have no permissions to merge or reject a PR (I used to have, but no longer wanted for a long time). I am not speaking for the whole project at all.

@PhilETaylor

This comment was marked as abuse.

@Ruud68
Copy link
Contributor

Ruud68 commented Mar 27, 2021

Google will eventually associate that new fake url with the nasty stuff.

Is there any proof that this will happen? Because I actually think Google is smart enough to not do that. After all, it's a single link from an external site and all internal links to the same content have a different URL. For Google that is a mild form of duplicate content and they pretty sure give priority to the same-site links when it comes to which URL is correct.
Correct forming of canonical links would actually solve the issue as well as it tells Google the correct URL for that content.

@Bakual here you go: dm me if you need the domainname so you can see what google is actually showing in the index #32490 (comment)

@PhilETaylor

This comment was marked as abuse.

@Hackwar
Copy link
Member

Hackwar commented Mar 27, 2021

No, this has not been changed in 4.0. When the URL is non-SEF, the router generally doesn't really change anything in it.

@PhilETaylor

This comment was marked as abuse.

@PhilETaylor PhilETaylor changed the title Modern routing, nonSEF alias manipulation [4] Modern routing, nonSEF alias manipulation Mar 27, 2021
@PhilETaylor PhilETaylor reopened this Mar 27, 2021
@Hackwar
Copy link
Member

Hackwar commented Mar 27, 2021

Are you planning to just check this during the parseing? Because I would be hesitant to get the aliases each time from the database for building the URLs. In worst case you add a few hundred queries for a single page by that.

@PhilETaylor

This comment was marked as abuse.

@PhilETaylor

This comment was marked as abuse.

@Ruud68
Copy link
Contributor

Ruud68 commented Mar 27, 2021

Exactly! that is also why I said it is a show stopper and pinged @wilsonge on this.
We can discuss a lot, but as you said 'if there is no interest', well time can be better spend

@Ruud68
Copy link
Contributor

Ruud68 commented Mar 27, 2021

Are you planning to just check this during the parseing? Because I would be hesitant to get the aliases each time from the database for building the URLs. In worst case you add a few hundred queries for a single page by that.

this is only for parsing, not for building. So that would involve only 1 (not additional but) changed query: you are already querying for the id that will be extended to query both the id and the alias

@PhilETaylor PhilETaylor changed the title [4] Modern routing, nonSEF alias manipulation [4] Modern routing, nonSEF & SEF urls alias manipulation Mar 27, 2021
@Hackwar
Copy link
Member

Hackwar commented Mar 27, 2021

Checking Joomla 4, with "Remove IDs from URLs" turned off, then the same bug exists with SEF urls that #32887 aimed to fix in Joomla 3 whereas:

A url of http://127.0.0.1:4444/bottom-most/1-my-article is generated, and can be manipulated like http://127.0.0.1:4444/bottom-most/1-HAHAHAHAHAHAH without a redirect/404 being generated :-(

That is what I described above. If you have a page with a hundred URLs, you get at least a 100 queries additionally to check the alias, which is why I'm asking to not do this in build, but in parse.

this is only for parsing, not for building. So that would involve only 1 (not additional but) changed query: you are already querying for the id that will be extended to query both the id and the alias

We are not validating the ID during parseing of the URL. That is something that the component has to do later on. So it would be an additional query. But one additional query shouldn't really worry us. I'm just trying to bring up all the things that we have to keep in mind.

Generally: I, and none of the production team, are your enemies. Quite the opposite. We are very gratefull for your work. To me, you are coming over as if you think you had to fight against us. We have common goals here and we are all trying our best to reach these goals. We currently only differ on certain rules which we put up.

@MacJoom
Copy link
Contributor

MacJoom commented Apr 8, 2021

I totally disagree with the production team - this issue is a major flaw. No one these days is go to type in an url... everyone clicks on a link - a link that can be manipulated and will show up in google search results - google will eventually change the link if its a 301 but google will surely delete the false link if its a 404.
So this has to be fixed - in every version of joomla!

@Ruud68
Copy link
Contributor

Ruud68 commented Apr 8, 2021

I totally disagree with the production team - this issue is a major flaw. No one these days is go to type in an url... everyone clicks on a link - a link that can be manipulated and will show up in google search results - google will eventually change the link if its a 301 but google will surely delete the false link if its a 404.
So this has to be fixed - in every version of joomla!

Agree, it is only a matter of time (but maybe that is already happening) before google categorizes / labels sites that in their eyes server p*rn links, so when you are running a legit business you will not show anymore on page 1 when somebody searches for your household equipment because you are categorized as running a completely different business.

@Hackwar
Copy link
Member

Hackwar commented Apr 8, 2021

We don't disagree that this is a major flaw. However we disagree that this can be fixed in a backwards compatible way. Fixing this in Joomla 3 will break thousands of websites and thus we can't fix this in the 3.x major version. Instead, we already have fixed it in 4.0.

@Hackwar
Copy link
Member

Hackwar commented Apr 8, 2021

Just FYI: I've been trying to fix this in Joomla 1.6 already and then pushed for the last big changes to the routing which at least partially fixed this for 6 years and yet another year to fix this in 4.0.

@MacJoom
Copy link
Contributor

MacJoom commented Apr 9, 2021

I cannot discuss on the b/c topic - which is very important surely - but this is issue is a problem which can badly affect the status of joomla as a reliable cms - even if the system was not hacked - a hacker can make it look like it was - at least on the url - some people wont notice the technical difference between an url and the content - they think the url is coming from the site. so i can only urge to fix this for 99% of the joomla sites (at the moment)

@Hackwar
Copy link
Member

Hackwar commented Apr 9, 2021

You mean that something that has been like this for 15 years needs to be fixed now, definitely breaking thousands of websites and requiring development work from them? At the same time breaking our semVer promises we made? I can guarantee you, that if we change this now, the Joomla project would loose half of its userbase. Not everyone would even be directly affected by this, but the break in trust would be devastating.

I can guarantee you, that no one in charge in the production part of the project will support this change in the 3.x branch of Joomla, especially since you can partially fix this by using modern routing without IDs and additional fixes have already also been deployed to Joomla 4.0.

We had such a change in another area in 2014, where someone thought it would be necessary to change the hashing of passwords and we still get people complaining about how unreliable we are because of that release 7 years ago.

@Ruud68
Copy link
Contributor

Ruud68 commented Apr 10, 2021

@Hackwar The same logic applies to security issues that are in core for 15+ years, you fix them when you find out about them. Ignoring (next to security by obfuscation) is no security at all. We (the Joomla community) trust that these matters will be dealt with when they arise).
When looking at the usage stats you see that only 30% of the sites make the move from an older version, so it is already loosing 70% of its user base. The move to 4.0 will (imo) not be made for even more sites, that is what I hear and see.
So for me this should be fixed in 4.0 and if that means that it breaks b/c with 3.10 then so be it. provide people with a migration' path / instructions and that soothes the pain
People need to invest in the upgrade, investing in URL changes can also be added to the list.
I did a PR addressing (part of) this issue as a proof of concept for J4. It adds a toggle [loose|strict] where loose is parsing an URL as it is now, and strict it not only checks the id but also the alias of the article. That at least gives people a choice in the matter instead of making the choice for them.
The PR stranded due to lack (none) of interest, only comments I got where the (very motivating) code styling changes. Maybe you can have a look at it and see if this is maybe a route to pursue?
#32500

@brianteeman
Copy link
Contributor

When looking at the usage stats you see that only 30% of the sites make the move from an older version, so it is already loosing 70% of its user base

The stats are useless. If you set the "send once" option then you only know about the first version installed.

@HLeithner
Copy link
Member

You can always add a ? or a & to the url and add what ever text you want.

https://example.com/?view=article&id=3:HAHA&catid=9

would also work as

https://example.com/?HAHA=mytext+and+it+still+works&view=article&id=3:my-article&catid=9

nothing we can do against this. even for

https://example.com/my-unique-alias?HAHA=mytext

Additionally to this, google tries to remove the complete url for years and with a market share of a monopolist it wouldn't take long anymore

just my 2 cents

@PhilETaylor

This comment was marked as abuse.

@PhilETaylor

This comment was marked as abuse.

@Ruud68
Copy link
Contributor

Ruud68 commented Apr 16, 2021

also, if this is "not really a problem" how come that the URL I reported (which was perfectly valid, returning a 200 OK and loading the Joomla 3.9.25 release news), has now been "fixed" to show a 404...

looking forward to the PR for this, maybe one of the maintainers can share it here.

The fake URL in my blog informing my customers about this issue is still resolving okay though, so 'I have tested this PR unsuccessful' #lol

@PhilETaylor

This comment was marked as abuse.

@brianteeman
Copy link
Contributor

@Bakual
Copy link
Contributor

Bakual commented Apr 19, 2021

Just out of curiosity, what is the real disadvantage of such URLs?
I get it that Google indexes it, and it gets found if you search for the URL. But Google is smart enough to not show this fake URLs in regular searches (with search terms, not URLs). Google knows the correct address and will not show the fake one. (I've tested that with the site Ruud mentioned).

So imho it's more that it scares the owner of the site when he looks at the Analytics, but it doesn't affect customers.

Or do I miss something?

@Ruud68
Copy link
Contributor

Ruud68 commented Apr 19, 2021

It definitely impacts users as it where the customers who brought this to the attention of the site owner. It's a (business) vulnerability. Just like a security issue, there is no issue until you get hacked... Or in this case your business gets linked by your customers to for example anti-semitism, or other nasty stuff

@Bakual
Copy link
Contributor

Bakual commented Apr 19, 2021

Still wondering how the site visitors where impacted. How did they get to see those fake URLs?
I couldn't get them to show in the Google search results using search terms. I always got the correct URL in the search results, not the fake ones.

I'm not saying this shouldn't be fixed, don't get me wrong. I'm just wondering what the severety is.

@Ruud68
Copy link
Contributor

Ruud68 commented Apr 19, 2021

from google itself, hover the 'correct' url and you see the bad url, but also from news aggregation sites that use the url structure to create an index, where the category in the url is used as 'container'.
image

@MacJoom
Copy link
Contributor

MacJoom commented Apr 19, 2021

According to SEO experts Google takes into consideration working URLs (non 404) from other websites (backlinks) for the ranking of a website. So if keywords in the URL does not appear in the content this could lead to downgrading - especially if its in a highly contested segment.

@brianteeman
Copy link
Contributor

SEO and Expert are two words that should never be used in the same sentence

@Bakual
Copy link
Contributor

Bakual commented Apr 19, 2021

@Ruud68 Without knowing what search words you used, that doesn't mean much. Did you search for an URL or for a keyword?
As I said I know that if you search for the URL, you get the results. But if you search eg for "Scorpio Gold Reports", you don't get that URL.
That's why I'm curious.

@MacJoom
Copy link
Contributor

MacJoom commented Apr 19, 2021

I think that Ruud fixed this for his client some time ago (with his own patch) - the website shows 404 or 410 now, so google removed the url - imho it doesn't matter how to find this - it was in the search results - so 1) people could see it - may be searching for porn - and 2) may be google was downgrading the original website because of the fake url

@Ruud68
Copy link
Contributor

Ruud68 commented Apr 23, 2021

@Ruud68 Without knowing what search words you used, that doesn't mean much. Did you search for an URL or for a keyword?
As I said I know that if you search for the URL, you get the results. But if you search eg for "Scorpio Gold Reports", you don't get that URL.
That's why I'm curious.

I don't know what they type in Google, they will not tell me.
I have worked as hired project manager for a large bank. Before giving external (but also internal) contractors a contract they are required by law to do a background check on the person they are hiring. How and what they do is classified, but I feel confident that when they 'stumbled' across references from my Joomla site on google with god knows what illegal activities / content then I would not get hired > again a business vulnerability.
And again I think, just like a security fix or the Google floc PR, this should be fixed before it has hit you

@Hackwar
Copy link
Member

Hackwar commented Apr 23, 2021

And as I've said before, we are happy to fix this, but not in Joomla 3. It has been like this for 16 years now and it is a rather well known issue. It is not something that we can properly fix in a backwards compatible way and thus we will not fix it in Joomla 3. You are welcome to provide a PR for Joomla 4 to fix this.

@PhilETaylor

This comment was marked as abuse.

@Hackwar
Copy link
Member

Hackwar commented Apr 23, 2021

Ok, let me rephrase: Fixing this in a backwards compatible way would require yet another option in the GUI and I would consider that as a new feature. I'm very much against adding yet more options unless absolutely necessary. In addition, new features can only be added in a minor release and that could only be Joomla 3.10. We decided quite some time ago (and communicated that as well) that Joomla 3.10 will only be a compatibility release to ease the migration to Joomla 4 and will not contain any additional new features. Thus this will not be fixed in Joomla 3.

I would really prefer if instead of arguing about this here, we could concentrate on Joomla 4, fix this there and finally get this release out the door.

@PhilETaylor

This comment was marked as abuse.

@Hackwar
Copy link
Member

Hackwar commented Apr 23, 2021

Teaches me to ever volunteer to execute a decision by the PLT.

@MacJoom
Copy link
Contributor

MacJoom commented Apr 23, 2021

What about writing the option directly into configuration.php - not needing a gui option? for me it is still fixing a bug and not a new feature.

@brianteeman
Copy link
Contributor

I would really prefer if instead of arguing about this here, we could concentrate on Joomla 4, fix this there and finally get this release out the door.

Wouldn't we all. Now if you only replied to comments specifically addressed to you we might make some progress.

@PhilETaylor

This comment was marked as abuse.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants