Skip to content

Add internal URI handling API #19073

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: master
Choose a base branch
from
Open

Conversation

kocsismate
Copy link
Member

No description provided.

Copy link
Member

@TimWolla TimWolla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some first remarks. Did not yet look at everything.

Comment on lines 111 to 114
if (uri_handler_name == NULL) {
return uri_handler_by_name("parse_url", sizeof("parse_url") - 1);
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defaulting to parse_url in a new API is probably not a good idea. Instead the “legacy” users should just pass "parse_url" explicitly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defaulting to parse_url here works because that's the default indeed where php_uri_get_handler() is called, the other "backends" can only be used if the config is explicitly passed (not null).

The other reason why I opted for this approach is that it would be inconvenient to create and free a new zend_string when the legacy implementation is needed, and I wanted to avoid adding a known string just for this purpose, or exposing the C string based uri_handler_by_name function instead.

Copy link
Member

@TimWolla TimWolla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've looked at this again and I must say that I'm having trouble meaningfully reviewing this. It adds a large amount of code with unclear purpose and confusing (to me) naming.

Copy link
Member

@nielsdos nielsdos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preliminary review round

Copy link
Member

@nielsdos nielsdos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The switch from zend_string to the pointer-length pair seems to have been a good idea

@TimWolla TimWolla self-requested a review July 22, 2025 20:06
@TimWolla
Copy link
Member

I'll try to take another look tomorrow.

@kocsismate
Copy link
Member Author

@nielsdos Do you see anything that I should fix before merging it? I'd like to implement some of the cleanups that we discussed

Copy link
Member

@nielsdos nielsdos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks mostly good, FWIW I agree with Tim on UNREACHABLE

ZEND_ASSERT(uri_handler->parse_uri != NULL);
ZEND_ASSERT(uri_handler->clone_uri != NULL);
ZEND_ASSERT(uri_handler->uri_to_string != NULL);
ZEND_ASSERT(uri_handler->clone_uri != NULL || strcmp(uri_handler->name, URI_PARSER_PHP) == 0);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to keep these assertions even though two handlers of the legacy parser became NULL. So I excluded this URI handler from the checks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assertions with exceptions to the rule are a bit iffy IMHO

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I can understand your POV. Do you think it's better to get rid of the latter two special cased assertions?


ZEND_ASSERT(uri_handler->name != NULL);
ZEND_ASSERT(uri_handler->name != NULL && (strlen(uri_handler->name) > 0 || strcmp(uri_handler->name, URI_PARSER_PHP) == 0));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I renamed the legacy implementation to "", I wanted to make sure that other implementations cannot use the same handler.

Although, I think this assertion doesn't make any sense. Especially because module initialization will surely fail because of the name collision.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh? Why did you change the name to "" though?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mainly because I wanted to avoid this to work:

filter_var("qwe", FILTER_VALIDATE_URL, ["uri_parser_class" => "parse_url"])

even though null is the suggested value to use for the "uri_parser_class" config. As "" is the most similar string value to null that can be stored in a uri_handlers hash table, I figured it's a better choice than parse_url.

Or do you have any better solutions in mind?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "" is weird from a user point of view.
I still don't think I quite understand the problem though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just found it weird that one can pass parse_url to use the legacy parser, while null is already dedicated for the purpose. Defaulting to null is convenient for the internal API: #19073 (comment).

Since "" is very close to null, I figured that it would be a better choice to use than an arbitrary name like parse_url. It's not really likely that users have to ever use this value (we should only document null), but I have to choose some string value for the name to be able to store the handler in the HashTable of the handlers. I could choose not to store it at all, and rather add an extra if to uri_handler_by_name (instead of php_uri_get_handler) for retrieving the legacy handler, but doing so would slightly slow down all usages, not just the ones that come outside of ext/uri.

I hope I managed to describe my problem 🤔 In any case, I don't have any hard feelings against each name, so I can live with any choice.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what if we want to change the default later or even drop parse_url?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm.. so do you suggest that the implementation should have a name so that it can be referenced even if it's not the default one... I guess that makes sense... Thinking about it long-term, a phase-out will likely look like that some other implementation is made the default (likely RFC 3986), and then a few years later parse_url is removed... I guess I got your point now.

Copy link
Member

@nielsdos nielsdos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is getting pretty close to being merge-able


ZEND_ASSERT(uri_handler->name != NULL);
ZEND_ASSERT(uri_handler->name != NULL && (strlen(uri_handler->name) > 0 || strcmp(uri_handler->name, URI_PARSER_PHP) == 0));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh? Why did you change the name to "" though?

ZEND_ASSERT(uri_handler->parse_uri != NULL);
ZEND_ASSERT(uri_handler->clone_uri != NULL);
ZEND_ASSERT(uri_handler->uri_to_string != NULL);
ZEND_ASSERT(uri_handler->clone_uri != NULL || strcmp(uri_handler->name, URI_PARSER_PHP) == 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assertions with exceptions to the rule are a bit iffy IMHO

@kocsismate
Copy link
Member Author

I'll fix the CI probably tomorrow night

@@ -615,7 +615,7 @@ void php_filter_validate_url(PHP_INPUT_FILTER_PARAM_DECL) /* {{{ */
}

/* Parse the URI - if it fails, we return NULL */
php_uri *uri = php_uri_parse_to_struct(uri_handler, Z_STRVAL_P(value), Z_STRLEN_P(value), URI_COMPONENT_READ_NORMALIZED_ASCII, true);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that still the normalized URI components were retrieved in multiple places where the raw version would have made more sense, so I changed these.

Generally, I think the raw format should be used internally whenever possible. I am also considering to return an error if an unsupported format is required (e.g. the WHATWG implementation doesn't support normalization - except for the host whose unicode representation can be returned).

}

php_raw_url_decode(Z_STRVAL_P(zv), Z_STRLEN_P(zv));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized that the php_raw_url_decode() is conformant to RFC 3986.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify what you mean with this remark/change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you missed that initially, I used php_url_decode(). A bit later I realized that this function is not conformant to RFC 3986 URL ecoding (due to this line:

if (*data == '+') {
), so I changed it to php_raw_url_decode which is conformant to it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand. (It's also a bit hard keeping track of changes in large PRs)
All fine then.

@kocsismate
Copy link
Member Author

@nielsdos @TimWolla could you please have another look at this? I think it should really merge this soon.

Copy link
Member

@nielsdos nielsdos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see my last question, otherwise looks fine to me. Let the cleanup begin.

@nielsdos nielsdos requested a review from a team August 16, 2025 12:09
Copy link
Member

@edorian edorian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RM wise: 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants