Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

getElementsByTagName() is O(N^2) #11308

Closed
tstarling opened this issue May 24, 2023 · 4 comments · Fixed by #11330
Closed

getElementsByTagName() is O(N^2) #11308

tstarling opened this issue May 24, 2023 · 4 comments · Fixed by #11330

Comments

@tstarling
Copy link
Contributor

Description

DOMDocument::getElementsByTagName() and DOMElement::getElementsByTagName() return an iterator which does a linear time search for the current position, each time the position is moved to the next node. So completing the iteration of the node list takes O(N^2) time where N is the number of nodes in the list.

For example:

for ( $n = 1000; $n <= 16000; $n *= 2 ) {
	$doc = new DOMDocument;
	$root = $doc->createElement( 'root' );
	$doc->appendChild( $root );
	for ( $i = 0; $i < $n; $i++ ) {
		$root->appendChild( $doc->createElement( 'e' ) );
	}
	$t = -microtime( true );
	foreach ( $doc->getElementsByTagName( 'e' ) as $node );
	$t += microtime( true );
	print "$n\t$t\n";
}

resulted in this output:

1000    0.0031359195709229
2000    0.012067079544067
4000    0.050027132034302
8000    0.20300078392029
16000   0.76775693893433

It was likely faster prior to 3084e72. Since that commit in 2003, getElementsByTagName() is similar to the following pseudocode:

function getElementsByTagName( $root, $name ) {
	$i = 0;
	do {
		$node = findNextNode( $root, $name );
		for ( $j = 0; $node && $j < $i; $j++ ) {
			$node = findNextNode( $node, $name );
		}
		
		if ( $node ) {
			yield $node;
		}
		$i++;
	} while ( $node );
}

That commit was motivated by standards compliance. The standard says that the node list returned by getElementsByTagName() is live, that is, it must immediately reflect changes to the document. The standard has no concept of iteration, it only has an integer offset which can be passed to item(). The pseudocode above does correctly reproduce the quirks implied by the standard. For example:

$doc = new DOMDocument;
$doc->loadXML( '<root><e i="1"/><e i="2"/><e i="3"/><e i="4"/><e i="5"/><e i="6"/><e i="7"/><e i="8"/><e i="9"/><e i="10"/></root>' );
$root = $doc->documentElement;
foreach ( $doc->getElementsByTagName( 'e' ) as $node ) {
	print $node->getAttribute( 'i' ) . ' ';
	$root->removeChild( $node );
}
print "\n";

produces

1 3 5 7 9 

WebKit implements the standard efficiently by having an item cache and a length cache. The item cache allows efficient retrieval of an item which has an index close to the previously requested item. These caches are invalidated on tree mutation.

PHP has no concept of tree mutation event callbacks, so implementing this in PHP may be somewhat tedious.

PHP Version

PHP 8.2.6

Operating System

No response

@nielsdos
Copy link
Member

Side note: I don't know what the other core people think, but I consider this as a feature request and as too risky to go into stable branches.

@tstarling
I created a proof-of-concept patch against the master branch that implements a coarse-grained caching strategy. It's nothing fancy. I just keep a counter that tracks how many times the document has been modified. Upon creation of an iterator, I store the current document's modification counter inside the iterator. If upon iteration the counter of the iterator and document match, then we know no modification took place and we can use the cached node. Otherwise, we'll fall back to the old slow codepath.
The reason I chose a counter is for performance: this is relatively lightweight, much more lightweight than the actual DOM modification themselves.
You might be wondering about counter overflow, this would indeed be a problem. I fixed this by using a saturating counter. If the counter is saturated then we will always take the old slow codepath. This means that even if 2^64 modifications occur on a 64-bit system, then the result will still be safe (albeit resulting in slower iteration).

I believe this is an acceptable trade-off. A more fine-grained approach is to have counters for each element tag type, but that has more impact on performance. The approach I propose is quite lightweight I think.
As you said, the libxml2 API does not expose a modification event callback, so I had to go through all the functions manually to see where an invalidation needs to occur.

There's still some TODOs for the patch to be ready, but you can find an early preview PR here: nielsdos#20
Let me know if this works for your use case.

@tstarling
Copy link
Contributor Author

Side note: I don't know what the other core people think, but I consider this as a feature request and as too risky to go into stable branches.

Yes, I assume this would only be in master/8.3. #7478 was only allowed in PHP 8.1 which was then in RC status, and that was a fair bit simpler.

You might be wondering about counter overflow, this would indeed be a problem. I fixed this by using a saturating counter. If the counter is saturated then we will always take the old slow codepath. This means that even if 2^64 modifications occur on a 64-bit system, then the result will still be safe (albeit resulting in slower iteration).

At 1µs per modification it would take 585,000 years to overflow the counter, so I'm not terribly worried about overflow. I'm more worried about some unexpected code path causing tree modification without the counter being incremented -- it's fine for that to cause incorrect results, but it would be nice if we could have memory safety. Is it possible to keep the current node alive by incrementing the reference count in its dom_object?

Let me know if this works for your use case.

Today's use case needs foreach, and it also uses item() called with consecutive indexes although that could be replaced with foreach. I found out we have https://phabricator.wikimedia.org/T317255 which is a general tracking bug for "remove every usage of getElementsByTagName and fail CI if we ever see it again". For getElementsByTagName to be usable by naïve developers familiar with JS, we will need O(N) foreach, O(1) item() for consecutive indexes, and O(1) length.

But item() and length can be added later, building on your counter concept.

@tstarling
Copy link
Contributor Author

There's still some TODOs for the patch to be ready, but you can find an early preview PR here: nielsdos#20
Let me know if this works for your use case.

I compiled it, it seems to work. Even without touching those item() calls, it reduces the time taken for my test case from 10.1s to 2.0s.

@nielsdos nielsdos self-assigned this May 25, 2023
@nielsdos
Copy link
Member

At 1µs per modification it would take 585,000 years to overflow the counter, so I'm not terribly worried about overflow. I'm more worried about some unexpected code path causing tree modification without the counter being incremented -- it's fine for that to cause incorrect results, but it would be nice if we could have memory safety. Is it possible to keep the current node alive by incrementing the reference count in its dom_object?

You're right. I amended it such that on 64-bit the overflow isn't checked for performance reasons.
The current node is already kept alive in the current code, so no use-after-free is possible.

Today's use case needs foreach, and it also uses item() called with consecutive indexes although that could be replaced with foreach. I found out we have https://phabricator.wikimedia.org/T317255 which is a general tracking bug for "remove every usage of getElementsByTagName and fail CI if we ever see it again". For getElementsByTagName to be usable by naïve developers familiar with JS, we will need O(N) foreach, O(1) item() for consecutive indexes, and O(1) length.

But item() and length can be added later, building on your counter concept.

I see the importance of improving the time complexity.
I'll work on item() and length too.

nielsdos added a commit to nielsdos/php-src that referenced this issue May 27, 2023
…iteration

The current implementation follows the spec requirement that the list
must be "live". This means that changes in the document must be
reflected in the existing node lists without requiring the user to
refetch the node list.
The consequence is that getting any item, or the length of the list,
always starts searching from the root element of the node list. This
results in O(n) time to get any item or the length. If there's a for
loop over the node list, this means the iterations will take O(n²) time
in total. This causes real-world performance issues with potential for
downtime (see phpGH-11308 and its references for details).

Fixes phpGH-11308.
nielsdos added a commit to nielsdos/php-src that referenced this issue May 27, 2023
…iteration

The current implementation follows the spec requirement that the list
must be "live". This means that changes in the document must be
reflected in the existing node lists without requiring the user to
refetch the node list.
The consequence is that getting any item, or the length of the list,
always starts searching from the root element of the node list. This
results in O(n) time to get any item or the length. If there's a for
loop over the node list, this means the iterations will take O(n²) time
in total. This causes real-world performance issues with potential for
downtime (see phpGH-11308 and its references for details).

Fixes phpGH-11308.
nielsdos added a commit to nielsdos/php-src that referenced this issue May 27, 2023
…iteration

The current implementation follows the spec requirement that the list
must be "live". This means that changes in the document must be
reflected in the existing node lists without requiring the user to
refetch the node list.
The consequence is that getting any item, or the length of the list,
always starts searching from the root element of the node list. This
results in O(n) time to get any item or the length. If there's a for
loop over the node list, this means the iterations will take O(n²) time
in total. This causes real-world performance issues with potential for
downtime (see phpGH-11308 and its references for details).

We fix this by introducing a caching strategy. We cache the last
iterated object in the iterator, the last requested item in the node
list, and the last length computation. To invalidate the cache, we
simply count the number of modifications made to the containing
document. If the modification number does not match what the number was
during caching, we know the document has been modified and the cache is
invalid. If this ever overflows, we saturate the modification number and
don't do any caching anymore. Note that we don't check for overflow on
64-bit systems because it would take hundreds of years to overflow.

Fixes phpGH-11308.
nielsdos added a commit to nielsdos/php-src that referenced this issue May 27, 2023
…iteration

The current implementation follows the spec requirement that the list
must be "live". This means that changes in the document must be
reflected in the existing node lists without requiring the user to
refetch the node list.
The consequence is that getting any item, or the length of the list,
always starts searching from the root element of the node list. This
results in O(n) time to get any item or the length. If there's a for
loop over the node list, this means the iterations will take O(n²) time
in total. This causes real-world performance issues with potential for
downtime (see phpGH-11308 and its references for details).

We fix this by introducing a caching strategy. We cache the last
iterated object in the iterator, the last requested item in the node
list, and the last length computation. To invalidate the cache, we
simply count the number of modifications made to the containing
document. If the modification number does not match what the number was
during caching, we know the document has been modified and the cache is
invalid. If this ever overflows, we saturate the modification number and
don't do any caching anymore. Note that we don't check for overflow on
64-bit systems because it would take hundreds of years to overflow.

Fixes phpGH-11308.
nielsdos added a commit to nielsdos/php-src that referenced this issue May 27, 2023
…iteration

The current implementation follows the spec requirement that the list
must be "live". This means that changes in the document must be
reflected in the existing node lists without requiring the user to
refetch the node list.
The consequence is that getting any item, or the length of the list,
always starts searching from the root element of the node list. This
results in O(n) time to get any item or the length. If there's a for
loop over the node list, this means the iterations will take O(n²) time
in total. This causes real-world performance issues with potential for
downtime (see phpGH-11308 and its references for details).

We fix this by introducing a caching strategy. We cache the last
iterated object in the iterator, the last requested item in the node
list, and the last length computation. To invalidate the cache, we
simply count the number of modifications made to the containing
document. If the modification number does not match what the number was
during caching, we know the document has been modified and the cache is
invalid. If this ever overflows, we saturate the modification number and
don't do any caching anymore. Note that we don't check for overflow on
64-bit systems because it would take hundreds of years to overflow.

Fixes phpGH-11308.
nielsdos added a commit to nielsdos/php-src that referenced this issue May 27, 2023
…iteration

The current implementation follows the spec requirement that the list
must be "live". This means that changes in the document must be
reflected in the existing node lists without requiring the user to
refetch the node list.
The consequence is that getting any item, or the length of the list,
always starts searching from the root element of the node list. This
results in O(n) time to get any item or the length. If there's a for
loop over the node list, this means the iterations will take O(n²) time
in total. This causes real-world performance issues with potential for
downtime (see phpGH-11308 and its references for details).

We fix this by introducing a caching strategy. We cache the last
iterated object in the iterator, the last requested item in the node
list, and the last length computation. To invalidate the cache, we
simply count the number of modifications made to the containing
document. If the modification number does not match what the number was
during caching, we know the document has been modified and the cache is
invalid. If this ever overflows, we saturate the modification number and
don't do any caching anymore. Note that we don't check for overflow on
64-bit systems because it would take hundreds of years to overflow.

Fixes phpGH-11308.
nielsdos added a commit to nielsdos/php-src that referenced this issue May 29, 2023
…iteration

The current implementation follows the spec requirement that the list
must be "live". This means that changes in the document must be
reflected in the existing node lists without requiring the user to
refetch the node list.
The consequence is that getting any item, or the length of the list,
always starts searching from the root element of the node list. This
results in O(n) time to get any item or the length. If there's a for
loop over the node list, this means the iterations will take O(n²) time
in total. This causes real-world performance issues with potential for
downtime (see phpGH-11308 and its references for details).

We fix this by introducing a caching strategy. We cache the last
iterated object in the iterator, the last requested item in the node
list, and the last length computation. To invalidate the cache, we
simply count the number of modifications made to the containing
document. If the modification number does not match what the number was
during caching, we know the document has been modified and the cache is
invalid. If this ever overflows, we saturate the modification number and
don't do any caching anymore. Note that we don't check for overflow on
64-bit systems because it would take hundreds of years to overflow.

Fixes phpGH-11308.
nielsdos added a commit to nielsdos/php-src that referenced this issue May 31, 2023
…iteration

The current implementation follows the spec requirement that the list
must be "live". This means that changes in the document must be
reflected in the existing node lists without requiring the user to
refetch the node list.
The consequence is that getting any item, or the length of the list,
always starts searching from the root element of the node list. This
results in O(n) time to get any item or the length. If there's a for
loop over the node list, this means the iterations will take O(n²) time
in total. This causes real-world performance issues with potential for
downtime (see phpGH-11308 and its references for details).

We fix this by introducing a caching strategy. We cache the last
iterated object in the iterator, the last requested item in the node
list, and the last length computation. To invalidate the cache, we
simply count the number of modifications made to the containing
document. If the modification number does not match what the number was
during caching, we know the document has been modified and the cache is
invalid. If this ever overflows, we saturate the modification number and
don't do any caching anymore. Note that we don't check for overflow on
64-bit systems because it would take hundreds of years to overflow.

Fixes phpGH-11308.
nielsdos added a commit that referenced this issue Jun 2, 2023
…iteration (#11330)

* Implement iteration cache, item cache and length cache for node list iteration

The current implementation follows the spec requirement that the list
must be "live". This means that changes in the document must be
reflected in the existing node lists without requiring the user to
refetch the node list.
The consequence is that getting any item, or the length of the list,
always starts searching from the root element of the node list. This
results in O(n) time to get any item or the length. If there's a for
loop over the node list, this means the iterations will take O(n²) time
in total. This causes real-world performance issues with potential for
downtime (see GH-11308 and its references for details).

We fix this by introducing a caching strategy. We cache the last
iterated object in the iterator, the last requested item in the node
list, and the last length computation. To invalidate the cache, we
simply count the number of modifications made to the containing
document. If the modification number does not match what the number was
during caching, we know the document has been modified and the cache is
invalid. If this ever overflows, we saturate the modification number and
don't do any caching anymore. Note that we don't check for overflow on
64-bit systems because it would take hundreds of years to overflow.

Fixes GH-11308.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants