getElementsByTagName() is O(N^2) #11308

tstarling · 2023-05-24T03:45:54Z

Description

DOMDocument::getElementsByTagName() and DOMElement::getElementsByTagName() return an iterator which does a linear time search for the current position, each time the position is moved to the next node. So completing the iteration of the node list takes O(N^2) time where N is the number of nodes in the list.

For example:

for ( $n = 1000; $n <= 16000; $n *= 2 ) {
	$doc = new DOMDocument;
	$root = $doc->createElement( 'root' );
	$doc->appendChild( $root );
	for ( $i = 0; $i < $n; $i++ ) {
		$root->appendChild( $doc->createElement( 'e' ) );
	}
	$t = -microtime( true );
	foreach ( $doc->getElementsByTagName( 'e' ) as $node );
	$t += microtime( true );
	print "$n\t$t\n";
}

resulted in this output:

1000    0.0031359195709229
2000    0.012067079544067
4000    0.050027132034302
8000    0.20300078392029
16000   0.76775693893433

It was likely faster prior to 3084e72. Since that commit in 2003, getElementsByTagName() is similar to the following pseudocode:

function getElementsByTagName( $root, $name ) {
	$i = 0;
	do {
		$node = findNextNode( $root, $name );
		for ( $j = 0; $node && $j < $i; $j++ ) {
			$node = findNextNode( $node, $name );
		}
		
		if ( $node ) {
			yield $node;
		}
		$i++;
	} while ( $node );
}

That commit was motivated by standards compliance. The standard says that the node list returned by getElementsByTagName() is live, that is, it must immediately reflect changes to the document. The standard has no concept of iteration, it only has an integer offset which can be passed to item(). The pseudocode above does correctly reproduce the quirks implied by the standard. For example:

$doc = new DOMDocument;
$doc->loadXML( '<root><e i="1"/><e i="2"/><e i="3"/><e i="4"/><e i="5"/><e i="6"/><e i="7"/><e i="8"/><e i="9"/><e i="10"/></root>' );
$root = $doc->documentElement;
foreach ( $doc->getElementsByTagName( 'e' ) as $node ) {
	print $node->getAttribute( 'i' ) . ' ';
	$root->removeChild( $node );
}
print "\n";

produces

1 3 5 7 9

WebKit implements the standard efficiently by having an item cache and a length cache. The item cache allows efficient retrieval of an item which has an index close to the previously requested item. These caches are invalidated on tree mutation.

PHP has no concept of tree mutation event callbacks, so implementing this in PHP may be somewhat tedious.

PHP Version

PHP 8.2.6

Operating System

No response

The text was updated successfully, but these errors were encountered:

nielsdos · 2023-05-24T21:05:29Z

Side note: I don't know what the other core people think, but I consider this as a feature request and as too risky to go into stable branches.

@tstarling
I created a proof-of-concept patch against the master branch that implements a coarse-grained caching strategy. It's nothing fancy. I just keep a counter that tracks how many times the document has been modified. Upon creation of an iterator, I store the current document's modification counter inside the iterator. If upon iteration the counter of the iterator and document match, then we know no modification took place and we can use the cached node. Otherwise, we'll fall back to the old slow codepath.
The reason I chose a counter is for performance: this is relatively lightweight, much more lightweight than the actual DOM modification themselves.
You might be wondering about counter overflow, this would indeed be a problem. I fixed this by using a saturating counter. If the counter is saturated then we will always take the old slow codepath. This means that even if 2^64 modifications occur on a 64-bit system, then the result will still be safe (albeit resulting in slower iteration).

I believe this is an acceptable trade-off. A more fine-grained approach is to have counters for each element tag type, but that has more impact on performance. The approach I propose is quite lightweight I think.
As you said, the libxml2 API does not expose a modification event callback, so I had to go through all the functions manually to see where an invalidation needs to occur.

There's still some TODOs for the patch to be ready, but you can find an early preview PR here: nielsdos#20
Let me know if this works for your use case.

tstarling · 2023-05-25T00:08:32Z

Side note: I don't know what the other core people think, but I consider this as a feature request and as too risky to go into stable branches.

Yes, I assume this would only be in master/8.3. #7478 was only allowed in PHP 8.1 which was then in RC status, and that was a fair bit simpler.

You might be wondering about counter overflow, this would indeed be a problem. I fixed this by using a saturating counter. If the counter is saturated then we will always take the old slow codepath. This means that even if 2^64 modifications occur on a 64-bit system, then the result will still be safe (albeit resulting in slower iteration).

At 1µs per modification it would take 585,000 years to overflow the counter, so I'm not terribly worried about overflow. I'm more worried about some unexpected code path causing tree modification without the counter being incremented -- it's fine for that to cause incorrect results, but it would be nice if we could have memory safety. Is it possible to keep the current node alive by incrementing the reference count in its dom_object?

Let me know if this works for your use case.

Today's use case needs foreach, and it also uses item() called with consecutive indexes although that could be replaced with foreach. I found out we have https://phabricator.wikimedia.org/T317255 which is a general tracking bug for "remove every usage of getElementsByTagName and fail CI if we ever see it again". For getElementsByTagName to be usable by naïve developers familiar with JS, we will need O(N) foreach, O(1) item() for consecutive indexes, and O(1) length.

But item() and length can be added later, building on your counter concept.

tstarling · 2023-05-25T00:46:01Z

There's still some TODOs for the patch to be ready, but you can find an early preview PR here: nielsdos#20
Let me know if this works for your use case.

I compiled it, it seems to work. Even without touching those item() calls, it reduces the time taken for my test case from 10.1s to 2.0s.

nielsdos · 2023-05-25T21:27:42Z

At 1µs per modification it would take 585,000 years to overflow the counter, so I'm not terribly worried about overflow. I'm more worried about some unexpected code path causing tree modification without the counter being incremented -- it's fine for that to cause incorrect results, but it would be nice if we could have memory safety. Is it possible to keep the current node alive by incrementing the reference count in its dom_object?

You're right. I amended it such that on 64-bit the overflow isn't checked for performance reasons.
The current node is already kept alive in the current code, so no use-after-free is possible.

Today's use case needs foreach, and it also uses item() called with consecutive indexes although that could be replaced with foreach. I found out we have https://phabricator.wikimedia.org/T317255 which is a general tracking bug for "remove every usage of getElementsByTagName and fail CI if we ever see it again". For getElementsByTagName to be usable by naïve developers familiar with JS, we will need O(N) foreach, O(1) item() for consecutive indexes, and O(1) length.

But item() and length can be added later, building on your counter concept.

I see the importance of improving the time complexity.
I'll work on item() and length too.

…iteration The current implementation follows the spec requirement that the list must be "live". This means that changes in the document must be reflected in the existing node lists without requiring the user to refetch the node list. The consequence is that getting any item, or the length of the list, always starts searching from the root element of the node list. This results in O(n) time to get any item or the length. If there's a for loop over the node list, this means the iterations will take O(n²) time in total. This causes real-world performance issues with potential for downtime (see phpGH-11308 and its references for details). Fixes phpGH-11308.

…iteration The current implementation follows the spec requirement that the list must be "live". This means that changes in the document must be reflected in the existing node lists without requiring the user to refetch the node list. The consequence is that getting any item, or the length of the list, always starts searching from the root element of the node list. This results in O(n) time to get any item or the length. If there's a for loop over the node list, this means the iterations will take O(n²) time in total. This causes real-world performance issues with potential for downtime (see phpGH-11308 and its references for details). We fix this by introducing a caching strategy. We cache the last iterated object in the iterator, the last requested item in the node list, and the last length computation. To invalidate the cache, we simply count the number of modifications made to the containing document. If the modification number does not match what the number was during caching, we know the document has been modified and the cache is invalid. If this ever overflows, we saturate the modification number and don't do any caching anymore. Note that we don't check for overflow on 64-bit systems because it would take hundreds of years to overflow. Fixes phpGH-11308.

…iteration (#11330) * Implement iteration cache, item cache and length cache for node list iteration The current implementation follows the spec requirement that the list must be "live". This means that changes in the document must be reflected in the existing node lists without requiring the user to refetch the node list. The consequence is that getting any item, or the length of the list, always starts searching from the root element of the node list. This results in O(n) time to get any item or the length. If there's a for loop over the node list, this means the iterations will take O(n²) time in total. This causes real-world performance issues with potential for downtime (see GH-11308 and its references for details). We fix this by introducing a caching strategy. We cache the last iterated object in the iterator, the last requested item in the node list, and the last length computation. To invalidate the cache, we simply count the number of modifications made to the containing document. If the modification number does not match what the number was during caching, we know the document has been modified and the cache is invalid. If this ever overflows, we saturate the modification number and don't do any caching anymore. Note that we don't check for overflow on 64-bit systems because it would take hundreds of years to overflow. Fixes GH-11308.

tstarling added Bug Status: Needs Triage labels May 24, 2023

nielsdos added Extension: dom Status: Verified and removed Status: Needs Triage labels May 24, 2023

nielsdos self-assigned this May 25, 2023

nielsdos mentioned this issue May 27, 2023

Implement iteration cache, item cache and length cache for node list iteration #11330

Merged

nielsdos closed this as completed in #11330 Jun 2, 2023

TimWolla mentioned this issue Jul 12, 2023

Bad performance of DOMDocument::getElementsByTagName #11689

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

getElementsByTagName() is O(N^2) #11308

getElementsByTagName() is O(N^2) #11308

tstarling commented May 24, 2023

nielsdos commented May 24, 2023

tstarling commented May 25, 2023

tstarling commented May 25, 2023

nielsdos commented May 25, 2023

getElementsByTagName() is O(N^2) #11308

getElementsByTagName() is O(N^2) #11308

Comments

tstarling commented May 24, 2023

Description

PHP Version

Operating System

nielsdos commented May 24, 2023

tstarling commented May 25, 2023

tstarling commented May 25, 2023

nielsdos commented May 25, 2023