The Verifier code in JDOM is critical code for the performance of JDOM, especially during parsing. It makes sense to ensure this code is as fast as possible.
A number of attempts have been made in the past to improve the performance of this code, but the general issues are as follows:
"Canadian Wilf" has made the latest attempts to improve the performance, and combining some of his ideas with others presented in the past, it makes sense to take another stab at improving Verifier performance.
There are two basic issues with the current verifier:
The current Verifier suffers from unscalable performance. For some characters it takes longer to verify than other characters. The current verifier uses 'seive' logic, waiting for the character to fall through the seive, and thus some characters take longer than others. The way the code is designed it happens that the 'large' characters are slower than the small characters. Thus, documents with worst-case characters in the upper reaches of the unicode character specification are slower to parse than simple ascii documents.
Because the unicode characters associated with the 'East Asian' languages (Chinese, Korean, and Japanese) are in the upper ranges, these languages are much slower to process than other languages.
The 'is' methods (like isXMLNameCharacter(char) ) on the Verifier are the ones that are used in the tightest loops, and small improvements in these methods can have a large impact on overall performance. The reality though, is that these 'small' methods are mostly used from three 'check' methods: checkElementName(), checkAttributeName(), and checkCharacterData(). These three methods are the entry point for almost all of the Verifier's activity. These three methods are thus what will be benchmarked in the performance test harness. Any improvements to the 'is' methods will be reflected in the 'check' methods.
The Verifier code is typically 'buried' in an XML parse, and there's no easy way to measure just the Verifier's time independantly of the rest of JDOM. To get around this, a custom Verifier test system was built to test just the Verifier. Three methods on the Verifier are considered to be the most performance critical, and they are tested directly by the harness.
The bigger problem in the test harness is to get the right data to test. To get representative data a selection of XML documents were 'harvested' for Element names, Attribute names, and general XML character data. These three data types were stored in a 'repository' file and are used to feed the three methods directly.
You can download the test data here.
You can inspect the performance in the contrib section here: org.jdom2.contrib.perf.PerfVerifier
The actual test harness pumps all the test data through the three respective methods. It times how long it takes to push all the data through. It does this many times, and from all the runs, it selects the fastest third of the executions, averages those times, and reports them. This results in a consistent measure of the performance.
It is not possible to compare the performance numbers from one machine to those on another, so all the results have to be taken from the same machine using the same JVM, etc.
All the results on this page were taken on my laptop, a Core i7 with 6Gig mem.
First, the perfomance baseline:
Validating took: att=7.492ms emt=10.447ms char=27.966ms mem=82.367KB
This indicates that checkAttributeName() took 7.5ms, checkElementName() took 10.5ms, and checkCharacterData() took 28ms. The baseline memory usage is 82KB.
The goals would be to better than halve these times, so we expect 'att' to be better than 3.7ms, 'emt' to be better than 5.2ms, and char to be better than 14ms.
The first phase of the process was to convert the seive-based is* methods with a bitmask-based version. A number of different bitmask strategies were tried, but the 'byte-per-char' system was found to have excellent performance characteristics, and the smallest impact on memory. The byte-per-char system pre-calculates the 'roles' of each character, and stores the roles as a bitmask in a single byte value. There are 7 'roles' calculated for each char, and thus there are 7 bits used on each byte. Since there are 65K characters, the memory impact of the lookup byte-array is 64KB.
A new class was created in the contrib area: VerfierBuilder.
This class uses the 'old' mechanisms from Verifier to calculate the roles for every character, and it builds a dataset used to create the lookup table in the new Verifier.
The first completed version of this byte-per-char system is has the following performance metrics (with the baseline above it):
Validating took: att=7.492ms emt=10.447ms char=27.966ms mem=82.367KB Validating took: att=2.684ms emt=3.329ms char=18.648ms mem=149.297KB
The Verifier was further tweakedwith a number of optimizations to reduce the execution time (and memory footprint) some more, and finally the checkCharacterData() method was re-structured to reduce the number of variables in it, which in turn allows it to run faster (again, with baseline above it):
Validating took: att=7.492ms emt=10.447ms char=27.966ms mem=82.367KB Validating took: att=2.621ms emt=3.139ms char=11.470ms mem=148.773KB
As you can see, the verifier is now running in about a third of the time it used to for all the critical methods, but at the cost of about 64KB of memory.
The performance goals have been met, but the code is significantly more complicated. Character usage in the XML spec does not change, so this is not code that will be maintained very much, and it is already a 'black box' of sorts, so changing it from one form of black box to another form is not really an issue. The only problem is that changes to th character classification now requires first editing the VerifierBuilder class, running it, and then pasting the results in to the Verifier class. This is cumbersome, but in theory, should be unnecessary now that it is done once (unless there is a mistake in a character classification, which is highly unlikely....).
The performance gains are worth the maintenance challenges.