Optimizations in UDMF parser can cause data loss #622

biwa · 2021-09-28T12:36:21Z

In commit 0366f13 (nearly 8 years ago) some optimizations were added to the UDMF parser, which - according to the commit message - made parsing 35% faster. These optimizations are quite clever, unfortunately they heavily rely on the formatting of the TEXTMAP. But we can not rely on a certain formatting as per the UDMF specs.

How the optimization works

The optimization works by not parsing identical lines that have been parsed before, and thus have the UDMF field's name and value known. Take this snippet:

thing // 0
{
    x = 0;
    y = 64;
    type = 1;
}

thing // 1
{
    x = 0;
    y = 64;
    type = 1;
}

Let's say the parser finished parsing x = 0; of thing 0. It then creates a UniversalEntry with the key x and the value 0 and adds it to the current UniversalCollection. Then it adds the line x = 0; to a cache (the variable is called matches) dictionary, where the line is the key and the just parsed UniversalEntry is the value.
For each new line it checks if that line is already in the cache, and if it is, it just adds that cached UniversalEntry to the current UniversalCollection, goes to the next line, checks if it's in the cache etc.

Why this is problematic

This works excellently - as long as the lines are formattet as in the above snippet. If they aren't (which is perfectly legal as per the UDMF specs) things go awry. Take this snippet:

thing // 0
{
    x = 0; y = 64;
    type = 1;
}

thing // 1
{
    x = 0; y = 64;
    type = 1;
}

Here the x and y definitions are on the same line. Now the following happens:

parsing thing 0 starts
x is parsed. The resulting UniversalEntry is added to the current UniversalCollection
the line x = 0; y = 64; is added to the cache, with the UniversalEntry having the key x and the value 0
y is parsed. The resulting UniversalEntry is added to the current UniversalCollection
the line x = 0; y = 64; is already in the cache, so nothing is added to the cache
parsing continues as normal
parsing thing 1 starts
the parser checks if the line x = 0; y = 64; is in the cache
it is! So it adds the associated UniversalEntry, which has the key x and the value 0, to the current UniversalCollection
then nothing further is done to the line x = 0; y = 64;, the parser moves to the next line

This means that the y value for thing 1 is completely skipped. In this particular case UDB will actually complain about that, because there's no valid default value for the position, but other fields do have a valid default value (like angle), where UDB will be completely silent.

Consequences of reverting the optimization

Relatively speaking reverting the change is quite significant. For example parsing the TEXTMAP of Bastion of Chaos, which has 4.14 million lines, goes from ~1.9 seconds to ~3.3 seconds on my 7 year old Core i5 4670K, which is a increate of ~73%. Absolutely speaking this isn't too significant, and obviously risking data loss isn't an option, so that's something we have to live with.

The text was updated successfully, but these errors were encountered:

biwa added the bug label Sep 28, 2021

biwa self-assigned this Sep 28, 2021

biwa closed this as completed in ef18385 Sep 28, 2021

biwa added the fixed label Sep 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizations in UDMF parser can cause data loss #622

Optimizations in UDMF parser can cause data loss #622

biwa commented Sep 28, 2021

Optimizations in UDMF parser can cause data loss #622

Optimizations in UDMF parser can cause data loss #622

Comments

biwa commented Sep 28, 2021