Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizations in UDMF parser can cause data loss #622

Closed
biwa opened this issue Sep 28, 2021 · 0 comments
Closed

Optimizations in UDMF parser can cause data loss #622

biwa opened this issue Sep 28, 2021 · 0 comments
Assignees

Comments

@biwa
Copy link
Collaborator

biwa commented Sep 28, 2021

In commit 0366f13 (nearly 8 years ago) some optimizations were added to the UDMF parser, which - according to the commit message - made parsing 35% faster. These optimizations are quite clever, unfortunately they heavily rely on the formatting of the TEXTMAP. But we can not rely on a certain formatting as per the UDMF specs.

How the optimization works

The optimization works by not parsing identical lines that have been parsed before, and thus have the UDMF field's name and value known. Take this snippet:

thing // 0
{
    x = 0;
    y = 64;
    type = 1;
}

thing // 1
{
    x = 0;
    y = 64;
    type = 1;
}

Let's say the parser finished parsing x = 0; of thing 0. It then creates a UniversalEntry with the key x and the value 0 and adds it to the current UniversalCollection. Then it adds the line x = 0; to a cache (the variable is called matches) dictionary, where the line is the key and the just parsed UniversalEntry is the value.
For each new line it checks if that line is already in the cache, and if it is, it just adds that cached UniversalEntry to the current UniversalCollection, goes to the next line, checks if it's in the cache etc.

Why this is problematic

This works excellently - as long as the lines are formattet as in the above snippet. If they aren't (which is perfectly legal as per the UDMF specs) things go awry. Take this snippet:

thing // 0
{
    x = 0; y = 64;
    type = 1;
}

thing // 1
{
    x = 0; y = 64;
    type = 1;
}

Here the x and y definitions are on the same line. Now the following happens:

  1. parsing thing 0 starts
  2. x is parsed. The resulting UniversalEntry is added to the current UniversalCollection
  3. the line x = 0; y = 64; is added to the cache, with the UniversalEntry having the key x and the value 0
  4. y is parsed. The resulting UniversalEntry is added to the current UniversalCollection
  5. the line x = 0; y = 64; is already in the cache, so nothing is added to the cache
  6. parsing continues as normal
  7. parsing thing 1 starts
  8. the parser checks if the line x = 0; y = 64; is in the cache
  9. it is! So it adds the associated UniversalEntry, which has the key x and the value 0, to the current UniversalCollection
  10. then nothing further is done to the line x = 0; y = 64;, the parser moves to the next line

This means that the y value for thing 1 is completely skipped. In this particular case UDB will actually complain about that, because there's no valid default value for the position, but other fields do have a valid default value (like angle), where UDB will be completely silent.

Consequences of reverting the optimization

Relatively speaking reverting the change is quite significant. For example parsing the TEXTMAP of Bastion of Chaos, which has 4.14 million lines, goes from ~1.9 seconds to ~3.3 seconds on my 7 year old Core i5 4670K, which is a increate of ~73%. Absolutely speaking this isn't too significant, and obviously risking data loss isn't an option, so that's something we have to live with.

@biwa biwa added the bug label Sep 28, 2021
@biwa biwa self-assigned this Sep 28, 2021
@biwa biwa closed this as completed in ef18385 Sep 28, 2021
@biwa biwa added the fixed label Sep 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant