Skip to content

Check filesystem for existing corpora on creation, import and deletion#313

Merged
thomaskrause merged 1 commit intokorpling:mainfrom
matthias-stemmler:bugfix/corpus-existence-check-fs
Feb 17, 2025
Merged

Check filesystem for existing corpora on creation, import and deletion#313
thomaskrause merged 1 commit intokorpling:mainfrom
matthias-stemmler:bugfix/corpus-existence-check-fs

Conversation

@matthias-stemmler
Copy link
Copy Markdown
Contributor

I found an issue with the check whether a corpus with a given name exists when trying to create an empty corpus, import a corpus or delete a corpus: The current implementation only checks if there is a cache entry for the given name, but not whether a corpus actually exists on disk. This can be a problem in two cases:

  • When a corpus has never been added to the cache
  • When a corpus has been evicted from the cache because the cache grew too large

We originally stumbled upon this in matthias-stemmler/annimate#471, where it caused an interesting case of data corruption. The sequence of events was presumably as follows:

  • The user imported all corpora of ReM 1.0.
  • They loaded all corpora, but due to limited memory, some got evicted from the cache.
  • They tried to import all corpora of ReM 2.x.
    • Since the names are identical in both versions and I set overwrite_existing to false, this should have failed. However, since existence is checked against the cache, the import went through, at least for those corpora that had been evicted.
  • Additionally, the old data weren't deleted, but instead the new data were imported on top of them.
    • This means that components that exist in ReM 1.0, but not in ReM 2.x, remained on disk. In this case it was Coverage/default_ns/, Coverage/VIRTUAL/ and Ordering/default_ns/.
  • Now node ids from the ReM 1.0 components were reinterpreted in the context of ReM 2.x, causing nonsensical edges such as coverage between nodes of different documents.
  • This made the subgraph method return wrong results, causing a broken export.

We could of course fix it by deleting and reimporting the corpora.

When looking through the code, I found that the same issue as for the import also applies to corpus creation and deletion. All other operations seem fine to me.

For the fix, I'm now checking whether a corpus exists on disk in addition to the cache check. I kept the latter to avoid changing the behavior in case a corpus exists only in the cache and not on disk (which can happen, for instance, when the info method is called on a nonexistent corpus).
I also noticed that the deletion logic acquires an exclusive lock on the cache entry before deleting from disk, so I figured the import should do the same in case the corpus already exists.

Thanks to @S-Oppermann for reporting the original issue and providing the data for analysis.

@codecov
Copy link
Copy Markdown

codecov Bot commented Feb 11, 2025

Codecov Report

Attention: Patch coverage is 60.71429% with 11 lines in your changes missing coverage. Please review.

Project coverage is 67.86%. Comparing base (c368b15) to head (67e17c9).
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
graphannis/src/annis/db/corpusstorage.rs 60.71% 5 Missing and 6 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #313      +/-   ##
==========================================
+ Coverage   67.70%   67.86%   +0.15%     
==========================================
  Files          76       76              
  Lines       18958    18968      +10     
  Branches    18958    18968      +10     
==========================================
+ Hits        12835    12872      +37     
+ Misses       4662     4633      -29     
- Partials     1461     1463       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@thomaskrause thomaskrause self-assigned this Feb 17, 2025
Copy link
Copy Markdown
Member

@thomaskrause thomaskrause left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix and especially the thorough test cases. The change looks good to me, I would add a changelog entry later to the main branch.

@thomaskrause thomaskrause merged commit 9fb79dd into korpling:main Feb 17, 2025
@matthias-stemmler matthias-stemmler deleted the bugfix/corpus-existence-check-fs branch February 17, 2025 19:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants