You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently Mastodon provides two storage formats. As folder or as *.mastodon. Both file formats might have their problems when used inside a git repository:
When text files are used: Git merge might introduce inconsistencies in the text files, when merge two changes to the same dataset. Conflicting changes, like removing a spot & adding a edge two the spot. Might not appear as conflicting changes.
When binary files are used: Commit many different version of a large binary file might blow up the repository size and and a size limitation in git or GitHub might ruin the attempts to version Mastodon files.
Possible solutions:
Text based
Or binary files divided into blocks (preferred)
Make sure to always change one small file, whenever there is a change to the mastodon project. This will cause a git merge conflicts, and prevent git from destroying the dataset during merge attempts, i.e. no automatic merge will be done by git.
Binary files divided into blocks
Introduce a key Mastodon rewrites the spot ids when saving a projects. Non constant spot ids are problematic. A small change in the ModelGraph can easily change a large number of spot ids. This is a problem for efficient storage (delta compression) of multiple versions with git. It is therefor necessary to have a key value that normally doesn't change.
The data in a mastodon project could be expressed in two tables
spot table:
------------
1. spot-key (unique and constant)
2. timepoint
3. x
4. y
5. z
6. label
7. covariance matrix
?? is there more
link table
-----------
1. link-key (unique and constant)
2. source spot-key
3. target spot-key
4. outgoing edge index
5. incoming edge index
Additionally, tables for properties should be created. These cover tagsets and features
property tables:
------------
1. spot-key or link-key (spot vs link can currently be derived from the filename)
2. value (e.g. feature value, tag value as boolean)
These tables could be stored in a simple chunked binary format (easiest solution using DataOutputStream, which would be Java-specific) or as ZARR tables (could potentially be read also using Python scripts). A table should be sorted by the key and be chunked. For example keys 0-1000 are written into the first file, keys 1000-2000 are written into the second file etc.
Using UUID as key value
advantage:
avoiding clashes.
disadvantages:
unevenly distributed
not dense waste memory
Data would need to be stored in kind of a chunked hash table. Which is pretty complicated. Unsolved question: which UUID entry would be stored in which chunk?
Possible alternative to UUID use a simple counter (integer or long). Each new spot receives an id from this counter. When spots are deleted this id will never be assigned again.
advantages:
less memory required
quicker iteration
may potentially used as indices for arrays (having "wholes")
disadvantage:
clashes may occur when merging edits from two users. the merge tool would need to resolve this and assign new unused ids
If spots are deleted, "wholes" are not filled. Spots need subsequently to be removed from link-tables, tag-tables and feature tables.
Text format using UUID (not-preferred)
Each spot and link gets a UUID. The spot table and link tables are stored in text files with rows:
git merge would cause false positive conflicts, and more critically false negative conflicts.
It's probably still necessary to divide the text file into chunks in order to save memory when using git.
TODOs:
Decide, if ZARR can be used to store the tables. Check if Strings can be stored in ZARR tables.
Decide, if UUID should be as key or simple counter
Write a test class that pressure tests the file format while being used together with git. Simulate the iterative generation of a dataset and the commits along the way. Measure the size of the git history. My idea is to open one of Mettes datasets, copy the spots a links in batches to the new dataset. Along the way delete some parts of the dataset and add them later on.
Write a test that grows a graph. Store intermediate stages as Masgitoff format. And makes several commits. Finally measure the size of the git history. Compare the test to a run with the standard Mastodon format.
Result: Standard format explodes the git history. Masgitoff format works well.
Make the test more realistic by not only growing the graph by adding spot. But also removing some spots some time. -> still works well.
Make the test more realistic by not only writing to Masgitoff but also closing the Mastodon instance. Reopen and then grow further.
Pressure test with renaming spots.
Pressure test with changing tags.
Write code to launch and open Mastodon from a Masgitoff dataset.
Make notes on the design decisions made.
Integrate the Masgitoff file format into the Mastodon collaboration plugin.
Currently label set ids are rewritten on every save operation, make sure that's not a problem.
What about tag ids. How do they behave if tags are: renamed, added, deleted. Do we need to consider that.
Is the edges order correctly recovered when opening a Mastodon dataset.
The text was updated successfully, but these errors were encountered:
Compare File Reading Performance UUID vs int32 performance
The results where very counter intuitive. Reading a 128-bit from a file plus lookup in a HashMap<UUID, Object> almost as performant as reading 32-bit int plus lookup in a ArrayList<Object>. So it my experiments result is don't hesitate to use UUID, the read performance can be the same as 32-bit int. But it's very important to use a BufferedInputStream and BufferedOutputStream.
This is a part of: #12
Currently Mastodon provides two storage formats. As folder or as *.mastodon. Both file formats might have their problems when used inside a git repository:
Possible solutions:
Text basedBinary files divided into blocks
Introduce a key Mastodon rewrites the spot ids when saving a projects. Non constant spot ids are problematic. A small change in the ModelGraph can easily change a large number of spot ids. This is a problem for efficient storage (delta compression) of multiple versions with git. It is therefor necessary to have a key value that normally doesn't change.
The data in a mastodon project could be expressed in two tables
Additionally, tables for properties should be created. These cover tagsets and features
These tables could be stored in a simple chunked binary format (easiest solution using DataOutputStream, which would be Java-specific) or as ZARR tables (could potentially be read also using Python scripts). A table should be sorted by the key and be chunked. For example keys 0-1000 are written into the first file, keys 1000-2000 are written into the second file etc.
Using UUID as key value
unevenly distributedPossible alternative to UUID use a simple counter (integer or long). Each new spot receives an id from this counter. When spots are deleted this id will never be assigned again.
If spots are deleted, "wholes" are not filled. Spots need subsequently to be removed from link-tables, tag-tables and feature tables.
Text format using UUID (not-preferred)
Each spot and link gets a UUID. The spot table and link tables are stored in text files with rows:
The text format has the advantages:
and disadvantages:
It's probably still necessary to divide the text file into chunks in order to save memory when using git.
TODOs:
The text was updated successfully, but these errors were encountered: