-
Notifications
You must be signed in to change notification settings - Fork 315
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for OWL/ZIP #375
Comments
I remember thinking the same, would be quite good to have. @matentzn you'll be available to consult on this, right? ;-) |
Excellent. It would be good to standardise on something. The basic idea is to have a required ontology |
You'll be surprised to know I have a certain amount of interest, and have been doing a certain amount of research, on this issue (compression,packing, atom-splitting, etc) :-) I needed a lot of small atoms to illustrate a major point. Note: all sizes below are those reported by du, which take into account disk block size. The decomposition finds 56,240 atoms. Base line (size on disk, contents catted to a single file)Before we look at the effects of archiving and compressing, we should get a measure of the size of the raw contents of the files. A simple way of doing this is to cat all the individual atom files together. We can also see the effect of applying compression to the raw contents, and note once again the effect of lexical sorting on OWL FSS (the individual files are using the new sorted FSS output, which sorts by entity type, then entity, then axiom.)
Size on disk (one file per atom, directory size)Here we see the significance of the size of disk blocks when measuring the size of files on disk. The disk space used by the compressed files is the same as that used by the uncompressed files.
Files stored in single archive, no compressionJust storing the uncompressed files in a single archive cuts the disk space requirements in half or more. We can see that the zip has less overhead than tar.
Files compressed, then stored in a single archiveWhen we store individually compressed files, we see a difference. Once the disk block overhead is removed, we see the effect of compression.
Files stored in single archive, then compressedNow we reverse the order - instead of compressing, then archiving, we create the archive, then compress.
The difference is an order of magnitude. Trade-off: random access to contents of compressed files.Because each file in a zip archive is independently compressed, and because the zip format keeps a directory of contents at the end of the file, it is quite easy to access files in random order. gzip files are one big block; it is difficult to randomly access the contents of a file. Possible alternativeIt might be interesting to look at experimenting with some of the text formats to see they can be adapted to handle multiple ontologies in a single form. There's an obvious hack using rdf formats that support named graphs, whereby each ontology could be stored in a named graph in the same file. However, that involves RDF. It might be possible to define an annotation that specifies which ontology/component an axiom belongs to (the partitioning could be done in a post-loading stage). This is similar to the work that I'm doing for @cmungall, which involves keeping a cache of inferred axioms stored in the same file as the axioms from which they are derived. These axioms are marked with annotations (both the is_inferred annotation that comes from OBO, and more detailed provenance information- import versions, transforms, and reasoner information. The advantage of using an annotation-style approach is that it is backwards compatible; the disadvantage is that it is harder to split out individual pieces. For imports of specific, versioned ontologies, things are easy. For imports of dynamic ontologies, one must check the cached imported axioms on load. |
Thanks all, this sounds like it would be v useful. |
Not sure if this is mixing completely opposing concerns, but might it be useful to make sure this is compatible with the github release mechanism. |
I would like to propose a slight modification to the owl/zip proposal. Currently an owl/zip archive must have a file Instead I propose that there is a separate way to indicate what the root file to load should be. This could be as simple as a file called With this in place, it will be very easy for ontology developers to use the github release mechanism to make owl/zip files. Not only that, but external services that archive github releases such as zenodo will also generate valid owl/zip files 'for free'. Here is an example of an ontology release using the github mechanism: https://github.com/obophenotype/cephalopod-ontology/releases/tag/release-2015-03-15 We'd be willing to make changes in conventions such as directory layout, but I'd rather do this sooner rather than later. |
How does OWL/ZIP compare with http://www.rdfhdt.org/ when it comes to compression and speed to reconstitution? |
HDT seems quite interesting but from a first skimming it looks like a plugin for it would need a JNI bridge; would be tough to support in the core OWLAPI, but would make a useful separate plugin. |
There's java and c++ code. However I don't think it is ideal for OWL,
|
…O build Historically we have built chebi centrally due to this ebi-chebi/ChEBI#3202 But this issue is now resolved. Note I'm also introducing two new top-levels (maybe should this on new PR?). We don't have rules on new top-levels, but I assume $ONT.SUFFIX is allowed for reasonable suffixes within $ONT. CHEBI provide gzips which is something I think we should encourage -- although it's fiddly with imports. See: owlcs/owlapi#375
I've taken a look at what YAML would allow. Seems like a good idea, I'm not entirely sure yet whether the power is worth the extra dependency (e.g., snakeyaml or another library). What I'm thinking is as follows:
A properties file could be sufficient for this purpose, with the necessary escapes in place. In terms of creating an owl/zip file, this property file is pretty simple to put together. New |
Just thought of something else. One of the space wasting factors, when saving each ontology/ontology fragment to their own file, is the large number of files and the block size overhead. Hence one of the advantages of OWL/ZIP is removing that waste, even before compression enters the picture. However, suppose I have two ontologies A and B, importing ontologies C,D, E, which in turn import F, G... Z. If I keep these all in one folder and resolve with AutoIRIMapper, the space used and the bytes read from disk match the size of A...Z (plus wasted block space). If I use OWL/ZIP for A and for B, each zip file will contain a compressed copy of C...Z. If I move C...Z to a separate OWL/ZIP file and leave A and B in their OWLZIP file, leaving imports resolution to sort things out, I end up with three files; if the network of dependencies is more complex and includes a few large ontologies, I might need to repeat this split of the zipped files, theoretically to its worst case scenario, where each OWL/ZIP file contains just one file (1) - or, going the other way, each file contains a copy of the same files (n). (1) has the disadvantage that local import resolution is lost, and we again have large numbers of files, with block space waste (plus any waste due to downloading large numbers of files rather than large files, when talking about remote access) Loading a few ontologies in the same manager should not differ substantially between the two approaches, as the ontology IRIs are stores in the property files; so one would know which file to load before having to open the file. I've not seen any mention of updates to imported ontologies above - I suppose we are considering this as a mechanism to group together ontologies that are not modified often, or where we intentionally want a snapshot and not an evolving set of ontologies. |
Proposal for the above: allow multiple roots in the property file. An example would look like this:
I've not include version information in the entry description; should this be included? |
When I implemented something similar in the past I included version information: |
Allow the content of a zip file to be mapped to ontology IRIs as an IRI mapper for use in an ontology manager. Also extend AutoIRIMapper to do the same. owlzip.properties and catalog*.xml supported to find ontology IRIs without parsing the actual files. Lacking these index files, same logic as AutoIRIMapper applies. Also implements Import ontology from jar #621 where AutoIRIMapper can now map the contents of a jar or zip file.
and
+1 for yaml |
Yams it is, I'll make soup. |
Example YAML file: roots: - D.owl - someotherfolder/B.owl D: physical: D.owl logical: http://test.org/complexImports/D.owl version: A: physical: somefolder/A.owl logical: http://test.org/compleximports/A.owl version: B: physical: someotherfolder/B.owl logical: http://test.org/complexontologies/B.owl version: C: physical: someotherfolder/C.owl logical: http://test.org/compleximports/C.owl version:
YAML example:
|
Current state:
To do:
|
For mapping objects how about iri: http://test.org/complexImports/D.owl
versionIri: http://test.org/complexImports/D-1.3.owl
path: D.owl i.e. reuse the notion of iri and version iri. Then, the roots object should refer to the ontology id, e.g. roots:
- iri: http://test.org/complexImports/D.owl
versionIri: ...
- iri: http://test.org/complexImports/D.owl |
OWLZipYaml can be used directly by the savvy developer in order to access OWLOntologyID objects for each ontology. The IRI mapping mechanism does not use ontology versions at present, so that part of the code is unchanged. D: &D { iri: 'http://test.org/complexImports/D.owl', versionIri:, path: D.owl } A: { iri: 'http://test.org/compleximports/A.owl', versionIri:, path: somefolder/A.owl } B: &B { iri: 'http://test.org/complexontologies/B.owl', versionIri:, path: someotherfolder/B.owl } C: { iri: 'http://test.org/compleximports/C.owl', versionIri:, path: someotherfolder/C.owl } roots: - <<: *D - <<: *B
Thanks @ignazio1977 and @matthewhorridge |
I wonder if some ideas from BagIt or BDBag could be incorporated, like the md5 checksum. Standards like these are likely to be adopted in the context of projects like the NIH Data Commons. This maybe doesn't make much sense and certainly not worth delaying release and adoption of owlzip, use cases as largely different, but it seems that it would be possible to have a compatible optional profile of owlzip that also is compatible with bdbag/bagit-RO. For example, we could have by convention the incorporation of a provenance json-ld file like this. We've been discussing the generation of prov files as part of robot BagIt also has a manifest file like this. I prefer the yaml above (and I was the one that originally requested yaml), but it's worth considering if there is value in adopting this. @stain may have some insights |
Thanks everyone who has contributed to this. It's going to be very useful. @rctauber and I are trying it out now (knocean/ontofetch#19), and would be happy to contribute if there's anything you need. I was looking at the example YAML in this commit b583351 and wondering if it could be simpler. In the example below I've eliminated the YAML aliases/anchors, allowing for a number of other simplifications:
|
|
YAML format not included in version 5
YAML format not included in version 4
Add support for OW/ZIP.
http://ceur-ws.org/Vol-1265/owled2014_submission_6.pdf
We might think about the conventions used, but the basic ideas seem sensible. This would require some twiddling to bring imports in from the right place (a temp override of the ontology IRI mapper).
The text was updated successfully, but these errors were encountered: