Add support for OWL/ZIP #375

matthewhorridge · 2015-03-13T23:35:17Z

Add support for OW/ZIP.

http://ceur-ws.org/Vol-1265/owled2014_submission_6.pdf

We might think about the conventions used, but the basic ideas seem sensible. This would require some twiddling to bring imports in from the right place (a temp override of the ontology IRI mapper).

ignazio1977 · 2015-03-14T09:05:17Z

I remember thinking the same, would be quite good to have. @matentzn you'll be available to consult on this, right? ;-)

matthewhorridge · 2015-03-14T16:00:55Z

Excellent. It would be good to standardise on something. The basic idea is to have a required ontology root.owl and then optionally add imports, encoding them using directory structure (rather than some kind of manifest). How this last part works isn't totally clear to me at the moment.

sesuncedu · 2015-03-14T21:03:38Z

You'll be surprised to know I have a certain amount of interest, and have been doing a certain amount of research, on this issue (compression,packing, atom-splitting, etc) :-) I needed a lot of small atoms to illustrate a major point.
Annoyingly I didn't have one prepared. so I generated atoms from a recent GO (w/imports).

Note: all sizes below are those reported by du, which take into account disk block size.

The decomposition finds 56,240 atoms.

Base line (size on disk, contents catted to a single file)

Before we look at the effects of archiving and compressing, we should get a measure of the size of the raw contents of the files. A simple way of doing this is to cat all the individual atom files together. We can also see the effect of applying compression to the raw contents, and note once again the effect of lexical sorting on OWL FSS (the individual files are using the new sorted FSS output, which sorts by entity type, then entity, then axiom.)

size	format/transforms
72M	cat
1.9M	cat + xz
0.69M	cat + sort + xz

Size on disk (one file per atom, directory size)

Here we see the significance of the size of disk blocks when measuring the size of files on disk. The disk space used by the compressed files is the same as that used by the uncompressed files.

Size	format/ transforms
220M	none
220M	xz (individual files)

Files stored in single archive, no compression

Just storing the uncompressed files in a single archive cuts the disk space requirements in half or more. We can see that the zip has less overhead than tar.

size	format/ transforms
112M	tar
82M	zip -0 (store)

Files compressed, then stored in a single archive

When we store individually compressed files, we see a difference. Once the disk block overhead is removed, we see the effect of compression.
We also see the limitations of compressing individual files.

size	format/ transforms
59M	xz (individual files) + tar
33M	xz (individual files) + zip (store)
29M	zip (deflate)

Files stored in single archive, then compressed

Now we reverse the order - instead of compressing, then archiving, we create the archive, then compress.

size	format/ transforms
5.2M	zip -0 + gz
4.3M	tar + gz
3.0M	zip + xz
2.5M	tar + xz

The difference is an order of magnitude.

Trade-off: random access to contents of compressed files.

Because each file in a zip archive is independently compressed, and because the zip format keeps a directory of contents at the end of the file, it is quite easy to access files in random order.

gzip files are one big block; it is difficult to randomly access the contents of a file.
xz allows for files to contain multiple streams of data, and for streams to contain multiple blocks. However, there is a trade-off between the size of the compressed blocks and the amount of data that must be read and decompressed in order to read a particular section.
A block size of 32 MB would split our tar file into just four chunks (plus a separate block at the end to hold the index). The I/O savings would be trivial in this case, as the total compressed file size is just 2.5MB.

Possible alternative

It might be interesting to look at experimenting with some of the text formats to see they can be adapted to handle multiple ontologies in a single form.

There's an obvious hack using rdf formats that support named graphs, whereby each ontology could be stored in a named graph in the same file. However, that involves RDF.

It might be possible to define an annotation that specifies which ontology/component an axiom belongs to (the partitioning could be done in a post-loading stage).

This is similar to the work that I'm doing for @cmungall, which involves keeping a cache of inferred axioms stored in the same file as the axioms from which they are derived. These axioms are marked with annotations (both the is_inferred annotation that comes from OBO, and more detailed provenance information- import versions, transforms, and reasoner information.
To check whether an cached inference is still valid, I need to split the relevant cached axioms from the asserted axioms - i.e. partitioning an ontology based on axiom annotations.

The advantage of using an annotation-style approach is that it is backwards compatible; the disadvantage is that it is harder to split out individual pieces. For imports of specific, versioned ontologies, things are easy. For imports of dynamic ontologies, one must check the cached imported axioms on load.
One must also be careful not to record this metadata directly onto each axiom, so that otherwise unchanged cached imports do not an annotation change - this interacts very badly with version control systems. On the other hand, github and bitbucket aren't too fond of very large files, so embedding imports introduces other problems.
DisjointClasses(:Swings :Roundabouts)

cmungall · 2015-03-15T00:32:05Z

Thanks all, this sounds like it would be v useful.
@sesuncedu - what tool do you use to decompose GO into atoms?

cmungall · 2015-03-16T00:50:44Z

Not sure if this is mixing completely opposing concerns, but might it be useful to make sure this is compatible with the github release mechanism.

cmungall · 2015-04-08T21:55:35Z

Of relevance: https://github.com/blog/1986-announcing-git-large-file-storage-lfs

cmungall · 2015-04-24T21:15:49Z

Willing to be a beta tester on this when ready.

Also, was thinking of submitting some ontologies to ORE with @dosumis and @hdietze - we have some big files, it would be useful if we could standardize on a way of packaging these

cmungall · 2015-06-17T20:25:04Z

I would like to propose a slight modification to the owl/zip proposal.

Currently an owl/zip archive must have a file root.owl. This is a little unnatural. In the obo library world, all ontology filenames match the end of the ontology URI (e.g. go.owl, cl.owl). Of course it would be possible for root.owl to consist solely of a single import statement, but this complicates things. E.g. where does the versionIRI tag go? both places?

Instead I propose that there is a separate way to indicate what the root file to load should be. This could be as simple as a file called ROOT_ONTOLOGY that contains a single name with the local path to the root file. Or it could be a kind of manifest, as @matthewhorridge suggests. If we go the manifest route this may be the time to replace catalog xml with yaml.

With this in place, it will be very easy for ontology developers to use the github release mechanism to make owl/zip files. Not only that, but external services that archive github releases such as zenodo will also generate valid owl/zip files 'for free'.

Here is an example of an ontology release using the github mechanism:

https://github.com/obophenotype/cephalopod-ontology/releases/tag/release-2015-03-15

We'd be willing to make changes in conventions such as directory layout, but I'd rather do this sooner rather than later.

cmungall · 2015-12-21T21:20:48Z

How does OWL/ZIP compare with http://www.rdfhdt.org/ when it comes to compression and speed to reconstitution?

ignazio1977 · 2015-12-23T12:40:30Z

HDT seems quite interesting but from a first skimming it looks like a plugin for it would need a JNI bridge; would be tough to support in the core OWLAPI, but would make a useful separate plugin.

sesuncedu · 2015-12-23T15:38:53Z

There's java and c++ code. However I don't think it is ideal for OWL,
because rdf is not ideal for OWL.
On Dec 23, 2015 7:40 AM, "Ignazio Palmisano" notifications@github.com
wrote:

HDT seems quite interesting but from a first skimming it looks like a
plugin for it would need a JNI bridge; would be tough to support in the
core OWLAPI, but would make a useful separate plugin.

—
Reply to this email directly or view it on GitHub
#375 (comment).

…O build Historically we have built chebi centrally due to this ebi-chebi/ChEBI#3202 But this issue is now resolved. Note I'm also introducing two new top-levels (maybe should this on new PR?). We don't have rules on new top-levels, but I assume $ONT.SUFFIX is allowed for reasonable suffixes within $ONT. CHEBI provide gzips which is something I think we should encourage -- although it's fiddly with imports. See: owlcs/owlapi#375

ignazio1977 · 2017-11-24T22:27:30Z

I propose that there is a separate way to indicate what the root file to load should be. This could be as simple as a file called ROOT_ONTOLOGY that contains a single name with the local path to the root file. Or it could be a kind of manifest, as @matthewhorridge suggests. If we go the manifest route this may be the time to replace catalog xml with yaml.

I've taken a look at what YAML would allow. Seems like a good idea, I'm not entirely sure yet whether the power is worth the extra dependency (e.g., snakeyaml or another library).

What I'm thinking is as follows:

one file to act as index; could be in the zip root or it could be in a META-NF folder (good for a JAR option)
the file needs to contain the path to the root ontology
would be useful for the file to contain a list of ontology IRIs and their corresponding path in the zip file; this gets around issues with ontology IRIs that do not match valid file names (IRIs ending with # or / or urn IRIs, etc. etc.). This would allow for an IRI mapper to be created in a straightforward way.

A properties file could be sufficient for this purpose, with the necessary escapes in place.

In terms of creating an owl/zip file, this property file is pretty simple to put together. New OWLOntologyDocumentSource and OWLOntologyDocumentTarget would take care of doing this through the OWL API.

ignazio1977 · 2017-11-25T12:24:19Z

Just thought of something else.

One of the space wasting factors, when saving each ontology/ontology fragment to their own file, is the large number of files and the block size overhead. Hence one of the advantages of OWL/ZIP is removing that waste, even before compression enters the picture.

However, suppose I have two ontologies A and B, importing ontologies C,D, E, which in turn import F, G... Z.

If I keep these all in one folder and resolve with AutoIRIMapper, the space used and the bytes read from disk match the size of A...Z (plus wasted block space).

If I use OWL/ZIP for A and for B, each zip file will contain a compressed copy of C...Z.

If I move C...Z to a separate OWL/ZIP file and leave A and B in their OWLZIP file, leaving imports resolution to sort things out, I end up with three files; if the network of dependencies is more complex and includes a few large ontologies, I might need to repeat this split of the zipped files, theoretically to its worst case scenario, where each OWL/ZIP file contains just one file (1) - or, going the other way, each file contains a copy of the same files (n).

(1) has the disadvantage that local import resolution is lost, and we again have large numbers of files, with block space waste (plus any waste due to downloading large numbers of files rather than large files, when talking about remote access)
(n) has the disadvantage that disk space is wasted, and a change to one ontology requires n files to be updated.

Loading a few ontologies in the same manager should not differ substantially between the two approaches, as the ontology IRIs are stores in the property files; so one would know which file to load before having to open the file.

I've not seen any mention of updates to imported ontologies above - I suppose we are considering this as a mechanism to group together ontologies that are not modified often, or where we intentionally want a snapshot and not an evolving set of ontologies.

ignazio1977 · 2017-11-25T12:25:55Z

Proposal for the above: allow multiple roots in the property file. An example would look like this:

roots=D.owl, someotherfolder/B.owl
D.owl=http://test.org/complexImports/D.owl
somefolder/A.owl=http://test.org/compleximports/A.owl
someotherfolder/B.owl=http://test.org/complexontologies/B.owl
someotherfolder/C.owl=http://test.org/compleximports/C.owl

I've not include version information in the entry description; should this be included?

ansell · 2017-11-26T21:18:41Z

When I implemented something similar in the past I included version information:

https://github.com/podd/podd-ontologies/blob/master/src/main/resources/default-podd-schema-manifest.ttl

Allow the content of a zip file to be mapped to ontology IRIs as an IRI mapper for use in an ontology manager. Also extend AutoIRIMapper to do the same. owlzip.properties and catalog*.xml supported to find ontology IRIs without parsing the actual files. Lacking these index files, same logic as AutoIRIMapper applies. Also implements Import ontology from jar #621 where AutoIRIMapper can now map the contents of a jar or zip file.

matthewhorridge · 2017-11-28T17:46:45Z

If we go the manifest route this may be the time to replace catalog xml with yaml.

and

I've taken a look at what YAML would allow. Seems like a good idea

+1 for yaml

ignazio1977 · 2017-11-29T20:39:34Z

Yams it is, I'll make soup.

Example YAML file: roots: - D.owl - someotherfolder/B.owl D: physical: D.owl logical: http://test.org/complexImports/D.owl version: A: physical: somefolder/A.owl logical: http://test.org/compleximports/A.owl version: B: physical: someotherfolder/B.owl logical: http://test.org/complexontologies/B.owl version: C: physical: someotherfolder/C.owl logical: http://test.org/compleximports/C.owl version:

ignazio1977 · 2017-11-29T20:41:05Z

YAML example:

roots:
 - D.owl
 - someotherfolder/B.owl
D:
 physical: D.owl
 logical: http://test.org/complexImports/D.owl
 version:
A:
 physical: somefolder/A.owl
 logical: http://test.org/compleximports/A.owl
 version:
B:
 physical: someotherfolder/B.owl
 logical: http://test.org/complexontologies/B.owl
 version:
C:
 physical: someotherfolder/C.owl
 logical: http://test.org/compleximports/C.owl
 version:

ignazio1977 · 2017-11-29T20:47:32Z

Current state:

Zip files can have an index file or no index file
If there is no index file then any file named root.owl in the root of the zip file will be considered the root
There is no requirement for any file to be in any specific folder in the zip file
owlzip.properties and owlzip.yaml allow more than one root to be specified
catalog*.xml is supported but there is no root concept there
All files, including the root, will be mapped with the standard IRI mapper mechanism
Index file means we know the logical IRI without need to attempt parsing
No index file means we use the same parsing as done in AutoIRIMapper to find out the ontology IRIs
Incidentally, AutoIRIMapper can now handle zip files

To do:

Index file does not need to be inside the zip file; e.g., if we have already released zipped folders, there is no need to edit them just to add an index file
AutoIRIMapper does not support all syntaxes; that would be good to improve upon
No utility class for writing an index file yet

matthewhorridge · 2017-11-29T20:47:52Z

For mapping objects how about

iri: http://test.org/complexImports/D.owl
versionIri: http://test.org/complexImports/D-1.3.owl
path: D.owl

i.e. reuse the notion of iri and version iri.

Then, the roots object should refer to the ontology id, e.g.

roots:
  - iri: http://test.org/complexImports/D.owl
    versionIri:  ...
  - iri: http://test.org/complexImports/D.owl

OWLZipYaml can be used directly by the savvy developer in order to access OWLOntologyID objects for each ontology. The IRI mapping mechanism does not use ontology versions at present, so that part of the code is unchanged. D: &D { iri: 'http://test.org/complexImports/D.owl', versionIri:, path: D.owl } A: { iri: 'http://test.org/compleximports/A.owl', versionIri:, path: somefolder/A.owl } B: &B { iri: 'http://test.org/complexontologies/B.owl', versionIri:, path: someotherfolder/B.owl } C: { iri: 'http://test.org/compleximports/C.owl', versionIri:, path: someotherfolder/C.owl } roots: - <<: *D - <<: *B

cmungall · 2017-12-08T17:49:32Z

Thanks @ignazio1977 and @matthewhorridge
cc @balhoff @kltm @dougli1sqrd this will solve a lot of our issues with network timeouts when initializing servers from large importer ontologies

cmungall · 2017-12-08T18:09:27Z

I wonder if some ideas from BagIt or BDBag could be incorporated, like the md5 checksum. Standards like these are likely to be adopted in the context of projects like the NIH Data Commons.

This maybe doesn't make much sense and certainly not worth delaying release and adoption of owlzip, use cases as largely different, but it seems that it would be possible to have a compatible optional profile of owlzip that also is compatible with bdbag/bagit-RO.

For example, we could have by convention the incorporation of a provenance json-ld file like this. We've been discussing the generation of prov files as part of robot

BagIt also has a manifest file like this. I prefer the yaml above (and I was the one that originally requested yaml), but it's worth considering if there is value in adopting this.

@stain may have some insights

jamesaoverton · 2017-12-20T13:55:26Z

Thanks everyone who has contributed to this. It's going to be very useful. @rctauber and I are trying it out now (knocean/ontofetch#19), and would be happy to contribute if there's anything you need.

I was looking at the example YAML in this commit b583351 and wondering if it could be simpler. In the example below I've eliminated the YAML aliases/anchors, allowing for a number of other simplifications:

eliminate YAML aliases/anchors in favour of a root key
I think the ontologies can be a sequence rather than a map
make the top-level structure a map with an ontologies key
- this "future proofs" the structure, allowing for extensions with other keys
- "ontologies" might not be the best name
- this suggestion is independent of the others: the top-level could just be the sequence of ontologies
remove curly braces, quotes, and commas to make the example look more like YAML and less like JSON
omit the versionIRI keys for now, as they aren't yet supported

ontologies:
- iri: http://test.org/complexImports/D.owl
  path: D.owl
  root: true
- iri: http://test.org/compleximports/A.owl
  path: somefolder/A.owl
- iri: http://test.org/complexontologies/B.owl
  path: someotherfolder/B.owl
  root: true
- iri: http://test.org/compleximports/C.owl
  path: someotherfolder/C.owl

ignazio1977 · 2018-03-06T21:17:32Z

OWLZipEntry can handle the version without forcing the users to do anything about it, so I'm leaving that in - if there is no version, that's no biggie. Enabling versions to be part of the imports closure resolution is a whole another kettle of fish - I don't believe the standard syntaxes support it at all. However, for applications that wish to use versions, there will be no obstacle.

YAML format not included in version 5

YAML format not included in version 4

ignazio1977 added the enhancement label Mar 14, 2015

cmungall mentioned this issue Jun 15, 2015

New option proposal: split ontodev/robot#39

Closed

cmungall mentioned this issue Sep 5, 2015

Add validation check to make sure all ontology_purls resolve OBOFoundry/OBOFoundry.github.io#96

Open

cmungall mentioned this issue Nov 22, 2016

robot does not parse gzipped owl files ontodev/robot#112

Closed

cmungall mentioned this issue Feb 21, 2017

Routing CHEBI purls directly to EBI to avoid dependency on central OB… OBOFoundry/purl.obolibrary.org#310

Merged

ignazio1977 mentioned this issue Nov 28, 2017

owl:imports and serving gzipped or binaryowl files #348

Closed

jamesaoverton mentioned this issue Dec 8, 2017

Test OWL ZIP knocean/ontofetch#19

Open

cmungall mentioned this issue Jan 7, 2018

Pipeline data needs a "backup" for archive and reproducibility geneontology/pipeline#9

Closed

cmungall mentioned this issue Jan 30, 2018

'only in taxon' vs only_in_taxon geneontology/go-ontology#14945

Closed

ignazio1977 mentioned this issue Feb 6, 2018

Import ontology from jar #621

Closed

cmungall mentioned this issue Feb 13, 2018

DO NOT MERGE YET: Upgrade to solr6: DO NOT MERGE YET owlcollab/owltools#195

Open

ignazio1977 added a commit that referenced this issue Mar 6, 2018

#375 alternative yaml format

15439ac

ignazio1977 closed this as completed Mar 6, 2018

ignazio1977 added a commit that referenced this issue Mar 24, 2018

Implement #375 OWLZip reader and writer

0c677ae

YAML format not included in version 5

ignazio1977 added a commit that referenced this issue Mar 24, 2018

Implement #375 OWLZip reader and writer

3115b10

YAML format not included in version 4

ignazio1977 mentioned this issue May 1, 2018

AutoIriMapper chooses wrong IRIs. #755

Closed

This was referenced Nov 2, 2018

Add integration tests INCATools/ontology-development-kit#118

Closed

Include a manifest file of all products and use this to drive population of release folder and catalogs INCATools/ontology-development-kit#124

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for OWL/ZIP #375

Add support for OWL/ZIP #375

matthewhorridge commented Mar 13, 2015

ignazio1977 commented Mar 14, 2015

matthewhorridge commented Mar 14, 2015

sesuncedu commented Mar 14, 2015

cmungall commented Mar 15, 2015

cmungall commented Mar 16, 2015

cmungall commented Apr 8, 2015

cmungall commented Apr 24, 2015

cmungall commented Jun 17, 2015

cmungall commented Dec 21, 2015

ignazio1977 commented Dec 23, 2015

sesuncedu commented Dec 23, 2015

ignazio1977 commented Nov 24, 2017

ignazio1977 commented Nov 25, 2017

ignazio1977 commented Nov 25, 2017

ansell commented Nov 26, 2017

matthewhorridge commented Nov 28, 2017 •

edited

Loading

ignazio1977 commented Nov 29, 2017

ignazio1977 commented Nov 29, 2017

ignazio1977 commented Nov 29, 2017

matthewhorridge commented Nov 29, 2017 •

edited

Loading

cmungall commented Dec 8, 2017

cmungall commented Dec 8, 2017

jamesaoverton commented Dec 20, 2017

ignazio1977 commented Mar 6, 2018

Add support for OWL/ZIP #375

Add support for OWL/ZIP #375

Comments

matthewhorridge commented Mar 13, 2015

ignazio1977 commented Mar 14, 2015

matthewhorridge commented Mar 14, 2015

sesuncedu commented Mar 14, 2015

Base line (size on disk, contents catted to a single file)

Size on disk (one file per atom, directory size)

Files stored in single archive, no compression

Files compressed, then stored in a single archive

Files stored in single archive, then compressed

Trade-off: random access to contents of compressed files.

Possible alternative

cmungall commented Mar 15, 2015

cmungall commented Mar 16, 2015

cmungall commented Apr 8, 2015

cmungall commented Apr 24, 2015

cmungall commented Jun 17, 2015

cmungall commented Dec 21, 2015

ignazio1977 commented Dec 23, 2015

sesuncedu commented Dec 23, 2015

ignazio1977 commented Nov 24, 2017

ignazio1977 commented Nov 25, 2017

ignazio1977 commented Nov 25, 2017

ansell commented Nov 26, 2017

matthewhorridge commented Nov 28, 2017 • edited Loading

ignazio1977 commented Nov 29, 2017

ignazio1977 commented Nov 29, 2017

ignazio1977 commented Nov 29, 2017

matthewhorridge commented Nov 29, 2017 • edited Loading

cmungall commented Dec 8, 2017

cmungall commented Dec 8, 2017

jamesaoverton commented Dec 20, 2017

ignazio1977 commented Mar 6, 2018

matthewhorridge commented Nov 28, 2017 •

edited

Loading

matthewhorridge commented Nov 29, 2017 •

edited

Loading