Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for OWL/ZIP #375

Closed
matthewhorridge opened this issue Mar 13, 2015 · 24 comments
Closed

Add support for OWL/ZIP #375

matthewhorridge opened this issue Mar 13, 2015 · 24 comments

Comments

@matthewhorridge
Copy link
Contributor

Add support for OW/ZIP.

http://ceur-ws.org/Vol-1265/owled2014_submission_6.pdf

We might think about the conventions used, but the basic ideas seem sensible. This would require some twiddling to bring imports in from the right place (a temp override of the ontology IRI mapper).

@ignazio1977
Copy link
Contributor

I remember thinking the same, would be quite good to have. @matentzn you'll be available to consult on this, right? ;-)

@matthewhorridge
Copy link
Contributor Author

Excellent. It would be good to standardise on something. The basic idea is to have a required ontology root.owl and then optionally add imports, encoding them using directory structure (rather than some kind of manifest). How this last part works isn't totally clear to me at the moment.

@sesuncedu
Copy link
Contributor

You'll be surprised to know I have a certain amount of interest, and have been doing a certain amount of research, on this issue (compression,packing, atom-splitting, etc) :-) I needed a lot of small atoms to illustrate a major point.
Annoyingly I didn't have one prepared. so I generated atoms from a recent GO (w/imports).

Note: all sizes below are those reported by du, which take into account disk block size.

The decomposition finds 56,240 atoms.

Base line (size on disk, contents catted to a single file)

Before we look at the effects of archiving and compressing, we should get a measure of the size of the raw contents of the files. A simple way of doing this is to cat all the individual atom files together. We can also see the effect of applying compression to the raw contents, and note once again the effect of lexical sorting on OWL FSS (the individual files are using the new sorted FSS output, which sorts by entity type, then entity, then axiom.)

size format/transforms
72M cat
1.9M cat + xz
0.69M cat + sort + xz

Size on disk (one file per atom, directory size)

Here we see the significance of the size of disk blocks when measuring the size of files on disk. The disk space used by the compressed files is the same as that used by the uncompressed files.

Size format/ transforms
220M none
220M xz (individual files)

Files stored in single archive, no compression

Just storing the uncompressed files in a single archive cuts the disk space requirements in half or more. We can see that the zip has less overhead than tar.

size format/ transforms
112M tar
82M zip -0 (store)

Files compressed, then stored in a single archive

When we store individually compressed files, we see a difference. Once the disk block overhead is removed, we see the effect of compression.
We also see the limitations of compressing individual files.

size format/ transforms
59M xz (individual files) + tar
33M xz (individual files) + zip (store)
29M zip (deflate)

Files stored in single archive, then compressed

Now we reverse the order - instead of compressing, then archiving, we create the archive, then compress.

size format/ transforms
5.2M zip -0 + gz
4.3M tar + gz
3.0M zip + xz
2.5M tar + xz

The difference is an order of magnitude.

Trade-off: random access to contents of compressed files.

Because each file in a zip archive is independently compressed, and because the zip format keeps a directory of contents at the end of the file, it is quite easy to access files in random order.

gzip files are one big block; it is difficult to randomly access the contents of a file.
xz allows for files to contain multiple streams of data, and for streams to contain multiple blocks. However, there is a trade-off between the size of the compressed blocks and the amount of data that must be read and decompressed in order to read a particular section.
A block size of 32 MB would split our tar file into just four chunks (plus a separate block at the end to hold the index). The I/O savings would be trivial in this case, as the total compressed file size is just 2.5MB.

Possible alternative

It might be interesting to look at experimenting with some of the text formats to see they can be adapted to handle multiple ontologies in a single form.

There's an obvious hack using rdf formats that support named graphs, whereby each ontology could be stored in a named graph in the same file. However, that involves RDF.

It might be possible to define an annotation that specifies which ontology/component an axiom belongs to (the partitioning could be done in a post-loading stage).

This is similar to the work that I'm doing for @cmungall, which involves keeping a cache of inferred axioms stored in the same file as the axioms from which they are derived. These axioms are marked with annotations (both the is_inferred annotation that comes from OBO, and more detailed provenance information- import versions, transforms, and reasoner information.
To check whether an cached inference is still valid, I need to split the relevant cached axioms from the asserted axioms - i.e. partitioning an ontology based on axiom annotations.

The advantage of using an annotation-style approach is that it is backwards compatible; the disadvantage is that it is harder to split out individual pieces. For imports of specific, versioned ontologies, things are easy. For imports of dynamic ontologies, one must check the cached imported axioms on load.
One must also be careful not to record this metadata directly onto each axiom, so that otherwise unchanged cached imports do not an annotation change - this interacts very badly with version control systems. On the other hand, github and bitbucket aren't too fond of very large files, so embedding imports introduces other problems.
DisjointClasses(:Swings :Roundabouts)

@cmungall
Copy link
Member

Thanks all, this sounds like it would be v useful.
@sesuncedu - what tool do you use to decompose GO into atoms?

@cmungall
Copy link
Member

Not sure if this is mixing completely opposing concerns, but might it be useful to make sure this is compatible with the github release mechanism.

@cmungall
Copy link
Member

cmungall commented Apr 8, 2015

@cmungall
Copy link
Member

Willing to be a beta tester on this when ready.

Also, was thinking of submitting some ontologies to ORE with @dosumis and @hdietze - we have some big files, it would be useful if we could standardize on a way of packaging these

@cmungall
Copy link
Member

I would like to propose a slight modification to the owl/zip proposal.

Currently an owl/zip archive must have a file root.owl. This is a little unnatural. In the obo library world, all ontology filenames match the end of the ontology URI (e.g. go.owl, cl.owl). Of course it would be possible for root.owl to consist solely of a single import statement, but this complicates things. E.g. where does the versionIRI tag go? both places?

Instead I propose that there is a separate way to indicate what the root file to load should be. This could be as simple as a file called ROOT_ONTOLOGY that contains a single name with the local path to the root file. Or it could be a kind of manifest, as @matthewhorridge suggests. If we go the manifest route this may be the time to replace catalog xml with yaml.

With this in place, it will be very easy for ontology developers to use the github release mechanism to make owl/zip files. Not only that, but external services that archive github releases such as zenodo will also generate valid owl/zip files 'for free'.

Here is an example of an ontology release using the github mechanism:

https://github.com/obophenotype/cephalopod-ontology/releases/tag/release-2015-03-15

We'd be willing to make changes in conventions such as directory layout, but I'd rather do this sooner rather than later.

@cmungall
Copy link
Member

How does OWL/ZIP compare with http://www.rdfhdt.org/ when it comes to compression and speed to reconstitution?

@ignazio1977
Copy link
Contributor

HDT seems quite interesting but from a first skimming it looks like a plugin for it would need a JNI bridge; would be tough to support in the core OWLAPI, but would make a useful separate plugin.

@sesuncedu
Copy link
Contributor

There's java and c++ code. However I don't think it is ideal for OWL,
because rdf is not ideal for OWL.
On Dec 23, 2015 7:40 AM, "Ignazio Palmisano" notifications@github.com
wrote:

HDT seems quite interesting but from a first skimming it looks like a
plugin for it would need a JNI bridge; would be tough to support in the
core OWLAPI, but would make a useful separate plugin.


Reply to this email directly or view it on GitHub
#375 (comment).

cmungall added a commit to OBOFoundry/purl.obolibrary.org that referenced this issue Feb 21, 2017
…O build

Historically we have built chebi centrally due to this
ebi-chebi/ChEBI#3202

But this issue is now resolved.

Note I'm also introducing two new top-levels (maybe should this on new PR?). We don't have rules on new top-levels, but I assume $ONT.SUFFIX is allowed for reasonable suffixes within $ONT. CHEBI provide gzips which is something I think we should encourage -- although it's fiddly with imports. See: owlcs/owlapi#375
@ignazio1977
Copy link
Contributor

I propose that there is a separate way to indicate what the root file to load should be. This could be as simple as a file called ROOT_ONTOLOGY that contains a single name with the local path to the root file. Or it could be a kind of manifest, as @matthewhorridge suggests. If we go the manifest route this may be the time to replace catalog xml with yaml.

I've taken a look at what YAML would allow. Seems like a good idea, I'm not entirely sure yet whether the power is worth the extra dependency (e.g., snakeyaml or another library).

What I'm thinking is as follows:

  • one file to act as index; could be in the zip root or it could be in a META-NF folder (good for a JAR option)
  • the file needs to contain the path to the root ontology
  • would be useful for the file to contain a list of ontology IRIs and their corresponding path in the zip file; this gets around issues with ontology IRIs that do not match valid file names (IRIs ending with # or / or urn IRIs, etc. etc.). This would allow for an IRI mapper to be created in a straightforward way.

A properties file could be sufficient for this purpose, with the necessary escapes in place.

In terms of creating an owl/zip file, this property file is pretty simple to put together. New OWLOntologyDocumentSource and OWLOntologyDocumentTarget would take care of doing this through the OWL API.

@ignazio1977
Copy link
Contributor

Just thought of something else.

One of the space wasting factors, when saving each ontology/ontology fragment to their own file, is the large number of files and the block size overhead. Hence one of the advantages of OWL/ZIP is removing that waste, even before compression enters the picture.

However, suppose I have two ontologies A and B, importing ontologies C,D, E, which in turn import F, G... Z.

If I keep these all in one folder and resolve with AutoIRIMapper, the space used and the bytes read from disk match the size of A...Z (plus wasted block space).

If I use OWL/ZIP for A and for B, each zip file will contain a compressed copy of C...Z.

If I move C...Z to a separate OWL/ZIP file and leave A and B in their OWLZIP file, leaving imports resolution to sort things out, I end up with three files; if the network of dependencies is more complex and includes a few large ontologies, I might need to repeat this split of the zipped files, theoretically to its worst case scenario, where each OWL/ZIP file contains just one file (1) - or, going the other way, each file contains a copy of the same files (n).

(1) has the disadvantage that local import resolution is lost, and we again have large numbers of files, with block space waste (plus any waste due to downloading large numbers of files rather than large files, when talking about remote access)
(n) has the disadvantage that disk space is wasted, and a change to one ontology requires n files to be updated.

Loading a few ontologies in the same manager should not differ substantially between the two approaches, as the ontology IRIs are stores in the property files; so one would know which file to load before having to open the file.

I've not seen any mention of updates to imported ontologies above - I suppose we are considering this as a mechanism to group together ontologies that are not modified often, or where we intentionally want a snapshot and not an evolving set of ontologies.

@ignazio1977
Copy link
Contributor

Proposal for the above: allow multiple roots in the property file. An example would look like this:

roots=D.owl, someotherfolder/B.owl
D.owl=http://test.org/complexImports/D.owl
somefolder/A.owl=http://test.org/compleximports/A.owl
someotherfolder/B.owl=http://test.org/complexontologies/B.owl
someotherfolder/C.owl=http://test.org/compleximports/C.owl

I've not include version information in the entry description; should this be included?

@ansell
Copy link
Member

ansell commented Nov 26, 2017

When I implemented something similar in the past I included version information:

https://github.com/podd/podd-ontologies/blob/master/src/main/resources/default-podd-schema-manifest.ttl

ignazio1977 added a commit that referenced this issue Nov 28, 2017
Allow the content of a zip file to be mapped to ontology IRIs
as an IRI mapper for use in an ontology manager.
Also extend AutoIRIMapper to do the same.

owlzip.properties and catalog*.xml supported to find ontology 
IRIs without parsing the actual files. Lacking these index files,
same logic as AutoIRIMapper applies.

Also implements Import ontology from jar #621 where 
AutoIRIMapper can now map the contents of a jar or zip file.
@matthewhorridge
Copy link
Contributor Author

matthewhorridge commented Nov 28, 2017

If we go the manifest route this may be the time to replace catalog xml with yaml.

and

I've taken a look at what YAML would allow. Seems like a good idea

+1 for yaml

@ignazio1977
Copy link
Contributor

Yams it is, I'll make soup.

ignazio1977 added a commit that referenced this issue Nov 29, 2017
Example YAML file:

roots:
 - D.owl
 - someotherfolder/B.owl
D:
 physical: D.owl
 logical: http://test.org/complexImports/D.owl
 version:
A:
 physical: somefolder/A.owl
 logical: http://test.org/compleximports/A.owl
 version:
B:
 physical: someotherfolder/B.owl
 logical: http://test.org/complexontologies/B.owl
 version:
C:
 physical: someotherfolder/C.owl
 logical: http://test.org/compleximports/C.owl
 version:
@ignazio1977
Copy link
Contributor

YAML example:

roots:
 - D.owl
 - someotherfolder/B.owl
D:
 physical: D.owl
 logical: http://test.org/complexImports/D.owl
 version:
A:
 physical: somefolder/A.owl
 logical: http://test.org/compleximports/A.owl
 version:
B:
 physical: someotherfolder/B.owl
 logical: http://test.org/complexontologies/B.owl
 version:
C:
 physical: someotherfolder/C.owl
 logical: http://test.org/compleximports/C.owl
 version:

@ignazio1977
Copy link
Contributor

Current state:

  • Zip files can have an index file or no index file
  • If there is no index file then any file named root.owl in the root of the zip file will be considered the root
  • There is no requirement for any file to be in any specific folder in the zip file
  • owlzip.properties and owlzip.yaml allow more than one root to be specified
  • catalog*.xml is supported but there is no root concept there
  • All files, including the root, will be mapped with the standard IRI mapper mechanism
  • Index file means we know the logical IRI without need to attempt parsing
  • No index file means we use the same parsing as done in AutoIRIMapper to find out the ontology IRIs
  • Incidentally, AutoIRIMapper can now handle zip files

To do:

  • Index file does not need to be inside the zip file; e.g., if we have already released zipped folders, there is no need to edit them just to add an index file
  • AutoIRIMapper does not support all syntaxes; that would be good to improve upon
  • No utility class for writing an index file yet

@matthewhorridge
Copy link
Contributor Author

matthewhorridge commented Nov 29, 2017

For mapping objects how about

iri: http://test.org/complexImports/D.owl
versionIri: http://test.org/complexImports/D-1.3.owl
path: D.owl

i.e. reuse the notion of iri and version iri.

Then, the roots object should refer to the ontology id, e.g.

roots:
  - iri: http://test.org/complexImports/D.owl
    versionIri:  ...
  - iri: http://test.org/complexImports/D.owl

ignazio1977 added a commit that referenced this issue Dec 4, 2017
OWLZipYaml can be used directly by the savvy developer in order
to access OWLOntologyID objects for each ontology. The IRI mapping
mechanism does not use ontology versions at present, so that
part of the code is unchanged.

D: &D {
   iri: 'http://test.org/complexImports/D.owl',
   versionIri:,
   path: D.owl
   }
A: {
   iri: 'http://test.org/compleximports/A.owl',
   versionIri:,
   path: somefolder/A.owl
   }
B: &B {
   iri: 'http://test.org/complexontologies/B.owl',
   versionIri:,
   path: someotherfolder/B.owl
   }
C: {
   iri: 'http://test.org/compleximports/C.owl',
   versionIri:,
   path: someotherfolder/C.owl
   }
roots:
 - <<: *D
 - <<: *B
@cmungall
Copy link
Member

cmungall commented Dec 8, 2017

Thanks @ignazio1977 and @matthewhorridge
cc @balhoff @kltm @dougli1sqrd this will solve a lot of our issues with network timeouts when initializing servers from large importer ontologies

@cmungall
Copy link
Member

cmungall commented Dec 8, 2017

I wonder if some ideas from BagIt or BDBag could be incorporated, like the md5 checksum. Standards like these are likely to be adopted in the context of projects like the NIH Data Commons.

This maybe doesn't make much sense and certainly not worth delaying release and adoption of owlzip, use cases as largely different, but it seems that it would be possible to have a compatible optional profile of owlzip that also is compatible with bdbag/bagit-RO.

For example, we could have by convention the incorporation of a provenance json-ld file like this. We've been discussing the generation of prov files as part of robot

BagIt also has a manifest file like this. I prefer the yaml above (and I was the one that originally requested yaml), but it's worth considering if there is value in adopting this.

@stain may have some insights

@jamesaoverton
Copy link

Thanks everyone who has contributed to this. It's going to be very useful. @rctauber and I are trying it out now (knocean/ontofetch#19), and would be happy to contribute if there's anything you need.

I was looking at the example YAML in this commit b583351 and wondering if it could be simpler. In the example below I've eliminated the YAML aliases/anchors, allowing for a number of other simplifications:

  • eliminate YAML aliases/anchors in favour of a root key
  • I think the ontologies can be a sequence rather than a map
  • make the top-level structure a map with an ontologies key
    • this "future proofs" the structure, allowing for extensions with other keys
    • "ontologies" might not be the best name
    • this suggestion is independent of the others: the top-level could just be the sequence of ontologies
  • remove curly braces, quotes, and commas to make the example look more like YAML and less like JSON
  • omit the versionIRI keys for now, as they aren't yet supported
ontologies:
- iri: http://test.org/complexImports/D.owl
  path: D.owl
  root: true
- iri: http://test.org/compleximports/A.owl
  path: somefolder/A.owl
- iri: http://test.org/complexontologies/B.owl
  path: someotherfolder/B.owl
  root: true
- iri: http://test.org/compleximports/C.owl
  path: someotherfolder/C.owl

@ignazio1977
Copy link
Contributor

OWLZipEntry can handle the version without forcing the users to do anything about it, so I'm leaving that in - if there is no version, that's no biggie. Enabling versions to be part of the imports closure resolution is a whole another kettle of fish - I don't believe the standard syntaxes support it at all. However, for applications that wish to use versions, there will be no obstacle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants