Introduce compression #11

oxinabox · 2019-05-24T16:17:40Z

This PR will close #7

Right now all it does is rename :serialize to :julia_native,
and make sure we are all setup to handle version changes.

We should probably have a sample file added to the repo (it i can be very small)
to test we can load formats serialized by old versions.

…change :serialize to :julia_native

codecov · 2019-05-24T16:47:34Z

Codecov Report

Merging #11 into master will decrease coverage by 7.31%.
The diff coverage is 86.04%.

@@            Coverage Diff             @@
##           master      #11      +/-   ##
==========================================
- Coverage     100%   92.68%   -7.32%     
==========================================
  Files           5        5              
  Lines          54       82      +28     
==========================================
+ Hits           54       76      +22     
- Misses          0        6       +6

Impacted Files	Coverage Δ
src/JLSO.jl	`100% <ø> (ø)`	⬆️
src/file_io.jl	`100% <100%> (ø)`	⬆️
src/metadata.jl	`100% <100%> (ø)`	⬆️
src/JLSOFile.jl	`93.75% <75%> (-6.25%)`	⬇️
src/serialization.jl	`83.87% <82.14%> (-16.13%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a6cc609...4386519. Read the comment docs.

iamed2 · 2019-05-24T17:21:12Z

Why was it renamed? It's not clear that :julia_native means "uses serialize".

rofinn · 2019-05-24T17:31:38Z

I'd suggest :julia_serialize if we're going to rename it.

oxinabox · 2019-05-24T19:04:40Z

The problem with serialize is that it doesn't actually specify what serialisation algorithm.
It is ambigious with the general concept of serialization.
All of these are serializers.

So I figured since changing the file version this is a chance to corect it.
Another option could be julia_base_serialize.
base_serialize, etc.
It needs to convey that it is a specific form for serialisation from the julia base library

oxinabox · 2019-05-24T19:08:51Z

Anyway, as of now this has all the stuff to work with Transcoding stream compressors.
Which do we want to support and with which compression leavels?
I also think we should make compressed the default.
Because disk time (esp if pushing to S3) much more expensive than the time taken for lightweight compression. (also later we may end up doing this in another thread. THis kind thing is ideal for Fork based parallelism but that is not an option sadly)

to :julia_serialize

oxinabox · 2019-05-24T23:37:11Z

Right, here are some stats,
this is running on existing financial_data.jlso file.
Timing is for second of 2 calls, includes time to write to disk on my laptop.
Which should be the default compression?
I am leaning towards the default gzip which is still faster than :none, due to the smaller filesize meaning less slow writes to disk

┌ Info: Original
└   size_kb = 626
┌ Info: Whole Compressed
|      compression = :gzip
|      time=  0.025815
└   size_kb = 45

┌ Info: 
│   format = :julia_serialize
│   compression = :none
│   time = 0.055351557
└   size_kb = 606
┌ Info: 
│   format = :julia_serialize
│   compression = :gzip
│   time = 0.046644187
└   size_kb = 51
┌ Info: 
│   format = :julia_serialize
│   compression = :gzip_fastest
│   time = 0.035092568
└   size_kb = 54
┌ Info: 
│   format = :julia_serialize
│   compression = :gzip_smallest
│   time = 0.073955492
└   size_kb = 50
┌ Info: 
│   format = :bson
│   compression = :none
│   time = 0.702540057
└   size_kb = 4840
┌ Info: 
│   format = :bson
│   compression = :gzip
│   time = 0.726010517
└   size_kb = 182
┌ Info: 
│   format = :bson
│   compression = :gzip_fastest
│   time = 0.667815146
└   size_kb = 218
┌ Info: 
│   format = :bson
│   compression = :gzip_smallest
│   time = 0.871860139
└   size_kb = 183

oxinabox · 2019-05-25T00:19:51Z

src/serialization.jl

-    jlso.objects[name] = take!(io)
+    # need to close buffer so any compression can write end of body stuffs.
+    close(compressing_buffer)
+    jlso.objects[name] = buffer.data  # can't use take! as stream is now closed


This seems gross, there must be a better way to get the end stuff in without closing the stream
look at
https://github.com/bicycle1885/CodecZlib.jl/blob/a5ec201eba8447c5fd8c941d050ff5fa89ebd28f/src/compression.jl#L148
and
https://github.com/bicycle1885/CodecZlib.jl/blob/a5ec201eba8447c5fd8c941d050ff5fa89ebd28f/src/compression.jl#L148

oxinabox · 2019-05-28T10:59:07Z

JuliaIO/TranscodingStreams.jl#85

oxinabox · 2019-05-28T11:04:53Z

Default is now :gzip
JLSO file is now v2.0, we can still load v1.0 but can not write them any more.

src/file_io.jl

src/metadata.jl

src/serialization.jl

iamed2 · 2019-05-28T11:57:19Z

src/serialization.jl

+    #)
+    bson = (
+        deserialize! = first ∘ values ∘ BSON.load,
+        serialize! = (io, value) -> bson(io, Dict("object" => value))


Why "object"?

Because it is an element of JLSOFile.objects
and we only put it into aDict here because that is how the BSON API likes it.
This dict is never visible to the User, except when they access the BSON directly.

That sounds fine, but what changed to make this only necessary now?

I made bson and julia_serialize work with the same interface
BSON used to do Dict(name => value)
which was redundant and the name is already stored as the key to the parent of this.

Ah, does bson only accept a Dict?

Correct. There is probably some internal function we could call instead.
But it would also make it much harder to keep loading v1 JLSO files so...

src/JLSO.jl

src/serialization.jl

Co-Authored-By: Eric Davies <iamed2@gmail.com>

src/file_io.jl

rofinn

These changes seem reasonable, but I do have a suggestion for code readability.

src/serialization.jl

rofinn · 2019-05-28T21:20:21Z

Looks like the tests are failing because of the sample legacy files being loaded. I'd recommend just hard coding the legacy metadata in the test julia code rather than saving files that'll be julia version and architecture dependent.

oxinabox · 2019-05-28T22:22:43Z

I have tests that directly check the legacy metadata but that would not catch changest in how those are interpreted (e.g. changing :bson to not pre-encode objects as 1 element as Dicts).

Not testing that can still load files feels bad.
I would rather make a list of allowed_failures that depends on architecture and Julia version.

rofinn · 2019-05-29T03:50:14Z

You can still test all of those things without needing to save binary files in the git repo. Just manually serialize the old structure to an IOBuffer and try loading it. I feel like the only thing that the binary files might catch are changes to the backend serialization format (e.g., saved using an old version of the bson library).

oxinabox · 2019-05-29T07:24:27Z

Generating the old structure would be super easy to screw up though.
It would share a lot of the code with the package itself.

A end to end integration test gives me much more confidence that it is correct.

oxinabox · 2019-05-29T10:30:36Z

Also I would really like to know which things when serialized on one platform can't be loaded on another.
So I am strongly inclined to keep those real tests. So I know what breaks.
I am going to push a branch that turns some into warnings and then we can then reassess

oxinabox · 2019-05-29T11:24:53Z

Ok, I am pretty sure this is actually a bug in the JLSO format.
#12

So for now I have allowed failures on x86.
(I could do this in the script rather than in the config, if prefered)

But in anycase this convinces me that it is completely worth having these kinds of tests.

oxinabox · 2019-05-29T14:35:03Z

O n testing of BSON on data from 32bit and 64 bit systems it handles it fine.
So this is not #12

rofinn · 2019-05-29T15:33:45Z

A end to end integration test gives me much more confidence that it is correct.

In that case, I'd recommend saving all the different permutations in a datadeps just for testing rather than storing binary files in the repo.

oxinabox · 2019-05-29T17:00:50Z

In that case, I'd recommend saving all the different permutations in a datadeps just for testing rather than storing binary files in the repo.

I would agree, if there were a bit larger. But they are pretty small really.
200kb ish.
Git's limit is 100MB, Github's limit is 50MB.
for simplicities sake it might be better just to have them in the repo.

I am still chasing down what is breaking on 32bit.

rofinn · 2019-05-29T17:18:09Z

I would agree, if there were a bit larger. But they are pretty small really.

I'm inclined to do it out of principle :) It's easy enough for someone to accidentally slip in a large binary file (especially if we want to have automated benchmarks), so using datadeps would make it more explicit about what the best practice is regardless of file size.

oxinabox · 2019-05-29T19:29:15Z

Ok, I am now satisfied that this is a weird BSON.jl error.

oxinabox · 2019-05-30T15:27:36Z

BSON error now has a PR to fix it open.

@rofinn how keen are you on having that test data in a DataDep?
I'ld rather not; because it is one more thing to do.
Also, if you insist, where should i store those?

rofinn · 2019-05-30T19:22:53Z

If you're willing to make an issue to add DataDeps then I think I'm fine to approve. We should probably host these test files in a public S3 bucket.

oxinabox · 2019-05-30T21:19:42Z

Done #16

rofinn

This seems fine to me for now

oxinabox added 2 commits May 24, 2019 17:12

prepare to handle more formats, redo handling of different versions, …

43c916f

…change :serialize to :julia_native

Add v1 specimins and test can still read them

7b415dc

Add compression functionality.

2a1e5be

rename :julia_native

1a14e1c

to :julia_serialize

add gzip compression

ade9d10

oxinabox changed the title ~~[WIP] Introduce compression~~ Introduce compression May 25, 2019

oxinabox commented May 25, 2019

View reviewed changes

oxinabox added 2 commits May 28, 2019 11:50

Close compressor with a lesser hack

f42c882

Change the default compression and tweak the documented warnings

71e238d

iamed2 requested changes May 28, 2019

View reviewed changes

oxinabox commented May 28, 2019

View reviewed changes

src/serialization.jl Outdated Show resolved Hide resolved

oxinabox commented May 28, 2019

View reviewed changes

src/serialization.jl Outdated Show resolved Hide resolved

Improve names of things

b37f723

Co-Authored-By: Eric Davies <iamed2@gmail.com>

nickrobinson251 reviewed May 28, 2019

View reviewed changes

src/file_io.jl Show resolved Hide resolved

rofinn reviewed May 28, 2019

View reviewed changes

src/serialization.jl Outdated Show resolved Hide resolved

Use dispatch for selecting serializer

2183fb8

oxinabox force-pushed the ox/compress2 branch from 6a38b2e to 2183fb8 Compare May 28, 2019 18:46

using === for symbol equality

b4edd68

Allow failures on x86 for now

4386519

oxinabox mentioned this pull request May 29, 2019

BSON file saves on 64bit system can not be loaded on 32bit system JuliaIO/BSON.jl#41

Closed

rofinn approved these changes May 30, 2019

View reviewed changes

oxinabox merged commit 9b42fca into master May 31, 2019

ararslan deleted the ox/compress2 branch June 6, 2019 20:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce compression #11

Introduce compression #11

oxinabox commented May 24, 2019

codecov bot commented May 24, 2019 •

edited

Loading

iamed2 commented May 24, 2019

rofinn commented May 24, 2019 •

edited

Loading

oxinabox commented May 24, 2019 •

edited

Loading

oxinabox commented May 24, 2019

oxinabox commented May 24, 2019 •

edited

Loading

oxinabox May 25, 2019

oxinabox commented May 28, 2019

oxinabox commented May 28, 2019

iamed2 May 28, 2019

oxinabox May 28, 2019 •

edited

Loading

iamed2 May 28, 2019

oxinabox May 28, 2019

iamed2 May 28, 2019

oxinabox May 28, 2019

rofinn left a comment

rofinn commented May 28, 2019

oxinabox commented May 28, 2019

rofinn commented May 29, 2019 •

edited

Loading

oxinabox commented May 29, 2019 •

edited

Loading

oxinabox commented May 29, 2019 •

edited

Loading

oxinabox commented May 29, 2019

oxinabox commented May 29, 2019

rofinn commented May 29, 2019

oxinabox commented May 29, 2019

rofinn commented May 29, 2019

oxinabox commented May 29, 2019

oxinabox commented May 30, 2019

rofinn commented May 30, 2019

oxinabox commented May 30, 2019 •

edited

Loading

rofinn left a comment

Introduce compression #11

Introduce compression #11

Conversation

oxinabox commented May 24, 2019

codecov bot commented May 24, 2019 • edited Loading

Codecov Report

iamed2 commented May 24, 2019

rofinn commented May 24, 2019 • edited Loading

oxinabox commented May 24, 2019 • edited Loading

oxinabox commented May 24, 2019

oxinabox commented May 24, 2019 • edited Loading

oxinabox May 25, 2019

Choose a reason for hiding this comment

oxinabox commented May 28, 2019

oxinabox commented May 28, 2019

iamed2 May 28, 2019

Choose a reason for hiding this comment

oxinabox May 28, 2019 • edited Loading

Choose a reason for hiding this comment

iamed2 May 28, 2019

Choose a reason for hiding this comment

oxinabox May 28, 2019

Choose a reason for hiding this comment

iamed2 May 28, 2019

Choose a reason for hiding this comment

oxinabox May 28, 2019

Choose a reason for hiding this comment

rofinn left a comment

Choose a reason for hiding this comment

rofinn commented May 28, 2019

oxinabox commented May 28, 2019

rofinn commented May 29, 2019 • edited Loading

oxinabox commented May 29, 2019 • edited Loading

oxinabox commented May 29, 2019 • edited Loading

oxinabox commented May 29, 2019

oxinabox commented May 29, 2019

rofinn commented May 29, 2019

oxinabox commented May 29, 2019

rofinn commented May 29, 2019

oxinabox commented May 29, 2019

oxinabox commented May 30, 2019

rofinn commented May 30, 2019

oxinabox commented May 30, 2019 • edited Loading

rofinn left a comment

Choose a reason for hiding this comment

codecov bot commented May 24, 2019 •

edited

Loading

rofinn commented May 24, 2019 •

edited

Loading

oxinabox commented May 24, 2019 •

edited

Loading

oxinabox commented May 24, 2019 •

edited

Loading

oxinabox May 28, 2019 •

edited

Loading

rofinn commented May 29, 2019 •

edited

Loading

oxinabox commented May 29, 2019 •

edited

Loading

oxinabox commented May 29, 2019 •

edited

Loading

oxinabox commented May 30, 2019 •

edited

Loading