Omusubi - Nums compression library.

Omusubi is compression library for Java. It compress array of numbers. Currently support only int and long array.

Sample code

IntDZBP

Using IntDZBP (int delta-zigzag binary packing).

import net.kaoriya.omusubi.IntDZBP;

// Copress.
byte[] compressed = IntDZBP.toBytes(new int[] { 0, 1, 2, ... });

// Decompress.
int[] decompressed = IntDZBP.fromBytes(compressed);

LongDZBP

Using LongDZBP (long delta-zigzag binary packing).

import net.kaoriya.omusubi.LongDZBP;

// Copress.
byte[] compressed = LongDZBP.toBytes(new long[] { 0, 1, 2, ... });

// Decompress.
long[] decompressed = LongDZBP.fromBytes(compressed);

IntAscSDBP

Sample to use IntAscSDBP (int ascending sorted delta binary packing).

import net.kaoriya.omusubi.IntAscSDBP;

// Copress. (input array must be sorted)
byte[] compressed = IntAscSDBP.toBytes(new int[] { 0, 1, 2, ... });

// Decompress.
int[] decompressed = IntAscSDBP.fromBytes(compressed);

LongAscSDBP

Sample to use LongAscSDBP (long ascending sorted delta binary packing).

import net.kaoriya.omusubi.LongAscSDBP;

// Copress. (input array must be sorted)
byte[] compressed = LongAscSDBP.toBytes(new long[] { 0, 1, 2, ... });

// Decompress.
long[] decompressed = LongAscSDBP.fromBytes(compressed);

decodeLength, decodeFirstValue

import net.kaoriya.omusubi.IntDZBP;

// Copress.
byte[] compressed = IntDZBP.toBytes(new int[] { 101, 55, 298, 300 });

// This will return 4.
int len = IntDZBP.decodeLength(compressed);

// This will return 101.
int firstValue = IntDZBP.decodeFirstValue(compressed);

All classes IntDZBP, LongDZBP, IntAscSDBP and LongAscSDBP has both method decodeLength and decodeFirstValue, available.

Set operations

Both IntAscSDBP and LongAscSDBP provide methods for set operations. Those methods are union, intersect and difference.

union example.

import net.kaoriya.omusubi.IntAscSDBP;

byte[] set1 = IntAscSDBP.toBytes(new int[] {1, 2, 3, 4});
byte[] set2 = IntAscSDBP.toBytes(new int[] {3});
byte[] set3 = IntAscSDBP.toBytes(new int[] {1, 3, 5});

// Get the union in compressed form.
byte[] r = IntAscSDBP.union(set1, set2, set3);

// Decompress, it should be [1, 2, 3, 4, 5].
int[] array = IntAscSDBP.fromBytes(r);

intersect example.

import net.kaoriya.omusubi.IntAscSDBP;

byte[] set1 = IntAscSDBP.toBytes(new int[] {1, 2, 3, 4});
byte[] set2 = IntAscSDBP.toBytes(new int[] {3});
byte[] set3 = IntAscSDBP.toBytes(new int[] {1, 3, 5});

// Get the intersect in compressed form.
byte[] r = IntAscSDBP.intersect(set1, set2, set3);

// Decompress, it should be [3].
int[] array = IntAscSDBP.fromBytes(r);

difference example.

import net.kaoriya.omusubi.IntAscSDBP;

byte[] set1 = IntAscSDBP.toBytes(new int[] {1, 2, 3, 4});
byte[] set2 = IntAscSDBP.toBytes(new int[] {3});
byte[] set3 = IntAscSDBP.toBytes(new int[] {1, 3, 5});

// Get the difference in compressed form.
byte[] r = IntAscSDBP.difference(set1, set2, set3);

// Decompress, it should be [2, 4].
int[] array = IntAscSDBP.fromBytes(r);

Support ByteBuffer

All methods of set operation can accept ByteBuffer instead of byte[].

For example, to calculate union() for files directly, you can write it like below:

private static MappedByteBuffer mapFile(File file) throws IOException {
    FileInputStream s = new FileInputStream(file);
    try {
        FileChannel c = s.getChannel();
        return c.map(FileChannel.MapMode.READ_ONLY, 0, file.length());
    } finally {
        if (s != null) {
            s.close();
        }
    }
}

public static byte[] unionFiles(File a, File b) throws IOException {
    return IntAscSDBP.union(mapFile(a), mapFile(b));
}

Please check ExampleByteBufferTest.java for complete sample codes.

Iterator

IntAscSDBP#toIterable and LongAscSDBP#toIterable can generate java.lang.Iterable object from compressed byte[]. Below example shows how to use Iterable.

byte[] b = IntAscSDBP.toBytes(new int[]{10, 20, 30, 40, 50});
for (int n : IntAscSDBP.toIterable(b)) {
    System.out.println(n);
}

This results output like below.

Of course an instance of Iterable is reusable.

byte[] b = LongAscSDBP.toBytes(new long[]{10, 20, 30, 40, 50});
Iterable<Long> iterable = LongAscSDBP.toIterable(b);

for (long n : iterable) {
    // do first iteration.
}

for (long n : iterable) {
    // do second iteration.
}

Jaccard Index

IntAscSDBP#jaccard and LongAscSDBP#jaccard can calculate jaccard index from two byte[] directly. Below example code shows how to get jaccard index.

byte[] b1 = IntAscSDBP.toBytes(new int{1, 3, 5});
byte[] b2 = IntAscSDBP.toBytes(new int{2, 3, 4});

doubule ji = IntAscSDBP.jaccard(b1, b2);

System.out.println(ji) // should be "0.2" (= 1/5)

See wikipedia: Jaccard Index for Jaccard Index details.

Formats

IntDZBP, IntAscSDBP

Header

  +0  +1  +2  +3  +4  +5  +6  +7  +8  +9  +A  +B  +C  +D  +E  +F
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
|    Length     |  First value  |                               |
+---+---+---+---+---+---+---+---+                               |
|                           (Chunks)                            |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+

Length: length of original array.
First value: First value of original array.
Chunks: compressed chunks.

Chunk

  +0  +1  +2  +3  +4  +5  +6  +7  +8  +9  +A  +B  +C  +D  +E  +F
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
|    Header     |                                               |
+---+---+---+---+                                               |
|                          (Block * 4)                          |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+

Header: holding length of each 4 blocks.
Block: contains compressed data, size is multiply of 4 bytes. (0-128 bytes)
- A block have 32 int values. So a chunk have 128 (=32*4) int values.

LongDZBP, LongAscSDBP

Header

  +0  +1  +2  +3  +4  +5  +6  +7  +8  +9  +A  +B  +C  +D  +E  +F
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
|            Length             |          First value          |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
|                           (Chunks)                            |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+

Length: length of original array.
First value: First value of original array.
Chunks: compressed chunks.

Chunk

  +0  +1  +2  +3  +4  +5  +6  +7  +8  +9  +A  +B  +C  +D  +E  +F
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
|            Header             |                               |
+---+---+---+---+---+---+---+---+                               |
|                          (Block * 4)                          |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+

Header: holding length of each 4 blocks.
Block: cotains compressed data, size is multiply of 8 bytes. (0-128 bytes)
- A block have 16 long values. So a chunk have 64 (=16*4) long values.

Utility methods

IntDZBP
- IntDZBP#toBytes
- IntDZBP#fromBytes
LongDZBP
- LongDZBP#toBytes
- LongDZBP#fromBytes
IntBitPacking
- IntBitPacking#toBytes
- IntBitPacking#fromBytes
LongBitPacking
- LongBitPacking#toBytes
- LongBitPacking#fromBytes
IntJustCopy
- IntJustCopy#toBytes
- IntJustCopy#fromBytes
LongJustCopy
- LongJustCopy#toBytes
- LongJustCopy#fromBytes

License

This library is distributed under Apache License 2.0. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 218 Commits
doc/release-note		doc/release-note
gradle/wrapper		gradle/wrapper
src		src
.gitignore		.gitignore
.travis.yml		.travis.yml
CHANGES.mkd		CHANGES.mkd
LICENSE		LICENSE
README.mkd		README.mkd
build.gradle		build.gradle
gradlew		gradlew
gradlew.bat		gradlew.bat

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Omusubi - Nums compression library.

Sample code

IntDZBP

LongDZBP

IntAscSDBP

LongAscSDBP

decodeLength, decodeFirstValue

Set operations

union example.

intersect example.

difference example.

Support ByteBuffer

Iterator

Jaccard Index

Formats

IntDZBP, IntAscSDBP

Header

Chunk

LongDZBP, LongAscSDBP

Header

Chunk

Utility methods

License

About

Releases 6

Packages

Contributors 2

Languages

License

koron/omusubi

Folders and files

Latest commit

History

Repository files navigation

Omusubi - Nums compression library.

Sample code

IntDZBP

LongDZBP

IntAscSDBP

LongAscSDBP

decodeLength, decodeFirstValue

Set operations

union example.

intersect example.

difference example.

Support ByteBuffer

Iterator

Jaccard Index

Formats

IntDZBP, IntAscSDBP

Header

Chunk

LongDZBP, LongAscSDBP

Header

Chunk

Utility methods

License

About

Resources

License

Stars

Watchers

Forks

Releases 6

Packages 0

Contributors 2

Languages

Packages