Skip to content

koron/omusubi

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Omusubi - Nums compression library.

Build Status Coverage Status

Omusubi is compression library for Java. It compress array of numbers. Currently support only int and long array.


Sample code

IntDZBP

Using IntDZBP (int delta-zigzag binary packing).

import net.kaoriya.omusubi.IntDZBP;

// Copress.
byte[] compressed = IntDZBP.toBytes(new int[] { 0, 1, 2, ... });

// Decompress.
int[] decompressed = IntDZBP.fromBytes(compressed);

LongDZBP

Using LongDZBP (long delta-zigzag binary packing).

import net.kaoriya.omusubi.LongDZBP;

// Copress.
byte[] compressed = LongDZBP.toBytes(new long[] { 0, 1, 2, ... });

// Decompress.
long[] decompressed = LongDZBP.fromBytes(compressed);

IntAscSDBP

Sample to use IntAscSDBP (int ascending sorted delta binary packing).

import net.kaoriya.omusubi.IntAscSDBP;

// Copress. (input array must be sorted)
byte[] compressed = IntAscSDBP.toBytes(new int[] { 0, 1, 2, ... });

// Decompress.
int[] decompressed = IntAscSDBP.fromBytes(compressed);

LongAscSDBP

Sample to use LongAscSDBP (long ascending sorted delta binary packing).

import net.kaoriya.omusubi.LongAscSDBP;

// Copress. (input array must be sorted)
byte[] compressed = LongAscSDBP.toBytes(new long[] { 0, 1, 2, ... });

// Decompress.
long[] decompressed = LongAscSDBP.fromBytes(compressed);

decodeLength, decodeFirstValue

import net.kaoriya.omusubi.IntDZBP;

// Copress.
byte[] compressed = IntDZBP.toBytes(new int[] { 101, 55, 298, 300 });

// This will return 4.
int len = IntDZBP.decodeLength(compressed);

// This will return 101.
int firstValue = IntDZBP.decodeFirstValue(compressed);

All classes IntDZBP, LongDZBP, IntAscSDBP and LongAscSDBP has both method decodeLength and decodeFirstValue, available.

Set operations

Both IntAscSDBP and LongAscSDBP provide methods for set operations. Those methods are union, intersect and difference.

union example.

import net.kaoriya.omusubi.IntAscSDBP;

byte[] set1 = IntAscSDBP.toBytes(new int[] {1, 2, 3, 4});
byte[] set2 = IntAscSDBP.toBytes(new int[] {3});
byte[] set3 = IntAscSDBP.toBytes(new int[] {1, 3, 5});

// Get the union in compressed form.
byte[] r = IntAscSDBP.union(set1, set2, set3);

// Decompress, it should be [1, 2, 3, 4, 5].
int[] array = IntAscSDBP.fromBytes(r);

intersect example.

import net.kaoriya.omusubi.IntAscSDBP;

byte[] set1 = IntAscSDBP.toBytes(new int[] {1, 2, 3, 4});
byte[] set2 = IntAscSDBP.toBytes(new int[] {3});
byte[] set3 = IntAscSDBP.toBytes(new int[] {1, 3, 5});

// Get the intersect in compressed form.
byte[] r = IntAscSDBP.intersect(set1, set2, set3);

// Decompress, it should be [3].
int[] array = IntAscSDBP.fromBytes(r);

difference example.

import net.kaoriya.omusubi.IntAscSDBP;

byte[] set1 = IntAscSDBP.toBytes(new int[] {1, 2, 3, 4});
byte[] set2 = IntAscSDBP.toBytes(new int[] {3});
byte[] set3 = IntAscSDBP.toBytes(new int[] {1, 3, 5});

// Get the difference in compressed form.
byte[] r = IntAscSDBP.difference(set1, set2, set3);

// Decompress, it should be [2, 4].
int[] array = IntAscSDBP.fromBytes(r);

Support ByteBuffer

All methods of set operation can accept ByteBuffer instead of byte[].

For example, to calculate union() for files directly, you can write it like below:

private static MappedByteBuffer mapFile(File file) throws IOException {
    FileInputStream s = new FileInputStream(file);
    try {
        FileChannel c = s.getChannel();
        return c.map(FileChannel.MapMode.READ_ONLY, 0, file.length());
    } finally {
        if (s != null) {
            s.close();
        }
    }
}

public static byte[] unionFiles(File a, File b) throws IOException {
    return IntAscSDBP.union(mapFile(a), mapFile(b));
}

Please check ExampleByteBufferTest.java for complete sample codes.

Iterator

IntAscSDBP#toIterable and LongAscSDBP#toIterable can generate java.lang.Iterable object from compressed byte[]. Below example shows how to use Iterable.

byte[] b = IntAscSDBP.toBytes(new int[]{10, 20, 30, 40, 50});
for (int n : IntAscSDBP.toIterable(b)) {
    System.out.println(n);
}

This results output like below.

10
20
30
40
50

Of course an instance of Iterable is reusable.

byte[] b = LongAscSDBP.toBytes(new long[]{10, 20, 30, 40, 50});
Iterable<Long> iterable = LongAscSDBP.toIterable(b);

for (long n : iterable) {
    // do first iteration.
}

for (long n : iterable) {
    // do second iteration.
}

Jaccard Index

IntAscSDBP#jaccard and LongAscSDBP#jaccard can calculate jaccard index from two byte[] directly. Below example code shows how to get jaccard index.

byte[] b1 = IntAscSDBP.toBytes(new int{1, 3, 5});
byte[] b2 = IntAscSDBP.toBytes(new int{2, 3, 4});

doubule ji = IntAscSDBP.jaccard(b1, b2);

System.out.println(ji) // should be "0.2" (= 1/5)

See wikipedia: Jaccard Index for Jaccard Index details.


Formats

IntDZBP, IntAscSDBP

Header

  +0  +1  +2  +3  +4  +5  +6  +7  +8  +9  +A  +B  +C  +D  +E  +F
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
|    Length     |  First value  |                               |
+---+---+---+---+---+---+---+---+                               |
|                           (Chunks)                            |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
  • Length: length of original array.
  • First value: First value of original array.
  • Chunks: compressed chunks.

Chunk

  +0  +1  +2  +3  +4  +5  +6  +7  +8  +9  +A  +B  +C  +D  +E  +F
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
|    Header     |                                               |
+---+---+---+---+                                               |
|                          (Block * 4)                          |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
  • Header: holding length of each 4 blocks.
  • Block: contains compressed data, size is multiply of 4 bytes. (0-128 bytes)
    • A block have 32 int values. So a chunk have 128 (=32*4) int values.

LongDZBP, LongAscSDBP

Header

  +0  +1  +2  +3  +4  +5  +6  +7  +8  +9  +A  +B  +C  +D  +E  +F
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
|            Length             |          First value          |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
|                           (Chunks)                            |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
  • Length: length of original array.
  • First value: First value of original array.
  • Chunks: compressed chunks.

Chunk

  +0  +1  +2  +3  +4  +5  +6  +7  +8  +9  +A  +B  +C  +D  +E  +F
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
|            Header             |                               |
+---+---+---+---+---+---+---+---+                               |
|                          (Block * 4)                          |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
  • Header: holding length of each 4 blocks.
  • Block: cotains compressed data, size is multiply of 8 bytes. (0-128 bytes)
    • A block have 16 long values. So a chunk have 64 (=16*4) long values.

Utility methods

  • IntDZBP
    • IntDZBP#toBytes
    • IntDZBP#fromBytes
  • LongDZBP
    • LongDZBP#toBytes
    • LongDZBP#fromBytes
  • IntBitPacking
    • IntBitPacking#toBytes
    • IntBitPacking#fromBytes
  • LongBitPacking
    • LongBitPacking#toBytes
    • LongBitPacking#fromBytes
  • IntJustCopy
    • IntJustCopy#toBytes
    • IntJustCopy#fromBytes
  • LongJustCopy
    • LongJustCopy#toBytes
    • LongJustCopy#fromBytes

License

This library is distributed under Apache License 2.0. See LICENSE for details.