Omusubi is compression library for Java. It compress array of numbers. Currently support only int and long array.
Using IntDZBP (int delta-zigzag binary packing).
import net.kaoriya.omusubi.IntDZBP;
// Copress.
byte[] compressed = IntDZBP.toBytes(new int[] { 0, 1, 2, ... });
// Decompress.
int[] decompressed = IntDZBP.fromBytes(compressed);
Using LongDZBP (long delta-zigzag binary packing).
import net.kaoriya.omusubi.LongDZBP;
// Copress.
byte[] compressed = LongDZBP.toBytes(new long[] { 0, 1, 2, ... });
// Decompress.
long[] decompressed = LongDZBP.fromBytes(compressed);
Sample to use IntAscSDBP (int ascending sorted delta binary packing).
import net.kaoriya.omusubi.IntAscSDBP;
// Copress. (input array must be sorted)
byte[] compressed = IntAscSDBP.toBytes(new int[] { 0, 1, 2, ... });
// Decompress.
int[] decompressed = IntAscSDBP.fromBytes(compressed);
Sample to use LongAscSDBP (long ascending sorted delta binary packing).
import net.kaoriya.omusubi.LongAscSDBP;
// Copress. (input array must be sorted)
byte[] compressed = LongAscSDBP.toBytes(new long[] { 0, 1, 2, ... });
// Decompress.
long[] decompressed = LongAscSDBP.fromBytes(compressed);
import net.kaoriya.omusubi.IntDZBP;
// Copress.
byte[] compressed = IntDZBP.toBytes(new int[] { 101, 55, 298, 300 });
// This will return 4.
int len = IntDZBP.decodeLength(compressed);
// This will return 101.
int firstValue = IntDZBP.decodeFirstValue(compressed);
All classes IntDZBP
, LongDZBP
, IntAscSDBP
and LongAscSDBP
has both
method decodeLength
and decodeFirstValue
, available.
Both IntAscSDBP and LongAscSDBP provide methods for set operations. Those methods are union, intersect and difference.
import net.kaoriya.omusubi.IntAscSDBP;
byte[] set1 = IntAscSDBP.toBytes(new int[] {1, 2, 3, 4});
byte[] set2 = IntAscSDBP.toBytes(new int[] {3});
byte[] set3 = IntAscSDBP.toBytes(new int[] {1, 3, 5});
// Get the union in compressed form.
byte[] r = IntAscSDBP.union(set1, set2, set3);
// Decompress, it should be [1, 2, 3, 4, 5].
int[] array = IntAscSDBP.fromBytes(r);
import net.kaoriya.omusubi.IntAscSDBP;
byte[] set1 = IntAscSDBP.toBytes(new int[] {1, 2, 3, 4});
byte[] set2 = IntAscSDBP.toBytes(new int[] {3});
byte[] set3 = IntAscSDBP.toBytes(new int[] {1, 3, 5});
// Get the intersect in compressed form.
byte[] r = IntAscSDBP.intersect(set1, set2, set3);
// Decompress, it should be [3].
int[] array = IntAscSDBP.fromBytes(r);
import net.kaoriya.omusubi.IntAscSDBP;
byte[] set1 = IntAscSDBP.toBytes(new int[] {1, 2, 3, 4});
byte[] set2 = IntAscSDBP.toBytes(new int[] {3});
byte[] set3 = IntAscSDBP.toBytes(new int[] {1, 3, 5});
// Get the difference in compressed form.
byte[] r = IntAscSDBP.difference(set1, set2, set3);
// Decompress, it should be [2, 4].
int[] array = IntAscSDBP.fromBytes(r);
All methods of set operation can accept ByteBuffer
instead of byte[]
.
For example, to calculate union()
for files directly, you can write it like
below:
private static MappedByteBuffer mapFile(File file) throws IOException {
FileInputStream s = new FileInputStream(file);
try {
FileChannel c = s.getChannel();
return c.map(FileChannel.MapMode.READ_ONLY, 0, file.length());
} finally {
if (s != null) {
s.close();
}
}
}
public static byte[] unionFiles(File a, File b) throws IOException {
return IntAscSDBP.union(mapFile(a), mapFile(b));
}
Please check ExampleByteBufferTest.java for complete sample codes.
IntAscSDBP#toIterable
and LongAscSDBP#toIterable
can generate
java.lang.Iterable
object from compressed byte[]
. Below example shows how
to use Iterable
.
byte[] b = IntAscSDBP.toBytes(new int[]{10, 20, 30, 40, 50});
for (int n : IntAscSDBP.toIterable(b)) {
System.out.println(n);
}
This results output like below.
10
20
30
40
50
Of course an instance of Iterable
is reusable.
byte[] b = LongAscSDBP.toBytes(new long[]{10, 20, 30, 40, 50});
Iterable<Long> iterable = LongAscSDBP.toIterable(b);
for (long n : iterable) {
// do first iteration.
}
for (long n : iterable) {
// do second iteration.
}
IntAscSDBP#jaccard
and LongAscSDBP#jaccard
can calculate jaccard index from
two byte[]
directly. Below example code shows how to get jaccard index.
byte[] b1 = IntAscSDBP.toBytes(new int{1, 3, 5});
byte[] b2 = IntAscSDBP.toBytes(new int{2, 3, 4});
doubule ji = IntAscSDBP.jaccard(b1, b2);
System.out.println(ji) // should be "0.2" (= 1/5)
See wikipedia: Jaccard Index for Jaccard Index details.
+0 +1 +2 +3 +4 +5 +6 +7 +8 +9 +A +B +C +D +E +F
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| Length | First value | |
+---+---+---+---+---+---+---+---+ |
| (Chunks) |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
- Length: length of original array.
- First value: First value of original array.
- Chunks: compressed chunks.
+0 +1 +2 +3 +4 +5 +6 +7 +8 +9 +A +B +C +D +E +F
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| Header | |
+---+---+---+---+ |
| (Block * 4) |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
- Header: holding length of each 4 blocks.
- Block: contains compressed data, size is multiply of 4 bytes. (0-128 bytes)
- A block have 32 int values. So a chunk have 128 (=32*4) int values.
+0 +1 +2 +3 +4 +5 +6 +7 +8 +9 +A +B +C +D +E +F
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| Length | First value |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| (Chunks) |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
- Length: length of original array.
- First value: First value of original array.
- Chunks: compressed chunks.
+0 +1 +2 +3 +4 +5 +6 +7 +8 +9 +A +B +C +D +E +F
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| Header | |
+---+---+---+---+---+---+---+---+ |
| (Block * 4) |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
- Header: holding length of each 4 blocks.
- Block: cotains compressed data, size is multiply of 8 bytes. (0-128 bytes)
- A block have 16 long values. So a chunk have 64 (=16*4) long values.
- IntDZBP
- IntDZBP#toBytes
- IntDZBP#fromBytes
- LongDZBP
- LongDZBP#toBytes
- LongDZBP#fromBytes
- IntBitPacking
- IntBitPacking#toBytes
- IntBitPacking#fromBytes
- LongBitPacking
- LongBitPacking#toBytes
- LongBitPacking#fromBytes
- IntJustCopy
- IntJustCopy#toBytes
- IntJustCopy#fromBytes
- LongJustCopy
- LongJustCopy#toBytes
- LongJustCopy#fromBytes
This library is distributed under Apache License 2.0. See LICENSE for details.