-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature request:support lz4 compression #29
Comments
One of the main advantages of vertical model is better and faster compression than for traditional horizontal model. For example PostgreSQL is using compression for TOAST data. But do not consider compression of magic tool which can help you to reduce storage demands 10 times for free. First of all compression ratio greatly depends on the data you are compressing. Actually all compression algorithms are based on redundancy in data. For really random data all compression algorithms (including infobright) will ratio ~ 1, not ten. For text documents usual compression ration is about 3... Also compression ratio depends on size of compression window or compressed document size. In DBMS usually unit of compression is page. Typical pages size (4/8kb) is small enough to reach high compression ratio. So even if infobright is able to reduce size of some file after compression ten times, most likely results will be less exiting if you split this file into 4kb pages and individually compress each page. So high compression ration can be reached only if input data contains a lot of duplicate values. RLE is able to compress only subsequent duplicate values. More sophisticated compression algorithms are able to locate arbitrary duplicating sequences of bits. Some compression algorithms are optimized for compression of similar values, by first calculating difference between subsequent values and encoding this difference. IMCS is first of all oriented on financial data, i.e. stock options... If you look at such data, then you can notice that even for stable symbols price is used to vary a little bit. So it is unlikely that price is not changed. Also ask, bid, open, close, high, low prices are used to be stored as floating point numbers. Even very close floating point number can have large number of different bits. So encoding them using delta-based algorithms is problematic. Compression has most effect for character data (like symbol name). But here are more "radical" approaches. First of all you can use symbol name is timeseries identifier. So you will not have to repeat it multiple times for all records. If it is not possible or if string can be very long, it is possible to place it in separate table (dictionary) and use integer identifier instead of string. One more moment. Compression is used to produce records with varying length. To allocate/free space for such records we need sophisticated memory allocator. PostgreSQL doesn't provide such one. ShmemAlloc does not allow to release memory at all! IMCS maintains its own list of fixed size pages to make it possible to reclaim space. With varying size pages this trick will not work. Compression adds significant CPU overhead. Look for example at LZ4. Its compression speed is about 400Mb/sec. Modern SSDs providers even higher write speed. It means that instead of trying to use compression to fit data in RAM, we can just write it to the disk at the same speed! If you want, we can try to perform experiments with real data you have and estimate possible effect of compressing this data using LZ4, RLE or any other compression algorithms... |
We use imcs as a ad hoc reporter,one of our table looks like this: log_time:timestamp---when this data is produce,selective We get billions of rows like this append to imcs every day,and we hope date of the latest 14 days even lastest 30 days in the memory and can be access quickly. LZ4 compression speed is 400MB,but decompression speed is 1.8GB,for most write once,read many OLAP system,write data speed will not critical compare with read. In my memory some file system ,may be ZFS or birtfs can judege if data can be compress at run time.for example,if you put jpg/rmvb/zip file to the fs with compression flag set,it can skip compress these files which already compressed.It try to compress the first 10% of the data block and if the compression ratio is not good, it will not compress the whole data.This is sophisticated and we can begin with simple compress all or not. |
Certainly for such schema compression of country, province, city, device, net and os can be very useful. But IMHO it will be more efficient to replace them with integers, i.e. have separate tables (dictionaries) with string->id mapping. Certainly it may increase insertion time. But it can be addressed by using in-memory copy of dictionaries (std::map in C++, HashMap in Java,...) In this case insert speed may be even increased (smaller record size). Certainly I can implement compression of this fields at page level. But it will be significantly less efficient. Even if there are 1000 different devices, it can happen that among about 100 devices at single page their will be 50 unique devices. In this case compression ration will be less than 2. And with dictionary approach ratio will be fixed: average size of name/ size of identifier. I expect that distribution of this data is far from normal, so among 1000 devices, 90% will belong to 10 most popular devices. In this case RLE compression can be efficient. Certainly it takes in account only subsequent values. And if we have sequence of values, {"iPhone 4.0", "iPhone 5.0", "iPhone 4.0", iPhone 5.0",...}, then RLE can do nothing. But all alternative compression schema significantly complicates lookup of element by position - we have to decompress page first in order to locate element at the specified position. But I will think more about this feature. |
Now IMCS is able to load unlimited varying size strings. Them are converted into integer identifiers using dictionary. Dictionary size should be larger than total capacity of all such columns. If dictionary size is less than 64kb, then identifiers are stored in 16 bit integers. So instead of original string value you are storing just two bytes. Below is an example of usage: create table CrashLog( insert into CrashLog values (timestamp('12-Apr-2014 11:54'), 10000001, 'USA', 'New York', 'iPhone4', 'wifi', 'iOS6.x'); ----------------------------------------------------------------------------------------------+-----------------------------------+------------------- --------+----------------------------------+------------------------------------------+---------------------------------+-----------------------------timestamp:{2014-04-12 11:33:14.414614,2014-04-12 11:33:14.417347,2014-04-12 11:33:14.419531} | int8:{10000001,10000002,10000003} | varchar:{USA,Japan select cs_translate(group_by, 1) as country, cs_translate(group_by, 2) as device, agg_val as counter from (select (cs_project_agg(cs_hash_count(country||device))).* from CrashLog_get()) s; |
I see RLE in imcs,it is greate,but I think the benefit is limited for real world compare with common compression algorithm such as lz4,lzo,snappy.
infobright can reach 10:1~40:1 compression ratio,may be imcs can do this too.
I would like to see imcs support lz4 compression,it can greate save memory and disk io.
Is it possible?
I will appreciate if this feature can be added.If you are busy to implement it, may be I can contribute some code(I have 4 years java coding experience,learn c and c++ in the university.My idea is only compress the leaf page,so the over head is small)?
ref:
http://code.google.com/p/lz4/
https://www.infobright.com/index.php/products/compression/
The text was updated successfully, but these errors were encountered: