Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partial/incremental decompression of clusters #411

Closed
wants to merge 13 commits into from

Conversation

veloman-yunkan
Copy link
Collaborator

@veloman-yunkan veloman-yunkan commented Aug 28, 2020

Fixes #78
Fixes #394
Enables #395 for compressed clusters

  1. Added IDataStream interface for sequential reading of data and its 3 implementations:

    • BufDataStream: reading data from a buffer
    • DecodedDataStream: for on the fly decompression of data coming from another data stream
    • ReaderDataStreamWrapper: a wrapper around zim::Reader presenting it as IDataStream
  2. Converted Cluster into an abstract base class with two implementations:

    • NonCompressedCluster
    • CompressedCluster
  3. Implemented partial/incremental decompression of compressed clusters using DecodedDataStream.

It is better to review this PR commit-by-commit.


Benchmarking data:

These changes are most beneficial to kiwix-serve. The benchmark was performed using the benchmark_kiwix_serve script as follows:

./benchmark_kiwix_serve master 1000 zimfiles/wikipedia_hy_all_mini_2020-08.zim
./benchmark_kiwix_serve partial_decompression 1000 zimfiles/wikipedia_hy_all_mini_2020-08.zim

That script starts a kiwix server and fetches from it about 1000 articles (more accurately every 509'th article, which amounts to 1002 articles) using wget --recursive -l 1.

ZIM file size (MB) article count cluster count fetched item count wget runtime (master) wget runtime (this PR)
wikipedia_hy_all_mini_2020-08.zim 563 509611 1526 5677 59s 37s

(It was verified that the output directories created by wget from both runs are identical, with the exception of /random)

zimcheck -A exercises a flow that should not benefit from this optimization. In fact there is some (about 3%) slowdown:

ZIM file size (MB) article count cluster count zimcheck -A runtime (master) zimcheck -A runtime (this PR)
wikipedia_en_climate_change_nopic_2020-01.zim 31 7646 51 5.663s 5.694s
wikipedia_hy_all_mini_2020-07.zim 560 509325 1518 6m6.325s 6m17.350s
wikipedia_hy_all_mini_2020-08.zim 563 509611 1526 6m7.287s 6m19.535s

@codecov
Copy link

codecov bot commented Aug 28, 2020

Codecov Report

Merging #411 into master will increase coverage by 1.20%.
The diff coverage is 96.23%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #411      +/-   ##
==========================================
+ Coverage   47.17%   48.37%   +1.20%     
==========================================
  Files          66       71       +5     
  Lines        3305     3390      +85     
  Branches     1422     1442      +20     
==========================================
+ Hits         1559     1640      +81     
- Misses       1746     1750       +4     
Impacted Files Coverage Δ
src/cluster.cpp 92.92% <94.59%> (+0.32%) ⬆️
src/decodeddatastream.h 96.00% <96.00%> (ø)
src/bufdatastream.h 100.00% <100.00%> (ø)
src/cluster.h 100.00% <100.00%> (ø)
src/idatastream.cpp 100.00% <100.00%> (ø)
src/idatastream.h 100.00% <100.00%> (ø)
src/readerdatastreamwrapper.h 100.00% <100.00%> (ø)
src/buffer.cpp 83.33% <0.00%> (-8.34%) ⬇️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 79e6500...1ac51bd. Read the comment docs.

@veloman-yunkan veloman-yunkan force-pushed the streaming_decompression branch 3 times, most recently from b850bfd to 445424d Compare August 30, 2020 14:02
@mgautierfr
Copy link
Collaborator

That is a big change (I've just quickly read it, not made a review). But as said in #395 (comment), it would have been nice to have some measurements to know where we are going instead of doing such big change for potentially nothing (or worst).

Can you explain the global strategy for this PR ?

@veloman-yunkan
Copy link
Collaborator Author

This PR is only about partial/incremental decompression (it doesn't include caching on article/item level, though, of course, advances in that direction). As a result of this change clusters are decompressed lazily. When i'th blob is first accessed, the cluster is decompressed up to and including blob #i. If blob #(i+j) is later accessed (before that cluster is evicted from the cache), decompression continues from the state at which it was suspended. In a random access scenario with a limited size of the cluster cache (and assuming that the articles/items are not grouped into clusters in a smart way), this approach should result on average in about 2x speed-up of cluster decompression (at least, in terms of latency of blob access) as well as 2x reduction in memory usage.

Some minor sources of slowdown are the following:

  1. Cluster is now a polymorphic object (non-compressed and compressed clusters being two sub-classes)
  2. Access to blobs is guarded with a mutex
  3. Some other implementation-specific shortcuts that can be eliminated.

I am confident that the potential performance improvement will greatly exceed the slowdown, that's why I didn't start with measurement.

@mgautierfr
Copy link
Collaborator

When i'th blob is first accessed, the cluster is decompressed up to and including blob #i. If blob #(i+j) is later accessed (before that cluster is evicted from the cache), decompression continues from the state at which it was suspended

How do you handle the need of more memory as you decompressing more data while keeping existing pointer valid ?

@veloman-yunkan
Copy link
Collaborator Author

@mgautierfr

Blobs of known size are read from a data stream that is decompressed on the fly:

libzim/src/cluster.cpp

Lines 278 to 283 in a22773c

void
CompressedCluster::ensureBlobIsDecompressed(blob_index_t n) const
{
for ( size_t i = blobs_.size(); i <= n.v; ++i )
blobs_.push_back(ds_->readBlob(getBlobSize(blob_index_t(i)).v));
}

DecodedDataStream, which is in charge of the on-the-fly decompression, easily decompresses the requested number of bytes (look at the function readImpl() first)

void readNextChunk()
{
const size_t n = std::min(size_t(CHUNK_SIZE), inputBytesLeft_);
encodedDataChunk_ = encodedDataStream_->readBlob(n);
inputBytesLeft_ -= n;
// XXX: ugly C-style cast (casting away constness) on the next line
decoderState_.next_in = (unsigned char*)encodedDataChunk_.data();
decoderState_.avail_in = encodedDataChunk_.size();
}
CompStatus decodeMoreBytes()
{
CompStep step = CompStep::STEP;
if ( decoderState_.avail_in == 0 )
{
if ( inputBytesLeft_ == 0 )
step = CompStep::FINISH;
else
readNextChunk();
}
return Decoder::stream_run_decode(&decoderState_, step);
}
void readImpl(void* buf, size_t nbytes) override
{
decoderState_.next_out = (unsigned char*)buf;
decoderState_.avail_out = nbytes;
while ( decoderState_.avail_out != 0 )
{
decodeMoreBytes();
}
}

into a buffer that is created for it by its superclass:

IDataStream::Blob
IDataStream::readBlobImpl(size_t size)
{
std::shared_ptr<char> buf(new char[size], std::default_delete<char[]>());
readImpl(buf.get(), size);
return Blob(buf, size);
}

@veloman-yunkan veloman-yunkan changed the base branch from master to incremental_decompression_groundwork August 31, 2020 10:08
@veloman-yunkan veloman-yunkan force-pushed the streaming_decompression branch 2 times, most recently from e351c3f to c610baa Compare September 2, 2020 10:23
Base automatically changed from incremental_decompression_groundwork to master September 2, 2020 13:30
CompressedCluster got its own collection of blobs. The latter is filled
by simply breaking down the fully uncompressed data with the help of
ReaderDataStreamWrapper.
However the IDataStream object used at this point is still a
ReaderDataStreamWrapper reading from the fully uncompressed data.
Cluster is now an abstract base class for NonCompressedCluster and
CompressedCluster.
From now on, compressed clusters are decompressed using the streaming
decompression (DecodedDataStream) API. However, a cluster is fully
decompressed (i.e. all blobs are extracted and stored inside the
cluster).
@veloman-yunkan veloman-yunkan changed the title [WIP] Partial/incremental decompression of clusters Partial/incremental decompression of clusters Sep 3, 2020
@veloman-yunkan veloman-yunkan marked this pull request as ready for review September 3, 2020 10:07
Copy link
Collaborator

@mgautierfr mgautierfr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a interesting PR. There are some few things I like and some other I don't :)

I mainly disagree with the new code architecture this change introduces.

Let's me explain first the global architecture of the classes of the libzim.

  • Buffer represents a memory region. There are several kind of buffers as the memory region can be handled differently (not handled, simply deleted at destruction or mmap backed)
  • Reader is a interface to read data. It can read data directly from a (potentially split) file (FileReader) or from a memory zone (BufferReader).
  • Other classes (Dirent, FileImpl and Cluster) use a reader to get data. They are about interpret the data they read (the logic) and not how the data is stored/read (implementation). [1]
  • FileImpl is the smart part that create appropriate Reader/Buffer depending of the context.
  • Blob is the end of the chain. It the (size aware) data returned to the user. It represent a memory zone (as a Buffer). Technically it could be replaced by a Buffer but a Buffer is something internal to the libzim and Blob is public (and Blob was here long time before Buffer and I didn't what to break the api when I introduced Buffer).

In our case, this architecture allow :

  • Split the reading of the data (Reader/Buffer) from its interpretation (Cluster/Dirent) using a interface (Reader) between the both. [1]
  • FileImpl to create the "best" reader to read some cluster data (if not compressed: FileReader. If compressed: BufferReader on a decompressed memory region)
  • Cluster don't have to care about how data is stored. It only care about reading the data and interpret it.

What missing is a way to have lazy decompression. This is mostly because the decompression algorithm is not associated with a reader but it is just a way to create a buffer. So FileImpl have to decompress the whole cluster data to pass a Reader to Cluster.

What you do in this PR complicate a lot the global architecture :

  • You split the Cluster implementation in two and use different algorithm and storage depending of how the data is read. You break the separation between two different problem.
  • Create new kind of itermediate class (IDataStream, IDataStream::Blob)
  • Add somehow complex relation between zim::Blob, IDataStream::Blob and Buffer.

I don't think we need a dataStream at all. We "simply" need to create a DecompressReader that lazy decompress the data just when we need(read) it.
The cluster would have to be change a bit. It would have to store the buffer of the already blob and be sure that buffer inferior to idx has been read when getting blob idx.

Here is a pseudo code of the DecompressReader and the change to made on Cluster.

class DecompressReader : public Reader
{
  void read(char* dest, offset_t offset, zsize_t size) const {
    // The following code should not used as we are smart on cluster side.
    // But we need it to be a fully functional as a reader.
    if (last_read_offset > offset) {
       last_read_offset = 0;
    }
    // decompress from last_read_offset to offset (and drop decompressed memory);
    decompress(offset-last_read_offset, nullptr);
    
    // Here is the useful decompress
    decompress(size, dest);
    last_read_offset = offset+size;
  }

  void decompress(zsize_t size, char* dest) {
    //reuse the code of the DecodedDataStream
  }

  std::shared_ptr<const Buffer> FileReader::get_buffer(offset_t offset, zsize_t size) const {
	auto ret_buffer = std::make_shared<MemoryBuffer>(size);
    read(ret_buffer->buf(), offset, size);
    return ret_buffer;
  }

  Buffer/*or Reader*/ compresseddata;
  compression_stream  stream;
  offset_t     last_read_offset;
};


class Cluster //updated code
{
  Blob getBlob(blob_index_t n) const
  {
    auto nb_read_buffers = already_read_buffers.size();
    for (auto i=nb_read_buffers; i<=n; i++) {
      // If cluster is not compressed, reader is a `BufferReader` and the create buffer is a sub buffer without memory copy
      // If cluster is compressed, reader is a `DecompressReader` and the create buffer is a "just in time" allocated and decompressed buffer.
      auto buffer = reader->get_buffer(startOffset+offsets[blob_index_type(i)], getBlobSize(i));
      already_read_buffers.push_back(buffer);
    }
    return Blob(already_read_buffers[n]);
  }
  std::vector<std::shared_ptr<const Buffer>> already_read_buffers;
}

[1] Yes, Dirent uses a Buffer and not a Reader. But this is because I haven't made thing correctly. We should change this to make the Dirent use a Reader, potentially creating a buffered reader (not to be confused with BufferReader) to avoid too many syscall when reading each dirent's fields. Directly using a Buffer is somehow a wrong implementation of a lazy developer (me) than a design choice.

Comment on lines 49 to 63
class Blob
{
private: // types
typedef std::shared_ptr<const char> DataPtr;

public: // functions
Blob(const DataPtr& data, size_t size) : data_(data) , size_(size) {}

const char* data() const { return data_.get(); }
size_t size() const { return size_; }

private: // data
DataPtr data_;
size_t size_;
};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not using zim::Blob ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intention is to promote IDataStream::Blob to zim::Blob, so eventually we will use zim::Blob here 😉

This version of a blob implementation has several advantages over the current version of zim::Blob

  1. It is a little more lightweight
  2. and still can replace, after minor enhancements, both zim::Blob and zim::Buffer

Comment on lines +45 to +46
const char* data_;
size_t size_;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use a zim::Buffer associated with a offset ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have wicked plans of getting rid of zim::Buffer

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please stop wicked plans :) Let's talk together before.

test/readerdatastreamwrapper.cpp Outdated Show resolved Hide resolved
src/readerdatastreamwrapper.h Outdated Show resolved Hide resolved
src/cluster.h Outdated Show resolved Hide resolved
src/decodeddatastream.h Show resolved Hide resolved
test/decodeddatastream.cpp Show resolved Hide resolved
namespace
{

class IDSBlobBuffer : public Buffer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A buffer is a memory zone. It is somehow non sense to back it by a datastream.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess there is some misunderstanding here. This is a temporary helper class used to translate wrap IDataStream::Blob in a zim::Blob.

This is probably the ugliest part of the new code that will be eliminated when IDataStream::Blob takes over zim::Blob


Blob idsBlob2zimBlob(const IDataStream::Blob& blob, size_t offset, size_t size)
{
return Blob(std::make_shared<IDSBlobBuffer>(blob, offset, size));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the user ask for a blob I think it is time to stop being lazy and load the data.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The data is already loaded at this point (and blob points to it).

namespace zim
{

class BufDataStream : public IDataStream
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class seems never used at all (except in test).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. I thought it was used in an intermediate state but it only enables unit-testing DecodedDataStream.

@veloman-yunkan
Copy link
Collaborator Author

veloman-yunkan commented Sep 3, 2020

Let's me explain first the global architecture of the classes of the libzim.

  • Buffer represents a memory region. There are several kind of buffers as the memory region can be handled differently (not handled, simply deleted at destruction or mmap backed)

The fact that zim::Buffer always represents a real memory region (rather than emulates representing a memory region, while possibly serving data from a remote resource or otherwise producing it on demand), undermines the role of all polymorphic Buffer classes. The same can be achieved with a class very similar (if not identical) to IDataStream::Blob and a family of factory functions:

Buffer makeMemoryBuffer(size_t size)
{
  return Buffer(std::shared_ptr<const char>(new char[size], std::default_delete<char[]>()), size);
}

Buffer makeMemoryViewBuffer(const char* start, size_t size)
{
  return Buffer(std::shared_ptr<const char>(start, NoDelete()), size);
}

Buffer makeMmapBuffer(const char* start, size_t size)
{
  // pseudocode
  Mmapping mmapping(start, size);
  return Buffer(std::shared_ptr<const char>(start, mmapping.getUnmmapper()), size);
}

That's why I think that IDataStream::Blob is a step in the right direction, allowing to get rid of Buffer and its sub-classes (after we replace zim::Blob with such an implementation).

@veloman-yunkan
Copy link
Collaborator Author

veloman-yunkan commented Sep 3, 2020

What you do in this PR complicate a lot the global architecture :

  • You split the Cluster implementation in two and use different algorithm and storage depending of how the data is read. You break the separation between two different problem.

  • Create new kind of itermediate class (IDataStream, IDataStream::Blob)

  • Add somehow complex relation between zim::Blob, IDataStream::Blob and Buffer.

You are right that the new code adds some complexity. A far reaching intent is to arrive at a simpler architecture as outlined in the previous comment (as well as noted in #402 (comment))

@veloman-yunkan
Copy link
Collaborator Author

I don't think we need a dataStream at all. We "simply" need to create a DecompressReader that lazy decompress the data just when we need(read) it.
The cluster would have to be change a bit. It would have to store the buffer of the already blob and be sure that buffer inferior to idx has been read when getting blob idx.

Here is a pseudo code of the DecompressReader and the change to made on Cluster.

class DecompressReader : public Reader
{
  void read(char* dest, offset_t offset, zsize_t size) const {
    // The following code should not used as we are smart on cluster side.
    // But we need it to be a fully functional as a reader.
    if (last_read_offset > offset) {
       last_read_offset = 0;
    }
    // decompress from last_read_offset to offset (and drop decompressed memory);
    decompress(offset-last_read_offset, nullptr);
    
    // Here is the useful decompress
    decompress(size, dest);
    last_read_offset = offset+size;
  }

Yes, that will work. But it achieves the result by using a more complex tool (the random access Reader) in a situation where a simpler tool (a sequential reader modeled in this PR by IDataStream) will do. Of course there is the problematic tradeoff between having a single general purpose tool, or having multiple specialized tools, which doesn't have an easy answer. My point of view is that sequential scanning of data has proven its usefulness in a number of areas, and we better have and utilize that abstraction in our code.

@veloman-yunkan
Copy link
Collaborator Author

On one hand you write

What you do in this PR complicate a lot the global architecture :

  • You split the Cluster implementation in two and use different algorithm and storage depending of how the data is read. You break the separation between two different problem.

On the other hand you propose

class Cluster //updated code
{
  Blob getBlob(blob_index_t n) const
  {
    auto nb_read_buffers = already_read_buffers.size();
    for (auto i=nb_read_buffers; i<=n; i++) {
      // If cluster is not compressed, reader is a `BufferReader` and the create buffer is a sub buffer without memory copy
      // If cluster is compressed, reader is a `DecompressReader` and the create buffer is a "just in time" allocated and decompressed buffer.
      auto buffer = reader->get_buffer(startOffset+offsets[blob_index_type(i)], getBlobSize(i));
      already_read_buffers.push_back(buffer);
    }
    return Blob(already_read_buffers[n]);
  }
  std::vector<std::shared_ptr<const Buffer>> already_read_buffers;
}

which is sub-optimal for non-compressed clusters. The distinction between non-compressed and compressed clusters (if we thrive for the most optimal solution in each case) is that of between random and sequential access, which has no solution other than treat them differently.

Added a unit-test covering the scenario where the loop body in
`DecodedDataStream::readImpl()` has to be executed more than once.
@veloman-yunkan
Copy link
Collaborator Author

Added benchmarking data in the description of this PR.

@mgautierfr
Copy link
Collaborator

The fact that zim::Buffer always represents a real memory region (rather than emulates representing a memory region, while possibly serving data from a remote resource or otherwise producing it on demand), undermines the role of all polymorphic Buffer classes.

Having something to read data, possibly served from a remote resource is the job of Reader.
It is not a flaw of polymorphic design if a class designed as A cannot be B.

A far reaching intent is to arrive at a simpler architecture as outlined in the previous comment (as well as noted in #402 (comment))

Please talk with me before making plan to libzim and implementing things.

Yes, that will work. But it achieves the result by using a more complex tool (the random access Reader) in a situation where a simpler tool (a sequential reader modeled in this PR by IDataStream) will do.

I'm not sure a sequential reader is simpler. But anyway, we would end with two tools instead of one. And the global situation would definitively not be simpler.

My point of view is that sequential scanning of data has proven its usefulness in a number of areas, and we better have and utilize that abstraction in our code.

How ? The only usecase we have is for decompression stream.
We are not designing a generic "reading system". We are creating a sub-system for libzim and libzim only.

which is sub-optimal for non-compressed clusters.

How it is sub-optimal ? Uncompressed clusters still have no memory copy, no big data allocation (not more than before). The only things that are created/copied are few shared_ptr. And this is a price I totally accept for a simple design.

@mgautierfr
Copy link
Collaborator

mgautierfr commented Sep 7, 2020

One another important point is that this must be threadsafe.
This is the case with the current code as the compression is made in one step and then we are just reading const data.
But here we must protect all read access as we may decompress data.

@kelson42
Copy link
Contributor

Superseeded by #421

@kelson42 kelson42 closed this Sep 23, 2020
@kelson42 kelson42 deleted the streaming_decompression branch May 26, 2022 18:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants