Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming approach for reading data in fread #1958

Open
st-pasha opened this issue Aug 9, 2019 · 0 comments
Open

Streaming approach for reading data in fread #1958

st-pasha opened this issue Aug 9, 2019 · 0 comments
Labels
design-doc Generic discussion / roadmap how some major new functionality can be implemented fread Issues related to parsing any input files via fread function new feature Feature requests for new functionality
Projects

Comments

@st-pasha
Copy link
Contributor

st-pasha commented Aug 9, 2019

See also: #1843, #1950

This issue concerns reading data from sources other than plain file. Such cases could include:

  • read from python file-like object;
  • download from a URL;
  • read from a shell pipe;
  • read from a socket;
  • read and uncompress a file;
  • read and decode encoding;
  • read and decrypt a file;
  • combinations of some of the sources listed above.

Currently we handle such use cases by first dumping the content into a file, and then reading it via fread as normal. Such approach, however, is suboptimal:

  • it foregoes possible parallelism of receiving / parsing data simultaneously;
  • it incurs expensive disk I/O;
  • it is wasteful if max_nrows= parameter is used;
  • if multiple steps are combined, multiple temporary files are created, which is even more wasteful.

Suggested Implementation

In most of the cases listed above data reading is unambiguously a sequential task. Therefore, it has to run in a single-threaded mode (with access to Python in many cases). The suggestion is therefore to use a dedicated thread for data reading, while all other threads will be busy parsing that data.

  • the Input thread will maintain 1 or more internal buffers where the data will be stored.
  • as the data arrives, it is stored in the current buffer, provided it is not full yet.
  • once the current buffer gets full, a "gc" step is run in order to determine whether the oldest buffer can be reused. If yes, then it becomes the new current buffer, otherwise allocate a new buffer.
  • if the number of allocated buffers is already too high, then the Input thread should wait until some of the older buffers get freed up.
  • "gc step": check the list of data ranges that are marked as "processed", and find the largest range [0:P] of data that is no longer in use. If the oldest buffer has data entirely within that range, mark the buffer is ready for reuse.
  • Other threads may query the Input object for chunks of data at any time. If the chunk is currently available, return it. Otherwise, the thread must wait until the data becomes available.
  • Once a worker thread is done with a particular chunk of data, it informs the Input object and that chunk is marked as "processed", allowing its memory to be eventually reclaimed.

The input thread must also be ready to receive a signal from the worker threads to pause receiving any data. At that point the input thread must exit its current task, leaving the Input object in such a state that it could resume receiving data from the point where it left.

@st-pasha st-pasha added fread Issues related to parsing any input files via fread function new feature Feature requests for new functionality labels Aug 9, 2019
@st-pasha st-pasha added this to the Release 0.10.0 milestone Aug 9, 2019
@st-pasha st-pasha self-assigned this Aug 9, 2019
@st-pasha st-pasha added this to To Do in fread via automation Aug 9, 2019
@st-pasha st-pasha added the design-doc Generic discussion / roadmap how some major new functionality can be implemented label Aug 12, 2019
@st-pasha st-pasha removed this from the Release 0.10.0 milestone Nov 21, 2019
@st-pasha st-pasha mentioned this issue Jan 4, 2020
27 tasks
@st-pasha st-pasha moved this from To Do to In progress in fread Jun 17, 2020
@st-pasha st-pasha removed their assignment Sep 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design-doc Generic discussion / roadmap how some major new functionality can be implemented fread Issues related to parsing any input files via fread function new feature Feature requests for new functionality
Projects
fread
  
In progress
Development

No branches or pull requests

1 participant