Insert batches #450

c0c0n3 · 2021-02-10T18:04:21Z

Proposed changes

This PR enables the splitting of the SQL rows to insert into batches. If the size of the list of rows the Translator has to insert is greater than a configured value M, the rows get split into smaller batches (lists), each having a size no greater than M, and each batch gets inserted separately, i.e. the Translator issues a separate SQL bulk insert for each batch. We do this since some backends (e.g. Crate) limit how much data you can shovel in a single SQL (bulk) insert statement---see #445 about it.
Splitting happens as explained in the notes below, using a cost function to compute how much data each row to insert holds in memory and a maximum batch size M (= cost in bytes) read from the env---see the notes below about configuration.

Types of changes

Bugfix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)

Checklist

I have read the CONTRIBUTING doc
I have signed the CLA
I have added tests that prove my fix is effective or that my feature works
I have added necessary documentation (if appropriate)
Any dependent changes have been merged and published in downstream modules

Further comments

Splitting spec

We split a stream in batches so the cumulative cost of each batch is within a set cost goal. Given an input stream s and a cost function c, we want to produce a sequence of streams b such that joining the b streams yields s and, for each b stream of length > 1, mapping c to each element and summing the costs yields a value ≤ M, where M is a configured cost goal. In symbols:

    (1)    s = b[0] + b[1] + ...
    (2)    b[k] = [x1, x2, ...] ⟹ M ≥ c(x1) + c(x2) + ...

Notice it can happen that to make batches satisfying (1) and (2), some b[k] contains just one element x > M since that doesn't violate (1) and (2).

Implementation

We use Python streams to process data in constant space and linear time. Working with Python streams is anything but easy in my opinion so the implementation looks quite involved but the concept is fairly simple. In fact, for the mathematically inclined soul out there, the Python implementation is doing what this recursively defined function does (using Haskell-y syntax for lists), only in a more obscure way

ϕ []          = []
ϕ [x]         = [[x]]

ϕ [x, y, ...] = [ x:t, u, ...]       if c(x) + Σ c(t[i]) ≤ M
ϕ [x, y, ...] = [ [x], t, u, ...]    otherwise

   where  [t, u, ...] = ϕ [y, ...]

Notice this isn't a solution to #193 but is certainly one piece of the puzzle if we want to piece together a stream-based architecture. Why should we care? Well, even if we split the insert into batches, we still have two huge datasets in memory: the Python representation of the input NGSI JSON doc and its translation to tabular format. Ouch, not exactly a big-data friendly design. In an ideal world, the notify endpoint would work in constant space and linear time...

Configuration

There's a new INSERT_MAX_SIZE env var to turn on the splitting into batches. If set, this variable limits how much data you can shovel in a single SQL bulk insert to a value M---see above for the details of how data gets split into batches of at most size M. We read this variable in on each API call to the notify endpoint so it's sort of dynamic that way and will affect every later insert operation. Accepted values are sizes in bytes (B) or 2^10 multiples (KiB, MiB, GiB), e.g. 10 B, 1.2 KiB, 0.9 GiB. (Technically, anything bitmath can digest will do, e.g. MB, kB, and friends.) If the variable isn't set (or the set value isn't valid), the Translator processes SQL inserts normally without splitting data into batches.

github-actions · 2021-02-10T18:04:41Z

CLA Assistant Lite bot All contributors have signed the CLA ✍️

c0c0n3 · 2021-03-09T17:25:52Z

@chicco785 need your approval to merge :-)

chicco785

LGTM

amotl mentioned this pull request Feb 10, 2021

split large data inserts #445

Closed

c0c0n3 added 4 commits March 9, 2021 17:00

read digital info sizes from env vars.

822fdd2

implement utils to split iterables into batches.

4a76db3

split large sql inserts into batches.

4d0bd7e

document insert max size env var.

9d4746d

c0c0n3 force-pushed the batch-inserts branch from 35d83b2 to 9d4746d Compare March 9, 2021 16:02

c0c0n3 and others added 3 commits March 9, 2021 16:03

Adjusted files for PEP-8 compliance

472f3e1

regenerate pipfile.lock

c65d17e

fix markdown linter complaints.

c99b35e

c0c0n3 requested a review from chicco785 March 9, 2021 17:25

chicco785 approved these changes Mar 9, 2021

View reviewed changes

chicco785 merged commit 41bb25f into master Mar 9, 2021

github-actions bot locked and limited conversation to collaborators Mar 9, 2021

c0c0n3 deleted the batch-inserts branch March 9, 2021 18:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Insert batches #450

Insert batches #450

c0c0n3 commented Feb 10, 2021

github-actions bot commented Feb 10, 2021 •

edited

c0c0n3 commented Mar 9, 2021

chicco785 left a comment

Insert batches #450

Insert batches #450

Conversation

c0c0n3 commented Feb 10, 2021

Proposed changes

Types of changes

Checklist

Further comments

Splitting spec

Implementation

Configuration

github-actions bot commented Feb 10, 2021 • edited

c0c0n3 commented Mar 9, 2021

chicco785 left a comment

Choose a reason for hiding this comment

github-actions bot commented Feb 10, 2021 •

edited