Streams #17

numberoverzero · 2015-07-21T20:40:56Z

New features are now GA:

Streams
API

How can we expose streams elegantly? Ideally we could leverage the Type system to load values as they are retrieved from the stream.

Some very brief review of the GetRecords operation suggests we could load the new and old values into two objects - StreamRecord is encouraging that all attributes are included, noting "The item in the DynamoDB table as it appeared after it was modified." and "The item in the DynamoDB table as it appeared before it was modified."

numberoverzero · 2015-11-25T23:28:13Z

While thinking about inheritance I came up with a draft for one solution to streaming across multiple engines:

# ------------------------------------------------------
# User function that is called when a change passes through the streaming api
# Signals a stop to processing once 100 changes are observed.
# Return Truthy values to stop processing, Falsey values to continue
# ------------------------------------------------------

changes = 0
def handle_update(engine, model, old, new):
    global changes
    changes += 1
    # Stop processing after 100 changes
    if changes > 99:
        return True
    else:
        return False

# Blocking call that monitors all tables registered for any number of engines
bloop.stream.monitor(handle_update, legacy_engine, engine_compat)

# ------------------------------------------------------
# bloop.streaming.py
# ------------------------------------------------------

def monitor(callback, *engines):
    tables = collections.defaultdict(list)
    for engine in engines:
        for model in engine.models:
            tables[model.Meta.table_name].append((engine, model))
    _listen_for(callback, tables)


def _listen_for(callback, tables):
    # client.listen_to is some infinite generator that
    # yields tuples of changes for a given list of table names
    for change in client.listen_to(tables.keys()):
        table, old_value, new_value = change

        # Engines should be unordered - don't terminate until all
        # engines for a given change have had a chance to process it
        terminate = False
        for engine, model in tables[table]:
            old = engine._load(model, old_value)
            new = engine._load(model, new_value)
            terminate |= bool(callback(engine, model, old, new))
        if terminate:
            return

numberoverzero · 2016-07-07T10:29:15Z

The stuff above was trying to do too many things that aren't flexible:

multiple models
baked in polling strategy
messy callback hell

Keeping the call pattern closer to the streaming API means less surprises, and the user can customize everything above. This reduces the goals of engine.stream to (in priority order, most important first):

Yield instances of the model being streamed, instead of attribute blobs.
Engine.stream(model) returns simple iterator for events (new, old, new+old?)
Automate shard + position information so callers can just next(stream)
Token is optional in case users want to continue a previous stream, etc
Yield None when the poll has no results, so that the user can write their own backoff policy

This is the design I've wanted for a while but for some reason got caught up with callbacks.

I still need to come up with a way to switch between streaming modes (I think there's old object, new object, and old+new object) which means another engine config value, and make sure that the yielded objects are consistent between streaming modes (so we don't have old->obj, new->obj, old+new->dict).

That probably means yielding a dict for each event, like:

{
    "old": instance_of_model,
    "new": None
}

class Stream:
    def __init__(self, *, engine, model, token):
        self.engine = engine
        self.model = model
        self.token = token

    def __iter__(self):
        if self.token is None:
            self._generate_token()
        while True:
            events = self._advance_stream()
            if events:
                for event in events:
                    yield self._unpack(event)
            else:
                # Caught up to head of the stream, explicit None lets caller
                # decide how long to wait before next poll.
                # We don't need this for the backlog case (every call returns items)
                # because it means we aren't at the head of the stream yet, and there
                # are unprocessed events.
                yield None

    def _generate_token(self):
        # TODO some boto3 calls to establish shard, etc
        pass

    def _advance_stream(self):
        # TODO result = boto3.some_call(unpack self.token)
        # TODO update self.token from result
        # return None on empty result
        pass

    def _unpack(event):
        # TODO handle old + new
        blob = some_unpacking_here(event)

        obj = self.engine._instance(self.model)
        self.engine._update(
            obj, blob, obj.Meta.columns,
            context={"engine": self.engine})
        return obj


class Engine:
    def stream(self, model, *, token=None):
        # TODO abstract check
        return iter(Stream(engine=self, model=model, token=token))

# ===============================================================

stream = engine.stream(MyModel)

processed = 0
while processed < 100:
    obj = next(stream)
    if obj is not None:
        process(obj)
        processed += 1
    else:
        # At head of stream, no new events.  Back off polling.
        time.sleep(1)

numberoverzero · 2016-07-07T22:37:03Z

The event item dict will also need to include the type of modification performed. From the Record this is eventName, one of INSERT, MODIFY, DELETE.

For a new-object-only stream, an event would look like:

{
    "type": "MODIFY",
    "old": None,
    "new": instance_of_model
}

Users may want to start at the beginning (replay the whole stream), or start at the end (process only the events that occur after they make the stream object), or start somewhere in the middle (token from previous bloop stream).

In the DynamoDB api this is exposed when getting an iterator for a single shard through the ShardIteratorType parameter in GetShardIterator.

User intention though is at the stream level, and needs to expose three options: existing token, start, or end.

If a token is provided, unpack that into a shard iterator on a stream and start from there. No need to pick between AT_SEQUENCE_NUMBER and AFTER_SEQUENCE_NUMBER since that is part of the shard iterator stored in the token.
Otherwise, the user wants to start at the beginning or end of the stream. This needs to be easy to specify. For example, don't require another import and an enum of Token.Start, Token.End/
This needs to be a required parameter. There's no clear winner for a default between start and end; the wrong assumption for either would be surprising and possibly dangerous (start leads to double processing, end means the cache doesn't rebuild).

numberoverzero · 2016-07-07T22:53:20Z

The example above has Engine.stream returning iter(Stream(...)) which assumes that there is no useful information a user could extract from the stream besides events. However, for the token parameter to be at all useful the user needs some way of getting one. This should be a property of the Steam object, changing the interface to (ignoring the start/end issue in the above comment):

class Stream:
    ...

    @property
    def token(self):
        # Pack the streamArn, shardId, and shardIterator into a dict
        ...

    def __iter__(self):
        return self

    def __next__(self):
        # __iter__ code from the previous example
        ...


class Engine:
    ...

    def stream(self, model, *, token=None):
        return Stream(engine=self, model=model, token=token)

Making the Stream class an iterable and its own iterator means users can still call next() on the returned object, but can now get the token, too:

stream = engine.stream(MyModel)

processed = 0
while processed < 100:
    obj = next(stream)
    if obj is not None:
        process(obj)
        processed += 1
    else:
        # At head of stream, no new events.  Back off polling.
        time.sleep(1)
    # See the new token after every call
    print(stream.token)

There's no need to make a Stream resettable, since the class is a lightweight wrapper over a few list/describe/get calls. Users would just create a new stream with the same parameters.

There's a little work to make two streams start at the same point when the first starts at the head of the stream, though. The user would need something like:

first_stream = engine.stream(MyModel, whatever_indicates_start)
snapshot_start_token = first_stream.token

# iterate through first_stream here
...

# Use the token that was computed at the start of the first stream to
# construct another bloop.Stream that starts at the same point
second_stream = engine.stream(MyModel, token=snapshot_start_token)

numberoverzero · 2016-07-24T19:25:00Z

This will make it into the 1.0.0 release; it's behind the docs rewrite (#13) which has been coming along in small bursts.

numberoverzero · 2016-07-26T04:39:23Z

Spent some time thinking about how to specify the stream view type when the model is declared. Simple and intuitive are my highest priorities, followed much further down by extensibility.

Here's the enum that bloop will map options to: StreamViewType

Preference (tl;dr)

Leaning towards a dict-based specification, primarily to avoid the split types and for easy room to grow. I don't like the extra overhead for specifying a single view:

stream = {"new": True}  # (vs) stream = "new"

but I do like that it mirrors what I expect the event format will be:

{
    "type": string-based-enum,
    "new": obj,
    "old": obj
}

Criteria

Any option needs to fulfill the following criteria:

no imports - while there's some precedent with GlobalSecondaryIndex
and LocalSecondaryIndex, those encapsulate a lot of behavior. There is
only 1 real option for a table stream (for now) and that's the view type.
An import for what roughly amounts to a switch is overkill.
no enums - specifying the view type shouldn't require users to remember
StreamViewType.Keys or anything. Requests is a great example here. The
verify option can be effectively used with just True or False. The
configuration for a stream isn't yet complex enough to demand its own class
just to declare one on a model.
easy to remember - a dict shouldn't need more than True/False values;
inclusion or exclusion should be sufficient for an iterable. Users shouldn't
have to go to the docs to specify one of a constrained set of values.
immediate validation - the shape must be unambiguous (and simple)
enough that the value can be quickly checked on model declaration.
easy fix - the shape must be simple enough that a validation error
message can explain what went wrong. Don't make users go to the
docs to correct their mistake.
room to grow - DynamoDB Streams is a different service from
an api standpoint; it's reasonable to expect the team(s) backing it to add
features in the future. The shape should expand gracefully as features are
released; this provides time for users to transition to a Stream class or
other more complex model. The shape shouldn't be invalidated when the
first feature comes out.

Options

These aren't exhaustive, but they make use of the most common python constructs (lists/tuples and dicts). I expect iterations on these options, but new orthogonal shapes are unlikely.

String or Tuple of Strings

class User(Base):
    class Meta:
        # Default if not specified
        stream = None

        # string or tuple
        stream = "keys"
        stream = "new"
        stream = "old"
        stream = "new", "old"

        # invalid, can't specify new and keys
        stream = "new", "keys"

        # invalid, unknown view type "foo"
        stream = "old", "foo"

    id = Column(Integer, hash_key=True, name="h")
    sequence = Column(Integer, range_key=True, name="r")
    data = Column(Binary)
    some_gsi = Column(Integer)
    by_gsi = bloop.GlobalSecondaryIndex(projection="keys", hash_key="data")

Dict of Options

class User(Base):
    class Meta:
        # Default if not specified
        stream = None

        stream = {
            "new": True,
            "old": True
        }

        # options default to false
        stream = {
            "old": True
        }

        # invalid, can't specify new and keys
        stream = {
            "new": True,
            "old": False,
            "keys": True
        }

        # valid; unexpected keys are ignored
        stream = {
            "new": True,
            "foo": "bar"
        }

    id = Column(Integer, hash_key=True, name="h")
    sequence = Column(Integer, range_key=True, name="r")
    data = Column(Binary)
    some_gsi = Column(Integer)
    by_gsi = bloop.GlobalSecondaryIndex(projection="keys", hash_key="data")

Tradeoffs

In either case, stream will be None if omitted; an explicit None is also allowed to represent a lack of stream.

Pros of strings:

Succinct
Close mapping to api specification
Reflects that options are effectively boolean
Easy to fit in an error message
Simple validation

Cons of strings:

Has to support tuples for exactly one case
Anyone inspecting Meta.stream has to handle string/tuple duality
Very limited extensibility

Pros of dicts:

Extremely common data structure for configuration, options, etc
False default for omitted values not surprising
Very extensible; "view_type" can make room for other stream configuration
single type regardless of view

Cons of dicts:

Yet another level of indentation to specify a model:

class Model(new_base()):
  class Meta:
      stream = {
          "new": True,
          "old": True
      }
  ...

Looks excessive for single options:
```
stream = {"new": True}
```
Error message gets cluttered, may have to direct users to docs

numberoverzero · 2016-08-29T03:40:21Z

To try to minimize deprecation + migration if/when new stream features are available, this will be the syntax for specifying a stream:

# No stream (default if missing)
stream = None

# Basic keys-only stream
stream = {
    "include": ["keys"]
}

# new and old, with an expected stream label
stream = {
    "include": ["new", "old"],
    "label": "2016-08-29T03:26:22.376"
}

numberoverzero · 2016-08-29T03:43:30Z

Debugging info from some exploratory DescribeTable calls. When there is no stream, none of the following are included in the response:

LatestStreamArn
LatestStreamLabel
StreamSpecification

When a stream exists, all three of these are returned. The UI shows the table as "updating" but the status is immediately ACTIVE after adding or removing a stream. The fields look like this:

'LatestStreamArn': 'arn:aws:dynamodb:us-west-2:525897010715:table/stream_both/stream/2016-08-29T03:26:22.376',
'LatestStreamLabel': '2016-08-29T03:26:22.376',
'StreamSpecification': {
    'StreamEnabled': True,
    'StreamViewType': 'NEW_IMAGE'
}

The combination account, table, stream label is guaranteed unique, but a label can be the same across tables.

numberoverzero · 2016-08-29T06:20:18Z

Stream creation and validation are done! Next, debugging some stream calls.

The declaration looks a bit weird now, since "include" is the only key, and a required key. I'd be shocked though if there are never any new configuration options for a Stream, and there are never any bloop-specific configuration options for a Stream.

The inner `was_active = shard in self.active` was redundant, because the only shards considered for removal were *from* `self.active`.

This isn't a bug fix, just an optimization. Previously, if an active shard without a sequence_number became exhausted, heartbeat() would still try to get records out of it. Because Shard.__next__ returns immediately when self.exhausted, this wasn't an issue. The shard wouldn't be removed from active on future heartbeats. It would be removed from the active list and its children would be promoted on the next Coordinator.next when the buffer is empty. Now, exhausted active shards will be cleaned up as part of a heartbeat. This means shards in the active list will never be exhausted[0]. In the future, that distinction could be relevant, or the assumption made that exhausted shards aren't part of the active list. [0] Won't be exhausted locally. The actual ShardIterator may be the last for the Shard, but that isn't known yet.

The first thing shard.seek_to does is jump_to("trim_horizon"), so there's no need to do an extra jump within the shard walking of _move_stream_time

It's easy to lose track of the frames involved in the recursive validation. Hopefully a visual clears up the overlap between token shards and actual shards.

numberoverzero · 2016-10-24T09:37:35Z

I should get around to writing a sample of streaming records without bloop:

https://gist.github.com/numberoverzero/8f95c5b21c50d2140347b43d63dd1488

That's going to suck.

numberoverzero added enhancement help wanted question labels Jul 21, 2015

numberoverzero added this to the 2.0 milestone Jul 21, 2015

numberoverzero mentioned this issue Jun 3, 2016

Inheritance, Abstract models #26

Closed

numberoverzero modified the milestones: 1.0, 2.0 Jul 24, 2016

numberoverzero removed help wanted question labels Jul 24, 2016

numberoverzero added the api label Aug 29, 2016

numberoverzero self-assigned this Aug 29, 2016

numberoverzero added a commit that referenced this issue Aug 29, 2016

New exception InvalidStream #17

0f76aad

numberoverzero added a commit that referenced this issue Aug 29, 2016

Model.Meta validates stream if provided #17

a0f6c45

numberoverzero added a commit that referenced this issue Aug 29, 2016

session.create_table includes StreamSpecification #17

122781a

numberoverzero mentioned this issue Aug 29, 2016

Relax compare_tables to allow unexpected Indexes and Streams #68

Closed

numberoverzero added a commit that referenced this issue Aug 29, 2016

expected_table_description includes optional stream label #17

28777be

numberoverzero added a commit that referenced this issue Aug 29, 2016

compare_tables pops labels/stream if model doesn't care #17, #68

a0e4522

numberoverzero added a commit that referenced this issue Aug 29, 2016

Add integration test that creates a table with a stream #17

0b531bc

numberoverzero added a commit that referenced this issue Aug 29, 2016

docs Model.Meta.stream #17

ced21a7

numberoverzero added a commit that referenced this issue Aug 29, 2016

docs new page Streams #17

82fca10

numberoverzero added a commit that referenced this issue Oct 16, 2016

Start testing Coordinator #17

c53b812

numberoverzero added a commit that referenced this issue Oct 16, 2016

Test Coordinator.heartbeat, refactor stream test helpers #17

6d729bd

numberoverzero added a commit that referenced this issue Oct 16, 2016

Test consecutive Coordinator.heartbeat calls #17

edf1e96

numberoverzero added a commit that referenced this issue Oct 16, 2016

Test Coordinator.token, simplify Shard.token #17

dc4b6d5

numberoverzero added a commit that referenced this issue Oct 16, 2016

Test Coordinator.remove_shard #17

b662774

numberoverzero added a commit that referenced this issue Oct 16, 2016

Simplify Coordinator.advance_shards #17

61e76bd

The inner `was_active = shard in self.active` was redundant, because the only shards considered for removal were *from* `self.active`.

numberoverzero added a commit that referenced this issue Oct 16, 2016

Test Coordinator.advance_shards #17

b7702bc

numberoverzero added a commit that referenced this issue Oct 16, 2016

Test Coorinator.__next__ #17

d778f9c

numberoverzero added a commit that referenced this issue Oct 16, 2016

Coordinator.move_to raises InvalidPosition, test unknown positions #17

220d8e3

numberoverzero added a commit that referenced this issue Oct 16, 2016

Test Coordinator.move_to latest, trim_horizon #17

35d01b1

numberoverzero added a commit that referenced this issue Oct 17, 2016

Remove redundant jump_to in _move_stream_time #17

0a68abb

The first thing shard.seek_to does is jump_to("trim_horizon"), so there's no need to do an extra jump within the shard walking of _move_stream_time

numberoverzero added a commit that referenced this issue Oct 17, 2016

Test Coordinator.move_to time #17

cfb6a26

numberoverzero added a commit that referenced this issue Oct 17, 2016

Test move_to old token, simplify shard ids #17

2ae7723

numberoverzero added a commit that referenced this issue Oct 17, 2016

Relax validation strictness for Coordinator.move_to(token) #17

fd85c2e

numberoverzero added a commit that referenced this issue Oct 17, 2016

Add details to remaining move_to(token) tests #17

b31c143

It's easy to lose track of the frames involved in the recursive validation. Hopefully a visual clears up the overlap between token shards and actual shards.

numberoverzero added a commit that referenced this issue Oct 17, 2016

Test move_to(token) with sequence_number past trim_horizon #17

1627d90

numberoverzero added a commit that referenced this issue Oct 17, 2016

Test move_to(token) with missing shard resolution #17

2936896

numberoverzero added a commit that referenced this issue Oct 17, 2016

refactored Stream.__init__; stream_for builds streams #17

26c5515

numberoverzero added a commit that referenced this issue Oct 17, 2016

Clean up Nones from Shard.token, relax token -> Shard loading #17

15e4abe

numberoverzero added a commit that referenced this issue Oct 17, 2016

Test Stream class #17

060a52f

numberoverzero added a commit that referenced this issue Oct 22, 2016

docs Streams intro and ordering details #17

481a337

numberoverzero added a commit that referenced this issue Oct 23, 2016

docs record gaps in Streams #17

2436ecb

numberoverzero added a commit that referenced this issue Oct 23, 2016

Clean up records from next(stream) #17

e403bd1

numberoverzero added a commit that referenced this issue Oct 23, 2016

docs move stream internals to new file, simplify stream usage #17

09e6b6c

numberoverzero mentioned this issue Oct 23, 2016

Support DynamoDB Streams, bump version to 1.0.0 #70

Merged

numberoverzero closed this as completed Oct 23, 2016

numberoverzero mentioned this issue Apr 7, 2018

Support Point-In-Time Recovery #116

Closed

numberoverzero mentioned this issue Jan 21, 2019

Support On-Demand Billing #128

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streams #17

Streams #17

numberoverzero commented Jul 21, 2015

numberoverzero commented Nov 25, 2015

numberoverzero commented Jul 7, 2016 •

edited

Loading

numberoverzero commented Jul 7, 2016

numberoverzero commented Jul 7, 2016

numberoverzero commented Jul 24, 2016

numberoverzero commented Jul 26, 2016 •

edited

Loading

numberoverzero commented Aug 29, 2016

numberoverzero commented Aug 29, 2016

numberoverzero commented Aug 29, 2016

numberoverzero commented Oct 24, 2016

Streams #17

Streams #17

Comments

numberoverzero commented Jul 21, 2015

numberoverzero commented Nov 25, 2015

numberoverzero commented Jul 7, 2016 • edited Loading

numberoverzero commented Jul 7, 2016

numberoverzero commented Jul 7, 2016

numberoverzero commented Jul 24, 2016

numberoverzero commented Jul 26, 2016 • edited Loading

Preference (tl;dr)

Criteria

Options

String or Tuple of Strings

Dict of Options

Tradeoffs

numberoverzero commented Aug 29, 2016

numberoverzero commented Aug 29, 2016

numberoverzero commented Aug 29, 2016

numberoverzero commented Oct 24, 2016

numberoverzero commented Jul 7, 2016 •

edited

Loading

numberoverzero commented Jul 26, 2016 •

edited

Loading