Distributed change data capture (CDC) platform for Google BigQuery
Change data capture (CDC) is a set of software design patterns used to determine (and track) the data that has changed so that action can be taken using the changed data. Instead of continuously polling a database for changes (which is costly if you do it often and inaccurate if you don't), rilly
uses the log-based approach (as does debezium
and all other major CDC frameworks).
There is currently no CDC plug-in for BigQuery that I am aware of, and certainly none for Python. The goal of this package is to be as simple and non-opinionated as possible to allow developers to have full control over how they want to stream and parse their change events.
pip install rilly
This library uses Google's PubSub and Stackdriver APIs, so follow the authentication process here.
Say you want to track all update/delete/insert events in your BigQuery dataset. After authenticating the Google Python Client APIs:
from rilly import logging, stream
#create a PubSub topic to send your change events to
stream.create_pubsub_topic('my-project-id', 'pubsub-topic')
#create sink to send logs to PubSub topic
logging.create_sink('sink-id', 'my-project-id', 'my-dataset-id', pubsub_topic='pubsub-topic')
#custom callback function to perform some action on each event
def custom_callback(message: str) -> str:
print('Received message data: {}'.format(message))
return message
#create subscription to PubSub topic, apply custom_callback() to each streamed log
stream.subscribe('my-project-id', 'pubsub-topic', 'cdc-subscription', 30, custom_callback)