Simple example of using Spark Structured Streaming for imaginary case of aggregating customers bank transactions
Our customer can make debit and credit transactions. Our task is to aggregate all of them in one-hour windows and save aggregates to a database. "To aggregate" means to group transactions by customer, time window, type... and sum up. Transactions arrive as CSV-files. Apache Cassandra is used as aggregates storage.
- Transactions may be incorrect (malformed CSV, zero amount).
- Transaction may be duplicated in one hour. And we mustn't sum up one transaction twice.
- Transactions may be late for few hours, but no more than 24 hours.
- Read stream of CSV-files and filter off incorrect records.
- Use time-window aggregation.
- Write batches of aggregates to Cassandra via spark-cassandra-connector.
- Transactions arrive as JSON-messages from Kafka.
- Read stream of JSON via spark-sql-kafka-0-10_2.11
- Use time-window aggregation as earlier.
- Write batches of aggregates to Cassandra via spark-cassandra-connector as earlier.