Skip to content

nakamasato/apache-beam-training

Repository files navigation

Apache Beam Training

  1. Java
  2. Python

Examples

Java:

  1. First Apache Beam Application
  2. ParDo and DoFn: Parallel Processing
  3. KV + GroupByKey: Aggregation
  4. MapElement.via(new SimpleFunction) <-> ParDo + DoFn
  5. KV with Custom Class and GroupIntoBatches
  6. MultiOutput: Failure Handling
  7. MultiOutput: with differnt types
  8. Read from Google PubSub

Python:

  1. Read and write PubSub
  2. Read and write PubSub proto message
  3. Read and write PubSub with deduplication (ToDo)

References

  1. Study Resource:
    1. https://www.youtube.com/c/ApacheBeamYT/videos
    2. Streaming Engine: Execution Model for Highly-Scalable, Low-Latency Data Processing
  2. Error Handling:
    1. https://medium.com/@vallerylancey/error-handling-elements-in-apache-beam-pipelines-fffdea91af2a
    2. https://www.linuxdeveloper.space/retry-apache-beam-flink/
    3. https://medium.com/@bravnic/apache-beam-fundamentals-765ea5b59565
    4. https://stackoverflow.com/questions/53392311/apache-beam-retrytransienterrors-neverretry-does-not-respect-table-not-found-err
  3. Examples:
    1. https://github.com/apache/beam/tree/master/examples/java
    2. https://medium.com/google-cloud/bigtable-beam-dataflow-cryptocurrencies-gcp-terraform-java-maven-4e7873811e86
    3. https://github.com/GoogleCloudPlatform/DataflowTemplates/tree/main/v2/googlecloud-to-elasticsearch/docs/PubSubToElasticsearch
  4. IO:
    1. https://www.elastic.co/guide/en/elasticsearch/client/java-api-client/current/introduction.html
    2. https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/main/v2/elasticsearch-common/src/main/java/com/google/cloud/teleport/v2/elasticsearch/utils/ElasticsearchIO.java#L1100
  5. Coder
    1. apache/beam#663
    2. GoogleCloudPlatform/DataflowJavaSDK#298
    3. https://timbrowndatablog.medium.com/apache-beam-coder-performance-4415cd0a1030
    4. https://stackoverflow.com/questions/28032063/how-to-fix-dataflow-unable-to-serialize-my-dofn
  6. KafkaIO+Protobuf: https://selectfrom.dev/apache-beam-python-dataflow-kafkaio-for-protobuf-message-streaming-f349119850ad