# Apache Beam - PubSub

PubSub is Google's cloud based messaging system.  PubSub can be used as a source or sink of data in a Beam pipeline.  In this notebook we will explore some of the PubSub capabilities available in Beam.


First, we define the dependencies that we wish to load from the Maven repositories.

See also:


Since our notebook is going to use Google Cloud SDK JARS we must include these in our dependencies.  Specifically, we need to include:

```
<dependency>
  <groupId>org.apache.beam</groupId>
  <artifactId>beam-sdks-java-io-google-cloud-platform</artifactId>
  <version>2.43.0</version>
</dependency>
```

Normally we would load our dependencies using the IJava Jupyter cell magic called `%%loadFromPom`.  Unfortunately, this doesn't work ([issue](https://github.com/SpencerPark/IJava/issues/139)).  A workaround is to download the dependencies outside of Jupyter and then launch Jupyter with the downloaded dependencies in the classpath.

```
mvn dependency:copy-dependencies
export IJAVA_CLASSPATH="./target/dependency/*"
jupyter notebook

```

See also:
* [JavaDoc: PubsubIO](https://beam.apache.org/releases/javadoc/2.43.0/org/apache/beam/sdk/io/gcp/pubsub/PubsubIO.html)
* [PubSub to BigQuery: How to Build a Data Pipeline Using Dataflow, Apache Beam, and Java](https://www.datobra.com/posts/pubsub_to_bigquery_dataflow_pipeline/)

Next we define our imports required for execution.

In [1]:
import java.util.Arrays;
import java.util.List;

import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.options.Default;
import org.apache.beam.sdk.options.Description;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.StreamingOptions;
import org.apache.beam.sdk.transforms.Create;
import org.apache.beam.sdk.values.PCollection;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.ParDo;
import org.apache.beam.sdk.coders.KvCoder;
import org.apache.beam.sdk.coders.StringUtf8Coder;
import org.apache.beam.sdk.values.KV;
import org.apache.beam.sdk.transforms.Sample;
import org.apache.beam.sdk.transforms.SerializableFunction;
import org.apache.beam.sdk.transforms.MapElements;
import org.apache.beam.sdk.values.TypeDescriptor;
import org.apache.beam.sdk.io.gcp.pubsub.PubsubIO;
import org.apache.beam.sdk.io.gcp.pubsub.PubsubMessage;
import org.apache.beam.sdk.PipelineResult;
import org.apache.beam.sdk.transforms.windowing.PaneInfo;
import org.apache.beam.runners.direct.DirectOptions;
import org.joda.time.Instant;

String args[] = new String[] {"--tempLocation=gs://kolban-tmp"};
var options = PipelineOptionsFactory.fromArgs(args).withValidation().create();

// Disable block on run for direct runner
options.as(DirectOptions.class).setBlockOnRun(false);

## Reading from a subscription
We will start with the simplest story ... reading from a PubSub subscription.  PubSubIO provides a few pipeline starters which read from a subsription and produce PCollections of the messages received.  We have:

* PubSubIO.readStrings() - Return strings
* PubSubIO.readProtos()
* PubSubIO.readAvros()
* PubSubIO.readMessagesWithMessageId()
* PubSubIO.readMessagesWithAttributesAndMessageId()
* PubSubIO.readMessageswithAttributes()
* PubSubIO.readMessages()

In the following, we create a pipeline that reads messages from a subscription.  The output is a `PCollection<PubsubMessage>`.

See also:
* [JavaDoc: PubsubMessage](https://beam.apache.org/releases/javadoc/2.43.0/org/apache/beam/sdk/io/gcp/pubsub/PubsubMessage.html)

In [2]:
// CHANGE THE FOLLOWING
var subscription = "projects/test1-305123/subscriptions/beam_sub";

//-----------------

public static class LogPane {
  public static void log(PaneInfo pane) {
    System.out.println("index: " + pane.getIndex() + ", nonSpeculativeIndex: " + pane.getNonSpeculativeIndex() + ", Timing: "+ pane.getTiming() + ", isFirst: " + pane.isFirst() + ", isLast: " + pane.isLast() + ", isUnknown: " + pane.isUnknown());
  }
}

public class LoggingDoFn extends DoFn<PubsubMessage, PubsubMessage>  {
  @ProcessElement
  public void processElement(
    @Element PubsubMessage element, OutputReceiver<PubsubMessage> out, ProcessContext context)
  {
    System.out.println(element);
    System.out.println(" Event timestamp:      " + context.timestamp());
    System.out.println(" Processing timestamp: " + Instant.now());    
    System.out.println(" Payload: \"" + new String(element.getPayload()) + "\"");
    // LogPane.log(context.pane());
    out.output(element);
  }
}

var pipeline = Pipeline.create(options);
pipeline
  .apply("Read subscription", PubsubIO.readMessages().fromSubscription(subscription))
  .apply("Print elements", ParDo.of(new LoggingDoFn()));
System.out.println("About to run pipeline");
PipelineResult result = pipeline.run();
System.out.println("Pipeline running");
result.waitUntilFinish(); // We can interrupt this statement and then cancel.

About to run pipeline
Pipeline running


EvaluationInterruptedException: Evaluator was interrupted while executing: 'result.waitUntilFinish(); // We can interrupt this statement and then cancel.'

In [3]:
// Stop the pipeline
result.cancel();
System.out.println(Instant.now())

2022-12-28T15:16:57.693Z


In [4]:
result

org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult@6a337055