# Apache Beam - Testing

Beam provides a set of testing capabilities.


* [Testing Unbounded Pipelines in Apache Beam](https://beam.apache.org/blog/test-stream/)
* [Testing in Apache Beam Part 2: Stream](https://medium.com/@asitkovets/testing-in-apache-beam-part-2-stream-2a9950ba2bc7)
* [Video: Understanding exactly-once processing and windowing in streaming pipelines](https://www.youtube.com/watch?v=DraQGkARegE)
* [JavaDoc: Class TestStream](https://beam.apache.org/releases/javadoc/2.43.0/org/apache/beam/sdk/testing/TestStream.html)
* [JavaDoc: Class GenerateSequence](https://beam.apache.org/releases/javadoc/2.43.0/org/apache/beam/sdk/io/GenerateSequence.html)
* [Github: json-data-generator](https://github.com/vincentrussell/json-data-generator)
* [GitHub: https://github.com/iht/beam-late-data](https://github.com/iht/beam-late-data)


First, we define the dependencies that we wish to load from the Maven repositories.

In [1]:
%%loadFromPOM

<dependency>
  <groupId>org.apache.beam</groupId>
  <artifactId>beam-sdks-java-core</artifactId>
  <version>2.43.0</version>
</dependency>

<dependency>
  <groupId>org.apache.beam</groupId>
  <artifactId>beam-runners-direct-java</artifactId>
  <version>2.43.0</version>
  <scope>runtime</scope>
</dependency>

<dependency>
    <groupId>org.slf4j</groupId>
    <artifactId>slf4j-api</artifactId>
    <version>2.0.6</version>
</dependency>

Next we define our imports required for execution.

In [74]:
import java.util.Arrays;
import java.util.List;

import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.options.Default;
import org.apache.beam.sdk.options.Description;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.StreamingOptions;
import org.apache.beam.sdk.transforms.Create;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.ParDo;
import org.apache.beam.sdk.transforms.windowing.Window;
import org.apache.beam.sdk.transforms.windowing.FixedWindows;
import org.apache.beam.sdk.transforms.windowing.AfterProcessingTime;
import org.apache.beam.sdk.transforms.windowing.AfterWatermark;
import org.apache.beam.sdk.transforms.SerializableFunction;
import org.apache.beam.sdk.transforms.Sum;
import org.apache.beam.sdk.transforms.MapElements;
import org.apache.beam.sdk.schemas.Schema;
import org.apache.beam.sdk.schemas.annotations.DefaultSchema;
import org.apache.beam.sdk.schemas.JavaBeanSchema;
import org.apache.beam.sdk.io.GenerateSequence;
import org.apache.beam.sdk.coders.SerializableCoder;
import org.apache.beam.sdk.coders.Coder;
import org.apache.beam.sdk.values.TimestampedValue;
import org.apache.beam.sdk.values.Row;
import org.apache.beam.sdk.values.PCollection;
import org.apache.beam.sdk.values.TypeDescriptors;
import org.apache.beam.sdk.testing.TestStream;
import org.joda.time.Duration;
import org.joda.time.Instant;


String args[] = new String[] {};
var options = PipelineOptionsFactory.fromArgs(args).withValidation().create();

We can generate a sequence of `PCollection<Long>` using GenerateSequence.

In [18]:
public class LoggingDoFn<T> extends DoFn<T, T>  {
  @ProcessElement
  public void processElement(@Element T element, OutputReceiver<T> out) {
    System.out.println(element);
    out.output(element);
  }
} // End of LoggingDoFn

var pipeline = Pipeline.create(options);
pipeline
  .apply("Generate elements", GenerateSequence.from(0).to(5))
  .apply("Print elements",ParDo.of(new LoggingDoFn<>()));
pipeline.run().waitUntilFinish();

3
4
0
1
2


DONE

We can also generate an unbounded sequence

In [26]:
var pipeline = Pipeline.create(options);
pipeline
  .apply("Generate elements", GenerateSequence
    .from(0)
    .withMaxReadTime(Duration.standardSeconds(20))
    .withRate(1, Duration.standardSeconds(1)))
  .apply("Print elements",ParDo.of(new LoggingDoFn<>()));
pipeline.run().waitUntilFinish();

EvaluationInterruptedException: Evaluator was interrupted while executing: 'pipeline.run().waitUntilFinish();'

## TestStream
We can use a beam source called TestStream to generate an unbounded source of specific elements.  There are two ways to create a TestStream:

* `TestStream.create(Coder<T> coder)`
* `TestStream.create(Schema schema)`

The first allows us to build a resulting `PCollection<T>` while the second builds a `PCollection<Row>`.  Following the `create`, we can add elememnts by calling `addElement(...)`.  There are two flavors of this:

* `addElement(T element, ...)`
* `addElement(TimestampedValue<T> element, ...)`

The first adds elements with no explicit event timestamp while the second adds elements with a given timestamp.  For example:

```
addElement(TimestampedValue.of(new Sale("blue", 10.0, Instant.parse("2022-12-11")), Instant.now()))
```

See also:
* [JavaDoc: Class TimestampedValue](https://beam.apache.org/releases/javadoc/2.43.0/org/apache/beam/sdk/values/TimestampedValue.html)

In [75]:
public class LoggingDoFn<T> extends DoFn<T, T>  {
  @ProcessElement
  public void processElement(@Element T element, OutputReceiver<T> out, ProcessContext context) {
    System.out.println(element + " ts: " + context.timestamp() + ", pane: " + context.pane());
    out.output(element);
  }
} // End of LoggingDoFn

public class Sale implements Serializable{
  private static final Schema saleSchema = Schema
    .builder()
    .addStringField("item")
    .addDoubleField("amount")
    .addDateTimeField("time")
    .build();
  private String item;
  private Double amount;
  private Instant time;
  
  public Sale(String item, Double amount, Instant time) {
    this.item = item;
    this.amount = amount;
    this.time = time;
  }
  
  public String getItem() { return item; }
  public Double getAmount() { return amount; }
  public Instant getTime() { return time; }
  public String toString() {
    return "item: " + item + ", amount: " + amount + ", time: " + time;
  }
  public Row getRow() {
    Row row = Row.withSchema(saleSchema)
      .withFieldValue("item", item)
      .withFieldValue("amount", amount)
      .withFieldValue("time", time)
      .build();
    return row;
  }
  
  public static final Schema getSchema() {
    return saleSchema;
  }
}


var pipeline = Pipeline.create(options);
pipeline
  .apply(TestStream.create(SerializableCoder.of(Sale.class))
    .addElements(TimestampedValue.of(new Sale("blue", 10.0, Instant.parse("2022-12-11")), Instant.now()))
    .advanceWatermarkToInfinity())
  /*
  .apply("Create elements", Create.of(
    new Sale("blue", 10.0, Instant.parse("2022-12-11")).getRow(),
    new Sale("green", 12.0, Instant.parse("2022-12-11")).getRow(),    
    new Sale("red", 15.0, Instant.parse("2022-12-12")).getRow(),
    new Sale("yellow", 14.0, Instant.parse("2022-12-13")).getRow()
  ))
  */
  .apply("Window", Window
    .<Sale>into(FixedWindows.of(Duration.standardDays(1)))
    .triggering(AfterWatermark.pastEndOfWindow())
    .withAllowedLateness(Duration.ZERO)
    .discardingFiredPanes()
  )  
  .apply("Print elements", ParDo.of(new LoggingDoFn<>()));
  
pipeline.run().waitUntilFinish();

EvalException: Default values are not supported in Combine.globally() if the input PCollection is not windowed by GlobalWindows. Instead, use Combine.globally().withoutDefaults() to output an empty PCollection if the input PCollection is empty, or Combine.globally().asSingletonView() to get the default output of the CombineFn if the input PCollection is empty.