# Apache Beam - Combine

At the highest level, the Combine transform takes the values in a PCollection and *combines* them together to produce a single value.  Some common examples:

* Summation - Imagine a PCollection of \[1,2,3,4\], if we were to sum the values we would end up with 1+2+3+4 = 10.
* Average - Imagine a PCollection of \[1,2,3,4\], if we were to average the values we would end up with (1+2+3+4)/4 = 2.5.

Generically, we are looking for some function that can combine the elements in a PCollection.   If we were in a simple programming language on a single machine, we might be tempted to code:

```
let sum = 0
for each element in PCollection
  sum = sum + element
return sum
```

However, this is beam and we may be processing millions of elememts.  We want to take advantage of parallelism.  Beam provides a concept called the CombineFn.  The CombineFn requries that we implement four methods:

* `public <Accum> createAccumulator()`
* `public <Accum> addInput(<Accum>, <Value>)`
* `public <Accum> mergeAccumulators(Iterable<Accum>)`
* `public <Result> extractOutput(<Accum>)`

This works only if the combination we are applying is associative and commutable.  Associative means that `function(function(a,b), c) === function(a, function(b,c))` and commutative means that `function(a,b) === function(b,a)`

We can see that addition conforms:

`(1+2)+3 = 1+(2+3)`

and

`1+2 = 2+1`

however subtraction is not:

`1-2 <> 2-1`

When we wish to combine a PCollection, we have some variants:

* Combine Globally - Combine **all** the elements in a PCollection
* Combine Per Key - Combine all the elements associated with a key for each key
* Combine Values - ???


* [JavaDoc: Class Combine](https://beam.apache.org/releases/javadoc/2.42.0/org/apache/beam/sdk/transforms/Combine.html)
* [JavaDoc: Class Combine.CombineFn](https://beam.apache.org/releases/javadoc/2.42.0/org/apache/beam/sdk/transforms/Combine.CombineFn.html)
* [Combine](https://beam.apache.org/documentation/transforms/java/aggregation/combine/)



First, we define the dependencies that we wish to load from the Maven repositories.

In [1]:
%%loadFromPOM

<dependency>
  <groupId>org.apache.beam</groupId>
  <artifactId>beam-sdks-java-core</artifactId>
  <version>2.40.0</version>
</dependency>

<dependency>
  <groupId>org.apache.beam</groupId>
  <artifactId>beam-runners-direct-java</artifactId>
  <version>2.40.0</version>
  <scope>runtime</scope>
</dependency>

<dependency>
  <groupId>org.slf4j</groupId>
  <artifactId>slf4j-api</artifactId>
  <version>2.0.6</version>
</dependency>

Next we define our imports required for execution.

In [2]:
import java.util.Arrays;
import java.util.List;

import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.options.Default;
import org.apache.beam.sdk.options.Description;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.StreamingOptions;
import org.apache.beam.sdk.transforms.Create;
import org.apache.beam.sdk.transforms.GroupByKey;
import org.apache.beam.sdk.values.PCollection;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.ParDo;
import org.apache.beam.sdk.coders.KvCoder;
import org.apache.beam.sdk.coders.StringUtf8Coder;
import org.apache.beam.sdk.values.KV;
import org.apache.beam.sdk.values.PDone;
import org.apache.beam.sdk.values.TupleTag;
import org.apache.beam.sdk.transforms.join.CoGbkResult;
import org.apache.beam.sdk.transforms.join.KeyedPCollectionTuple;
import org.apache.beam.sdk.transforms.join.CoGroupByKey;
import org.apache.beam.sdk.transforms.Combine.CombineFn;
import org.apache.beam.sdk.transforms.Combine;
import org.apache.beam.sdk.transforms.SerializableFunction;
import org.apache.beam.sdk.transforms.Sum;
import org.apache.beam.sdk.transforms.Count;
import org.apache.beam.sdk.transforms.Distinct;
import org.apache.beam.sdk.transforms.Filter;

String args[] = new String[] {};
var options = PipelineOptionsFactory.fromArgs(args).withValidation().create();

Here is a full Combine example that performs a summation.  The sum of 3, 4 and 5 is 12.

In [12]:
public class LoggingDoFn<T> extends DoFn<T, T>  {
  @ProcessElement
  public void processElement(
    @Element T element,
    OutputReceiver<T> out) {
    System.out.println(element);
    out.output(element);
  }
}

// Simple CombineFn implementation that performs a summation of the input
// PCollection elements.
public class SumFn extends CombineFn<Integer, Integer, Integer> {
  public Integer createAccumulator() {
    return 0;
  }
    
  public Integer addInput(Integer accum, Integer input) {
    accum = accum + input;
    return accum;
  }
    
  public Integer mergeAccumulators(Iterable<Integer> accums) {
    Integer total;
    total = 0;
    for (Integer i: accums) {
      total = total + i;
    }
    return total;
  }
    
  public Integer extractOutput(Integer accum) {
    return accum;
  }
} // End of SumFn

var pipeline = Pipeline.create(options);
pipeline
  .apply("Create elements", Create.of(3,4,5))
  .apply("Combine sum", Combine.globally(new SumFn()))
  .apply("Print elements", ParDo.of(new LoggingDoFn()));

pipeline.run().waitUntilFinish();

12


DONE

For certain classes of Combine functions, we can get away with a simpler story by using the [SerializableFunction](https://beam.apache.org/releases/javadoc/2.42.0/org/apache/beam/sdk/transforms/SerializableFunction.html) capability.  This allows us to receive a bundle of elements and perform a simple combination on that bundle and return the combination of just those bundle items.  As long as they can be simply combined, this will work.

In [4]:
public class SumFn2 implements SerializableFunction<Iterable<Integer>, Integer> {
  @Override
  public Integer apply(Iterable<Integer> input) {
    int sum = 0;
    for (int item : input) {
      sum += item;
    }
    return sum;
  }
} // End of SumFn2

var pipeline = Pipeline.create(options);
pipeline
  .apply("Create elements", Create.of(3,4,5))
  .apply("Combine sum", Combine.globally(new SumFn2()))
  .apply("Print elements", ParDo.of(new LoggingDoFn()));

pipeline.run().waitUntilFinish();

12


DONE

Beam provides some pre-built combiners including min, max and sum.

In [5]:
var pipeline = Pipeline.create(options);
pipeline
  .apply("Create elements", Create.of(3,4,5))
  .apply("Combine sum", Combine.globally(Sum.ofIntegers()))
  .apply("Print elements", ParDo.of(new LoggingDoFn()));

pipeline.run().waitUntilFinish();

12


DONE

There is an even simpler class called [Sum](https://beam.apache.org/releases/javadoc/2.42.0/org/apache/beam/sdk/transforms/Sum.html) that provides all the functions.

In [6]:
var pipeline = Pipeline.create(options);
pipeline
  .apply("Create elements", Create.of(3,4,5))
  .apply("Calculate Sum", Sum.integersGlobally())
  .apply("Print elements", ParDo.of(new LoggingDoFn()));

pipeline.run().waitUntilFinish();

12


DONE

## Count - Counting elements
We can count the number of items in a PCollection by using `Count.globally()`.

In [13]:
var pipeline = Pipeline.create(options);
pipeline
  .apply("Create elements", Create.of(3,4,5))
  .apply("Calculate count", Count.globally())
  .apply("Print elements", ParDo.of(new LoggingDoFn()));

pipeline.run().waitUntilFinish();

3


DONE

We can even count the number of elements grouped by the element value by using `Count.perElement()`:

In [14]:
var pipeline = Pipeline.create(options);
pipeline
  .apply("Create elements", Create.of(3,4,5,3,4,8))
  .apply("Calculate Count per element", Count.perElement())
  .apply("Print elements", ParDo.of(new LoggingDoFn()));

pipeline.run().waitUntilFinish();

KV{4, 2}
KV{3, 2}
KV{5, 1}
KV{8, 1}


DONE

## Distinct - Unique elements
We can find distinct values using `Distinct`.  That transform takes an input PCollection and returns a new PCollection that contains only the distinct elements from the original PCollection.

In [15]:
var pipeline = Pipeline.create(options);
pipeline
  .apply("Create elements", Create.of(3,4,5,3,4,8))
  .apply("Calculate Distinct", Distinct.create())
  .apply("Print elements", ParDo.of(new LoggingDoFn()));

pipeline.run().waitUntilFinish();

5
4
8
3


DONE

## Filter - Filtering elements
We can filter a PCollection by applying a function over each of the elements.  If the function returns true, the element is kept otherwise it is discarded.  We pass the predicate function to the `Filter` PTransform.

In [10]:
public class FilterFn implements SerializableFunction<Integer, Boolean> {
  @Override
  public Boolean apply(Integer input) {
    if (input % 2 == 0) {
      return true;
    }
    return false;
  }
}
  
var pipeline = Pipeline.create(options);
pipeline
  .apply("Create elements", Create.of(3,4,5,6,7,8))
  .apply("Calculate Distinct", Filter.by(new FilterFn()))
  .apply("Print elements", ParDo.of(new LoggingDoFn()));

pipeline.run().waitUntilFinish();

6
8
4


DONE