# Apache Beam - CoGroupByKey

The `GroupByKey` transform performs a database style join against a collection PCollections.  We assume that each of the input PCollections contains KV elements.  We assume that we are going to join on the keys in PCollection KV elements.  The input to a CoGroupByKey is a KeyedPCollectionTuple.  This takes a bit of explanation.  Let us consider two PCollections:

The first we call `emailPCollection` which is a `PCollection<KV<String, String>>` that contains `<name, emailAddress>` elements.

The second we call `agePCollection` which is a `PCollection<KV<String, Integer>>` that contains `<name, age>` elements.

What we want to do is create a KeyedPCollectionTuple that will contain both the `emailPCollection` and the `agePCollection`.  To achieve this, we will also tag each of the input PCollections with a `TupleTag`.

```
final TupleTag<String> emailsTag = new TupleTag<>();
final TupleTag<Integer> agesTag = new TupleTag<>();
```

Now we can create our KeyedPCollectionTuple:

```
KeyedPCollectionTuple myKeyedPCollectionTuple = KeyedPCollectionTuple.of(emailsTag, emails)
  .and(agesTag, agesPCollection)
```

The `CoGroupByKey` is a `PTransform` that takes as input a `KeyedPCollectionTuple<K>` and returns a `PCollection<KV<K, CoGbkResult>>`.

Now we have to understand the CoGbkResult.  This is an object that contains multiple iterables where each iterable is associated with a tag.  For any given instance of CoGbkResult, we can ask it for the iterable for a given tag using the `CoGbkResult.getAll(tag)` method.


* [JavaDoc: Class CoGroupByKey](https://beam.apache.org/releases/javadoc/2.42.0/index.html?org/apache/beam/sdk/transforms/join/CoGroupByKey.html)
* [JavaDoc: Class TupleTag](https://beam.apache.org/releases/javadoc/2.42.0/org/apache/beam/sdk/values/TupleTag.html)
* [JavaDoc: Class KeyedPCollectionTuple](https://beam.apache.org/releases/javadoc/2.42.0/org/apache/beam/sdk/transforms/join/KeyedPCollectionTuple.html)
* [JavaDoc: Class CoGbkResult](https://beam.apache.org/releases/javadoc/2.42.0/org/apache/beam/sdk/transforms/join/CoGbkResult.html)
* [CoGroupByKey](https://beam.apache.org/documentation/transforms/java/aggregation/cogroupbykey/)


First, we define the dependencies that we wish to load from the Maven repositories.

In [1]:
%%loadFromPOM

<dependency>
  <groupId>org.apache.beam</groupId>
  <artifactId>beam-sdks-java-core</artifactId>
  <version>2.40.0</version>
</dependency>

<dependency>
  <groupId>org.apache.beam</groupId>
  <artifactId>beam-runners-direct-java</artifactId>
  <version>2.40.0</version>
  <scope>runtime</scope>
</dependency>

<dependency>
  <groupId>org.slf4j</groupId>
  <artifactId>slf4j-api</artifactId>
  <version>2.0.6</version>
</dependency>

Next we define our imports required for execution.

In [2]:
import java.util.Arrays;
import java.util.List;

import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.options.Default;
import org.apache.beam.sdk.options.Description;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.StreamingOptions;
import org.apache.beam.sdk.transforms.Create;
import org.apache.beam.sdk.transforms.GroupByKey;
import org.apache.beam.sdk.values.PCollection;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.ParDo;
import org.apache.beam.sdk.coders.KvCoder;
import org.apache.beam.sdk.coders.StringUtf8Coder;
import org.apache.beam.sdk.values.KV;
import org.apache.beam.sdk.values.PDone;
import org.apache.beam.sdk.values.TupleTag;
import org.apache.beam.sdk.transforms.join.CoGbkResult;
import org.apache.beam.sdk.transforms.join.KeyedPCollectionTuple;
import org.apache.beam.sdk.transforms.join.CoGroupByKey;

String args[] = new String[] {};
var options = PipelineOptionsFactory.fromArgs(args).withValidation().create();

And now we perform a Group By Key.  See how we have grouped all the keys and the resulting values are all the element values with the same key.

In [4]:
final List<KV<String, String>> emailsList =
    Arrays.asList(
        KV.of("amy", "amy@example.com"),
        KV.of("carl", "carl@example.com"),
        KV.of("julia", "julia@example.com"),
        KV.of("carl", "carl@email.com"));

final List<KV<String, String>> phonesList =
    Arrays.asList(
        KV.of("amy", "111-222-3333"),
        KV.of("james", "222-333-4444"),
        KV.of("amy", "333-444-5555"),
        KV.of("carl", "444-555-6666"));

public class LoggingDoFn<T> extends DoFn<T, T>  {
  @ProcessElement
  public void processElement(
    @Element T element,
    OutputReceiver<T> out) {
    System.out.println(element);
    out.output(element);
  }
}

var pipeline = Pipeline.create(options);
PCollection<KV<String, String>> emails = pipeline.apply("CreateEmails", Create.of(emailsList));
PCollection<KV<String, String>> phones = pipeline.apply("CreatePhones", Create.of(phonesList));
final TupleTag<String> emailsTag = new TupleTag<>();
final TupleTag<String> phonesTag = new TupleTag<>();

PCollection<KV<String, CoGbkResult>> results =
  KeyedPCollectionTuple.of(emailsTag, emails)
    .and(phonesTag, phones)
    .apply("Join on emails and phones", CoGroupByKey.create())
    .apply("Print elements", ParDo.of(new LoggingDoFn<>()));

pipeline.run().waitUntilFinish();

KV{julia, [[julia@example.com], []]}
KV{carl, [[carl@example.com, carl@email.com], [444-555-6666]]}
KV{amy, [[amy@example.com], [111-222-3333, 333-444-5555]]}
KV{james, [[], [222-333-4444]]}


DONE