# Apache Beam - SQL

We can code SQL in Beam.


Since our notebook is going to use Google Cloud SDK JARS we must include these in our dependencies.  Specifically, we need to include:

```
<dependency>
  <groupId>org.apache.beam</groupId>
  <artifactId>beam-sdks-java-io-google-cloud-platform</artifactId>
  <version>2.43.0</version>
</dependency>
```

Normally we would load our dependencies using the IJava Jupyter cell magic called `%%loadFromPom`.  Unfortunately, this doesn't work ([issue](https://github.com/SpencerPark/IJava/issues/139)).  A workaround is to download the dependencies outside of Jupyter and then launch Jupyter with the downloaded dependencies in the classpath.

```
mvn dependency:copy-dependencies
export IJAVA_CLASSPATH="./target/dependency/*"
jupyter notebook

```

When we think about coding SQL in Beam, a question arises as to *which* SQL?  What is the syntax/structure of the dialect of SQL we should use?  Beam provides two flavors of SQL ... Calcite and Zeta.  The default is Calcite.  We can change which flavor is used through the PipelineOptions ([BeamSqlPipelineOptions](https://beam.apache.org/releases/javadoc/2.43.0/org/apache/beam/sdk/extensions/sql/impl/BeamSqlPipelineOptions.html#setPlannerName-java.lang.String-)) and the `setPlannerName()` method.  If we pass in `org.apache.beam.sdk.extensions.sql.zetasql.ZetaSQLQueryPlanner` we will switch to Zeta.

See also:
* [Beam SQL Overview](https://beam.apache.org/documentation/dsls/sql/overview/)
* [Exploring Beam SQL on Google Cloud Platform](https://servian.dev/exploring-beam-sql-on-google-cloud-platform-b6c77f9b4af4)
* [Data processing with Dataflow SQL (part 1/2)](https://medium.com/syntio/data-processing-with-dataflow-sql-part-1-2-fe57e47f4bb0)
* [Data processing with Dataflow SQL (part 2/2)](https://medium.com/syntio/data-processing-with-dataflow-sql-part-2-2-3f1d507b6297)
* [JavaDoc: Row](https://beam.apache.org/releases/javadoc/2.43.0/org/apache/beam/sdk/values/Row.html)
* [JavaDoc: Schema](https://beam.apache.org/releases/javadoc/2.43.0/org/apache/beam/sdk/schemas/Schema.html)
* [JavaDoc: SqlTransform](https://beam.apache.org/releases/javadoc/2.43.0/org/apache/beam/sdk/extensions/sql/SqlTransform.html)

Next we define our imports required for execution.

In [1]:
import java.util.Arrays;
import java.util.List;

import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.PipelineResult;
import org.apache.beam.sdk.options.Default;
import org.apache.beam.sdk.options.Description;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.StreamingOptions;
import org.apache.beam.runners.direct.DirectOptions;
import org.apache.beam.sdk.transforms.Create;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.ParDo;
import org.apache.beam.sdk.transforms.Sample;
import org.apache.beam.sdk.transforms.SerializableFunction;
import org.apache.beam.sdk.coders.KvCoder;
import org.apache.beam.sdk.coders.StringUtf8Coder;
import org.apache.beam.sdk.schemas.Schema;
import org.apache.beam.sdk.values.Row;
import org.apache.beam.sdk.values.PCollection;
import org.apache.beam.sdk.values.KV;
import org.apache.beam.sdk.values.TypeDescriptor;
import org.apache.beam.sdk.extensions.sql.SqlTransform;
import org.joda.time.Instant;

String args[] = new String[] {"--tempLocation=gs://kolban-tmp"};
var options = PipelineOptionsFactory.fromArgs(args).withValidation().create();

// Disable block on run for direct runner
options.as(DirectOptions.class).setBlockOnRun(false);
//options.as(org.apache.beam.sdk.extensions.sql.impl.BeamSqlPipelineOptions.class).setPlannerName("org.apache.beam.sdk.extensions.sql.zetasql.ZetaSQLQueryPlanner");

Before we can talk about Beam SQL, we need to understand the `Row` data type.  A Row is an element in a PCollection that represents the data that the SQL will work upon.  Consider a logical table called `Sale` that contains:

* item: String - What was sold
* amount: Double - How much was it sold for
* time: Instant - When was it sold

To create a Row that represents an instance of a `Sale`, we must first start by describing the *schema*.

In [2]:
Schema saleSchema = Schema
  .builder()
  .addStringField("item")
  .addDoubleField("amount")
  .addDateTimeField("time")
  .build();

The types of fields that can be added are:

* BOOLEAN
* BYTE
* BYTES
* DATETIME
* DECIMAL
* DOUBLE
* FLOAT
* INT16
* INT32
* INT64
* STRING

Now that we have a schema, we can create an instance of a Row.

In [3]:
Row row = Row.withSchema(saleSchema)
  .withFieldValue("item", "blue")
  .withFieldValue("amount", 10.12)
  .withFieldValue("time", Instant.now())
  .build();

Let's put it together.  Here is a PoJo called Sale that includes the ability to get a Row from an instance.

In [4]:
public class Sale implements Serializable{
  private static final Schema saleSchema = Schema
    .builder()
    .addStringField("item")
    .addDoubleField("amount")
    .addDateTimeField("time")
    .build();
  private String item;
  private Double amount;
  private Instant time;
  
  public Sale(String item, Double amount, Instant time) {
    this.item = item;
    this.amount = amount;
    this.time = time;
  }
  
  public String getItem() { return item; }
  public Double getAmount() { return amount; }
  public Instant getTime() { return time; }
  public String toString() {
    return "item: " + item + ", amount: " + amount + ", time: " + time;
  }
  public Row getRow() {
    Row row = Row.withSchema(saleSchema)
      .withFieldValue("item", item)
      .withFieldValue("amount", amount)
      .withFieldValue("time", time)
      .build();
    return row;
  }
  
  public static final Schema getSchema() {
    return saleSchema;
  }
}

And now we can create a PCollection of Row for us to work with:

In [5]:
public class LoggingDoFn<T> extends DoFn<T, T>  {
  @ProcessElement
  public void processElement(@Element T element, OutputReceiver<T> out) {
    System.out.println(element);
    out.output(element);
  }
} // LoggingDoFn

var pipeline = Pipeline.create(options);
pipeline
  .apply("Create elements", Create.of(
    new Sale("blue", 10.0, Instant.parse("2022-12-11")).getRow(),
    new Sale("green", 12.0, Instant.parse("2022-12-11")).getRow(),    
    new Sale("red", 15.0, Instant.parse("2022-12-12")).getRow(),
    new Sale("yellow", 14.0, Instant.parse("2022-12-13")).getRow()
  ))
  .apply("Print elements", ParDo.of(new LoggingDoFn<>()));
  
pipeline.run().waitUntilFinish();

Row: 
item:red
amount:15.0
time:2022-12-12T00:00:00.000Z

Row: 
item:green
amount:12.0
time:2022-12-11T00:00:00.000Z

Row: 
item:yellow
amount:14.0
time:2022-12-13T00:00:00.000Z

Row: 
item:blue
amount:10.0
time:2022-12-11T00:00:00.000Z



DONE

Now we can invoke a SQL transform.  Notice the call to setRowSchema().  We need to associate a schema with the PCollection before we can invoke the SQL.

In [7]:
var pipeline = Pipeline.create(options);
pipeline
  .apply("Create elements", Create.of(
    new Sale("blue", 10.0, Instant.parse("2022-12-11")).getRow(),
    new Sale("green", 12.0, Instant.parse("2022-12-11")).getRow(),    
    new Sale("red", 15.0, Instant.parse("2022-12-12")).getRow(),
    new Sale("yellow", 14.0, Instant.parse("2022-12-13")).getRow()
  ))
  .setRowSchema(Sale.getSchema())
  .apply("SQL", SqlTransform.query("SELECT SUM(amount) as total_sum FROM PCOLLECTION"))
  .apply("Print elements", ParDo.of(new LoggingDoFn<>()));
  
pipeline.run().waitUntilFinish();

Row: 
total_sum:51.0



DONE

## Dataflow SQL
Dataflow provides source of GCS, PubSub and BigQuery and sinks of PubSub and BigQuery.