<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Implementing-SQL-Operations:-Aggregates" data-toc-modified-id="Implementing-SQL-Operations:-Aggregates-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Implementing SQL Operations: Aggregates</a></span><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#Prerequisites" data-toc-modified-id="Prerequisites-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Prerequisites</a></span></li><li><span><a href="#Initialization" data-toc-modified-id="Initialization-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Initialization</a></span><ul class="toc-item"><li><span><a href="#Ensure-database-is-running" data-toc-modified-id="Ensure-database-is-running-1.3.1"><span class="toc-item-num">1.3.1&nbsp;&nbsp;</span>Ensure database is running</a></span></li><li><span><a href="#Download-and-install-additional-components." data-toc-modified-id="Download-and-install-additional-components.-1.3.2"><span class="toc-item-num">1.3.2&nbsp;&nbsp;</span>Download and install additional components.</a></span></li></ul></li><li><span><a href="#Connect-to-database-and-populate-test-data" data-toc-modified-id="Connect-to-database-and-populate-test-data-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Connect to database and populate test data</a></span></li><li><span><a href="#Create-a-secondary-index" data-toc-modified-id="Create-a-secondary-index-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Create a secondary index</a></span></li></ul></li><li><span><a href="#Execution-model" data-toc-modified-id="Execution-model-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Execution model</a></span></li><li><span><a href="#Simple-Map-Reduce-Aggregates" data-toc-modified-id="Simple-Map-Reduce-Aggregates-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Simple Map-Reduce Aggregates</a></span><ul class="toc-item"><li><span><a href="#Create-User-Defined-Function-(UDF)" data-toc-modified-id="Create-User-Defined-Function-(UDF)-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Create User Defined Function (UDF)</a></span></li><li><span><a href="#Register-UDF" data-toc-modified-id="Register-UDF-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Register UDF</a></span></li><li><span><a href="#Execute-UDF" data-toc-modified-id="Execute-UDF-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Execute UDF</a></span></li></ul></li><li><span><a href="#Using-Aggregate-Operator" data-toc-modified-id="Using-Aggregate-Operator-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Using Aggregate Operator</a></span></li><li><span><a href="#Stream-Partitioning-with-GROUP-BY" data-toc-modified-id="Stream-Partitioning-with-GROUP-BY-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Stream Partitioning with GROUP BY</a></span><ul class="toc-item"><li><span><a href="#Filtering-Partitions:-HAVING" data-toc-modified-id="Filtering-Partitions:-HAVING-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Filtering Partitions: HAVING</a></span></li><li><span><a href="#Sorting-Partitions:-ORDER-BY" data-toc-modified-id="Sorting-Partitions:-ORDER-BY-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>Sorting Partitions: ORDER BY</a></span></li></ul></li><li><span><a href="#More-Aggregates:-Distinct-and-Top-N" data-toc-modified-id="More-Aggregates:-Distinct-and-Top-N-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>More Aggregates: Distinct and Top N</a></span></li><li><span><a href="#Takeaways-and-Conclusion" data-toc-modified-id="Takeaways-and-Conclusion-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Takeaways and Conclusion</a></span></li><li><span><a href="#Clean-up" data-toc-modified-id="Clean-up-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Clean up</a></span></li><li><span><a href="#Further-Exploration-and-Resources" data-toc-modified-id="Further-Exploration-and-Resources-9"><span class="toc-item-num">9&nbsp;&nbsp;</span>Further Exploration and Resources</a></span><ul class="toc-item"><li><span><a href="#Next-steps" data-toc-modified-id="Next-steps-9.1"><span class="toc-item-num">9.1&nbsp;&nbsp;</span>Next steps</a></span></li></ul></li></ul></div>

# Implementing SQL Operations: Aggregates
This tutorial describes how to implement SQL aggregate queries in Aerospike.

This notebook requires Aerospike datbase running on localhost and that python and the Aerospike python client have been installed (`pip install aerospike`). Visit [Aerospike notebooks repo](https://github.com/aerospike-examples/interactive-notebooks) for additional details and the docker container.

## Introduction
In this notebook, we will see how specific aggregate statements in SQL can be implemented in Aerospike. 

SQL is a widely known data access language. If you have used SQL, the examples in this notebook should provide a pattern for implementing specific SQL aggregate queries. 

This notebook is the second in the SQL Operations series that consists of the following notebooks:
- Implementing SQL Operations: SELECT
- Implementing SQL Operations: Aggregates (this notebook)
- Implementing SQL Operations: UPDATE, CREATE, and DELETE

The specific topics and aggregates statements we discuss include:
- Execution model
    - Stream processing using UDF functions
    - Four type of operators: Filter, Map, Aggregate, and Reduce
    - Two phases of reduce: On server nodes and on client 
- Simple Map-Reduce Aggregations:
    - MIN
    - MAX
    - AVERAGE
    - SUM
    - COUNT
- Using Aggregate Operator
    - AVERAGE
- Stream Partitioning with GROUP BY
    - Filtering partitions: HAVING
    - Sorting partitions: ORDER BY
- Other aggregates
    - DISTINCT
    - TOP N

The purpose of this notebook is to illustrate Aerospike implementation for specific SQL operations. Check out [Aerospike Presto Connector](https://www.aerospike.com/docs/connect/access/presto/index.html) for ad-hoc SQL access to Aerospike data.

## Prerequisites
This tutorial assumes familiarity with the following topics:
- [Hello World](hello_world.ipynb)
- [Implementing SQL Operations: SELECT](sql_select.ipynb)

## Initialization

### Ensure database is running
This notebook requires that Aerospike datbase is running. 

In [1]:
import io.github.spencerpark.ijava.IJava;
import io.github.spencerpark.jupyter.kernel.magic.common.Shell;
IJava.getKernelInstance().getMagics().registerMagics(Shell.class);
%sh asd

### Download and install additional components.
Install the Java client.

In [2]:
%%loadFromPOM
<dependencies>
  <dependency>
    <groupId>com.aerospike</groupId>
    <artifactId>aerospike-client</artifactId>
    <version>5.0.0</version>
  </dependency>
</dependencies>

## Connect to database and populate test data
The test data has ten records with user-key "id-1" through "id-10", two integer bins (fields) "bin1" and "bin2", in the namespace "test" and sets "sql-select-small"and null, and similarly structured 1000 records in set "sql-select-large". 

In [3]:
import com.aerospike.client.AerospikeClient;
import com.aerospike.client.Bin;
import com.aerospike.client.Key;
//import com.aerospike.client.policy.ClientPolicy;
import com.aerospike.client.policy.WritePolicy;
import java.util.Random; 

String[] groups = {"A", "B", "C", "D", "E"}; 
Random rand = new Random(1); 

AerospikeClient client = new AerospikeClient("localhost", 3000);
System.out.println("Initialized the client and connected to the cluster.");

String Namespace = "test";
String SmallSet = "sql-select-small";
String LargeSet = "sql-select-large";
String NullSet = "";

//ClientPolicy policy = new ClientPolicy();
WritePolicy wpolicy = new WritePolicy();
wpolicy.sendKey = true;
for (int i = 1; i <= 10; i++) {
    Key key = new Key(Namespace, SmallSet, "id-"+i);
    Bin bin1 = new Bin(new String("bin1"), i);
    Bin bin2 = new Bin(new String("bin2"), 1000+i);
    client.put(wpolicy, key, bin1, bin2);
}
for (int i = 1; i <= 10; i++) {
    Key key = new Key(Namespace, NullSet, "id-"+i);
    Bin bin1 = new Bin(new String("bin1"), i);
    Bin bin2 = new Bin(new String("bin2"), 1000+i);
    client.put(wpolicy, key, bin1, bin2);
}
for (int i = 1; i <= 1000; i++) {
    Key key = new Key(Namespace, LargeSet, "id-"+i);
    Bin bin1 = new Bin(new String("bin1"), i);
    Bin bin2 = new Bin(new String("bin2"), 1000+i);
    Bin bin3 = new Bin(new String("bin3"), groups[rand.nextInt(groups.length)]); 
    client.put(wpolicy, key, bin1, bin2, bin3);
}

System.out.format("Test data popuated");;

Initialized the client and connected to the cluster.
Test data popuated

In [4]:
import com.aerospike.client.policy.Policy;
import com.aerospike.client.query.IndexType;
import com.aerospike.client.task.IndexTask;
import com.aerospike.client.AerospikeException;
import com.aerospike.client.ResultCode;

String IndexName = "test_small_bin1_number_idx";

Policy policy = new Policy();
policy.socketTimeout = 0; // Do not timeout on index create.

try {
    IndexTask task = client.createIndex(policy, Namespace, SmallSet, IndexName, 
                                        "bin1", IndexType.NUMERIC);
    task.waitTillComplete();
}
catch (AerospikeException ae) {
    if (ae.getResultCode() != ResultCode.INDEX_ALREADY_EXISTS) {
        throw ae;
    }
}

System.out.format("Created index %s on ns=%s set=%s bin=%s.", 
                                    IndexName, Namespace, SmallSet, "bin1");;

Created index test_small_bin1_number_idx on ns=test set=sql-select-small bin=bin1.

# Execution model
Key points
- Stream processing using UDF functions. 
    - UDF stream functions must be used with query operation. A stream function specifies the pipeline of operators for processing all records. 
- Four type of operators: Filter, Map, Aggregate, and Reduce
    - The operators that can be specified in any order fall into 4 categories: filter (obj -> boolean), map (object -> object), aggregate ((state, object) -> new state), and reduce ((value1, value2) -> value).
- Two phases of reduce: On server nodes and on client 
    - The server nodes execute all operators up to and including the first reduce operation in the pipeline. The client processes the node results with the remaining pipeline operators starting with and including the first reduce operation in the pipeline. Thus, the first reduce operation if specified in the pipeline is executed on all server nodes as well as on the client. 
- Post aggregation processing. 
    - Post aggregation processing for sorting and filtering must happen on the client side typically with a map operator. 
    
CONFIRM: 
- Every node processes only up to first reduce.
- Client processes the entire pipeline including and after the first reduce.

# Simple Map-Reduce Aggregates
`SELECT aggregate(col) FROM namespace.set WHERE condition`

Examples:
- MIN
- MAX
- AVERAGE
- SUM
- COUNT

 `void Statement::setAggregateFunction(String udfModule, String udfFunction, ... Value udfArgs))`
 
 `ResultSet rs = Client::queryAggregate(QueryPolicy policy, Statement stmt);`
 
Key points
- Simple aggregates requiring a numeric state can be implemented using map and reduce.
- The WHERE clause must be implemented using either query's index predicate or UDF's stream filter. 

CONFIRM:
- Expression filters are ignored in aggregate queries?!

# Using Aggregate Operator
Example:
- AVERAGE

Key points
- Computing the average requires keeping track of two entities as records are processed in a stream: the sum total and the count. Average is computed at the end using the final sum and count. 
- We need to utilize aggregate operator that can hold more complex state, for example, using a map with "sum" and "count" values as the stream is processed. 
- The reducer function entails merging two  partial stream aggregates into one by adding their "sum" and "count" values. The final phase of reduce happes on the client to arrive at the final Sum and Count. 
- A map operator can then take the aggregate (map) as input and output the average value. 

# Stream Partitioning with GROUP BY

`SELECT bin1 agg(bin2) FROM namespace.set WHERE inner-condition GROUP BY bin1`

Note the inner filter "inner-condition" can be specified using any bins in the record, whereas the outer filter and ORDER BY must use selected (aggregated) bins from the query.

Key points:
- The aggregate state uses a map of maps. The second level maps correspond to unique grouped values in GROUP BY bin. An aggregated column value is stored within a group's map.
- Reduce uses map-merge to merge partial aggregates.

## Filtering Partitions: HAVING
`SELECT bin1 agg(bin2) FROM namespace.set WHERE inner-condition GROUP BY bin1 HAVING outer-condition`

Processing for Having clause can be done using a filter operator after reduce.

## Sorting Partitions: ORDER BY
`SELECT bin1 agg(bin2) FROM namespace.set WHERE inner-condition GROUP BY bin1 HAVING outer-condition ORDER BY bin`

Processing for Order By clause can be done using a map at the end to output an ordered list.

# More Aggregates: Distinct and Top N
DISTINCT can be processed by storing all values in a map (as the aggregate state) that is keyed on the value(s) of the bin(s) so only unique values are retained.

TOP N can be processed by retaining top N values in a list (as the aggregate state) in aggregate as well as reduce that performs list-merge.

# Takeaways and Conclusion
Many developers that are familiar with SQL would like to see how SQL operations translate to Aeropsike. We looked at how to implement various aggregate statements. This should be generally useful irrespective of the reader's SQL knowledge. While the examples here use synchronous execution, many operations can also be performed asynchronously. 

# Clean up
Remove tutorial data and close connection.

In [5]:
client.dropIndex(null, Namespace, SmallSet, IndexName);
client.truncate(null, Namespace, null, null);
client.close();
System.out.println("Removed tutorial data and closed server connection.");

Removed tutorial data and closed server connection.


# Further Exploration and Resources
Here are some links for further exploration

Resources
- Related notebooks
    - [Queries](https://github.com/aerospike/aerospike-dev-notebooks.docker/blob/main/notebooks/python/query.ipynb)
    - Other notebooks in the SQL series on 1) [SELECT](sql_select.ipynb) and 2) UPDATE, CREATE, and DELETE.
- [Aerospike Presto Connector](https://www.aerospike.com/docs/connect/access/presto/index.html)
- Blog post
    - [Introducing Aerospike JDBC Connector](https://medium.com/aerospike-developer-blog/introducing-aerospike-jdbc-driver-fe46d9fc3b4d)
- Aerospike Developer Hub
    - [Java Developers Resources](https://developer.aerospike.com/java-developers)
- Github repos
    - [Java code examples](https://github.com/aerospike/aerospike-client-java/tree/master/examples/src/com/aerospike/examples)
- Documentation
    - [Java Client](https://www.aerospike.com/docs/client/java/index.html)
    - [Java API Reference](https://www.aerospike.com/apidocs/java/)

## Next steps

Visit [Aerospike notebooks repo](https://github.com/aerospike-examples/interactive-notebooks) to run additional Aerospike notebooks. To run a different notebook, download the notebook from the repo to your local machine, and then click on File->Open, and select Upload.