## ArcGIS Notebook to be demonstrate the usage of the built-in [Spark](https://spark.apache.org/) engine.


This notebook will demonstrate the spatial binning of AIS data around the port of Miami using Apache Spark.

The AIS broadcast data is in a FileGeodatabase that can be download from [here](https://marinecadastre.gov/ais).  

Make sure to create a new conda environment and activate it before starting this notebook, as follows:

- Start a `Python Command Prompt` from `Start > ArcGIS`.

- Execute the following:

```
conda create --yes --name spark_esri --clone arcgispro-py3
activate spark_esri
# pip install spark_esri-0.1-py3-none-any.whl
pip install pyarrow
proswap spark_esri
```

- Install the Esri Spark module.

```
git clone https://github.com/mraad/spark-esri
cd spark_esri
python setup.py install
```

### Import the required modules.

In [None]:
import os
import arcpy
from spark_esri import spark_start, spark_stop

### Start a Spark instance.

Note the `config` argument to [configure the Spark instance](https://spark.apache.org/docs/latest/configuration.html).

In [None]:
config = {
    "spark.driver.memory":"2G",
    "spark.jars": os.path.join("C:",os.sep,"bdt","bdt.jar"),
    "spark.submit.pyFiles": os.path.join("C:",os.sep,"bdt","bdt.egg")
}
spark = spark_start(config=config)

### Read the selected Broadcast feature shapes in WebMercator SR.

It is assumed that you added to the map the `Broadcast` point feature class from the download `miami.gdb`.

Note that the `SearchCursor` is subject to the user selected features, and to an active query definition in the layer properties. For Example, set the query definition to `Stauts = 0: Under way using engine` to get the location of all moving ships, in such that we get a "heat map" of the port movement.

In [None]:
sp_ref = arcpy.SpatialReference(3857)
data = arcpy.da.SearchCursor("Broadcast",["SHAPE@X","SHAPE@Y"],spatial_reference=sp_ref)

### Create a Spark data frame of the read data, and create a view named 'v0'.

In [None]:
spark\
    .createDataFrame(data,"x double,y double")\
    .createOrReplaceTempView("v0")

### Aggregate the data at 200x200 meters bins.

The aggregation is performed by Spark as a SQL statement in a parallel share-nothing way and the resulting bins are collected back in the `rows` array variable.

This is a nested SQL expression, where the inner expression is mapping the input `x` and `y` into `q` and `r` cell locations given a user defined bin size, and the outter expression is aggreating as a sum the `q` and `r` pairs. Finally, `q` and `r` are mapped back to `x` and `y` to enble the placement on a map. 

In [None]:
cell0 = 200.0 # meters
cell1 = cell0 * 0.5

rows = spark\
    .sql(f"""
select q*{cell0}+{cell1} x,r*{cell0}+{cell1} y,least(count(1),1000) as pop
from
(select cast(x/{cell0} as long) q,cast(y/{cell0} as long) r from v0)
group by q,r
""")\
    .collect()

### Create an in-memory point feature class of the collected bins.

The variable `rows` is an array of form `[[x0,y0,pop0],[x1,y1,pop1],...,[xN,yN,popN]]`.

In [None]:
ws = "memory"
nm = "Bins"

fc = os.path.join(ws,nm)

arcpy.management.Delete(fc)

sp_ref = arcpy.SpatialReference(3857)
arcpy.management.CreateFeatureclass(ws,nm,"POINT",spatial_reference=sp_ref)
arcpy.management.AddField(fc, "POP", "LONG")

with arcpy.da.InsertCursor(fc, ["SHAPE@X","SHAPE@Y", "POP"]) as cursor:
    for row in rows:
        cursor.insertRow(row)

### Apply a graduated colors symbology to highlight the bins.

In [None]:
_ = arcpy.ApplySymbologyFromLayer_management(fc, f"{nm}.lyrx")

### Stop the spark instance.

In [None]:
spark_stop()