# Ingestion and query of spatial dimensions
<!--
  ~ Licensed to the Apache Software Foundation (ASF) under one
  ~ or more contributor license agreements.  See the NOTICE file
  ~ distributed with this work for additional information
  ~ regarding copyright ownership.  The ASF licenses this file
  ~ to you under the Apache License, Version 2.0 (the
  ~ "License"); you may not use this file except in compliance
  ~ with the License.  You may obtain a copy of the License at
  ~
  ~   http://www.apache.org/licenses/LICENSE-2.0
  ~
  ~ Unless required by applicable law or agreed to in writing,
  ~ software distributed under the License is distributed on an
  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  ~ KIND, either express or implied.  See the License for the
  ~ specific language governing permissions and limitations
  ~ under the License.
  -->

This notebook demonstrates the spatial dimension capabilities of Apache Druid.
In this notebook, you perform the following tasks: 
- Ingest spatial data
- Query spatial dimensions
- Use spatial filters to efficiently find events within a rectangle, a radius, or a polygon shape  


## Prerequisites

This tutorial works with Druid 29.0.0 or later.

#### Run with Docker

Launch this tutorial and all prerequisites using the `druid-jupyter` profile of the Docker Compose file for Jupyter-based Druid tutorials. For more information, see the Learn Druid repository [readme](https://github.com/implydata/learn-druid).
   
   

## Initialization

The following cells set up the notebook and learning environment ready for use.

### Set up and connect to the learning environment

Run the next cell to set up the Druid Python client's connection to Apache Druid.

If successful, the Druid version number will be shown in the output.

In [None]:
import druidapi
import os

if 'DRUID_HOST' not in os.environ.keys():
    druid_host=f"http://localhost:8888"
else:
    druid_host=f"http://{os.environ['DRUID_HOST']}:8888"

print(f"Opening a connection to {druid_host}.")
druid = druidapi.jupyter_client(druid_host)

display_client = druid.display
sql_client = druid.sql
status_client = druid.status
rest_client = druid.rest

# client for Data Generator API
datagen = druidapi.rest.DruidRestClient("http://datagen:9999")

status_client.version

### Set up helper functions

Run the next two cells to set up the following helper functions:
- `wait_for_datagen`: probes the data generator for the status of a job until it completes
- `wait_for_task`: probes indexer task status until it completes or fails

In [None]:
def wait_for_datagen( job_name:str):
    import time
    from IPython.display import clear_output
    # wait for the messages to be fully published 
    done = False
    while not done:
        result = datagen.get_json(f"/status/{job_name}",'')
        clear_output(wait=True)
        print(json.dumps(result, indent=2))
        if result["status"] == 'COMPLETE':
            done = True
        else:
            time.sleep(1)

In [None]:
def wait_for_task( task_id):
    import time
    from IPython.display import clear_output
    # wait for the messages to be fully published 
    done = False
    probe_count=0
    while not done:
        result = rest_client.get_json(f"/druid/indexer/v1/task/{task_id}/status",'')
        clear_output(wait=True)
        print(json.dumps(result, indent=2))
        if result["status"]["status"] != 'RUNNING':
            done = True
        else:
            probe_count+=1
            print(f'Sleeping... probe # {probe_count}')
            time.sleep(1)

## Generate sample data
Run the following cell to create 10k rows of user data that include latitude and longitude coordinates for each user.

In [None]:
from datetime import datetime, timedelta
# simulate clicks for last 2 hours
gen_hours=2
gen_now = datetime.now() - timedelta(hours=gen_hours)
gen_start_time = gen_now.strftime("%Y-%m-%d %H:%M:%S")

headers = {
  'Content-Type': 'application/json'
}

datagen_request = {
    "name": "users",
    "target": { "type": "file", "path":"users.json"  },
    "config_file": "clickstream/users_init.json", 
    "concurrency":100,
    "total_events":10000 
}
datagen.post("/start", json.dumps(datagen_request), headers=headers)
wait_for_datagen('users')

To ingest the latitude and longitude as a spatially indexed dimension, you must use a native ingestion spec. SQL-based ingestion does not support spatial dimensions.

The following native `index_parallel` spec includes a property within the "dimensionSpec" that describes the spatial index creation:
```
    "dimensionSpec":
      "dimensions":[
          ...
          {
            "name": "address_lat",
            "type": "double"
          },
          {
            "name": "address_long",
            "type": "double"
          },
          ...
        ]
      "spatialDimensions": [
          {
             "dimName": "address_spatial",
             "dims": [
               "address_lat",
               "address_long"
             ]
          }
        ]
       ...
```

The `"spatialDimensions"` array contains the list of spatial dimensions. Each spatial dimension has a name `"dimName"` and the list of dimensions that are the coordinates for a point in space. The dimensions that form the point come from the dimensions defined in the ingestion spec.

The ingestion spec is submitted using a REST API call which returns a `task_id`. The `task_id` is then used to monitor the job using a status REST API call inside the `wait_for_task` helper function.

In [None]:
import json

spatial_index_spec = {
  "type": "index_parallel",
  "spec": {
    "ioConfig": {
      "type": "index_parallel",
      "inputSource": {
        "type": "http",
        "uris": [
          "http://datagen:9999/file/users.json"
        ]
      },  
      "inputFormat": {
        "type": "json"
      },
      "appendToExisting": False
    },
    "dataSchema": {
      "granularitySpec": {
        "segmentGranularity": "day",
        "queryGranularity": "none",
        "rollup": False
      },
      "dataSource": "example-spatial-index",
      "timestampSpec": {
        "column": "time",
        "format": "iso"
      },
      "dimensionsSpec": {
        "dimensions": [
          "user_id",
          "first_name",
          "last_name",
          "dob",
          {
            "name": "address_lat",
            "type": "double"
          },
          {
            "name": "address_long",
            "type": "double"
          },
          "marital_status",
          {
            "name": "income",
            "type": "double"
          },
          "signup_ts"
        ],
        "spatialDimensions": [
          {
             "dimName": "address_spatial",
             "dims": [
               "address_lat",
               "address_long"
             ]
          }
        ]
      }
    },
    "tuningConfig": {
      "type": "index_parallel",
      "partitionsSpec": {
        "type": "dynamic"
      }
    }
  }
}
headers = {
  'Content-Type': 'application/json'
}

task = rest_client.post("/druid/indexer/v1/task", json.dumps(spatial_index_spec), headers=headers)
task_id = json.loads(task.text)['task']
wait_for_task(task_id)

## Filter spatial data 
To query spatial data that uses indexed filtering, you must use the native query API. 
You can specify spatial filters as follows:

```
"filter": {
  "type": "spatial",
  "dimension": <name_of_spatial_dimension>,
  "bound": <bound_type>
}
```

The following are the types of bounds for spatial filters:
- rectangular: matches any point that falls within the specified rectangle 
- radius: matches any point that falls within the circle defined by center coordinates and a radius 
- polygon: matches any point within the area of the polygon as defined by a set of points


### Query with spatial data
Use the following TopN query to aggregate minimum and maximum income among the users found in the filtered geographical area and also count them.
The query is the same across all three examples, each example uses a different spatial filter bound type: `"rectangular"`,`"radius"` or `"polygon"`.

### Filter using a rectangular area
The next cell is an example of a rectangular bound.
The two corner points of the rectangle (`"minCoords"`, `"maxCoords"`) define the filter. Anything that falls inside the rectangle is included in the results:
```
"filter": {
    "type": "spatial",
    "dimension": "address_spatial",
    "bound": {
        "type": "rectangular",
        "minCoords": [10.0, 20.0],
        "maxCoords": [30.0, 40.0]
      }
  }
```

Run the following cell to test a rectangular filter:

In [None]:
rectangular_filter_query = {
  "queryType": "topN",
  "dataSource": {
    "type": "table",
    "name": "example-spatial-index"
  },
  "dimension": {
    "type": "default",
    "dimension": "marital_status",
    "outputName": "marital_status",
    "outputType": "STRING"
  },
  "metric": {
    "type": "dimension",
    "ordering": {
      "type": "lexicographic"
    }
  },
  "threshold": 1001,
  "intervals": {
    "type": "intervals",
    "intervals": [
      "-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z"
    ]
  },
  "filter": {
    "type": "spatial",
    "dimension": "address_spatial",
    "bound": {
        "type": "rectangular",
        "minCoords": [10.0, 20.0],
        "maxCoords": [30.0, 40.0]
      }
  },
  "granularity": {
    "type": "all"
  },
  "aggregations": [
    {
      "type": "doubleMin",
      "name": "min_income",
      "fieldName": "income"
    },
    {
      "type": "doubleMax",
      "name": "max_income",
      "fieldName": "income"
    },
    {
      "type": "count",
      "name": "user_count"
    }
  ],
  "context": {
    "sqlOuterLimit": 1001,
    "useNativeQueryExplain": False
  }
}

headers = {
  'Content-Type': 'application/json'
}

result = rest_client.post("/druid/v2", json.dumps(rectangular_filter_query), headers=headers)

json.loads(result.text)[0]['result']


### Filter using a radius area
The following example shows a radius bound filter.
A `"radius"` bound filter is defined by a center point `"coords"` and a `"radius"` of the circle. Anything within the circle will be included in the result:
```
"filter": {
    "type": "spatial",
    "dimension": "address_spatial",
    "bound": {
        "type": "radius",
        "coords": [10.0, 20.0],
        "radius": [30.0]
      }
  }
```
Run the following cell to test a radius filter:

In [None]:
radius_filter_query = {
  "queryType": "topN",
  "dataSource": {
    "type": "table",
    "name": "example-spatial-index"
  },
  "dimension": {
    "type": "default",
    "dimension": "marital_status",
    "outputName": "marital_status",
    "outputType": "STRING"
  },
  "metric": {
    "type": "dimension",
    "ordering": {
      "type": "lexicographic"
    }
  },
  "threshold": 1001,
  "intervals": {
    "type": "intervals",
    "intervals": [
      "-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z"
    ]
  },
  "filter": {
    "type": "spatial",
    "dimension": "address_spatial",
    "bound": {
        "type": "radius",
        "coords": [-10.0, 20.0],
        "radius": 10.0
      }
  },
  "granularity": {
    "type": "all"
  },
  "aggregations": [
    {
      "type": "doubleMin",
      "name": "min_income",
      "fieldName": "income"
    },
    {
      "type": "doubleMax",
      "name": "max_income",
      "fieldName": "income"
    },
    {
      "type": "count",
      "name": "user_count"
    }
  ],
  "context": {
    "sqlOuterLimit": 1001,
    "useNativeQueryExplain": False
  }
}

headers = {
  'Content-Type': 'application/json'
}

result = rest_client.post("/druid/v2", json.dumps(radius_filter_query), headers=headers)

json.loads(result.text)[0]['result']


### Filter using a polygon area
The following is an example of a polygon bound.
A polygon describes a closed irregular shape where any point that falls within the irregular shape is included in the results. The shape is defined by a list of coordinates in two arrays:
- `"abscissa"`: horizontal coordinates
- `"ordinate"`: vertical coordinates 

This example uses a rough outline of Africa:

```
"filter": {
    "type": "spatial",
    "dimension": "address_spatial",
    "bound": {
        "type": "polygon",
        "abscissa": [30.942493, 36.917158, 22.799200, 3.914715, -0.286389, -35.494284, 11.654848, 30.942493 ],
        "ordinate": [32.434730, 9.758948 ,-16.959803,-8.524121,  7.766978,  19.844799, 51.309644, 32.434730 ]
      }
  }
```

Run the following cell to test a polygon filter:

In [None]:
# this polygon filter is roughly the outline of Manhattan
polygon_filter_query = {
  "queryType": "topN",
  "dataSource": {
    "type": "table",
    "name": "example-spatial-index"
  },
  "dimension": {
    "type": "default",
    "dimension": "marital_status",
    "outputName": "marital_status",
    "outputType": "STRING"
  },
  "metric": {
    "type": "dimension",
    "ordering": {
      "type": "lexicographic"
    }
  },
  "threshold": 1001,
  "intervals": {
    "type": "intervals",
    "intervals": [
      "-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z"
    ]
  },
  "filter": {
    "type": "spatial",
    "dimension": "address_spatial",
    "bound": {
        "type": "polygon",
        "abscissa": [30.942493, 36.917158, 22.799200, 3.914715, -0.286389, -35.494284, 11.654848, 30.942493 ],
        "ordinate": [32.434730, 9.758948 ,-16.959803,-8.524121,  7.766978,  19.844799, 51.309644, 32.434730 ]
      }
  },
  "granularity": {
    "type": "all"
  },
  "aggregations": [
    {
      "type": "doubleMin",
      "name": "min_income",
      "fieldName": "income"
    },
    {
      "type": "doubleMax",
      "name": "max_income",
      "fieldName": "income"
    },
    {
      "type": "count",
      "name": "user_count"
    }
  ],
  "context": {
    "sqlOuterLimit": 1001,
    "useNativeQueryExplain": False
  }
}

headers = {
  'Content-Type': 'application/json'
}

result = rest_client.post("/druid/v2", json.dumps(polygon_filter_query), headers=headers)

json.loads(result.text)[0]['result']


## Cleanup 
The following cell removes the example datasource from Druid.

In [None]:
print(f"Drop datasource: [{druid.datasources.drop('example-spatial-index')}]")


## Learn more

See [Spatial filters](https://druid.apache.org/docs/latest/querying/geo/) for more information.
