# Working with extensions contributed by the community

<!--
  ~ Licensed to the Apache Software Foundation (ASF) under one
  ~ or more contributor license agreements.  See the NOTICE file
  ~ distributed with this work for additional information
  ~ regarding copyright ownership.  The ASF licenses this file
  ~ to you under the Apache License, Version 2.0 (the
  ~ "License"); you may not use this file except in compliance
  ~ with the License.  You may obtain a copy of the License at
  ~
  ~   http://www.apache.org/licenses/LICENSE-2.0
  ~
  ~ Unless required by applicable law or agreed to in writing,
  ~ software distributed under the License is distributed on an
  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  ~ KIND, either express or implied.  See the License for the
  ~ specific language governing permissions and limitations
  ~ under the License.
  -->
This tutorial demonstrates the use of a community extension called [Spectator Histogram](https://druid.apache.org/docs/latest/development/extensions-contrib/spectator-histogram) which was [contributed by Ben Sykes to the Apache Druid project](https://github.com/apache/druid/pull/15340).

In a nutshell, this new type of approximation provides the same capabilities as a [QuantilesSketch](https://datasketches.apache.org/docs/Quantiles/QuantilesOverview.html), but has a much smaller data storage footprint. It has some limitations though:
- Supports positive long integer values within the range of [0, 2^53). Negatives are coerced to 0.
- Does not support decimals.
- Does not support Druid SQL queries, only native queries.
- Does not support vectorized queries.
- Generates 276 fixed buckets with increasing bucket widths.

In practice, the authors observed error of computed percentiles ranges from 0.1% to 3%, exclusive.

In this tutorial, you use the Spectator Histogram and compare it to a Quantiles Sketch in both its results and in its storage footprint.

## Prerequisites

This tutorial works with Druid 29.0.1 or later.

#### Run using Docker

Launch this tutorial and all prerequisites using the `druid-jupyter` profile of the Docker Compose file for Jupyter-based Druid tutorials. For more information, see the Learn Druid repository [readme](https://github.com/implydata/learn-druid).

You will need to add the extension `druid-spectator-histogram` to the `environment` file's `druid_extensions_loadList` parameter.

`environment` (file is located in the root of the learn-druid clone directory):
```
...
druid_extensions_loadList=["druid-histogram", "druid-datasketches", "druid-lookups-cached-global", "postgresql-metadata-storage", "druid-multi-stage-query", "druid-kafka-indexing-service", "druid-spectator-histogram"]
...
```
Then restart the docker compose:

```
docker compose --profile druid-jupyter up -d
```

After it completes come back to this tutorial to continue. You may need to wait for services to be reinitialized before continuing.

## Initialization

Run the next cell to set up the Druid Python client's connection to Apache Druid.

If successful, the Druid version number will be shown in the output.

In [8]:
import druidapi
import os
import json

if 'DRUID_HOST' not in os.environ.keys():
    druid_host=f"http://localhost:8888"
else:
    druid_host=f"http://{os.environ['DRUID_HOST']}:8888"
    
print(f"Opening a connection to {druid_host}.")
druid = druidapi.jupyter_client(druid_host)

display = druid.display
sql_client = druid.sql
status_client = druid.status
rest_client = druid.rest

status_client.version

Opening a connection to http://router:8888.


'29.0.1'

### Set up helper functions

Run the next two cells to set up the following helper functions:
- `wait_for_task`: probes indexer task status until it completes or fails

In [9]:
def wait_for_task( task_id):
    import time
    from IPython.display import clear_output
    # wait for the messages to be fully published 
    done = False
    probe_count=0
    while not done:
        result = rest_client.get_json(f"/druid/indexer/v1/task/{task_id}/status",'')
        clear_output(wait=True)
        print(json.dumps(result, indent=2))
        if result["status"]["status"] != 'RUNNING':
            done = True
        else:
            probe_count+=1
            print(f'Sleeping... probe # {probe_count}')
            time.sleep(1)

### Load example data

Once your Druid environment is up and running, ingest the sample data for this tutorial.

The following 2 cells to ingest the wikipedia sample data using Quantiles Sketch and Spectator Histogram respectively using the `added` column as source:

In [10]:
sketch_ingest = {
  "type": "index_parallel",
  "spec": {
    "ioConfig": {
      "type": "index_parallel",
      "inputSource": {
        "type": "http",
        "uris": ["https://druid.apache.org/data/wikipedia.json.gz"]
      },
      "inputFormat": { "type": "json" }
    },
    "dataSchema": {
      "granularitySpec": {
        "segmentGranularity": "day",
        "queryGranularity": "minute",
        "rollup": True
      },
      "dataSource": "example-wikipedia-quantiles-sketch",
      "timestampSpec": { "column": "timestamp", "format": "iso" },
      "dimensionsSpec": {
        "dimensions": [
          "isRobot",
          "channel",
          "comment",
          "user",
          "region",
          "countryName",
          "regionIsoCode",
          "countryIsoCode",
          "regionName"
        ]
      },
      "metricsSpec": [
        { "name": "count", "type": "count" },
        { "name": "sum_added", "type": "longSum", "fieldName": "added" },
        {
          "name": "hist_added",
          "type": "quantilesDoublesSketch",
          "fieldName": "added"
        }
      ]
    },
    "tuningConfig": {
      "type": "index_parallel",
      "partitionsSpec": { "type": "hashed" },
      "forceGuaranteedRollup": True
    }
  }
}


headers = {
  'Content-Type': 'application/json'
}

task = rest_client.post("/druid/indexer/v1/task", json.dumps(sketch_ingest), headers=headers)
task_id = json.loads(task.text)['task']
wait_for_task(task_id)
sql_client.wait_until_ready('example-wikipedia-quantiles-sketch')

{
  "task": "index_parallel_example-wikipedia-quantiles-sketch_dcbioleh_2024-04-12T13:32:22.485Z",
  "status": {
    "id": "index_parallel_example-wikipedia-quantiles-sketch_dcbioleh_2024-04-12T13:32:22.485Z",
    "groupId": "index_parallel_example-wikipedia-quantiles-sketch_dcbioleh_2024-04-12T13:32:22.485Z",
    "type": "index_parallel",
    "createdTime": "2024-04-12T13:32:22.488Z",
    "queueInsertionTime": "1970-01-01T00:00:00.000Z",
    "statusCode": "SUCCESS",
    "status": "SUCCESS",
    "runnerStatusCode": "WAITING",
    "duration": 3899,
    "location": {
      "host": "172.19.0.10",
      "port": 8100,
      "tlsPort": -1
    },
    "dataSource": "example-wikipedia-quantiles-sketch",
    "errorMsg": null
  }
}


In [11]:
spectator_ingest = {
  "type": "index_parallel",
  "spec": {
    "ioConfig": {
      "type": "index_parallel",
      "inputSource": {
        "type": "http",
        "uris": ["https://druid.apache.org/data/wikipedia.json.gz"]
      },
      "inputFormat": { "type": "json" }
    },
    "dataSchema": {
      "granularitySpec": {
        "segmentGranularity": "day",
        "queryGranularity": "minute",
        "rollup": True
      },
      "dataSource": "example-wikipedia-spectator-histogram",
      "timestampSpec": { "column": "timestamp", "format": "iso" },
      "dimensionsSpec": {
        "dimensions": [
          "isRobot",
          "channel",
          "comment",
          "user",
          "region",
          "countryName",
          "regionIsoCode",
          "countryIsoCode",
          "regionName"
        ]
      },
      "metricsSpec": [
        { "name": "count", "type": "count" },
        { "name": "sum_added", "type": "longSum", "fieldName": "added" },
        {
          "name": "hist_added",
          "type": "spectatorHistogram",
          "fieldName": "added"
        }
      ]
    },
    "tuningConfig": {
      "type": "index_parallel",
      "partitionsSpec": { "type": "hashed" },
      "forceGuaranteedRollup": True
    }
  }
}


headers = {
  'Content-Type': 'application/json'
}

task = rest_client.post("/druid/indexer/v1/task", json.dumps(spectator_ingest), headers=headers)
task_id = json.loads(task.text)['task']
wait_for_task(task_id)
sql_client.wait_until_ready('example-wikipedia-spectator-histogram')

{
  "task": "index_parallel_example-wikipedia-spectator-histogram_igdnodic_2024-04-12T13:32:32.548Z",
  "status": {
    "id": "index_parallel_example-wikipedia-spectator-histogram_igdnodic_2024-04-12T13:32:32.548Z",
    "groupId": "index_parallel_example-wikipedia-spectator-histogram_igdnodic_2024-04-12T13:32:32.548Z",
    "type": "index_parallel",
    "createdTime": "2024-04-12T13:32:32.550Z",
    "queueInsertionTime": "1970-01-01T00:00:00.000Z",
    "statusCode": "SUCCESS",
    "status": "SUCCESS",
    "runnerStatusCode": "WAITING",
    "duration": 3800,
    "location": {
      "host": "172.19.0.10",
      "port": 8100,
      "tlsPort": -1
    },
    "dataSource": "example-wikipedia-spectator-histogram",
    "errorMsg": null
  }
}


## Look at the difference in sizes

Look at the data sources view on the [Druid console](http://localhost:8888/unified-console.html#datasources) and compare the sizes of the segments.
Even in this small example, you can see the different in segment sizes.

## Queries 
The following two cells issue the same query using Spectator Histogram and Quantiles Sketch.
Notice that the results are almost identical.

In [12]:
spectator_query={
  "queryType": "groupBy",
  "dataSource": {
    "type": "table",
    "name": "example-wikipedia-spectator-histogram"
  },
  "intervals": {
    "type": "intervals",
    "intervals": [
      "-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z"
    ]
  },
  "granularity": {
    "type": "all"
  },
  "dimensions": [
    {
      "type": "default",
      "dimension": "isRobot",
      "outputName": "isRobot",
      "outputType": "STRING"
    }
  ],
  "aggregations": [
    {
      "type": "longSum",
      "name": "total_added",
      "fieldName": "sum_added"
    },
    {
        "type":"longSum",
        "name":"histogram_population",
        "fieldName":"hist_added"
    },
    {
      "type": "spectatorHistogram",
      "name": "agg_hist_added",
      "fieldName": "hist_added"
    }
  ],
  "postAggregations": [
    {
      "type": "percentileSpectatorHistogram",
      "name": "medianAdded",
      "field": {
        "type": "fieldAccess",
        "fieldName": "agg_hist_added"
      },
      "percentile": "50.0"
    },
    {
      "type": "percentilesSpectatorHistogram",
      "name": "iqr_added",
      "field": {
        "type": "fieldAccess",
        "fieldName": "agg_hist_added"
      },
      "percentiles": [25,75]
    }  
  ],
  "limitSpec": {
    "type": "default",
    "columns": [],
    "limit": 1001
  }
}

result = rest_client.post("/druid/v2", json.dumps(spectator_query), headers=headers)

spectator_result_json = json.loads(result.text)
spectator_result_json

[{'version': 'v1',
  'timestamp': '-146136543-09-08T08:23:32.096Z',
  'event': {'medianAdded': 21.795454545454547,
   'isRobot': 'false',
   'total_added': 4787206,
   'iqr_added': [0.8400023062730627, 134.843625498008],
   'agg_hist_added': {'0': 4336,
    '33': 498,
    '58': 3,
    '25': 384,
    '50': 17,
    '49': 32,
    '8': 109,
    '9': 119,
    '67': 1,
    '57': 8,
    '24': 626,
    '59': 2,
    '26': 325,
    '16': 374,
    '17': 376,
    '23': 192,
    '18': 341,
    '35': 222,
    '60': 10,
    '39': 124,
    '56': 6,
    '27': 251,
    '47': 35,
    '48': 37,
    '51': 60,
    '55': 12,
    '52': 36,
    '31': 146,
    '32': 156,
    '34': 311,
    '46': 41,
    '38': 106,
    '36': 155,
    '42': 268,
    '44': 83,
    '45': 80,
    '40': 75,
    '41': 71,
    '43': 133,
    '70': 1,
    '29': 193,
    '4': 300,
    '37': 142,
    '12': 85,
    '54': 17,
    '13': 86,
    '53': 16,
    '61': 9,
    '69': 1,
    '30': 156,
    '28': 212,
    '62': 5,
    '19': 343,
    

In [14]:
quantile_query={
  "queryType": "groupBy",
  "dataSource": {
    "type": "table",
    "name": "example-wikipedia-quantile-sketch"
  },
  "intervals": {
    "type": "intervals",
    "intervals": [
      "-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z"
    ]
  },
  "granularity": {
    "type": "all"
  },
  "dimensions": [
    {
      "type": "default",
      "dimension": "isRobot",
      "outputName": "isRobot",
      "outputType": "STRING"
    }
  ],
  "aggregations": [
    {
      "type": "longSum",
      "name": "total_added",
      "fieldName": "sum_added"
    },
    {
        "type":"longSum",
        "name":"histogram_population",
        "fieldName":"hist_added"
    },
    {
      "type": "quantilesDoublesSketch",
      "name": "agg_hist_added",
      "fieldName": "hist_added"
    }
  ],
  "postAggregations": [
    {
      "type": "quantilesDoublesSketchToQuantile",
      "name": "medianAdded",
      "field": "agg_hist_added",
      "fraction":0.50
    },
    {
      "type": "quantilesDoublesSketchToQuantile",
      "name": "medianAdded",
      "field": "agg_hist_added",
      "fractions":[0.25,0.75]
    }  
  ],
  "limitSpec": {
    "type": "default",
    "columns": [],
    "limit": 1001
  }
}

result = rest_client.post("/druid/v2", json.dumps(spectator_query), headers=headers)

spectator_result_json = json.loads(result.text)
spectator_result_json

[{'version': 'v1',
  'timestamp': '-146136543-09-08T08:23:32.096Z',
  'event': {'medianAdded': 21.795454545454547,
   'isRobot': 'false',
   'total_added': 4787206,
   'iqr_added': [0.8400023062730627, 134.843625498008],
   'agg_hist_added': {'0': 4336,
    '33': 498,
    '25': 384,
    '58': 3,
    '49': 32,
    '50': 17,
    '8': 109,
    '9': 119,
    '67': 1,
    '24': 626,
    '57': 8,
    '59': 2,
    '26': 325,
    '16': 374,
    '17': 376,
    '18': 341,
    '23': 192,
    '35': 222,
    '60': 10,
    '39': 124,
    '56': 6,
    '27': 251,
    '47': 35,
    '48': 37,
    '51': 60,
    '55': 12,
    '52': 36,
    '31': 146,
    '32': 156,
    '34': 311,
    '46': 41,
    '38': 106,
    '36': 155,
    '42': 268,
    '44': 83,
    '45': 80,
    '40': 75,
    '41': 71,
    '43': 133,
    '70': 1,
    '29': 193,
    '4': 300,
    '37': 142,
    '12': 85,
    '54': 17,
    '13': 86,
    '53': 16,
    '28': 212,
    '30': 156,
    '69': 1,
    '61': 9,
    '20': 239,
    '19': 343,
  

## Cleanup

Run the following cell to remove the table from the database. 

In [15]:
druid.datasources.drop("example-wikipedia-spectator-histogram")
druid.datasources.drop("example-wikipedia-quantiles-sketch")

## Summary

* The SpectatorHistogram is a great alternative to Quantile Sketches when using positive integer values.
* The SpectatorHistogram provides very similar accuracy with a lower storage footprint.