# Working with network data using IP functions
<!--
  ~ Licensed to the Apache Software Foundation (ASF) under one
  ~ or more contributor license agreements.  See the NOTICE file
  ~ distributed with this work for additional information
  ~ regarding copyright ownership.  The ASF licenses this file
  ~ to you under the Apache License, Version 2.0 (the
  ~ "License"); you may not use this file except in compliance
  ~ with the License.  You may obtain a copy of the License at
  ~
  ~   http://www.apache.org/licenses/LICENSE-2.0
  ~
  ~ Unless required by applicable law or agreed to in writing,
  ~ software distributed under the License is distributed on an
  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  ~ KIND, either express or implied.  See the License for the
  ~ specific language governing permissions and limitations
  ~ under the License.
  -->

In this short notebook, see examples of [IP functions](https://druid.apache.org/docs/latest/27.0.0/sql-scalar#ip-address-functions) being used on a sample dataset.

## Prerequisites

This tutorial works with Druid 27.0.0 or later.

#### Run with Docker

Launch this tutorial and all prerequisites using the `druid-jupyter` profile of the Docker Compose file for Jupyter-based Druid tutorials. For more information, see [Docker for Jupyter Notebook tutorials](https://druid.apache.org/docs/latest/tutorials/tutorial-jupyter-docker.html).

## Initialization

The following cells set up the notebook and learning environment ready for use.

### Set up and connect to the learning environment

Run the next cell to set up the Druid Python client's connection to Apache Druid.

If successful, the Druid version number will be shown in the output.

In [None]:
import druidapi
import os

if 'DRUID_HOST' not in os.environ.keys():
    druid_host=f"http://localhost:8888"
else:
    druid_host=f"http://{os.environ['DRUID_HOST']}:8888"
    
print(f"Opening a connection to {druid_host}.")
druid = druidapi.jupyter_client(druid_host)

display = druid.display
sql_client = druid.sql
status_client = druid.status

status_client.version

### Load example data

Run the following cell to create a table called `example-koalas-ip`. Notice only required columns are ingested.

When completed, you'll see a description of the final table.

In [None]:
sql='''
REPLACE INTO "example-koalas-ip" OVERWRITE ALL
WITH "ext" AS (SELECT *
FROM TABLE(
  EXTERN(
    '{"type":"http","uris":["https://static.imply.io/example-data/kttm-v2/kttm-v2-2019-08-25.json.gz"]}',
    '{"type":"json"}'
  )
) EXTEND ("timestamp" VARCHAR, "agent_category" VARCHAR, "agent_type" VARCHAR, "browser" VARCHAR, "browser_version" VARCHAR, "city" VARCHAR, "continent" VARCHAR, "country" VARCHAR, "version" VARCHAR, "event_type" VARCHAR, "event_subtype" VARCHAR, "loaded_image" VARCHAR, "adblock_list" VARCHAR, "forwarded_for" VARCHAR, "language" VARCHAR, "number" VARCHAR, "os" VARCHAR, "path" VARCHAR, "platform" VARCHAR, "referrer" VARCHAR, "referrer_host" VARCHAR, "region" VARCHAR, "remote_address" VARCHAR, "screen" VARCHAR, "session" VARCHAR, "session_length" BIGINT, "timezone" VARCHAR, "timezone_offset" VARCHAR, "window" VARCHAR))
SELECT
  TIME_PARSE("timestamp") AS "__time",
  "forwarded_for",
  "remote_address",
  "session"
FROM "ext"
PARTITIONED BY DAY
'''

display.run_task(sql)
sql_client.wait_until_ready('example-koalas-ip')
display.table('example-koalas-ip')

## Filtering query results against subnets

The `IPV4_MATCH` function allows for filtering of datasets and of aggregations according to the subnet of an IP address.

Run the following query to see a count of the number of sessions in the data for a specific CIDR.

In [None]:
sql='''
SELECT COUNT(DISTINCT "session") AS "sessions"
FROM "example-koalas-ip"
WHERE IPV4_MATCH("forwarded_for",'68.0.0.0/8')
AND TIME_IN_INTERVAL("__time",'2019-08-25/PT1H')
'''

display.sql(sql)

A filter can also be applied to aggregations.

Run the following query, where count of the number of unique sessions over time is broken down by a number of first octets.

In [None]:
sql='''
SELECT
  TIME_FLOOR(__time,'PT1H') AS "timebucket",
  COUNT(DISTINCT "session") FILTER (WHERE IPV4_MATCH("forwarded_for",'172.0.0.0/8')) AS "sessions_172",
  COUNT(DISTINCT "session") FILTER (WHERE IPV4_MATCH("forwarded_for",'174.0.0.0/8')) AS "sessions_174",
  COUNT(DISTINCT "session") FILTER (WHERE IPV4_MATCH("forwarded_for",'64.0.0.0/8')) AS "sessions_62"
FROM "example-koalas-ip"
WHERE TIME_IN_INTERVAL("__time",'2019-08-25T01:30/PT6H')
GROUP BY 1
'''

display.sql(sql)

## Extracting elements of an IP address

Use the `IPV4_PARSE` function to take an IP address and convert it to a string representation.

In the example below, this is used along with a very simple regex search pattern through `REGEX_EXTRACT` to find the top 10 first-octets in the table according to the number of sessions.

In [None]:
sql='''
SELECT
  REGEXP_EXTRACT(IPV4_STRINGIFY("forwarded_for"),'(\d+)\.(\d+)\.(\d+)\.(\d+)',1) AS "firstOctet",
  COUNT(DISTINCT "session") AS "sessions"
FROM "example-koalas-ip"
WHERE TIME_IN_INTERVAL("__time",'2019-08-25T01:30/PT10M')
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10
'''

display.sql(sql)

Use these functions, or the native (JSON) equivallents, at ingestion time to enrich data in the table ahead of time.

An important technique, especially at scale, is to use Apache Datasketches for high-performance, approximate operations on network data.

Run the ingestion task below to create a summarised table of the source data.

- Each row in the table represents a 15-minute bucket for each first and second octet in `forwarded_for`
- Every row includes a Theta sketch of both the `forwarded_for` and `remote_address` IP addresses
- Rows conclude with a HyperLogLog sketch of the session Ids

In [None]:
sql='''
REPLACE INTO "example-koalas-ip-rollup" OVERWRITE ALL
WITH "ext" AS (SELECT *
FROM TABLE(
  EXTERN(
    '{"type":"http","uris":["https://static.imply.io/example-data/kttm-v2/kttm-v2-2019-08-25.json.gz"]}',
    '{"type":"json"}'
  )
) EXTEND ("timestamp" VARCHAR, "agent_category" VARCHAR, "agent_type" VARCHAR, "browser" VARCHAR, "browser_version" VARCHAR, "city" VARCHAR, "continent" VARCHAR, "country" VARCHAR, "version" VARCHAR, "event_type" VARCHAR, "event_subtype" VARCHAR, "loaded_image" VARCHAR, "adblock_list" VARCHAR, "forwarded_for" VARCHAR, "language" VARCHAR, "number" VARCHAR, "os" VARCHAR, "path" VARCHAR, "platform" VARCHAR, "referrer" VARCHAR, "referrer_host" VARCHAR, "region" VARCHAR, "remote_address" VARCHAR, "screen" VARCHAR, "session" VARCHAR, "session_length" BIGINT, "timezone" VARCHAR, "timezone_offset" VARCHAR, "window" VARCHAR))
SELECT
  TIME_FLOOR(TIME_PARSE("timestamp"),'PT15M') AS "__time",
  REGEXP_EXTRACT(IPV4_STRINGIFY("forwarded_for"),'(\d+)\.(\d+)\.(\d+)\.(\d+)',1) AS "forwarded_for_1",
  REGEXP_EXTRACT(IPV4_STRINGIFY("forwarded_for"),'(\d+)\.(\d+)\.(\d+)\.(\d+)',2) AS "forwarded_for_2",
  DS_THETA("forwarded_for") AS "forwarded_for_theta",
  DS_THETA("remote_address") AS "remote_address_theta",
  DS_HLL("session") AS "session_HLL"
FROM "ext"
GROUP BY 1, 2, 3
PARTITIONED BY DAY
'''

req = sql_client.sql_request(sql)
req.add_context("finalize", "false")
req.add_context("finalizeAggregations", "false")

display.run_task(req)
sql_client.wait_until_ready('example-koalas-ip-rollup')
display.table('example-koalas-ip-rollup')

This new summarised table can then be addressed in a query like the following, taking advantage of the pre-calculated field `forwarded_for_1`, and of approximation algorithms via the [HyperLogLog](https://druid.apache.org/docs/27.0.0/querying/sql-aggregations#hll-sketch-functions) and [Theta](https://druid.apache.org/docs/27.0.0/querying/sql-aggregations#theta-sketch-functions) forms of the `APPROX_COUNT_DISTINCT_DS` function.

In [None]:
sql='''
SELECT
  "forwarded_for_1",
  APPROX_COUNT_DISTINCT_DS_THETA("forwarded_for_theta") AS "approx-unique-forwarded-for",
  APPROX_COUNT_DISTINCT_DS_HLL("session_HLL") AS "approx-unique-sessions"
FROM "example-koalas-ip-rollup"
WHERE TIME_IN_INTERVAL("__time",'2019-08-25T01:30/PT10M')
GROUP BY 1
ORDER BY 3 DESC
LIMIT 10
'''

display.sql(sql)

## Clean up

Run the following cell to remove the two tables used in this notebook from the database.

In [None]:
druid.datasources.drop("example-koalas-ip")
druid.datasources.drop("example-koalas-ip-rollup")

## Summary

* IP functions in Druid allow for filtering and parsing at ingestion and query time
* In networking use cases, it's important to understand the use of Apache Datasketches when aiming to minimise table sizes and query processing by leveraging approximation

## Learn more

* Refer to the documentation on the available IP functions for both [SQL](https://druid.apache.org/docs/27.0.0/querying/sql-scalar#ip-address-functions) and [native](https://druid.apache.org/docs/latest/querying/math-expr#ip-address-functions) queries
* Try the notebook on solving for [COUNT DISTINCT](./03-approxCountDistinct.ipynb) at scale using Apache Datasketches
* Run through the notebook on creating Apache Datasketches [at ingestion time](../02-ingestion/03-sketchIngestion.ipynb).