# Work with string data using scalar functions
<!--
  ~ Licensed to the Apache Software Foundation (ASF) under one
  ~ or more contributor license agreements.  See the NOTICE file
  ~ distributed with this work for additional information
  ~ regarding copyright ownership.  The ASF licenses this file
  ~ to you under the Apache License, Version 2.0 (the
  ~ "License"); you may not use this file except in compliance
  ~ with the License.  You may obtain a copy of the License at
  ~
  ~   http://www.apache.org/licenses/LICENSE-2.0
  ~
  ~ Unless required by applicable law or agreed to in writing,
  ~ software distributed under the License is distributed on an
  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  ~ KIND, either express or implied.  See the License for the
  ~ specific language governing permissions and limitations
  ~ under the License.
  -->

This notebook walks through some examples of [scalar string functions](https://druid.apache.org/docs/latest/querying/sql-scalar#string-functions) being used in queries and during ingestion.

## Prerequisites

This tutorial works with Druid 27.0.0 or later.

Launch this tutorial and all prerequisites using the `druid-jupyter` profile of the Docker Compose file for Jupyter-based Druid tutorials. For more information, see the Learn Druid repository [readme](https://github.com/implydata/learn-druid).

## Initialization

The following cells set up the notebook and learning environment ready for use.

### Set up and connect to the learning environment

Run the next cell to set up the Druid Python client's connection to Apache Druid.

If successful, the Druid version number will be shown in the output.

In [None]:
import druidapi
import os

if 'DRUID_HOST' not in os.environ.keys():
    druid_host=f"http://localhost:8888"
else:
    druid_host=f"http://{os.environ['DRUID_HOST']}:8888"
    
print(f"Opening a connection to {druid_host}.")
druid = druidapi.jupyter_client(druid_host)

display = druid.display
sql_client = druid.sql
status_client = druid.status

status_client.version

### Load example data

Run the following cell to create a table called `example-koalas-strings`. Only the specific dimensions that we need for this tutorial are ingested.

When completed, you'll see a description of the final table.

In [None]:
sql='''
REPLACE INTO "example-koalas-strings" OVERWRITE ALL
WITH "ext" AS (SELECT *
FROM TABLE(
  EXTERN(
    '{"type":"http","uris":["https://static.imply.io/example-data/kttm-v2/kttm-v2-2019-08-25.json.gz"]}',
    '{"type":"json"}'
  )
) EXTEND ("timestamp" VARCHAR, "agent_category" VARCHAR, "agent_type" VARCHAR, "browser" VARCHAR, "browser_version" VARCHAR, "city" VARCHAR, "continent" VARCHAR, "country" VARCHAR, "version" VARCHAR, "event_type" VARCHAR, "event_subtype" VARCHAR, "loaded_image" VARCHAR, "adblock_list" VARCHAR, "forwarded_for" VARCHAR, "language" VARCHAR, "number" VARCHAR, "os" VARCHAR, "path" VARCHAR, "platform" VARCHAR, "referrer" VARCHAR, "referrer_host" VARCHAR, "region" VARCHAR, "remote_address" VARCHAR, "screen" VARCHAR, "session" VARCHAR, "session_length" BIGINT, "timezone" VARCHAR, "timezone_offset" VARCHAR, "window" VARCHAR))
SELECT
  TIME_PARSE("timestamp") AS "__time",
  "referrer",
  "event_type",
  "event_subtype",
  "city",
  "os",
  "continent",
  "country",
  "browser",
  "session",
  "session_length",
  "screen",
  "loaded_image"
FROM "ext"
PARTITIONED BY DAY
'''

display.run_task(sql)
sql_client.wait_until_ready('example-koalas-strings')
display.table('example-koalas-strings')

### Import additional modules

Run the following cell to import additional Python modules that you will use as part of the notebook.

In [None]:
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd

## Filter rows using string functions

In this part of the notebook, you'll see the use of:

* Pattern matches with `LIKE` and `REGEXP_LIKE`.
* Searches with `CONTAINS_STRING` and `ICONTAINS_STRING`.

Run the next cell to find any event recorded in the table that was made in English using the `LIKE` filter.

In [None]:
sql='''
SELECT
  COUNT(*) AS "events"
FROM "example-koalas-strings"
WHERE "referrer" LIKE '%google%'
AND TIME_IN_INTERVAL(__time,'2019-08-25T14/PT1H')
'''

display.sql(sql)

Alternatively, use the `CONTAINS_STRING` (case-sensitive) and `ICONTAINS_STRING` (case-insensitive) functions.

Behind-the-scenes, these two functions use the native [`search`-type filter](https://druid.apache.org/docs/latest/querying/filters/#search-filter).

The cell below uses these functions to produce two different counts, one with Google as the referrer, and one where it is not.

In [None]:
sql='''
SELECT
  COUNT(DISTINCT "session") FILTER (WHERE CONTAINS_STRING("referrer",'google')) AS "google_referred_sessions",
  COUNT(DISTINCT "session") FILTER (WHERE NOT(CONTAINS_STRING("referrer",'google'))) AS "not_google_referred_sessions"
FROM "example-koalas-strings"
WHERE TIME_IN_INTERVAL(__time,'2019-08-25T14/PT1H')
'''

display.sql(sql)

Advanced filtering is possible using Regular Expressions via `REGEXP_LIKE`.

The next cell contains a SQL statement that uses a regular expression to filter out and count the number of error events that references an unsecure web or file URI, and another that counts references to secure sites. These are then grouped into hourly results.

In [None]:
sql='''
SELECT
  TIME_FLOOR(__time,'PT1H') AS "time",
  COUNT(*) FILTER (WHERE REGEXP_LIKE("event_subtype",'ReferenceError:.*(http|file):.*')) AS "suspicious_errors",
  COUNT(*) FILTER (WHERE REGEXP_LIKE("event_subtype",'ReferenceError:.*(https):.*')) AS "secure_suspicious_errors"
FROM "example-koalas-strings"
WHERE TIME_IN_INTERVAL(__time,'2019-08-25T12/PT6H')
GROUP BY 1
'''

display.sql(sql)

## Manipulate string values

In this part of the notebook, you'll see the use of:

* Manipulation with `UPPER`, `LOWER`, and `REVERSE`.
* Concatenation with the `||` operator and the `CONCAT` and `TEXTCAT` functions.
* Replacements with `REPLACE` and `REGEXP_REPLACE`.
* Padding with `LPAD`, `RPAD`.
* Generating new text with `REPEAT`.
* Trimming text with `TRIM`.

Run the cell below to see some examples of simple string manipulation.

In [None]:
sql='''
SELECT DISTINCT
  UPPER("city") AS "CITY",
  LOWER("os") AS "os",
  REVERSE("country") AS "yrtnuoc"
FROM "example-koalas-strings"
WHERE TIME_IN_INTERVAL(__time,'2019-08-25T01/PT2S')
'''

display.sql(sql)

Run this cell for some examples of `CONCAT` and `TEXTCAT` being used, as well as the `||` operator, to concatenate field values.

Interestingly, these SQL functions all use the same underlying native `concat` function.

In [None]:
sql='''
SELECT DISTINCT
  UPPER("continent") || ' saw ' || COUNT(*) || ' events.' AS "i-am-an-operator",
  CONCAT(LOWER("continent"), ' saw ',COUNT(*),' events.') AS "i-am-a-function",
  TEXTCAT('Continent: ',"continent") AS "and-i-only-take-two-arguments"
FROM "example-koalas-strings"
WHERE TIME_IN_INTERVAL(__time,'2019-08-25T01/P1D')
AND "continent" IS NOT NULL
GROUP BY "continent"
'''

display.sql(sql)

A Java string format pattern can be applied to the data by using `STRING_FORMAT`.

Run the cell below to see how this function can be applied to the results of a `GROUP BY`.

The function's first parameter is the format to apply. In this, the `%S` format applies upper-case formatting, and `%,d` applies locale-specific commas to a number. Then come the arguments - `continent` and `COUNT` - to which these to formats are applied.

In [None]:
sql='''
SELECT DISTINCT
  STRING_FORMAT('%S saw %,d events.',"continent",COUNT(*)) AS "results"
FROM "example-koalas-strings"
WHERE TIME_IN_INTERVAL(__time,'2019-08-25T01/P1D')
GROUP BY "continent"
'''

display.sql(sql)

Using `REGEXP_REPLACE` and `REPLACE` you can change the contents of string dimensions at query time or at ingestion time.

The following query uses a simple `REPLACE` to change "IE" to "Internet Explorer". The results are loaded to a dataframe and then plotted.

In [None]:
sql='''
SELECT
  REPLACE("browser",'IE','Internet Explorer') AS "browser",
  COUNT(DISTINCT "session") AS "sessions"
FROM "example-koalas-strings"
WHERE TIME_IN_INTERVAL(__time,'2019-08-25T01/P1D')
GROUP BY 1
ORDER BY 2 DESC
'''

df = pd.DataFrame(sql_client.sql(sql))
df.plot.bar(x='browser', y='sessions')
plt.xticks(rotation=45, ha='right')
plt.yscale("log")
plt.gca().get_legend().remove()
plt.show()

Using `REGEXP_REPLACE` you can apply more advanced replacements.

The query below uses `REGEXP_REPLACE` to extract portions of the Url and to construct a new string value.

The regular expression contains a number of capture groups, and the replacement string refers to these using `$`.

In [None]:
sql='''
SELECT
  REGEXP_REPLACE("loaded_image",'^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?','Path $5 was requested over $2') AS "image",
  COUNT(*) AS "events"
FROM "example-koalas-strings"
WHERE TIME_IN_INTERVAL(__time,'2019-08-25T01/PT2H')
AND "loaded_image" != 'Custom image'
GROUP BY 1
ORDER BY 2 DESC
'''

display.sql(sql)

The `LPAD` and `RPAD` functions add padding to the right or left of a returned portion of a string.

The query below uses `LPAD` in the `ORDER BY` clause to sort the results of the query.

In the data set, the `event_subtype` field records, in a string, the amount of the image that was cleared on [Koalas to the max](https://www.koalastothemax.com/). This is recorded in the data as an event with the type "PercentClear".

`LPAD` is used to construct a three-character value for the percentage that will result in the correct sort being applied. The first parameter indicates the field to use, the second the limit of the size of the returned string, and the final parameter the character to use for the padding.

`COALESCE` ensures that when the recorded percentage is empty, it is included in the data as 0%.

In [None]:
sql='''
SELECT
  "event_subtype" AS "Percent Clear",
  COUNT(*) AS "events"
FROM "example-koalas-strings"
WHERE TIME_IN_INTERVAL(__time,'2019-08-25T01/P1D')
AND "event_type" = 'PercentClear'
GROUP BY 1
ORDER BY LPAD(COALESCE("event_subtype",'0'),3,'0')
'''

df = pd.DataFrame(sql_client.sql(sql))
df.plot.bar(x='Percent Clear', y='events')
plt.gca().get_legend().remove()
plt.show()

The next cell uses `REPEAT` to generate a fun table of results.

Here, the `SUBSTRING` function (also available as `SUBSTR`) is used to get a particular portion of a string based on the hour in the timestamp of the event.

In [None]:
sql='''
SELECT
  REPEAT(SUBSTRING('EO',MOD(TIME_EXTRACT(__time, 'HOUR'),2)+1,1),TIME_EXTRACT(__time, 'HOUR')) AS "nice",
  COUNT(*) AS "Count"
FROM "example-koalas-strings"
WHERE TIME_IN_INTERVAL(__time,'2019-08-25T01/P1D')
GROUP BY 1
ORDER BY 2 DESC
'''

display.sql(sql)

Let's use `REPEAT` with `TRIM` to see how characters can be removed from the beginning and / or end of a string.

Run the cell below to re-ingest the example data, adding a new dimension that uses `REPEAT`.

In [None]:
sql='''
REPLACE INTO "example-koalas-strings" OVERWRITE ALL
WITH "ext" AS (SELECT *
FROM TABLE(
  EXTERN(
    '{"type":"http","uris":["https://static.imply.io/example-data/kttm-v2/kttm-v2-2019-08-25.json.gz"]}',
    '{"type":"json"}'
  )
) EXTEND ("timestamp" VARCHAR, "agent_category" VARCHAR, "agent_type" VARCHAR, "browser" VARCHAR, "browser_version" VARCHAR, "city" VARCHAR, "continent" VARCHAR, "country" VARCHAR, "version" VARCHAR, "event_type" VARCHAR, "event_subtype" VARCHAR, "loaded_image" VARCHAR, "adblock_list" VARCHAR, "forwarded_for" VARCHAR, "language" VARCHAR, "number" VARCHAR, "os" VARCHAR, "path" VARCHAR, "platform" VARCHAR, "referrer" VARCHAR, "referrer_host" VARCHAR, "region" VARCHAR, "remote_address" VARCHAR, "screen" VARCHAR, "session" VARCHAR, "session_length" BIGINT, "timezone" VARCHAR, "timezone_offset" VARCHAR, "window" VARCHAR))
SELECT
  TIME_PARSE("timestamp") AS "__time",
  "referrer",
  "event_type",
  "event_subtype",
  "city",
  "os",
  "continent",
  "country",
  REPEAT('X',5) || "country" || REPEAT('X',5) AS "XXXXXcountryXXXXX",
  "browser",
  "session",
  "session_length",
  "screen",
  "loaded_image"
FROM "ext"
PARTITIONED BY DAY
'''

display.run_task(sql)
sql_client.wait_until_ready('example-koalas-strings')
display.table('example-koalas-strings')

The SQL below shows the three variants of `TRIM` in action on the new `XXXXXcountryXXXXX` column.

In [None]:
sql='''
SELECT
  TRIM(LEADING 'X' FROM "XXXXXcountryXXXXX") AS "leadingTrim",
  TRIM(TRAILING 'X' FROM "XXXXXcountryXXXXX") AS "trailingTrim",
  TRIM(BOTH 'X' FROM "XXXXXcountryXXXXX") AS "bothTrim",
  COUNT(*) AS "Count"
FROM "example-koalas-strings"
WHERE TIME_IN_INTERVAL(__time,'2019-08-25T01/PT30S')
GROUP BY 1, 2, 3
'''

display.sql(sql)

## Use parts of a string

In this part of the notebook, you'll see the use of:

* Extracting portions of a string with `RIGHT`, `LEFT`, `SUBSTRING`, and `REGEXP_EXTRACT`.
* Finding text with `POSITION`.

The next cell uses the `POSITION`, `RIGHT`, and `LEFT` functions to find the horizontal and vertical screen size of the user.

In [None]:
sql='''
SELECT
  LEFT("screen",POSITION('x' in "screen")-1) AS "x-size",
  RIGHT("screen",LENGTH("screen")-POSITION('x' in "screen")) AS "y-size"
FROM "example-koalas-strings"
WHERE TIME_IN_INTERVAL(__time,'2019-08-25T14/PT1H')
LIMIT 10
'''

display.sql(sql)

Alternatively, you might use a regular expression.

Here, `REGEXP_EXTRACT` is combined with `STRING_FORMAT` to display the average screen size by hour.

In [None]:
sql='''
SELECT
  TIME_FLOOR("__time",'PT1H') AS "interval",
  STRING_FORMAT('%.3f x %.3f',AVG(REGEXP_EXTRACT("screen",'([0-9]*)x([0-9]*)',1)),AVG(REGEXP_EXTRACT("screen",'([0-9]*)x([0-9]*)',2))) AS "size-average"
FROM "example-koalas-strings"
WHERE TIME_IN_INTERVAL(__time,'2019-08-25T0/PT12H')
GROUP BY 1
'''

display.sql(sql)

Run the following cell to see another regular expression example, here returning the filename from the image Url in the data.

In [None]:
sql='''
SELECT
  REGEXP_EXTRACT("loaded_image",'[^/\\&\?]+\.\w{3,4}(?=([\?&].*$|$))') AS "filename",
  COUNT(*) AS "events"
FROM "example-koalas-strings"
WHERE TIME_IN_INTERVAL(__time,'2019-08-25T14/PT1H')
GROUP BY 1
'''

df = pd.DataFrame(sql_client.sql(sql))
df.plot.barh(x='filename', y='events')
plt.show()

The next cell contains a SQL statement that uses a regular expression with multiple matches.

In [None]:
sql='''
SELECT
  REGEXP_EXTRACT("loaded_image",'^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?',2) AS "scheme",
  REGEXP_EXTRACT("loaded_image",'^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?',5) AS "path",
  COUNT(DISTINCT "browser") AS "events"
FROM "example-koalas-strings"
WHERE TIME_IN_INTERVAL(__time,'2019-08-25T14/PT1H')
GROUP BY 1, 2
'''

df = pd.DataFrame(sql_client.sql(sql))
df_group=df.groupby(['path','scheme']).sum().unstack()
df_group.plot.bar(stacked="true")
plt.xticks(rotation=45, ha='right')
plt.show()

## Clean up

Run the following cell to remove the XXX used in this notebook from the database.

In [None]:
druid.datasources.drop("example-koalas-strings")

## Summary

* You can use scalar functions in your `SELECT` and `WHERE` clauses at query time and in SQL-based ingestion.
* SQL functions have native equivallents that you can use in JSON-based ingestion.

## Learn more

* Read the documentation around the full list of [scalar string functions](https://druid.apache.org/docs/latest/querying/sql-scalar#string-functions).
* Look for some common string functions in your queries and create a table where these functions have been applied at ingestion time.