# PIVOT and UNPIVOT
<!--
  ~ Licensed to the Apache Software Foundation (ASF) under one
  ~ or more contributor license agreements.  See the NOTICE file
  ~ distributed with this work for additional information
  ~ regarding copyright ownership.  The ASF licenses this file
  ~ to you under the Apache License, Version 2.0 (the
  ~ "License"); you may not use this file except in compliance
  ~ with the License.  You may obtain a copy of the License at
  ~
  ~   http://www.apache.org/licenses/LICENSE-2.0
  ~
  ~ Unless required by applicable law or agreed to in writing,
  ~ software distributed under the License is distributed on an
  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  ~ KIND, either express or implied.  See the License for the
  ~ specific language governing permissions and limitations
  ~ under the License.
  -->


This tutorial demonstrates how to work with [PIVOT and UNPIVOT SQL Operators](https://druid.apache.org/docs/latest/querying/sql#pivot). 

PIVOT and UNPIVOT are experimental features of Druid 29.0.0.

In this tutorial you perform the following tasks:

- 

## Prerequisites

This tutorial works with Druid 29.0.0 or later.

#### Run with Docker

Launch this tutorial and all prerequisites using the `all-services` profile of the Docker Compose file for Jupyter-based Druid tutorials. For more information, see the Learn Druid repository [readme](https://github.com/implydata/learn-druid).
   

## Initialization

The following cells set up the notebook and learning environment ready for use.

### Set up and connect to the learning environment

Run the next cell to set up the Druid Python client's connection to Apache Druid.

If successful, the Druid version number will be shown in the output.

In [None]:
import druidapi
import os
import json
import time
from datetime import datetime, timedelta

# get druid host from param if available
if 'DRUID_HOST' not in os.environ.keys():
    druid_host=f"http://localhost:8888"
else:
    druid_host=f"http://{os.environ['DRUID_HOST']}:8888"

# get kafka host from param if available
if 'KAFKA_HOST' not in os.environ.keys():
   kafka_host=f"http://localhost:9092"
else:
    kafka_host=f"{os.environ['KAFKA_HOST']}:9092"

print(f"Opening a connection to {druid_host}.")

#setup Druid API clients
druid = druidapi.jupyter_client(druid_host)
display_client = druid.display
sql_client = druid.sql
status_client = druid.status
rest_client = druid.rest
status_client.version

## Ingest sample data to pivot

The following cell ingests the sample Wikipedia data into table "example-pivot-unpivot":

In [None]:
sql='''
REPLACE INTO "example-pivot-unpivot" OVERWRITE ALL
WITH "ext" AS (
  SELECT *
  FROM TABLE(
    EXTERN(
      '{"type":"http","uris":["https://druid.apache.org/data/wikipedia.json.gz"]}',
      '{"type":"json"}'
    )
  ) EXTEND ("isRobot" VARCHAR, "channel" VARCHAR, "timestamp" VARCHAR, "flags" VARCHAR, "isUnpatrolled" VARCHAR, "page" VARCHAR, "diffUrl" VARCHAR, "added" BIGINT, "comment" VARCHAR, "commentLength" BIGINT, "isNew" VARCHAR, "isMinor" VARCHAR, "delta" BIGINT, "isAnonymous" VARCHAR, "user" VARCHAR, "deltaBucket" BIGINT, "deleted" BIGINT, "namespace" VARCHAR, "cityName" VARCHAR, "countryName" VARCHAR, "regionIsoCode" VARCHAR, "metroCode" BIGINT, "countryIsoCode" VARCHAR, "regionName" VARCHAR)
)
SELECT
  TIME_PARSE("timestamp") AS "__time",
  "isRobot",
  "channel",
  "flags",
  "isUnpatrolled",
  "page",
  "diffUrl",
  "added",
  "comment",
  "commentLength",
  "isNew",
  "isMinor",
  "delta",
  "isAnonymous",
  "user",
  "deltaBucket",
  "deleted",
  "namespace",
  "cityName",
  "countryName",
  "regionIsoCode",
  "metroCode",
  "countryIsoCode",
  "regionName"
FROM "ext"
PARTITIONED BY DAY
'''
display_client.run_task(sql)
sql_client.wait_until_ready('example-pivot-unpivot')
display_client.table('example-pivot-unpivot')

## Robotic updates to wikipedia by channel
If you want to determine which channels receive the most robotic changes you could use a regular aggregate query like this one:

In [None]:
sql='''
SELECT
  "channel",
  "isRobot",
  SUM(added) total_added,
  SUM(deleted) total_deleted
FROM "example-pivot-unpivot" 
GROUP BY 1, 2
'''
display_client.sql(sql)

## PIVOT example
The prior result is a bit long and makes it hard to figure out which channels have the most updates while still seeing the distinction between additions and deletions to the wikipedia pages.
One way to reduce the result while still keeping all the detail is to pivot the result such that rows become columns for each value of a particular column, the pivot column.

PIVOT is a SQL operator that is used after the FROM clause.

```
    FROM "example-pivot-unpivot" 
    
    PIVOT ( SUM(added) as added, 
            SUM(deleted) as deleted 
            
            FOR "isRobot" IN ('true' as robot, 'false' as human)
          )
```
- It lists a set of aggregations to calculate: `SUM(added) as added, SUM(deleted) as deleted`
- The FOR clause indicated the pivot column: `"isRobot"`
- The IN clause lists the pivot values of the column and what to call them: `IN ('true' as robot, 'false' as human)`

The resulting columns from this operation are a combination in the form \<pivot_value_name\>__\<aggregation_name\>. So in this case the result pivot columns will be:
```
"robot_added"
"human_added"
"robot_deleted"
"human_deleted"
```
which can now form part of the SELECT clause in the query.


The following query uses the results of the pivot to then aggregate totals for the pivot columns by channel. The aggregations being calculated could be NULL in the case where "isRobot"='true' (or 'false') does not exist for a given channel. The query calculates a ratio of human to total changes for each channel and the result is sorted on this ratio such that the channels with the highest proportion of robot updates are listed first. So in order to calculate this metric even in presence of NULLs the query uses COALESCE on all the pivoted metrics.

The result is much cleaner than the prior query and it is easy to determine which channels have the most robotic activity.

In [None]:
sql='''
SELECT "channel", 
        SUM("robot_added")       AS "added_robot", 
        SUM("robot_deleted")     AS "deleted_robot", 
        SUM("human_added")       AS "added_human",
        SUM("human_deleted")     AS "deleted_human", 
        
        SAFE_DIVIDE( SUM("human_deleted" + "human_added") * 1.0 , 
                     SUM( "robot_deleted" + "robot_added" + "human_deleted" + "human_added" )
                     
                    )            AS "human_ratio"
                    
FROM
(
    SELECT
      "channel", 
       COALESCE("robot_added",0) AS "robot_added", 
       COALESCE("human_added",0) AS "human_added", 
       COALESCE("robot_deleted",0) AS "robot_deleted", 
       COALESCE("human_deleted",0) AS "human_deleted"
    FROM "example-pivot-unpivot" 
    
    PIVOT ( SUM(added) as added, 
            SUM(deleted) as deleted 
            
            FOR "isRobot" IN ('true' as robot, 'false' as human)
          )
)x
GROUP BY 1
ORDER BY 6
'''
display_client.sql(sql)

## UNPIVOT example
Given the prior query we can now investigate the sources of the updates in the most robotically updated pages.
```
  UNPIVOT ( "changes"  FOR "action" IN ("added", "deleted") )
```
This example uses UNPIVOT to:
- take the values of multiple columns: `"added","deleted"`
- incorporate them into a single column called the "values column": `"changes"`
- while expanding the result with a "names column" containing the names of the pivoted columns: `"action"`

The columns names become row values.

In this example we examine the most active user as measured by total adding or deleting of text in the wikipedia pages in the channel `'#it.wikipedia'` which had a ratio of about 0.50 (half robots):

In [None]:
sql='''
SELECT "user",
       "action",
       SUM("changes") "total_changes"
FROM "example-pivot-unpivot"

UNPIVOT ( "changes"  FOR "action" IN ("added", "deleted") )

WHERE "channel"='#it.wikipedia' AND "isRobot"='true'
GROUP BY 1,2
ORDER BY 3 DESC 
'''
display_client.sql(sql)

## Clean up

Run the following cell to remove everything used in this notebook from the database and data generation engine.

In [None]:
druid.datasources.drop("example-pivot-unpivot")


## Summary

You learned about how to use the new experimental functions PIVOT and UNPIVOT.