# PIVOT and UNPIVOT
<!--
  ~ Licensed to the Apache Software Foundation (ASF) under one
  ~ or more contributor license agreements.  See the NOTICE file
  ~ distributed with this work for additional information
  ~ regarding copyright ownership.  The ASF licenses this file
  ~ to you under the Apache License, Version 2.0 (the
  ~ "License"); you may not use this file except in compliance
  ~ with the License.  You may obtain a copy of the License at
  ~
  ~   http://www.apache.org/licenses/LICENSE-2.0
  ~
  ~ Unless required by applicable law or agreed to in writing,
  ~ software distributed under the License is distributed on an
  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  ~ KIND, either express or implied.  See the License for the
  ~ specific language governing permissions and limitations
  ~ under the License.
  -->


This tutorial demonstrates how to work with [PIVOT](https://druid.apache.org/docs/latest/querying/sql#pivot) and [UNPIVOT](https://druid.apache.org/docs/latest/querying/sql#unpivot) SQL operators. 

Note: PIVOT and UNPIVOT are [experimental features](https://druid.apache.org/docs/latest/development/experimental).

## Prerequisites

This tutorial works with Druid 29.0.0 or later.

#### Run with Docker

Launch this tutorial and all prerequisites using the `druid-jupyter` profile of the Docker Compose file for Jupyter-based Druid tutorials. For more information, see the Learn Druid repository [readme](https://github.com/implydata/learn-druid).
   

## Initialization

Run the next cell to set up the Druid Python client's connection to Apache Druid.

If successful, the Druid version number will be shown in the output.

In [None]:
import druidapi
import os

if 'DRUID_HOST' not in os.environ.keys():
    druid_host=f"http://localhost:8888"
else:
    druid_host=f"http://{os.environ['DRUID_HOST']}:8888"

print(f"Opening a connection to {druid_host}.")
druid = druidapi.jupyter_client(druid_host)

display_client = druid.display
sql_client = druid.sql
status_client = druid.status

status_client.version

## Load example data

Once your Druid environment is up and running, ingest the sample data for this tutorial.

Run the following cell to create a table called "example-wiki-pivot-unpivot". When completed, you'll see a description of the final table.

In [None]:
sql='''
REPLACE INTO "example-wiki-pivot-unpivot" OVERWRITE ALL
WITH "ext" AS (
  SELECT *
  FROM TABLE(
    EXTERN(
      '{"type":"http","uris":["https://druid.apache.org/data/wikipedia.json.gz"]}',
      '{"type":"json"}'
    )
  ) EXTEND ("isRobot" VARCHAR, "channel" VARCHAR, "timestamp" VARCHAR, "flags" VARCHAR, "isUnpatrolled" VARCHAR, "page" VARCHAR, "diffUrl" VARCHAR, "added" BIGINT, "comment" VARCHAR, "commentLength" BIGINT, "isNew" VARCHAR, "isMinor" VARCHAR, "delta" BIGINT, "isAnonymous" VARCHAR, "user" VARCHAR, "deltaBucket" BIGINT, "deleted" BIGINT, "namespace" VARCHAR, "cityName" VARCHAR, "countryName" VARCHAR, "regionIsoCode" VARCHAR, "metroCode" BIGINT, "countryIsoCode" VARCHAR, "regionName" VARCHAR)
)
SELECT
  TIME_PARSE("timestamp") AS "__time",
  "isRobot",
  "channel",
  "added",
  "user",
  "deleted"
FROM "ext"
PARTITIONED BY DAY
'''
display_client.run_task(sql)
sql_client.wait_until_ready('example-wiki-pivot-unpivot')
display_client.table('example-wiki-pivot-unpivot')

## PIVOT

PIVOT is a SQL operator that reduces rows in a result set by converting rows into columns.
```
    FROM <data source>
    
    PIVOT ( <list of aggregations>
            
            FOR <pivot_column> IN (<list of values in pivot_column>)
          )
```
- \<list of aggregations\> - list of aggregate expressions; for example : SUM(added) as added, SUM(deleted) as deleted 
- \<pivot_column\> - the column whose values are being turned into new columns
- \<list of values in pivot_column\> to use in the form `<literal value> as <value_name>,...`; for example: 'true' as robot, 'false as human 

The operation produces columns in the form \<value_name\>__\<aggregation_name\> for all the aggregations and values that are specified.

Try this out. 
The next cell runs without PIVOT: 

In [None]:
sql='''
    SELECT
      "channel", "isRobot", "added", "deleted"
    FROM "example-wiki-pivot-unpivot" 
    WHERE TIME_IN_INTERVAL(__time, '2016-06-27/P1D')
    LIMIT 10
'''
display_client.sql(sql)

The next cell uses PIVOT with the same information as above to demonstrate how PIVOT reorganizes it:

In [None]:
sql='''
    SELECT
      "channel", 
       "robot_added", 
       "human_added", 
       "robot_deleted", 
       "human_deleted"
    FROM "example-wiki-pivot-unpivot" 
    
    PIVOT ( SUM(added) as added, 
            SUM(deleted) as deleted 
            
            FOR "isRobot" IN ('true' as robot, 'false' as human)
          )
    WHERE TIME_IN_INTERVAL(__time, '2016-06-27/P1D')
    LIMIT 10
'''
display_client.sql(sql)

Notice that the PIVOT operation:
- moved the values of "isRobot"='true' into two columns called "robot_added" and "robot_deleted"
- moved the values of "isRobot"='false' into two columns called "human_added" and "human_deleted"

Also, notice the presence of NULL values where the pivot column is not the value assigned to the corresponding column name. For example, "robot_added" only has values where "isRobot" was true.

## Robotic updates to wikipedia by channel using normal aggregation

This follwoing example demonstrates PIVOT with aggregation.

To illustrate, first determine which channels receive the most robotic changes by running a normal aggregation query:

In [None]:
sql='''
SELECT
  "channel",
  "isRobot",
  SUM(added) total_added,
  SUM(deleted) total_deleted
FROM "example-wiki-pivot-unpivot" 
WHERE TIME_IN_INTERVAL(__time, '2016-06-27/P1D')
GROUP BY 1, 2
'''
display_client.sql(sql)

## Transform rows with PIVOT 
The query above produces a lengthy output, making it hard to discern which channels have the most updates while still seeing the distinction between additions and deletions to the wikipedia pages.
You can use the PIVOT operator to transform rows into columns for distinct values in the column.
PIVOT helps to reorganize the result into less rows and more columns while still keeping all the detailed values.

The following query uses aggregation on top of the pivoted columns in order to merge results into one row per channel.

Run the following cell to calculate totals of human and robotic additions and deletions by channel with a ratio of human to total changes. 

In [None]:
sql='''
SELECT "channel", 
        SUM("robot_added")       AS "added_robot", 
        SUM("robot_deleted")     AS "deleted_robot", 
        SUM("human_added")       AS "added_human",
        SUM("human_deleted")     AS "deleted_human", 
        
        SAFE_DIVIDE( SUM("human_deleted" + "human_added") * 1.0 , 
                     SUM( "robot_deleted" + "robot_added" + "human_deleted" + "human_added" )
                     
                    )            AS "human_ratio"
                    
FROM
(
    SELECT
      "channel", 
       COALESCE("robot_added",0) AS "robot_added", 
       COALESCE("human_added",0) AS "human_added", 
       COALESCE("robot_deleted",0) AS "robot_deleted", 
       COALESCE("human_deleted",0) AS "human_deleted"
    FROM "example-wiki-pivot-unpivot" 
    
    PIVOT ( SUM(added) as added, 
            SUM(deleted) as deleted 
            
            FOR "isRobot" IN ('true' as robot, 'false' as human)
          )
    WHERE TIME_IN_INTERVAL(__time, '2016-06-27/P1D')
)x
GROUP BY 1
ORDER BY 6
LIMIT 20
'''
display_client.sql(sql)

The result shows 20 channels with the lowest "human_ratio", they are the ones that have the highest proportion of robot updates.

A few cells above we saw that the values from PIVOT can be NULL. The query above calculates a ratio of human to total changes for each channel and the result is sorted on this ratio such that the channels with the highest proportion of robot updates are listed first. To calculate this metric even in the presence of NULLs, the query uses COALESCE on all the pivoted metrics.

The result is much cleaner than the prior query, and it is easy to determine which channels have the most robotic activity.

Notice the use of SAFE_DIVIDE in the "human_ratio" calculation, which guards for division by zero for cases with no updates. See [SAFE_DIVIDE](https://druid.apache.org/docs/latest/querying/sql-functions#safe_divide) for more information.


## Transform columns with UNPIVOT 
UNPIVOT does the opposite of PIVOT. UNPIVOT turns columns into rows by using the names of the columns being removed as values for a new column called the "names column".
The values of the removed columns are combined into a single column called the "values column".

Given the prior query, you can investigate the sources of the updates in the most robotically updated pages.
The following cell uses the UNPIVOT operation to do just that.

Run the following cell to find the most active user as measured by total adding or deleting activity within the channel `'#it.wikipedia'` which has a ratio of ~0.50 (about half robots). The SQL statement uses the following `UNPIVOT` operator:
```
  UNPIVOT ( "changes"  FOR "action" IN ("added", "deleted") )
```
- takes the values of multiple columns: `"added","deleted"`
- incorporates them into a single column called the "values column": `"changes"`
- expands the results in "names column" with names of deleted columns: `"action"`

In [None]:
sql='''
SELECT "user",
       "action",
       SUM("changes") "total_changes"
FROM "example-wiki-pivot-unpivot"

UNPIVOT ( "changes"  FOR "action" IN ("added", "deleted") )

WHERE TIME_IN_INTERVAL(__time, '2016-06-27/P1D')
       AND "channel"='#it.wikipedia' 
       AND "isRobot"='true'
GROUP BY 1,2
ORDER BY 3 DESC 
LIMIT 10
'''
display_client.sql(sql)

The result provides the list of users that did the most additions or deletions in the channel `'#it.wikipedia'` and identified as `isRobot='true'`. 

It has merged the values of columns "added" and "deleted" into the column "changes" which is SUMed into "total_changes".
The names of the original columns "added" and "deleted" are now values in the "action" column, so you can still see the detail.
By using UNPIVOT in this fashion, you can sort on the largest addition or deletion that is the biggest total change by a user and easily find the robot users that are most affecting the channel.


## Clean up

Run the following cell to remove the table created for this notebook from the database.

In [None]:
druid.datasources.drop("example-wiki-pivot-unpivot")


## Summary

PIVOT converts row values into columns with aggregate results.

UNPIVOT converts columns into rows by merging the values from multiple columns into a single column.
