# Using INSERT with EXTERN to export data (experimental)
<!--
  ~ Licensed to the Apache Software Foundation (ASF) under one
  ~ or more contributor license agreements.  See the NOTICE file
  ~ distributed with this work for additional information
  ~ regarding copyright ownership.  The ASF licenses this file
  ~ to you under the Apache License, Version 2.0 (the
  ~ "License"); you may not use this file except in compliance
  ~ with the License.  You may obtain a copy of the License at
  ~
  ~   http://www.apache.org/licenses/LICENSE-2.0
  ~
  ~ Unless required by applicable law or agreed to in writing,
  ~ software distributed under the License is distributed on an
  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  ~ KIND, either express or implied.  See the License for the
  ~ specific language governing permissions and limitations
  ~ under the License.
  -->

Data ingestion into Druid can and usually will add value to the raw data by transforming or aggregating it in some form.
Downstream systems can benefit from this ingestion time and/or query time data transforms by exporting the results of a query asynchronously.

Note: [Exporting data](https://druid.apache.org/docs/latest/multi-stage-query/reference/#extern-to-export-to-a-destination) is an [experimental feature](https://druid.apache.org/docs/latest/development/experimental).

## Prerequisites

This tutorial works with Druid 29.0.0 or later.

#### Run with Docker

Launch this tutorial and all prerequisites using the `druid-jupyter` profile of the Docker Compose file for Jupyter-based Druid tutorials. For more information, see the Learn Druid repository [readme](https://github.com/implydata/learn-druid).
   

## Initialization

Run the next cell to set up the Druid Python client's connection to Apache Druid.

If successful, the Druid version number will be shown in the output.

In [None]:
import druidapi
import os

if 'DRUID_HOST' not in os.environ.keys():
    druid_host=f"http://localhost:8888"
else:
    druid_host=f"http://{os.environ['DRUID_HOST']}:8888"

print(f"Opening a connection to {druid_host}.")
druid = druidapi.jupyter_client(druid_host)

display_client = druid.display
sql_client = druid.sql
status_client = druid.status

status_client.version

## Load example data

Once your Druid environment is up and running, ingest the sample data for this tutorial.

Run the following cell to create a table called "example-wiki-pivot-unpivot". When completed, you'll see a description of the final table.

In [None]:
sql='''
REPLACE INTO "example-wiki-export" OVERWRITE ALL
WITH "ext" AS (
  SELECT *
  FROM TABLE(
    EXTERN(
      '{"type":"http","uris":["https://druid.apache.org/data/wikipedia.json.gz"]}',
      '{"type":"json"}'
    )
  ) EXTEND ("isRobot" VARCHAR, "channel" VARCHAR, "timestamp" VARCHAR, "flags" VARCHAR, "isUnpatrolled" VARCHAR, "page" VARCHAR, "diffUrl" VARCHAR, "added" BIGINT, "comment" VARCHAR, "commentLength" BIGINT, "isNew" VARCHAR, "isMinor" VARCHAR, "delta" BIGINT, "isAnonymous" VARCHAR, "user" VARCHAR, "deltaBucket" BIGINT, "deleted" BIGINT, "namespace" VARCHAR, "cityName" VARCHAR, "countryName" VARCHAR, "regionIsoCode" VARCHAR, "metroCode" BIGINT, "countryIsoCode" VARCHAR, "regionName" VARCHAR)
)
SELECT
  TIME_PARSE("timestamp") AS "__time",
  "isRobot",
  "channel",
  "added",
  "user",
  "deleted"
FROM "ext"
PARTITIONED BY DAY
'''
display_client.run_task(sql)
sql_client.wait_until_ready('example-wiki-export')
display_client.table('example-wiki-export')

## Exporting data

[Exporting data](https://druid.apache.org/docs/latest/multi-stage-query/reference/#extern-to-export-to-a-destination) uses MSQ INSERT to an EXTERN table function to define the output location and format of the data.

The SQL statement takes on the form:
```
INSERT INTO
  EXTERN(<destination function>)
AS CSV
SELECT
  ...
```

- Destination function: [S3 destination function](https://druid.apache.org/docs/latest/multi-stage-query/reference#s3), [local destination function](https://druid.apache.org/docs/latest/multi-stage-query/reference#local)
- AS CSV is the format of the exported file(s), new formats are expected as this experimental function evolves
- SELECT statement can be any MSQ query including transformations, joins and aggregations

In this learning environment the `druid_export_storage_baseDir` property has been set to `/opt/shared/exports` in the docker compose "environment" file.
The volume `/opt/shared` is accessible to all Druid processes as well as the Jupyter Labs container.

Note: This tutorial demonstrates how to export the results of an asynchronous query to a local file system. In a cluster environment, local storage will not be a good choice; [export to S3](https://druid.apache.org/docs/latest/multi-stage-query/reference/#s3) instead.


The output folder must exist and be empty. 
Run the following cell to remove the folder if it exists and then create under /opt/shared/exports:

In [None]:
!rm -rf /opt/shared/exports/example-wiki-export
!mkdir -p /opt/shared/exports/example-wiki-export

Run the following command to export of all data in the `example-wiki-export` datasource.:

In [None]:
sql='''
    INSERT INTO 
       EXTERN ( local(exportPath => '/opt/shared/exports/example-wiki-export') )
       AS CSV
    SELECT
      *
    FROM "example-wiki-export" 
'''
display_client.run_task(sql)

Run the following command to list the contents of the exports folder.

The filename contains:

- query Id - identifies the specific query that generated the file(s)
- worker task Id - identifies the MSQ worker that generated the file
- a partition number - one partition for each file generated by each worker


In [None]:
!ls /opt/shared/exports/example-wiki-export

You can see the content of the file with:

In [None]:
!head /opt/shared/exports/example-wiki-export/*

## Exporting transformed data
Export works for any SQL SELECT statement.

Run following cell to export aggregate query results:

In [None]:
!rm -rf /opt/shared/exports/example-wiki-export-agg
!mkdir -p /opt/shared/exports/example-wiki-export-agg

sql='''
INSERT INTO 
       EXTERN ( local(exportPath => '/opt/shared/exports/example-wiki-export-agg') )
       AS CSV

SELECT "user" as "user",
       "channel" as "channel",
       SUM("added"+"deleted") as "total_changes"
FROM "example-wiki-export"
GROUP BY 1,2
'''
display_client.run_task(sql)

In [None]:
!head /opt/shared/exports/example-wiki-export-agg/*

## Controlling the size of the output files
You can control the size of the output files by using the `rowsPerPage` parameter.
Since there are only 25k rows in this dataset, the following example uses 5000 rows per file to show how this works:

In [None]:
!rm -rf /opt/shared/exports/example-wiki-export-parts
!mkdir -p /opt/shared/exports/example-wiki-export-parts
sql='''
INSERT INTO 
       EXTERN ( local(exportPath => '/opt/shared/exports/example-wiki-export-parts') )
       AS CSV

SELECT * 
FROM "example-wiki-export"
'''
req = sql_client.sql_request(sql)
req.add_context("rowsPerPage", 5000)
display_client.run_task(req)

The following cell shows the 5 files that were generated:

In [None]:
!ls /opt/shared/exports/example-wiki-export-parts

Run the next cell to count the number of rows in each file, you'll see that each one is somewhat evenly sized and close to the 5000 row target:

In [None]:
!wc -l /opt/shared/exports/example-wiki-export-parts/*

The next cell shows the beginning of each file. Notice that each files has the column headers:

In [None]:
!head -3 /opt/shared/exports/example-wiki-export-parts/*

## Clean up

Run the following cell to remove the table created for this notebook from the database.

In [None]:
druid.datasources.drop("example-wiki-export")
druid.datasources.drop("example-wiki-export-parts")


## Summary

Druid can export the results of a query to external files. 
It parallelizes the process with 

Learn more about this experimental function in the [Apache Druid documentation](https://druid.apache.org/docs/latest/multi-stage-query/reference/#extern-to-export-to-a-destination).