# Incorporate data from source systems into table data
<!--
  ~ Licensed to the Apache Software Foundation (ASF) under one
  ~ or more contributor license agreements.  See the NOTICE file
  ~ distributed with this work for additional information
  ~ regarding copyright ownership.  The ASF licenses this file
  ~ to you under the Apache License, Version 2.0 (the
  ~ "License"); you may not use this file except in compliance
  ~ with the License.  You may obtain a copy of the License at
  ~
  ~   http://www.apache.org/licenses/LICENSE-2.0
  ~
  ~ Unless required by applicable law or agreed to in writing,
  ~ software distributed under the License is distributed on an
  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  ~ KIND, either express or implied.  See the License for the
  ~ specific language governing permissions and limitations
  ~ under the License.
  -->

When doing ingestion from multiple sources, it can be helpful to incorporate specific information from the source of the data.

For example, you might parse a source system URI using a [datetime function](https://druid.apache.org/docs/latest/querying/sql-scalar#date-and-time-functions) to use as the primary `__time` stamp.

This tutorial demonstrates how to incorporate `systemFields` with batch ingestion to add this information to rows in your table.

## Prerequisites

This tutorial works with Druid 29.0.0 or later.

Launch this tutorial and all prerequisites using the `druid-jupyter` profile of the Docker Compose file for Jupyter-based Druid tutorials. For more information, see the Learn Druid repository [readme](https://github.com/implydata/learn-druid).

## Initialization

The following cells set up the notebook and learning environment ready for use.

### Set up a connection to Apache Druid

Run the next cell to set up the Druid Python client's connection to Apache Druid.

If successful, the Druid version number will be shown in the output.

In [2]:
import druidapi
import os

druid_headers = {'Content-Type': 'application/json'}

if 'DRUID_HOST' not in os.environ.keys():
    druid_host=f"http://localhost:8888"
else:
    druid_host=f"http://{os.environ['DRUID_HOST']}:8888"

print(f"Opening a connection to {druid_host}.")
druid = druidapi.jupyter_client(druid_host)
display = druid.display
sql_client = druid.sql
status_client = druid.status

status_client.version

Opening a connection to http://router:8888.


'31.0.0'

## Create a table with system fields using batch ingestion

The `systemFields` property can be added to a number of [`inputSource` definitions](https://druid.apache.org/docs/latest/ingestion/input-sources) in the EXTERN clause.

Run the cell to ingest three files. Notice:

* The EXTERN statement includes the `systemFields` object, with an array containing two fields to add.
* The EXTEND includes a type definition for both of the additional fields.
* The SELECT explicitly names both of the system fields, alongside the fields to ingest.
* The REGEXP_EXTRACT function applies a simple regular expression to the `__file_path` to store only the filename as `__file_name`.
* Only rows with a `passenger_count` of 5 are ingested.

When completed, you'll see a description of the final table.

In [3]:
table_name = "example-taxitrips-systemfields"

sql = '''
REPLACE INTO "''' + table_name + '''" OVERWRITE ALL
WITH "ext" AS (
  SELECT *
  FROM TABLE(
    EXTERN(
      '{
          "type":"http",
          "systemFields":["__file_uri","__file_path"],
          "uris":
              ["https://static.imply.io/example-data/trips/trips_xaa.csv.gz",
              "https://static.imply.io/example-data/trips/trips_xac.csv.gz"]}',
      '{"type":"csv","findColumnsFromHeader":false,"columns":["trip_id","vendor_id","pickup_datetime","dropoff_datetime","store_and_fwd_flag","rate_code_id","pickup_longitude","pickup_latitude","dropoff_longitude","dropoff_latitude","passenger_count","trip_distance","fare_amount","extra","mta_tax","tip_amount","tolls_amount","ehail_fee","improvement_surcharge","total_amount","payment_type","trip_type","pickup","dropoff","cab_type","precipitation","snow_depth","snowfall","max_temperature","min_temperature","average_wind_speed","pickup_nyct2010_gid","pickup_ctlabel","pickup_borocode","pickup_boroname","pickup_ct2010","pickup_boroct2010","pickup_cdeligibil","pickup_ntacode","pickup_ntaname","pickup_puma","dropoff_nyct2010_gid","dropoff_ctlabel","dropoff_borocode","dropoff_boroname","dropoff_ct2010","dropoff_boroct2010","dropoff_cdeligibil","dropoff_ntacode","dropoff_ntaname","dropoff_puma"]}'
    )
  ) EXTEND ("__file_uri" VARCHAR, "__file_path" VARCHAR,
    "trip_id" BIGINT, "vendor_id" BIGINT, "pickup_datetime" VARCHAR, "dropoff_datetime" VARCHAR, "store_and_fwd_flag" VARCHAR, "rate_code_id" BIGINT, "pickup_longitude" DOUBLE, "pickup_latitude" DOUBLE, "dropoff_longitude" DOUBLE, "dropoff_latitude" DOUBLE, "passenger_count" BIGINT, "trip_distance" DOUBLE, "fare_amount" DOUBLE, "extra" DOUBLE, "mta_tax" DOUBLE, "tip_amount" DOUBLE, "tolls_amount" DOUBLE, "ehail_fee" VARCHAR, "improvement_surcharge" VARCHAR, "total_amount" DOUBLE, "payment_type" BIGINT, "trip_type" VARCHAR, "pickup" VARCHAR, "dropoff" VARCHAR, "cab_type" VARCHAR, "precipitation" DOUBLE, "snow_depth" BIGINT, "snowfall" DOUBLE, "max_temperature" BIGINT, "min_temperature" BIGINT, "average_wind_speed" DOUBLE, "pickup_nyct2010_gid" BIGINT, "pickup_ctlabel" BIGINT, "pickup_borocode" BIGINT, "pickup_boroname" VARCHAR, "pickup_ct2010" BIGINT, "pickup_boroct2010" BIGINT, "pickup_cdeligibil" VARCHAR, "pickup_ntacode" VARCHAR, "pickup_ntaname" VARCHAR, "pickup_puma" BIGINT, "dropoff_nyct2010_gid" BIGINT, "dropoff_ctlabel" BIGINT, "dropoff_borocode" BIGINT, "dropoff_boroname" VARCHAR, "dropoff_ct2010" BIGINT, "dropoff_boroct2010" BIGINT, "dropoff_cdeligibil" VARCHAR, "dropoff_ntacode" VARCHAR, "dropoff_ntaname" VARCHAR, "dropoff_puma" BIGINT)
)
SELECT
  TIME_PARSE(TRIM("pickup_datetime")) AS "__time",
  "__file_uri",
  "__file_path",
  REGEXP_EXTRACT("__file_path",'(?:.+\/)(.+)',1) AS "__file_name",
  "trip_id",
  "vendor_id",
  "dropoff_datetime",
  "rate_code_id",
  "passenger_count",
  "trip_distance",
  "fare_amount",
  "extra",
  "mta_tax",
  "tip_amount",
  "tolls_amount",
  "total_amount",
  "payment_type"
FROM "ext"
WHERE "passenger_count" = 5
PARTITIONED BY DAY
'''

request = druid.sql.sql_request(sql)
request.add_context('maxNumTasks', 4)

display.run_task(request)
sql_client.wait_until_ready(table_name)
display.table(table_name)

Loading data, status:[FAILED]:   0%|          | 0.0/100.0 [00:07<?, ?it/s] 


ClientError: Table "example-taxitrips-systemfields" not found.

In [None]:
Run the next cell to run a simple query that shows the results of the ingestion.

In [None]:
sql=f'''
SELECT
    "__file_uri",
    "__file_path", 
    "__file_name",
    COUNT(*) AS "total_file_rows" 
FROM "{table_name}"
GROUP BY 1,2, 3
'''

druid.display.sql(sql)

## Clean up

Run the following cell to drop the table.

In [None]:
print(f"Drop table: [{druid.datasources.drop(table_name)}]")

## Summary

* You learned this
* Remember this

## Learn more

* Try this out on your own data
* Solve for problem X that is't covered here
* Read docs pages
* Watch or read something cool from the community
* Do some exploratory stuff on your own