Skip to content

BH Projection Framework

Paul Rogers edited this page Jan 13, 2018 · 10 revisions

Drill provides a Project operator to handle the common projection tasks:

  • Remove an unused column
  • Rename a column
  • Promote a column from inside an array or map to top level.
  • Reorder columns.
  • Create a new column as an expression of existing columns and constants.

Originally, readers load all columns, then the Project operator removed those not needed. The team quickly realized that a performance improvement can be had by asking the reader to omit unprojected columns. As a result, each reader has implemented projection-pushdown its own way. The projection framework provides a common scan projection-pushdown framework so that we can test it once and use it everywhere. The projection framework is coupled with the result set loader to form the overall scan framework.

Projection in Scan

Projection in a scan (AKA "projection push down") must handle several key tasks:

  • Parse the project list in the physical plan (provided as a list of SchemaPath elements)
  • If the projection is a wildcard ("*", AKA "star query"), create the scan output based on the schema of the data source
  • If the projection is explicit (a list of columns), create the scan output based on that list, making up null columns as necessary
  • Handle file metadata (AKA "implicit") columns: file name, file path, dirx, and so on.

Projection provides information for other parts of the scan:

  • The list of desired columns for the reader.
  • The list of null columns for the null column builder.
  • The list of file metadata columns for the file metadata builder.

Building the Output

We've noted three groups of columns: "table" columns, null columns and file metadata columns. The scan framework builds these separately. (Each reader need not worry about the null and file metadata columns: there is only one right way to create them.) Suppose we have the following query:

SELECT a, b, dir0, filename, c, d

Suppose that the table can provides columns (c, e, a). Columns (dir0, filename) are file metadata columns. This means that columns (b, d) are null. We create the three groups. Then, we must project them into the output in the order specified by the query:

   Table                              Output
+----------+                      +----------+
| c        |-----+  +------------>| a        |
| e        | X   |  |  +--------->| b        |
| a        |--------+  |  +------>| dir      |
+----------+     |     |  | +---->| filename |
File Metadata    +-----|--|-|---->| c        |
+----------+           |  | |  +->| d        |
| filename |-----------|--|-+  |  +----------+
| dir0     |-----------|--+    |
+----------+           |       |
   Nulls               |       |
+----------+           |       |
| b        |-----------+       |
| d        |-------------------+
+----------+

Types of Projection

Table (Reader) Projection

Null Column Handling

File Metadata (AKA "Implicit") Columns

Specialized Projection

  • CSV
  • Files

Column Reordering

Arrays

Maps

Vector Persistence

"Schema Smoothing"

Components

SchemaPath Parsing

Wildcard Expansion

Explicit Projection Resolution

Clone this wiki locally