-
Notifications
You must be signed in to change notification settings - Fork 1
BH Projection Framework
Drill provides a Project operator to handle the common projection tasks:
- Remove an unused column
- Rename a column
- Promote a column from inside an array or map to top level.
- Reorder columns.
- Create a new column as an expression of existing columns and constants.
Originally, readers load all columns, then the Project operator removed those not needed. The team quickly realized that a performance improvement can be had by asking the reader to omit unprojected columns. As a result, each reader has implemented projection-pushdown its own way. The projection framework provides a common scan projection-pushdown framework so that we can test it once and use it everywhere. The projection framework is coupled with the result set loader to form the overall scan framework.
Projection in a scan (AKA "projection push down") must handle several key tasks:
- Parse the project list in the physical plan (provided as a list of
SchemaPath
elements) - If the projection is a wildcard ("*", AKA "star query"), create the scan output based on the schema of the data source
- If the projection is explicit (a list of columns), create the scan output based on that list, making up null columns as necessary
- Handle file metadata (AKA "implicit") columns: file name, file path, dirx, and so on.
Projection provides information for other parts of the scan:
- The list of desired columns for the reader.
- The list of null columns for the null column builder.
- The list of file metadata columns for the file metadata builder.
We've noted three groups of columns: "table" columns, null columns and file metadata columns. The scan framework builds these separately. (Each reader need not worry about the null and file metadata columns: there is only one right way to create them.) Suppose we have the following query:
SELECT a, b, dir0, filename, c, d
Suppose that the table can provides columns (c, e, a). Columns (dir0, filename) are file metadata columns. This means that columns (b, d) are null. We create the three groups. Then, we must project them into the output in the order specified by the query:
Table Output
+----------+ +----------+
| c |-----+ +------------>| a |
| e | X | | +--------->| b |
| a |--------+ | +------>| dir |
+----------+ | | | +---->| filename |
File Metadata +-----|--|-|---->| c |
+----------+ | | | +->| d |
| filename |-----------|--|-+ | +----------+
| dir0 |-----------|--+ |
+----------+ | |
Nulls | |
+----------+ | |
| b |-----------+ |
| d |-------------------+
+----------+
- CSV
- Files