Support NULL and MISSING value in response #667

penghuo · 2020-08-06T00:50:29Z

Description of changes

Auto add FIELDS * for PPL command.
In Analyzer, expend SELECT * to SELECT all fields.
Extract type info from QueryPlan and add to QueryResponse.
Support NULL and MISSING value in response. https://github.com/penghuo/sql/blob/pr-select-all/docs/user/general/values.rst

Problem Statements

Background

Before explain the current issue, firstly, let’s setup the context.

Sample Data

Let’s explain the problem with an example, the bank index which has 2 fields.

    "mappings" : {
      "properties" : {
        "account_number" : {
          "type" : "long"
        },
        "age" : {
          "type" : "integer"
        }
      }
    }

Then we add some data to the index.

POST bank/_doc/1
{"account_number":1,"age":31}

// age is null
POST bank/_doc/2
{"account_number":2,"age":null}

// age is missing
POST bank/_doc/3
{"account_number":3}

JDBC and JSON data format

Then, we define the response data format for query “SELECT account_number FROM bank” as follows.

JDBC Format. There there mandatory fields, “field name”, “field type” and “data”. e.g.

{
  "schema": [{
    "name": "account_number",
    "type": "long"
  }],
  "total": 3,
  "datarows": [
    [1],
    [2],
    [3]
  ],
  "size": 3
}

JSON Format, comparing with JDBC format, it doesn’t have schema field

{
  "datarows": [
    {"account_number": 1},
    {"account_number": 2},
    {"account_number": 3}
  ]
}

Issue 1. Represent NULL and MISSING in Response

With these sample data and response data format in mind, let go through more query and examine their results.
Considering the query:** SELECT age, account_number FROM bank. **
The JDBC format doesn’t have MISSING value. If the field exist in the schema but missed in the document, it should considered as NULL value.
The JSON format could represent the MISSING value properly.

* JDBC Format
{
  "schema": [
    {
      "name": "age",
      "type": "integer"
    },
    {
      "name": "account_number",
      "type": "long"
    }
  ],
  "total": 3,
  "datarows": [
    [
      31,
      1
    ],
    [
      null,
      2
    ],
    [
      null,
      3
    ],
  ],
  "size": 3
}

* JSON Format
{
  "datarows": [
    {"age": 1, "account_number": 1},
    {"age": null, "account_number": 2},
    {"account_number": 3}
  ]
}

Issue 2. ExprValue to JDBC format

Based on our current impementation, all the SQL operator is translated to chain of PhysicalOpeartor. Each PhysicalOpeartor provide the ExprValue as the return value. The protocol pull the result from PhysicalOperator and transalte to the expected format. e.g. Taking the above query as example, the result of the PhysicalOpeartor is a list of ExprValues.

[
    {"age": ExprIntegerValue(1), "account_number": ExprIntegerValue(1)},
    {"age": ExprNullValue(), "account_number": ExprIntegerValue(2)},
    {"account_number": ExprIntegerValue(3)}
]

The current solution is extract field name and field type from the data itself. This solution has two problems

It is impossible to derive the type from NULL value.
If the field is missing in the ExprValue, there is no way to derive it.

Issue 3. The Type info is missing

In current design, the Protocol is a seperate module which work independently with QueryEngine. The Protocol module receive the list of ExprTupleValue from QueryEngine, then the Protocol module format the result based on the type of ExprValue. the problem is ExprNullValue and ExprMissingValue doesn’t have type assosicate with it. thus the Protocol module can’t derive the type info from input ExprTupleValue directly.

Issue 4. What is *(all field) means in SELECT

In current design, the SELECT * clause ingored in the AST builder logic, because it means select all the data from input operator. The issue is similar as Issue 3 that if the input operator produce NULL or MISSING value, then the Protocol have no idea to derive type info from it.

Requirements

The JDBC format should been supported. The MISSING and NULL value should been representd as NULL.
The JSON format should been supported.
The Protocol module should receive the QueryResponse which include schema and data.

Solution

Includ NULL and MISSING value in the QueryResult (Issue 1, 2)

The SELECT operator will be translated to PhysicalOpeartor with a list of expression to resolve ExprValue from input data. With the above example, when handling NULL and MISSING value, the expected output data should be as follows.

[
    {"age": ExprIntegerValue(1), "account_number": ExprIntegerValue(1)},
    {"age": ExprNullValue(), "account_number": ExprIntegerValue(2)},
    {"age": ExprMissingValue(), "account_number": ExprIntegerValue(3)}
]

An aditionial list of Schema is also required to when protocol is JDBC.

{
  "schema": [
    {
      "name": "age",
      "type": "integer"
    },
    {
      "name": "account_number",
      "type": "long"
    }
  ]    
}

Then the protocol module could easily translate the JDBC format or JSON format.

Expend SELECT * to SELECT ...fields (Issue 4)

In our current implementation, in SQL, the SELECT * is ignored and in PPL there even no fields * command. This solution works fine for JSON format which doesn’t require schema, but it doens’s works for JDBC format.
The proposal in here is

Automatically add the fields command to PPL query
Expand SELECT * to SELECT ...fields.

Automatically add fields * to PPL query

Comparing with SQL, the PPL grammer doesn’t require the Fields command is the last command. Thus, the fields * command should been automatically added.
The automatically added logic is if the last operator is not Fields command, the Fields * command will been added.

Expand SELECT * to SELECT ...fields

In Analyzer, we should expend the * to all fields in the current scope. There are two issues we need to address,

No all the fields in the current scope should been used to expand *. The original scope is from Elasticsearch mapping which include nested mapping. In current design, only the top fields will be retrived from the current scope, all the nested fields will been ignored.
The scope should been dynamtic maintain in the Analyzer. For example, the stats command will define the new scope.

Retrive Type Info from ProjectOperator and Expose to Protocol (Issue 3)

After expending the * and automatically add fields, the type info could been retrived from ProjectOperator. Then the Protocol could get schema and data from QueryEngine.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

dai-chen

LGTM. Thanks!

...rc/main/java/com/amazon/opendistroforelasticsearch/sql/planner/physical/ProjectOperator.java

chloe-zh

LGTM, thanks for the changes!

Support NULL and MISSING value in response

f549205

penghuo requested review from chloe-zh and dai-chen August 6, 2020 00:50

penghuo self-assigned this Aug 6, 2020

penghuo added SQL PPL labels Aug 6, 2020

penghuo marked this pull request as ready for review August 6, 2020 16:17

dai-chen approved these changes Aug 8, 2020

View reviewed changes

...rc/main/java/com/amazon/opendistroforelasticsearch/sql/planner/physical/ProjectOperator.java Outdated Show resolved Hide resolved

address comments

6974a90

chloe-zh approved these changes Aug 10, 2020

View reviewed changes

penghuo merged commit 6021470 into opendistro-for-elasticsearch:develop Aug 10, 2020

dai-chen mentioned this pull request Sep 11, 2020

Not able to fetch null value result for doctest due to ExpressionEvaluationException #583

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support NULL and MISSING value in response #667

Support NULL and MISSING value in response #667

penghuo commented Aug 6, 2020 •

edited

dai-chen left a comment

chloe-zh left a comment

Support NULL and MISSING value in response #667

Support NULL and MISSING value in response #667

Conversation

penghuo commented Aug 6, 2020 • edited

Description of changes

Problem Statements

Background

Sample Data

JDBC and JSON data format

Issue 1. Represent NULL and MISSING in Response

Issue 2. ExprValue to JDBC format

Issue 3. The Type info is missing

Issue 4. What is *(all field) means in SELECT

Requirements

Solution

Includ NULL and MISSING value in the QueryResult (Issue 1, 2)

Expend SELECT * to SELECT ...fields (Issue 4)

Automatically add fields * to PPL query

Expand SELECT * to SELECT ...fields

Retrive Type Info from ProjectOperator and Expose to Protocol (Issue 3)

dai-chen left a comment

Choose a reason for hiding this comment

chloe-zh left a comment

Choose a reason for hiding this comment

penghuo commented Aug 6, 2020 •

edited