Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unnest array of row to multiple columns #10883

Merged
merged 1 commit into from
Jul 24, 2018

Conversation

rongrong
Copy link
Contributor

This partially resolves #8151 (for array type).

@rongrong rongrong requested review from martint and haozhun June 21, 2018 04:35
@rongrong rongrong force-pushed the unnest_array_row branch 4 times, most recently from 425e667 to df7d46a Compare June 21, 2018 04:51
Copy link
Contributor

@wenleix wenleix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general looks good to me. But my knowledge about analyzer and planner is very limited.

@martint , @kokosing would you mind also taking look? Thank you !

private int position;
private int positionCount;

public ArrayRowUnnester(Type elementType)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about just make constructor to take List<Type> fieldTypes ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about that. But setBlock expects a RowBlock, so it would be really strange for the constructor to take an arbitrary list of types. There are pros and cons for both. I think keeping the constructor consistent with other unnesters and make it obvious that this should be only used for RowType is better.

implements Unnester
{
private final List<Type> fieldTypes;
private ColumnarRow block;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about just call it columnarRow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, both makes sense. I don't really care.

@@ -112,7 +113,12 @@ public UnnestOperator(OperatorContext operatorContext, List<Integer> replicateCh
this.unnesters = new ArrayList<>(unnestTypes.size());
for (Type type : unnestTypes) {
if (type instanceof ArrayType) {
unnesters.add(new ArrayUnnester(((ArrayType) type).getElementType()));
if (((ArrayType) type).getElementType() instanceof RowType) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might consider put a local variable elementType = ((ArrayType) type).getElementType() and being reused ?

@@ -575,6 +575,11 @@ public void testUnnest()
assertQuery("SELECT b FROM UNNEST(ARRAY[1, 2, 3], ARRAY[4, 5]) t(a, b)", "SELECT * FROM VALUES 4, 5, NULL");
assertQuery("SELECT count(*) FROM UNNEST(ARRAY[1, 2, 3], ARRAY[4, 5])", "SELECT 3");
assertQuery("SELECT a FROM UNNEST(ARRAY['kittens', 'puppies']) t(a)", "SELECT * FROM VALUES ('kittens'), ('puppies')");
assertQuery("SELECT * FROM UNNEST(ARRAY[ROW(1, 2), ROW(3, 4)], ARRAY[ROW('a', true), ROW('b', false)])", "SELECT * FROM VALUES (1, 2, 'a', true), (3, 4, 'b', false)");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Add comment explaining these three tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No other test cases have comments. These test cases should be at least as self-explanatory for what they are testing as the others in the test. Do we have any criteria on when to add comment vs not?

@@ -650,7 +661,17 @@ protected RelationPlan visitUnnest(Unnest node, Void context)
Symbol inputSymbol = symbolAllocator.newSymbol(rewritten, type);
argumentSymbols.add(inputSymbol);
if (type instanceof ArrayType) {
unnestSymbols.put(inputSymbol, ImmutableList.of(unnestedSymbolsIterator.next()));
Type elementType = ((ArrayType) type).getElementType();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line 663 - 680 is same as line 562 - 580. And they are both doing unnest.

Would it make sense to refactor them? @martint @kokosing

Copy link
Contributor Author

@rongrong rongrong Jun 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite like how unnest is done through out the code base. the

if (type instanceof ArrayType)
...
else if (type instanceof MapType)
...

is everywhere. If I were to do a refactoring, I'd prefer something doesn't require that. But I didn't see an easy way to do so. Would be more than happy to take suggestions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please extract the outer if as a static method.

I'm not sure what part of the following snippet you don't like

if (type instanceof ArrayType)
...
else if (type instanceof MapType)
...

You could have done switch(type.getTypeSignature().getBase(). But that makes little difference.

@@ -0,0 +1,63 @@
package com.facebook.presto.operator;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copyright

import static com.google.common.base.Preconditions.checkState;
import static java.util.Objects.requireNonNull;

public class ArrayRowUnnester
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ArrayOfRowsUnnester?

@@ -575,6 +575,11 @@ public void testUnnest()
assertQuery("SELECT b FROM UNNEST(ARRAY[1, 2, 3], ARRAY[4, 5]) t(a, b)", "SELECT * FROM VALUES 4, 5, NULL");
assertQuery("SELECT count(*) FROM UNNEST(ARRAY[1, 2, 3], ARRAY[4, 5])", "SELECT 3");
assertQuery("SELECT a FROM UNNEST(ARRAY['kittens', 'puppies']) t(a)", "SELECT * FROM VALUES ('kittens'), ('puppies')");
assertQuery("SELECT * FROM UNNEST(ARRAY[ROW(1, 2), ROW(3, 4)], ARRAY[ROW('a', true), ROW('b', false)])", "SELECT * FROM VALUES (1, 2, 'a', true), (3, 4, 'b', false)");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please format this as:

assertQuery(
   actualQuery,
   expectedQuery);
```?

@@ -575,6 +575,11 @@ public void testUnnest()
assertQuery("SELECT b FROM UNNEST(ARRAY[1, 2, 3], ARRAY[4, 5]) t(a, b)", "SELECT * FROM VALUES 4, 5, NULL");
assertQuery("SELECT count(*) FROM UNNEST(ARRAY[1, 2, 3], ARRAY[4, 5])", "SELECT 3");
assertQuery("SELECT a FROM UNNEST(ARRAY['kittens', 'puppies']) t(a)", "SELECT * FROM VALUES ('kittens'), ('puppies')");
assertQuery("SELECT * FROM UNNEST(ARRAY[ROW(1, 2), ROW(3, 4)], ARRAY[ROW('a', true), ROW('b', false)])", "SELECT * FROM VALUES (1, 2, 'a', true), (3, 4, 'b', false)");
assertQuery("SELECT x, y FROM (VALUES (ARRAY[ROW(1, 2), ROW(3, 4)])) AS t(a) CROSS JOIN UNNEST(a) t(x, y)", "SELECT * FROM VALUES (1, 2), (3, 4)");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the output is not:

SELECT * FROM VALUES (1, 2, 3, 4)

There is only one array so I was expecting only one output row.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's one array of two rows. Each array element will become one row in the final result right? So the first element ROW(1, 2) becomes (1, 2), and second element ROW(3, 4) becomes (3, 4).

Copy link
Contributor

@kokosing kokosing Jun 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will be produced for:

SELECT x, y FROM (VALUES (ARRAY[ROW(1, 2), ROW('varchar', false])) AS t(a) CROSS JOIN UNNEST(a) t(x, y)

?

Also why the test

assertQuery("SELECT * FROM UNNEST(ARRAY[ROW(1, 2), ROW(3, 4)], ARRAY[ROW('a', true), ROW('b', false)])", "SELECT * FROM VALUES (1, 2, 'a', true), (3, 4, 'b', false)");

does not produce 4 rows (one per each row in arrays)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand the semantic correctly, when you unnest an array, each element in the array will become 1 row. If you unnest two arrays, first element of both arrays will become 1 row, second element of both arrays will become the second row, etc. Things would be a little more obvious with "WITH ORDINALITY". Previously, we'd always unnest an array into one column, so if it's array, we'd unnest that to one column of type ROW. This PR would change it so it unnests the array to multiple columns defined by the fields in the row.

In your example, ARRAY[ROW(1, 2), ROW('varchar', false]) is not a valid array.

@@ -575,6 +575,11 @@ public void testUnnest()
assertQuery("SELECT b FROM UNNEST(ARRAY[1, 2, 3], ARRAY[4, 5]) t(a, b)", "SELECT * FROM VALUES 4, 5, NULL");
assertQuery("SELECT count(*) FROM UNNEST(ARRAY[1, 2, 3], ARRAY[4, 5])", "SELECT 3");
assertQuery("SELECT a FROM UNNEST(ARRAY['kittens', 'puppies']) t(a)", "SELECT * FROM VALUES ('kittens'), ('puppies')");
assertQuery("SELECT * FROM UNNEST(ARRAY[ROW(1, 2), ROW(3, 4)], ARRAY[ROW('a', true), ROW('b', false)])", "SELECT * FROM VALUES (1, 2, 'a', true), (3, 4, 'b', false)");
assertQuery("SELECT x, y FROM (VALUES (ARRAY[ROW(1, 2), ROW(3, 4)])) AS t(a) CROSS JOIN UNNEST(a) t(x, y)", "SELECT * FROM VALUES (1, 2), (3, 4)");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also below queries:

SELECT x, y FROM (VALUES (ARRAY[ROW(1, 2), ROW('varchar', false)])) AS t(a) CROSS JOIN UNNEST(a) t(x, y)

SELECT x, y FROM (VALUES (ARRAY[ROW(1)], ARRAY[ROW(2, 3)]) AS t(a) CROSS JOIN UNNEST(a) t(x, y)

SELECT x, y FROM (VALUES (ARRAY[ROW(1)], ARRAY[ROW('varchar')]) AS t(a) CROSS JOIN UNNEST(a) t(x, y)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These queries not valid, so they won't test unnest array of rows.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but they should fail in some reasonable way. IMO this should be tested, but maybe not here.

@rongrong rongrong force-pushed the unnest_array_row branch 2 times, most recently from 926d499 to 0f5db0f Compare June 21, 2018 06:55
@@ -575,6 +575,15 @@ public void testUnnest()
assertQuery("SELECT b FROM UNNEST(ARRAY[1, 2, 3], ARRAY[4, 5]) t(a, b)", "SELECT * FROM VALUES 4, 5, NULL");
assertQuery("SELECT count(*) FROM UNNEST(ARRAY[1, 2, 3], ARRAY[4, 5])", "SELECT 3");
assertQuery("SELECT a FROM UNNEST(ARRAY['kittens', 'puppies']) t(a)", "SELECT * FROM VALUES ('kittens'), ('puppies')");
assertQuery(
"SELECT * FROM UNNEST(ARRAY[ROW(1, 2), ROW(3, 4)], ARRAY[ROW('a', true), ROW('b', false)])",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in case of

SELECT * FROM UNNEST(ARRAY[ROW(1, 2), ROW(3, 4)])

what are the column names?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They shouldn’t have one, just like every other place in the engine that produces anonymous rows or columns.

outputFields.add(Field.newUnqualified(Optional.empty(), ((ArrayType) expressionType).getElementType()));
Type elementType = ((ArrayType) expressionType).getElementType();
if (elementType instanceof RowType) {
outputFields.addAll(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extracting a variable will make this more readable:

ImmutableList<Field> fields = elementType.getTypeParameters().stream()
        .map(type -> Field.newUnqualified(Optional.empty(), type))
        .collect(toImmutableList());
outputFields.addAll(fields);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. But I thought we prefer inlining variables if they are only used once? Happy to go either way. But can we have a consistent standard? (I can take "when it affects readability, use a variable, otherwise inline" as a standard, though that would leave some room for debates about what's "readable". So it's probably better to provide a clearer guideline.

Copy link
Contributor

@haozhun haozhun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See inline comments below.

In addition, please add tests to TestUnnestOperator.

@@ -575,6 +575,15 @@ public void testUnnest()
assertQuery("SELECT b FROM UNNEST(ARRAY[1, 2, 3], ARRAY[4, 5]) t(a, b)", "SELECT * FROM VALUES 4, 5, NULL");
assertQuery("SELECT count(*) FROM UNNEST(ARRAY[1, 2, 3], ARRAY[4, 5])", "SELECT 3");
assertQuery("SELECT a FROM UNNEST(ARRAY['kittens', 'puppies']) t(a)", "SELECT * FROM VALUES ('kittens'), ('puppies')");
assertQuery(
"SELECT * FROM UNNEST(ARRAY[ROW(1, 2), ROW(3, 4)], ARRAY[ROW('a', true), ROW('b', false)])",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please cover double and structural type in this test case

@@ -650,7 +661,17 @@ protected RelationPlan visitUnnest(Unnest node, Void context)
Symbol inputSymbol = symbolAllocator.newSymbol(rewritten, type);
argumentSymbols.add(inputSymbol);
if (type instanceof ArrayType) {
unnestSymbols.put(inputSymbol, ImmutableList.of(unnestedSymbolsIterator.next()));
Type elementType = ((ArrayType) type).getElementType();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please extract the outer if as a static method.

I'm not sure what part of the following snippet you don't like

if (type instanceof ArrayType)
...
else if (type instanceof MapType)
...

You could have done switch(type.getTypeSignature().getBase(). But that makes little difference.

outputFields.addAll(
elementType.getTypeParameters().stream()
.map(type -> Field.newUnqualified(Optional.empty(), type))
.collect(toImmutableList()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally prefer using a for-loop here. If you prefer stream, you could do .forEach(outputFields::add). This way, you wouldn't need outputFields.addAll any more. You can change your code either way.

{
this.columnarRow = ColumnarRow.toColumnarRow(block);
this.position = 0;
this.positionCount = block == null ? 0 : block.getPositionCount();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NPE at line 72. ColumnarRow.toColumnarRow throws for null argument

Copy link
Contributor

@martint martint left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments. Please look at test failures, too.

true),
booleanSessionProperty(
LEGACY_UNNEST,
"Using legacy unnest semantic, where unnest(array<row>) will create one column of type row",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

array(row) is the canonical form to refer to a parametric type.

@@ -115,6 +115,7 @@
private DataSize filterAndProjectMinOutputPageSize = new DataSize(500, KILOBYTE);
private int filterAndProjectMinOutputPageRowCount = 256;
private int maxGroupingSets = 2048;
private boolean legacyUnnestArrayRows = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this false by default. The old (broken) behavior should be available but not how it works out of the box.

@Test(singleThreaded = true)
public class TestLegacyUnnestArrayRows
{
private H2QueryRunner h2QueryRunner;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for H2 here. Take a look at how com.facebook.presto.sql.query.TestXXX work.

}

@Test
public void testUnnestArrayRowsLegacyBehavorDisabled()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this test to a TestUnnest class in com.facebook.presto.sql.query, similar to TestGrouping, TestJoinUsing, etc. and leave TestLegacyUnnest in a separate class (like TestLegacyJoinUsing). It's easier to follow the code, keep concerns separated and when it's time to remove the feature, it's just a matter of deleting the class.

@rongrong rongrong force-pushed the unnest_array_row branch 2 times, most recently from 440382e to 97179e0 Compare July 21, 2018 00:09
@rongrong
Copy link
Contributor Author

@martint I addressed the comments. Mind take another look? Thanks!

Copy link
Contributor

@martint martint left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments, but otherwise looks good.

@Test
public void testLegacyUnnestArrayRows()
{
QueryAssertions assertions = new QueryAssertions(testSessionBuilder().build(), new FeaturesConfig().setLegacyUnnestArrayRows(true));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be closed:

try (QueryAssertions queryAssertions = new QueryAssertions(...)) {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, you could just do this instead of adding FeaturesConfig to the QueryAssertions:

testSessionBuilder()
    .setSystemProperty(LEGACY_UNNEST, "true")
    .build()

Although, admittedly, that doesn't test the whole config->session->engine path.

@Test
public void testUnnestArrayRowsLegacyBehavorDisabled()
{
QueryAssertions assertions = new QueryAssertions();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

public class TestUnnestArrayRows
{
@Test
public void testUnnestArrayRowsLegacyBehavorDisabled()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just call this testUnnestArrayOfRows


import org.testng.annotations.Test;

public class TestUnnestArrayRows
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd call this TestUnnest (this is where all tests for unnest should eventually go)

@rongrong rongrong closed this Jul 24, 2018
@rongrong rongrong deleted the unnest_array_row branch July 24, 2018 18:45
@rongrong rongrong merged commit f3f7aa3 into prestodb:master Jul 24, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

UNNEST of collections of row types should produce multiple columns
8 participants