-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support blocking sections with multiple exploded columns #143
Conversation
Instead of trying to select() out all of the columns we need in each iteration, we can use withColumn() to add the exploded columns one by one. There is a weird bit of logic where we need to do an extra select only if there are exploded columns. I added a comment about that and will do a bit more looking into it.
…ting them out Previously we were selecting with a set, so the columns got all mixed up. Let's sort them so that they are easier to work with. The order of the columns should not affect the results.
Check the size of the output tables to confirm that rows are being exploded correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me . But I don't know how this change is going to affect the future steps.
) | ||
for c in all_column_names | ||
] | ||
explode_col_expr = explode(col(exploding_column_name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this exploding_column_name and not derived_from_column?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really good documentation on the PR.
I don't feel like I really understand the code in general so can't comment in depth but what you wrote explains + justifies the changes well.
Fixes #142.
This PR fixes a bug in the matching explode step which seems like it has been around for a long time. When a user set
explode = true
for more than one blocking column, they got an error like this:This error happened in the loop in the
_explode()
function. The loop constructed each exploded column one-by-one and then selected it and the other columns out ofexploded_df
. The problem was that each iteration tried to select all of the blocking columns out ofexploded_df
, even though the loop hadn't run for all of the exploded columns yet. This is why the first iteration of the loop threw an unresolved column error.To fix this, I've switched the loop to add the exploded columns to
exploded_df
one-by-one withwithColumn()
. This actually ended up simplifying the loop pretty substantially because we can focus on a single column at a time instead of handling all of the columns to select out with list comprehensions. The results are the same.I found another possible bug when working on this. In the previous implementation, we selected
all_column_names
out ofexploded_df
if and only if there was at least one exploded column. The tests depend on this behavior. So I've added some logic to replicate the behavior and a comment explaining it. Changing this behavior is probably technically a breaking change because it changes the columns of the exploded_df_a and exploded_df_b tables. This might have ramifications for later tasks as well. One thing I did change is the order of the columns in exploded_df_a and exploded_df_b, since previously we were selecting with an unordered set. I've sorted the columns so that they aren't in a random order in the output tables.