Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support blocking sections with multiple exploded columns #143

Merged
merged 5 commits into from
Aug 14, 2024

Conversation

riley-harper
Copy link
Contributor

Fixes #142.

This PR fixes a bug in the matching explode step which seems like it has been around for a long time. When a user set explode = true for more than one blocking column, they got an error like this:

[UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `<exploded_column>` cannot be resolved.

This error happened in the loop in the _explode() function. The loop constructed each exploded column one-by-one and then selected it and the other columns out of exploded_df. The problem was that each iteration tried to select all of the blocking columns out of exploded_df, even though the loop hadn't run for all of the exploded columns yet. This is why the first iteration of the loop threw an unresolved column error.

To fix this, I've switched the loop to add the exploded columns to exploded_df one-by-one with withColumn(). This actually ended up simplifying the loop pretty substantially because we can focus on a single column at a time instead of handling all of the columns to select out with list comprehensions. The results are the same.

I found another possible bug when working on this. In the previous implementation, we selected all_column_names out of exploded_df if and only if there was at least one exploded column. The tests depend on this behavior. So I've added some logic to replicate the behavior and a comment explaining it. Changing this behavior is probably technically a breaking change because it changes the columns of the exploded_df_a and exploded_df_b tables. This might have ramifications for later tasks as well. One thing I did change is the order of the columns in exploded_df_a and exploded_df_b, since previously we were selecting with an unordered set. I've sorted the columns so that they aren't in a random order in the output tables.

Instead of trying to select() out all of the columns we need in each iteration,
we can use withColumn() to add the exploded columns one by one. There is a
weird bit of logic where we need to do an extra select only if there are
exploded columns. I added a comment about that and will do a bit more looking
into it.
…ting them out

Previously we were selecting with a set, so the columns got all mixed up. Let's
sort them so that they are easier to work with. The order of the columns should
not affect the results.
Check the size of the output tables to confirm that rows are being exploded
correctly.
Copy link
Contributor

@anpumn anpumn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me . But I don't know how this change is going to affect the future steps.

)
for c in all_column_names
]
explode_col_expr = explode(col(exploding_column_name))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this exploding_column_name and not derived_from_column?

Copy link

@ccdavis ccdavis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really good documentation on the PR.

I don't feel like I really understand the code in general so can't comment in depth but what you wrote explains + justifies the changes well.

@riley-harper riley-harper merged commit 21e9007 into main Aug 14, 2024
6 checks passed
@riley-harper riley-harper deleted the multiple_exploded_columns branch August 14, 2024 20:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Configurations with multiple exploded blocking columns cause errors in Matching step 0 - explode
3 participants