Support blocking sections with multiple exploded columns #143

riley-harper · 2024-08-13T16:25:09Z

Fixes #142.

This PR fixes a bug in the matching explode step which seems like it has been around for a long time. When a user set explode = true for more than one blocking column, they got an error like this:

[UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `<exploded_column>` cannot be resolved.

This error happened in the loop in the _explode() function. The loop constructed each exploded column one-by-one and then selected it and the other columns out of exploded_df. The problem was that each iteration tried to select all of the blocking columns out of exploded_df, even though the loop hadn't run for all of the exploded columns yet. This is why the first iteration of the loop threw an unresolved column error.

To fix this, I've switched the loop to add the exploded columns to exploded_df one-by-one with withColumn(). This actually ended up simplifying the loop pretty substantially because we can focus on a single column at a time instead of handling all of the columns to select out with list comprehensions. The results are the same.

I found another possible bug when working on this. In the previous implementation, we selected all_column_names out of exploded_df if and only if there was at least one exploded column. The tests depend on this behavior. So I've added some logic to replicate the behavior and a comment explaining it. Changing this behavior is probably technically a breaking change because it changes the columns of the exploded_df_a and exploded_df_b tables. This might have ramifications for later tasks as well. One thing I did change is the order of the columns in exploded_df_a and exploded_df_b, since previously we were selecting with an unordered set. I've sorted the columns so that they aren't in a random order in the output tables.

…g columns

Instead of trying to select() out all of the columns we need in each iteration, we can use withColumn() to add the exploded columns one by one. There is a weird bit of logic where we need to do an extra select only if there are exploded columns. I added a comment about that and will do a bit more looking into it.

…ting them out Previously we were selecting with a set, so the columns got all mixed up. Let's sort them so that they are easier to work with. The order of the columns should not affect the results.

Check the size of the output tables to confirm that rows are being exploded correctly.

anpumn

Looks good to me . But I don't know how this change is going to affect the future steps.

riley-harper · 2024-08-14T14:30:47Z

hlink/linking/matching/link_step_explode.py

-                    )
-                    for c in all_column_names
-                ]
+                explode_col_expr = explode(col(exploding_column_name))


Why is this exploding_column_name and not derived_from_column?

ccdavis

Really good documentation on the PR.

I don't feel like I really understand the code in general so can't comment in depth but what you wrote explains + justifies the changes well.

riley-harper added 5 commits August 12, 2024 16:17

[#142] Add a test that fails because it has multiple exploded blockin…

6be2405

…g columns

[#142] Remove redundant aliasing

e57ac93

[#142] Sort the columns in exploded_df_a and exploded_df_b when selec…

71b9db5

…ting them out Previously we were selecting with a set, so the columns got all mixed up. Let's sort them so that they are easier to work with. The order of the columns should not affect the results.

[#142] Tweak the new blocking-explode test

1ae9a69

Check the size of the output tables to confirm that rows are being exploded correctly.

riley-harper requested review from anpumn and ccdavis August 13, 2024 17:09

anpumn approved these changes Aug 14, 2024

View reviewed changes

riley-harper commented Aug 14, 2024

View reviewed changes

ccdavis approved these changes Aug 14, 2024

View reviewed changes

riley-harper merged commit 21e9007 into main Aug 14, 2024
6 checks passed

riley-harper deleted the multiple_exploded_columns branch August 14, 2024 20:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support blocking sections with multiple exploded columns #143

Support blocking sections with multiple exploded columns #143

riley-harper commented Aug 13, 2024

anpumn left a comment

riley-harper Aug 14, 2024

ccdavis left a comment

Support blocking sections with multiple exploded columns #143

Support blocking sections with multiple exploded columns #143

Conversation

riley-harper commented Aug 13, 2024

anpumn left a comment

Choose a reason for hiding this comment

riley-harper Aug 14, 2024

Choose a reason for hiding this comment

ccdavis left a comment

Choose a reason for hiding this comment