Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

$join() with join_nulls = TRUE doesn't produce expected result with DataFrames #945

Closed
george-wood opened this issue Mar 19, 2024 · 5 comments · Fixed by #949
Closed

$join() with join_nulls = TRUE doesn't produce expected result with DataFrames #945

george-wood opened this issue Mar 19, 2024 · 5 comments · Fixed by #949
Labels
bug Something isn't working

Comments

@george-wood
Copy link
Contributor

I was using the join_nulls argument in $join()$, added in #716. Joining LazyFrames with join_nulls = TRUE produces the expected result. However, join_nulls seems to be ignored when joining DataFrames. I'd be happy to look into this if it might be a good first issue?

library(polars)

df1 <-
  pl$DataFrame(
    x = "John",
    y = NULL
  )

df2 <-
  pl$DataFrame(
    x = "John",
    y = NULL,
    value = 1
  )

df1$
  join(
    other = df2,
    on = c("x", "y"),
    how = "left",
    join_nulls = TRUE
  )
#> shape: (1, 3)
#> ┌──────┬──────┬───────┐
#> │ x    ┆ y    ┆ value │
#> │ ---  ┆ ---  ┆ ---   │
#> │ str  ┆ null ┆ f64   │
#> ╞══════╪══════╪═══════╡
#> │ John ┆ null ┆ null  │
#> └──────┴──────┴───────┘

# lazy
df1$lazy()$
  join(
    other = df2$lazy(),
    on = c("x", "y"),
    how = "left",
    join_nulls = TRUE
  )$
  collect()
#> shape: (1, 3)
#> ┌──────┬──────┬───────┐
#> │ x    ┆ y    ┆ value │
#> │ ---  ┆ ---  ┆ ---   │
#> │ str  ┆ null ┆ f64   │
#> ╞══════╪══════╪═══════╡
#> │ John ┆ null ┆ 1.0   │
#> └──────┴──────┴───────┘

Created on 2024-03-19 with reprex v2.1.0

There is a test for join_nulls = TRUE using LazyFrames, but not DataFrames:

test_that("argument 'join_nulls' works", {

@eitsupi
Copy link
Collaborator

eitsupi commented Mar 19, 2024

Thanks for reporting!
Since this could be an upstream (Rust) issue, could you please check if you can reproduce it in Python if possible?

If it does not reproduce in Python, it is a problem with this repository and I would be happy for you to work on it; if you get the same result in Python, we will have to report the problem upstream.

@eitsupi eitsupi added the bug Something isn't working label Mar 19, 2024
@george-wood
Copy link
Contributor Author

george-wood commented Mar 19, 2024

Thanks! I checked in Python and it returned the same (expected) output for LazyFrames and DataFrames, so it seems to be a problem with r-polars.

import polars as pl

df1 = pl.DataFrame(
  {
    "x": ["John"],
    "y": [None]
  }
)

df2 = pl.DataFrame(
  {
    "x": ["John"],
    "y": [None],
    "value": [1]
  }
)

print(
  df1
  .join(
    other = df2,
    how = "left", 
    on = ["x", "y"],
    join_nulls = True
  )
)
#> shape: (1, 3)
#> ┌──────┬──────┬───────┐
#> │ x    ┆ y    ┆ value │
#> │ ---  ┆ ---  ┆ ---   │
#> │ str  ┆ null ┆ i64   │
#> ╞══════╪══════╪═══════╡
#> │ John ┆ null ┆ 1     │
#> └──────┴──────┴───────┘

# lazy
print(
  df1.lazy()
  .join(
    other = df2.lazy(),
    how = "left", 
    on = ["x", "y"],
    join_nulls = True
  )
  .collect()
)
#> shape: (1, 3)
#> ┌──────┬──────┬───────┐
#> │ x    ┆ y    ┆ value │
#> │ ---  ┆ ---  ┆ ---   │
#> │ str  ┆ null ┆ i64   │
#> ╞══════╪══════╪═══════╡
#> │ John ┆ null ┆ 1     │
#> └──────┴──────┴───────┘

Created at 2024-03-19 16:43:14 GMT by reprexlite v0.5.0

@etiennebacher
Copy link
Collaborator

The bug is actually very simple: the user can provide the join_nulls arg but it is never passed to the lazy method:

r-polars/R/dataframe__frame.R

Lines 1008 to 1029 in d30005b

DataFrame_join = function(
other,
on = NULL,
how = c("inner", "left", "outer", "semi", "anti", "cross", "outer_coalesce"),
...,
left_on = NULL,
right_on = NULL,
suffix = "_right",
validate = "m:m",
join_nulls = FALSE,
allow_parallel = TRUE,
force_parallel = FALSE) {
if (!is_polars_df(other)) {
Err_plain("`other` must be a DataFrame.") |>
unwrap("in $join():")
}
.pr$DataFrame$lazy(self)$join(
other = other$lazy(), left_on = left_on, right_on = right_on,
on = on, how = how, suffix = suffix, allow_parallel = allow_parallel,
force_parallel = force_parallel
)$collect()
}

@george-wood I think this is a good first issue, do you want to fix this and add a test? Feel free to open an incomplete PR if you struggle somewhere and we can see how to carry it to the finish line

@george-wood
Copy link
Contributor Author

@etiennebacher I'd be happy to.

@george-wood
Copy link
Contributor Author

I noticed that the validate arg is also not passed to the lazy method. I'll fix this too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants