-
Notifications
You must be signed in to change notification settings - Fork 37
Polars Course Missing Data Notebook #96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Haleshot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great first notebook contrib! Left some comments as part of the review.
Haleshot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments mainly for stacking & labelling of outputs.
etrotta
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taking a look, I feel like there are some more things we may want to mention to make it more complete so left some comments, but most importantly: For me the NaN explanations read like "NaN is the same as null, just used in its place for floats" which is incorrect in multiple ways.
If you think it would take too much space to explain it better, it might be better to only talk about null and tell the readers to read the official explanation if they must work with NaN
You may also want to mention the nulls_equal keyword argument on df.join, explaining the default behavior and how it changes using it, and perhaps mention how null values are treated on aggregations (both as part of the column(s) to group by as well as the values being aggregated)
polars/11_missing_data.py
Outdated
|
|
||
| @app.cell | ||
| def _(mo): | ||
| mo.md("""Polars datatype specific features for missing data are NaN(for float values) and null(for everything else).""") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
null and NaN are completely different things in polars, Not a Number is not a missing value in polars, it is a float value like 0.0 or inf
There are some methods dedicated specially to dealing with it like expr.is_nan() (similar to but distinct from expr.is_null()), and you can have both NaN and nulls in float columns.
That confusion might arise from the way NaN is used by other libraries like numpy and polars?
Co-authored-by: Srihari Thyagarajan <57552973+Haleshot@users.noreply.github.com>
Co-authored-by: Srihari Thyagarajan <57552973+Haleshot@users.noreply.github.com>
|
@folicks Hey Felix, let me know when I can review this PR (no rush). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you mixed these up?
(pl.col("fruits") == pl.col("fruits")).alias("A_eq_B")
pl.col("score").eq_missing(pl.col("B")).alias("A_eq_missing_B")Maybe try to use a slightly more realistic dataset rather than having columns literally called A and B
edit; oh wait I didn't realize it was set to Draft, never mind
|
|
||
| # Compare using == operator (same as .eq()) | ||
| eq_result = df.with_columns([ | ||
| (pl.col("fruits") == pl.col("fruits")).alias("A_eq_B") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That alias is wrong or at at least extremely misleading
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@folicks;
Agree with @etrotta here;
You're comparing fruits to itself but labeling it A_eq_B, then comparing score to B but calling it A_eq_missing_B. Also doesn't show the diff b/w == and eq_missing().
I think something like works:
# Compare fruits column to itself
fruits_eq = df.with_columns([
(pl.col("fruits") == pl.col("fruits")).alias("fruits_eq_self")
])
# Compare with eq_missing - shows true for null == null
fruits_eq_missing = df.with_columns([
pl.col("fruits").eq_missing(pl.col("fruits")).alias("fruits_eq_missing_self")
])| def _(df, mo, pl): | ||
| mean_age = df.select(pl.col("age").mean()).item() | ||
| imputed = df.with_columns( | ||
| pl.col("age").fill_null(mean_age).alias("age_imputed") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could use pl.col("age").fill_null(pl.col("age").mean()) directly, or even pl.col("age").fill_null(strategy="mean")
|
Haleshot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing the past comments/feedback, appreciate it 🎉; would recommend @etrotta's comments and some minor nits mention in the latest review.
| r""" | ||
| # Handling Missing Data in Polars | ||
|
|
||
| _By Felix Najera (https://github.com/folicks)._ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| _By Felix Najera (https://github.com/folicks)._ | |
| _By [Felix Najera](https://github.com/folicks)._ |
| def _(mo): | ||
| mo.md( | ||
| r""" | ||
| /// details | Disclamier |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| /// details | Disclamier | |
| /// details | Disclaimer |
|
|
||
| # Compare using == operator (same as .eq()) | ||
| eq_result = df.with_columns([ | ||
| (pl.col("fruits") == pl.col("fruits")).alias("A_eq_B") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@folicks;
Agree with @etrotta here;
You're comparing fruits to itself but labeling it A_eq_B, then comparing score to B but calling it A_eq_missing_B. Also doesn't show the diff b/w == and eq_missing().
I think something like works:
# Compare fruits column to itself
fruits_eq = df.with_columns([
(pl.col("fruits") == pl.col("fruits")).alias("fruits_eq_self")
])
# Compare with eq_missing - shows true for null == null
fruits_eq_missing = df.with_columns([
pl.col("fruits").eq_missing(pl.col("fruits")).alias("fruits_eq_missing_self")
])|
|
||
| @app.cell(hide_code=True) | ||
| def _(mo): | ||
| mo.md(r"""Above we have a dataframe containing `null` is used to indicate missing data in any data types. For the purposes of this guide we won't mention all pelicularities that may come from alternative dataframes found in other packages such as Pandas.""") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| mo.md(r"""Above we have a dataframe containing `null` is used to indicate missing data in any data types. For the purposes of this guide we won't mention all pelicularities that may come from alternative dataframes found in other packages such as Pandas.""") | |
| mo.md(r"""Above we have a dataframe containing `null` is used to indicate missing data in any data types. For the purposes of this guide we won't mention all peculiarities that may come from alternative dataframes found in other packages such as Pandas.""") |
|
@folicks Just dropping by to follow-up on if you had a chance to look at this PR again (& the review comments posted above). |
📝 Summary
https://github.com//issues/40📋 Checklist
--sandboxREADME.md