Polars Course Missing Data Notebook #96

folicks · 2025-05-01T23:20:54Z

📝 Summary

https://github.com//issues/40

📋 Checklist

I have included package dependencies in the notebook file using --sandbox
If adding a course, include a README.md
Keep language direct and simple.

used --sandbox

Haleshot

Great first notebook contrib! Left some comments as part of the review.

polars/11_missing_data.py

Haleshot · 2025-05-07T04:25:41Z

I'd also rephrase your above PR description (on GitHub) to reflect:

A summary of the notebook and its contents (you can refer to other PRs for reference eg: this and this).
There is no README being added here, so I'd uncheck the box.

Haleshot

Left some comments mainly for stacking & labelling of outputs.

polars/11_missing_data.py

etrotta

Taking a look, I feel like there are some more things we may want to mention to make it more complete so left some comments, but most importantly: For me the NaN explanations read like "NaN is the same as null, just used in its place for floats" which is incorrect in multiple ways.

If you think it would take too much space to explain it better, it might be better to only talk about null and tell the readers to read the official explanation if they must work with NaN

You may also want to mention the nulls_equal keyword argument on df.join, explaining the default behavior and how it changes using it, and perhaps mention how null values are treated on aggregations (both as part of the column(s) to group by as well as the values being aggregated)

etrotta · 2025-05-08T17:00:33Z

polars/11_missing_data.py

+
+@app.cell
+def _(mo):
+    mo.md("""Polars datatype specific features for missing data are NaN(for float values) and null(for everything else).""")


null and NaN are completely different things in polars, Not a Number is not a missing value in polars, it is a float value like 0.0 or inf
There are some methods dedicated specially to dealing with it like expr.is_nan() (similar to but distinct from expr.is_null()), and you can have both NaN and nulls in float columns.

That confusion might arise from the way NaN is used by other libraries like numpy and polars?

polars/11_missing_data.py

Co-authored-by: Srihari Thyagarajan <57552973+Haleshot@users.noreply.github.com>

Haleshot · 2025-05-28T05:49:36Z

@folicks Hey Felix, let me know when I can review this PR (no rush).

etrotta

I think you mixed these up?

(pl.col("fruits") == pl.col("fruits")).alias("A_eq_B")
pl.col("score").eq_missing(pl.col("B")).alias("A_eq_missing_B")

Maybe try to use a slightly more realistic dataset rather than having columns literally called A and B

edit; oh wait I didn't realize it was set to Draft, never mind

etrotta · 2025-05-28T14:57:16Z

polars/11_missing_data.py

+
+    # Compare using == operator (same as .eq())
+    eq_result = df.with_columns([
+        (pl.col("fruits") == pl.col("fruits")).alias("A_eq_B")


That alias is wrong or at at least extremely misleading

@folicks;
Agree with @etrotta here;

You're comparing fruits to itself but labeling it A_eq_B, then comparing score to B but calling it A_eq_missing_B. Also doesn't show the diff b/w == and eq_missing().

I think something like works:

# Compare fruits column to itself fruits_eq = df.with_columns([ (pl.col("fruits") == pl.col("fruits")).alias("fruits_eq_self") ]) # Compare with eq_missing - shows true for null == null fruits_eq_missing = df.with_columns([ pl.col("fruits").eq_missing(pl.col("fruits")).alias("fruits_eq_missing_self") ])

etrotta · 2025-05-28T14:59:33Z

polars/11_missing_data.py

+def _(df, mo, pl):
    mean_age = df.select(pl.col("age").mean()).item()
    imputed = df.with_columns(
        pl.col("age").fill_null(mean_age).alias("age_imputed")


You could use pl.col("age").fill_null(pl.col("age").mean()) directly, or even pl.col("age").fill_null(strategy="mean")

folicks · 2025-05-29T02:00:06Z

@folicks Hey Felix, let me know when I can review this PR (no rush). yes I think so

Haleshot

Thanks for addressing the past comments/feedback, appreciate it 🎉; would recommend @etrotta's comments and some minor nits mention in the latest review.

Haleshot · 2025-05-29T12:20:05Z

polars/11_missing_data.py

+        r"""
+    # Handling Missing Data in Polars
+
+    _By Felix Najera (https://github.com/folicks)._  


Suggested change

_By Felix Najera (https://github.com/folicks)._

_By [Felix Najera](https://github.com/folicks)._

Haleshot · 2025-05-29T12:23:06Z

polars/11_missing_data.py

+def _(mo):
+    mo.md(
+        r"""
+    /// details | Disclamier


Suggested change

/// details | Disclamier

/// details | Disclaimer

Haleshot · 2025-05-29T13:04:31Z

polars/11_missing_data.py

+
+    # Compare using == operator (same as .eq())
+    eq_result = df.with_columns([
+        (pl.col("fruits") == pl.col("fruits")).alias("A_eq_B")


@folicks;
Agree with @etrotta here;

You're comparing fruits to itself but labeling it A_eq_B, then comparing score to B but calling it A_eq_missing_B. Also doesn't show the diff b/w == and eq_missing().

I think something like works:

# Compare fruits column to itself fruits_eq = df.with_columns([ (pl.col("fruits") == pl.col("fruits")).alias("fruits_eq_self") ]) # Compare with eq_missing - shows true for null == null fruits_eq_missing = df.with_columns([ pl.col("fruits").eq_missing(pl.col("fruits")).alias("fruits_eq_missing_self") ])

Haleshot · 2025-05-29T13:23:26Z

polars/11_missing_data.py

+
+@app.cell(hide_code=True)
+def _(mo):
+    mo.md(r"""Above we have a dataframe containing `null` is used to indicate missing data in any data types. For the purposes of this guide we won't mention all pelicularities that may come from alternative dataframes found in other packages such as Pandas.""")


Suggested change

mo.md(r"""Above we have a dataframe containing `null` is used to indicate missing data in any data types. For the purposes of this guide we won't mention all pelicularities that may come from alternative dataframes found in other packages such as Pandas.""")

mo.md(r"""Above we have a dataframe containing `null` is used to indicate missing data in any data types. For the purposes of this guide we won't mention all peculiarities that may come from alternative dataframes found in other packages such as Pandas.""")

Haleshot · 2025-06-25T11:51:41Z

@folicks Just dropping by to follow-up on if you had a chance to look at this PR again (& the review comments posted above).

folicks and others added 3 commits May 1, 2025 15:10

quick tutorial

2b39989

added the polars missing data course notebook

c286938

Missing Data Notebook

1ee5742

used --sandbox

Haleshot reviewed May 7, 2025

View reviewed changes

etrotta suggested changes May 8, 2025

View reviewed changes

folicks and others added 3 commits May 13, 2025 17:49

better tldr block polars/11_missing_data.py

5c8f61b

Co-authored-by: Srihari Thyagarajan <57552973+Haleshot@users.noreply.github.com>

thank you

ed17f32

Co-authored-by: Srihari Thyagarajan <57552973+Haleshot@users.noreply.github.com>

Update polars/11_missing_data.py

e6648d0

Co-authored-by: Srihari Thyagarajan <57552973+Haleshot@users.noreply.github.com>

folicks marked this pull request as draft May 28, 2025 03:11

folicks added 2 commits May 27, 2025 20:14

all comments

78cae59

Merge branch 'main' of https://github.com/folicks/missing-data-polars

fe87aae

folicks marked this pull request as ready for review May 28, 2025 03:16

folicks marked this pull request as draft May 28, 2025 03:17

etrotta reviewed May 28, 2025

View reviewed changes

Haleshot reviewed May 29, 2025

View reviewed changes

folicks closed this Sep 11, 2025

etrotta mentioned this pull request Sep 14, 2025

Add Missing Data notebook to Polars course #121

Merged

3 tasks

	_By Felix Najera (https://github.com/folicks)._
	_By [Felix Najera](https://github.com/folicks)._

	mo.md(r"""Above we have a dataframe containing `null` is used to indicate missing data in any data types. For the purposes of this guide we won't mention all pelicularities that may come from alternative dataframes found in other packages such as Pandas.""")
	mo.md(r"""Above we have a dataframe containing `null` is used to indicate missing data in any data types. For the purposes of this guide we won't mention all peculiarities that may come from alternative dataframes found in other packages such as Pandas.""")

Polars Course Missing Data Notebook #96

Polars Course Missing Data Notebook #96

Uh oh!

Conversation

folicks commented May 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📝 Summary

📋 Checklist

Uh oh!

Haleshot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Haleshot commented May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Haleshot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

etrotta left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Haleshot commented May 28, 2025

Uh oh!

etrotta left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

folicks commented May 29, 2025

Uh oh!

Haleshot left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Haleshot commented Jun 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

folicks commented May 1, 2025 •

edited

Loading

Haleshot commented May 7, 2025 •

edited

Loading

etrotta left a comment •

edited

Loading