Add missing filter options #331

mkfreeman · 2022-11-30T01:23:27Z

This is a small code change that warrants a detailed explanation. For context, the __table function creates JavaScript array filters that mimic SQL filter functions. This PR is concerned with the way we think about NULL values. The SQL code that gets constructed by the Table cell reads:

// If an "IS NULL" filter is applied to a column, add a SQL "IS NULL" statement
case "n":
        appendSql(` IS NULL`, args);
        return;

The corresponding JavaScript code currently reads:

// If an "IS NULL" filter is applied to a column, filter down to JavaScript `null` OR undefined values
case "n": {
        source = source.filter((d) => d[column] == null)

While this makes sense at face value, the concept of null means something different in JavaScript and SQL. In SQL, "A field with a NULL value is a field with no value." (source). In JavaScript, null is a primitive value which may appear in the dataset.

Currently, the way the __table function checks for null values doesn't align with the way the Observable Data Table cell checks for NULL/EMPTY. When computing summaries for the Summary Charts, the Data Table cell uses the following check:

 isMissing = (value) => {
  // Note that we use Number.isNaN because isNaN can return unexpected results
  // for non-numeric values.
  return value == null || value === "" || Number.isNaN(value);
};

Note, it checks if the value is null or undefined in the same way the standard library currently does, and also includes two additional checks: empty string (value === "") and NaN Number.isNaN(value). These checks importantly identify common JavaScript data patterns that are uncommon (or not possible?) in SQL.

A remaining question is: should we make any additional changes to the SQL code to ensure comparable behaviors?

See this notebook for additional context and interactive examples.

mbostock · 2022-11-30T01:46:36Z

We shouldn’t change the behavior of the existing n and nn operators. They have a clear meaning—is null and is not null—and the empty string is not the same as null in both SQL and JavaScript. Moreover, we shouldn’t change the definition of existing operators since doing so would retroactively affect existing notebooks (and it’s difficult to argue that this is merely fixing a bug).

I think if you want a query that matches NULL (null), the empty string, and NaN in JavaScript (and maybe also Invalid Date objects? I don’t recall how we represent those), it should probably be an in query, or perhaps an entirely new operator so that we can avoid changing the behavior of existing ones. I think it’s reasonable to map the JavaScript NaN value to be NULL in SQL, but I don’t think we should do the reverse because these are distinct types in JavaScript unlike SQL.

mkfreeman · 2022-11-30T02:16:29Z

Thanks, @mbostock - how would you feel about adding new missing (m) and notMissing (nm) queries in the __table function that perform the suggested checks, e.g.:

case "m": {
        source = source.filter(
          (d) =>
            d[column] == null || d[column] === "" || Number.isNaN(d[column])
        );
        break;
      }
case "nm": {
        source = source.filter(
          (d) =>
            d[column] != null && d[column] !== "" && !Number.isNaN(d[column])
        );

If we add those, it would surface two questions:

What are the corresponding SQL queries (if any)?
If there are no corresponding SQL queries, do we conditionally show different filter options in the Data Table filter menu depending on the Data Source type (e.g., JavaScript shows the new is missing and is not missing options)?

mkfreeman · 2022-11-30T02:36:14Z

What are the corresponding SQL queries (if any)?

Perhaps a missing m check for SQL is just NULL or '' (empty string):

case "m":
        appendSql(` IS NULL`, args);
        appendSql(` = ""`, args); // not sure if we need to use appendOperand before this
        break;
case "nm":
        appendSql(` IS NOT NULL`, args);
        appendSql(` !=""`, args);

mbostock · 2022-12-01T00:32:45Z

The SQL for the m query would presumably have to look like (x IS NULL OR x = '' OR x = float 'NaN')?

(I’m tempted to use d for “defined” and nd for “not defined” to avoid a double negative, but it doesn’t matter much since those names are only used internally.)

mkfreeman · 2022-12-01T17:37:04Z

That makes sense -- for the sake of conversation (see this notebook in progress) I've added the d and nd options to this PR (and changed the PR title).

mkfreeman · 2022-12-06T18:39:10Z

A brief explanation of Only check that the first value is a primitive type:

Because we want to support columns that put unmatched values into a "missing" bin, we no longer want to check that the first n (20) values are of the same type. For example, we want to support the array of strings ["a", "b", 1] that indicates 1 value not string. Without the above change, we would get an Invalid table data error pushed up.

mbostock

You appear to be removing some of the intentional “guard rails” for the data table cell here; this will allow the data table cell to be used on primitive arrays of mixed types (for example, a mix of Dates and objects, or a mix of booleans and numbers). Why do we need to do this? Is this type of mixed data something we expect to see in practice, and that we want to show in a data table cell? I’m not sure why we’re removing the consistency requirements that we previously implemented and that seem like a good idea (from my perspective).

mkfreeman · 2022-12-07T18:06:41Z

Important questions! I wanted to write out my thoughts before chatting. Feel free to comment here / in the notebook, or just wait until our meeting.

I'd say we're just moving the guard rails somewhere else 😅 .

mkfreeman · 2022-12-07T20:34:04Z

Corresponding monorepo PR that uses these filters.

src/table.js

mkfreeman · 2022-12-08T15:05:48Z

Here's a notebook with the isValid function tested against a number of values. Two observations that I found initially surprising (but I think are fine):

object columns: array, new ArrayBuffer, and date are all valid (which makes sense, they're objects)
other columns: everything except null and undefined is valid (makes sense! I guess they're valid forms of "other")

mbostock

Let me know when you’re ready for me to take a look.

mbostock · 2022-12-08T15:35:38Z

src/table.js

+    case "boolean":
+      return typeof value === colType;
+    case "number":
+      return typeof value === colType && !Number.isNaN(value);


You can simplify here, too, since at this point we already know that value is a number and hence will not require coerce.

Suggested change

return typeof value === colType && !Number.isNaN(value);

return typeof value === colType && !isNaN(value);

Maybe I'm mistaken, but I thought that isNaN does the coercion, so here is used Number.isNaN to avoid unnecessarily doing coercion.

That’s correct. You can use either here and get the same result. The only difference is that one is shorter to type—that’s all I was saying. You’re welcome to leave it as-is.

Ah, by "simplify", I thought you meant simplify the computation.

Number.isNaN seems "simpler" in terms of the operation (doesn't do the additionally complex step of coercion)

isNaN is simpler in terms of number of characters 😄

This reverts commit b701de1.

mbostock

Functionally, this looks great!

Implementation-wise, I think there’s an easy improvement we can make. This is currently doing the switch (colType) for every value; it’d be faster to restructure the code to have separate functions for each type, e.g.,

function isValidNumber(value) {
  return typeof value === "number" && !Number.isNaN(value);
}
function isValidDate(value) {
  return value instanceof Date && !isNaN(value);
}

and then have a function for getting a validator of the specified column type like so:

function validator(colType) {
  switch (colType) {
    case "number": return isValidNumber;
    case "date": return isValidDate;
    // etc.
    default: throw new Error(`unknown type: ${colType}`); // maybe? see comment below
  }
}

And lastly you could then apply the validator like so:

const isValid = validator(colType);
source = source.filter(d => isValid(d[column]));

That way we only have to check colType once.

And also we can throw an error if we see a colun type we don’t expect… though according to the DatabaseClient specification, I think we should probably treat an unknown value the same way as we treat "other" rather than throwing an error.

mkfreeman · 2022-12-08T19:23:12Z

Great point -- appreciate you took the time to write that out, will make that change soon and re-request a review!

annie

LGTM based on the previous discussion in this PR!

perhaps we should also add some unit testing for getValidator, either in this or a later PR?

src/table.js

test/table-test.js

Co-authored-by: Annie Zhang <annie@observablehq.com>

Add empty string and NaN checks to null filter

a7da246

mkfreeman requested review from annie and mbostock November 30, 2022 01:26

mkfreeman changed the title ~~Add empty string and NaN checks to null filter~~ Add missing filter options Dec 1, 2022

Add filters for defined and not defined

6358131

Only check that the first value is a primitive type

b701de1

mbostock reviewed Dec 6, 2022

View reviewed changes

Add valid and not valid filters

318c280

Add filters for dates

2b13fe4

mbostock requested changes Dec 7, 2022

View reviewed changes

src/table.js Outdated Show resolved Hide resolved

src/table.js Outdated Show resolved Hide resolved

Add isValid function and export it

738cd97

mbostock reviewed Dec 8, 2022

View reviewed changes

mkfreeman added 5 commits December 8, 2022 10:48

Pass the correct value to isValid

4689506

Remove unused nd and d filters

ff24745

Remove prettierignore file

6950053

Revert "Only check that the first value is a primitive type"

1f38439

This reverts commit b701de1.

Remove old comment

f6f1b12

mkfreeman requested a review from mbostock December 8, 2022 18:22

mbostock requested changes Dec 8, 2022

View reviewed changes

Extract validators into separate functions to avoid redundant checks

f386070

mkfreeman requested a review from mbostock December 8, 2022 21:21

annie approved these changes Dec 12, 2022

View reviewed changes

mbostock approved these changes Dec 12, 2022

View reviewed changes

src/table.js Outdated Show resolved Hide resolved

mkfreeman added 2 commits December 12, 2022 14:35

Rename getValidator getTypeValidator

0ac2c9f

Add getValidator tests

afdf9b7

mkfreeman requested a review from annie December 12, 2022 20:06

Create null/not null filters for valid/invalid column values

d7ac9d8

annie reviewed Dec 12, 2022

View reviewed changes

test/table-test.js Outdated Show resolved Hide resolved

mkfreeman and others added 2 commits December 12, 2022 15:21

Don't repeatedly re-define source

d3cc14e

Co-authored-by: Annie Zhang <annie@observablehq.com>

Merge branch 'main' into mkfreeman/nan-values

ec986a8

mkfreeman merged commit bcd86dc into main Dec 13, 2022

mkfreeman deleted the mkfreeman/nan-values branch December 13, 2022 18:32

	return typeof value === colType && !Number.isNaN(value);
	return typeof value === colType && !isNaN(value);

Add missing filter options #331

Add missing filter options #331

Uh oh!

Conversation

mkfreeman commented Nov 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mbostock commented Nov 30, 2022

Uh oh!

mkfreeman commented Nov 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mkfreeman commented Nov 30, 2022

Uh oh!

mbostock commented Dec 1, 2022

Uh oh!

mkfreeman commented Dec 1, 2022

Uh oh!

mkfreeman commented Dec 6, 2022

Uh oh!

mbostock left a comment

Choose a reason for hiding this comment

Uh oh!

mkfreeman commented Dec 7, 2022

Uh oh!

mkfreeman commented Dec 7, 2022

Uh oh!

Uh oh!

Uh oh!

mkfreeman commented Dec 8, 2022

Uh oh!

mbostock left a comment

Choose a reason for hiding this comment

Uh oh!

mbostock Dec 8, 2022

Choose a reason for hiding this comment

Uh oh!

mkfreeman Dec 8, 2022

Choose a reason for hiding this comment

Uh oh!

mbostock Dec 8, 2022

Choose a reason for hiding this comment

Uh oh!

mkfreeman Dec 8, 2022

Choose a reason for hiding this comment

Uh oh!

mbostock left a comment

Choose a reason for hiding this comment

Uh oh!

mkfreeman commented Dec 8, 2022

Uh oh!

annie left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mkfreeman commented Nov 30, 2022 •

edited

Loading

mkfreeman commented Nov 30, 2022 •

edited

Loading