Skip to content

Add missing filter options #331

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Dec 13, 2022
Merged

Add missing filter options #331

merged 17 commits into from
Dec 13, 2022

Conversation

mkfreeman
Copy link
Contributor

@mkfreeman mkfreeman commented Nov 30, 2022

This is a small code change that warrants a detailed explanation. For context, the __table function creates JavaScript array filters that mimic SQL filter functions. This PR is concerned with the way we think about NULL values. The SQL code that gets constructed by the Table cell reads:

// If an "IS NULL" filter is applied to a column, add a SQL "IS NULL" statement
case "n":
        appendSql(` IS NULL`, args);
        return;

The corresponding JavaScript code currently reads:

// If an "IS NULL" filter is applied to a column, filter down to JavaScript `null` OR undefined values
case "n": {
        source = source.filter((d) => d[column] == null)

While this makes sense at face value, the concept of null means something different in JavaScript and SQL. In SQL, "A field with a NULL value is a field with no value." (source). In JavaScript, null is a primitive value which may appear in the dataset.

Currently, the way the __table function checks for null values doesn't align with the way the Observable Data Table cell checks for NULL/EMPTY. When computing summaries for the Summary Charts, the Data Table cell uses the following check:

 isMissing = (value) => {
  // Note that we use Number.isNaN because isNaN can return unexpected results
  // for non-numeric values.
  return value == null || value === "" || Number.isNaN(value);
};

Note, it checks if the value is null or undefined in the same way the standard library currently does, and also includes two additional checks: empty string (value === "") and NaN Number.isNaN(value). These checks importantly identify common JavaScript data patterns that are uncommon (or not possible?) in SQL.

A remaining question is: should we make any additional changes to the SQL code to ensure comparable behaviors?

See this notebook for additional context and interactive examples.

@mkfreeman mkfreeman requested review from annie and mbostock November 30, 2022 01:26
@mbostock
Copy link
Member

We shouldn’t change the behavior of the existing n and nn operators. They have a clear meaning—is null and is not null—and the empty string is not the same as null in both SQL and JavaScript. Moreover, we shouldn’t change the definition of existing operators since doing so would retroactively affect existing notebooks (and it’s difficult to argue that this is merely fixing a bug).

I think if you want a query that matches NULL (null), the empty string, and NaN in JavaScript (and maybe also Invalid Date objects? I don’t recall how we represent those), it should probably be an in query, or perhaps an entirely new operator so that we can avoid changing the behavior of existing ones. I think it’s reasonable to map the JavaScript NaN value to be NULL in SQL, but I don’t think we should do the reverse because these are distinct types in JavaScript unlike SQL.

@mkfreeman
Copy link
Contributor Author

mkfreeman commented Nov 30, 2022

Thanks, @mbostock - how would you feel about adding new missing (m) and notMissing (nm) queries in the __table function that perform the suggested checks, e.g.:

case "m": {
        source = source.filter(
          (d) =>
            d[column] == null || d[column] === "" || Number.isNaN(d[column])
        );
        break;
      }
case "nm": {
        source = source.filter(
          (d) =>
            d[column] != null && d[column] !== "" && !Number.isNaN(d[column])
        );

If we add those, it would surface two questions:

  • What are the corresponding SQL queries (if any)?
  • If there are no corresponding SQL queries, do we conditionally show different filter options in the Data Table filter menu depending on the Data Source type (e.g., JavaScript shows the new is missing and is not missing options)?

@mkfreeman
Copy link
Contributor Author

What are the corresponding SQL queries (if any)?

Perhaps a missing m check for SQL is just NULL or '' (empty string):

case "m":
        appendSql(` IS NULL`, args);
        appendSql(` = ""`, args); // not sure if we need to use appendOperand before this
        break;
case "nm":
        appendSql(` IS NOT NULL`, args);
        appendSql(` !=""`, args);

@mbostock
Copy link
Member

mbostock commented Dec 1, 2022

The SQL for the m query would presumably have to look like (x IS NULL OR x = '' OR x = float 'NaN')?

(I’m tempted to use d for “defined” and nd for “not defined” to avoid a double negative, but it doesn’t matter much since those names are only used internally.)

@mkfreeman mkfreeman changed the title Add empty string and NaN checks to null filter Add missing filter options Dec 1, 2022
@mkfreeman
Copy link
Contributor Author

That makes sense -- for the sake of conversation (see this notebook in progress) I've added the d and nd options to this PR (and changed the PR title).

@mkfreeman
Copy link
Contributor Author

A brief explanation of Only check that the first value is a primitive type:

Because we want to support columns that put unmatched values into a "missing" bin, we no longer want to check that the first n (20) values are of the same type. For example, we want to support the array of strings ["a", "b", 1] that indicates 1 value not string. Without the above change, we would get an Invalid table data error pushed up.

Copy link
Member

@mbostock mbostock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You appear to be removing some of the intentional “guard rails” for the data table cell here; this will allow the data table cell to be used on primitive arrays of mixed types (for example, a mix of Dates and objects, or a mix of booleans and numbers). Why do we need to do this? Is this type of mixed data something we expect to see in practice, and that we want to show in a data table cell? I’m not sure why we’re removing the consistency requirements that we previously implemented and that seem like a good idea (from my perspective).

@mkfreeman
Copy link
Contributor Author

Important questions! I wanted to write out my thoughts before chatting. Feel free to comment here / in the notebook, or just wait until our meeting.

I'd say we're just moving the guard rails somewhere else 😅 .

@mkfreeman
Copy link
Contributor Author

Corresponding monorepo PR that uses these filters.

@mkfreeman
Copy link
Contributor Author

Here's a notebook with the isValid function tested against a number of values. Two observations that I found initially surprising (but I think are fine):

  • object columns: array, new ArrayBuffer, and date are all valid (which makes sense, they're objects)
  • other columns: everything except null and undefined is valid (makes sense! I guess they're valid forms of "other")

Copy link
Member

@mbostock mbostock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know when you’re ready for me to take a look.

src/table.js Outdated
case "boolean":
return typeof value === colType;
case "number":
return typeof value === colType && !Number.isNaN(value);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can simplify here, too, since at this point we already know that value is a number and hence will not require coerce.

Suggested change
return typeof value === colType && !Number.isNaN(value);
return typeof value === colType && !isNaN(value);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm mistaken, but I thought that isNaN does the coercion, so here is used Number.isNaN to avoid unnecessarily doing coercion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That’s correct. You can use either here and get the same result. The only difference is that one is shorter to type—that’s all I was saying. You’re welcome to leave it as-is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, by "simplify", I thought you meant simplify the computation.

  • Number.isNaN seems "simpler" in terms of the operation (doesn't do the additionally complex step of coercion)
  • isNaN is simpler in terms of number of characters 😄

@mkfreeman mkfreeman requested a review from mbostock December 8, 2022 18:22
Copy link
Member

@mbostock mbostock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functionally, this looks great!

Implementation-wise, I think there’s an easy improvement we can make. This is currently doing the switch (colType) for every value; it’d be faster to restructure the code to have separate functions for each type, e.g.,

function isValidNumber(value) {
  return typeof value === "number" && !Number.isNaN(value);
}
function isValidDate(value) {
  return value instanceof Date && !isNaN(value);
}

and then have a function for getting a validator of the specified column type like so:

function validator(colType) {
  switch (colType) {
    case "number": return isValidNumber;
    case "date": return isValidDate;
    // etc.
    default: throw new Error(`unknown type: ${colType}`); // maybe? see comment below
  }
}

And lastly you could then apply the validator like so:

const isValid = validator(colType);
source = source.filter(d => isValid(d[column]));

That way we only have to check colType once.

And also we can throw an error if we see a colun type we don’t expect… though according to the DatabaseClient specification, I think we should probably treat an unknown value the same way as we treat "other" rather than throwing an error.

@mkfreeman
Copy link
Contributor Author

Great point -- appreciate you took the time to write that out, will make that change soon and re-request a review!

@mkfreeman mkfreeman requested a review from mbostock December 8, 2022 21:21
Copy link
Contributor

@annie annie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM based on the previous discussion in this PR!

perhaps we should also add some unit testing for getValidator, either in this or a later PR?

@mkfreeman mkfreeman requested a review from annie December 12, 2022 20:06
mkfreeman and others added 2 commits December 12, 2022 15:21
@mkfreeman mkfreeman merged commit bcd86dc into main Dec 13, 2022
@mkfreeman mkfreeman deleted the mkfreeman/nan-values branch December 13, 2022 18:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants