Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

coerce to the scale’s type #532

Merged
merged 9 commits into from Sep 7, 2021
Merged

Conversation

mbostock
Copy link
Member

@mbostock mbostock commented Sep 5, 2021

Once the scale type is known, we can coerce the domain and channel values to match. This makes it easier to correct for untyped data by setting the scale type explicitly, rather than relying on the user to fix their data; it also makes Plot behave more consistently and predictably.

For example, consider a simple line plot. If we forget to pass typed data and the Date and Close columns are strings, they are interpreted as ordinal values rather than temporal and quantitative, respectively. Oops!

Screen Shot 2021-09-05 at 10 12 25 AM

Plot.plot({
  marks: [
    Plot.line(aapl, {x: "Date", y: "Close"})
  ]
})

We might try to fix this by setting the scale types explicitly. However, since the data are still strings, and since Plot uses d3.min and d3.max to compute the quantitative domain, it doesn’t work very well! The inferred domains are ["2013-05-13", "2018-05-11"] = [Invalid Date, Invalid Date] and ["100.110001", "99.989998"] = [100.110001, 99.989998] respectively. Oops!

Screen Shot 2021-09-05 at 10 15 34 AM

Plot.plot({
  x: {
    type: "utc"
  },
  y: {
    type: "linear"
  },
  marks: [
    Plot.line(aapl, {x: "Date", y: "Close"})
  ]
})

We can correct by setting a scale.transform, but this is a relatively obscure feature.

Screen Shot 2021-09-05 at 10 15 53 AM

Plot.plot({
  x: {
    transform: x => new Date(x)
  },
  y: {
    transform: y => +y
  },
  marks: [
    Plot.line(aapl, {x: "Date", y: "Close"})
  ]
})

With this PR, the input data are automatically coerced to match the scale’s type, giving a more obvious fix to the common case of trying to use strings as temporal or quantitative data. (I.e., with this PR, the second example above produces the desired chart shown in the third example.) In the future, we could even consider requiring an explicitly-specified scale type when the inferred type is ordinal and the inferred domain has high cardinality (say, forty or more); this would cause the first chart to error, at which point you could enter the scale types to correct the problem, or better, type the data before passing it to Plot.

Related #74 #513.

@mbostock mbostock requested a review from Fil September 5, 2021 17:24
@mbostock
Copy link
Member Author

mbostock commented Sep 5, 2021

(There’s a small potential optimization here where if we know the scale type will be quantitative or temporal before we’ve computed the channel values, say because the scale type was specified explicitly, then we can avoid the extra copy and instead fold the coercion and typed array into the channel value construction. But we can’t do this in general because we need to know the channel values in order to infer the scale type. In any case hopefully an extra copy won’t add too much overhead, and I think it’s worth the cost for predictability regardless.)

Copy link
Contributor

@Fil Fil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely useful.

I would like to clarify the handling of dates. The regexp doesn’t match dates with a space instead of a T, like "2014-12-30 00:00:00+00:00", which happen in the wild and are a potential source of error (as you mention in https://observablehq.com/@mbostock/plotting-bens-google-fit-data : this format is parsed by Chrome but not Safari). Shouldn't we in this case replace the space by a T, rather than default to NaN?

Also, since there are many flavors of ISO 8601, we might want to document precisely the formats we're expecting with this regexp, and those we're not covering.

Paraphrasing the RegExp we have:

  • year: YYYY (and for years outside 0000-9999, -YYYYYY or +YYYYYY)
  • month: MM
  • day: DD
  • hour: hh
  • minutes: mm
  • seconds: ss
  • milliseconds: mmm
  • timezone: empty (local time), Z (UTC +00:00), +hh:mm (UTC +hh:mm)

for the date, the year is mandatory, optionally followed by month, and day:

  • (±YY)YYYY
  • (±YY)YYYY-MM
  • (±YY)YYYY-MM-DD

hour:minutes is optional, and optionally followed by seconds, milliseconds

  • hh:mm
  • hh:mm:ss
  • hh:mm:ss.mmm

timezone is optional

overall this makes a lot of combinations: (year)(Ttime)?(tz)?

Some parts of ISO 8601 that are not covered: time parts, week, day-of-year, 10th of seconds…

Here's a list of examples that are matched:

  • 2021-09-06T06:57:38 => 2021-09-06T06:57:38 local time
  • 2021-09-06T06:57:38+00:00 => 2021-09-06T06:57:38 UTC
  • 2021-09-06T06:57:38Z => 2021-09-06T06:57:38 UTC
  • 2021-09-06T06:57:38.123+00:00 => 2021-09-06T06:57:38.123 UTC
  • +012000-09-01 => +012000-09-01 00:00 UTC
  • -000100-09-01T02:02 => -000100-09-01T01:52:39.000Z // don't expect hours for far-away dates to work?
  • 1402 => 1402-01-01 00:00 UTC
  • 2008 => 2008-01-01 00:00 UTC
  • 2008-03 => 2008-03-01 00:00 UTC
  • 2021-09-06 => 2021-09-06 00:00 UTC
  • 2021T06:57:38Z => 2021-01-01T06:57:38.000Z UTC

How can we document this in README without doing a lot of "paraphrasing"? Is there a reference page we can point to that would cover exactly what is covered here?

@mbostock
Copy link
Member Author

mbostock commented Sep 6, 2021

I think we’re going for I think this comes from RFC 3339 section 5.6, the “internet date/time format” (which allows a space or lowercase T unlike ISO 8601): https://datatracker.ietf.org/doc/html/rfc3339#section-5.6

@mbostock
Copy link
Member Author

mbostock commented Sep 6, 2021

I don’t feel like we need to document this exhaustively. We can just say it’s a ISO 8601 date or date-time.

Shouldn't we in this case replace the space by a T, rather than default to NaN?

I’m not sure we want to broaden scope here beyond ISO 8601 into RFC 3339. If we do that, it’s more likely someone wouldn’t notice when their string is not compatible with the Date constructor in all browsers. The goal here isn’t to parse all possible date strings automatically; it’s to recommend a standard interchange format that works reliably. Formats outside that recommendation will need to be handled specially either by parsing yourself or using the timeFormat option as suggested in #535.

@Fil
Copy link
Contributor

Fil commented Sep 6, 2021

If I read RFC 3339 section 5.6 correctly, it doesn't seem to include "+" 6DIGIT years, and is more lax on "." 1*DIGIT sec-frac (we only accept "." 3DIGIT)? (I'm fine with that choice, I'm just looking for a reference to point to.)

But it makes me wonder, shouldn't we adopt d3.isoParse instead—as the default timeFormat function of #535? Though the code is a bit different, the results are identical in all my tests of what I understand the format to cover*. This way, documentation and tests would be centralized in d3-time-format.

[*] A difference is that isoParse("99-02") returns a date (1st Feb 1999), and isoParse("12") returns (1st January 1912)—this regexp doesn't cover these cases, and the parser returns respectively NaN and 12 (12ms after EPOCH).

@mbostock
Copy link
Member Author

mbostock commented Sep 6, 2021

d3.isoParse doesn’t do any validation—it just passes through to the Date constructor. So I don’t advise that because the goal here again is to recommend an interchange format for dates.

I do think it’d be worthwhile to have a parse function that goes along with isoformat if we want to make this code available outside of Plot.

@mbostock
Copy link
Member Author

mbostock commented Sep 6, 2021

Here’s parse for isoformat, WDYT? mbostock/isoformat#1

@mbostock
Copy link
Member Author

mbostock commented Sep 6, 2021

I have merged and published isoformat 0.2.

@mbostock mbostock requested a review from Fil September 6, 2021 20:12
Copy link
Contributor

@Fil Fil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a paragraph to README and CHANGELOG—not sure it's enough (do we want to repeat some of the guidance (about typing data), or details (such as "any non-string values are coerced to number first and treated as milliseconds since UNIX epoch")?

Comment on lines +198 to +201
// browsers. (In the future, this could be made more liberal if desired, though
// it is still generally preferable to do date parsing yourself explicitly,
// rather than rely on Plot.) Any non-string values are coerced to number first
// and treated as milliseconds since UNIX epoch.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe these comments could be reflected in the README?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will take a pass at the README before I merge. Thank you! 👍

src/scales.js Outdated
function coerceDate(x) {
return x instanceof Date ? x
: typeof x === "string" ? isoParse(x)
: new Date(x == null ? NaN : +x);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning a Date instance for undefined or NaN input feels wrong here, especially because our defined helper considers invalid dates to be defined. We should probably return undefined instead of an invalid date.

@mbostock mbostock merged commit 0bbf70e into main Sep 7, 2021
@mbostock mbostock deleted the mbostock/coerce-to-scale-type branch September 7, 2021 02:23
@mbostock mbostock mentioned this pull request Feb 11, 2022
12 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants