Define NaN/Null/Inf handling #6

kylebrandt · 2019-09-06T16:35:45Z

Handling "Edge cases" values is important to get a reasonable design for or it will be difficult to build functions on top of data we are working with.

For example, in the case of a time series with the current model we have: A Vector of *time.Time and a Vector of *float64 for the values of a time series (within a data.Frame). For now it is assumed those vectors are of equal length.

Given this, we have the following logical situations to handle (or prevent):

The time is Null
The float value is Null
The float value is NaN
The float value is Inf- or Inf+
The series has no values (Vectors of Len 0). (in terms of implementation, the vectors could be Null, but I currently thing of that is implementation issue and not a logical one).

(Note: We could maybe have less options by making the particular case of a float not nullable, but then with integers and other types they will need to be pointers to be nullable (In Go), so it is probably easier in the big picture to keep it a pointer so it is like everything else. Unless we want to have mask arrays like arrow and approach nulls differently).

Given this situations, they can generally be handled by (depending on the situation):

Keeping it as (null stats null, nan stays nan, etc)
Dropping the datapoint (from both vectors)
Replacing the value with a constant
Replacing the value with some relative value
Raising an Error

In the context of data processing, there is also when these values occur in regards to handling them. For example, they could come from the data source query, or they could be the result of an operation done on that data.

For GEL I imagine each UI node will have a dropdown that will have some options on how to handle these. This will impact how that node handles these values.

We don't want more options than are necessary as it will just confuse us and the users. In general the options are somewhere on a scale of "strict" handling and "best effort" handling.

A sample of some examples of when these values matter

sorting a series by time or value when either time and/or values are null
taking the reduction of values that contain nan/null values
dividing by zero
doing series arthmatic (joines of two series by time) resulting in null values
resampling resulting if null values (to maybe be filled)
empty series arrising from dropping values

ryantxu · 2019-09-06T17:32:07Z

I don't have a strong opinion on this. The only comments I will add are that on the frontend, null and undefined are essentially equivalent.

The relevant frontend setting is the NullValuesMode -- this is used for graph display and reducers:

  Null >> do whatever you do with nulls
  Ignore >> just skip it, pretend the null was not in the list to begin with
  AsZero >> replace any null with 0

Did more testing... when with reducers, when the value is Null, it behaves the same as ignore. The mean value of: [null, 10, null, 20, null] is 15.

kylebrandt added the needs design/spec label Sep 6, 2019

kylebrandt added this to TODO in GEL TODOs Sep 6, 2019

kylebrandt changed the title ~~Define NaN/Null/Inf Handling~~ Define NaN/Null/Inf handling Sep 6, 2019

kylebrandt moved this from TODO to In progress in GEL TODOs Sep 26, 2019

marefr assigned kylebrandt Sep 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Define NaN/Null/Inf handling #6

Define NaN/Null/Inf handling #6

kylebrandt commented Sep 6, 2019 •

edited

ryantxu commented Sep 6, 2019 •

edited

Define NaN/Null/Inf handling #6

Define NaN/Null/Inf handling #6

Comments

kylebrandt commented Sep 6, 2019 • edited

A sample of some examples of when these values matter

ryantxu commented Sep 6, 2019 • edited

kylebrandt commented Sep 6, 2019 •

edited

ryantxu commented Sep 6, 2019 •

edited