Skip to content
This repository has been archived by the owner on Nov 19, 2020. It is now read-only.

Define NaN/Null/Inf handling #6

Open
kylebrandt opened this issue Sep 6, 2019 · 1 comment
Open

Define NaN/Null/Inf handling #6

kylebrandt opened this issue Sep 6, 2019 · 1 comment
Assignees
Projects

Comments

@kylebrandt
Copy link
Contributor

kylebrandt commented Sep 6, 2019

Handling "Edge cases" values is important to get a reasonable design for or it will be difficult to build functions on top of data we are working with.

For example, in the case of a time series with the current model we have: A Vector of *time.Time and a Vector of *float64 for the values of a time series (within a data.Frame). For now it is assumed those vectors are of equal length.

Given this, we have the following logical situations to handle (or prevent):

  • The time is Null
  • The float value is Null
  • The float value is NaN
  • The float value is Inf- or Inf+
  • The series has no values (Vectors of Len 0). (in terms of implementation, the vectors could be Null, but I currently thing of that is implementation issue and not a logical one).

(Note: We could maybe have less options by making the particular case of a float not nullable, but then with integers and other types they will need to be pointers to be nullable (In Go), so it is probably easier in the big picture to keep it a pointer so it is like everything else. Unless we want to have mask arrays like arrow and approach nulls differently).

Given this situations, they can generally be handled by (depending on the situation):

  • Keeping it as (null stats null, nan stays nan, etc)
  • Dropping the datapoint (from both vectors)
  • Replacing the value with a constant
  • Replacing the value with some relative value
  • Raising an Error

In the context of data processing, there is also when these values occur in regards to handling them. For example, they could come from the data source query, or they could be the result of an operation done on that data.

For GEL I imagine each UI node will have a dropdown that will have some options on how to handle these. This will impact how that node handles these values.

We don't want more options than are necessary as it will just confuse us and the users. In general the options are somewhere on a scale of "strict" handling and "best effort" handling.

A sample of some examples of when these values matter

  • sorting a series by time or value when either time and/or values are null
  • taking the reduction of values that contain nan/null values
  • dividing by zero
  • doing series arthmatic (joines of two series by time) resulting in null values
  • resampling resulting if null values (to maybe be filled)
  • empty series arrising from dropping values
@kylebrandt kylebrandt added this to TODO in GEL TODOs Sep 6, 2019
@kylebrandt kylebrandt changed the title Define NaN/Null/Inf Handling Define NaN/Null/Inf handling Sep 6, 2019
@ryantxu
Copy link
Member

ryantxu commented Sep 6, 2019

I don't have a strong opinion on this. The only comments I will add are that on the frontend, null and undefined are essentially equivalent.

The relevant frontend setting is the NullValuesMode -- this is used for graph display and reducers:

  Null >> do whatever you do with nulls
  Ignore >> just skip it, pretend the null was not in the list to begin with
  AsZero >> replace any null with 0

Did more testing... when with reducers, when the value is Null, it behaves the same as ignore. The mean value of: [null, 10, null, 20, null] is 15.

@kylebrandt kylebrandt moved this from TODO to In progress in GEL TODOs Sep 26, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
GEL TODOs
  
In progress
Development

No branches or pull requests

2 participants