Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling NaNs when calculating mean or sum #200

Closed
schajee opened this issue May 11, 2021 · 3 comments · Fixed by #210
Closed

Handling NaNs when calculating mean or sum #200

schajee opened this issue May 11, 2021 · 3 comments · Fixed by #210
Assignees

Comments

@schajee
Copy link

schajee commented May 11, 2021

Describe the bug
When calculating the mean() or sum() of a dataframe, NaNs are not ignored and output contains NaNs.

Issue #144 says that 0.2.0 onwards this behavior is addressed.

To Reproduce

  1. data = [[11, 20, 3], [null, 15, 6], [2, 30, 40], [2, 89, 78]]
  2. let df = new dfd.DataFrame(data)
  3. df.mean().print() or df.sum().print()

Current behavior

Mean Sum
NaN NaN
38.5 154
31.75 127

Expected behavior

Mean Sum
5 15
38.5 154
31.75 127

Desktop (please complete the following information):

  • OS: Windows 10
  • Browser: Chrome
  • Version: 0.2.5
@schajee
Copy link
Author

schajee commented May 11, 2021

Additionally, when I apply a custom function to the dataframe to filter out NaNs...

function mean_vals(x) {
    return x.dropna().mean()
}

df.apply({ axis: 1, callable: mean_vals })

I get...

Callable Error: You can only apply JavaScript functions on DataFrames when axis is not specified. This operation is applied on all element, and returns a DataFrame of the same shape.

Even though the same works without .dropna()

@risenW
Copy link
Member

risenW commented May 30, 2021

@schajee Thanks for raising this issue.

I just realized that I fixed the issue in the Series class only.

In the case of a DataFrame, there are some concerns.
First, we are computing the mean on a DataFrame using Tensorflow.js (tfjs) .mean function. This .mean function and generally all tfjs arithmetic operations will return NaN if any field is NaN or undefined. This in turn affects the mathematical operation.
For example:

const a = tf.tensor([ [ 11, 20, 3 ],
                      [ NaN, 15, 6 ],
                      [ 2, 30, 40 ],
                      [ 2, 89, 78 ]])
console.log(a)
const b = a.mean(axis=0)
console.log(b)
//outputs
Tensor
    [[11 , 20, 3 ],
     [NaN, 15, 6 ],
     [2  , 30, 40],
     [2  , 89, 78]]
Tensor
    [NaN, 38.5, 31.75]

Now if we decide to change all NaNs to null before calculating the mean, then tfjs internally sets all null values to 0. This will affect the calculation of averages like mean, where we divide by the total number of observations.

So for example if we do the following in tfjs:

const a = tf.tensor([ [ 11, 20, 3 ],
                      [ null, 15, 6 ],
                      [ 2, 30, 40 ],
                      [ 2, 89, 78 ]])
console.log(a)
const b = a.mean(axis=0)
console.log(b)
//outputs

Tensor
    [[11, 20, 3 ],
     [0 , 15, 6 ],
     [2 , 30, 40],
     [2 , 89, 78]]
Tensor
    [3.75, 38.5, 31.75]

So there are two options, we either go with the computing mean while counting missing observations or without counting missing observations.

In order to be consistent with Series implementation and Pandas API in general, we'll remove all NaNs before computation. If this isn't your desired result, then it is better to replace all missing values in a DF before calling the mean or sum operation.

PS: I'll start a fix for this.

@risenW risenW self-assigned this May 30, 2021
risenW added a commit that referenced this issue May 30, 2021
Fixes #200 remove NaNs before computing mean or sum
@risenW
Copy link
Member

risenW commented May 30, 2021

FIXED IN #210

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants