Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP/ENH: add weights kw to numeric aggregation functions #15039

Closed
wants to merge 3 commits into from

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Jan 2, 2017

closes #10030
alt to #15031

 In [5]: df = DataFrame({'A': [1, 1, 2, 2],
   ...:                'B': [1, 2, 3, 4]})

In [6]: df
Out[6]: 
   A  B
0  1  1
1  1  2
2  2  3
3  2  4

In [7]: df.mean()
Out[7]: 
A    1.5
B    2.5
dtype: float64

In [8]: df.mean(weights='A')
Out[8]: 
A    0.416667
B    0.708333
dtype: float64

New signatures

In [9]: Series.mean?
Signature: Series.mean(self, axis=None, skipna=None, level=None, weights=None, numeric_only=None, **kwargs)
Docstring:
Return the mean of the values for the requested axis

Parameters
----------
axis : {index (0)}
skipna : boolean, default True
    Exclude NA/null values. If an entire row/column is NA, the result
    will be NA
level : int or level name, default None
    If the axis is a MultiIndex (hierarchical), count along a
    particular level, collapsing into a scalar
weights : str or ndarray-like, optional
    Default 'None' results in equal probability weighting.

    If passed a Series, will align with target object on index.
    Index values in weights not found in the target object
    will be ignored and index values in the target object
    not in weights will be assigned weights of zero.

    If called on a DataFrame, will accept the name of a column
    when axis = 0.

    Unless weights are a Series, weights must be same length
    as axis of the target object.

    If weights do not sum to 1, they will be normalized to sum to 1.

    Missing values in the weights column will be treated as zero.
    inf and -inf values not allowed.
numeric_only : boolean, default None
    Include only float, int, boolean columns. If None, will attempt to use
    everything, then use only numeric data. Not implemented for Series.

Returns
-------
mean : scalar or Series (if level specified)
File:      ~/pandas/pandas/core/generic.py
Type:      function

@jreback jreback added Numeric Operations Arithmetic, Comparison, and Logical operations Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jan 2, 2017
@jreback jreback mentioned this pull request Jan 2, 2017
@jreback
Copy link
Contributor Author

jreback commented Jan 2, 2017

biggest API issue is should:

df.mean(weights='A'), exclude 'A' in the result set?

similarly (not implemented ATM)

df.groupby('B').mean(weights='A') exclude 'A'?

e.g.

# remove column if its referenced like this
In [2]: df.mean(weights='A')
Out[2]: 
B    0.708333
dtype: float64

# we don't remove if its a function / array / Series (this is how groupby works)
In [3]: df.mean(weights=df.A)
Out[3]: 
A    0.416667
B    0.708333
dtype: float64

@codecov-io
Copy link

codecov-io commented Jan 2, 2017

Current coverage is 84.77% (diff: 92.70%)

No coverage report found for master at 74e20a0.

Powered by Codecov. Last update 74e20a0...ca423c1

@jreback
Copy link
Contributor Author

jreback commented Jan 2, 2017

In [1]: df = DataFrame({'A': [1, 2, 3, 4],
   ...:                'B': [1, 2, 3, 4], 'C':[1, 1, 2, 3]})

In [2]: df
Out[2]: 
   A  B  C
0  1  1  1
1  2  2  1
2  3  3  2
3  4  4  3

In [3]: df.groupby('C').B.mean(weights='A')
Out[3]: 
C
1    0.25
2    0.90
3    1.60
Name: B, dtype: float64

In [4]: df.groupby('C').B.mean()
Out[4]: 
C
1    1.5
2    3.0
3    4.0
Name: B, dtype: float64

In [5]: df.groupby('C').mean(weights='A')
Out[5]: 
      A     B
C            
1  0.25  0.25
2  0.90  0.90
3  1.60  1.60

In [6]: df.mean(weights='A')
Out[6]: 
B    0.750
C    0.525
dtype: float64

groupby almost all working (exception is [5], which includes the weighting column)

@chris-b1
Copy link
Contributor

chris-b1 commented Jan 3, 2017

I don't feel strongly either way, but for consideration - R doesn't seem to normalize weights by default, nor does numpy.

In [3]: pd.Series([1, 2, 3]).mean(weights=[1, 1, 2])
Out[3]: 0.75

In [4]: np.average([1, 2, 3], weights=[1, 1, 2])
Out[4]: 2.25

R

> weighted.mean(x=c(1, 2, 3), w=c(1, 1, 2))
[1] 2.25

@jreback
Copy link
Contributor Author

jreback commented Jan 3, 2017

so if we are going to add 'options' to weighting, then .weightby is much more attractive, e.g.

df.weightby(weights=...., weights_type=...., normalize=True).mean()

@chris-b1
Copy link
Contributor

chris-b1 commented Jan 4, 2017

For std and friends, some code in statsmodels that might be a useful reference.
https://github.com/statsmodels/statsmodels/blob/master/statsmodels/stats/weightstats.py

I'm assuming the current implementation (re-weighting the values) isn't what most people would expect? Although that maybe that gets into the different kinds of weights, which I don't fully understand
#10030 (comment)

In [41]: s = pd.Series([1, 1, 2])

In [42]: from statsmodels.stats.weightstats import DescrStatsW

In [43]: stats = DescrStatsW(s.values, weights=[2, 2, 6], ddof=0)

In [44]: stats.std
Out[44]: 0.4898979485566356

In [45]: s.std(weights=[2, 2, 6], ddof=0)
Out[45]: 0.47140452079103168

@jreback
Copy link
Contributor Author

jreback commented Jan 4, 2017

another possibility is to allow weights to take an object like this:

df.mean(weights=pd.Weightby(weights=..., aweights=...., normalize=True))

making this similar to what pd.Grouper does, allowing us to encode multiple parameters in the single argument.

@jreback
Copy link
Contributor Author

jreback commented Feb 27, 2017

closing for now

@Heuertje
Copy link

Heuertje commented Aug 8, 2018

Would be nice to get this opened again!

@MaxGhenis
Copy link

Agreed this would be great to have, even if just on a Series to start, since that skips the questions around DataFrames and groupby.

I've added a few weighted functions to my microdf package here, e.g. weighted_mean(df, 'val', 'weight') but would love to get rid of it in favor of weights in Series operations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Numeric Operations Arithmetic, Comparison, and Logical operations Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

weighted mean
5 participants