Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: time grouper differs when passes as a list and as a scalar #17530

Closed
ruiann opened this issue Sep 14, 2017 · 12 comments · Fixed by #17587
Closed

BUG: time grouper differs when passes as a list and as a scalar #17530

ruiann opened this issue Sep 14, 2017 · 12 comments · Fixed by #17587
Labels
Milestone

Comments

@ruiann
Copy link
Contributor

ruiann commented Sep 14, 2017

Code Sample, a copy-pastable example if possible

data_frame.groupby(pd.TimeGrouper('D', convention='start', key=time_column)).count()
data_frame.groupby([pd.TimeGrouper('D', convention='start', key=time_column)]).count()

give below two different outputs

            location  price  profit  type
time                                     
2017-07-23         1      1       1     1
2017-07-24         0      0       0     0
2017-07-25         1      1       1     1
2017-07-26         1      1       1     1
2017-07-27         3      3       3     3
2017-07-28         6      6       6     6
            location  price  profit  type
time                                     
2017-07-23         1      1       1     1
2017-07-25         1      1       1     1
2017-07-26         1      1       1     1
2017-07-27         3      3       3     3
2017-07-28         6      6       6     6

Problem description

The origin data frame doesn't contain a record in 2017-07-24, you can see if I use time grouper with convention='start' lonely, the 2017-07-24 item has been filled, but if by group list, that item hasn't been filled

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.10.final.0
python-bits: 64
OS: Darwin
OS-release: 16.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.20.3
pytest: None
pip: 9.0.1
setuptools: 36.2.0
Cython: None
numpy: 1.13.1
scipy: 0.19.1
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.9999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
pandas_gbq: None
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented Sep 14, 2017

need a copy-pastable example

@ruiann
Copy link
Contributor Author

ruiann commented Sep 15, 2017

df = pd.DataFrame(np.random.randn(3), index=pd.Series(['2017-08-09 13:32:23', '2017-08-11 23:23:15', '2017-08-11 22:23:15'], dtype='datetime64[ns]'))

grouper = pd.TimeGrouper('D')
grouped = df.groupby(grouper)
data = grouped.count()
grouped = df.groupby([grouper])
data = grouped.count()

you can see the result is different

Sorry that convention parameter doesn't influence this scenario, but the result is what you can see, there is no way to resample and group by another categorical columns in the same time I think?

@ruiann
Copy link
Contributor Author

ruiann commented Sep 15, 2017

    def _get_binner_for_grouping(self, obj):
        # return an ordering of the transformed group labels,
        # suitable for multi-grouping, e.g the labels for
        # the resampled intervals
        binner, grouper, obj = self._get_grouper(obj)

        l = []
        for key, group in grouper.get_iterator(self.ax):
            l.extend([key] * len(group))
        pdb.set_trace()
        if isinstance(self.ax, PeriodIndex):
            grouper = binner.__class__(l, freq=binner.freq, name=binner.name)
        else:
            # resampling causes duplicated values, specifying freq is invalid
            grouper = binner.__class__(l, name=binner.name)

        # since we may have had to sort
        # may need to reorder groups here
        if self.indexer is not None:
            indexer = self.indexer.argsort(kind='quicksort')
            grouper = grouper.take(indexer)
        return grouper

the second grouper generated by this method, while the first way generate grouper by

binner, grouper, obj = self._get_grouper(obj)

it seems that the second way get the same grouper of TimeGrouper as the first way and then sample it once more, which only uses the existed time slices in group axis by

l.extend([key] * len(group))

and then do some other operations

@ruiann
Copy link
Contributor Author

ruiann commented Sep 15, 2017

So what is the expected behavior? If follow the SQL rule, the missing time slices shouldn't be filled for TimeGrouper, only the resample method can get a complete time axis index. As the resample method call the TimeGrouper to implement the time axis index, may a new parameter needed to control TimeGrouper's behavior?

My target is to group by a categorical column and then resample to get a complete statistic in time dimension, I cannot do this in a groupby operation.

@jreback
Copy link
Contributor

jreback commented Sep 15, 2017

ok, this is a duplicate of #16746, but I like your example better, slightly formatted:

should all result the same [6] and [8] are correct, [7] is the bug.

In [5]: df = pd.DataFrame(np.random.randn(3), index=pd.Series(['2017-08-09 13:32:23', '2017-08-11 23:23:15', '2017-08-11 22:23:15'], dtype='datetime64[ns]'))
   ...: 
   ...: grouper = pd.TimeGrouper('D')
   ...: grouped1 = df.groupby(grouper)
   ...: data1 = grouped1.count()
   ...: grouped2 = df.groupby([grouper])
   ...: data2 = grouped2.count()
   ...: 

In [6]: data1
Out[6]: 
            0
2017-08-09  1
2017-08-10  0
2017-08-11  2

In [7]: data2
Out[7]: 
            0
2017-08-09  1
2017-08-11  2

In [8]: df.resample('D').count()
Out[8]: 
            0
2017-08-09  1
2017-08-10  0
2017-08-11  2

Should be straightforward to fix if you'd like to dig in.

@jreback jreback added this to the Next Major Release milestone Sep 15, 2017
@jreback jreback changed the title time grouper with convention doesn't work for groupby list situtation BUG: time grouper differs when passes as a list and as a scalar Sep 15, 2017
@ruiann
Copy link
Contributor Author

ruiann commented Sep 15, 2017

@jreback I think 7 is also correct if follow the SQL rule, missed data can not be queried, maybe a new parameter needed? After hours debug, I think this situation is very complex and hard to fix, when by scalar TimeGrouper returns a BinGrouper, while by list returns a BaseGrouper that may have other grouping. I'll take some effort in this bug but can't promise to fix it out.

@jreback
Copy link
Contributor

jreback commented Sep 15, 2017

you are diving too deep. look more around here

[7] is not a resample, rather a straight groupby by date

In [42]: df.groupby(df.index.normalize()).count()
Out[42]: 
            0
2017-08-09  1
2017-08-11  2

you could add this example to the groupby docs, but another parameter is not needed.

@ruiann
Copy link
Contributor Author

ruiann commented Sep 16, 2017

I think the situation is complex than your thought.
For list groupers, groupings are built for each grouper to make a BaseGrouper, and Grouping class only get binner by _get_binner_for_grouping of each grouper, the group_info of BaseGrouper property generated by the labels and group_index of its groupings, and each grouping get them by the binner, which is the label list of each row.

# the labels and group_index of Grouping class, with them the BaseGrouper generate group_info
    @property
    def labels(self):
        if self._labels is None:
            self._make_labels()
        return self._labels

    @property
    def group_index(self):
        if self._group_index is None:
            self._make_labels()
        return self._group_index

    def _make_labels(self):
        if self._labels is None or self._group_index is None:
            pdb.set_trace()
            # self.grouper is the label of each row
            labels, uniques = algorithms.factorize(
                self.grouper, sort=self.sort)
            uniques = Index(uniques, name=self.name)
            self._labels = labels
            self._group_index = uniques

But for BinGrouper, the group_info property generated in different way, which directly map the labels and group_index.

    def group_info(self):
        pdb.set_trace()
        ngroups = self.ngroups
        obs_group_ids = np.arange(ngroups)
        rep = np.diff(np.r_[0, self.bins])

        rep = _ensure_platform_int(rep)
        if ngroups == len(self.bins):
            comp_ids = np.repeat(np.arange(ngroups), rep)
        else:
            comp_ids = np.repeat(np.r_[-1, np.arange(ngroups)], rep)

        return comp_ids.astype('int64', copy=False), \
            obs_group_ids.astype('int64', copy=False), ngroups

So the bug occurs, the binner can only provide the labels of each row in selected axis, and then unique them to get unique labels and group_index, so the missed date time cannot be generated.

I think the way to fix this bug is to provide group_info in _get_binner_for_grouping of grouper to get more info? I think this logic change may be bigger? I'll take further look.

@jreback
Copy link
Contributor

jreback commented Sep 17, 2017

@ruiann again, you can simply have a len-1 list-like grouper case turn into the same case as the scalar passed case. its just going down the wrong path.

@ruiann
Copy link
Contributor Author

ruiann commented Sep 17, 2017

Then how about the result of ['categorical_axis', TimeGrouper('D', key='time_axis')]? It can not get all time slices for each category item.
In fact I think the

    def _get_binner_for_grouping(self, obj):
        # return an ordering of the transformed group labels,
        # suitable for multi-grouping, e.g the labels for
        # the resampled intervals
        binner, grouper, obj = self._get_grouper(obj)

        l = []
        for key, group in grouper.get_iterator(self.ax):
            l.extend([key] * len(group))
        pdb.set_trace()
        if isinstance(self.ax, PeriodIndex):
            grouper = binner.__class__(l, freq=binner.freq, name=binner.name)
        else:
            # resampling causes duplicated values, specifying freq is invalid
            grouper = binner.__class__(l, name=binner.name)

        # since we may have had to sort
        # may need to reorder groups here
        if self.indexer is not None:
            indexer = self.indexer.argsort(kind='quicksort')
            grouper = grouper.take(indexer)
        return grouper

method of BaseGrouper (and BinGrouper) class works wrong, the group_info can be get from the BaseGrouper/BinGrouper instance generated by _get_grouper method, but not to read and sort them again item by item.

I'm trying to return the BaseGrouper/BinGrouper instance generated by _get_grouper method if the Grouping instance is given by Grouper instance, and get labels, group_index from the group_info, but there are still some bugs reported by test case, I'm now trying to fix them.

At last, I think groupby related logics need some refactor (I mean the name, alias, call stack sometimes confuse me, also some duplicated calls happened, maybe convert them all to list groupers could be better, with which only one implement need to deal).

@jreback
Copy link
Contributor

jreback commented Sep 17, 2017

At last, I think groupby related logics need some refactor (I mean the name, alias, call stack sometimes confuse me, also some duplicated calls happened, maybe convert them all to list groupers could be better, with which only one implement need to deal).

If you want to attempt that, sure. Pls do so w/o adding any features / fixing any bugs, (though you can add xfailing tests). Try to make things pass everything that currently works, then in different commits / PR add things.

@ruiann
Copy link
Contributor Author

ruiann commented Sep 19, 2017

@jreback
I've fixed this bug in this PR

The fix is to allow the Grouping to use grouper generated by _get_grouper of passed TimeGrouper\OtherGrouper item of the list directly, and provide indexer for BinGrouper to reorder them correctly for the values of the grouped axis

There are some PEP8 errors, which not caused by my change, and hard to change 😢

Also the Contributing guide doesn't work for me
When I try to use conda to build the development env

conda create -n pandas_dev --file ci/requirements_dev.txt

shell said

Fetching package metadata .........

PackageNotFoundError: Package missing in current osx-64 channels: 
  - moto

Close matches found; did you mean one of these?

    moto: boto

When I build env by setup.py and run all tests

pytest pandas

same bug occurs in some test cases

pandas/tests/io/test_excel.py:2403: in <module>
    pytest.param('xlwt',
E   AttributeError: 'module' object has no attribute 'param'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants