Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: Fix pandas.Series.resample docstring #23197

Merged
merged 17 commits into from
Nov 10, 2018

Conversation

jmrr
Copy link
Contributor

@jmrr jmrr commented Oct 17, 2018

  • closes DOC: Fix the docstring of resample in pandas/core/generic.py #22894

  • tests added / passed:

    • ./scripts/validate_docstrings.py pandas.Series.resample
    • flake8 --doctests pandas/core/generic.py
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff

  • whatsnew entry --> Needed?

  • Followed pandas docstring conventions and general reST.

  • Added deprecation directive on some parameters. Had some doubts but after investigation the recommended API changed after v0.18.0 for these parameters.

Please let me know if I can further improve this.

Thanks!

@pep8speaks
Copy link

Hello @jmrr! Thanks for submitting the PR.

@jmrr jmrr changed the title Fix docstring resample DOC: Fix pandas.Series.resample docstring Oct 19, 2018
@jschendel jschendel added the Docs label Oct 21, 2018
Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments.

to the on or level keyword.
series. Object must have a datetime-like index (``DatetimeIndex``,
``PeriodIndex``, or ``TimedeltaIndex``), or pass datetime-like values
to the ``on`` or ``level`` keyword.

Parameters
----------
rule : string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use the Python types (str instead of string...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, done. The module is full of string by the way. Should I create another issue for that? Do these values str, int get picked up by Sphinx?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to fix one whole docstring at a time, than fixing a problem in all docstrings.

Sphinx doesn't care about the types, but we try to be consistent, and will start validating the types soon (not sure if validate_docstrings.py already conmplains about these.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, atomic improvements around each docstring make total sense. Pushing the amendments now.


.. deprecated:: 0.18.0
The new syntax is ``.resample(...).mean()``, or
``.resample(...).apply(<func>)``
axis : int, optional, default 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check other docstrings to see how the axis type is usually documented?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another example is:

apply to each column (``axis=0`` or ``'index'``)
or to each row (``axis=1`` or ``'columns'``) or    

The problem is that for Series the axis is that along which the operation is computed, which in this case it's always along the rows (aka columnwise).

Maybe a more verbose but informative doc would be

 Which axis to use for up- or down-sampling. For ``Series`` this
 will default to 0, i.e. `along the rows`. Must be
 ``DatetimeIndex``, ``TimedeltaIndex`` or ``PeriodIndex``.

I have added this tentatively to the PR but let me know if something less verbose would be better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I wasn't clear. There is not much standardization of the descriptions, use whatever makes more sense to you. But the used type is somehow standard, that's what I think we should copy from another docstring.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmrr can you make this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @datapythonista not your fault at all. I somewhat skipped the word type in your request and I thought you meant how axis is usually documented. Regarding the type and values there's no much standardization either. From set_axis() in pandas/core/generic.py I like the following:

{0 or 'index', 1 or 'columns'}, default 0

If we extrapolate this to Series then that line should be:

axis : {0 or 'index'}, default 0

I will make this change now, hope it sounds good.

@@ -7385,18 +7401,22 @@ def resample(self, rule, how=None, axis=0, fill_method=None, closed=None,
for all frequency offsets except for 'M', 'A', 'Q', 'BM',
'BA', 'BQ', and 'W' which all have a default of 'right'.
convention : {'start', 'end', 's', 'e'}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some default values are missing, like in this case.

kind: {'timestamp', 'period'}, optional
For ``PeriodIndex`` only, controls whether to use the start or
end of ``rule``.
kind : {'timestamp', 'period'} optional
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing comma before optional

freq='A',
periods=2))
>>> s = pd.Series([1, 2], index=pd.period_range(
... '2012-01-01',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd keep the date in the previous line

@@ -7577,9 +7598,9 @@ def resample(self, rule, how=None, axis=0, fill_method=None, closed=None,

>>> time = pd.date_range('1/1/2000', periods=5, freq='T')
>>> df2 = pd.DataFrame(data=10*[range(4)],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pep8 issue, missing spaces around operator.

@codecov
Copy link

codecov bot commented Oct 27, 2018

Codecov Report

Merging #23197 into master will decrease coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #23197      +/-   ##
==========================================
- Coverage   92.25%   92.25%   -0.01%     
==========================================
  Files         161      161              
  Lines       51262    51277      +15     
==========================================
+ Hits        47292    47305      +13     
- Misses       3970     3972       +2
Flag Coverage Δ
#multiple 90.63% <ø> (-0.01%) ⬇️
#single 42.32% <ø> (-0.02%) ⬇️
Impacted Files Coverage Δ
pandas/core/generic.py 96.81% <ø> (ø) ⬆️
pandas/util/testing.py 86.71% <0%> (-0.08%) ⬇️
pandas/core/arrays/datetimelike.py 95.9% <0%> (-0.03%) ⬇️
pandas/core/frame.py 97.03% <0%> (ø) ⬆️
pandas/io/parquet.py 84.76% <0%> (+1.25%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update efd1844...109e414. Read the comment docs.

Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work, added some comments, mainly to the original code, not your changes.


.. deprecated:: 0.18.0
The new syntax is ``.resample(...).mean()``, or
``.resample(...).apply(<func>)``
axis : int, optional, default 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmrr can you make this change?

freq='A',
periods=2))
... freq='A',
... periods=2))
>>> s
2012 1
2013 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the following examples, when using resample, it once uses .head() and the other time it shows 12 months. I don't like any of them. I'd use something like resampling a year to quarters, or a quarter to months... So we can show all the data, and it's just few rows that readers can quickly check.

Copy link
Contributor Author

@jmrr jmrr Nov 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've created two examples, one sampling a year into quarters using convention='start' and another sampling quarters into months using the convention='end'. If it's too much we can remove one of them.

@@ -7563,7 +7584,8 @@ def resample(self, rule, how=None, axis=0, fill_method=None, closed=None,
For DataFrame objects, the keyword ``on`` can be used to specify the
column instead of the index for resampling.

>>> df = pd.DataFrame(data=9*[range(4)], columns=['a', 'b', 'c', 'd'])
>>> df = pd.DataFrame(data=9 * [range(4)],
... columns=['a', 'b', 'c', 'd'])
>>> df['time'] = pd.date_range('1/1/2000', periods=9, freq='T')
>>> df.resample('3T', on='time').sum()
a b c d
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should show the data before resampling, otherwise it's difficult to see what resample is doing. Also, we can probably have just two columns instead of 4, name them with something meaningful (like price and volume instead of a and b), and create the DataFrame in a nicer way (in a single command, with no range, may be just a dict).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I've revamped this example. Pushing now altogether with the rest of the changes. Agree more meaningful examples make sure everyone understands. Just a bit conscious of being too verbose. But looking forward to your feedback.

pandas/core/generic.py Outdated Show resolved Hide resolved
>>> df2 = pd.DataFrame(data=10 * [range(4)],
... columns=['a', 'b', 'c', 'd'],
... index=pd.MultiIndex.from_product([time, [1, 2]])
... )
>>> df2.resample('3T', level=0).sum()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comments as in the previous example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, also with a different, more realistic approach. Please let me know your comments. I also struggled a bit with the formatting of long lines, opted for styling lines 7627-7632 using black (also compatible with PEP8) but let me know it breaks consistency and should be reverted to the old style. Pushing now.

@jmrr
Copy link
Contributor Author

jmrr commented Nov 7, 2018

Thanks for the great feedback @datapythonista. I've pushed some changes as per your comments, I've been a bit creative with some examples, let me know if I went too far :)

Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more comments, and I think we're ready to merge.

.. deprecated:: 0.18.0
The new syntax is ``.resample(...).mean()``, or
``.resample(...).apply(<func>)``
axis : {0 or 'index'}, default 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

methods in generic.py are reused for both in Series and DataFrame. I guess for DataFrame we can also use axis='columns'?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right, I wasn't sure, I'll add it. Thinking about it now perhaps we could've added more examples with DataFrame.

axis : {0 or 'index'}, default 0
Which axis to use for up- or down-sampling. For ``Series`` this
will default to 0, i.e. `along the rows`. Must be
``DatetimeIndex``, ``TimedeltaIndex`` or ``PeriodIndex``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the must be. I don't think axis must be DatetimeIndex...

Also, don't use backticks around the "along the rows". Backticks are to tell that it's a reference (to a variable, a function, a class...). We don't use them for text.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the must be. I don't think axis must be DatetimeIndex...

I took it from the main function description. As rule must be an offset string which are time-related magnitudes I decided to be explicit here. Also I ran some tests and got TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex.

In fact, following this logic, on should also have that sentence: Must be ``DatetimeIndex``, ``TimedeltaIndex`` or ``PeriodIndex``.

For instance:

>>> df = pd.DataFrame({'price': [10, 11, 9, 13, 14, 18, 17, 19], 'volume': [50, 60, 40, 100, 50, 100, 40, 50]})
>>> df['time'] = pd.date_range('01/01/2018', periods=8, freq='W')
>>> df
   price  volume       time
0     10      50 2018-01-07
1     11      60 2018-01-14
2      9      40 2018-01-21
3     13     100 2018-01-28
4     14      50 2018-02-04
5     18     100 2018-02-11
6     17      40 2018-02-18
7     19      50 2018-02-25
>>> df.resample('M', on='time').sum()
            price  volume
time                     
2018-01-31     43     250
2018-02-28     68     240

I hope I'm not missing anything here... resample is not a function I've used heavily personally.

Also, don't use backticks around the "along the rows". Backticks are to tell that it's a reference (to a variable, a function, a class...). We don't use them for text.

Slip of keyboard, I meant normal textual quotes ("). Changing.

kind: {'timestamp', 'period'}, optional
convention : {'start', 'end', 's', 'e'}, default 'start'
For ``PeriodIndex`` only, controls whether to use the start or
end of ``rule``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

single backticks, in both cases, double backticks are for code (including possible values of variables like NaN). But for classes, variables, parameters... we use single backticks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, wasn't clear from the docstring docs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would None be with double backtick in examples like "default None"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question. I think it probably should, but we don't have it anywhere, so let's leave it without backticks in the default for now

loffset : timedelta, default None
Adjust the resampled time labels.
limit : int, default None
Maximum size gap when reindexing with ``fill_method``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

For a MultiIndex, level (name or number) to use for
resampling. Level must be datetime-like.
resampling. ``level`` must be datetime-like.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

@@ -7422,6 +7443,10 @@ def resample(self, rule, how=None, axis=0, fill_method=None, closed=None,
To learn more about the offset strings, please see `this link
<http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases>`__.

See Also
--------
groupby : Group by mapping, function, label, or list of labels.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this is for Series and DataFrame, I'd add both here too. One of the links will be self-referencing, but the other will point to the equivalent of the other class, which is useful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.

to the on or level keyword.
series. Object must have a datetime-like index (``DatetimeIndex``,
``PeriodIndex``, or ``TimedeltaIndex``), or pass datetime-like values
to the ``on`` or ``level`` keyword.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Single backticks.

rule : str
The offset string or object representing target conversion.
how : str
Method for down-/re-sampling, default to ‘mean’ for downsampling.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the possible values? If they are a fixed set, can we have {'mean',...}? If it's a problem, leave it like it is, as this will go away soon.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible values are any valid string function that does some form of aggregation. Interestingly, this would also work with the example dataframe I shared above:

>>> df
   price  volume       time
0     10      50 2018-01-07
1     11      60 2018-01-14
2      9      40 2018-01-21
3     13     100 2018-01-28
4     14      50 2018-02-04
5     18     100 2018-02-11
6     17      40 2018-02-18
7     19      50 2018-02-25

>>> df.resample('M', on='time', how={'price': min, 'volume': sum})
            price  volume
time                     
2018-01-31      9     250
2018-02-28     14     240
>>> df.resample('M', on='time', how=['min', 'sum'])
           price     volume     
             min sum    min  sum
time                            
2018-01-31     9  43     40  250
2018-02-28    14  68     40  240

Happy to change the description of the parameter to something like:

how : str, dict, list
       Method for down-/re-sampling, default to `mean` for downsampling. Accepted combinations are:
       * string function name
       * list of string function names
       * dict of column names -> string function names (or list of string function names) 

I got inspiration from pandas.DataFrame.aggregate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

f the name needs to be a string with the function name, I assume it can't be any arbitrary value. But as it's deprecated, let's not spend time on this.

@jmrr
Copy link
Contributor Author

jmrr commented Nov 9, 2018

Thanks @datapythonista. I've addressed the changes but it's good you highlighted that as these docs are shared between Series and DataFrame some parameters might accept quite different values depending on the data structure.

Please see my comments and suggestions. Thanks!

Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, @jbrockmendel do you mind taking a look, and merge if you're happy?

@datapythonista
Copy link
Member

@jmrr looks like the error in the CI is in this PR, a non-ascii character somewhere. Can you take a look and leave the CI green, so we can merge?

@jmrr
Copy link
Contributor Author

jmrr commented Nov 9, 2018

Sure @datapythonista let me have a look. Thanks for the review!

@jmrr
Copy link
Contributor Author

jmrr commented Nov 9, 2018

Looks like it's still failing, but it's an issue with CI setup. The branch is encountering a pytest installation that is too new.

@datapythonista
Copy link
Member

That's weird, I think that happened before and it was fixed. Can you merge master into your branch? May be it's based in the old failing version.


>>> days = pd.date_range('1/1/2000', periods=4, freq='D')
>>> d2 = dict({'price': [10, 11, 9, 13, 14, 18, 17, 19],
... 'volume': [50, 60, 40, 100, 50, 100, 40, 50]})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one more space?

@jbrockmendel
Copy link
Member

Aside from one missing space, LGTM


>>> days = pd.date_range('1/1/2000', periods=4, freq='D')
>>> d2 = dict({'price': [10, 11, 9, 13, 14, 18, 17, 19],
... 'volume': [50, 60, 40, 100, 50, 100, 40, 50]})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
... 'volume': [50, 60, 40, 100, 50, 100, 40, 50]})
... 'volume': [50, 60, 40, 100, 50, 100, 40, 50]})

@datapythonista
Copy link
Member

@jmrr can you run ./scripts/validate_docstrings.py pandas.Series.resample I think it should capture the error in the indentation (and other formatting problems).

Also, can you remove the -resample in the doctests part of ci/code_checks.sh? doctests should pass for this now, and we don't need to skip it anymore.

@jmrr
Copy link
Contributor Author

jmrr commented Nov 10, 2018

Thanks for the review @jbrockmendel and @datapythonista. Previous to the branch update with latest master, all tests passed when running ./scripts/validate_docstrings.py pandas.Series.resample.

Now after the update I encountered some errors, but they're flake8 ones. In particular:

2 Errors found:
        flake8 error: E302 expected 2 blank lines, found 0
        flake8 error: E305 expected 2 blank lines after class or function definition, found 0

I spent some time looking but I think this validation is trying to lint the example code as if this wasn't REPL code. For instance, the E302 disappears if I trick the linter with:

        >>>
        ...
        ... def custom_resampler(array_like):
        ...     return np.sum(array_like) + 5
        ...
        ...
        ...
        >>> series.resample('3T').apply(custom_resampler)

So there's two lines before a function.

Should I spend some time with this and commit the change or this is something that will get improved with time (e.g. by ignoring that error code).

@datapythonista
Copy link
Member

Not sure what's the best solution here, but the problem is in how we validate flake8, we surely don't want to leave blank lines that way in here.

We'll address that in a separate PR in any case.

Thanks for the work on this @jmrr

CC: @alimcmaster1

@datapythonista datapythonista merged commit b930303 into pandas-dev:master Nov 10, 2018
thoo added a commit to thoo/pandas that referenced this pull request Nov 11, 2018
…fixed

* upstream/master:
  DOC: Fixes to docstring to add validation to CI (pandas-dev#23560)
  DOC: Remove incorrect periods at the end of parameter types (pandas-dev#23600)
  MAINT: tm.assert_raises_regex --> pytest.raises (pandas-dev#23592)
  DOC: Updating Series.resample and DataFrame.resample docstrings (pandas-dev#23197)
  ENH: Support for partition_cols in to_parquet (pandas-dev#23321)
  TST: Use intp as expected dtype in IntervalIndex indexing tests (pandas-dev#23609)
thoo added a commit to thoo/pandas that referenced this pull request Nov 11, 2018
* upstream/master:
  BUG: Casting tz-aware DatetimeIndex to object-dtype ndarray/Index (pandas-dev#23524)
  BUG: Delegate more of Excel parsing to CSV (pandas-dev#23544)
  API: DataFrame.__getitem__ returns Series for sparse column (pandas-dev#23561)
  CLN: use float64_t consistently instead of double, double_t (pandas-dev#23583)
  DOC: Fix Order of parameters in docstrings (pandas-dev#23611)
  TST: Unskip some Categorical Tests (pandas-dev#23613)
  TST: Fix integer ops comparison test (pandas-dev#23619)
  DOC: Fixes to docstring to add validation to CI (pandas-dev#23560)
  DOC: Remove incorrect periods at the end of parameter types (pandas-dev#23600)
  MAINT: tm.assert_raises_regex --> pytest.raises (pandas-dev#23592)
  DOC: Updating Series.resample and DataFrame.resample docstrings (pandas-dev#23197)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DOC: Fix the docstring of resample in pandas/core/generic.py
5 participants