Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOC/ENH] How to best handle repeating docstrings #9414

Closed
jklymak opened this issue Oct 15, 2017 · 28 comments
Closed

[DOC/ENH] How to best handle repeating docstrings #9414

jklymak opened this issue Oct 15, 2017 · 28 comments

Comments

@jklymak
Copy link
Member

jklymak commented Oct 15, 2017

Issue

In #9324 I refactored the docstrings for legend.Legend.__init__, figure.Figure.legend and axes.Axes.legend. In particular, all three have the same kwargs because the two artist versions passthrough kwargs to legend. All three had different and contradictory/incomplete docstrings.

To solve this, I used string interpolation. The developer docs say:

Since matplotlib uses a lot of pass-through kwargs, e.g., in every function that creates 
a line (plot(), semilogx(), semilogy(), etc…), it can be difficult for the new user to know which
kwargs are supported. Matplotlib uses a docstring interpolation scheme to support 
documentation of every function that takes a **kwargs. 

and the code has numerous places where this is used. I used a simple form of this that hard-coded the kwarg doc string and then invoked it in the three docstrings:

'''
Parameters
-----------

%(_legend_kw_doc)s
'''

Reviewers didn't like this (@NelleV, @anntzer, @tacaswell ) because apparently this sort of interpolation is fragile and has caused issues with the docs in the past:

#7095, or http://matplotlib.org/2.0.0/api/_as_gen/matplotlib.axes.Axes.semilogy.html#matplotlib.axes.Axes.semilogy, are example of issues caused by uncontrolled interpolation.

For #9324 it was decided that we should just cut-and-paste the kwarg docs into the three docstrings. I also implemented a test in test_legend.py that insured that the docstrings were the same to stop future drift of the docstrings.

Proper solution?

So, what is the proper solution to the documentation problem of repeating passthrough kwargs? I didn't see that the substitution decorator was so bad, but I've not seen all the problems.

Cut-and-paste:

perhaps with unit tests to make sure the docstrings don't drift.

Structure to hold docstrings:

@anntzer suggested:

I think we should either use a proper ("structured") interpolation machinery or none at all.

i.e., he suggested a "structured" docstring whereby each section is an OrderedDict entry, and each kwarg is an OrderedDict entry. This allows referencing of this information in other docstrings. See below.

Functions to hold docstrings:

@efiring suggested what is also essentially suggested in MEP10 which is that specialized documentation functions be used that have the repeated kwargs, and then the main method's documentation references this.

Documentation issue

I think the developer docs should make the deprecation of the interpolation machinery explicit if it truly is deprecated.

Affected functions:

  • figure.legend = axes.legend = legend.Legend
  • colorbar?
  • text properties
@anntzer
Copy link
Contributor

anntzer commented Oct 15, 2017

https://github.com/anntzer/structured-docstrings/blob/master/example.py is basically the idea I had in mind: write each section as separate strings, have some machinery to assemble the parts, and keep the parts available separately. Note that I explicitly don't allow mixing this style with a preexisting docstring -- it should be much more robust to have the entire assembly process under control. Obviously support for all numpydoc sections should be added.

Right now the implementation (in structdoc.py) does some weird tricks because it delays the rendering until someone actually uses the docstring (to shave off some time at import). Depending on how fast things are, we may be able to just render everything eagerly in the decorator and not worry about the magic.

In fact the original approach I thought of was to add the possibility to write some docstrings "normally" and have a minimal parser that would allow one to retrieve the sections individually, but that seems far too complicated/brittle to be worth it (and parsing would make it lazy-rendering more necessary).

Just to clarify, in the case of legend/figlegend/etc., this you would document the parameters on one of the functions (e.g. Axes.legend), and on the others you would just pass the same parameters (parameters=Axes.legend.__doc__.parameters -- of course you can use normal expressions, e.g. [(k, Axes.legend.__doc__.parameters[k]) for k in [...]] if you only want to reuse certain keys).

@NelleV
Copy link
Member

NelleV commented Oct 15, 2017

or we could simply stop hacking our docstring, and make them simpler to contribute too ,and simpler to maintain.

@jklymak
Copy link
Member Author

jklymak commented Oct 15, 2017

@NelleV if you pass through arguments and want to document them in multiple places, I’d argue that it’s not easy to maintain if you have to remember to edit the other X docstrings.

@jklymak
Copy link
Member Author

jklymak commented Oct 15, 2017

@anntzer I don't get much of a vote, but I think a system like what you are proposing is good. I think making sure the decorators are sort of self-documenting is great. i.e. so long as the arguments point towards the file the original doc came from it should allay the concern that its hard to follow. I'd also hope the result show up in IDEs (which I assume they do).

@NelleV
Copy link
Member

NelleV commented Oct 15, 2017

@jklymak Somehow scikit-learn manages this just fine, despite literally having identical API for all of the estimators, and between each of the functions and their associated class. (Matplotlib's case is harder, as the functions often have slightly different parameters, that take different values, while scikit-learn has extremely consistent API).

This problem of having identical parameters over many functions is far from being unique to Matplotlib, yet many projects manages just fine without docstring hacking. Sure, there is a risk of the docstrings evolving separately, but is it worse than having hard to detect bugs, cryptic error messages or worse, no contributors to the documentation?

Docstring hacking gives the illusion to ease maintenance. In practice, so far, it has caused more headaches than any of the advantages you bring forward. It also breaks many of the points brought up by the Zen of Python:

The zen of Python

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.

@efiring
Copy link
Member

efiring commented Oct 15, 2017

Ideally, sorting out this doc strategy problem might warrant a full MEP, with clear statements of goals and viable alternatives. But to begin here, the goals might be:

  1. Generate docstrings that are as helpful to the user as possible, regardless of the method used to view them: directly in the source code, in an interactive ipython session, e.g. using the trailing question mark, and in the online sphinx-generated docs.

  2. Avoid making our codebase more complex and convoluted than it already is.

  3. Related to (2), make it as easy as possible to improve the documentation.

  4. Observe the DRY principle when possible.

I think we should approach this first by looking carefully at criterion 1: what should the user see with each of the methods of documentation access? One of the big problems here is clutter: Too Much Information (or pseudo-information, like my favorite example, the infamous "agg_filter: unknown" in http://matplotlib.org/api/_as_gen/matplotlib.lines.Line2D.html#matplotlib.lines.Line2D).

Does everything have to go directly into a single docstring? I don't think so. Instead of interpolating the property kwargs, can we have something like "for additional keyword arguments, see Line2D_properties"? In other words, use some dedicated functions that serve only for documentation, some of which can be used for more than one function or method. In the actual function or method docstring, put the most generally useful subset of all possible information, with references to the remainder.

@anntzer, I understand the logic of your suggested example, but I think it makes the actual code unacceptably ugly and adds unnecessary complexity.

@jklymak
Copy link
Member Author

jklymak commented Oct 15, 2017

@efiring My concern with referring to other functions is that I can't access those directly i.e.% ax.text? gives me

Other parameters
----------------
kwargs : `~matplotlib.text.Text` properties.
    Other miscellaneous text parameters.

which now I must google, unless I am away from the internet (on a ship), in which case I am out of luck...

@efiring
Copy link
Member

efiring commented Oct 15, 2017

I don't understand the problem. If the docstring for ax.text includes "For a table of all available keyword arguments see ~matplotlib.text.Text_kwargs." then if you are in ipython, you invoke matplotlib.text.Text_kwargs?; if you are looking at the docs on the web, you click the link; if you are looking at the source code in an editor, you load Text_kwargs.py (or whatever we decide to call it).

@efiring
Copy link
Member

efiring commented Oct 15, 2017

The documentation-only functions I am suggesting could also print their own docstrings if invoked as functions.

@jklymak
Copy link
Member Author

jklymak commented Oct 15, 2017

@efiring I understand now. That seems like a reasonable route as well!

@anntzer
Copy link
Contributor

anntzer commented Oct 15, 2017

I am fairly strongly against doc-only functions -- I consider them to be a subversion of the docstring system... (but agree with @efiring that as things are right now, matplotlib.text.Text? should be "enough" (or should be made to be enough -- i.e. the "agg_filter: unknown" should be fixed)).

I believe that there's definitely some value in having all the relevant parameters directly available (without having to follow a link, either hypertext or via IPython). Note that numpydoc defines an "other parameters" section that can be used advantageously.

The syntax of the proof of concept may be a bit ugly (because I hacked it yesterday evening...) but note that it can trivially be converted into e.g.

@process_docstring(
    summary="""
    Some great docstring.""",
    parameters=["""
    foo : int
        The description of foo.""", """
    bar : str
        The description of bar."""
    ],
)

(getting the parameter name from, well, the first word)
at which point I would seriously consider the source to be just as legible as a plain docstring would be (especially with syntax highlighting, that can highlight the keywords ("summary", "parameters", ...) in a different color.

While I agree that the previous system was brittle (a definite point against it), I am also curious about @NelleV's experience about the difficulty in contributing to the docstrings. Would you not consider an example as the one above essentially self-documenting? What kind of issues were raised to you? Obviously you need to know what the numpydoc sections are, but this is just as much a problem with plain docstrings.

@NelleV
Copy link
Member

NelleV commented Oct 15, 2017

@anntzer I should be able to contribute docstring patches to any project in less than 10minutes. That means following conventions and using standard tools. You are suggesting editing code to contribute a docstring, using decorators, a concept that a lot of (non-scientific) developers struggle with.

Most of the headaches come from all the docstring hacking we have: finding where to edit the docstring, indentation problems because of interpolation, setting up a dev environment and building the documentation (which took a couple of weeks to fix on @choldgraf's computer last year). These are the "usual" ones. The documentation often breaks in unusual ways as well.

FYI, I've contributed code patches, that included bug fixes, tests and documentation to most major scientific python packages in less time than it takes me to edit a docstring on Matplotlib. And I know all the major headaches of Matplotlib way better than I know any other scientific python package (except maybe sklearn).

@NelleV
Copy link
Member

NelleV commented Oct 15, 2017

@tacaswell wrote a MEP a very long time ago that tackles most of the issue we have with docstrings, including the interpolation and duplication one (and the makefile, and other problem we have).
Someone should probably have a look at it and update it, and the MEP should be discussed and accepted in order to avoid re discussing the problems and solutions over again.

@jklymak
Copy link
Member Author

jklymak commented Oct 15, 2017

@NelleV MEP10?

@NelleV
Copy link
Member

NelleV commented Oct 15, 2017

@jklymak Yep, that one. A lot of the items have been accomplished without even realizing it by @choldgraf and myself during the last year.

@jklymak
Copy link
Member Author

jklymak commented Oct 15, 2017

FWIW MEP10 specifically outlines the soln suggested by @efiring.

So as an aside I’m not sure about how we are meant to discuss MEPs.

@efiring
Copy link
Member

efiring commented Oct 16, 2017

Massive cut/paste with subsequent synchronization via hand-editing each instance would be a major violation of DRY. I don't think it would end well. We should look for ways of factoring out common elements that are relatively simple, and no more intrusive than necessary. Hence my suggestion of using references to documentation functions. I don't understand the "subversion" objection. It might not have been exactly what Guido had in mind when he designed the language, but I suspect he would have no objection to it. I think this approach is easily understood by users and by coders.

I'm not ruling out documentation-generation machinery, and in fact I am using such machinery in another project, but I am not convinced that it is needed, or the most appropriate solution, for matplotlib.

@anntzer
Copy link
Contributor

anntzer commented Oct 16, 2017

Looking at this again, it should always be possible to tuck the kwargs on the relevant object itself (e.g., Text.__doc__ or Text.__init__.__doc__ -- which is already what is done right now), which I am fine with if we decide not to choose the generation route (I'm only mildly annoyed by the linking to another docstring, but don't think it's a big issue, and as mentioned in the thread, there are also advantages in being concise).

Basically, there should be a "natural" place (a module docstring, a class docstring, a constructor docstring, a function docstring, ...) to document the signature of a callable, and I feel (sorry, nothing objective here) that something's "wrong" if we have to create a mock object just to host that signature.

On the other hand, I definitely agree with @efiring that massive cut/paste as in the case of legend/figlegend should be avoided as much as possible.

@jklymak
Copy link
Member Author

jklymak commented Oct 16, 2017

WRT a concise docstring that links, 98% of the time I am looking at the docstring (usually in Jupyter notebook), I’m doing so to figure out the available kwargs. And usually I’m doing that by tab completion on an artist I’ve created. HAving to leave the notebook and go to the interpreter console and query another object is not very convenient. It’s really nice when the kwargs are available from the method I’m trying to use.

@anntzer anntzer mentioned this issue Oct 17, 2017
6 tasks
@joelostblom
Copy link
Contributor

Thanks for having this discussion!
My usage scenario is similar to what @jklymak described in his most recent comment, and I find it helpful when the **kwargs are listed in the docstring I am reading, rather than referring to another docstring. In my opinion, this makes the documentation easier to access, especially for less experienced Matplotlib users. I don't quite understand the costs for the different methods for implementing this, so I can't speak to that.

@dopplershift
Copy link
Contributor

My $0.02 is to nuke completely the docstring generation code we have now, and not replace it with a different kind of generation code. In addition to the complexity for user contributions, generated docstrings are unsupported by pretty much every other tool out there that doesn't actually execute the code. This means our docstrings are useless in e.g. PyCharm.

As far as subversion of the doc mechanism by creating doc-only functions is concerned, I'm pretty sure computing docstrings at runtime is equally subversive, so that particular argument doesn't hold much water for me.

I'm a fan of having one canonical place for a set of documentation, be it on a class or a standalone function. Other places should reference those liberally. The location should be readily accessible; if matplotlib.text.Text is too much to ask users to type, then let's figure out something to make it easier. (Hell, why not improve the notebook help functionality to make those links actually clickable?)

@jklymak
Copy link
Member Author

jklymak commented Oct 20, 2017

Clickable links would be cool. Except when you don’t have the internet...

Just to be clear, in the notebook there is no way to type matplotlib.text.Text. Well that’s a bit of a lie - there is now a little widget that gives command line access to the kernel. I suppose that’s ok but still nowhere near as nice as tab completion.

I’m still leaning towards repeating the doc string unless they are huge and having a test to make sure devs remember to duplicate them. But I’m not claiming my vote matters very much.

It should also be “easy” to have a small script that can automate docstring copies before commit/push. The only logistical difficulty is knowing which changed docstring the dev means to be the canonical one.

@dopplershift
Copy link
Contributor

dopplershift commented Oct 20, 2017

@jklymak Two things:

  1. I didn't mean clickable to web docs. If sphinx can figure out how to reference between them (which it does), then there's nothing technically hard about having a link in the notebook that resolves to another local docstring (not saying it'd be simple, but we're not trying to solve an NP-hard problem here)
import matplotlib.text
matplotlib.text.Text?

Doesn't work?

@NelleV
Copy link
Member

NelleV commented Oct 20, 2017

That's how numpy does it:

In [1]: X = np.arange(10)

In [2]: X.compress?
Docstring:
a.compress(condition, axis=None, out=None)

Return selected slices of this array along given axis.

Refer to `numpy.compress` for full documentation.

See Also
--------
numpy.compress : equivalent function

And the rendered html links to np.compress: https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.ndarray.compress.html#numpy.ndarray.compress

@jklymak
Copy link
Member Author

jklymak commented Oct 20, 2017

Sure that works. But you need to create a new cell execute it get what you want go back to the previous cell and then delete the spare cell.

Conversely the tab completion pops up a little widget that you can expand to get the scrollable docstring. The whole thing dismisses then you are done. Not sure what following a docstring should do. Maybe it could repopulate that widget with the new docstring. I don’t know the tech of that docstring widget so maybe it’s easy. Maybe they already did it ;-)

@jklymak
Copy link
Member Author

jklymak commented Oct 20, 2017

@NelleV. I didn’t say it was optimal for numpy either ;-). But numpy doesn’t use pass through kwargs like we do. At least with the above I know what the kwargs are and can make a decent guess as to their purpose

@anntzer
Copy link
Contributor

anntzer commented Oct 20, 2017

@dopplershift To be honest I don't understand the point of tools that try so hard not to import dependent modules, and then complain that Python is too dynamic. For example, with something as simple as

from functools import wraps


def func1():
    """Some docstring."""


@wraps(func1)
def func2():
    pass

PyCharm (2017.2 pro) appears to be unable to report the docstring of func2 (which is the same as func1. Note that this is completely stdlib python...

The main objections I've heard against importing are 1) it's too slow (well, it's not as if pylint or pycharm's indexing were fast as a rocket either, and just like pycharm's index, the result of the docstrings can be cached -- and yes cache invalidation is a difficult problem for everyone), and 2) it can have side effects (that's... exceptionally rare for well written libraries; if you're worried that you're developing against some "nasty" library then it being executed by your code editor should probably be the last of your worries...)

Now this does not address whether dynamic docstring generation is more or less perverse than docstring-only functions, for sure...

@bcolsen
Copy link

bcolsen commented Oct 23, 2017

I'm looking at this issue with Jedi right now (spyder-ide/spyder#5522) and it (like pycharm) also can't return doc strings that need to be executed.

Jedi is used in Jupyter and this needs to be considered for Jupterlab doc lookup as well.

FYI as of matplotlib 2.1(thanks to numpydoc doc strings) and jedi 11 you can now get completions for:

import matplotlib.plyplot as plt
fig = plt.figure()
ax = fig.add_subplot(111)
ax.pl

Jedi will give completions of plot and plot_date

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants