Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add setting for switching between "production" and "interactive" mode #13862

Open
bbirand opened this Issue Aug 1, 2016 · 9 comments

Comments

Projects
None yet
5 participants
@bbirand
Copy link

commented Aug 1, 2016

It seems that future development in Pandas creates a distinction between "production" and "interactive" use. I frequently use Pandas to do interactive data analyses. As such, things like indexing shortcuts are very handy. However, I also can see that some of this magic can create very subtle bugs in production systems.

One potential solution is to have a global configuration switch. When an analysis project starts, many of these "shortcut" functions could be made available. Once the code is ready, the user can change this setting to "production" which would issue warnings for functions deemed unsafe.

Any thoughts on this?

This issue came up while talking to @jreback on #13548 .

@shoyer

This comment has been minimized.

Copy link
Member

commented Aug 2, 2016

My immediate inclination is that we should only have a "production" mode in the hypothetical pandas 2.0, and features suitable only for "interactive" mode should be dropped or changed to be unambiguous by only depending on types rather than values. I'd love to see some examples of important "interactive only" features, though.

@bbirand

This comment has been minimized.

Copy link
Author

commented Aug 3, 2016

You are right, probably the best way to move forward is to figure out what features would be "production" and what others would be "interactive-only".

In my opinion the most important features when doing interactive work is indexing, especially with Dates (which was the main point with #13548 ). Being able to use these with the least amount of additional function calls would be great.

features suitable only for "interactive" mode should be dropped or changed to be unambiguous by only depending on types rather than values

What do you mean by depending on types? Similar to the way the "Grouper" object is used, for instance?

df.groupby([pd.Grouper(freq='1M',key='Date'),'Buyer']).sum()

As a side note, having this kind of discussion may actually be beneficial in creating new "interactive-only" features that perhaps were not considered in the API by virtue of being unsafe. First thing that comes to mind are single-letter shortcuts, similar to numpy's r_ and c_ for replacing frequently used longer functions (maybe like pd.date_range()).

Another example would be features that are implemented in external libraries, for instance:
http://pythonhosted.org/pandas-ply/

@shoyer

This comment has been minimized.

Copy link
Member

commented Aug 4, 2016

What do you mean by depending on types?

I want methods that produce an output with a predictable type/dtype that depends only on the types/dtypes of the input, i.e., with a sane type signature that we could (and perhaps should) specify with PEP 484.

Indexing is a prime example of violating this principle of good software design. Depending on whether the indexing value matches multiple items in the index or not, the result of indexing a DataFrame is either another DataFrame (multiple rows) or a Series (single row). We might fix this in either of two ways:

  1. Require that all indexes have unique values.
  2. Change indexing methods to always return a DataFrame, even when the result is currently a Series representing a single row.

I'm somewhat partial to (2), as a Series with dtype=object for representing a single row is not very useful. But obviously this would be a major break in backwards compatibility.

@bbirand

This comment has been minimized.

Copy link
Author

commented Aug 13, 2016

Yes, I certainly see the benefits of having a more consistent API. And most definitely having a consistent return type that is predictable would be extremely helpful. Between the two options you give, I'd also prefer (2)..

But if we get back to this issue; making certain things explicit means writing longer statements. It means one needs to use the specific indexing methods (that ideally don't have the types of surprises that you define). It also means using functions to create objects (using functions like daterange) to that are passed on two these more methods. All in all, these are great for having a safe, deterministic environment. But it also makes it that much more painful to write and harder to read.

For interactive environments, some "magic" should be acceptable. For instance, when doing date manipulations, giving a string of "2015/12:2016/2" as a single string value could do the correct slicing.

A configuration option can decide whether these magic commands should be allowed or not. Thoughts?

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Aug 13, 2016

I agree with @shoyer that there should only be one mode. Every feature in pandas should have well defined behaviour IMO (note the 'should', it should be our aim, but there are certainly areas for improvement).

But that does not mean that some features can be convenient. For example datetime indexing with string is very convenient for interactive usage (eg df['2012-01':'2012-03'], but still it should have a well defined behaviour.

I would still like to see an example of something specific for an 'interactive mode' that would not have well defined behaviour (so you would want to be able to turn this off).

If something is painful to write and hard to read, we should try to do something about that, looking for solutions to make it easier while still keeping a sane API. But that can still be done in 'one mode' IMO.

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Aug 13, 2016

Regarding your specific example of

For instance, when doing date manipulations, giving a string of "2015/12:2016/2" as a single string value could do the correct slicing.

I personally don't find it problematic to write it as two strings instead of one ("2015/12":"2016/2", it's only two characters difference)

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

commented Aug 13, 2016

The only example I could think of is assigning a column using .

In [1]: df = pd.DataFrame({'a': [1, 2]})

In [2]: df.b = [3, 4]

In [3]: df
Out[3]:
   a
0  1
1  2

In [4]: df.a = [3, 4]

In [5]: df
Out[5]:
   a
0  3
1  4

Even then, I don't know what the "production" mode would be. Just disallowing all __setattr__ doesn't feel all that pythonic.

@shoyer

This comment has been minimized.

Copy link
Member

commented Aug 13, 2016

Even then, I don't know what the "production" mode would be. Just disallowing all __setattr__ doesn't feel all that pythonic.

I would be happy with this. We made this choice in xarray and I have heard few complaints about it.

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

commented Aug 13, 2016

That's a useful datapoint. In that case, I'd be tempted to just disallow it altogether. At least then we can raise a helpful error message 😄.

@shoyer shoyer referenced this issue Jan 23, 2017

Merged

ENH: add Series & DataFrame .agg/.aggregate #14668

0 of 4 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.