Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confusing API for constructing Series from DataFrame #20658

Open
e-pet opened this issue Apr 11, 2018 · 2 comments
Open

Confusing API for constructing Series from DataFrame #20658

e-pet opened this issue Apr 11, 2018 · 2 comments
Labels
Constructors Series/DataFrame/Index/pd.array Constructors Enhancement Error Reporting Incorrect or improved errors from pandas

Comments

@e-pet
Copy link

e-pet commented Apr 11, 2018

The current API for constructing a Series from a DataFrame containing just a single column currently is quite confusing in my opinion. The following illustrates my experience trying to transform a Dataframe containing just a single column into a Series object. This may or may not be exemplary for other users.

  1. I know there's a to_frame (why not to_dataframe?) method for Series, so I guess there's also a to_series method for dataframes...? [There isn't.]
  2. Well, then I'll just call the Series constructor using the DataFrame: pd.Series(df). From a user perspective, the only sensible thing that this could possibly do is return the Series corresponding to the (single) column. I'm not exactly sure what it does, but it definitely doesn't do what I expected it to do and the result doesn't look very useful in any case:
pd.Series(pd.DataFrame({'a':[1,2,3]}))
Out[20]: 
0    (a,)
1    (a,)
2    (a,)
dtype: object
  1. Some googling reveals that squeeze appears to be the almost-inverse of the to_frame method, except that it returns an int instead of a Series in the 1x1 case:
type(pd.DataFrame({'a':[1]}).squeeze())
Out[22]: numpy.int64
  1. Further googling yields a solution that technically does exactly what I want but to my eyes somewhat obscures what's happening: df.iloc[:, 0].

Specific questions:

  1. Why is it to_frame and not to_dataframe? Is there any reason not to introduce the latter as an alias to the former?
  2. Could there be a to_series method on dataframes that raises an exception if there's more than one column and otherwise works like squeeze except that it always returns a Series?
  3. Why does the first example above using the Series constructor do what it does, and can't this be changed so that it does something useful for this use case?

(I'm not sure if this has been discussed before; I couldn't find any related issue. Sorry if this is a duplicate post!)

On a sidenote, this issue is quite representative of my experiences with pandas so far: it is almost always capable of doing what I'd like it to do, but it often takes me longer than expected to get things to work due to (for me, at least) unintuitive and partly inconsistent API design. That being said, it's still immensely useful and I use it on an almost daily basis. Thanks a lot to everyone involved!

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Apr 11, 2018

If you have a DataFrame, then each column is a Series. So, in your example:

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'a':[1,2,3]})

In [3]: df['a']
Out[3]:
0    1
1    2
2    3
Name: a, dtype: int64

In [4]: type(df['a'])
Out[4]: pandas.core.series.Series

In [5]: df[df.columns[0]]
Out[5]:
0    1
1    2
2    3
Name: a, dtype: int64

The last example shows that you don't even need to know the column name.

So there really isn't a need to convert a DataFrame to a Series, since each column is already a Series

@jorisvandenbossche
Copy link
Member

As @Dr-Irv notes, one way to go about this is "selecting the single column". There are some different ways (as @Dr-Irv shows), and your df.iloc[:, 0] is also a way to do this. You might find it obscure (because you might not think about indexing because you have in your head to convert the dataframe to series), but in the indexing logic, it is what you want: give me all rows of the first column.

That said, on the other points:

  • I am personally not sure it is worth to add a to_series method on DataFrame, given the other ways to do the same. Adding new methods to a DataFrame should be done considerate, and the added value is of course always subjective ..

  • The result of pd.Series(pd.DataFrame({'a':[1,2,3]})) is for sure a bug, as it is indeed non-sensical. I tried it on the latest master, and there it raises an error ("ValueError: Wrong number of items passed 1, placement implies 3"). The error is still confusing (a better message would be good), but at least an error is better than the previous behaviour I think.
    Whether it should ideally convert the dataframe to a series instead of erroring, I am not fully sure. Eg if you do pd.Series(df.values) (so passing a 2D array, as a DataFrame is a 2D data structure), you get an error saying "Exception: Data must be 1-dimensional". So it would also be good to have consistency with that.

  • The squeeze method is inherited from numpy, and we follow their behaviour. I personally don't use it, but you can actually achieve what you want (also in the len-1 case) by specifying you only want to squeeze the second axis (columns), and not the first (rows): pd.DataFrame({'a':[1]}).squeeze(axis=1)

BTW, thanks for raising this issue and for the feedback. It's always interesting to hear such experiences. Although that intuitivity is often something subjective, there are certainly many things that can be improved (although that a lot of things also have historical reasons and backwards compatibility constraints).

@jbrockmendel jbrockmendel added the Constructors Series/DataFrame/Index/pd.array Constructors label Jul 23, 2019
@mroeschke mroeschke added Error Reporting Incorrect or improved errors from pandas Enhancement and removed API Design Usage Question labels Jun 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Constructors Series/DataFrame/Index/pd.array Constructors Enhancement Error Reporting Incorrect or improved errors from pandas
Projects
None yet
Development

No branches or pull requests

6 participants