New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Named tuple fields as column names #11181

Closed
max-sixty opened this Issue Sep 23, 2015 · 8 comments

Comments

Projects
None yet
3 participants
@max-sixty
Contributor

max-sixty commented Sep 23, 2015

Currently, passing a list of namedtuples (or SQLAlchemy results, which is our specific case) to DataFrame.from_records doesn't resolve the field names.

I think it would make sense to, at least if the names are all the same. Happy to do a PR if people agree.

In [17]: from collections import namedtuple

In [18]: Record = namedtuple('Record',('date','value'))

In [19]: records = [Record(d, v) for (d, v) in zip(pd.date_range('2000',freq='D', periods=20), range(20))]

In [20]: records
Out[20]: 
[Record(date=Timestamp('2000-01-01 00:00:00', offset='D'), value=0),
 Record(date=Timestamp('2000-01-02 00:00:00', offset='D'), value=1),
...
 Record(date=Timestamp('2000-01-19 00:00:00', offset='D'), value=18),
 Record(date=Timestamp('2000-01-20 00:00:00', offset='D'), value=19)]

In [21]: pd.DataFrame.from_records(records)
Out[21]: 
            0   1
0  2000-01-01   0
1  2000-01-02   1
...
18 2000-01-19  18
19 2000-01-20  19

Desired behavior:

Out[21]: 
            date   value
0  2000-01-01   0
1  2000-01-02   1
...
18 2000-01-19  18
19 2000-01-20  19
@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Sep 24, 2015

Contributor

You could do this, but you would have to then deal with potentially multiple names here (for a single column), and should raise in that case. This is unambiguous with a recarray, but you are passing a list of records-likes here. FYI this is the same for DataFrame(records). Further you would have to be careful about perf here.

Contributor

jreback commented Sep 24, 2015

You could do this, but you would have to then deal with potentially multiple names here (for a single column), and should raise in that case. This is unambiguous with a recarray, but you are passing a list of records-likes here. FYI this is the same for DataFrame(records). Further you would have to be careful about perf here.

@jreback jreback added the API Design label Sep 24, 2015

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Sep 24, 2015

Contributor

any reason you are not simply using pd.read_sql which already does this?

Contributor

jreback commented Sep 24, 2015

any reason you are not simply using pd.read_sql which already does this?

@max-sixty

This comment has been minimized.

Show comment
Hide comment
@max-sixty

max-sixty Sep 24, 2015

Contributor

@jreback - we're using SQLAlchemy as an ORM, so we have a SQLAlchemy Query object, which encapsulates joins & filtering.

My understanding is that those methods take a SELECT string - are we missing something, and they'll work with the ORM? We're using this fairly prevalently so I'm keen to know if we're doing something inefficient.

Thanks

Contributor

max-sixty commented Sep 24, 2015

@jreback - we're using SQLAlchemy as an ORM, so we have a SQLAlchemy Query object, which encapsulates joins & filtering.

My understanding is that those methods take a SELECT string - are we missing something, and they'll work with the ORM? We're using this fairly prevalently so I'm keen to know if we're doing something inefficient.

Thanks

@max-sixty

This comment has been minimized.

Show comment
Hide comment
@max-sixty

max-sixty Sep 24, 2015

Contributor

I hear you re performance, if you have to check every item.

I had thought that if the first item a tuple with ._fields, it would be a reasonable default to set the appropriate axis index to those fields. The alternative is integers, so there's not a lot of competition, and the user is free to supply columns / index as ever.

Without that convenience, it does seem like a lot of lifting for pandas vs the benefit.

Contributor

max-sixty commented Sep 24, 2015

I hear you re performance, if you have to check every item.

I had thought that if the first item a tuple with ._fields, it would be a reasonable default to set the appropriate axis index to those fields. The alternative is integers, so there's not a lot of competition, and the user is free to supply columns / index as ever.

Without that convenience, it does seem like a lot of lifting for pandas vs the benefit.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Sep 24, 2015

Contributor

this is prob not that crazy to add
we obviously already support lists of ndarrays and such
so it's just another path

Contributor

jreback commented Sep 24, 2015

this is prob not that crazy to add
we obviously already support lists of ndarrays and such
so it's just another path

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Sep 24, 2015

Contributor

side issue: #4916

love to remove from_records entirely (and just have it figured out in the constructor)

Contributor

jreback commented Sep 24, 2015

side issue: #4916

love to remove from_records entirely (and just have it figured out in the constructor)

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Sep 24, 2015

Member

@MaximilianR regarding the sqlalchemy Query object. At the moment this will indeed not work with read_sql_query, but would that be useful to add this functionality?
I think you can try it out by providing the selectable attribute (http://docs.sqlalchemy.org/en/rel_1_0/orm/query.html#sqlalchemy.orm.query.Query.selectable), something like: pd.read_sql_query(Query().selectable, engine)

Member

jorisvandenbossche commented Sep 24, 2015

@MaximilianR regarding the sqlalchemy Query object. At the moment this will indeed not work with read_sql_query, but would that be useful to add this functionality?
I think you can try it out by providing the selectable attribute (http://docs.sqlalchemy.org/en/rel_1_0/orm/query.html#sqlalchemy.orm.query.Query.selectable), something like: pd.read_sql_query(Query().selectable, engine)

@max-sixty

This comment has been minimized.

Show comment
Hide comment
@max-sixty

max-sixty Sep 24, 2015

Contributor

@jorisvandenbossche Yes, that works, thanks!

I asked a Q to SQLAlchemy, to see if those guys had any guidance. I could imagine a few options - leave it as-is, allow read_sql to take a Query object as a very simple wrapper, or something more coupled with SQLAlchemy if there are optimizations
https://bitbucket.org/zzzeek/sqlalchemy/issues/3542/design-for-pandas-access-to-query

Contributor

max-sixty commented Sep 24, 2015

@jorisvandenbossche Yes, that works, thanks!

I asked a Q to SQLAlchemy, to see if those guys had any guidance. I could imagine a few options - leave it as-is, allow read_sql to take a Query object as a very simple wrapper, or something more coupled with SQLAlchemy if there are optimizations
https://bitbucket.org/zzzeek/sqlalchemy/issues/3542/design-for-pandas-access-to-query

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment