ENH: Named tuple fields as column names #11181

Closed
MaximilianR opened this Issue Sep 23, 2015 · 8 comments

Comments

Projects
None yet
3 participants
Contributor

MaximilianR commented Sep 23, 2015

Currently, passing a list of namedtuples (or SQLAlchemy results, which is our specific case) to DataFrame.from_records doesn't resolve the field names.

I think it would make sense to, at least if the names are all the same. Happy to do a PR if people agree.

In [17]: from collections import namedtuple

In [18]: Record = namedtuple('Record',('date','value'))

In [19]: records = [Record(d, v) for (d, v) in zip(pd.date_range('2000',freq='D', periods=20), range(20))]

In [20]: records
Out[20]: 
[Record(date=Timestamp('2000-01-01 00:00:00', offset='D'), value=0),
 Record(date=Timestamp('2000-01-02 00:00:00', offset='D'), value=1),
...
 Record(date=Timestamp('2000-01-19 00:00:00', offset='D'), value=18),
 Record(date=Timestamp('2000-01-20 00:00:00', offset='D'), value=19)]

In [21]: pd.DataFrame.from_records(records)
Out[21]: 
            0   1
0  2000-01-01   0
1  2000-01-02   1
...
18 2000-01-19  18
19 2000-01-20  19

Desired behavior:

Out[21]: 
            date   value
0  2000-01-01   0
1  2000-01-02   1
...
18 2000-01-19  18
19 2000-01-20  19
Contributor

jreback commented Sep 24, 2015

You could do this, but you would have to then deal with potentially multiple names here (for a single column), and should raise in that case. This is unambiguous with a recarray, but you are passing a list of records-likes here. FYI this is the same for DataFrame(records). Further you would have to be careful about perf here.

jreback added the API Design label Sep 24, 2015

Contributor

jreback commented Sep 24, 2015

any reason you are not simply using pd.read_sql which already does this?

Contributor

MaximilianR commented Sep 24, 2015

@jreback - we're using SQLAlchemy as an ORM, so we have a SQLAlchemy Query object, which encapsulates joins & filtering.

My understanding is that those methods take a SELECT string - are we missing something, and they'll work with the ORM? We're using this fairly prevalently so I'm keen to know if we're doing something inefficient.

Thanks

Contributor

MaximilianR commented Sep 24, 2015

I hear you re performance, if you have to check every item.

I had thought that if the first item a tuple with ._fields, it would be a reasonable default to set the appropriate axis index to those fields. The alternative is integers, so there's not a lot of competition, and the user is free to supply columns / index as ever.

Without that convenience, it does seem like a lot of lifting for pandas vs the benefit.

Contributor

jreback commented Sep 24, 2015

this is prob not that crazy to add
we obviously already support lists of ndarrays and such
so it's just another path

Contributor

jreback commented Sep 24, 2015

side issue: pydata#4916

love to remove from_records entirely (and just have it figured out in the constructor)

@MaximilianR regarding the sqlalchemy Query object. At the moment this will indeed not work with read_sql_query, but would that be useful to add this functionality?
I think you can try it out by providing the selectable attribute (http://docs.sqlalchemy.org/en/rel_1_0/orm/query.html#sqlalchemy.orm.query.Query.selectable), something like: pd.read_sql_query(Query().selectable, engine)

Contributor

MaximilianR commented Sep 24, 2015

@jorisvandenbossche Yes, that works, thanks!

I asked a Q to SQLAlchemy, to see if those guys had any guidance. I could imagine a few options - leave it as-is, allow read_sql to take a Query object as a very simple wrapper, or something more coupled with SQLAlchemy if there are optimizations
https://bitbucket.org/zzzeek/sqlalchemy/issues/3542/design-for-pandas-access-to-query

jreback added the Compat label Oct 23, 2015

jreback added this to the 0.17.1 milestone Oct 23, 2015

jreback closed this in #11416 Oct 23, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment