MultiIndex DataFrame to_csv() ignores date_format #7791

Closed
Poquaruse opened this Issue Jul 18, 2014 · 7 comments

Comments

Projects
None yet
3 participants

Hi all,

I'm running pandas 0.14.1:

INSTALLED VERSIONS

commit: None
python: 3.4.1.final.0
python-bits: 64
OS: Windows
OS-release: 8
machine: AMD64
processor: Intel64 Family 6 Model 58 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: DE

pandas: 0.14.1
nose: 1.3.3
Cython: 0.20.1
numpy: 1.8.1
scipy: 0.14.0
statsmodels: None
IPython: 2.1.0
sphinx: 1.2.2
patsy: 0.2.1
scikits.timeseries: None
dateutil: 2.1
pytz: 2014.4
bottleneck: None
tables: 3.1.1
numexpr: 2.3.1
matplotlib: 1.3.1
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: None
xlsxwriter: 0.5.5
lxml: 3.3.5
bs4: 4.3.1
html5lib: None
httplib2: None
apiclient: None
rpy2: None
sqlalchemy: 0.9.4
pymysql: None
psycopg2: None

I'm experiencing a bug which might be related to pydata#7622

How to reproduce the bug:
At first, a new DataFrame is created

dt_rng = pd.date_range(start='2014-01-01 00:00', periods = 1000, freq='1s')
df = pd.DataFrame({'a':np.random.randn(1000), 'b': np.random.randn(1000)},index = dt_rng)
df['b'] = df['b'].round()

Then, this DataFrame is exported as csv with a custom date_format:

df.to_csv(date_format='%Y demo')

This does work: ',a,b\n2014 demo,-0.5582228932333034,1.0\n2014 demo,[...]'

Now, the other scenario. Same DataFrame but a groupby('b') and a resample:

df.groupby(df['b']).resample('1min').to_csv(date_format='%y demo')

Here, the custom date_format is not used:
'b,,a\n-4.0,2014-01-01 00:12:00,1.571130069273494[...]'

Thanks and best regards

Contributor

jreback commented Jul 18, 2014

  • use the format below for grouping (this is new in 0.14.0), as its more flexible (e.g. the resample could be on a column).
  • this is a bug, but I think for another reason, the index is not being constructured 100% correctly, xref #7793
In [5]: df.groupby(['b',pd.Grouper(freq='1h')]).mean()
Out[5]: 
                      a
b                      
-4 2014-01-01  0.236842
-3 2014-01-01 -0.196441
-2 2014-01-01  0.084048
-1 2014-01-01 -0.079776
 0 2014-01-01  0.059750
 1 2014-01-01 -0.010162
 2 2014-01-01 -0.124559
 3 2014-01-01 -0.009600

In [6]: df.groupby(['b',pd.Grouper(freq='1h')]).mean().to_csv('test.csv',mode='w',date_format='%Y')

In [7]: !cat test.csv
b,,a
-4.0,2014-01-01 00:00:00,0.23684162290873984
-3.0,2014-01-01 00:00:00,-0.1964410632489068
-2.0,2014-01-01 00:00:00,0.08404773044520912
-1.0,2014-01-01 00:00:00,-0.07977596435616328
0.0,2014-01-01 00:00:00,0.05975025407287002
1.0,2014-01-01 00:00:00,-0.010162014106808491
2.0,2014-01-01 00:00:00,-0.1245593183893703
3.0,2014-01-01 00:00:00,-0.00959968236480397

jreback added this to the 0.15.0 milestone Jul 18, 2014

Thanks! :-)

Contributor

jreback commented Jul 19, 2014

ok, since #7746 fix was just merged, this is a separate issue.

I think that the multi-index to_native_types is not passing thru the date_format (and maybe other formatters) to the sub-index formatting.

@jreback jreback modified the milestone: 0.15.1, 0.15.0 Sep 10, 2014

@jreback jreback modified the milestone: 0.16.0, Next Major Release Mar 6, 2015

Contributor

nbonnotte commented Nov 8, 2015

@jreback i've been looking at that issue, and i would not say the difficulty is "novice"...

The issue I'm struggling with is that (if i understand things correctly) a MultiIndex is a MultiIndex whatever the type of the content, while a 1D index with datetimes is a DatetimeIndex.

So when we need to format a DatetimeIndex, DatetimeIndex._format_native_types is there and does what it has to do, knowing its values are Timestamps https://github.com/pydata/pandas/blob/d0a21032be1e92d4fcb92b8a1c16dee514cb61dc/pandas/tseries/index.py#L737

On the other hand, a MultiIndex has only a generic _format_native_types method, which basically does nothing but returning its content https://github.com/pydata/pandas/blob/d0a21032be1e92d4fcb92b8a1c16dee514cb61dc/pandas/core/index.py#L4329

I could try to go through the individual values of the MultiIndex and format all Timestamp that I encounter, but I'm not sure that's such a good idea. Probably finding a way to reuse DatetimeIndex._format_native_types would be better.

Is it possible to transform a MultiIndex into a list of regular indexes (e.g. DatetimeIndex) back and forth?

Contributor

jreback commented Nov 8, 2015

I'll bump the difficulty up 1 :)

here's what you have to do

redefine _format_native_types for a MulitIndex

for each level, call _format_native_types, passing thru the kwargs. these are the new levels,
then reconstruct a MultiIndexIndex and call .values on it.

In [1]: mi = pd.MultiIndex.from_product([range(3),list('abc'),date_range('20130101 09:00:00',periods=3)])
In [4]: mi.levels

Out[4]: FrozenList([[0, 1, 2], [u'a', u'b', u'c'], [2013-01-01 09:00:00, 2013-01-02 09:00:00, 2013-01-03 09:00:00]])

In [5]: mi.levels[0]
Out[5]: Int64Index([0, 1, 2], dtype='int64')

In [6]: mi.levels[1]
Out[6]: Index([u'a', u'b', u'c'], dtype='object')

In [7]: mi.levels[2]
Out[7]: DatetimeIndex(['2013-01-01 09:00:00', '2013-01-02 09:00:00', '2013-01-03 09:00:00'], dtype='datetime64[ns]', freq='D')

In [8]: mi.levels[2].to_native_types()
Out[8]: array(['2013-01-01 09:00:00', '2013-01-02 09:00:00', '2013-01-03 09:00:00'], dtype=object)

In [9]: mi.levels[2].to_native_types(date_format='%Y%m%d')
Out[9]: array(['20130101', '20130102', '20130103'], dtype=object)

something like this:

In [10]: MultiIndex(levels=[ l._format_native_types() for l in mi.levels],labels=mi.labels,sortorder=mi.sortorder,verify_integrity=False)
Out[10]: 
MultiIndex(levels=[[u'0', u'1', u'2'], [u'a', u'b', u'c'], [u'2013-01-01 09:00:00', u'2013-01-02 09:00:00', u'2013-01-03 09:00:00']],
           labels=[[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2], [0, 0, 0, 1, 1, 1, 2, 2, 2, 0, 0, 0, 1, 1, 1, 2, 2, 2, 0, 0, 0, 1, 1, 1, 2, 2, 2], [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2]])

In [11]: MultiIndex(levels=[ l._format_native_types() for l in mi.levels],labels=mi.labels,sortorder=mi.sortorder,verify_integrity=False).values
Out[11]: 
array([('0', 'a', '2013-01-01 09:00:00'),
       ('0', 'a', '2013-01-02 09:00:00'),
       ('0', 'a', '2013-01-03 09:00:00'),
       ('0', 'b', '2013-01-01 09:00:00'),
       ('0', 'b', '2013-01-02 09:00:00'),
       ('0', 'b', '2013-01-03 09:00:00'),
       ('0', 'c', '2013-01-01 09:00:00'),
       ('0', 'c', '2013-01-02 09:00:00'),
       ('0', 'c', '2013-01-03 09:00:00'),
       ('1', 'a', '2013-01-01 09:00:00'),
       ('1', 'a', '2013-01-02 09:00:00'),
       ('1', 'a', '2013-01-03 09:00:00'),
       ('1', 'b', '2013-01-01 09:00:00'),
       ('1', 'b', '2013-01-02 09:00:00'),
       ('1', 'b', '2013-01-03 09:00:00'),
       ('1', 'c', '2013-01-01 09:00:00'),
       ('1', 'c', '2013-01-02 09:00:00'),
       ('1', 'c', '2013-01-03 09:00:00'),
       ('2', 'a', '2013-01-01 09:00:00'),
       ('2', 'a', '2013-01-02 09:00:00'),
       ('2', 'a', '2013-01-03 09:00:00'),
       ('2', 'b', '2013-01-01 09:00:00'),
       ('2', 'b', '2013-01-02 09:00:00'),
       ('2', 'b', '2013-01-03 09:00:00'),
       ('2', 'c', '2013-01-01 09:00:00'),
       ('2', 'c', '2013-01-02 09:00:00'), ('2', 'c', '2013-01-03 09:00:00')], dtype=object)
Contributor

nbonnotte commented Nov 8, 2015

Nice! I didn't know the concept of "level", and this is exactly what i was missing, thx ;)

Here we go

@jreback jreback modified the milestone: 0.17.1, Next Major Release Nov 8, 2015

Contributor

jreback commented Nov 11, 2015

closed by #11151

jreback closed this Nov 11, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment