New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MultiIndex DataFrame to_csv() ignores date_format #7791

Closed
Poquaruse opened this Issue Jul 18, 2014 · 7 comments

Comments

Projects
None yet
3 participants
@Poquaruse

Poquaruse commented Jul 18, 2014

Hi all,

I'm running pandas 0.14.1:

INSTALLED VERSIONS

commit: None
python: 3.4.1.final.0
python-bits: 64
OS: Windows
OS-release: 8
machine: AMD64
processor: Intel64 Family 6 Model 58 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: DE

pandas: 0.14.1
nose: 1.3.3
Cython: 0.20.1
numpy: 1.8.1
scipy: 0.14.0
statsmodels: None
IPython: 2.1.0
sphinx: 1.2.2
patsy: 0.2.1
scikits.timeseries: None
dateutil: 2.1
pytz: 2014.4
bottleneck: None
tables: 3.1.1
numexpr: 2.3.1
matplotlib: 1.3.1
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: None
xlsxwriter: 0.5.5
lxml: 3.3.5
bs4: 4.3.1
html5lib: None
httplib2: None
apiclient: None
rpy2: None
sqlalchemy: 0.9.4
pymysql: None
psycopg2: None

I'm experiencing a bug which might be related to #7622

How to reproduce the bug:
At first, a new DataFrame is created

dt_rng = pd.date_range(start='2014-01-01 00:00', periods = 1000, freq='1s')
df = pd.DataFrame({'a':np.random.randn(1000), 'b': np.random.randn(1000)},index = dt_rng)
df['b'] = df['b'].round()

Then, this DataFrame is exported as csv with a custom date_format:

df.to_csv(date_format='%Y demo')

This does work: ',a,b\n2014 demo,-0.5582228932333034,1.0\n2014 demo,[...]'

Now, the other scenario. Same DataFrame but a groupby('b') and a resample:

df.groupby(df['b']).resample('1min').to_csv(date_format='%y demo')

Here, the custom date_format is not used:
'b,,a\n-4.0,2014-01-01 00:12:00,1.571130069273494[...]'

Thanks and best regards

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Jul 18, 2014

Contributor
  • use the format below for grouping (this is new in 0.14.0), as its more flexible (e.g. the resample could be on a column).
  • this is a bug, but I think for another reason, the index is not being constructured 100% correctly, xref #7793
In [5]: df.groupby(['b',pd.Grouper(freq='1h')]).mean()
Out[5]: 
                      a
b                      
-4 2014-01-01  0.236842
-3 2014-01-01 -0.196441
-2 2014-01-01  0.084048
-1 2014-01-01 -0.079776
 0 2014-01-01  0.059750
 1 2014-01-01 -0.010162
 2 2014-01-01 -0.124559
 3 2014-01-01 -0.009600

In [6]: df.groupby(['b',pd.Grouper(freq='1h')]).mean().to_csv('test.csv',mode='w',date_format='%Y')

In [7]: !cat test.csv
b,,a
-4.0,2014-01-01 00:00:00,0.23684162290873984
-3.0,2014-01-01 00:00:00,-0.1964410632489068
-2.0,2014-01-01 00:00:00,0.08404773044520912
-1.0,2014-01-01 00:00:00,-0.07977596435616328
0.0,2014-01-01 00:00:00,0.05975025407287002
1.0,2014-01-01 00:00:00,-0.010162014106808491
2.0,2014-01-01 00:00:00,-0.1245593183893703
3.0,2014-01-01 00:00:00,-0.00959968236480397
Contributor

jreback commented Jul 18, 2014

  • use the format below for grouping (this is new in 0.14.0), as its more flexible (e.g. the resample could be on a column).
  • this is a bug, but I think for another reason, the index is not being constructured 100% correctly, xref #7793
In [5]: df.groupby(['b',pd.Grouper(freq='1h')]).mean()
Out[5]: 
                      a
b                      
-4 2014-01-01  0.236842
-3 2014-01-01 -0.196441
-2 2014-01-01  0.084048
-1 2014-01-01 -0.079776
 0 2014-01-01  0.059750
 1 2014-01-01 -0.010162
 2 2014-01-01 -0.124559
 3 2014-01-01 -0.009600

In [6]: df.groupby(['b',pd.Grouper(freq='1h')]).mean().to_csv('test.csv',mode='w',date_format='%Y')

In [7]: !cat test.csv
b,,a
-4.0,2014-01-01 00:00:00,0.23684162290873984
-3.0,2014-01-01 00:00:00,-0.1964410632489068
-2.0,2014-01-01 00:00:00,0.08404773044520912
-1.0,2014-01-01 00:00:00,-0.07977596435616328
0.0,2014-01-01 00:00:00,0.05975025407287002
1.0,2014-01-01 00:00:00,-0.010162014106808491
2.0,2014-01-01 00:00:00,-0.1245593183893703
3.0,2014-01-01 00:00:00,-0.00959968236480397
@Poquaruse

This comment has been minimized.

Show comment
Hide comment
@Poquaruse

Poquaruse Jul 19, 2014

Thanks! :-)

Poquaruse commented Jul 19, 2014

Thanks! :-)

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Jul 19, 2014

Contributor

ok, since #7746 fix was just merged, this is a separate issue.

I think that the multi-index to_native_types is not passing thru the date_format (and maybe other formatters) to the sub-index formatting.

Contributor

jreback commented Jul 19, 2014

ok, since #7746 fix was just merged, this is a separate issue.

I think that the multi-index to_native_types is not passing thru the date_format (and maybe other formatters) to the sub-index formatting.

@jreback jreback modified the milestones: 0.15.1, 0.15.0 Sep 10, 2014

@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015

@nbonnotte

This comment has been minimized.

Show comment
Hide comment
@nbonnotte

nbonnotte Nov 8, 2015

Contributor

@jreback i've been looking at that issue, and i would not say the difficulty is "novice"...

The issue I'm struggling with is that (if i understand things correctly) a MultiIndex is a MultiIndex whatever the type of the content, while a 1D index with datetimes is a DatetimeIndex.

So when we need to format a DatetimeIndex, DatetimeIndex._format_native_types is there and does what it has to do, knowing its values are Timestamps https://github.com/pydata/pandas/blob/d0a21032be1e92d4fcb92b8a1c16dee514cb61dc/pandas/tseries/index.py#L737

On the other hand, a MultiIndex has only a generic _format_native_types method, which basically does nothing but returning its content https://github.com/pydata/pandas/blob/d0a21032be1e92d4fcb92b8a1c16dee514cb61dc/pandas/core/index.py#L4329

I could try to go through the individual values of the MultiIndex and format all Timestamp that I encounter, but I'm not sure that's such a good idea. Probably finding a way to reuse DatetimeIndex._format_native_types would be better.

Is it possible to transform a MultiIndex into a list of regular indexes (e.g. DatetimeIndex) back and forth?

Contributor

nbonnotte commented Nov 8, 2015

@jreback i've been looking at that issue, and i would not say the difficulty is "novice"...

The issue I'm struggling with is that (if i understand things correctly) a MultiIndex is a MultiIndex whatever the type of the content, while a 1D index with datetimes is a DatetimeIndex.

So when we need to format a DatetimeIndex, DatetimeIndex._format_native_types is there and does what it has to do, knowing its values are Timestamps https://github.com/pydata/pandas/blob/d0a21032be1e92d4fcb92b8a1c16dee514cb61dc/pandas/tseries/index.py#L737

On the other hand, a MultiIndex has only a generic _format_native_types method, which basically does nothing but returning its content https://github.com/pydata/pandas/blob/d0a21032be1e92d4fcb92b8a1c16dee514cb61dc/pandas/core/index.py#L4329

I could try to go through the individual values of the MultiIndex and format all Timestamp that I encounter, but I'm not sure that's such a good idea. Probably finding a way to reuse DatetimeIndex._format_native_types would be better.

Is it possible to transform a MultiIndex into a list of regular indexes (e.g. DatetimeIndex) back and forth?

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Nov 8, 2015

Contributor

I'll bump the difficulty up 1 :)

here's what you have to do

redefine _format_native_types for a MulitIndex

for each level, call _format_native_types, passing thru the kwargs. these are the new levels,
then reconstruct a MultiIndexIndex and call .values on it.

In [1]: mi = pd.MultiIndex.from_product([range(3),list('abc'),date_range('20130101 09:00:00',periods=3)])
In [4]: mi.levels

Out[4]: FrozenList([[0, 1, 2], [u'a', u'b', u'c'], [2013-01-01 09:00:00, 2013-01-02 09:00:00, 2013-01-03 09:00:00]])

In [5]: mi.levels[0]
Out[5]: Int64Index([0, 1, 2], dtype='int64')

In [6]: mi.levels[1]
Out[6]: Index([u'a', u'b', u'c'], dtype='object')

In [7]: mi.levels[2]
Out[7]: DatetimeIndex(['2013-01-01 09:00:00', '2013-01-02 09:00:00', '2013-01-03 09:00:00'], dtype='datetime64[ns]', freq='D')

In [8]: mi.levels[2].to_native_types()
Out[8]: array(['2013-01-01 09:00:00', '2013-01-02 09:00:00', '2013-01-03 09:00:00'], dtype=object)

In [9]: mi.levels[2].to_native_types(date_format='%Y%m%d')
Out[9]: array(['20130101', '20130102', '20130103'], dtype=object)

something like this:

In [10]: MultiIndex(levels=[ l._format_native_types() for l in mi.levels],labels=mi.labels,sortorder=mi.sortorder,verify_integrity=False)
Out[10]: 
MultiIndex(levels=[[u'0', u'1', u'2'], [u'a', u'b', u'c'], [u'2013-01-01 09:00:00', u'2013-01-02 09:00:00', u'2013-01-03 09:00:00']],
           labels=[[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2], [0, 0, 0, 1, 1, 1, 2, 2, 2, 0, 0, 0, 1, 1, 1, 2, 2, 2, 0, 0, 0, 1, 1, 1, 2, 2, 2], [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2]])

In [11]: MultiIndex(levels=[ l._format_native_types() for l in mi.levels],labels=mi.labels,sortorder=mi.sortorder,verify_integrity=False).values
Out[11]: 
array([('0', 'a', '2013-01-01 09:00:00'),
       ('0', 'a', '2013-01-02 09:00:00'),
       ('0', 'a', '2013-01-03 09:00:00'),
       ('0', 'b', '2013-01-01 09:00:00'),
       ('0', 'b', '2013-01-02 09:00:00'),
       ('0', 'b', '2013-01-03 09:00:00'),
       ('0', 'c', '2013-01-01 09:00:00'),
       ('0', 'c', '2013-01-02 09:00:00'),
       ('0', 'c', '2013-01-03 09:00:00'),
       ('1', 'a', '2013-01-01 09:00:00'),
       ('1', 'a', '2013-01-02 09:00:00'),
       ('1', 'a', '2013-01-03 09:00:00'),
       ('1', 'b', '2013-01-01 09:00:00'),
       ('1', 'b', '2013-01-02 09:00:00'),
       ('1', 'b', '2013-01-03 09:00:00'),
       ('1', 'c', '2013-01-01 09:00:00'),
       ('1', 'c', '2013-01-02 09:00:00'),
       ('1', 'c', '2013-01-03 09:00:00'),
       ('2', 'a', '2013-01-01 09:00:00'),
       ('2', 'a', '2013-01-02 09:00:00'),
       ('2', 'a', '2013-01-03 09:00:00'),
       ('2', 'b', '2013-01-01 09:00:00'),
       ('2', 'b', '2013-01-02 09:00:00'),
       ('2', 'b', '2013-01-03 09:00:00'),
       ('2', 'c', '2013-01-01 09:00:00'),
       ('2', 'c', '2013-01-02 09:00:00'), ('2', 'c', '2013-01-03 09:00:00')], dtype=object)
Contributor

jreback commented Nov 8, 2015

I'll bump the difficulty up 1 :)

here's what you have to do

redefine _format_native_types for a MulitIndex

for each level, call _format_native_types, passing thru the kwargs. these are the new levels,
then reconstruct a MultiIndexIndex and call .values on it.

In [1]: mi = pd.MultiIndex.from_product([range(3),list('abc'),date_range('20130101 09:00:00',periods=3)])
In [4]: mi.levels

Out[4]: FrozenList([[0, 1, 2], [u'a', u'b', u'c'], [2013-01-01 09:00:00, 2013-01-02 09:00:00, 2013-01-03 09:00:00]])

In [5]: mi.levels[0]
Out[5]: Int64Index([0, 1, 2], dtype='int64')

In [6]: mi.levels[1]
Out[6]: Index([u'a', u'b', u'c'], dtype='object')

In [7]: mi.levels[2]
Out[7]: DatetimeIndex(['2013-01-01 09:00:00', '2013-01-02 09:00:00', '2013-01-03 09:00:00'], dtype='datetime64[ns]', freq='D')

In [8]: mi.levels[2].to_native_types()
Out[8]: array(['2013-01-01 09:00:00', '2013-01-02 09:00:00', '2013-01-03 09:00:00'], dtype=object)

In [9]: mi.levels[2].to_native_types(date_format='%Y%m%d')
Out[9]: array(['20130101', '20130102', '20130103'], dtype=object)

something like this:

In [10]: MultiIndex(levels=[ l._format_native_types() for l in mi.levels],labels=mi.labels,sortorder=mi.sortorder,verify_integrity=False)
Out[10]: 
MultiIndex(levels=[[u'0', u'1', u'2'], [u'a', u'b', u'c'], [u'2013-01-01 09:00:00', u'2013-01-02 09:00:00', u'2013-01-03 09:00:00']],
           labels=[[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2], [0, 0, 0, 1, 1, 1, 2, 2, 2, 0, 0, 0, 1, 1, 1, 2, 2, 2, 0, 0, 0, 1, 1, 1, 2, 2, 2], [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2]])

In [11]: MultiIndex(levels=[ l._format_native_types() for l in mi.levels],labels=mi.labels,sortorder=mi.sortorder,verify_integrity=False).values
Out[11]: 
array([('0', 'a', '2013-01-01 09:00:00'),
       ('0', 'a', '2013-01-02 09:00:00'),
       ('0', 'a', '2013-01-03 09:00:00'),
       ('0', 'b', '2013-01-01 09:00:00'),
       ('0', 'b', '2013-01-02 09:00:00'),
       ('0', 'b', '2013-01-03 09:00:00'),
       ('0', 'c', '2013-01-01 09:00:00'),
       ('0', 'c', '2013-01-02 09:00:00'),
       ('0', 'c', '2013-01-03 09:00:00'),
       ('1', 'a', '2013-01-01 09:00:00'),
       ('1', 'a', '2013-01-02 09:00:00'),
       ('1', 'a', '2013-01-03 09:00:00'),
       ('1', 'b', '2013-01-01 09:00:00'),
       ('1', 'b', '2013-01-02 09:00:00'),
       ('1', 'b', '2013-01-03 09:00:00'),
       ('1', 'c', '2013-01-01 09:00:00'),
       ('1', 'c', '2013-01-02 09:00:00'),
       ('1', 'c', '2013-01-03 09:00:00'),
       ('2', 'a', '2013-01-01 09:00:00'),
       ('2', 'a', '2013-01-02 09:00:00'),
       ('2', 'a', '2013-01-03 09:00:00'),
       ('2', 'b', '2013-01-01 09:00:00'),
       ('2', 'b', '2013-01-02 09:00:00'),
       ('2', 'b', '2013-01-03 09:00:00'),
       ('2', 'c', '2013-01-01 09:00:00'),
       ('2', 'c', '2013-01-02 09:00:00'), ('2', 'c', '2013-01-03 09:00:00')], dtype=object)
@nbonnotte

This comment has been minimized.

Show comment
Hide comment
@nbonnotte

nbonnotte Nov 8, 2015

Contributor

Nice! I didn't know the concept of "level", and this is exactly what i was missing, thx ;)

Here we go

Contributor

nbonnotte commented Nov 8, 2015

Nice! I didn't know the concept of "level", and this is exactly what i was missing, thx ;)

Here we go

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Nov 11, 2015

Contributor

closed by #11151

Contributor

jreback commented Nov 11, 2015

closed by #11151

@jreback jreback closed this Nov 11, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment