API: df.rolling(..).corr()/cov() when pairwise=True to return MI DataFrame #15677

jreback · 2017-03-13T20:35:42Z

Unfortunately I don't see an easy way to even deprecate this and we simply have to switch. Good news is this will simply fail fast in accessing, as the Panels have a different access pattern (names of indices and indexing) that MI DataFrames (and another reason to remove them :>).

codecov-io · 2017-03-13T21:45:22Z

Codecov Report

Merging #15677 into master will decrease coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #15677      +/-   ##
==========================================
- Coverage   90.97%   90.96%   -0.01%     
==========================================
  Files         145      145              
  Lines       49483    49487       +4     
==========================================
+ Hits        45015    45018       +3     
- Misses       4468     4469       +1

Flag	Coverage Δ
#multiple	`88.73% <100%> (-0.01%)`	⬇️
#single	`40.62% <0%> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/window.py	`96.2% <100%> (+0.02%)`	⬆️
pandas/indexes/multi.py	`96.59% <100%> (-0.01%)`	⬇️
pandas/core/indexing.py	`94.01% <0%> (-0.1%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update da0523a...0f5092c. Read the comment docs.

jreback · 2017-03-14T13:51:20Z

@chrisaycock thoughts?

chrisaycock · 2017-03-14T15:28:12Z

I didn't understand the "pairwise=True" part of the PR title, but the actual documentation is pretty straightforward. You are just returning a MI DataFrame by transposing the Panel result under-the-hood. That all makes sense.

jreback · 2017-03-22T20:14:52Z

any comments?

jorisvandenbossche · 2017-03-27T20:03:03Z

I find it at first sight a bit strange to get a frame with multi-indexed columns instead of a frame with multi-indexed index.
If you have a multi-index columns, accessing one correlation matrix would also be simpler, as you don't need the unstack

(never use this, so I don't speak out of experience)

jreback · 2017-03-27T20:04:46Z

I find it at first sight a bit strange to get a frame with multi-indexed columns instead of a frame with multi-indexed index.
If you have a multi-index columns, accessing one correlation matrix would also be simpler, as you don't need the unstack

whats your example?

jorisvandenbossche · 2017-03-27T20:08:57Z

So example from the whatsnew:

In [3]: df = DataFrame(np.random.rand(100, 2))

In [4]: res = df.rolling(12).corr()

In [5]: res
Out[5]: 
major    0                   1     
minor    0         1         0    1
0      NaN       NaN       NaN  NaN
1      NaN       NaN       NaN  NaN
2      NaN       NaN       NaN  NaN
3      NaN       NaN       NaN  NaN
4      NaN       NaN       NaN  NaN
5      NaN       NaN       NaN  NaN
6      NaN       NaN       NaN  NaN
7      NaN       NaN       NaN  NaN
8      NaN       NaN       NaN  NaN
9      NaN       NaN       NaN  NaN
10     NaN       NaN       NaN  NaN
11     1.0  0.131988  0.131988  1.0
12     1.0  0.115938  0.115938  1.0
13     1.0  0.142035  0.142035  1.0
14     1.0  0.160646  0.160646  1.0
15     1.0 -0.011628 -0.011628  1.0
16     1.0  0.480531  0.480531  1.0
17     1.0  0.317300  0.317300  1.0
18     1.0  0.297592  0.297592  1.0

I expected something like:

          0         1     
0  0    NaN       NaN 
   1    NaN       NaN 
1  0    NaN       NaN   
   1    NaN       NaN   
2  0    NaN       NaN   
   1    NaN       NaN 
...

But as I said, don't have experience with what would be the most practical to work with further.

jreback · 2017-03-27T20:10:13Z

ok same as my example.

The reason for this returning this way is that the index is the same as the original. Otherwise its actually a transpose. I guess this is kind of arbitrary. But IMHO this makes more sense.

In [14]: np.random.seed(1234)
    ...: df = DataFrame(np.random.rand(100, 2), columns=list('AB'))
    ...: 
    ...: 

In [15]: df.head()
Out[15]: 
          A         B
0  0.191519  0.622109
1  0.437728  0.785359
2  0.779976  0.272593
3  0.276464  0.801872
4  0.958139  0.875933

In [16]: df.rolling(12).corr().head()
Out[16]: 
major   A       B    
minor   A   B   A   B
0     NaN NaN NaN NaN
1     NaN NaN NaN NaN
2     NaN NaN NaN NaN
3     NaN NaN NaN NaN
4     NaN NaN NaN NaN

jreback · 2017-03-27T20:14:26Z

ahh, you want this?

In [11]: df.rolling(12).corr().stack('minor', dropna=False).head()
Out[11]: 
major     A   B
  minor        
0 A     NaN NaN
  B     NaN NaN
1 A     NaN NaN
  B     NaN NaN
2 A     NaN NaN

jorisvandenbossche · 2017-03-27T22:11:07Z

yes, that is what I had in mind (the whatsnew example is maybe a bit confusing with its integer column names)

jreback · 2017-03-27T22:14:52Z

ok, let me see what I can do.

jreback · 2017-03-27T23:18:05Z

updated. and here's the new example

In [1]: pd.options.display.max_rows=12

In [2]:    np.random.seed(1234)
   ...:    df = DataFrame(np.random.rand(100, 2),
   ...:                  columns=['A', 'B'],
   ...:                  index=pd.date_range('20160101', periods=100, freq='D'))
   ...:    df
   ...: 
Out[2]: 
                   A         B
2016-01-01  0.191519  0.622109
2016-01-02  0.437728  0.785359
2016-01-03  0.779976  0.272593
2016-01-04  0.276464  0.801872
2016-01-05  0.958139  0.875933
2016-01-06  0.357817  0.500995
...              ...       ...
2016-04-04  0.475567  0.344417
2016-04-05  0.640880  0.126205
2016-04-06  0.171465  0.737086
2016-04-07  0.127029  0.369650
2016-04-08  0.604334  0.103104
2016-04-09  0.802374  0.945553

[100 rows x 2 columns]

In [3]: df.rolling(12).corr()
Out[3]: 
                         A         B
major      minor                    
2016-01-01 A           NaN       NaN
           B           NaN       NaN
2016-01-02 A           NaN       NaN
           B           NaN       NaN
2016-01-03 A           NaN       NaN
           B           NaN       NaN
...                    ...       ...
2016-04-07 A      1.000000 -0.132090
           B     -0.132090  1.000000
2016-04-08 A      1.000000 -0.145775
           B     -0.145775  1.000000
2016-04-09 A      1.000000  0.119645
           B      0.119645  1.000000

[200 rows x 2 columns]

jreback · 2017-03-27T23:25:59Z

I think maybe we should completely zonk the index level names. I am not sure what to do with them w/o making it look weird. The problem is that the 2nd level AND the columns are named the same which, when you print it is odd. Could name the 1st level though.

jorisvandenbossche · 2017-03-28T07:08:51Z

Yes, the 'major' and 'minor' do not necessarily make sense anymore, as this is rather Panel-specific terminology.

@chrisaycock Do you think this shape makes sense? (compared to the initial proposal in the PR?)

Unfortunately I don't see an easy way to even deprecate this and we simply have to switch.

@jreback We could easily add a keyword for switching this behaviour, and raise a deprecation warning on the default value, indicating they can change behaviour + suppress warning by specifying the keyword.
But, this approach is always a bit ugly. Not sure if it is needed in this case.

chrisaycock · 2017-03-28T14:10:32Z

The major/minor names are weird since they don't come from the user. It's hard to follow what those are from just looking at a code sample.

jreback · 2017-03-28T15:11:56Z

so here is an easy thing to do. I have annotated names to make this explict

In [8]:    pd.options.display.max_rows=12
   ...:    np.random.seed(1234)
   ...:    df = pd.DataFrame(np.random.rand(100, 2),
   ...:                      columns=pd.Index(['A', 'B'], name='bar'),
   ...:                      index=pd.date_range('20160101',
   ...:                                          periods=100, freq='D', name='foo'))
   ...:    df2 = df.copy()
   ...:    df2.columns = pd.Index(['A', 'B'], name='bar2')
   ...:    df2.index = date_range('20160101', periods=100, freq='D', name='foo2')
   ...: 
   ...: 

In [9]: df.rolling(12).corr(df2, pairwise=True)
Out[9]: 
bar2                   A         B
bar        foo                    
2016-01-01 A         NaN       NaN
           B         NaN       NaN
2016-01-02 A         NaN       NaN
           B         NaN       NaN
2016-01-03 A         NaN       NaN
           B         NaN       NaN
...                  ...       ...
2016-04-07 A    1.000000 -0.132090
           B   -0.132090  1.000000
2016-04-08 A    1.000000 -0.145775
           B   -0.145775  1.000000
2016-04-09 A    1.000000  0.119645
           B    0.119645  1.000000

[200 rows x 2 columns]

I would do this (but maybe zonk the column name). This can actually be confusing if you do the typical cross-corr.

In [10]: df.rolling(12).corr()
Out[10]: 
bar                    A         B
bar        foo                    
2016-01-01 A         NaN       NaN
           B         NaN       NaN
2016-01-02 A         NaN       NaN
           B         NaN       NaN
2016-01-03 A         NaN       NaN
           B         NaN       NaN
...                  ...       ...
2016-04-07 A    1.000000 -0.132090
           B   -0.132090  1.000000
2016-04-08 A    1.000000 -0.145775
           B   -0.145775  1.000000
2016-04-09 A    1.000000  0.119645
           B    0.119645  1.000000

[200 rows x 2 columns]

jreback · 2017-03-28T15:52:33Z

ok latest push gives [10].

jorisvandenbossche · 2017-03-29T09:00:00Z

In the [10] above, shouldn't the index level names 'bar' and 'foo' not be switched ?

jreback · 2017-03-29T12:25:45Z

@jorisvandenbossche this is on latest. I had them switched (in error) before when I did that example.

result.index.name =  [index, columns] of the source df
result.columns.name = None

In [2]: pd.options.display.max_rows=12
   ...: np.random.seed(1234)
   ...: df = pd.DataFrame(np.random.rand(100, 2),
   ...:                         columns=pd.Index(['A', 'B'], name='bar'),
   ...:                         index=pd.date_range('20160101',
   ...:                                              periods=100, freq='D', name='foo'))
   ...: df2 = df.copy()
   ...: df2.columns = pd.Index(['A', 'B'], name='bar2')
   ...: df2.index = date_range('20160101', periods=100, freq='D', name='foo2')
   ...:     

In [3]: df.rolling(12).corr()
Out[3]: 
                       A         B
foo        bar                    
2016-01-01 A         NaN       NaN
           B         NaN       NaN
2016-01-02 A         NaN       NaN
           B         NaN       NaN
2016-01-03 A         NaN       NaN
           B         NaN       NaN
...                  ...       ...
2016-04-07 A    1.000000 -0.132090
           B   -0.132090  1.000000
2016-04-08 A    1.000000 -0.145775
           B   -0.145775  1.000000
2016-04-09 A    1.000000  0.119645
           B    0.119645  1.000000

[200 rows x 2 columns]

In [4]: df.rolling(12).corr(df2, pairwise=True)
Out[4]: 
                       A         B
foo        bar                    
2016-01-01 A         NaN       NaN
           B         NaN       NaN
2016-01-02 A         NaN       NaN
           B         NaN       NaN
2016-01-03 A         NaN       NaN
           B         NaN       NaN
...                  ...       ...
2016-04-07 A    1.000000 -0.132090
           B   -0.132090  1.000000
2016-04-08 A    1.000000 -0.145775
           B   -0.145775  1.000000
2016-04-09 A    1.000000  0.119645
           B    0.119645  1.000000

[200 rows x 2 columns]

…Frame xref pandas-dev#15601

jreback · 2017-04-07T00:36:04Z

closing in favor of #15601 (which incorporates these commits)

jorisvandenbossche · 2017-04-07T07:38:09Z

@jreback In your Out[3], the columns loose its name. Keeping the name would duplicate the 'bar' in this case, but I think that is OK (loosing its name seems worse?)

jreback · 2017-04-07T12:51:51Z

@jreback In your Out[3], the columns loose its name. Keeping the name would duplicate the 'bar' in this case, but I think that is OK (loosing its name seems worse?)

The problem is it will always duplicate if you have the same frame (e.g. in df.rolling(12).corr() its against itself. In the case of using another frame
df.rolling(12).corr(df2) I can see this.

I am explicity setting this to None. I'll push a change (on the deprecate PR now) with this soon.

jorisvandenbossche · 2017-04-07T13:20:20Z

Yes, you will always get a double name in that case, but I don't think that is that worse (the actual column labels are also actually duplicated, so that seems only consistent).

It would also be consistent with the non-rolling corr:

In [3]: df = pd.DataFrame(np.random.randn(10,2), columns=['A', 'B'])

In [5]: df.corr()
Out[5]: 
          A         B
A  1.000000 -0.089014
B -0.089014  1.000000

In [6]: df.columns.name = 'name'

In [7]: df.corr()
Out[7]: 
name         A         B
name                    
A     1.000000 -0.089014
B    -0.089014  1.000000

closes #13563 on top of #15677 Author: Jeff Reback <jeff@reback.net> Closes #15601 from jreback/panel and squashes the following commits: 04104a7 [Jeff Reback] fine grained catching warnings in tests f8800dc [Jeff Reback] add numpy reference for searchsorted fa136dd [Jeff Reback] doc correction c39453a [Jeff Reback] add perf optimization in searchsorted for FrozenNDArray 0e9c4a4 [Jeff Reback] fix docs as per review & column name changes 3df0abe [Jeff Reback] remove Panel from doc-strings, catch internal warning on Panel construction 755606d [Jeff Reback] more docs d04db2e [Jeff Reback] add deprecate_panel section to docs 538b8e8 [Jeff Reback] pep fix 912d523 [Jeff Reback] TST: separate out test_append_to_multiple_dropna to two tests; when drop=False this is sometimes failing a2625ba [Jeff Reback] remove most Term references in test_pytables.py cd5b6b8 [Jeff Reback] DEPR: Panel deprecated 6b20ddc [Jeff Reback] fix names on return structure f41d3df [Jeff Reback] API: df.rolling(..).corr()/cov() when pairwise=True to return MI DataFrame 84e788b [Jeff Reback] BUG/PERF: handle a slice correctly in get_level_indexer

stanleyng8 · 2017-06-24T12:04:16Z

Apologies if this is not the right place to ask a question about the above change. Pairwise rolling correlation used to give a Panel. Asking for the shape gives three numbers. But with the above change, it returns a 2-dimensional DataFrame. My question is what is the easiest way to update legacy code? For e.g., panel_corr.ix[:, 0, 0] and panel_corr.ix[i, :, :]. What do they look like in the dataframe language? I could try coming up with some arithmetic to somehow translate from a 3-dimensional panel to a 2-dimensional dataframe. But that seems rather inefficient and error-prone. Can I turn the new multi-index dataframe output into a 3-dimensional object so that the legacy codes would work?

jreback · 2017-06-24T13:18:21Z

you should read the whatsnew note and section on how to index

jreback added API Design Numeric Operations Arithmetic, Comparison, and Logical operations Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Mar 13, 2017

jreback added this to the 0.20.0 milestone Mar 13, 2017

jreback force-pushed the corr branch 3 times, most recently from a259eb5 to db9f2c0 Compare March 13, 2017 21:44

jreback force-pushed the corr branch 5 times, most recently from da60531 to 0ee6303 Compare March 22, 2017 19:41

jreback mentioned this pull request Mar 22, 2017

DEPR: Panel deprecated #15601

Closed

jreback force-pushed the corr branch from 0ee6303 to e33dadd Compare March 24, 2017 21:23

jreback force-pushed the corr branch from e33dadd to 3ee62d3 Compare March 27, 2017 23:17

jreback force-pushed the corr branch from 3ee62d3 to 12ae250 Compare March 28, 2017 15:52

jreback force-pushed the corr branch from 12ae250 to f44012e Compare March 28, 2017 16:48

jreback force-pushed the corr branch from f44012e to ff96134 Compare April 2, 2017 22:57

jreback added 3 commits April 3, 2017 17:53

BUG/PERF: handle a slice correctly in get_level_indexer

dbd1322

API: df.rolling(..).corr()/cov() when pairwise=True to return MI Data…

ed7d927

…Frame xref pandas-dev#15601

fix names on return structure

0f5092c

jreback force-pushed the corr branch from ff96134 to 0f5092c Compare April 3, 2017 21:54

jreback closed this Apr 7, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: df.rolling(..).corr()/cov() when pairwise=True to return MI DataFrame #15677

API: df.rolling(..).corr()/cov() when pairwise=True to return MI DataFrame #15677

jreback commented Mar 13, 2017

codecov-io commented Mar 13, 2017 •

edited by codecov bot

Loading

jreback commented Mar 14, 2017

chrisaycock commented Mar 14, 2017

jreback commented Mar 22, 2017

jorisvandenbossche commented Mar 27, 2017

jreback commented Mar 27, 2017

jorisvandenbossche commented Mar 27, 2017

jreback commented Mar 27, 2017

jreback commented Mar 27, 2017

jorisvandenbossche commented Mar 27, 2017

jreback commented Mar 27, 2017

jreback commented Mar 27, 2017

jreback commented Mar 27, 2017

jorisvandenbossche commented Mar 28, 2017

chrisaycock commented Mar 28, 2017

jreback commented Mar 28, 2017 •

edited

Loading

jreback commented Mar 28, 2017

jorisvandenbossche commented Mar 29, 2017

jreback commented Mar 29, 2017 •

edited by jorisvandenbossche

Loading

jreback commented Apr 7, 2017

jorisvandenbossche commented Apr 7, 2017

jreback commented Apr 7, 2017

jorisvandenbossche commented Apr 7, 2017 •

edited

Loading

stanleyng8 commented Jun 24, 2017 •

edited

Loading

jreback commented Jun 24, 2017

API: df.rolling(..).corr()/cov() when pairwise=True to return MI DataFrame #15677

API: df.rolling(..).corr()/cov() when pairwise=True to return MI DataFrame #15677

Conversation

jreback commented Mar 13, 2017

codecov-io commented Mar 13, 2017 • edited by codecov bot Loading

Codecov Report

jreback commented Mar 14, 2017

chrisaycock commented Mar 14, 2017

jreback commented Mar 22, 2017

jorisvandenbossche commented Mar 27, 2017

jreback commented Mar 27, 2017

jorisvandenbossche commented Mar 27, 2017

jreback commented Mar 27, 2017

jreback commented Mar 27, 2017

jorisvandenbossche commented Mar 27, 2017

jreback commented Mar 27, 2017

jreback commented Mar 27, 2017

jreback commented Mar 27, 2017

jorisvandenbossche commented Mar 28, 2017

chrisaycock commented Mar 28, 2017

jreback commented Mar 28, 2017 • edited Loading

jreback commented Mar 28, 2017

jorisvandenbossche commented Mar 29, 2017

jreback commented Mar 29, 2017 • edited by jorisvandenbossche Loading

jreback commented Apr 7, 2017

jorisvandenbossche commented Apr 7, 2017

jreback commented Apr 7, 2017

jorisvandenbossche commented Apr 7, 2017 • edited Loading

stanleyng8 commented Jun 24, 2017 • edited Loading

jreback commented Jun 24, 2017

codecov-io commented Mar 13, 2017 •

edited by codecov bot

Loading

jreback commented Mar 28, 2017 •

edited

Loading

jreback commented Mar 29, 2017 •

edited by jorisvandenbossche

Loading

jorisvandenbossche commented Apr 7, 2017 •

edited

Loading

stanleyng8 commented Jun 24, 2017 •

edited

Loading