Rolling Pearson correlation counterintuitive #248

SmokinCaterpillar · 2020-08-16T08:40:10Z

Hey, first and foremost, dtale is a great and super useful application. Thanks a lot for this nice tool!

I have two suggestions for improvements:

To my mind the temporal correlation plot behaves quite unintuitive and too much magic happens in the background.
The documentation says:

"When the data being viewed in D-Tale has date or timestamp columns but for each date/timestamp vlaue there is only one row of data the behavior of the Correlations popup is a little different. Instead of a timeseries correlation chart the user is given a rolling correlation chart which can have the window (default: 10) altered."

To me having rolling windows is almost always the desired behavior. But you cannot have that in case by any chance two data points share the same timestamp (my data sets have this quite often). So you cannot have any rolling analysis unless you filter duplicates before using the correlation tab.

Proposal: Make the rolling view always the default behavior and add a toggle to switch between rolling windows and the grouping by date behavior. Now dtale's behavior no longer depends implicitly on the dataframe data, but on the user's selection in the dtale frontend.

Moreover, the rolling behavior is hard to grasp as well. In case you export the code, you get:

# DISCLAIMER: 'df' refers to the data you passed in when calling 'dtale.show'

import pandas as pd

if isinstance(df, (pd.DatetimeIndex, pd.MultiIndex)):
	df = df.to_frame(index=False)

# remove any pre-existing indices for ease of use in the D-Tale code, but this is not required
df = df.reset_index().drop('index', axis=1, errors='ignore')
df.columns = [str(c) for c in df.columns]  # update columns to strings in case they are numbers

corr_ts = df[['date_col', 'some_col', 'some_other_col']].set_index('date_col')
corr_ts = corr_ts[['some_col', 'some_other_col']].rolling(3).corr().reset_index()
corr_ts = corr_ts.dropna()
corr_ts = corr_ts[corr_ts['level_1'] == 'some_col'][['date_col', 'some_other_col]]
corr_ts.columns = ['date', 'corr']

The date column is set as the index, however the rolling function is not taken over a time interval, but simply over the last n data points. This produces weird behavior if the data is not sampled at regular intervals. In this case correlation over time is really misleading. Moreover, since no min_periods is set for pandas' rolling function, the whole correlation analysis breaks down once you have NaN values in your data and you increase the rolling window, as then the correlations become all NaN for increasing window sizes.

Proposal: Always take rolling windows over time intervals and not over n data points. Let the user choose the length t and the unit of the time interval (e.g. days, seconds, hours, etc.). Let the user specify the min_periods of the pandas rolling function.

Eager to hear your opinion about these two suggestions.
Best,
Robert

The text was updated successfully, but these errors were encountered:

aschonfeld · 2020-08-24T02:13:39Z

@SmokinCaterpillar Thanks so much for your feedback, this is great! The correlations functionality was one of the first pieces I added to this tool and it was one of those things where it fit the type of data we were using (universes of securities over time => multiple values for each date) real well and we didn't stray from that formula so I was a little near-sighted in how I put it together.

I'm currently finishing up implementing treemaps and I'll be working on these changes next. The only thing about them that might be a little complex is specifying time intervals. So what I will do is, in addition to the current functionality of # of values, add the ability to specify a time interval string supported by pandas.

Hope this is fine for you 🙏 thanks again

aschonfeld · 2020-08-25T17:45:56Z

@SmokinCaterpillar quick question on making the "rolling" correlations available to datasets where there are multiple data points for each date. Just trying to figure out who one can show the output in a chart. For example I ran the following code:

def test_data(rows, columns, no_of_dates=364):
    import pandas as pd
    import numpy as np
    import random
    from past.utils import old_div
    from pandas.tseries.offsets import Day
    from dtale.utils import dict_merge
    import string
    import datetime
    now = pd.Timestamp(pd.Timestamp('now').date())
    dates = pd.date_range(now - Day(no_of_dates), now)
    num_of_securities = max(old_div(rows, len(dates)), 1)  # always have at least one security
    securities = [
        dict(security_id=100000 + sec_id, int_val=random.randint(1, 100000000000),
             str_val=random.choice(string.ascii_letters) * 5)
        for sec_id in range(num_of_securities)
    ]
    data = pd.concat([
        pd.DataFrame([dict_merge(dict(date=date), sd) for sd in securities])
        for date in dates
    ], ignore_index=True)[['date', 'security_id', 'int_val', 'str_val']]
    col_names = ['Col{}'.format(c) for c in range(columns)]
    data = pd.concat([data, pd.DataFrame(np.random.randn(len(data), columns), columns=col_names)], axis=1)
    data.loc[data['security_id'] == 100000, 'str_val'] = np.nan
    data.loc[:, 'bool_val'] = data.index % 2 == 0
    data.loc[:, 'category_val'] = data['str_val'].astype('category')

    def pp(start, end, n):
        start_u = start.value//10**9
        end_u = end.value//10**9
        return pd.DatetimeIndex((10**9*np.random.randint(start_u, end_u, n, dtype=np.int64)).view('M8[ns]'))

    data.loc[:, 'ts_val'] = pp(pd.Timestamp('19600101'), pd.Timestamp('20500101'), len(data))
    data.loc[:, 'timedelta_val'] = [datetime.timedelta(seconds=random.randint(1, 100000000)) for _ in range(len(data))]
    return data

df = test_data(10000, 5)
corrs = df.groupby('date')[['Col0', 'Col1']].rolling(10).corr(method='pearson')
corrs.groupby(level=[0,1]).last()['Col0']

But that produces the following output:

date            
2019-08-27  0            NaN
            1            NaN
            2            NaN
            3            NaN
            4            NaN
                      ...   
2020-08-25  9850    0.060373
            9851    0.204974
            9852    0.265396
            9853   -0.113846
            9854   -0.148245
Name: Col0, Length: 9855, dtype: float64

For a timeseries chart I need one data point per day. The only other option would be to have a timeseries chart with multiple lines. One for each point in the rolling window...

Let me know if you have any thoughts, thanks.

SmokinCaterpillar · 2020-08-26T14:59:26Z

I just realized that df.set_index('date').rolling('10D').corr() (note the 10D window) does not create a pearson correlation coefficient over 10 days, but simply seems to be broken :(. So probably there's no way to get around df.set_index('date').rolling(10).corr() and simply look at the last n samples. In this case you end up having the problem with multiple datapoints per date.

I guess in this case just take some aggregation, e.g. mean for example.

So I would change my second part of my proposal to:
Always take rolling windows over last n data points. Let the user choose the number of data points n. Also let the user specify the min_periods of the pandas rolling function. In cas multiple points per date exist, simply aggregate by taking the mean correlation.

aschonfeld · 2020-08-26T16:29:07Z

Awesome! I can definitely set that up. Glad I wasn’t going crazy with the groupby-rolling-corr issues 🙂

aschonfeld · 2020-09-03T01:43:29Z

@SmokinCaterpillar Sorry its taken so long I've been moving. Here's a demo of what I've got. Let me know what you think and I'll put together a release soon.

SmokinCaterpillar · 2020-09-03T08:11:22Z

Wow, this is nice, great, thanks!

aschonfeld · 2020-09-04T13:34:17Z

Added in v1.15.2

SmokinCaterpillar changed the title ~~Rolling Pearson Correlation very counterintuitive~~ Rolling Pearson correlation counterintuitive Aug 16, 2020

aschonfeld added a commit that referenced this issue Sep 3, 2020

#248: Correlations updates

09b3e48

aschonfeld added a commit that referenced this issue Sep 4, 2020

#248: Correlations updates

957aea3

aschonfeld added a commit that referenced this issue Sep 4, 2020

#248: Correlations updates

d9fb720

aschonfeld closed this as completed Sep 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rolling Pearson correlation counterintuitive #248

Rolling Pearson correlation counterintuitive #248

SmokinCaterpillar commented Aug 16, 2020

aschonfeld commented Aug 24, 2020

aschonfeld commented Aug 25, 2020 •

edited

SmokinCaterpillar commented Aug 26, 2020

aschonfeld commented Aug 26, 2020

aschonfeld commented Sep 3, 2020

SmokinCaterpillar commented Sep 3, 2020

aschonfeld commented Sep 4, 2020

Rolling Pearson correlation counterintuitive #248

Rolling Pearson correlation counterintuitive #248

Comments

SmokinCaterpillar commented Aug 16, 2020

aschonfeld commented Aug 24, 2020

aschonfeld commented Aug 25, 2020 • edited

SmokinCaterpillar commented Aug 26, 2020

aschonfeld commented Aug 26, 2020

aschonfeld commented Sep 3, 2020

SmokinCaterpillar commented Sep 3, 2020

aschonfeld commented Sep 4, 2020

aschonfeld commented Aug 25, 2020 •

edited