Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rolling Pearson correlation counterintuitive #248

Closed
SmokinCaterpillar opened this issue Aug 16, 2020 · 7 comments
Closed

Rolling Pearson correlation counterintuitive #248

SmokinCaterpillar opened this issue Aug 16, 2020 · 7 comments

Comments

@SmokinCaterpillar
Copy link

Hey, first and foremost, dtale is a great and super useful application. Thanks a lot for this nice tool!

I have two suggestions for improvements:

To my mind the temporal correlation plot behaves quite unintuitive and too much magic happens in the background.
The documentation says:

"When the data being viewed in D-Tale has date or timestamp columns but for each date/timestamp vlaue there is only one row of data the behavior of the Correlations popup is a little different. Instead of a timeseries correlation chart the user is given a rolling correlation chart which can have the window (default: 10) altered."

To me having rolling windows is almost always the desired behavior. But you cannot have that in case by any chance two data points share the same timestamp (my data sets have this quite often). So you cannot have any rolling analysis unless you filter duplicates before using the correlation tab.

Proposal: Make the rolling view always the default behavior and add a toggle to switch between rolling windows and the grouping by date behavior. Now dtale's behavior no longer depends implicitly on the dataframe data, but on the user's selection in the dtale frontend.

Moreover, the rolling behavior is hard to grasp as well. In case you export the code, you get:

# DISCLAIMER: 'df' refers to the data you passed in when calling 'dtale.show'

import pandas as pd

if isinstance(df, (pd.DatetimeIndex, pd.MultiIndex)):
	df = df.to_frame(index=False)

# remove any pre-existing indices for ease of use in the D-Tale code, but this is not required
df = df.reset_index().drop('index', axis=1, errors='ignore')
df.columns = [str(c) for c in df.columns]  # update columns to strings in case they are numbers

corr_ts = df[['date_col', 'some_col', 'some_other_col']].set_index('date_col')
corr_ts = corr_ts[['some_col', 'some_other_col']].rolling(3).corr().reset_index()
corr_ts = corr_ts.dropna()
corr_ts = corr_ts[corr_ts['level_1'] == 'some_col'][['date_col', 'some_other_col]]
corr_ts.columns = ['date', 'corr']

The date column is set as the index, however the rolling function is not taken over a time interval, but simply over the last n data points. This produces weird behavior if the data is not sampled at regular intervals. In this case correlation over time is really misleading. Moreover, since no min_periods is set for pandas' rolling function, the whole correlation analysis breaks down once you have NaN values in your data and you increase the rolling window, as then the correlations become all NaN for increasing window sizes.

Proposal: Always take rolling windows over time intervals and not over n data points. Let the user choose the length t and the unit of the time interval (e.g. days, seconds, hours, etc.). Let the user specify the min_periods of the pandas rolling function.

Eager to hear your opinion about these two suggestions.
Best,
Robert

@SmokinCaterpillar SmokinCaterpillar changed the title Rolling Pearson Correlation very counterintuitive Rolling Pearson correlation counterintuitive Aug 16, 2020
@aschonfeld
Copy link
Collaborator

@SmokinCaterpillar Thanks so much for your feedback, this is great! The correlations functionality was one of the first pieces I added to this tool and it was one of those things where it fit the type of data we were using (universes of securities over time => multiple values for each date) real well and we didn't stray from that formula so I was a little near-sighted in how I put it together.

I'm currently finishing up implementing treemaps and I'll be working on these changes next. The only thing about them that might be a little complex is specifying time intervals. So what I will do is, in addition to the current functionality of # of values, add the ability to specify a time interval string supported by pandas.

Hope this is fine for you 🙏 thanks again

@aschonfeld
Copy link
Collaborator

aschonfeld commented Aug 25, 2020

@SmokinCaterpillar quick question on making the "rolling" correlations available to datasets where there are multiple data points for each date. Just trying to figure out who one can show the output in a chart. For example I ran the following code:

def test_data(rows, columns, no_of_dates=364):
    import pandas as pd
    import numpy as np
    import random
    from past.utils import old_div
    from pandas.tseries.offsets import Day
    from dtale.utils import dict_merge
    import string
    import datetime
    now = pd.Timestamp(pd.Timestamp('now').date())
    dates = pd.date_range(now - Day(no_of_dates), now)
    num_of_securities = max(old_div(rows, len(dates)), 1)  # always have at least one security
    securities = [
        dict(security_id=100000 + sec_id, int_val=random.randint(1, 100000000000),
             str_val=random.choice(string.ascii_letters) * 5)
        for sec_id in range(num_of_securities)
    ]
    data = pd.concat([
        pd.DataFrame([dict_merge(dict(date=date), sd) for sd in securities])
        for date in dates
    ], ignore_index=True)[['date', 'security_id', 'int_val', 'str_val']]
    col_names = ['Col{}'.format(c) for c in range(columns)]
    data = pd.concat([data, pd.DataFrame(np.random.randn(len(data), columns), columns=col_names)], axis=1)
    data.loc[data['security_id'] == 100000, 'str_val'] = np.nan
    data.loc[:, 'bool_val'] = data.index % 2 == 0
    data.loc[:, 'category_val'] = data['str_val'].astype('category')

    def pp(start, end, n):
        start_u = start.value//10**9
        end_u = end.value//10**9
        return pd.DatetimeIndex((10**9*np.random.randint(start_u, end_u, n, dtype=np.int64)).view('M8[ns]'))

    data.loc[:, 'ts_val'] = pp(pd.Timestamp('19600101'), pd.Timestamp('20500101'), len(data))
    data.loc[:, 'timedelta_val'] = [datetime.timedelta(seconds=random.randint(1, 100000000)) for _ in range(len(data))]
    return data

df = test_data(10000, 5)
corrs = df.groupby('date')[['Col0', 'Col1']].rolling(10).corr(method='pearson')
corrs.groupby(level=[0,1]).last()['Col0']

But that produces the following output:

date            
2019-08-27  0            NaN
            1            NaN
            2            NaN
            3            NaN
            4            NaN
                      ...   
2020-08-25  9850    0.060373
            9851    0.204974
            9852    0.265396
            9853   -0.113846
            9854   -0.148245
Name: Col0, Length: 9855, dtype: float64

For a timeseries chart I need one data point per day. The only other option would be to have a timeseries chart with multiple lines. One for each point in the rolling window...

Let me know if you have any thoughts, thanks.

@SmokinCaterpillar
Copy link
Author

I just realized that df.set_index('date').rolling('10D').corr() (note the 10D window) does not create a pearson correlation coefficient over 10 days, but simply seems to be broken :(. So probably there's no way to get around df.set_index('date').rolling(10).corr() and simply look at the last n samples. In this case you end up having the problem with multiple datapoints per date.

I guess in this case just take some aggregation, e.g. mean for example.

So I would change my second part of my proposal to:
Always take rolling windows over last n data points. Let the user choose the number of data points n. Also let the user specify the min_periods of the pandas rolling function. In cas multiple points per date exist, simply aggregate by taking the mean correlation.

@aschonfeld
Copy link
Collaborator

Awesome! I can definitely set that up. Glad I wasn’t going crazy with the groupby-rolling-corr issues 🙂

@aschonfeld
Copy link
Collaborator

@SmokinCaterpillar Sorry its taken so long I've been moving. Here's a demo of what I've got. Let me know what you think and I'll put together a release soon.

aschonfeld added a commit that referenced this issue Sep 3, 2020
@SmokinCaterpillar
Copy link
Author

Wow, this is nice, great, thanks!

aschonfeld added a commit that referenced this issue Sep 4, 2020
aschonfeld added a commit that referenced this issue Sep 4, 2020
@aschonfeld
Copy link
Collaborator

Added in v1.15.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants