raw_to_Xy doesn't handle gaps in data #71

gmgreg · 2020-10-20T17:38:12Z

raw_to_Xy appears to handle regular gaps in data (e.g. weekend days) but cannot handle irregular gaps such as holidays.

When fed trading data similar to the example at https://deepdow.readthedocs.io/en/latest/source/data_loading.html but covering an entire trading year it get out of sync on every holiday. E.g. a Monday that would typically trade but does not on a holiday such as Jan 20, 2020.

The result is that the assertion assert timestamps[0] == raw_df.index[lookback] fails.

This, and likely other data formatting issues, causes an error when executing history = run.launch(30) which is RuntimeError: mat1 and mat2 shapes cannot be multiplied

The text was updated successfully, but these errors were encountered:

jankrepl · 2020-10-20T17:56:56Z

Hey there!

Could you please share some minimal reproducible raw_df that leads to errors? I am not sure what the main problem is. Note that you can remove all potential private information (column names, valid values, etc..).

In general, I encourage you to check the implementation of raw_to_Xy and rewrite it in a way that suits your use case.

deepdow/deepdow/utils.py

Line 203 in ea894c5

    
           def raw_to_Xy(raw_data, lookback=10, horizon=10, gap=0, freq='B', included_assets=None, included_indicators=None,

Additionally, check any of the end-to-end examples where the raw_to_Xy was not used and the X, y were created from scratch: https://deepdow.readthedocs.io/en/latest/auto_examples/index.html#end-to-end

gmgreg · 2020-10-20T18:38:31Z

Thank you for your response; I realize my question may not have been very clear. I took a look at the implementation and noticed raw_to_Xy is calling pandas date_range with freq=B by default (this wasn't clear to me from the documentation).

I believe I've been able to address this particular issue by using:

from pandas.tseries.holiday import USFederalHolidayCalendar
bday_us = pd.offsets.CustomBusinessDay(calendar=USFederalHolidayCalendar())

...and then call raw_to_Xy with that custom frequency...

X, timestamps, y, asset_names, indicators = raw_to_Xy(raw_df,
                                                      lookback=lookback,
                                                      gap=gap,
                                                      horizon=horizon,
                                                      freq=bday_us,
                                                      use_log=True)

I'm still having issues and will provide sample data and additional information.

gmgreg · 2020-10-20T21:37:56Z

You can use the below code and the attached csv file (sample_raw_df.txt, github does not allow .csv attachments). You'll notice the data has 19 rows (timesteps) and if we use a 5 day lookback, 0 gap, and 1 horizon it should be 14 windowed samples. When running it through raw_to_Xy we end up with 13.

from pandas.tseries.holiday import USFederalHolidayCalendar
bday_us = pd.offsets.CustomBusinessDay(calendar=USFederalHolidayCalendar())
# MLK holiday is around Jan 19th and results in a gap if not accounted for in a custom freq for pandas

raw_df = pd.read_csv('./sample_raw_df.txt',
                     parse_dates = ['Date'], 
                     index_col=['Date'])
raw_df = raw_df.sort_values(by=['Date', 'Ticker'])

raw_df = raw_df.pivot_table(index=['Date'],
                                columns='Ticker',
                                aggfunc='sum',
                                fill_value=0).swaplevel(axis=1).sort_index(1)


assert isinstance(raw_df.columns, pd.MultiIndex)
assert isinstance(raw_df.index, pd.DatetimeIndex)

n_timesteps = len(raw_df)  # 19
n_channels = len(raw_df.columns.levels[1])  # 5
n_assets = len(raw_df.columns.levels[0])  # 2

lookback, gap, horizon = 5, 0, 1

X, timestamps, y, asset_names, indicators = raw_to_Xy(raw_df,
                                                      lookback=lookback,
                                                      gap=gap,
                                                      horizon=horizon,
                                                      freq=bday_us,
                                                      use_log=True)

n_samples =  n_timesteps - lookback - horizon - gap + 1  # 14

print(f'Timesteps: {n_timesteps}, Samples: {n_samples}, X.shape {X.shape}')

assert timestamps[0] == raw_df.index[lookback]
assert X.shape == (n_samples, n_channels, lookback, n_assets) # X.shape: (13, 5, 5, 2), should be (14, 5, 5, 2)
assert asset_names == list(raw_df.columns.levels[0])
assert indicators == list(raw_df.columns.levels[1])

sample_raw_df.txt

jankrepl · 2020-10-21T18:36:58Z

Thank you for the example!

I would guess that the thing that confused you (I blame the documentation, see #72 for a fix) is that the true value of n_samples is not always equal to len(raw_df) - lookback - horizon - gap + 1. It worked out that way in the documentation, however, if there was a different number of missing timestamps in the raw_df or a different freq it could be a totally different number.

The raw_to_Xy creates its own DateTimeIndex in the following way (see code for more details):

index = pd.date_range(start=raw_data.index[0], end=raw_data.index[-1], freq=freq)

So it does not really matter what happens in between the start and the end timestamp - the new index is just generated from scratch based on the frequency and the end points. In your example, you changed the frequency to a custom one

index_custom = pd.date_range(start=raw_df.index[0], end=raw_df.index[-1], freq=bday_us)
index_default =  pd.date_range(start=raw_df.index[0], end=raw_df.index[-1], freq='B')


print(len(index_custom), len(index_default), set(index_default) - set(index_custom))

19 20 {Timestamp('2016-01-18 00:00:00', freq='B')}

That means that just by providing your custom index you will lose 1 sample with respect to the default one.

You'll notice the data has 19 rows (timesteps) and if we use a 5 day lookback, 0 gap, and 1 horizon it should be 14 windowed samples. When running it through raw_to_Xy we end up with 13.

I think you forgot to factor in the fact that the raw_to_Xy function actually computes 1-step returns in the background, so the first time step is deleted (see code)

gmgreg · 2020-10-21T22:51:19Z

Thanks again for the feedback. As background just wanted to quickly test out deepdow with a limited dataset so was following the getting_started.ipynb notebook and simply replacing the generated data with a sampling of my own closer to the format noted in Data Loading.

I'm used to creating windowed training datasets as is typical for LSTM. E.g. 3D numpy arrays with samples, lookback, features, and the matching target array (y). Using a toy dataset fed to raw_to_Xy caused several assertions to fail which I mistook as critical.

I think it may be easier to take your earlier advice and create X and y from scratch. Looking at the generated data in the end-to-end examples is a start tough it's only a single feature (channel).

At this point I've still not been able to get a toy dataset successfully trained (currently seeing a RuntimeError: mat1 and mat2 shapes cannot be multiplied error no doubt due to something wrong in the dataset I'm loading).

Thanks for your patience.

gmgreg · 2020-10-22T15:29:13Z

After more experimentation the relationship between the dataset shape and the network is now more clear. I had assumed the dataset and network were generic but now I see the different networks expect different dataset shapes (e.g. number of channels). I has assumed the errors I was seeing when attempting to train was due to something in my dataset construction. In actuality it was a mis-match between what the network was expecting (e.g. 1 channel or multiple channels) and what I was feeding it.

jankrepl · 2020-10-23T14:51:33Z

Well, I hope you managed to do what you wanted! Feel free to ask any other questions at any point!

Cheers!

jankrepl mentioned this issue Oct 21, 2020

Make docs more clear - raw_to_Xy #72

Merged

gmgreg closed this as completed Oct 22, 2020

gmgreg mentioned this issue Oct 23, 2020

AssertionError: when using BachelierNet #73

Closed

jankrepl mentioned this issue Aug 14, 2021

raw_to_Xy raises KeyError #118

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

raw_to_Xy doesn't handle gaps in data #71

raw_to_Xy doesn't handle gaps in data #71

gmgreg commented Oct 20, 2020

jankrepl commented Oct 20, 2020

gmgreg commented Oct 20, 2020

gmgreg commented Oct 20, 2020

jankrepl commented Oct 21, 2020 •

edited

Loading

gmgreg commented Oct 21, 2020

gmgreg commented Oct 22, 2020

jankrepl commented Oct 23, 2020

raw_to_Xy doesn't handle gaps in data #71

raw_to_Xy doesn't handle gaps in data #71

Comments

gmgreg commented Oct 20, 2020

jankrepl commented Oct 20, 2020

gmgreg commented Oct 20, 2020

gmgreg commented Oct 20, 2020

jankrepl commented Oct 21, 2020 • edited Loading

gmgreg commented Oct 21, 2020

gmgreg commented Oct 22, 2020

jankrepl commented Oct 23, 2020

jankrepl commented Oct 21, 2020 •

edited

Loading