Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
pd.concat loses frequency attribute for 'continuous' DataFrame appends #3232
I have a
These new update dataframes are of the same frequency, and contain data that is 'continuous' in time (i.e., they pick up right where the last timestamp left off), and ultimately I would like to append this new data to the existing dataframe while preserving the main dataframe frequency attribute. I tried by using a
import pandas as pd import numpy as np dr = pd.date_range('01-Jan-2013', periods=100, freq='50L', tz='UTC') df = pd.DataFrame(np.random.randn(100, 2), index=dr) df.index
<class 'pandas.tseries.index.DatetimeIndex'> [2013-01-01 00:00:00, ..., 2013-01-01 00:00:04.950000] Length: 100, Freq: 50L, Timezone: UTC
These guys look good:
#Preserves frequency print df[:50].index print df[50:].index
<class 'pandas.tseries.index.DatetimeIndex'> [2013-01-01 00:00:00, ..., 2013-01-01 00:00:02.450000] Length: 50, Freq: 50L, Timezone: UTC <class 'pandas.tseries.index.DatetimeIndex'> [2013-01-01 00:00:02.500000, ..., 2013-01-01 00:00:04.950000] Length: 50, Freq: 50L, Timezone: UTC
However, these guys, together, forget where they came from:
#Loses frequency pd.concat([df[:50], df[50:]]).index
<class 'pandas.tseries.index.DatetimeIndex'> [2013-01-01 00:00:00, ..., 2013-01-01 00:00:04.950000] Length: 100, Freq: None, Timezone: UTC
I currently get around this with a
@nehalecky as an aside, this is a good case for appending your data with an
Thanks @jreback for the note. I do use HDFStore when I persist data to my local machine, and it works great for that, however, the application I am referring to above is persisting data to a remote store, which will eventually have to scale horizontally. For both those reasons, the use of hdf5 isn't an option, unfortunately. :(
Actually, for preprocessing analysis, I am storing the raw record data as heavily compressed hdf5 binary in the db now (via pandas HDFStore). This allows me to retrieve individual records and load them directly to DataFrame, tying directly into my analysis stack, which is nice. I am really looking forward to whatever solutions are implemented for binary storage of data frame (#686 and all), but this is how I'm rolling for now. ;)
BTW, this writeup (http://pandas.pydata.org/pandas-docs/dev/io.html#hdf5-pytables) is awesomeness, and your SO answer for pandas workflow is off the charts (http://stackoverflow.com/questions/14262433/large-data-work-flows-using-pandas), thank you!
glad to here the docs are useful!
Heres another resource: http://pandas.pydata.org/pandas-docs/dev/cookbook.html#hdfstore
If you have the time, http://msgpack.org is probably a reasonable format (though doesn't support compression directly), is I think a good choice, and probably simple too implement (and db storagable).
In the general case concat can join any two indices types, or
Another angle would rely on the fact that although
Hey @y-p, thanks for the tips.
Agreed it's a very special case, and I right now it isn't a major performance issue, however, when we begin to scale, it could be. In the meantime, I'll try and implement your suggestions and I'll keep you posted as to how this performs when we things begin to get bigger. :)
Still an issue in 0.23.4.
So setting the frequency is about 13k times faster than resample and about 1.6k times faster than reindexing. If it is not known if the indices are contiguous, I'd thus go with reindex. Any opinions/advice on this?