Skip to content

Missing last chunk in CHUNK_STORE #976

@atamkapoor

Description

@atamkapoor

Arctic Version

1.80.5

Arctic Store

# ChunkStore

Platform and version

Python 3.8.5

Description of problem and/or code sample that reproduces the issue

I noticed that if I save a dataframe where the UTC date carries over to the next day, most functions (reverse_iterator, get_chunk_ranges, get_info, ...) don't return the chunk for the new date. The following example will make this clear (jupyter notebook attached in the zip file):

Set Up

import pandas as pd
from arctic import Arctic, CHUNK_STORE
store = Arctic("localhost")
store.initialize_library("scratch_lib", lib_type=CHUNK_STORE)

lib = store["scratch_lib"]

Create an Index with some times that will change dates when converted to UTC

ind = pd.Index([pd.Timestamp("20121208T16:00", tz="US/Eastern"), pd.Timestamp("20121208T18:00", tz="US/Eastern"), 
                pd.Timestamp("20121208T20:00", tz="US/Eastern"), pd.Timestamp("20121208T22:00", tz="US/Eastern")], name="date")
print(ind)

Output:

DatetimeIndex(['2012-12-08 16:00:00-05:00', '2012-12-08 18:00:00-05:00', '2012-12-08 20:00:00-05:00', '2012-12-08 22:00:00-05:00'], dtype='datetime64[ns, US/Eastern]', name='date', freq=None)

print(ind.tz_convert("UTC"))

Output

DatetimeIndex(['2012-12-08 21:00:00+00:00', '2012-12-08 23:00:00+00:00', '2012-12-09 01:00:00+00:00', '2012-12-09 03:00:00+00:00'], dtype='datetime64[ns, UTC]', name='date', freq=None)

Create dataframe, write it to the library, and read it back out

df = pd.DataFrame([1, 2, 3, 4], index=ind, columns=["col"])
lib.write("example_df", df, chunk_size="D")
df_read = lib.read("example_df")
print(df_read)

Output

date col
2012-12-08 21:00:00 1
2012-12-08 23:00:00 2
2012-12-09 01:00:00 3
2012-12-09 03:00:00 4

This is different from what I expected. Is this behavior expected?

lib.get_info("example_df")

Output

{'chunk_count': 1,
'len': 4,
'appended_rows': 0,
'metadata': {'columns': ['date', 'col']},
'chunker': 'date',
'chunk_size': 'D',
'serializer': 'FrameToArray'}

>> expected chunk_count = 2, not 1

list(lib.get_chunk_ranges("example_df"))

Output

[(b'2012-12-08 00:00:00', b'2012-12-08 23:59:59.999000')]

>> expected [(b'2012-12-08 00:00:00', b'2012-12-08 23:59:59.999000'), (b'2012-12-09 00:00:00', b'2012-12-09 23:59:59.999000')]

iterator = lib.reverse_iterator("example_df")
while True:
    data = next(iterator, None)
    if data is None:
        break
    print(data)

Output

date col
2012-12-08 21:00:00 1
2012-12-08 23:00:00 2

**>> expected the following:
date col
2012-12-09 01:00:00 3
2012-12-09 03:00:00 4

date col
2012-12-08 21:00:00 1
2012-12-08 23:00:00 2**

arctic_issue_example.zip

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions