Skip to content

Retrieving large frames with sparse data from hdf5 - 'NoneType' object is not iterable error #2299

@Vankisa

Description

@Vankisa

I have been using pandas within my scripts for some time now, especially to store large data sets in an easily accessible way. I have stumbled upon this problem a couple of days ago and have not been able to solve it so far.

The problem is that after I store a huge data frame into an hdf5 file, when I later load it back, it sometimes has one or more columns (only from the object type columns) completely inaccessible and returning the 'NoneType object is not iterable' error.

While I use the frame in memory there are no problems, even with moderately larger data sets than the example below. It is worth mentioning that the frame contains either multiple datetime columns or multiple VMS timestamps (http://labs.hoffmanlabs.com/node/735), as well as string and char and integer columns. All non-object columns can and do have missing values.

At first I thought I was saving 'NA' values in one of the 'object type' columns. Then I tried to update to latest pandas version (0.9.1). I was using 0.9.0 when this problem first occurred. Neither seem to be the solution.

I have been able to reproduce the error with the following code:

import pandas as pd
import numpy as np
import datetime

# Get VMS timestamps for today
time_now = datetime.datetime.today()
start_vms = datetime.datetime(1858, 11, 17)
t_delta = (time_now - start_vms)
vms_time = t_delta.total_seconds() * 10000000

# Generate Test Frame (dense)
test_records = []
vms_time1 = vms_time
vms_time2 = vms_time
for i in range(2000000):
    vms_time1 += 15 * np.random.randn()
    vms_time2 += 25 * np.random.randn()
    vms_time_diff = vms_time2 - vms_time1
    string1 = 'XXXXXXXXXX'
    string2 = 'XXXXXXXXXX'
    string3 = 'XXXXX'
    string4 = 'XXXXX'
    char1 = 'A'
    char2 = 'B'
    char3 = 'C'
    char4 = 'D'
    number1 = np.random.randint(1,10)
    number2 = np.random.randint(1,100)
    number3 = np.random.randint(1,1000)
    test_records.append((char1, string1, vms_time1, number1, char2, string2, vms_time2, number2, char3, string3, vms_time_diff, number3, char4, string4))

df = pd.DataFrame(test_records, columns = ["column_1", "column_2", "column_3", "column_4", "column_5", "column_6", "column_7", "column_8", "column_9", "column_10", "column_11", "column_12", "column_13", "column_14"])

# Generate Test Frame (sparse)
test_records = []
vms_time1 = vms_time
vms_time2 = vms_time
count = 0
for i in range(2000000):
    if (count%23 == 0):
        vms_time1 += 15 * np.random.randn()
        string1 = 'XXXXXXXXXX'
        string2 = ' '
        string3 = 'XXXXX'
        string4 = 'XXXXX'
        char1 = 'A'
        char2 = 'B'
        char3 = 'C'
        char4 = 'D'
        number1 = None
        number2 = np.random.randint(1,100)
        number3 = np.random.randint(1,1000)
        test_records.append((char1, string1, vms_time1, number1, char2, None, None, number2, char3, string3, None, number3, None, string4))
    else:
        vms_time1 += 15 * np.random.randn()
        vms_time2 += 25 * np.random.randn()
        vms_time_diff = vms_time2 - vms_time1
        string1 = 'XXXXXXXXXX'
        string2 = 'XXXXXXXXXX'
        string3 = 'XXXXX'
        string4 = 'XXXXX'
        char1 = 'A'
        char2 = 'B'
        char3 = 'C'
        char4 = 'D'
        number1 = np.random.randint(1,10)
        number2 = np.random.randint(1,100)
        number3 = np.random.randint(1,1000)
        test_records.append((char1, string1, vms_time1, number1, char2, string2, vms_time2, number2, char3, string3, vms_time_diff, number3, char4, string4))
    count += 1

df1 = pd.DataFrame(test_records, columns = ["column_1", "column_2", "column_3", "column_4", "column_5", "column_6", "column_7", "column_8", "column_9", "column_10", "column_11", "column_12", "column_13", "column_14"])

store_loc = "foo.h5"
h5_store = pd.HDFStore(store_loc )
h5_store['df1'] = df
h5_store['df2'] = df1
h5_store.close()

When I try to load from this store now the 'df1' is behaving normally, but the 'df2' is producing the following error:

TypeError: 'NoneType' object is not iterable

Additionally I just tried to reproduce this error on pandas version 0.8.1. It does not seem to be present there. So it is probably connected with the I/O changes introduced in 0.9.0?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions