New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_sas with chunksize/iterator raises ValueError #14734

Closed
pijucha opened this Issue Nov 25, 2016 · 9 comments

Comments

Projects
None yet
5 participants
@pijucha
Contributor

pijucha commented Nov 25, 2016

read_sas doesn't work well with chunksize or iterator parameters.

Code Sample and Problem Description

The following data test file in the repository have 32 lines.

sasfile = 'pandas/io/tests/sas/data/airline.sas7bdat'
pd.read_sas(sasfile).shape
Out[18]: (32, 6)

When we carefully read the file with chunksize/iterator, all's well:

reader = pd.read_sas(sasfile, chunksize=16)
df = reader.read()
df.shape
Out[31]: (16, 6)
df = reader.read()
df.shape
Out[33]: (16, 6)

or

reader = pd.read_sas(sasfile, iterator=True)
df = reader.read(30)
df.shape
Out[37]: (30, 6)
df = reader.read(2)
df.shape
Out[39]: (2, 6)
df = reader.read(2)
type(df)
Out[41]: NoneType

But if we don't know the length of the data, we'll easily stumble on an exception and won't read the whole data, which is painful with large files.

reader = pd.read_sas(sasfile, chunksize=20)
df = reader.read()
df.shape
Out[45]: (20, 6)
df = reader.read()
Traceback (most recent call last):
  File "/usr/local/lib64/python3.5/site-packages/IPython/core/interactiveshell.py", line 2885, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-46-c5d811b93ac1>", line 1, in <module>
    df = reader.read()
  File "/home/users/piotr/workspace/pandas-pijucha/pandas/io/sas/sas7bdat.py", line 604, in read
    rslt = self._chunk_to_dataframe()
  File "/home/users/piotr/workspace/pandas-pijucha/pandas/io/sas/sas7bdat.py", line 646, in _chunk_to_dataframe
    dtype=self.byte_order + 'd')
  File "/home/users/piotr/workspace/pandas-pijucha/pandas/core/frame.py", line 2419, in __setitem__
    self._set_item(key, value)
  File "/home/users/piotr/workspace/pandas-pijucha/pandas/core/frame.py", line 2485, in _set_item
    value = self._sanitize_column(key, value)
  File "/home/users/piotr/workspace/pandas-pijucha/pandas/core/frame.py", line 2656, in _sanitize_column
    value = _sanitize_index(value, self.index, copy=False)
  File "/home/users/piotr/workspace/pandas-pijucha/pandas/core/series.py", line 2793, in _sanitize_index
    raise ValueError('Length of values does not match length of ' 'index')
ValueError: Length of values does not match length of index

or

reader = pd.read_sas(sasfile, iterator=True)
reader.read(30).shape
Out[51]: (30, 6)
reader.read(30).shape
Traceback (most recent call last):
  File "/usr/local/lib64/python3.5/site-packages/IPython/core/interactiveshell.py", line 2885, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-52-5d757f713808>", line 1, in <module>
    reader.read(30).shape
  File "/home/users/piotr/workspace/pandas-pijucha/pandas/io/sas/sas7bdat.py", line 604, in read
    rslt = self._chunk_to_dataframe()
  File "/home/users/piotr/workspace/pandas-pijucha/pandas/io/sas/sas7bdat.py", line 646, in _chunk_to_dataframe
    dtype=self.byte_order + 'd')
  File "/home/users/piotr/workspace/pandas-pijucha/pandas/core/frame.py", line 2419, in __setitem__
    self._set_item(key, value)
  File "/home/users/piotr/workspace/pandas-pijucha/pandas/core/frame.py", line 2485, in _set_item
    value = self._sanitize_column(key, value)
  File "/home/users/piotr/workspace/pandas-pijucha/pandas/core/frame.py", line 2656, in _sanitize_column
    value = _sanitize_index(value, self.index, copy=False)
  File "/home/users/piotr/workspace/pandas-pijucha/pandas/core/series.py", line 2793, in _sanitize_index
    raise ValueError('Length of values does not match length of ' 'index')
ValueError: Length of values does not match length of index

Output of pd.show_versions()

pd.show_versions()

INSTALLED VERSIONS

commit: 75b606a
python: 3.5.1.final.0
python-bits: 64
OS: Linux
OS-release: 4.1.20-1
machine: x86_64
processor: Intel(R)_Core(TM)i5-2520M_CPU@_2.50GHz
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.19.0+112.g75b606a
nose: 1.3.7
pip: 9.0.1
setuptools: 21.0.0
Cython: 0.24
numpy: 1.11.0

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment

@kshedden kshedden referenced this issue Nov 25, 2016

Merged

SAS chunksize / iteration issues #14743

3 of 3 tasks complete
@kshedden

This comment has been minimized.

Show comment
Hide comment
@kshedden

kshedden Nov 25, 2016

Contributor

@pijucha, thanks for the report. I've PR'd a possible fix.

Contributor

kshedden commented Nov 25, 2016

@pijucha, thanks for the report. I've PR'd a possible fix.

@pijucha

This comment has been minimized.

Show comment
Hide comment
@pijucha

pijucha Nov 25, 2016

Contributor

@kshedden Yes, this should be it. I see you probably also solved #13654. Very nice. Thanks.

Contributor

pijucha commented Nov 25, 2016

@kshedden Yes, this should be it. I see you probably also solved #13654. Very nice. Thanks.

@jreback jreback added this to the 0.19.2 milestone Nov 25, 2016

jorisvandenbossche added a commit that referenced this issue Nov 28, 2016

jorisvandenbossche added a commit that referenced this issue Dec 15, 2016

@boulund

This comment has been minimized.

Show comment
Hide comment
@boulund

boulund May 2, 2017

Is this issue solved? I just got this trying to iterate through a large sas7bdat file (using pandas 0.19.2 via conda)

Traceback (most recent call last):                                                                                   
  File "./extract_subset_of_columns.py", line 35, in <module>                                                        
    extract_columns_from_sas(lmed_file, columns=["lpnr", "KON", "atc", "EDATUM"], output_csv=lmed_file+".csv")       
  File "./extract_subset_of_columns.py", line 25, in extract_columns_from_sas                                        
    for count, chunk in enumerate(reader, start=1):                                                                  
  File "/home/ctmr/anaconda3/lib/python3.6/site-packages/pandas/io/sas/sas7bdat.py", line 229, in __next__           
    da = self.read(nrows=self.chunksize or 1)                                                                        
  File "/home/ctmr/anaconda3/lib/python3.6/site-packages/pandas/io/sas/sas7bdat.py", line 614, in read               
    rslt = self._chunk_to_dataframe()                                                                                
  File "/home/ctmr/anaconda3/lib/python3.6/site-packages/pandas/io/sas/sas7bdat.py", line 663, in _chunk_to_dataframe
    rslt[name] = self._string_chunk[js, :]                                                                           
  File "/home/ctmr/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2419, in __setitem__            
    self._set_item(key, value)                                                                                       
  File "/home/ctmr/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2485, in _set_item              
    value = self._sanitize_column(key, value)                                                                        
  File "/home/ctmr/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2656, in _sanitize_column       
    value = _sanitize_index(value, self.index, copy=False)                                                           
  File "/home/ctmr/anaconda3/lib/python3.6/site-packages/pandas/core/series.py", line 2800, in _sanitize_index       
    raise ValueError('Length of values does not match length of ' 'index')                                           
ValueError: Length of values does not match length of index                                                          

The file is 27GB, iterating with chunksize=100000. It failed approximately 70% through the file.
column_count is 49, row_count is reported as 89065305.

Is this related to the error referenced in this issue?

boulund commented May 2, 2017

Is this issue solved? I just got this trying to iterate through a large sas7bdat file (using pandas 0.19.2 via conda)

Traceback (most recent call last):                                                                                   
  File "./extract_subset_of_columns.py", line 35, in <module>                                                        
    extract_columns_from_sas(lmed_file, columns=["lpnr", "KON", "atc", "EDATUM"], output_csv=lmed_file+".csv")       
  File "./extract_subset_of_columns.py", line 25, in extract_columns_from_sas                                        
    for count, chunk in enumerate(reader, start=1):                                                                  
  File "/home/ctmr/anaconda3/lib/python3.6/site-packages/pandas/io/sas/sas7bdat.py", line 229, in __next__           
    da = self.read(nrows=self.chunksize or 1)                                                                        
  File "/home/ctmr/anaconda3/lib/python3.6/site-packages/pandas/io/sas/sas7bdat.py", line 614, in read               
    rslt = self._chunk_to_dataframe()                                                                                
  File "/home/ctmr/anaconda3/lib/python3.6/site-packages/pandas/io/sas/sas7bdat.py", line 663, in _chunk_to_dataframe
    rslt[name] = self._string_chunk[js, :]                                                                           
  File "/home/ctmr/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2419, in __setitem__            
    self._set_item(key, value)                                                                                       
  File "/home/ctmr/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2485, in _set_item              
    value = self._sanitize_column(key, value)                                                                        
  File "/home/ctmr/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2656, in _sanitize_column       
    value = _sanitize_index(value, self.index, copy=False)                                                           
  File "/home/ctmr/anaconda3/lib/python3.6/site-packages/pandas/core/series.py", line 2800, in _sanitize_index       
    raise ValueError('Length of values does not match length of ' 'index')                                           
ValueError: Length of values does not match length of index                                                          

The file is 27GB, iterating with chunksize=100000. It failed approximately 70% through the file.
column_count is 49, row_count is reported as 89065305.

Is this related to the error referenced in this issue?

@kshedden

This comment has been minimized.

Show comment
Hide comment
@kshedden

kshedden May 2, 2017

Contributor
Contributor

kshedden commented May 2, 2017

@boulund

This comment has been minimized.

Show comment
Hide comment
@boulund

boulund May 2, 2017

@kshedden I get the following from the compression attribute of the iterator:

In [4]: iter.compression 
Out[4]: b'SASYZCRL'      

Which I interpret as some kind of compression. So, yes, I guess?
It's interesting the error occured first at somewhere after 70% into the file. All information up until this point was extracted without issue.

Edit: I actually got another error for one of my other files. Maybe it's related?

Traceback (most recent call last):                                                                              
  File "pandas/io/sas/saslib.pyx", line 29, in pandas.io.sas.saslib.rle_decompress (pandas/io/sas/saslib.c:2540)
ValueError: Unexpected non-zero end_of_first_byte

boulund commented May 2, 2017

@kshedden I get the following from the compression attribute of the iterator:

In [4]: iter.compression 
Out[4]: b'SASYZCRL'      

Which I interpret as some kind of compression. So, yes, I guess?
It's interesting the error occured first at somewhere after 70% into the file. All information up until this point was extracted without issue.

Edit: I actually got another error for one of my other files. Maybe it's related?

Traceback (most recent call last):                                                                              
  File "pandas/io/sas/saslib.pyx", line 29, in pandas.io.sas.saslib.rle_decompress (pandas/io/sas/saslib.c:2540)
ValueError: Unexpected non-zero end_of_first_byte
@kshedden

This comment has been minimized.

Show comment
Hide comment
@kshedden

kshedden May 2, 2017

Contributor
Contributor

kshedden commented May 2, 2017

@boulund

This comment has been minimized.

Show comment
Hide comment
@boulund

boulund May 2, 2017

I see. Really appreciate your effort!

Unfortunately I don't think I can generate the file without compression, but I'll look into it (I don't have access to the source data).

boulund commented May 2, 2017

I see. Really appreciate your effort!

Unfortunately I don't think I can generate the file without compression, but I'll look into it (I don't have access to the source data).

@kshedden

This comment has been minimized.

Show comment
Hide comment
@kshedden

kshedden May 2, 2017

Contributor
Contributor

kshedden commented May 2, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment