Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_sas fails due to unclear problems in SAS dataset #16615

Open
Grinvydas opened this issue Jun 6, 2017 · 7 comments

Comments

Projects
None yet
4 participants
@Grinvydas
Copy link

commented Jun 6, 2017

Problem description

I was trying to read a SAS dataset with pandas 0.19.2. It was not successful, with an error: ValueError('Length of values does not match length of ' 'index').

After some research I came up with the idea, that new line symbol in one of the character values creates this error.

I removed new line and carriage return symbols from column values in SAS data and read_sas finished without errors after that. I assume that read_sas treats any new line symbol it encounters as new line of a table.

Expected Output

read_sas could translate new line symbols found in column values to space and finish without an error.

@chris-b1

This comment has been minimized.

Copy link
Contributor

commented Jun 6, 2017

Can you provide a data set that reproduces the issue? Either something existing online, or you can attach to the issue.

@chris-b1 chris-b1 added the IO SAS label Jun 6, 2017

@Grinvydas

This comment has been minimized.

Copy link
Author

commented Jun 6, 2017

As always, when I try to re-create, it reads the file fine. I'll try to fetch the original file tomorrow and update.

@Grinvydas

This comment has been minimized.

Copy link
Author

commented Jun 7, 2017

After some further investigation I think the problem could be elsewhere, not in New line or carriage return symbol. Actually all I needed is just to re-create the file with simple data step and after that the new data set is read properly by read_sas

data new_file;
set old;
run;

I'm attaching the problematic file which gives this error below:

Traceback (most recent call last):
File "isItemDsUpdated.py", line 27, in
status_log = pd.read_sas(sasfile, encoding='latin1').dropna()
File "/projects/karegr01/anaconda3/lib/python3.6/site-packages/pandas/io/sas/sasreader.py", line 61, in read_sas
data = reader.read()
File "/projects/karegr01/anaconda3/lib/python3.6/site-packages/pandas/io/sas/sas7bdat.py", line 614, in read
rslt = self._chunk_to_dataframe()
File "/projects/karegr01/anaconda3/lib/python3.6/site-packages/pandas/io/sas/sas7bdat.py", line 663, in _chunk_to_dataframe
rslt[name] = self._string_chunk[js, :]
File "/projects/karegr01/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2419, in setitem
self._set_item(key, value)
File "/projects/karegr01/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2485, in _set_item
value = self._sanitize_column(key, value)
File "/projects/karegr01/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2656, in _sanitize_column
value = _sanitize_index(value, self.index, copy=False)
File "/projects/karegr01/anaconda3/lib/python3.6/site-packages/pandas/core/series.py", line 2800, in _sanitize_index
raise ValueError('Length of values does not match length of ' 'index')
ValueError: Length of values does not match length of index

load_log.zip

@Grinvydas Grinvydas changed the title read_sas fails when values in SAS dataset contains a new line symbol read_sas fails due to unclear problems in SAS dataset Jun 8, 2017

@theianrobertson

This comment has been minimized.

Copy link

commented Jun 15, 2017

Dug into this a bit because I was seeing a similar issue. I think it's something to do with unexpected bytes - starting in row 1806 in your file there's a bunch of odd-looking bytes which the parser is choking on somehow. I can't get something working, but as far as I can see:

import numpy as np
from pandas.io.sas.sas7bdat import SAS7BDATReader
from pandas.io.sas._sas import Parser
reader = SAS7BDATReader('load_log.sas7bdat', index=None, encoding=None, chunksize=None)
print(reader.row_count)
#2097
nd = (reader.column_types == b'd').sum()
ns = (reader.column_types == b's').sum()
nrows = reader.row_count
reader._string_chunk = np.empty((ns, nrows), dtype=np.object)
reader._byte_chunk = np.empty((nd, 8 * nrows), dtype=np.uint8)
reader._current_row_in_chunk_index = 0
p = Parser(reader)
p.read(nrows)
print(reader._current_row_in_chunk_index)
#1805
print(reader._current_row_in_file_index)
#1805
  • SAS7BDATReader._string_chunk is initialized with 2097 rows x 7 string columns (datetime not included) here
  • this call only reads up to line 1805 in the file (but silently returns that data), so when the 2097 rows of the first string column are matched up, that's why the exception is thrown
  • I think this is what's returning early. I'm not too familiar with how the byte patterns work, but is there maybe a control byte that's confusing the pagination functions?

Iterating through the read_sas call actually works, and you can see the bytes that are being read:

import pandas as pd

rows = list(pd.read_sas('load_log.sas7bdat', iterator=True))

print(len(rows))
#2097
print(rows[1804]['libname'])
#1804    b'TRANS'
print(rows[1805]['libname'])
#1805    b'\x00\x00\x00\x00\x00\x00\x00\x00'
odd_bytes = rows[1805]['libname'].iloc[0]
print(odd_bytes)
#b'\x00\x00\x00\x00\x00\x00\x00\x00'
print(odd_bytes.decode('latin-1'))
#
print(len(odd_bytes.decode('latin-1')))
#8
@Grinvydas

This comment has been minimized.

Copy link
Author

commented Jun 15, 2017

Thank you Ian, it seems \x00 is a NULL character.

Its interesting that SAS does not mind having NULL in the dataset, but it removes it during regular dataset rewrite.

So the question that's left is why read_sas behaves badly when encountering NULL character in the data.

@sasutils

This comment has been minimized.

Copy link

commented Mar 3, 2018

That sample SAS dataset has 9 deleted observations. At observation number 69,70.71.72,97,1218,1219,1220 and 1221. Does this python package understand how to skip the deleted observations?

@sasutils

This comment has been minimized.

Copy link

commented Mar 3, 2018

There are no null characters in any of the character variables in that SAS dataset. Unless they are in the 9 deleted observations.

troels added a commit to troels/pandas that referenced this issue Sep 7, 2018

BUG: Fix parsing of sas7bdat files with oddly flagged data pages (pan…
…das-dev#16615)

SAS can apparently generate data pages having bit 7 (128) set on
the page type.
It seems that the presence of bit 8 (256) determines whether it's
a data page or not. So treat page as a data page if bit 8 is set and
don't mind the lower bits.

troels added a commit to troels/pandas that referenced this issue Sep 7, 2018

BUG: Fix parsing of sas7bdat files with odd data pages (pandas-dev#16615
)

SAS can apparently generate data pages having bit 7 (128) set on
the page type.
It seems that the presence of bit 8 (256) determines whether it's
a data page or not. So treat page as a data page if bit 8 is set and
don't mind the lower bits.

troels added a commit to troels/pandas that referenced this issue Sep 7, 2018

BUG: Fix parsing of sas7bdat files with odd data pages (pandas-dev#16615
)

SAS can apparently generate data pages having bit 7 (128) set on
the page type.
It seems that the presence of bit 8 (256) determines whether it's
a data page or not. So treat page as a data page if bit 8 is set and
don't mind the lower bits.

troels added a commit to troels/pandas that referenced this issue Sep 8, 2018

BUG: Fix parsing of sas7bdat files with odd data pages (pandas-dev#16615
)

SAS can apparently generate data pages having bit 7 (128) set on
the page type.
It seems that the presence of bit 8 (256) determines whether it's
a data page or not. So treat page as a data page if bit 8 is set and
don't mind the lower bits.

troels added a commit to troels/pandas that referenced this issue Sep 9, 2018

BUG: Fix parsing of sas7bdat files with odd data pages (pandas-dev#16615
)

SAS can apparently generate data pages having bit 7 (128) set on
the page type.
It seems that the presence of bit 8 (256) determines whether it's
a data page or not. So treat page as a data page if bit 8 is set and
don't mind the lower bits.

troels added a commit to troels/pandas that referenced this issue Sep 11, 2018

BUG: Fix parsing of sas7bdat files with odd data pages (pandas-dev#16615
)

SAS can apparently generate data pages having bit 7 (128) set on
the page type.
It seems that the presence of bit 8 (256) determines whether it's
a data page or not. So treat page as a data page if bit 8 is set and
don't mind the lower bits.

troels added a commit to troels/pandas that referenced this issue Sep 16, 2018

BUG: Fix parsing of sas7bdat files with odd data pages (pandas-dev#16615
)

SAS can apparently generate data pages having bit 7 (128) set on
the page type.
It seems that the presence of bit 8 (256) determines whether it's
a data page or not. So treat page as a data page if bit 8 is set and
don't mind the lower bits.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.