read_sas fails due to unclear problems in SAS dataset #16615
I was trying to read a SAS dataset with pandas 0.19.2. It was not successful, with an error: ValueError('Length of values does not match length of ' 'index').
After some research I came up with the idea, that new line symbol in one of the character values creates this error.
I removed new line and carriage return symbols from column values in SAS data and read_sas finished without errors after that. I assume that read_sas treats any new line symbol it encounters as new line of a table.
read_sas could translate new line symbols found in column values to space and finish without an error.
After some further investigation I think the problem could be elsewhere, not in New line or carriage return symbol. Actually all I needed is just to re-create the file with simple data step and after that the new data set is read properly by read_sas
I'm attaching the problematic file which gives this error below:
Traceback (most recent call last):
Dug into this a bit because I was seeing a similar issue. I think it's something to do with unexpected bytes - starting in row 1806 in your file there's a bunch of odd-looking bytes which the parser is choking on somehow. I can't get something working, but as far as I can see:
import numpy as np from pandas.io.sas.sas7bdat import SAS7BDATReader from pandas.io.sas._sas import Parser reader = SAS7BDATReader('load_log.sas7bdat', index=None, encoding=None, chunksize=None) print(reader.row_count) #2097 nd = (reader.column_types == b'd').sum() ns = (reader.column_types == b's').sum() nrows = reader.row_count reader._string_chunk = np.empty((ns, nrows), dtype=np.object) reader._byte_chunk = np.empty((nd, 8 * nrows), dtype=np.uint8) reader._current_row_in_chunk_index = 0 p = Parser(reader) p.read(nrows) print(reader._current_row_in_chunk_index) #1805 print(reader._current_row_in_file_index) #1805
Iterating through the
import pandas as pd rows = list(pd.read_sas('load_log.sas7bdat', iterator=True)) print(len(rows)) #2097 print(rows['libname']) #1804 b'TRANS' print(rows['libname']) #1805 b'\x00\x00\x00\x00\x00\x00\x00\x00' odd_bytes = rows['libname'].iloc print(odd_bytes) #b'\x00\x00\x00\x00\x00\x00\x00\x00' print(odd_bytes.decode('latin-1')) # print(len(odd_bytes.decode('latin-1'))) #8
Thank you Ian, it seems \x00 is a NULL character.
Its interesting that SAS does not mind having NULL in the dataset, but it removes it during regular dataset rewrite.
So the question that's left is why read_sas behaves badly when encountering NULL character in the data.