read_sas fails due to unclear problems in SAS dataset #16615

Grinvydas · 2017-06-06T19:24:00Z

Problem description

I was trying to read a SAS dataset with pandas 0.19.2. It was not successful, with an error: ValueError('Length of values does not match length of ' 'index').

After some research I came up with the idea, that new line symbol in one of the character values creates this error.

I removed new line and carriage return symbols from column values in SAS data and read_sas finished without errors after that. I assume that read_sas treats any new line symbol it encounters as new line of a table.

Expected Output

read_sas could translate new line symbols found in column values to space and finish without an error.

chris-b1 · 2017-06-06T20:00:44Z

Can you provide a data set that reproduces the issue? Either something existing online, or you can attach to the issue.

Grinvydas · 2017-06-06T21:44:21Z

As always, when I try to re-create, it reads the file fine. I'll try to fetch the original file tomorrow and update.

Grinvydas · 2017-06-07T14:51:07Z

After some further investigation I think the problem could be elsewhere, not in New line or carriage return symbol. Actually all I needed is just to re-create the file with simple data step and after that the new data set is read properly by read_sas

data new_file;
set old;
run;

I'm attaching the problematic file which gives this error below:

Traceback (most recent call last):
File "isItemDsUpdated.py", line 27, in
status_log = pd.read_sas(sasfile, encoding='latin1').dropna()
File "/projects/karegr01/anaconda3/lib/python3.6/site-packages/pandas/io/sas/sasreader.py", line 61, in read_sas
data = reader.read()
File "/projects/karegr01/anaconda3/lib/python3.6/site-packages/pandas/io/sas/sas7bdat.py", line 614, in read
rslt = self._chunk_to_dataframe()
File "/projects/karegr01/anaconda3/lib/python3.6/site-packages/pandas/io/sas/sas7bdat.py", line 663, in _chunk_to_dataframe
rslt[name] = self._string_chunk[js, :]
File "/projects/karegr01/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2419, in setitem
self._set_item(key, value)
File "/projects/karegr01/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2485, in _set_item
value = self._sanitize_column(key, value)
File "/projects/karegr01/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2656, in _sanitize_column
value = _sanitize_index(value, self.index, copy=False)
File "/projects/karegr01/anaconda3/lib/python3.6/site-packages/pandas/core/series.py", line 2800, in _sanitize_index
raise ValueError('Length of values does not match length of ' 'index')
ValueError: Length of values does not match length of index

load_log.zip

theianrobertson · 2017-06-15T01:33:38Z

Dug into this a bit because I was seeing a similar issue. I think it's something to do with unexpected bytes - starting in row 1806 in your file there's a bunch of odd-looking bytes which the parser is choking on somehow. I can't get something working, but as far as I can see:

import numpy as np
from pandas.io.sas.sas7bdat import SAS7BDATReader
from pandas.io.sas._sas import Parser
reader = SAS7BDATReader('load_log.sas7bdat', index=None, encoding=None, chunksize=None)
print(reader.row_count)
#2097
nd = (reader.column_types == b'd').sum()
ns = (reader.column_types == b's').sum()
nrows = reader.row_count
reader._string_chunk = np.empty((ns, nrows), dtype=np.object)
reader._byte_chunk = np.empty((nd, 8 * nrows), dtype=np.uint8)
reader._current_row_in_chunk_index = 0
p = Parser(reader)
p.read(nrows)
print(reader._current_row_in_chunk_index)
#1805
print(reader._current_row_in_file_index)
#1805

SAS7BDATReader._string_chunk is initialized with 2097 rows x 7 string columns (datetime not included) here
this call only reads up to line 1805 in the file (but silently returns that data), so when the 2097 rows of the first string column are matched up, that's why the exception is thrown
I think this is what's returning early. I'm not too familiar with how the byte patterns work, but is there maybe a control byte that's confusing the pagination functions?

Iterating through the read_sas call actually works, and you can see the bytes that are being read:

import pandas as pd

rows = list(pd.read_sas('load_log.sas7bdat', iterator=True))

print(len(rows))
#2097
print(rows[1804]['libname'])
#1804    b'TRANS'
print(rows[1805]['libname'])
#1805    b'\x00\x00\x00\x00\x00\x00\x00\x00'
odd_bytes = rows[1805]['libname'].iloc[0]
print(odd_bytes)
#b'\x00\x00\x00\x00\x00\x00\x00\x00'
print(odd_bytes.decode('latin-1'))
#
print(len(odd_bytes.decode('latin-1')))
#8

Grinvydas · 2017-06-15T15:35:23Z

Thank you Ian, it seems \x00 is a NULL character.

Its interesting that SAS does not mind having NULL in the dataset, but it removes it during regular dataset rewrite.

So the question that's left is why read_sas behaves badly when encountering NULL character in the data.

sasutils · 2018-03-03T02:01:20Z

That sample SAS dataset has 9 deleted observations. At observation number 69,70.71.72,97,1218,1219,1220 and 1221. Does this python package understand how to skip the deleted observations?

sasutils · 2018-03-03T02:08:00Z

There are no null characters in any of the character variables in that SAS dataset. Unless they are in the 9 deleted observations.

…das-dev#16615) SAS can apparently generate data pages having bit 7 (128) set on the page type. It seems that the presence of bit 8 (256) determines whether it's a data page or not. So treat page as a data page if bit 8 is set and don't mind the lower bits.

) SAS can apparently generate data pages having bit 7 (128) set on the page type. It seems that the presence of bit 8 (256) determines whether it's a data page or not. So treat page as a data page if bit 8 is set and don't mind the lower bits.

evnb · 2020-07-01T15:33:21Z

Any fixes/suggestions for this? I'm running into this error and don't know much about SAS

Xiaoping777 · 2021-07-13T23:18:11Z

I got the same issue, when the specific field contains more than 8000 chars (text field), any suggestion to deal with this issue?

MaxGhenis · 2021-10-25T15:22:41Z

I'm also hitting this:

pd.read_sas("https://www2.census.gov/programs-surveys/supplemental-poverty-measure/datasets/spm/spm_pu_2018.sas7bdat")

ValueError: Length of values (3061064) does not match length of index (1615763)

As a workaround, importing with the sas7bdat package (per this SO answer), worked:

from sas7bdat import SAS7BDAT
# After downloading file:
df = SAS7BDAT("spm_pu_2018.sas7bdat").to_data_frame()

Xiaoping777 · 2021-10-27T00:14:56Z

I'm also hitting this:
pd.read_sas("https://www2.census.gov/programs-surveys/supplemental-poverty-measure/datasets/spm/spm_pu_2018.sas7bdat")
ValueError: Length of values (3061064) does not match length of index (1615763)

As a workaround, importing with the sas7bdat package (per this SO answer), worked:
from sas7bdat import SAS7BDAT
# After downloading file:
df = SAS7BDAT("spm_pu_2018.sas7bdat").to_data_frame()

Hi MaxGhenis,
I tested the method you mentioned, but it is not perfect, the records with text length >= 8192 are shortened as a blank string, while pd.read_sas directly gives error message.

hopefully pandas or SAS7BDAT packages can fix the issues

AndreaPasqualini · 2022-01-13T00:04:07Z

Using pandas 1.3.5 on Windows 10 through Anaconda.

A bunch of datasets where the same ValueError is encountered can be found at https://www.fdic.gov/foia/ris/index.html. I'm currently experiencing this issue precisely with these datasets. Conveniently, the data source also provides corresponding CSV files, which may be useful to troubleshoot the problem. For my use case, using SAS files would save me a ton of time, because it already contains data type and format information, while CSV files would require me to do a non-trivial amount of cleaning by myself. Meanwhile, Stata's import sas works without issues on these SAS files.

For example, for the file fts003.sas7bdat, which is found in https://www.fdic.gov/foia/ris/risdata/sas/ris9803-ris0512-sas.zip within the folder RIS0003,

import pandas as pd
df = pd.read_sas("fts0003.sas7bdat")

returns

Warning: column count mismatch (587 + 1023 != 3337)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_18836/2218533414.py in <module>
----> 2 df = pd.read_sas("fts0003.sas7bdat")

~\.miniconda3\lib\site-packages\pandas\io\sas\sasreader.py in read_sas(filepath_or_buffer, format, index, encoding, chunksize, iterator)
    159 
    160     with reader:
--> 161         return reader.read()

~\.miniconda3\lib\site-packages\pandas\io\sas\sas7bdat.py in read(self, nrows)
    754         p.read(nrows)
    755 
--> 756         rslt = self._chunk_to_dataframe()
    757         if self.index is not None:
    758             rslt = rslt.set_index(self.index)

~\.miniconda3\lib\site-packages\pandas\io\sas\sas7bdat.py in _chunk_to_dataframe(self)
    798 
    799             if self._column_types[j] == b"d":
--> 800                 rslt[name] = self._byte_chunk[jb, :].view(dtype=self.byte_order + "d")
    801                 rslt[name] = np.asarray(rslt[name], dtype=np.float64)
    802                 if self.convert_dates:

~\.miniconda3\lib\site-packages\pandas\core\frame.py in __setitem__(self, key, value)
   3610         else:
   3611             # set column
-> 3612             self._set_item(key, value)
   3613 
   3614     def _setitem_slice(self, key: slice, value):

~\.miniconda3\lib\site-packages\pandas\core\frame.py in _set_item(self, key, value)
   3782         ensure homogeneity.
   3783         """
-> 3784         value = self._sanitize_column(value)
   3785 
   3786         if (

~\.miniconda3\lib\site-packages\pandas\core\frame.py in _sanitize_column(self, value)
   4507 
   4508         if is_list_like(value):
-> 4509             com.require_length_match(value, self.index)
   4510         return sanitize_array(value, self.index, copy=True, allow_2d=True)
   4511 

~\.miniconda3\lib\site-packages\pandas\core\common.py in require_length_match(data, index)
    529     """
    530     if len(data) != len(index):
--> 531         raise ValueError(
    532             "Length of values "
    533             f"({len(data)}) "

ValueError: Length of values (10275) does not match length of index (10243)

From the column number mismatch, I'd suspect that the issue has to do with recognizing missing values, which these data files are rich of. However, I know close-to-nothing about SAS, so I'm having difficulties with providing good insights.

Xiaoping777 · 2022-01-13T00:28:52Z

Hi AndreaPasqualini ,

the method that i used is to limit the column length < 8192 in SAS, so you have to modify the columns in SAS. I still cannot find a good solution in python platform

xp

AndreaPasqualini · 2022-01-13T17:50:26Z

Thank you for the suggestion, but that is something I cannot do. I have no access to SAS and I'm only a consumer of those data. I have provided data to help the developers troubleshoot the problem.

chris-b1 added the IO SAS SAS: read_sas label Jun 6, 2017

Grinvydas changed the title ~~read_sas fails when values in SAS dataset contains a new line symbol~~ read_sas fails due to unclear problems in SAS dataset Jun 8, 2017

troels mentioned this issue Sep 7, 2018

BUG: Some sas7bdat files with many columns are not parseable by read_sas #22628

Merged

3 tasks

mroeschke added the Bug label Apr 5, 2020

MaxGhenis mentioned this issue Oct 25, 2021

RawACS generation is failing PolicyEngine/openfisca-us-data#48

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_sas fails due to unclear problems in SAS dataset #16615

read_sas fails due to unclear problems in SAS dataset #16615

Grinvydas commented Jun 6, 2017

chris-b1 commented Jun 6, 2017

Grinvydas commented Jun 6, 2017

Grinvydas commented Jun 7, 2017 •

edited

Loading

theianrobertson commented Jun 15, 2017

Grinvydas commented Jun 15, 2017

sasutils commented Mar 3, 2018

sasutils commented Mar 3, 2018

evnb commented Jul 1, 2020

Xiaoping777 commented Jul 13, 2021

MaxGhenis commented Oct 25, 2021 •

edited

Loading

Xiaoping777 commented Oct 27, 2021

AndreaPasqualini commented Jan 13, 2022

Xiaoping777 commented Jan 13, 2022

AndreaPasqualini commented Jan 13, 2022

read_sas fails due to unclear problems in SAS dataset #16615

read_sas fails due to unclear problems in SAS dataset #16615

Comments

Grinvydas commented Jun 6, 2017

Problem description

Expected Output

chris-b1 commented Jun 6, 2017

Grinvydas commented Jun 6, 2017

Grinvydas commented Jun 7, 2017 • edited Loading

theianrobertson commented Jun 15, 2017

Grinvydas commented Jun 15, 2017

sasutils commented Mar 3, 2018

sasutils commented Mar 3, 2018

evnb commented Jul 1, 2020

Xiaoping777 commented Jul 13, 2021

MaxGhenis commented Oct 25, 2021 • edited Loading

Xiaoping777 commented Oct 27, 2021

AndreaPasqualini commented Jan 13, 2022

Xiaoping777 commented Jan 13, 2022

AndreaPasqualini commented Jan 13, 2022

Grinvydas commented Jun 7, 2017 •

edited

Loading

MaxGhenis commented Oct 25, 2021 •

edited

Loading