Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_sas fails due to unclear problems in SAS dataset #16615

Open
Grinvydas opened this issue Jun 6, 2017 · 14 comments
Open

read_sas fails due to unclear problems in SAS dataset #16615

Grinvydas opened this issue Jun 6, 2017 · 14 comments
Labels
Bug IO SAS SAS: read_sas

Comments

@Grinvydas
Copy link

Problem description

I was trying to read a SAS dataset with pandas 0.19.2. It was not successful, with an error: ValueError('Length of values does not match length of ' 'index').

After some research I came up with the idea, that new line symbol in one of the character values creates this error.

I removed new line and carriage return symbols from column values in SAS data and read_sas finished without errors after that. I assume that read_sas treats any new line symbol it encounters as new line of a table.

Expected Output

read_sas could translate new line symbols found in column values to space and finish without an error.

@chris-b1
Copy link
Contributor

chris-b1 commented Jun 6, 2017

Can you provide a data set that reproduces the issue? Either something existing online, or you can attach to the issue.

@chris-b1 chris-b1 added the IO SAS SAS: read_sas label Jun 6, 2017
@Grinvydas
Copy link
Author

As always, when I try to re-create, it reads the file fine. I'll try to fetch the original file tomorrow and update.

@Grinvydas
Copy link
Author

Grinvydas commented Jun 7, 2017

After some further investigation I think the problem could be elsewhere, not in New line or carriage return symbol. Actually all I needed is just to re-create the file with simple data step and after that the new data set is read properly by read_sas

data new_file;
set old;
run;

I'm attaching the problematic file which gives this error below:

Traceback (most recent call last):
File "isItemDsUpdated.py", line 27, in
status_log = pd.read_sas(sasfile, encoding='latin1').dropna()
File "/projects/karegr01/anaconda3/lib/python3.6/site-packages/pandas/io/sas/sasreader.py", line 61, in read_sas
data = reader.read()
File "/projects/karegr01/anaconda3/lib/python3.6/site-packages/pandas/io/sas/sas7bdat.py", line 614, in read
rslt = self._chunk_to_dataframe()
File "/projects/karegr01/anaconda3/lib/python3.6/site-packages/pandas/io/sas/sas7bdat.py", line 663, in _chunk_to_dataframe
rslt[name] = self._string_chunk[js, :]
File "/projects/karegr01/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2419, in setitem
self._set_item(key, value)
File "/projects/karegr01/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2485, in _set_item
value = self._sanitize_column(key, value)
File "/projects/karegr01/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2656, in _sanitize_column
value = _sanitize_index(value, self.index, copy=False)
File "/projects/karegr01/anaconda3/lib/python3.6/site-packages/pandas/core/series.py", line 2800, in _sanitize_index
raise ValueError('Length of values does not match length of ' 'index')
ValueError: Length of values does not match length of index

load_log.zip

@Grinvydas Grinvydas changed the title read_sas fails when values in SAS dataset contains a new line symbol read_sas fails due to unclear problems in SAS dataset Jun 8, 2017
@theianrobertson
Copy link

Dug into this a bit because I was seeing a similar issue. I think it's something to do with unexpected bytes - starting in row 1806 in your file there's a bunch of odd-looking bytes which the parser is choking on somehow. I can't get something working, but as far as I can see:

import numpy as np
from pandas.io.sas.sas7bdat import SAS7BDATReader
from pandas.io.sas._sas import Parser
reader = SAS7BDATReader('load_log.sas7bdat', index=None, encoding=None, chunksize=None)
print(reader.row_count)
#2097
nd = (reader.column_types == b'd').sum()
ns = (reader.column_types == b's').sum()
nrows = reader.row_count
reader._string_chunk = np.empty((ns, nrows), dtype=np.object)
reader._byte_chunk = np.empty((nd, 8 * nrows), dtype=np.uint8)
reader._current_row_in_chunk_index = 0
p = Parser(reader)
p.read(nrows)
print(reader._current_row_in_chunk_index)
#1805
print(reader._current_row_in_file_index)
#1805
  • SAS7BDATReader._string_chunk is initialized with 2097 rows x 7 string columns (datetime not included) here
  • this call only reads up to line 1805 in the file (but silently returns that data), so when the 2097 rows of the first string column are matched up, that's why the exception is thrown
  • I think this is what's returning early. I'm not too familiar with how the byte patterns work, but is there maybe a control byte that's confusing the pagination functions?

Iterating through the read_sas call actually works, and you can see the bytes that are being read:

import pandas as pd

rows = list(pd.read_sas('load_log.sas7bdat', iterator=True))

print(len(rows))
#2097
print(rows[1804]['libname'])
#1804    b'TRANS'
print(rows[1805]['libname'])
#1805    b'\x00\x00\x00\x00\x00\x00\x00\x00'
odd_bytes = rows[1805]['libname'].iloc[0]
print(odd_bytes)
#b'\x00\x00\x00\x00\x00\x00\x00\x00'
print(odd_bytes.decode('latin-1'))
#
print(len(odd_bytes.decode('latin-1')))
#8

@Grinvydas
Copy link
Author

Thank you Ian, it seems \x00 is a NULL character.

Its interesting that SAS does not mind having NULL in the dataset, but it removes it during regular dataset rewrite.

So the question that's left is why read_sas behaves badly when encountering NULL character in the data.

@sasutils
Copy link

sasutils commented Mar 3, 2018

That sample SAS dataset has 9 deleted observations. At observation number 69,70.71.72,97,1218,1219,1220 and 1221. Does this python package understand how to skip the deleted observations?

@sasutils
Copy link

sasutils commented Mar 3, 2018

There are no null characters in any of the character variables in that SAS dataset. Unless they are in the 9 deleted observations.

troels added a commit to troels/pandas that referenced this issue Sep 7, 2018
…das-dev#16615)

SAS can apparently generate data pages having bit 7 (128) set on
the page type.
It seems that the presence of bit 8 (256) determines whether it's
a data page or not. So treat page as a data page if bit 8 is set and
don't mind the lower bits.
troels added a commit to troels/pandas that referenced this issue Sep 7, 2018
)

SAS can apparently generate data pages having bit 7 (128) set on
the page type.
It seems that the presence of bit 8 (256) determines whether it's
a data page or not. So treat page as a data page if bit 8 is set and
don't mind the lower bits.
troels added a commit to troels/pandas that referenced this issue Sep 7, 2018
)

SAS can apparently generate data pages having bit 7 (128) set on
the page type.
It seems that the presence of bit 8 (256) determines whether it's
a data page or not. So treat page as a data page if bit 8 is set and
don't mind the lower bits.
troels added a commit to troels/pandas that referenced this issue Sep 8, 2018
)

SAS can apparently generate data pages having bit 7 (128) set on
the page type.
It seems that the presence of bit 8 (256) determines whether it's
a data page or not. So treat page as a data page if bit 8 is set and
don't mind the lower bits.
troels added a commit to troels/pandas that referenced this issue Sep 9, 2018
)

SAS can apparently generate data pages having bit 7 (128) set on
the page type.
It seems that the presence of bit 8 (256) determines whether it's
a data page or not. So treat page as a data page if bit 8 is set and
don't mind the lower bits.
troels added a commit to troels/pandas that referenced this issue Sep 11, 2018
)

SAS can apparently generate data pages having bit 7 (128) set on
the page type.
It seems that the presence of bit 8 (256) determines whether it's
a data page or not. So treat page as a data page if bit 8 is set and
don't mind the lower bits.
troels added a commit to troels/pandas that referenced this issue Sep 16, 2018
)

SAS can apparently generate data pages having bit 7 (128) set on
the page type.
It seems that the presence of bit 8 (256) determines whether it's
a data page or not. So treat page as a data page if bit 8 is set and
don't mind the lower bits.
@mroeschke mroeschke added the Bug label Apr 5, 2020
@evnb
Copy link

evnb commented Jul 1, 2020

Any fixes/suggestions for this? I'm running into this error and don't know much about SAS

@Xiaoping777
Copy link

I got the same issue, when the specific field contains more than 8000 chars (text field), any suggestion to deal with this issue?

@MaxGhenis
Copy link

MaxGhenis commented Oct 25, 2021

I'm also hitting this:

pd.read_sas("https://www2.census.gov/programs-surveys/supplemental-poverty-measure/datasets/spm/spm_pu_2018.sas7bdat")

ValueError: Length of values (3061064) does not match length of index (1615763)

As a workaround, importing with the sas7bdat package (per this SO answer), worked:

from sas7bdat import SAS7BDAT
# After downloading file:
df = SAS7BDAT("spm_pu_2018.sas7bdat").to_data_frame()

@Xiaoping777
Copy link

I'm also hitting this:

pd.read_sas("https://www2.census.gov/programs-surveys/supplemental-poverty-measure/datasets/spm/spm_pu_2018.sas7bdat")

ValueError: Length of values (3061064) does not match length of index (1615763)

As a workaround, importing with the sas7bdat package (per this SO answer), worked:

from sas7bdat import SAS7BDAT
# After downloading file:
df = SAS7BDAT("spm_pu_2018.sas7bdat").to_data_frame()

Hi MaxGhenis,
I tested the method you mentioned, but it is not perfect, the records with text length >= 8192 are shortened as a blank string, while pd.read_sas directly gives error message.

hopefully pandas or SAS7BDAT packages can fix the issues

@AndreaPasqualini
Copy link

Using pandas 1.3.5 on Windows 10 through Anaconda.

A bunch of datasets where the same ValueError is encountered can be found at https://www.fdic.gov/foia/ris/index.html. I'm currently experiencing this issue precisely with these datasets. Conveniently, the data source also provides corresponding CSV files, which may be useful to troubleshoot the problem. For my use case, using SAS files would save me a ton of time, because it already contains data type and format information, while CSV files would require me to do a non-trivial amount of cleaning by myself. Meanwhile, Stata's import sas works without issues on these SAS files.

For example, for the file fts003.sas7bdat, which is found in https://www.fdic.gov/foia/ris/risdata/sas/ris9803-ris0512-sas.zip within the folder RIS0003,

import pandas as pd
df = pd.read_sas("fts0003.sas7bdat")

returns

Warning: column count mismatch (587 + 1023 != 3337)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_18836/2218533414.py in <module>
----> 2 df = pd.read_sas("fts0003.sas7bdat")

~\.miniconda3\lib\site-packages\pandas\io\sas\sasreader.py in read_sas(filepath_or_buffer, format, index, encoding, chunksize, iterator)
    159 
    160     with reader:
--> 161         return reader.read()

~\.miniconda3\lib\site-packages\pandas\io\sas\sas7bdat.py in read(self, nrows)
    754         p.read(nrows)
    755 
--> 756         rslt = self._chunk_to_dataframe()
    757         if self.index is not None:
    758             rslt = rslt.set_index(self.index)

~\.miniconda3\lib\site-packages\pandas\io\sas\sas7bdat.py in _chunk_to_dataframe(self)
    798 
    799             if self._column_types[j] == b"d":
--> 800                 rslt[name] = self._byte_chunk[jb, :].view(dtype=self.byte_order + "d")
    801                 rslt[name] = np.asarray(rslt[name], dtype=np.float64)
    802                 if self.convert_dates:

~\.miniconda3\lib\site-packages\pandas\core\frame.py in __setitem__(self, key, value)
   3610         else:
   3611             # set column
-> 3612             self._set_item(key, value)
   3613 
   3614     def _setitem_slice(self, key: slice, value):

~\.miniconda3\lib\site-packages\pandas\core\frame.py in _set_item(self, key, value)
   3782         ensure homogeneity.
   3783         """
-> 3784         value = self._sanitize_column(value)
   3785 
   3786         if (

~\.miniconda3\lib\site-packages\pandas\core\frame.py in _sanitize_column(self, value)
   4507 
   4508         if is_list_like(value):
-> 4509             com.require_length_match(value, self.index)
   4510         return sanitize_array(value, self.index, copy=True, allow_2d=True)
   4511 

~\.miniconda3\lib\site-packages\pandas\core\common.py in require_length_match(data, index)
    529     """
    530     if len(data) != len(index):
--> 531         raise ValueError(
    532             "Length of values "
    533             f"({len(data)}) "

ValueError: Length of values (10275) does not match length of index (10243)

From the column number mismatch, I'd suspect that the issue has to do with recognizing missing values, which these data files are rich of. However, I know close-to-nothing about SAS, so I'm having difficulties with providing good insights.

@Xiaoping777
Copy link

Hi AndreaPasqualini ,

the method that i used is to limit the column length < 8192 in SAS, so you have to modify the columns in SAS. I still cannot find a good solution in python platform

xp

@AndreaPasqualini
Copy link

Thank you for the suggestion, but that is something I cannot do. I have no access to SAS and I'm only a consumer of those data. I have provided data to help the developers troubleshoot the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO SAS SAS: read_sas
Projects
None yet
Development

No branches or pull requests

9 participants