BUG: xl.parse index_col ignoring skiprows #50953

musshorn · 2023-01-24T03:38:40Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

xl = pd.ExcelFile("TestPandas.xlsx")

df = xl.parse("Sheet1", skiprows=8, headers=[0, 1, 2], index_col=[0, 1])
print(df)

Issue Description

Run the sample code on this excel sheet. Parsing is expected to begin from row 9 however in the index column data is pulled from rows that should've been skipped.
TestPandas.xlsx

Expected Behavior

Parsing to begin from the first non-skipped row

Installed Versions

>>> pd.show_versions()

INSTALLED VERSIONS

commit : 7cb7592
python : 3.10.0.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19044
machine : AMD64
processor : Intel64 Family 6 Model 79 Stepping 1, GenuineIntel
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : English_Australia.1252

pandas : 2.0.0.dev0+1147.g7cb7592523
numpy : 1.22.4
pytz : 2022.2.1
dateutil : 2.8.2
setuptools : 57.4.0
pip : 22.3.1
Cython : 0.29.32
pytest : 7.2.0
hypothesis : None
sphinx : 4.5.0
blosc : None
feather : None
xlsxwriter : 3.0.1
lxml.etree : 4.9.1
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.1.1
pandas_datareader: None
bs4 : 4.10.0
bottleneck : None
brotli : 1.0.9
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.5.1
numba : None
numexpr : 2.8.4
odfpy : None
openpyxl : 3.0.10
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.8.0
snappy : None
sqlalchemy : None
tables : 3.7.0
tabulate : 0.8.9
xarray : None
xlrd : None
zstandard : 0.17.0
tzdata : None
qtpy : 2.0.1
pyqt5 : None

The text was updated successfully, but these errors were encountered:

axeleschholz · 2023-01-25T18:38:08Z

I reproduced the issue. It seems like the headers or labels are being propagated across multiple rows before they are phased out with skiprows. Honestly the best solution would be to implement the skiprows functionality at the highest level possible in the process, but that might require some refactoring. So instead what needs to happen is the skiprows offset needs to be taken into account when assigning headers. I think the issue might be present around here:

pandas/pandas/io/excel/_base.py

Lines 794 to 833 in 2e218d1

    
               for row in header: 
        
                   if is_integer(skiprows): 
        
                       assert isinstance(skiprows, int) 
        
                       row += skiprows 
        
                   if row > len(data) - 1: 
        
                       raise ValueError( 
        
                           f"header index {row} exceeds maximum index " 
        
                           f"{len(data) - 1} of data.", 
        
                       ) 
        
                   data[row], control_row = fill_mi_header(data[row], control_row) 
        
                   if index_col is not None: 
        
                       header_name, _ = pop_header_name(data[row], index_col) 
        
                       header_names.append(header_name) 
        
           # If there is a MultiIndex header and an index then there is also 
        
           # a row containing just the index name(s) 
        
           has_index_names = False 
        
           if is_list_header and not is_len_one_list_header and index_col is not None: 
        
               index_col_list: Sequence[int] 
        
               if isinstance(index_col, int): 
        
                   index_col_list = [index_col] 
        
               else: 
        
                   assert isinstance(index_col, Sequence) 
        
                   index_col_list = index_col 
        
               # We have to handle mi without names. If any of the entries in the data 
        
               # columns are not empty, this is a regular row 
        
               assert isinstance(header, Sequence) 
        
               if len(header) < len(data): 
        
                   potential_index_names = data[len(header)] 
        
                   potential_data = [ 
        
                       x 
        
                       for i, x in enumerate(potential_index_names) 
        
                       if not control_row[i] and i not in index_col_list 
        
                   ] 
        
                   has_index_names = all(x == "" or x is None for x in potential_data)

I'll keep doing some digging on this and let you know if I find a fix.

pacificdragon · 2023-01-28T19:23:27Z

Hi,

I did some debugging and analysis on this issue. Below are my findings :

Issue 1: The argument passed in the sample code provided by @musshorn to parse the excel is incorrect as the header argument is spelled as "headers".

Due to this, the skip rows are non-functional in the above case.

However, if I read the same data using the correct parameters. This data looks meaningful.

Issue 2: Even if the header is not provided to the input, it should allow us to skip rows for reading the data frame.

So, for that, we may need to apply some fixes.

@axeleschholz @rhshadrach - Let me know your thoughts on this. Also, can I work on this issue?

rhshadrach · 2023-01-29T12:55:03Z

If I remove the first 8 rows of the excel file and then run

xl = pd.ExcelFile("TestPandas.xlsx")
df = xl.parse("Sheet1", header=[0, 1, 2], index_col=[0, 1])
print(df)

I get

          Unnamed: 2_level_0 WEST COAST                       
          Unnamed: 2_level_1 Merchant 1       Merchant 2      
Sale#                  Sale#  Comission Price  Comission Price
Fridges 1                NaN         82    76         64    89
        2                NaN         66    98         18    54
        3                NaN          6    48         48    81
        4                NaN         13    56         53    33
        5                NaN         60    69          7    48

This appears the same as the output in #50953 (comment), except for the numbers (the numbers here agree with those in the OP file, so maybe they got changed?). This makes me think skiprows is not an issue here, but I'm not certain.

In any case, passing the invalid argument header should raise instead of being ignored.

@pacificdragon

Also, can I work on this issue?

Sure! Just comment /take on the issue and it should be assigned to you.

pacificdragon · 2023-01-30T23:38:19Z

/take

pacificdragon · 2023-01-31T02:00:53Z

Issue 1 : The reason why this is being ignored is that the call to parse function is having keyword arguments (**kwds) which allows passing the headers parameter.

The sample code used to reproduce the bug directly uses the parse function to read the excel.

xl = pd.ExcelFile("TestPandas.xlsx")
df = xl.parse("Sheet1", skiprows=8, headers=[0, 1, 2], index_col=[0, 1])

However, the recommended way to perform the same action is :

pd.read_excel("TestPandas.xlsx",sheet_name='Sheet1',header=[0, 1, 2], index_col=[0, 1],skiprows=8)

The read_excel function handles all the unexpected arguments. But still, if want to raise an error in case parse function is called then there are these two possible solutions i can think of :

Validate allowed arguments using provided list inside the parse function.
Make the parse function internal to the class as _parse.

But, I think the real issue in this bug is that we are not able to skip rows if a header is not provided.

Sample Code to reproduce this :

df1=pd.read_excel("TestPandas.xlsx",sheet_name='Sheet1', index_col=[0, 1],skiprows=8)
print(df1)
df2=pd.read_excel("TestPandas.xlsx",sheet_name='Sheet1', index_col=[0, 1],skiprows=8,header=[0,1,2])
print(df2)

After running the above code if we check the data frame, the NORTH AMERICA and Test text 1 strings are visible in df1 which should not happen if skiprows is provided.

But the reason it's happening is due to the implementation of forward-filling values if the MultiIndex index is provided.

pandas/pandas/io/excel/_base.py

Lines 835 to 862 in 2e218d1

    
           if is_list_like(index_col): 
        
               # Forward fill values for MultiIndex index. 
        
               if header is None: 
        
                   offset = 0 
        
               elif isinstance(header, int): 
        
                   offset = 1 + header 
        
               else: 
        
                   offset = 1 + max(header) 
        
               # GH34673: if MultiIndex names present and not defined in the header, 
        
               # offset needs to be incremented so that forward filling starts 
        
               # from the first MI value instead of the name 
        
               if has_index_names: 
        
                   offset += 1 
        
               # Check if we have an empty dataset 
        
               # before trying to collect data. 
        
               if offset < len(data): 
        
                   assert isinstance(index_col, Sequence) 
        
                   for col in index_col: 
        
                       last = data[offset][col] 
        
                       for row in range(offset + 1, len(data)): 
        
                           if data[row][col] == "" or data[row][col] is None: 
        
                               data[row][col] = last 
        
                           else: 
        
                               last = data[row][col]

@rhshadrach Let me know if this makes sense. Which one will take higher priority in this case index_cols or skiprows.

rhshadrach · 2023-01-31T04:19:32Z

take

rhshadrach · 2023-01-31T04:25:21Z

Sorry @pacificdragon - there is no / in the command. I've assigned the issue to you (but in the future, just take will work).

if want to raise an error in case parse function is called then there are these two possible solutions i can think of :

What happens to the headers entry in kwargs when it's passed? Is it not used anywhere?

But, I think the real issue in this bug is that we are not able to skip rows if a header is not provided.

No disagreement that there may be an issue there, but if that is the case then I'm confused as to why in #50953 (comment) I get the same result when removing the top eight rows and the skiprows argument.

pacificdragon · 2023-01-31T17:05:19Z

What happens to the headers entry in kwargs when it's passed? Is it not used anywhere?

No its not used anywhere, the kwargs are passed from parse function to TextParser which further are passed to TextFileReader in io/parsers/readers.py

pandas/pandas/io/excel/_base.py

Lines 866 to 889 in 2e218d1

    
           parser = TextParser( 
        
               data, 
        
               names=names, 
        
               header=header, 
        
               index_col=index_col, 
        
               has_index_names=has_index_names, 
        
               squeeze=squeeze, 
        
               dtype=dtype, 
        
               true_values=true_values, 
        
               false_values=false_values, 
        
               skiprows=skiprows, 
        
               nrows=nrows, 
        
               na_values=na_values, 
        
               skip_blank_lines=False,  # GH 39808 
        
               parse_dates=parse_dates, 
        
               date_parser=date_parser, 
        
               thousands=thousands, 
        
               decimal=decimal, 
        
               comment=comment, 
        
               skipfooter=skipfooter, 
        
               usecols=usecols, 
        
               mangle_dupe_cols=mangle_dupe_cols, 
        
               **kwds, 
        
           )

No disagreement that there may be an issue there, but if that is the case then I'm confused as to why in #50953 (comment) I get the same result when removing the top eight rows and the skiprows argument.

Yes, you are correct if we compare that is same because issue happens only when header is not provided and skiprows is provided. But in #50953 (comment) you are providing header argument.

@rhshadrach - Let me know if its clear now.

rhshadrach · 2023-02-01T22:52:40Z

I see - thanks. Agreed on this issue. Also passing invalid kwargs like headers should raise.

pacificdragon · 2023-02-05T18:24:53Z

Just to align on the changes. I did these tests again.

File : TestPandas.xlsx

Case 1:

df = xl.parse("Sheet1", skiprows=8)

Comments:

skiprows working fine, data got skipped.
Doesn't change source data (eg. no forward filling)

Case 2:

df = xl.parse("Sheet1", skiprows=8, header=[0, 1, 2])

df.columns :

MultiIndex([('Unnamed: 0_level_0', 'Unnamed: 0_level_1', 'Unnamed: 0_level_2'),
('Unnamed: 1_level_0', 'Unnamed: 1_level_1', 'Sale#'),
('Unnamed: 2_level_0', 'Unnamed: 2_level_1', 'Sale#'),
( 'WEST COAST', 'Merchant 1', 'Comission'),
( 'WEST COAST', 'Merchant 1', 'Price'),
( 'WEST COAST', 'Merchant 2', 'Comission'),
( 'WEST COAST', 'Merchant 2', 'Price')],
)

Comments:

Forward filling happened but only after skipping rows that is why we can see some 'Unnamed' indexes in starting.
Skiprows taking the highest priority than 'header' (Looks good).

Case 3:

df = xl.parse("Sheet1", skiprows=8, index_col=[0, 1])

df.index

MultiIndex([('NORTH AMERICA', 'Test text 1'),
('NORTH AMERICA', 'Sale#'),
( 'Fridges', 1),
( 'Fridges', 2),
( 'Fridges', 3),
( 'Fridges', 4),
( 'Fridges', 5)],
names=['NORTH AMERICA', 'Test text 1'])

Comments :

Source Data Got Changed as index_col is specified. Forward Filling for Multiindex brought "Test text 1", "NORTH AMERICA" to row 9th.
Skiprows were applied after forward filling happened.

This makes me think maybe it's correct if indexes can't be blank strings so bringing value from the top is needed.
But this shows index_col parameter is taking higher priority than skip rows.

@rhshadrach - Please suggest if my understanding is correct and if implementing skip rows at the top of the hierarchy will be a good option or not.

I see - thanks. Agreed on this issue. Also passing invalid kwargs like headers should raise.

Sure, creating a PR for this.

musshorn added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 24, 2023

rhshadrach added IO Excel read_excel, to_excel and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 25, 2023

github-actions bot assigned rhshadrach Jan 31, 2023

rhshadrach assigned pacificdragon and unassigned rhshadrach Jan 31, 2023

This was referenced Feb 5, 2023

Removed **kwds from parse function to handle unexpected arguments #51181

Closed

Removed **kwds from parse function in read_excel #51218

Closed

BUG : Add Deprecation FutureWarning for parse function call #51331

Closed

pacificdragon mentioned this issue Feb 16, 2023

BUG : Add Deprecation FutureWarning for parse function call in read_excel #51437

Closed

5 tasks

rhshadrach mentioned this issue Apr 13, 2024

DEPR: ExcelFile.parse #58247

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: xl.parse index_col ignoring skiprows #50953

BUG: xl.parse index_col ignoring skiprows #50953

musshorn commented Jan 24, 2023

INSTALLED VERSIONS

axeleschholz commented Jan 25, 2023

pacificdragon commented Jan 28, 2023

rhshadrach commented Jan 29, 2023

pacificdragon commented Jan 30, 2023

pacificdragon commented Jan 31, 2023

rhshadrach commented Jan 31, 2023

rhshadrach commented Jan 31, 2023

pacificdragon commented Jan 31, 2023

rhshadrach commented Feb 1, 2023

pacificdragon commented Feb 5, 2023