Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: xl.parse index_col ignoring skiprows #50953

Open
3 tasks done
musshorn opened this issue Jan 24, 2023 · 10 comments
Open
3 tasks done

BUG: xl.parse index_col ignoring skiprows #50953

musshorn opened this issue Jan 24, 2023 · 10 comments
Assignees
Labels
Bug IO Excel read_excel, to_excel

Comments

@musshorn
Copy link

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

xl = pd.ExcelFile("TestPandas.xlsx")

df = xl.parse("Sheet1", skiprows=8, headers=[0, 1, 2], index_col=[0, 1])
print(df)

Issue Description

Run the sample code on this excel sheet. Parsing is expected to begin from row 9 however in the index column data is pulled from rows that should've been skipped.
TestPandas.xlsx

Expected Behavior

Parsing to begin from the first non-skipped row

Installed Versions

>>> pd.show_versions()

INSTALLED VERSIONS

commit : 7cb7592
python : 3.10.0.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19044
machine : AMD64
processor : Intel64 Family 6 Model 79 Stepping 1, GenuineIntel
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : English_Australia.1252

pandas : 2.0.0.dev0+1147.g7cb7592523
numpy : 1.22.4
pytz : 2022.2.1
dateutil : 2.8.2
setuptools : 57.4.0
pip : 22.3.1
Cython : 0.29.32
pytest : 7.2.0
hypothesis : None
sphinx : 4.5.0
blosc : None
feather : None
xlsxwriter : 3.0.1
lxml.etree : 4.9.1
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.1.1
pandas_datareader: None
bs4 : 4.10.0
bottleneck : None
brotli : 1.0.9
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.5.1
numba : None
numexpr : 2.8.4
odfpy : None
openpyxl : 3.0.10
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.8.0
snappy : None
sqlalchemy : None
tables : 3.7.0
tabulate : 0.8.9
xarray : None
xlrd : None
zstandard : 0.17.0
tzdata : None
qtpy : 2.0.1
pyqt5 : None

@musshorn musshorn added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 24, 2023
@axeleschholz
Copy link

I reproduced the issue. It seems like the headers or labels are being propagated across multiple rows before they are phased out with skiprows. Honestly the best solution would be to implement the skiprows functionality at the highest level possible in the process, but that might require some refactoring. So instead what needs to happen is the skiprows offset needs to be taken into account when assigning headers. I think the issue might be present around here:

for row in header:
if is_integer(skiprows):
assert isinstance(skiprows, int)
row += skiprows
if row > len(data) - 1:
raise ValueError(
f"header index {row} exceeds maximum index "
f"{len(data) - 1} of data.",
)
data[row], control_row = fill_mi_header(data[row], control_row)
if index_col is not None:
header_name, _ = pop_header_name(data[row], index_col)
header_names.append(header_name)
# If there is a MultiIndex header and an index then there is also
# a row containing just the index name(s)
has_index_names = False
if is_list_header and not is_len_one_list_header and index_col is not None:
index_col_list: Sequence[int]
if isinstance(index_col, int):
index_col_list = [index_col]
else:
assert isinstance(index_col, Sequence)
index_col_list = index_col
# We have to handle mi without names. If any of the entries in the data
# columns are not empty, this is a regular row
assert isinstance(header, Sequence)
if len(header) < len(data):
potential_index_names = data[len(header)]
potential_data = [
x
for i, x in enumerate(potential_index_names)
if not control_row[i] and i not in index_col_list
]
has_index_names = all(x == "" or x is None for x in potential_data)

I'll keep doing some digging on this and let you know if I find a fix.

@rhshadrach rhshadrach added IO Excel read_excel, to_excel and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 25, 2023
@pacificdragon
Copy link

Hi,

I did some debugging and analysis on this issue. Below are my findings :

Issue 1: The argument passed in the sample code provided by @musshorn to parse the excel is incorrect as the header argument is spelled as "headers".

image

Due to this, the skip rows are non-functional in the above case.

image

However, if I read the same data using the correct parameters. This data looks meaningful.

image

Issue 2: Even if the header is not provided to the input, it should allow us to skip rows for reading the data frame.

So, for that, we may need to apply some fixes.

@axeleschholz @rhshadrach - Let me know your thoughts on this. Also, can I work on this issue?

@rhshadrach
Copy link
Member

If I remove the first 8 rows of the excel file and then run

xl = pd.ExcelFile("TestPandas.xlsx")
df = xl.parse("Sheet1", header=[0, 1, 2], index_col=[0, 1])
print(df)

I get

          Unnamed: 2_level_0 WEST COAST                       
          Unnamed: 2_level_1 Merchant 1       Merchant 2      
Sale#                  Sale#  Comission Price  Comission Price
Fridges 1                NaN         82    76         64    89
        2                NaN         66    98         18    54
        3                NaN          6    48         48    81
        4                NaN         13    56         53    33
        5                NaN         60    69          7    48

This appears the same as the output in #50953 (comment), except for the numbers (the numbers here agree with those in the OP file, so maybe they got changed?). This makes me think skiprows is not an issue here, but I'm not certain.

In any case, passing the invalid argument header should raise instead of being ignored.

@pacificdragon

Also, can I work on this issue?

Sure! Just comment /take on the issue and it should be assigned to you.

@pacificdragon
Copy link

/take

@pacificdragon
Copy link

Issue 1 : The reason why this is being ignored is that the call to parse function is having keyword arguments (**kwds) which allows passing the headers parameter.

The sample code used to reproduce the bug directly uses the parse function to read the excel.

xl = pd.ExcelFile("TestPandas.xlsx")
df = xl.parse("Sheet1", skiprows=8, headers=[0, 1, 2], index_col=[0, 1])

However, the recommended way to perform the same action is :

pd.read_excel("TestPandas.xlsx",sheet_name='Sheet1',header=[0, 1, 2], index_col=[0, 1],skiprows=8)

The read_excel function handles all the unexpected arguments. But still, if want to raise an error in case parse function is called then there are these two possible solutions i can think of :

  1. Validate allowed arguments using provided list inside the parse function.
  2. Make the parse function internal to the class as _parse.

But, I think the real issue in this bug is that we are not able to skip rows if a header is not provided.

Sample Code to reproduce this :

df1=pd.read_excel("TestPandas.xlsx",sheet_name='Sheet1', index_col=[0, 1],skiprows=8)
print(df1)
df2=pd.read_excel("TestPandas.xlsx",sheet_name='Sheet1', index_col=[0, 1],skiprows=8,header=[0,1,2])
print(df2)

After running the above code if we check the data frame, the NORTH AMERICA and Test text 1 strings are visible in df1 which should not happen if skiprows is provided.

But the reason it's happening is due to the implementation of forward-filling values if the MultiIndex index is provided.

if is_list_like(index_col):
# Forward fill values for MultiIndex index.
if header is None:
offset = 0
elif isinstance(header, int):
offset = 1 + header
else:
offset = 1 + max(header)
# GH34673: if MultiIndex names present and not defined in the header,
# offset needs to be incremented so that forward filling starts
# from the first MI value instead of the name
if has_index_names:
offset += 1
# Check if we have an empty dataset
# before trying to collect data.
if offset < len(data):
assert isinstance(index_col, Sequence)
for col in index_col:
last = data[offset][col]
for row in range(offset + 1, len(data)):
if data[row][col] == "" or data[row][col] is None:
data[row][col] = last
else:
last = data[row][col]

@rhshadrach Let me know if this makes sense. Which one will take higher priority in this case index_cols or skiprows.

@rhshadrach
Copy link
Member

take

@rhshadrach
Copy link
Member

Sorry @pacificdragon - there is no / in the command. I've assigned the issue to you (but in the future, just take will work).

if want to raise an error in case parse function is called then there are these two possible solutions i can think of :

What happens to the headers entry in kwargs when it's passed? Is it not used anywhere?

But, I think the real issue in this bug is that we are not able to skip rows if a header is not provided.

No disagreement that there may be an issue there, but if that is the case then I'm confused as to why in #50953 (comment) I get the same result when removing the top eight rows and the skiprows argument.

@pacificdragon
Copy link

What happens to the headers entry in kwargs when it's passed? Is it not used anywhere?

No its not used anywhere, the kwargs are passed from parse function to TextParser which further are passed to TextFileReader in io/parsers/readers.py

parser = TextParser(
data,
names=names,
header=header,
index_col=index_col,
has_index_names=has_index_names,
squeeze=squeeze,
dtype=dtype,
true_values=true_values,
false_values=false_values,
skiprows=skiprows,
nrows=nrows,
na_values=na_values,
skip_blank_lines=False, # GH 39808
parse_dates=parse_dates,
date_parser=date_parser,
thousands=thousands,
decimal=decimal,
comment=comment,
skipfooter=skipfooter,
usecols=usecols,
mangle_dupe_cols=mangle_dupe_cols,
**kwds,
)

No disagreement that there may be an issue there, but if that is the case then I'm confused as to why in #50953 (comment) I get the same result when removing the top eight rows and the skiprows argument.

Yes, you are correct if we compare that is same because issue happens only when header is not provided and skiprows is provided. But in #50953 (comment) you are providing header argument.

@rhshadrach - Let me know if its clear now.

@rhshadrach
Copy link
Member

I see - thanks. Agreed on this issue. Also passing invalid kwargs like headers should raise.

@pacificdragon
Copy link

Just to align on the changes. I did these tests again.

File : TestPandas.xlsx

Case 1:

df = xl.parse("Sheet1", skiprows=8)

Comments:

  1. skiprows working fine, data got skipped.
  2. Doesn't change source data (eg. no forward filling)

Case 2:

df = xl.parse("Sheet1", skiprows=8, header=[0, 1, 2])

df.columns :

MultiIndex([('Unnamed: 0_level_0', 'Unnamed: 0_level_1', 'Unnamed: 0_level_2'),
('Unnamed: 1_level_0', 'Unnamed: 1_level_1', 'Sale#'),
('Unnamed: 2_level_0', 'Unnamed: 2_level_1', 'Sale#'),
( 'WEST COAST', 'Merchant 1', 'Comission'),
( 'WEST COAST', 'Merchant 1', 'Price'),
( 'WEST COAST', 'Merchant 2', 'Comission'),
( 'WEST COAST', 'Merchant 2', 'Price')],
)

Comments:

  1. Forward filling happened but only after skipping rows that is why we can see some 'Unnamed' indexes in starting.
  2. Skiprows taking the highest priority than 'header' (Looks good).

Case 3:

df = xl.parse("Sheet1", skiprows=8, index_col=[0, 1])

df.index

MultiIndex([('NORTH AMERICA', 'Test text 1'),
('NORTH AMERICA', 'Sale#'),
( 'Fridges', 1),
( 'Fridges', 2),
( 'Fridges', 3),
( 'Fridges', 4),
( 'Fridges', 5)],
names=['NORTH AMERICA', 'Test text 1'])

Comments :

  1. Source Data Got Changed as index_col is specified. Forward Filling for Multiindex brought "Test text 1", "NORTH AMERICA" to row 9th.
  2. Skiprows were applied after forward filling happened.
  • This makes me think maybe it's correct if indexes can't be blank strings so bringing value from the top is needed.
  • But this shows index_col parameter is taking higher priority than skip rows.

@rhshadrach - Please suggest if my understanding is correct and if implementing skip rows at the top of the hierarchy will be a good option or not.

I see - thanks. Agreed on this issue. Also passing invalid kwargs like headers should raise.

Sure, creating a PR for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment