Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: InvalidRange Error while reading csv file after modifications #47402

Closed
2 of 3 tasks
kamalsharma2 opened this issue Jun 17, 2022 · 4 comments
Closed
2 of 3 tasks
Labels
Bug Closing Candidate May be closeable, needs more eyeballs IO Network Local or Cloud (AWS, GCS, etc.) IO Issues Needs Info Clarification about behavior needed to assess issue

Comments

@kamalsharma2
Copy link

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import time

storage_options = {'account_name': 'ADLS Gen2 account', 'account_key': 'some key'}

while True:
    df = pd.read_csv("abfs://path_to_file", storage_options=storage_options)
    print(df)
    time.sleep(100)

Issue Description

If a file is read twice from an ADLS Gen2 storage account (only cloud storage that I've tested with) in a single run and the file has changed in between the two reads, we get an InvalidRange error as pasted below. This has been tested with csv and excel, the number of rows were reduced between two consecutive reads.

Error:

Ran into a deserialization error. Ignoring since this is failsafe deserialization
Traceback (most recent call last):
  File "/home/kamal/anaconda3/lib/python3.8/site-packages/msrest/serialization.py", line 1501, in failsafe_deserialize
    return self(target_obj, data, content_type=content_type)
  File "/home/kamal/anaconda3/lib/python3.8/site-packages/msrest/serialization.py", line 1367, in __call__
    data = self._unpack_content(response_data, content_type)
  File "/home/kamal/anaconda3/lib/python3.8/site-packages/msrest/serialization.py", line 1535, in _unpack_content
    raise ValueError("This pipeline didn't have the RawDeserializer policy; can't deserialize")
ValueError: This pipeline didn't have the RawDeserializer policy; can't deserialize
Traceback (most recent call last):
  File "script.py", line 7, in <module>
    df = pd.read_csv("abfs://synapsemlfs@synapsemladlsgen2.dfs.core.windows.net/dataset/64mb_test.csv", storage_options=storage_options)
  File "/home/kamal/anaconda3/lib/python3.8/site-packages/pandas/io/parsers.py", line 610, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/home/kamal/anaconda3/lib/python3.8/site-packages/pandas/io/parsers.py", line 468, in _read
    return parser.read(nrows)
  File "/home/kamal/anaconda3/lib/python3.8/site-packages/pandas/io/parsers.py", line 1057, in read
    index, columns, col_dict = self._engine.read(nrows)
  File "/home/kamal/anaconda3/lib/python3.8/site-packages/pandas/io/parsers.py", line 2061, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 756, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 771, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas/_libs/parsers.pyx", line 827, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 814, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 1943, in pandas._libs.parsers.raise_parser_error
  File "/home/kamal/anaconda3/lib/python3.8/site-packages/fsspec/asyn.py", line 26, in _runner
    result[0] = await coro
  File "/home/kamal/anaconda3/lib/python3.8/site-packages/adlfs/spec.py", line 1895, in _async_fetch_range
    stream = await self.container_client.download_blob(
  File "/home/kamal/anaconda3/lib/python3.8/site-packages/azure/core/tracing/decorator_async.py", line 79, in wrapper_use_tracer
    return await func(*args, **kwargs)
  File "/home/kamal/anaconda3/lib/python3.8/site-packages/azure/storage/blob/aio/_container_client_async.py", line 1011, in download_blob
    return await blob_client.download_blob(
  File "/home/kamal/anaconda3/lib/python3.8/site-packages/azure/core/tracing/decorator_async.py", line 79, in wrapper_use_tracer
    return await func(*args, **kwargs)
  File "/home/kamal/anaconda3/lib/python3.8/site-packages/azure/storage/blob/aio/_blob_client_async.py", line 494, in download_blob
    await downloader._setup()  # pylint: disable=protected-access
  File "/home/kamal/anaconda3/lib/python3.8/site-packages/azure/storage/blob/aio/_download_async.py", line 254, in _setup
    self._response = await self._initial_request()
  File "/home/kamal/anaconda3/lib/python3.8/site-packages/azure/storage/blob/aio/_download_async.py", line 340, in _initial_request
    process_storage_error(error)
  File "/home/kamal/anaconda3/lib/python3.8/site-packages/azure/storage/blob/_shared/response_handlers.py", line 181, in process_storage_error
    exec("raise error from None")   # pylint: disable=exec-used # nosec
  File "<string>", line 1, in <module>
azure.core.exceptions.HttpResponseError: The range specified is invalid for the current size of the resource.
RequestId:982c7a7f-701e-011d-0635-8209e7000000
Time:2022-06-17T10:35:16.3421106Z
ErrorCode:InvalidRange
Content: <?xml version="1.0" encoding="utf-8"?><Error><Code>InvalidRange</Code><Message>The range specified is invalid for the current size of the resource.
RequestId:982c7a7f-701e-011d-0635-8209e7000000
Time:2022-06-17T10:35:16.3421106Z</Message></Error>

Expected Behavior

Pandas should be able to read the file even after the changes.

Installed Versions

INSTALLED VERSIONS

commit : 2cb9652
python : 3.8.8.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.72-microsoft-standard-WSL2
Version : #1 SMP Wed Oct 28 23:40:43 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.2.4
numpy : 1.19.5
pytz : 2021.1
dateutil : 2.8.1
pip : 21.0.1
setuptools : 52.0.0.post20210125
Cython : 0.29.23
pytest : 6.2.3
hypothesis : None
sphinx : 4.0.1
blosc : None
feather : None
xlsxwriter : 1.3.8
lxml.etree : 4.6.3
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.3
IPython : 7.22.0
pandas_datareader: None
bs4 : 4.9.3
bottleneck : 1.3.2
fsspec : 2022.5.0
fastparquet : None
gcsfs : None
matplotlib : 3.3.4
numexpr : 2.7.3
odfpy : None
openpyxl : 3.0.7
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.6.2
sqlalchemy : 1.4.15
tables : 3.6.1
tabulate : 0.8.9
xarray : None
xlrd : 2.0.1
xlwt : 1.3.0
numba : 0.53.1

@kamalsharma2 kamalsharma2 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 17, 2022
@twoertwein
Copy link
Member

Thank you for posting the traceback! I think this might be an issue with the server/adlfs as the error happens when we try to read from the fsspec file handle (which then uses adlfs).

I assume there might be a race condition: pandas opens the file (probably the old file), then the server is updated (presumably the "request id" changes), the file handle doesn't point to a file anymore, pandas tries to read, and the error happens.

@twoertwein twoertwein added the IO Network Local or Cloud (AWS, GCS, etc.) IO Issues label Jun 17, 2022
@kamalsharma2
Copy link
Author

kamalsharma2 commented Jun 23, 2022

@twoertwein I ran a couple of tests to check if that could be the case, downloading the file using Azure SDK (used by adlfs to download the files) and reading through pandas doesn't throw this error with changes in the file in between runs. This neglects the possibility of the file handle not pointing to the same file anymore. Here is a snippet of the script I ran

 async def blob_snapshots_async(self):
        from azure.storage.blob.aio import BlobServiceClient
        from azure.storage.blob._models import BlobBlock, BlobProperties, BlobType

        conn_str = "some_conn_str"
        blob_service_client = BlobServiceClient.from_connection_string(conn_str)

        # Instantiate a ContainerClient
        async with blob_service_client:
            container_client = blob_service_client.get_container_client("container_name")

            async with container_client.get_blob_client("/path/to/file.csv") as bc:
                  stream = await bc.download_blob()
                  data = await stream.readall()
                  bio = BytesIO(data)
                  bio.seek(0)
                  df = pd.read_csv(bio)
                  print(df)

@rhshadrach
Copy link
Member

I doesn't appear to me there is anything pandas can do here - as far as I know we merely take the path and storage_options as provided by the user can make a request using this. @kamalsharma2 - can you try producing using:

import urllib.request

urllib.request.Request("abfs://path_to_file", headers=storage_options)

@rhshadrach rhshadrach added Needs Info Clarification about behavior needed to assess issue Closing Candidate May be closeable, needs more eyeballs and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 7, 2024
@mroeschke
Copy link
Member

Closing as this doesn't seem to be an issue with pandas

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Closing Candidate May be closeable, needs more eyeballs IO Network Local or Cloud (AWS, GCS, etc.) IO Issues Needs Info Clarification about behavior needed to assess issue
Projects
None yet
Development

No branches or pull requests

4 participants