URL treated as local file for read_feather #29055

mccarthyryanc · 2019-10-17T16:02:49Z

Not sure if this is a pandas issue or pyarrow, but when I try to read from a URL:

import pandas as pd
pd.read_feather("https://github.com/wesm/feather/raw/master/R/inst/feather/iris.feather")

I get the following error:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/ubuntu/miniconda3/envs/pandas/lib/python3.7/site-packages/pandas/util/_decorators.py", line 208, in wrapper
    return func(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/pandas/lib/python3.7/site-packages/pandas/io/feather_format.py", line 119, in read_feather
    return feather.read_feather(path, columns=columns, use_threads=bool(use_threads))
  File "/home/ubuntu/miniconda3/envs/pandas/lib/python3.7/site-packages/pyarrow/feather.py", line 214, in read_feather
    reader = FeatherReader(source)
  File "/home/ubuntu/miniconda3/envs/pandas/lib/python3.7/site-packages/pyarrow/feather.py", line 40, in __init__
    self.open(source)
  File "pyarrow/error.pxi", line 80, in pyarrow.lib.check_status
  File "pyarrow/io.pxi", line 1406, in pyarrow.lib.get_reader
  File "pyarrow/io.pxi", line 1395, in pyarrow.lib._get_native_file
  File "pyarrow/io.pxi", line 788, in pyarrow.lib.memory_map
  File "pyarrow/io.pxi", line 751, in pyarrow.lib.MemoryMappedFile._open
  File "pyarrow/error.pxi", line 80, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: Failed to open local file 'https://github.com/wesm/feather/raw/master/R/inst/feather/iris.feather', error: No such file or directory

Output of `pd.show_versions()`

INSTALLED VERSIONS
------------------
commit           : None
python           : 3.7.4.final.0
python-bits      : 64
OS               : Linux
OS-release       : 4.15.0-64-generic
machine          : x86_64
processor        : x86_64
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 0.25.1
numpy            : 1.17.3
pytz             : 2019.3
dateutil         : 2.8.0
pip              : 19.2.3
setuptools       : 41.4.0
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : 0.4.0
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : None
IPython          : None
pandas_datareader: None
bs4              : None
bottleneck       : None
fastparquet      : None
gcsfs            : None
lxml.etree       : None
matplotlib       : None
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : 0.15.0
pytables         : None
s3fs             : None
scipy            : None
sqlalchemy       : None
tables           : None
xarray           : None
xlrd             : None
xlwt             : None
xlsxwriter       : None

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2019-10-17T16:15:32Z

Does this work using arrow directly?

…

On Thu, Oct 17, 2019 at 11:03 AM Ryan ***@***.***> wrote: Not sure if this is a pandas issue or pyarrow, but when I try to read from a URL: import pandas as pd pd.read_feather("https://github.com/wesm/feather/raw/master/R/inst/feather/iris.feather") I get the following error: Traceback (most recent call last): File "<string>", line 1, in <module> File "/home/ubuntu/miniconda3/envs/pandas/lib/python3.7/site-packages/pandas/util/_decorators.py", line 208, in wrapper return func(*args, **kwargs) File "/home/ubuntu/miniconda3/envs/pandas/lib/python3.7/site-packages/pandas/io/feather_format.py", line 119, in read_feather return feather.read_feather(path, columns=columns, use_threads=bool(use_threads)) File "/home/ubuntu/miniconda3/envs/pandas/lib/python3.7/site-packages/pyarrow/feather.py", line 214, in read_feather reader = FeatherReader(source) File "/home/ubuntu/miniconda3/envs/pandas/lib/python3.7/site-packages/pyarrow/feather.py", line 40, in __init__ self.open(source) File "pyarrow/error.pxi", line 80, in pyarrow.lib.check_status File "pyarrow/io.pxi", line 1406, in pyarrow.lib.get_reader File "pyarrow/io.pxi", line 1395, in pyarrow.lib._get_native_file File "pyarrow/io.pxi", line 788, in pyarrow.lib.memory_map File "pyarrow/io.pxi", line 751, in pyarrow.lib.MemoryMappedFile._open File "pyarrow/error.pxi", line 80, in pyarrow.lib.check_status pyarrow.lib.ArrowIOError: Failed to open local file 'https://github.com/wesm/feather/raw/master/R/inst/feather/iris.feather', error: No such file or directory Output of pd.show_versions() INSTALLED VERSIONS ------------------ commit : None python : 3.7.4.final.0 python-bits : 64 OS : Linux OS-release : 4.15.0-64-generic machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 0.25.1 numpy : 1.17.3 pytz : 2019.3 dateutil : 2.8.0 pip : 19.2.3 setuptools : 41.4.0 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : 0.4.0 xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader: None bs4 : None bottleneck : None fastparquet : None gcsfs : None lxml.etree : None matplotlib : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 0.15.0 pytables : None s3fs : None scipy : None sqlalchemy : None tables : None xarray : None xlrd : None xlwt : None xlsxwriter : None — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#29055?email_source=notifications&email_token=AAKAOIUNSOGXUME4UDVI7ELQPCEDXA5CNFSM4JB3ZMZ2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HSQH5RQ>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKAOIVLYEDDCMDGV7D2ADTQPCEDXANCNFSM4JB3ZMZQ> .

mccarthyryanc · 2019-10-17T16:30:10Z

A similar call with pyarrow:

import pyarrow
pyarrow.feather.read_feather('https://github.com/wesm/feather/raw/master/R/inst/feather/iris.feather')

Fails with the same error. The pandas docs say you can pass a URL, but the arrow help say it only take a path/file-like object:

read_feather(source, columns=None, use_threads=True)
    Read a pandas.DataFrame from Feather format

    Parameters
    ----------
    source : string file path, or file-like object
    columns : sequence, optional
        Only read a specific set of columns. If not provided, all columns are
        read
    use_threads: bool, default True
        Whether to parallelize reading using multiple threads

    Returns
    -------
    df : pandas.DataFrame

So I was assuming pandas was doing extra work to parse the URL.

mccarthyryanc · 2019-10-23T22:40:56Z

@TomAugspurger, is this just an error in the docs and a feature request to pyarrow?

TomAugspurger · 2019-10-24T12:30:49Z

Seems to be an issue with the pandas docs.

mccarthyryanc · 2019-10-25T23:28:17Z

Looks like this feature won't be coming to pyarrow. So, until this gets added on the pandas side, here is a work around:

import pandas as pd
import requests
import io

resp = requests.get(
    'https://github.com/wesm/feather/raw/master/R/inst/feather/iris.feather',
    stream=True
)
resp.raw.decode_content = True
mem_fh = io.BytesIO(resp.raw.read())
pd.read_feather(mem_fh)

darshit-doshi · 2020-02-07T14:56:37Z

@mccarthyryanc, But in case if we want to fetch data from AWS S3 bucket, then this workaround is of no use...Any idea whant need to be done in that case ??

mccarthyryanc · 2020-02-07T16:00:30Z

@darshit-doshi if your data is in a public S3 bucket this method will work, you just need the full object URL.

If not in a public bucket I would use something like s3fs:

import s3fs
fs = s3fs.S3FileSystem()
fh = fs.open('s3://bucketname/filename.feather')
df = pd.read_feather(fh)

alimcmaster1 · 2020-04-23T23:40:52Z

Its maybe a case of us calling get_filepath_or_buffer similar to other readers (e.g parquet) here, this infers the filesystem to use based on the URL.

mccarthyryanc · 2020-04-26T20:34:07Z

@alimcmaster1 and @jreback , thanks for the fix and merge!

jbrockmendel added the IO Parquet parquet, feather label Dec 1, 2019

mroeschke added the Docs label Apr 19, 2020

alimcmaster1 self-assigned this Apr 25, 2020

alimcmaster1 mentioned this issue Apr 26, 2020

IO: Fix feather s3 and http paths #33798

Merged

5 tasks

jreback added this to the 1.1 milestone Apr 26, 2020

jreback closed this as completed in #33798 Apr 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

URL treated as local file for read_feather #29055

URL treated as local file for read_feather #29055

mccarthyryanc commented Oct 17, 2019

TomAugspurger commented Oct 17, 2019 via email

mccarthyryanc commented Oct 17, 2019

mccarthyryanc commented Oct 23, 2019

TomAugspurger commented Oct 24, 2019

mccarthyryanc commented Oct 25, 2019

darshit-doshi commented Feb 7, 2020

mccarthyryanc commented Feb 7, 2020

alimcmaster1 commented Apr 23, 2020

mccarthyryanc commented Apr 26, 2020

URL treated as local file for read_feather #29055

URL treated as local file for read_feather #29055

Comments

mccarthyryanc commented Oct 17, 2019

Output of pd.show_versions()

TomAugspurger commented Oct 17, 2019 via email

mccarthyryanc commented Oct 17, 2019

mccarthyryanc commented Oct 23, 2019

TomAugspurger commented Oct 24, 2019

mccarthyryanc commented Oct 25, 2019

darshit-doshi commented Feb 7, 2020

mccarthyryanc commented Feb 7, 2020

alimcmaster1 commented Apr 23, 2020

mccarthyryanc commented Apr 26, 2020

Output of `pd.show_versions()`