Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

URL treated as local file for read_feather #29055

Closed
mccarthyryanc opened this issue Oct 17, 2019 · 9 comments · Fixed by #33798
Closed

URL treated as local file for read_feather #29055

mccarthyryanc opened this issue Oct 17, 2019 · 9 comments · Fixed by #33798
Assignees
Labels
Docs IO Parquet parquet, feather
Milestone

Comments

@mccarthyryanc
Copy link

Not sure if this is a pandas issue or pyarrow, but when I try to read from a URL:

import pandas as pd
pd.read_feather("https://github.com/wesm/feather/raw/master/R/inst/feather/iris.feather")

I get the following error:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/ubuntu/miniconda3/envs/pandas/lib/python3.7/site-packages/pandas/util/_decorators.py", line 208, in wrapper
    return func(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/pandas/lib/python3.7/site-packages/pandas/io/feather_format.py", line 119, in read_feather
    return feather.read_feather(path, columns=columns, use_threads=bool(use_threads))
  File "/home/ubuntu/miniconda3/envs/pandas/lib/python3.7/site-packages/pyarrow/feather.py", line 214, in read_feather
    reader = FeatherReader(source)
  File "/home/ubuntu/miniconda3/envs/pandas/lib/python3.7/site-packages/pyarrow/feather.py", line 40, in __init__
    self.open(source)
  File "pyarrow/error.pxi", line 80, in pyarrow.lib.check_status
  File "pyarrow/io.pxi", line 1406, in pyarrow.lib.get_reader
  File "pyarrow/io.pxi", line 1395, in pyarrow.lib._get_native_file
  File "pyarrow/io.pxi", line 788, in pyarrow.lib.memory_map
  File "pyarrow/io.pxi", line 751, in pyarrow.lib.MemoryMappedFile._open
  File "pyarrow/error.pxi", line 80, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: Failed to open local file 'https://github.com/wesm/feather/raw/master/R/inst/feather/iris.feather', error: No such file or directory

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit           : None
python           : 3.7.4.final.0
python-bits      : 64
OS               : Linux
OS-release       : 4.15.0-64-generic
machine          : x86_64
processor        : x86_64
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 0.25.1
numpy            : 1.17.3
pytz             : 2019.3
dateutil         : 2.8.0
pip              : 19.2.3
setuptools       : 41.4.0
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : 0.4.0
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : None
IPython          : None
pandas_datareader: None
bs4              : None
bottleneck       : None
fastparquet      : None
gcsfs            : None
lxml.etree       : None
matplotlib       : None
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : 0.15.0
pytables         : None
s3fs             : None
scipy            : None
sqlalchemy       : None
tables           : None
xarray           : None
xlrd             : None
xlwt             : None
xlsxwriter       : None
@TomAugspurger
Copy link
Contributor

TomAugspurger commented Oct 17, 2019 via email

@mccarthyryanc
Copy link
Author

A similar call with pyarrow:

import pyarrow
pyarrow.feather.read_feather('https://github.com/wesm/feather/raw/master/R/inst/feather/iris.feather')

Fails with the same error. The pandas docs say you can pass a URL, but the arrow help say it only take a path/file-like object:

read_feather(source, columns=None, use_threads=True)
    Read a pandas.DataFrame from Feather format

    Parameters
    ----------
    source : string file path, or file-like object
    columns : sequence, optional
        Only read a specific set of columns. If not provided, all columns are
        read
    use_threads: bool, default True
        Whether to parallelize reading using multiple threads

    Returns
    -------
    df : pandas.DataFrame

So I was assuming pandas was doing extra work to parse the URL.

@mccarthyryanc
Copy link
Author

@TomAugspurger, is this just an error in the docs and a feature request to pyarrow?

@TomAugspurger
Copy link
Contributor

Seems to be an issue with the pandas docs.

@mccarthyryanc
Copy link
Author

Looks like this feature won't be coming to pyarrow. So, until this gets added on the pandas side, here is a work around:

import pandas as pd
import requests
import io

resp = requests.get(
    'https://github.com/wesm/feather/raw/master/R/inst/feather/iris.feather',
    stream=True
)
resp.raw.decode_content = True
mem_fh = io.BytesIO(resp.raw.read())
pd.read_feather(mem_fh)

@jbrockmendel jbrockmendel added the IO Parquet parquet, feather label Dec 1, 2019
@darshit-doshi
Copy link

@mccarthyryanc, But in case if we want to fetch data from AWS S3 bucket, then this workaround is of no use...Any idea whant need to be done in that case ??

@mccarthyryanc
Copy link
Author

@darshit-doshi if your data is in a public S3 bucket this method will work, you just need the full object URL.

If not in a public bucket I would use something like s3fs:

import s3fs
fs = s3fs.S3FileSystem()
fh = fs.open('s3://bucketname/filename.feather')
df = pd.read_feather(fh)

@mroeschke mroeschke added the Docs label Apr 19, 2020
@alimcmaster1
Copy link
Member

Its maybe a case of us calling get_filepath_or_buffer similar to other readers (e.g parquet) here, this infers the filesystem to use based on the URL.

@alimcmaster1 alimcmaster1 self-assigned this Apr 25, 2020
@jreback jreback added this to the 1.1 milestone Apr 26, 2020
@mccarthyryanc
Copy link
Author

@alimcmaster1 and @jreback , thanks for the fix and merge!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs IO Parquet parquet, feather
Projects
No open projects
IO Method Robustness
Awaiting triage
Development

Successfully merging a pull request may close this issue.

7 participants