Skip to content

Allow read_csv to take URLs #970

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Apr 2, 2012
Merged

Conversation

jseabold
Copy link
Contributor

This allows read_csv to take URLs. The tests are going to need to be modified after it's merged to point the the new repo URL or if you want to host a test file somewhere else. I also have no idea what the file:// path should be on non-posix systems, so this path might need some adjustment. Not sure. There's no test case for ftp, but I don't see why it wouldn't work the same as http.

@takluyver
Copy link
Contributor

On Python 3, the HTTP response produces bytes, but the parser implementation expects strings (i.e. unicode). I think the correct way would be to wrap the response in an io.TextIOWrapper, using the encoding passed to read_csv.

@jseabold
Copy link
Contributor Author

what about numpy.compat.asstr or similar? io doesn't exist in python 2.5

@takluyver
Copy link
Contributor

codecs.StreamReader should exist in 2.5, but I think whatever we use needs an if PY3 anyway, because the machinery for handling CSV files on Python 2 expects a bytes-mode file-like object.

numpy.compat.asstr looks like it's hardcoded to Latin-1, so I don't think that's quite right.

@jseabold
Copy link
Contributor Author

Is this what you have in mind?

@takluyver
Copy link
Contributor

Yes, that looks sensible, although I haven't tested it yet. I'd also specify errors='replace' when creating the wrapper, so that it will tolerate an incorrectly guessed encoding.

@jseabold
Copy link
Contributor Author

Hmm. Who guesses the encoding? The user? Or is there some BOM checking somewhere?

If the user, it seems to me I'd want to know if I'm not working with the encoding I think I am. Either way, you're probably going to have to either fix your text or pass another encoding. Replace makes it such that you wouldn't discover this until later right?

@takluyver
Copy link
Contributor

The user has the option of doing so, but I guess most of the time they're going to leave it at the default, which I believe follows the encoding specified by locale (UTF-8 for Mac & most Linux, a particular code page for Windows).

In general, I'd agree with failing early and loudly, but with encoding, either it's a tiny detail, and it's a pain to have to keep guessing at encodings when you don't really care, or the text will be obviously gibberish if you get it wrong.

I think for opening files, we use a compromise - if the user specifies an encoding, we use errors='strict' so it fails if they're wrong, but if they leave it at the default, we use errors='replace'. We could do the same here.

@jseabold
Copy link
Contributor Author

Added the error handling.

@takluyver
Copy link
Contributor

Sorry, having tested it, it turns out that TextIOWrapper doesn't play nicely with an http response - if you try to read it line by line, it closes after the first read, and the next line fails with ValueError: I/O operation on closed file. I've worked out what needs doing, I'll make a mini-PR against this PR.

@jseabold
Copy link
Contributor Author

Thanks. Updated the PR.

@takluyver
Copy link
Contributor

Great, then this is alright as far as I'm concerned, so I'll ping @wesm and @adamklein to look at it.

@wesm
Copy link
Member

wesm commented Apr 2, 2012

@jseabold things have become a bit of a merging mess after PR #952. would you mind taking a crack at merging this work onto pydata/master?

@jseabold
Copy link
Contributor Author

jseabold commented Apr 2, 2012

Yeah, I'll have a look.

@jseabold
Copy link
Contributor Author

jseabold commented Apr 2, 2012

Rebased on master and force pushed. Should be okay now as long as it doesn't screw up @takluyver

@wesm wesm merged commit 2e3f7f4 into pandas-dev:master Apr 2, 2012
@wesm
Copy link
Member

wesm commented Apr 2, 2012

thanks dude. everything looks OK (haven't run tests on py3 yet but will soon)

wesm added a commit that referenced this pull request Apr 2, 2012
@wesm
Copy link
Member

wesm commented Apr 2, 2012

might want to add some logic (at soem point) to skip the url test in some cases, but maybe no big deal. i pointed it at pydata/pandas now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants