Allow read_csv to take URLs #970

jseabold · 2012-03-26T18:32:52Z

This allows read_csv to take URLs. The tests are going to need to be modified after it's merged to point the the new repo URL or if you want to host a test file somewhere else. I also have no idea what the file:// path should be on non-posix systems, so this path might need some adjustment. Not sure. There's no test case for ftp, but I don't see why it wouldn't work the same as http.

takluyver · 2012-03-26T20:55:20Z

On Python 3, the HTTP response produces bytes, but the parser implementation expects strings (i.e. unicode). I think the correct way would be to wrap the response in an io.TextIOWrapper, using the encoding passed to read_csv.

jseabold · 2012-03-26T21:01:51Z

what about numpy.compat.asstr or similar? io doesn't exist in python 2.5

takluyver · 2012-03-26T22:25:27Z

codecs.StreamReader should exist in 2.5, but I think whatever we use needs an if PY3 anyway, because the machinery for handling CSV files on Python 2 expects a bytes-mode file-like object.

numpy.compat.asstr looks like it's hardcoded to Latin-1, so I don't think that's quite right.

jseabold · 2012-03-29T22:39:10Z

Is this what you have in mind?

takluyver · 2012-03-30T11:34:23Z

Yes, that looks sensible, although I haven't tested it yet. I'd also specify errors='replace' when creating the wrapper, so that it will tolerate an incorrectly guessed encoding.

jseabold · 2012-03-30T14:27:15Z

Hmm. Who guesses the encoding? The user? Or is there some BOM checking somewhere?

If the user, it seems to me I'd want to know if I'm not working with the encoding I think I am. Either way, you're probably going to have to either fix your text or pass another encoding. Replace makes it such that you wouldn't discover this until later right?

takluyver · 2012-03-30T22:05:08Z

The user has the option of doing so, but I guess most of the time they're going to leave it at the default, which I believe follows the encoding specified by locale (UTF-8 for Mac & most Linux, a particular code page for Windows).

In general, I'd agree with failing early and loudly, but with encoding, either it's a tiny detail, and it's a pain to have to keep guessing at encodings when you don't really care, or the text will be obviously gibberish if you get it wrong.

I think for opening files, we use a compromise - if the user specifies an encoding, we use errors='strict' so it fails if they're wrong, but if they leave it at the default, we use errors='replace'. We could do the same here.

jseabold · 2012-03-31T18:56:24Z

Added the error handling.

takluyver · 2012-03-31T20:14:37Z

Sorry, having tested it, it turns out that TextIOWrapper doesn't play nicely with an http response - if you try to read it line by line, it closes after the first read, and the next line fails with ValueError: I/O operation on closed file. I've worked out what needs doing, I'll make a mini-PR against this PR.

jseabold · 2012-03-31T20:27:46Z

Thanks. Updated the PR.

takluyver · 2012-03-31T20:58:40Z

Great, then this is alright as far as I'm concerned, so I'll ping @wesm and @adamklein to look at it.

wesm · 2012-04-02T19:22:32Z

@jseabold things have become a bit of a merging mess after PR #952. would you mind taking a crack at merging this work onto pydata/master?

jseabold · 2012-04-02T19:47:27Z

Yeah, I'll have a look.

jseabold · 2012-04-02T20:01:18Z

Rebased on master and force pushed. Should be okay now as long as it doesn't screw up @takluyver

wesm · 2012-04-02T20:05:23Z

thanks dude. everything looks OK (haven't run tests on py3 yet but will soon)

wesm · 2012-04-02T20:07:03Z

might want to add some logic (at soem point) to skip the url test in some cases, but maybe no big deal. i pointed it at pydata/pandas now

jseabold and others added 9 commits April 2, 2012 15:52

ENH: Allow read_csv to take a URL

749c8f8

TST: Add test data for URL io

65701f3

ENH: Allow https in url

dca72cc

REF: Go ahead and import urlparse

60922a3

ENH: Only give strings to _is_url

03e59a5

TST: Add tests for read_table with URL

f3788a0

ENH: Py3 compatibility for reading URLs

1d1292b

ENH: Improve encoding error handling for URLs in Py3

0dcab67

Various fixes so test_url() passes on Python 3.

2e3f7f4

wesm merged commit 2e3f7f4 into pandas-dev:master Apr 2, 2012

wesm added a commit that referenced this pull request Apr 2, 2012

TST: point test url to github for #970

a5a2a04

jseabold mentioned this pull request Sep 24, 2012

genfromdta does not accept file:// URI schemes. statsmodels/statsmodels#475

Closed

Uh oh!

Allow read_csv to take URLs #970

Allow read_csv to take URLs #970

Uh oh!

Conversation

jseabold commented Mar 26, 2012

Uh oh!

takluyver commented Mar 26, 2012

Uh oh!

jseabold commented Mar 26, 2012

Uh oh!

takluyver commented Mar 26, 2012

Uh oh!

jseabold commented Mar 29, 2012

Uh oh!

takluyver commented Mar 30, 2012

Uh oh!

jseabold commented Mar 30, 2012

Uh oh!

takluyver commented Mar 30, 2012

Uh oh!

jseabold commented Mar 31, 2012

Uh oh!

takluyver commented Mar 31, 2012

Uh oh!

jseabold commented Mar 31, 2012

Uh oh!

takluyver commented Mar 31, 2012

Uh oh!

wesm commented Apr 2, 2012

Uh oh!

jseabold commented Apr 2, 2012

Uh oh!

jseabold commented Apr 2, 2012

Uh oh!

wesm commented Apr 2, 2012

Uh oh!

wesm commented Apr 2, 2012

Uh oh!

Uh oh!