Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor change to csv reading #146

Closed
wants to merge 3 commits into from

Conversation

@MLnick
Copy link

commented Sep 16, 2011

Hi Wes

Firstly, congrats on such an amazing project! I love prototyping in python / numpy / ipython, but I always envy some of R's features. I tried pandas out about 9 months ago and although it was interesting, it seemed very rough around the edges. Now, however, it is looking really polished and I've been using it for prototyping and testing some trading models, and everything works extremely well. I hope it keeps growing, together with the integration with scikits statsmodels/timeseries and maybe even scikits.learn in future ...

Anyway, as I was starting to dive into the code, I came across the read_csv functions and noticed that there was full duplication in read_table. The csv module in python actually has full support for arbitrary delimiters, so there is no need for the duplication. Also, there is csv.Sniffer().sniff(sample) that attempts to sniff out the delimiter automatically. This commit tries to "magically" handle any arbitrary CSV file without needing to specify a separator, whether separated by blank spaces, tabs, commas, semicolons or other weird separators (I have a file at work with "^" separators :). If it doesn't work, one can fall back on specifying the separator (so read_csv looks more like read_table). In future it could make sense to simply have one read_data or read_table function.

Incidentally, the csv.Sniffer() also tries to sniff out other things like quote escaping and double quoting, but this commit effectively only uses it for the delimiter. If problems with quote / string escaping crop up with users one could always let the sniffer try to figure out the full dialect.

@MLnick

This comment has been minimized.

Copy link
Owner

commented on 11ccfab Sep 16, 2011

The csv module already supports arbitrary delimiter (separator), so only one function is actually needed, read_csv. read_table can simply be a convenience to a tab-separated file. In addition, csv.Sniffer() attempts to determine the separator automatically given a sample string (in this case the first line, which is often likely to contain headers). So read_csv should now work on any standard CSV file without having to specify a separator, whether tab-, comma-, semicolon- or other separated. The sep parameter can still be used to control this if necessary.

@wesm

This comment has been minimized.

Copy link
Member

commented Sep 17, 2011

Rebased into wesm/master-- thanks for these changes, a big help!

@wesm wesm closed this Sep 17, 2011

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.