New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support categorical variables with CSVs #10153

Closed
esafak opened this Issue May 16, 2015 · 4 comments

Comments

Projects
None yet
4 participants
@esafak

esafak commented May 16, 2015

It would be nice to be able to read CSVs with categorical variables using read_csv's dtype parameter instead of casting the columns after the fact.

@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger May 16, 2015

Contributor

I'm not opposed to this in principle, but I think the API will necessarily be clunky. Would we require (or allow) the user to specify all categories in the call to read_csv.

@esafak we do support categoricals in read/write_hdf if that's an option for you (it may not be).

Contributor

TomAugspurger commented May 16, 2015

I'm not opposed to this in principle, but I think the API will necessarily be clunky. Would we require (or allow) the user to specify all categories in the call to read_csv.

@esafak we do support categoricals in read/write_hdf if that's an option for you (it may not be).

@esafak

This comment has been minimized.

Show comment
Hide comment
@esafak

esafak May 16, 2015

Can't we already declare the dtypes of selected columns? I thought the problem was limited to categoricals, but if not, please expand my request to all dtypes.

esafak commented May 16, 2015

Can't we already declare the dtypes of selected columns? I thought the problem was limited to categoricals, but if not, please expand my request to all dtypes.

@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger May 16, 2015

Contributor

You can specify the types. I was just thinking

pd.read_csv('file.csv', dtypes={'A': np.int64, 'B': pd.CategoricalDtype(['cat1', 'cat2', 'cat3'])})

which means you'd need to know all the categories up front. Or we infer them and you'll need to check that they're aren't any surprising categories.

Contributor

TomAugspurger commented May 16, 2015

You can specify the types. I was just thinking

pd.read_csv('file.csv', dtypes={'A': np.int64, 'B': pd.CategoricalDtype(['cat1', 'cat2', 'cat3'])})

which means you'd need to know all the categories up front. Or we infer them and you'll need to check that they're aren't any surprising categories.

@sinhrks

This comment has been minimized.

Show comment
Hide comment
@sinhrks

sinhrks Jul 11, 2015

Member

Nice workaround, but I think it is still nice to support category arg.

As a first step, how about converting the specified columns to Categorical after parsing? Though it is very nice to have optimized IO logic...

Member

sinhrks commented Jul 11, 2015

Nice workaround, but I think it is still nice to support category arg.

As a first step, how about converting the specified columns to Categorical after parsing? Though it is very nice to have optimized IO logic...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment