ENH: allow usecols to be case insensitive #14154

Closed
aechase opened this Issue Sep 5, 2016 · 9 comments

Comments

Projects
None yet
3 participants
Contributor

aechase commented Sep 5, 2016

I have a wide delimited file that includes columns with predictable names but unpredictable capitalisation, as shown in this toy code.

import pandas as pd
from io import StringIO

data = """1,1,1,1
2,2,2,2
3,3,3,3
4,4,4,4
5,5,5,5"""
names = ['col1', 'col2', 'COL3', 'Col4']
df = pd.read_csv(StringIO(data), names=names, usecols=None)
df.head()
col1 col2 COL3 Col4
0 1 1 1 1
1 2 2 2 2
2 3 3 3 3
3 4 4 4 4
4 5 5 5 5

I want to pass a list of column names to usecols when importing the file, but I won't be able to unless I match the names exactly:

df = pd.read_csv(StringIO(data), names=names, usecols=['col3', 'col4'])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-14-92d038c5ac8c> in <module>()
      1 cols = ['col3', 'col4']
----> 2 df1 = pd.read_csv('test.csv', usecols=['col3', 'col4'])
      3 df1.head()

/Users/alexchase/anaconda/lib/python3.5/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)
    560                     skip_blank_lines=skip_blank_lines)
    561 
--> 562         return _read(filepath_or_buffer, kwds)
    563 
    564     parser_f.__name__ = name

/Users/alexchase/anaconda/lib/python3.5/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    313 
    314     # Create the parser.
--> 315     parser = TextFileReader(filepath_or_buffer, **kwds)
    316 
    317     if (nrows is not None) and (chunksize is not None):

/Users/alexchase/anaconda/lib/python3.5/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds)
    643             self.options['has_index_names'] = kwds['has_index_names']
    644 
--> 645         self._make_engine(self.engine)
    646 
    647     def close(self):

/Users/alexchase/anaconda/lib/python3.5/site-packages/pandas/io/parsers.py in _make_engine(self, engine)
    797     def _make_engine(self, engine='c'):
    798         if engine == 'c':
--> 799             self._engine = CParserWrapper(self.f, **self.options)
    800         else:
    801             if engine == 'python':

/Users/alexchase/anaconda/lib/python3.5/site-packages/pandas/io/parsers.py in __init__(self, src, **kwds)
   1255 
   1256             if len(self.names) < len(self.usecols):
-> 1257                 raise ValueError("Usecols do not match names.")
   1258 
   1259         self._set_noconvert_columns()

ValueError: Usecols do not match names.

I see two problems here. First, the error text would be more helpful if it specified what column name(s) is/are wrong. That's pretty easy to fix. Second, if I know that capitalisation might be a problem, I have to check every column name individually before I can pass a list to usecols. One possible solution would be to modify this part and this part of parsers.py to case-transform the lists of column names before matching. This behaviour could be made the default, or it could be optional via an ignore_col_case keyword in the various read functions.


INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Darwin
OS-release: 15.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 25.1.6
Cython: 0.24.1
numpy: 1.11.1
scipy: 0.18.0
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.4.1
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: 1.1.0
tables: 3.2.3.1
numexpr: 2.6.1
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.2
lxml: 3.6.4
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: 0.7.6.None
psycopg2: None
jinja2: 2.8
boto: 2.40.0
pandas_datareader: None

Contributor

jreback commented Sep 5, 2016

a better soln would be to allow usecols to take a callable. Then you can validate however you want, e.g.

df = pd.read_csv(StringIO(data), names=names, usecols=lambda x: x.lower() in ['col3', 'col4'])

Contributor

jreback commented Sep 5, 2016

jreback added this to the Next Major Release milestone Sep 5, 2016

Contributor

jreback commented Sep 5, 2016

yes a better error message would be nice as well.

Contributor

aechase commented Sep 5, 2016

@jreback should passing a callable to usecols implicitly rename the columns as well, or should it preserve the original formatting?

Contributor

jreback commented Sep 5, 2016

no, it takes a single column and returns a boolean. These should match the names,

In [1]: def usecols(x):
   ...:     return x.lower() in ['col3', 'col4']
   ...: 

In [2]: [usecols(x) for x in ['col1', 'col2', 'Col3', 'col4']]
Out[2]: [False, False, True, True]
Contributor

jreback commented Sep 5, 2016 edited

this would also easily allow #10882

In [3]: def skipcols(x):
   ...:     return x.lower() not in ['col3', 'col4']

In [4]: [skipcols(x) for x in ['col1', 'col2', 'Col3', 'col4']]
Out[4]: [True, True, False, False]

though skipcols could be a separate kw.

Member

gfyoung commented Sep 6, 2016

This indeed would be a nice enhancement! However, perhaps a better name (that would encompass everything) would be colfilter?

Contributor

aechase commented Sep 6, 2016

@jreback I'll need a few days to make all these changes. In the meantime, are you interested in a PR that just updates the error message?

Contributor

jreback commented Sep 6, 2016 edited

sure a separate PR for the error message would be be great

@jreback jreback modified the milestone: 0.20.0, Next Major Release Dec 4, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment