Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infer datetime format #6021

Merged
merged 2 commits into from Jan 24, 2014
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
34 changes: 34 additions & 0 deletions doc/source/io.rst
Expand Up @@ -500,6 +500,40 @@ a single date rather than the entire array.

.. _io.dayfirst:


Inferring Datetime Format
~~~~~~~~~~~~~~~~~~~~~~~~~
If you have `parse_dates` enabled for some or all of your columns, and your
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the previous sections, parse_dates is always written with double backtick quotation (parse_dates). This will render as code, while single backtick as italic.

datetime strings are all formatted the same way, you may get a large speed
up by setting `infer_datetime_format=True`. If set, pandas will attempt
to guess the format of your datetime strings, and then use a faster means
of parsing the strings. 5-10x parsing speeds have been observed. Pandas
will fallback to the usual parsing if either the format cannot be guessed
or the format that was guessed cannot properly parse the entire column
of strings. So in general, `infer_datetime_format` should not have any
negative consequences if enabled.

Here are some examples of datetime strings that can be guessed (All
representing December 30th, 2011 at 00:00:00)

"20111230"
"2011/12/30"
"20111230 00:00:00"
"12/30/2011 00:00:00"
"30/Dec/2011 00:00:00"
"30/December/2011 00:00:00"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make a list of this? (just - before each), or otherwise a code-block, as you like (but just one enter will be disregarded by Sphinx)


`infer_datetime_format` is sensitive to `dayfirst`. With `dayfirst=True`, it
will guess "01/12/2011" to be December 1st. With `dayfirst=False` (default)
it will guess "01/12/2011" to be January 12th.

.. ipython:: python

# Try to infer the format for the index column
df = pd.read_csv('foo.csv', index_col=0, parse_dates=True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

foo.csv has been removed before (under 'Specifying date columns'), so you will have to move that remove below this.

infer_datetime_format=True)


International Date Formats
~~~~~~~~~~~~~~~~~~~~~~~~~~
While US date formats tend to be MM/DD/YYYY, many international formats use
Expand Down
14 changes: 14 additions & 0 deletions doc/source/v0.13.1.txt
Expand Up @@ -107,6 +107,20 @@ Enhancements
result
result.loc[:,:,'ItemA']

- Added optional `infer_datetime_format` to `read_csv`, `Series.from_csv` and
`DataFrame.read_csv` (:issue:`5490`)

If `parse_dates` is enabled and this flag is set, pandas will attempt to
infer the format of the datetime strings in the columns, and if it can
be inferred, switch to a faster method of parsing them. In some cases
this can increase the parsing speed by ~5-10x.

.. ipython:: python

# Try to infer the format for the index column
df = pd.read_csv('foo.csv', index_col=0, parse_dates=True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

foo.csv will not be known here (it was generated and removed in io.rst itself). So make it a code-block (so the code won't be executed), or you have to add the creation of foo.csv. I think the first option is the easiest

infer_datetime_format=True)

Experimental
~~~~~~~~~~~~

Expand Down
10 changes: 8 additions & 2 deletions pandas/core/frame.py
Expand Up @@ -947,7 +947,8 @@ def _from_arrays(cls, arrays, columns, index, dtype=None):

@classmethod
def from_csv(cls, path, header=0, sep=',', index_col=0,
parse_dates=True, encoding=None, tupleize_cols=False):
parse_dates=True, encoding=None, tupleize_cols=False,
infer_datetime_format=False):
"""
Read delimited file into DataFrame

Expand All @@ -966,6 +967,10 @@ def from_csv(cls, path, header=0, sep=',', index_col=0,
tupleize_cols : boolean, default False
write multi_index columns as a list of tuples (if True)
or new (expanded format) if False)
infer_datetime_format: boolean, default False
If True and `parse_dates` is True for a column, try to infer the
datetime format based on the first datetime string. If the format
can be inferred, there often will be a large parsing speed-up.

Notes
-----
Expand All @@ -980,7 +985,8 @@ def from_csv(cls, path, header=0, sep=',', index_col=0,
from pandas.io.parsers import read_table
return read_table(path, header=header, sep=sep,
parse_dates=parse_dates, index_col=index_col,
encoding=encoding, tupleize_cols=tupleize_cols)
encoding=encoding, tupleize_cols=tupleize_cols,
infer_datetime_format=infer_datetime_format)

def to_sparse(self, fill_value=None, kind='block'):
"""
Expand Down
9 changes: 7 additions & 2 deletions pandas/core/series.py
Expand Up @@ -2178,7 +2178,7 @@ def between(self, left, right, inclusive=True):

@classmethod
def from_csv(cls, path, sep=',', parse_dates=True, header=None,
index_col=0, encoding=None):
index_col=0, encoding=None, infer_datetime_format=False):
"""
Read delimited file into Series

Expand All @@ -2197,6 +2197,10 @@ def from_csv(cls, path, sep=',', parse_dates=True, header=None,
encoding : string, optional
a string representing the encoding to use if the contents are
non-ascii, for python versions prior to 3
infer_datetime_format: boolean, default False
If True and `parse_dates` is True for a column, try to infer the
datetime format based on the first datetime string. If the format
can be inferred, there often will be a large parsing speed-up.

Returns
-------
Expand All @@ -2205,7 +2209,8 @@ def from_csv(cls, path, sep=',', parse_dates=True, header=None,
from pandas.core.frame import DataFrame
df = DataFrame.from_csv(path, header=header, index_col=index_col,
sep=sep, parse_dates=parse_dates,
encoding=encoding)
encoding=encoding,
infer_datetime_format=infer_datetime_format)
result = df.icol(0)
result.index.name = result.name = None
return result
Expand Down
35 changes: 28 additions & 7 deletions pandas/io/parsers.py
Expand Up @@ -16,6 +16,7 @@
from pandas.core.config import get_option
from pandas.io.date_converters import generic_parser
from pandas.io.common import get_filepath_or_buffer
from pandas.tseries import tools

from pandas.util.decorators import Appender

Expand Down Expand Up @@ -143,6 +144,9 @@
warn_bad_lines: boolean, default True
If error_bad_lines is False, and warn_bad_lines is True, a warning for each
"bad line" will be output. (Only valid with C parser).
infer_datetime_format : boolean, default False
If True and parse_dates is enabled for a column, attempt to infer
the datetime format to speed up the processing

Returns
-------
Expand Down Expand Up @@ -262,6 +266,7 @@ def _read(filepath_or_buffer, kwds):
'compression': None,
'mangle_dupe_cols': True,
'tupleize_cols': False,
'infer_datetime_format': False,
}


Expand Down Expand Up @@ -349,7 +354,8 @@ def parser_f(filepath_or_buffer,
encoding=None,
squeeze=False,
mangle_dupe_cols=True,
tupleize_cols=False):
tupleize_cols=False,
infer_datetime_format=False):

# Alias sep -> delimiter.
if delimiter is None:
Expand Down Expand Up @@ -408,7 +414,8 @@ def parser_f(filepath_or_buffer,
low_memory=low_memory,
buffer_lines=buffer_lines,
mangle_dupe_cols=mangle_dupe_cols,
tupleize_cols=tupleize_cols)
tupleize_cols=tupleize_cols,
infer_datetime_format=infer_datetime_format)

return _read(filepath_or_buffer, kwds)

Expand Down Expand Up @@ -665,9 +672,13 @@ def __init__(self, kwds):
self.true_values = kwds.get('true_values')
self.false_values = kwds.get('false_values')
self.tupleize_cols = kwds.get('tupleize_cols', False)
self.infer_datetime_format = kwds.pop('infer_datetime_format', False)

self._date_conv = _make_date_converter(date_parser=self.date_parser,
dayfirst=self.dayfirst)
self._date_conv = _make_date_converter(
date_parser=self.date_parser,
dayfirst=self.dayfirst,
infer_datetime_format=self.infer_datetime_format
)

# validate header options for mi
self.header = kwds.get('header')
Expand Down Expand Up @@ -1178,6 +1189,10 @@ def TextParser(*args, **kwds):
Encoding to use for UTF when reading/writing (ex. 'utf-8')
squeeze : boolean, default False
returns Series if only one column
infer_datetime_format: boolean, default False
If True and `parse_dates` is True for a column, try to infer the
datetime format based on the first datetime string. If the format
can be inferred, there often will be a large parsing speed-up.
"""
kwds['engine'] = 'python'
return TextFileReader(*args, **kwds)
Expand Down Expand Up @@ -1870,13 +1885,19 @@ def _get_lines(self, rows=None):
return self._check_thousands(lines)


def _make_date_converter(date_parser=None, dayfirst=False):
def _make_date_converter(date_parser=None, dayfirst=False,
infer_datetime_format=False):
def converter(*date_cols):
if date_parser is None:
strs = _concat_date_cols(date_cols)
try:
return tslib.array_to_datetime(com._ensure_object(strs),
utc=None, dayfirst=dayfirst)
return tools.to_datetime(
com._ensure_object(strs),
utc=None,
box=False,
dayfirst=dayfirst,
infer_datetime_format=infer_datetime_format
)
except:
return lib.try_parse_dates(strs, dayfirst=dayfirst)
else:
Expand Down