Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API deprecate date_parser, add date_format #50601

Closed
MarcoGorelli opened this issue Jan 6, 2023 · 33 comments · Fixed by #51019
Closed

API deprecate date_parser, add date_format #50601

MarcoGorelli opened this issue Jan 6, 2023 · 33 comments · Fixed by #51019
Labels
API Design Datetime Datetime data dtype IO CSV read_csv, to_csv

Comments

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Jan 6, 2023

TLDR the conversation here goes on for a bit, but to summarise, the suggestion is:

  • deprecate date_parser, because it always hurts performance (counter examples welcome!)
  • add date_format, because that can boost performance
  • for anything else, amend docs to make clear that users should first in the data as object, and then apply their parsing

Performance-wise, this would only be an improvement to the status quo


As far as I can tell, date_parser is a net negative and only ever slows things down

In the best case, it only results in a slight degradation:

timestamp_format = '%Y-%d-%m %H:%M:%S'

date_index = pd.date_range(start='1900', end='2000')

dates_df = date_index.strftime(timestamp_format).to_frame(name='ts_col')
data = dates_df.to_csv()
In [6]: %%timeit
   ...: df = pd.read_csv(io.StringIO(data),
   ...:     date_parser=lambda x: pd.to_datetime(x, format=timestamp_format),
   ...:     parse_dates=['ts_col']
   ...: )
   ...: 
   ...: 
111 ms ± 3.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [7]: %%timeit
   ...: df = pd.read_csv(io.StringIO(data),
   ...:     #date_parser=lambda x: pd.to_datetime(x, format=timestamp_format),
   ...:     #parse_dates=['ts_col']
   ...: )
   ...: df['ts_col'] = pd.to_datetime(df['ts_col'], format=timestamp_format)
   ...: 
   ...: 
75.8 ms ± 1.98 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Parsing element-by-element is also slower than just using .apply:

In [21]: %%timeit
    ...: df = pd.read_csv(io.StringIO(data),
    ...:     #date_parser=lambda x: pd.to_datetime(x, format=timestamp_format),
    ...:     #parse_dates=['ts_col']
    ...: )
    ...: df['ts_col'].apply(lambda x: du_parse(x, dayfirst=True))
    ...: 
    ...: 
1.13 s ± 33.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [22]: %%timeit
    ...: df = pd.read_csv(io.StringIO(data),
    ...:     date_parser=lambda x: du_parse(x, dayfirst=True),
    ...:     parse_dates=['ts_col']
    ...: )
    ...: 
    ...: 
1.19 s ± 3.43 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In the worst case, it results in 65x performance degradation, see #50586 (comment) (and this gets way worse for larger datasets)

My suggestion is:

  • deprecate date_parser
  • introduce date_format, which would actually deliver a performance improvement:
In [1]: timestamp_format = '%Y-%d-%m %H:%M:%S'
   ...: 
   ...: date_index = pd.date_range(start='1900', end='2000')
   ...: 
   ...: dates_df = date_index.strftime(timestamp_format).to_frame(name='ts_col')
   ...: data = dates_df.to_csv()

In [2]: %%timeit
   ...: df = pd.read_csv(io.StringIO(data),
   ...:     date_format=timestamp_format,
   ...:     parse_dates=['ts_col']
   ...: )
   ...: 
   ...: 
19.5 ms ± 397 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

That's over 3 times as fast!

@sm-Fifteen
Copy link

The main drawback I can see of switching from a callback to a format string is that the callback enables you to use Series.name to use different patterns for each date column, though that can be remedied by allowing a dict to map column names to date formats.

Is there concern that the other parameters of to_datetime may end up having to be added to read_csv, like utc=, unit=, or origin=, or are these not a problem for datetime column parsing (at read time, anyway)?

At any rate, I certainly feel that this would be a pretty sizable improvement to date parsing, not just in terms of performance gains but also in terms of usability.

@MarcoGorelli
Copy link
Member Author

Thanks for your input, much appreciated

how about to_datetime_kwargs? then dayfirst could be deprecated as well. so then someone can run

read_csv(data, parser_dates=['ts_col'], to_datetime_kwargs={'format': datetime_format, 'utc': False, 'dayfirst': True})

and it would be performant, because to_datetime wouldn't need to be called twice in

result = tools.to_datetime(
date_parser(*date_cols), errors="ignore", cache=cache_dates
)

, the kwargs could directly go to

return tools.to_datetime(
ensure_object(strs),
utc=False,
dayfirst=dayfirst,
errors="ignore",
cache=cache_dates,
).to_numpy()

@MarcoGorelli
Copy link
Member Author

The main drawback I can see of switching from a callback to a format string is that the callback enables you to use Series.name to use different patterns for each date column

not sure I see what you mean - is this currently supported? do you have an example please?

@sm-Fifteen
Copy link

The main drawback I can see of switching from a callback to a format string is that the callback enables you to use Series.name to use different patterns for each date column

not sure I see what you mean - is this currently supported? do you have an example please?

Something like this:

import pandas as pd
from io import StringIO
from tempfile import TemporaryFile

date_euro = pd.date_range(start='1900', end='2000', freq='1D').strftime("%d/%m/%Y %H:%M:%S").to_list()
date_us = pd.date_range(start='1900', end='2000', freq='1D').strftime("%m/%d/%Y %H:%M:%S").to_list()

date_df_pre = pd.DataFrame({
    'date_euro': date_euro,
    'date_us': date_us,
})

csv_buf = StringIO()
date_df_pre.to_csv(csv_buf)
csv_buf.seek(0)

date_df_pre

(day first vs month-first)

0      01/01/1900 00:00:00  01/01/1900 00:00:00
1      02/01/1900 00:00:00  01/02/1900 00:00:00
2      03/01/1900 00:00:00  01/03/1900 00:00:00
3      04/01/1900 00:00:00  01/04/1900 00:00:00
4      05/01/1900 00:00:00  01/05/1900 00:00:00
...                    ...                  ...
36520  28/12/1999 00:00:00  12/28/1999 00:00:00
36521  29/12/1999 00:00:00  12/29/1999 00:00:00
36522  30/12/1999 00:00:00  12/30/1999 00:00:00
36523  31/12/1999 00:00:00  12/31/1999 00:00:00
36524  01/01/2000 00:00:00  01/01/2000 00:00:00

[36525 rows x 2 columns]
from typing import Union

def parse_all_dates(col: Union[str,pd.Series]):
    if not isinstance(col, pd.Series): return col
    if not pd.api.types.is_string_dtype(col.dtype): return col

    if col.name == 'date_euro': return pd.to_datetime(col, format="%d/%m/%Y %H:%M:%S")
    if col.name == 'date_us': return pd.to_datetime(col, format="%m/%d/%Y %H:%M:%S")
    return col

new_df = pd.read_csv(csv_buf, usecols=['date_euro', 'date_us'], parse_dates=['date_euro', 'date_us'], date_parser=parse_all_dates)
new_df
       date_euro    date_us
0     1900-01-01 1900-01-01
1     1900-02-01 1900-01-02
2     1900-03-01 1900-01-03
3     1900-04-01 1900-01-04
4     1900-05-01 1900-01-05
...          ...        ...
36520 1999-12-28 1999-12-28
36521 1999-12-29 1999-12-29
36522 1999-12-30 1999-12-30
36523 1999-12-31 1999-12-31
36524 2000-01-01 2000-01-01

[36525 rows x 2 columns]

A bit contrived, but I'm sure there are real-world use-cases that would need something like this, where different date formats are used per column.

@MarcoGorelli
Copy link
Member Author

MarcoGorelli commented Jan 6, 2023

IIUC that's not currently supported by date_parser either right?

if so, let's keep that to a separate discussion (you're welcome to open a new issue and I'll gladly take a look)

@sm-Fifteen
Copy link

Thanks for your input, much appreciated

how about to_datetime_kwargs? then dayfirst could be deprecated as well. so then someone can run

I can't say, I haven't had to use these parameters before and can't tell whether they're actually useful for CSV parsing, I'm just pointing it out for the record, in case there's a use case with these that I'm missing.

If the other args for to_datetime actually matter for file parsing, then I guess kwargs is probably the most extensible and least "future-intrusive" solution.

@sm-Fifteen
Copy link

sm-Fifteen commented Jan 6, 2023

IIUC that's not currently supported by date_parser either right?

if so, let's keep that to a separate discussion (you're welcome to open a new issue and I'll gladly take a look)

Oh, no, that code snippet actually works and uses date_parser to parse both column differently. It strikes me as a potentially useful feature (on paper anyway, I can't recall ever needing to do this) that would be lost if date_format were to only accept a flat string, and not, say, a dict of formats mapped by column names as well.

@MarcoGorelli
Copy link
Member Author

ah I see, sorry, I should've actually tried running it. and if it's possible, then there's definitely someone out there using it 😄

a dict mapping column names to kwargs would allow for that functionality to be kept though, right? like

to_datetime(data, to_datetime_kwargs={'date_euro': {'dayfirst': True}, 'date_us': {'dayfirst': False}}

So, to_datetime_kwargs could either be a dict, or a dict of dicts

@sm-Fifteen
Copy link

I can't speak for the ergonomics of such a solution (the list of parameters for read_csv is already pretty long and complex), but it would at least preserve the functionnality of the callback, as far as I can tell. Looking at the current docs, there are cases where specific use of to_datetime is needed (such as mixed timezone columns), so that wouldn't be lost, at least.


Truth be told, the more I look at the extended use-cases, the less I'm confident about having this sort of highly tunable functionnality (date parsing kwargs, that is) be baked directly in the read_csv function instead of having the user do the conversion as a second pass. Pola-rs, for instance, only has a switch for automatic date parsing where it tries to figure out your date format in an undocumented way based on the first value of a column by testing a bunch of big-endian date formats, and the guide saying you should leave that switch off and do the string-to-date conversion yourself if you have anything more complicated than that.

Leaving read_csv with just the simple knobs (i.e: just parse_dates= + date_format= and that's it) and having the doc reccomend a two-pass approach for the rest (like I was doing in #35296) might be preferable, but I'm far from having sufficent knowledge of the user expectations for Pandas to tell if that would be a good change or not.

@MarcoGorelli
Copy link
Member Author

Thanks for your thoughts

Yes, I think you're right, I'd be +1 for simple knobs, and doc change

So, to summarise:

  • deprecate date_parser, because it always hurts performance (counter examples welcome!)
  • add date_format, because that can boost performance
  • for anything else, amend docs to make clear that users should first in the data as object, and then apply their parsing

Performance-wise, this would only be an improvement to the status quo

@lithomas1
Copy link
Member

+1 to this.

@jbrockmendel
Copy link
Member

to_datetime_kwargs

for use cases like the different-behavior-per-column id rather tell users to handle that after read_csv. read_csv is way too complicated already.

@MarcoGorelli
Copy link
Member Author

Agreed

I just checked, and if the format can be guessed, then just parse_dates is enough to offer a speed-up:

In [2]: %%timeit
   ...: df = pd.read_csv(io.StringIO(data), parse_dates=['ts_col'])
   ...: 
   ...: 
18.2 ms ± 574 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

So, I'd suggest going further - just deprecate date_parser, and don't add anything new. For anything non-standard, users can read in the file and then apply to_datetime

Since PDEP4, the original issue which motivated adding the date_parser argument isn't relevant anymore anyway

@sm-Fifteen
Copy link

sm-Fifteen commented Jan 9, 2023

Leaving out non-automatic date parsing is something I'd normally be in favor of, but I should point out that manual date conversion gets pretty convoluted (in comparison to the normal case, I'm looking at this for the perspective of a new user or someone migrating code) when it comes to index columns and expecially MultiIndex.

# Roughly copied from the SO answer above
# For a MultiIndex
df = pd.read_csv(infile, parse_dates={'mydatetime': ['date', 'time']}, index_col=['mydatetime', 'num'])

idx_mydatetime = df.index.get_level_values('mydatetime')
idx_num = df.index.get_level_values('num')
idx_mydatetime = pd.to_datetime(idx_mydatetime, exact=True, cache=True, format='%Y-%m-%d %H:%M:%S')
df.index = pd.MultiIndex.from_arrays([idx_mydatetime, idx_num])

There might be simpler ways of handling this in a "post-parsing" manner that I'm not aware of, but that,s still roughly what I'd imagine index parsing would look like if read_csv ends up with no date parsing utilities besides parse_dates=. Pola-rs can get away with this because they don't have indexes on their dataframes (the underlying Arrow dataframes don't either), but in Pandas, index manipulation unfortunately comes with extra steps like these, which the documentation would need to account for and might cause some amount of friction for users.

@MarcoGorelli
Copy link
Member Author

I think that code looks fine, and rare-enough that to not need a special place in the documentation

@sm-Fifteen
Copy link

rare-enough that to not need a special place in the documentation

Time-series data with an unconventionnal/ambiguous date format and one or more extra key columns (sensor name, histogram bucket, dataset name if there's multiple reports in one file, etc.)? I personally see these quite a lot; ISO 8601 is unfortunately not as universal as I would like. This is CSV we're talking about, people can't even agree whether the "C" stands for "Tab", "Semi-Colon" or "Space". Date formats are anybody's guess.

@MarcoGorelli
Copy link
Member Author

Yes, and if it's not a format which pandas can guess, then people can read in the column and then run to_datetime with their given format

Do you have an example of something that people can currently do with date_parser, which would become more difficult without date_parser? Because the example you posted above

df = pd.read_csv(infile, parse_dates={'mydatetime': ['date', 'time']}, index_col=['mydatetime', 'num'])

doesn't use date_parser, and so would continue working

parse_dates is here to stay, I'm only suggesting that date_parser be deprecated

@sm-Fifteen
Copy link

Because the example you posted above

df = pd.read_csv(infile, parse_dates={'mydatetime': ['date', 'time']}, index_col=['mydatetime', 'num'])

doesn't use date_parser, and so would continue working

On its own, yes, but I would imagine most people have it written like this instead:

df = pd.read_csv(
    infile, parse_dates={'mydatetime': ['date', 'time']}, index_col=['mydatetime', 'num'],
    date_parser=lambda x: pd.to_datetime(x, format=timestamp_format)
)

...and converting it to work without date_parser or a date_format replacement would mean adding the extra 4 lines I quoted above, which I'm sure most users would see as a downgrade in that regard.

df = pd.read_csv(infile, parse_dates={'mydatetime': ['date', 'time']}, index_col=['mydatetime', 'num'])

idx_mydatetime = df.index.get_level_values('mydatetime')
idx_num = df.index.get_level_values('num')
idx_mydatetime = pd.to_datetime(idx_mydatetime, format=timestamp_format)
df.index = pd.MultiIndex.from_arrays([idx_mydatetime, idx_num])

Meanwhile, if date_format is introduced as a replacement for simple cases:

df = pd.read_csv(
    infile,
    parse_dates={'mydatetime': ['date', 'time']},
    date_format={'mydatetime': timestamp_format},
    index_col=['mydatetime', 'num']
)

Which would cover the vast majority of uses of date_parser and make the transition a lot smoother for users.

@MarcoGorelli
Copy link
Member Author

Thanks - not sold on this to be honest, in pandas 2.0.0 most common date formats should be guessable, and if anyone has something really unusual and has a multiindex in which the date is one of the levels (which itself strikes me as exceedingly rare), then it seems fine to require them to add an extra few lines. I don't think all this complexity is warranted for such a rare use-case

@staticdev
Copy link
Contributor

staticdev commented May 26, 2023

@MarcoGorelli I am struggling a bit to convert to Pandas 2.0 recommended way.

I have the following (full code here):

import datetime
import pandas as pd
def date_parse(value: str) -> datetime.datetime:
    return datetime.datetime.strptime(value.strip(), "%d/%m/%y")

df = pd.read_excel(
      filename,
      usecols="B:K",
      parse_dates=["Data Negócio"],
      date_parser=date_parse,
      skipfooter=4,
      skiprows=10,
  )

In this case, my final dataframe has a column "Data Negócio" with type datetime64[ns]. But if change it to:

import datetime
import pandas as pd
DATE_FORMAT = "%d/%m/%y"

df = pd.read_excel(
      filename,
      usecols="B:K",
      parse_dates=["Data Negócio"],
      date_format=DATE_FORMAT,
      skipfooter=4,
      skiprows=10,
  )

My column now is an object and nothing is formatted. I get values such as ' 01/12/21'. Is it a bug, or what am I missing?

@MarcoGorelli
Copy link
Member Author

hi @staticdev - could you share a reproducible example please?

@MarcoGorelli
Copy link
Member Author

(and also open a new issue if this is a bug - thanks!)

@ciscorucinski
Copy link

There doesn't seem to be a lot of documentation for users that have a working date_parser implementation and are running into the FutureWarning and need to go the object route; especially if they don't use pandas much.

The issue I am having is that date_format is insufficient in my case because I unfortunately have multiple date formats in the dataset and it will only parse one or the other.

  1. YYYY-MM-DD HH:MM:SS+HH:MM - 2023-05-25 17:29:44+02:00
  2. YYYY-MM-DDTHH:MM:SS+HH:MM - 2023-05-26T11:07:14+02:00

I am using a custom parser that uses pd.Timestamp which also nicely handled timezone conversion to UTC +00:00.

df = pd.read_csv(data, sep=",", parse_dates=['designation_date'], date_parser=lambda dt: pd.Timestamp(dt))

@MarcoGorelli
Copy link
Member Author

MarcoGorelli commented May 27, 2023

you can do

df = pd.read_csv(data, sep=",", parse_dates=['designation_date'], date_format='mixed')

or something like

df = pd.read_csv(data, sep=",")
df['designation_date'] = pd.to_datetime(df['designation_date'], format='mixed', utc=True)

@ciscorucinski
Copy link

Thank you for this suggestion. It is working with the first suggestion. Now just dealing with NaT values (#11953), which I might be able to deal with as-is.

However, I do still think extra documentation beyond just stating the alternative method within the FutureWarning would be beneficial. I don't think the FutureWarning message needs to be modified, but there is very little guidance searching online assuming a starting point of a working date_parser implementation. But I do assume it is probably way easier than I am thinking.

@MarcoGorelli
Copy link
Member Author

sure, pull requests to improve the docs would be very welcome https://pandas.pydata.org/docs/dev/development/contributing_docstring.html

weiji14 added a commit to GenericMappingTools/pygmt that referenced this issue Jun 9, 2023
Use `pd.read_csv(..., date_format="ISO8601")` to silence `UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.`. Xref pandas-dev/pandas#50601
@arkanoid87
Copy link

current docs

date_formatstr or dict of column -> format, default None

    If used in conjunction with parse_dates, will parse dates according to this format. For anything more complex, please read in as object and then apply [to_datetime()](https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html#pandas.to_datetime) as-needed.

it doesn't say how to declare format, it doesn't say the type of the output, it doesn't say if it's timezone aware or not, and what is the underlying function

please don't deprecate functions until get valid alternative

@MarcoGorelli
Copy link
Member Author

It's the same as the 'format' argument in 'to_datetime', a PR to clarify this would be welcome

@brendan-m-murphy
Copy link

Date parser could be used to combine multiple columns into a single datetime (e.g. columns for 'year', 'month', 'day', 'hour' minute', 'second' could be turned into a single datetime column). Is this no longer possible?

@sm-Fifteen
Copy link

sm-Fifteen commented Nov 6, 2023

Date parser could be used to combine multiple columns into a single datetime (e.g. columns for 'year', 'month', 'day', 'hour' minute', 'second' could be turned into a single datetime column). Is this no longer possible?

@brendan-m-murphy: I believe you can still do this with parse_dates={'full_datetime':['my_year', 'my_month', 'my_day', 'my_hour', 'my_minute', 'my_second']} date_format={'full_datetime': '%Y %m %d %H %M %S'}.

@gsgxnet
Copy link

gsgxnet commented Dec 22, 2023

Would be great if that new parser would be able to convert POSIX times (epoch seconds). The referenced documentation for the strftime in Python does not say it supports %s.
And so an attempt fails with:

ValueError: 's' is a bad directive in format '%s'

KeyError                                 
Traceback (most recent call last)
File strptime.pyx:248, in pandas._libs.tslibs.strptime.array_strptime()

File ~/miniconda3/envs/ds/lib/python3.12/_strptime.py:263, in TimeRE.compile(self, format)
    262 """Return a compiled re object for the format string."""
--> 263 return re_compile(self.pattern(format), IGNORECASE)

Linux strftime man3 manpage:

       %s     The number of seconds since the Epoch, 1970-01-01 00:00:00
              +0000 (UTC). (TZ) (Calculated from mktime(tm).)

And sqlite seems to support it too: https://www.sqlite.org/lang_datefunc.html

@MarcoGorelli
Copy link
Member Author

@gsgxnet could you open a new issue please?

@sm-Fifteen
Copy link

Date parser could be used to combine multiple columns into a single datetime (e.g. columns for 'year', 'month', 'day', 'hour' minute', 'second' could be turned into a single datetime column). Is this no longer possible?

@brendan-m-murphy: I believe you can still do this with parse_dates={'full_datetime':['my_year', 'my_month', 'my_day', 'my_hour', 'my_minute', 'my_second']} date_format={'full_datetime': '%Y %m %d %H %M %S'}.

I guess that's not going to hold true in the near future, since it looks like that feature is set to get deprecated in the next Pandas version, as per PR #56569 and issue #55569.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Datetime Datetime data dtype IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants