Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Series.map passes DatetimeIndex as first argument on some series #44392

Open
3 tasks done
pierremonico opened this issue Nov 11, 2021 · 9 comments
Open
3 tasks done
Labels
Bug Series Series data structure

Comments

@pierremonico
Copy link

pierremonico commented Nov 11, 2021

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the master branch of pandas.

Reproducible Example

from datetime import datetime
from random import randint

import pandas as pd


def mapping_func(ts):
    assert type(ts) == pd.Timestamp


# --- works as expected ---

ts_list = [
    datetime(2021, randint(1, 12), randint(1, 28))
    for _ in range(5)
]

df01 = pd.DataFrame(ts_list, columns=["timestamps"])
df01["timestamps"].map(mapping_func)


# --- doesn't work as expected ---


ts_list_02 = [
    '2020-09-01T15:47:33.953000+00:00',
    '2020-09-02T16:57:16.547000+00:00',
    '2020-09-02T16:57:16.887000+00:00',
    '2020-09-02T16:57:12.377000+00:00',
    '2020-09-02T16:57:12.667000+00:00',
]

df02 = pd.DataFrame(ts_list_02, columns=["timestamps"])
df02["timestamps"] = pd.to_datetime(df02["timestamps"])
df02["timestamps"].map(mapping_func)


# --- Type checks ---

assert isinstance(df01["timestamps"], pd.Series)
assert isinstance(df02["timestamps"], pd.Series)

assert df01["timestamps"].dtype == df02["timestamps"].dtype
# fails because (respectively) dtype('<M8[ns]') != datetime64[ns, UTC]

Issue Description

I am having a weird issued when using Series.map(arg=fn) on a series containing Timestamps:

  • In some cases the first time fn is called by map, the argument is a DatetimeIndex containing all timestamps in the series. Further calls pass every individual member of the series to fn as expected.
  • In other cases it passes every individual member of the series to fn as expected.

I was able to reproduce it by:

  • creating one series based on a list of python datetimes instances (df01 in the example - works as expected).
  • creating one series based on a list of ISO timestamps and casting the series to Timestamp using pd.to_datetime (df02 in the example - doesn't work as expected aka. passes a full DatetimeIndex the first time it calls fn).

Expected Behavior

def mapping_func(ts):
    assert type(ts) == pd.Timestamp

ts_list = [
    datetime(2021, randint(1, 12), randint(1, 28))
    for _ in range(5)
]

df01 = pd.DataFrame(ts_list, columns=["timestamps"])
df01["timestamps"].map(mapping_func)

Works as expected, the ts parameter of mapping_func always being a Timestamp instance.

def mapping_func(ts):
    assert type(ts) == pd.Timestamp

ts_list_02 = [
    '2020-09-01T15:47:33.953000+00:00',
    '2020-09-02T16:57:16.547000+00:00',
    '2020-09-02T16:57:16.887000+00:00',
    '2020-09-02T16:57:12.377000+00:00',
    '2020-09-02T16:57:12.667000+00:00',
]

df02 = pd.DataFrame(ts_list_02, columns=["timestamps"])
df02["timestamps"] = pd.to_datetime(df02["timestamps"])
df02["timestamps"].map(mapping_func)

Should not raise an error but actually does, since the first time mapping_func is called, the ts parameter is passed a DatetimeIndex.

Installed Versions

INSTALLED VERSIONS

commit : 945c9ed
python : 3.7.9.final.0
python-bits : 64
OS : Darwin
OS-release : 20.5.0
Version : Darwin Kernel Version 20.5.0: Sat May 8 05:10:33 PDT 2021; root:xnu-7195.121.3~9/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : None.UTF-8

pandas : 1.3.4
numpy : 1.21.4
pytz : 2021.3
dateutil : 2.8.2
pip : 21.1.3
setuptools : 57.4.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 3.0.2
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.9.1 (dt dec pq3 ext lo64)
jinja2 : 3.0.3
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : 2021.11.0
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : 1.4.26
tables : None
tabulate : 0.8.9
xarray : None
xlrd : None
xlwt : None
numba : None

@pierremonico pierremonico added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 11, 2021
@mroeschke
Copy link
Member

mroeschke commented Nov 14, 2021

Hmm I don't see an error on master and don't really follow how map applies?

In [2]: from datetime import datetime
   ...: from random import randint
   ...:
   ...: import pandas as pd

In [3]: def mapping_func(ts):
   ...:     assert type(ts) == pd.Timestamp
   ...:
   ...: ts_list = [
   ...:     datetime(2021, randint(1, 12), randint(1, 28))
   ...:     for _ in range(5)
   ...: ]
   ...:
   ...: df01 = pd.DataFrame(ts_list, columns=["timestamps"])
   ...: df01["timestamps"].map(mapping_func)
Out[3]:
0    None
1    None
2    None
3    None
4    None
Name: timestamps, dtype: object

In [4]: def mapping_func(ts):
   ...:     assert type(ts) == pd.Timestamp
   ...:
   ...: ts_list_02 = [
   ...:     '2020-09-01T15:47:33.953000+00:00',
   ...:     '2020-09-02T16:57:16.547000+00:00',
   ...:     '2020-09-02T16:57:16.887000+00:00',
   ...:     '2020-09-02T16:57:12.377000+00:00',
   ...:     '2020-09-02T16:57:12.667000+00:00',
   ...: ]
   ...:
   ...: df02 = pd.DataFrame(ts_list_02, columns=["timestamps"])
   ...: df02["timestamps"] = pd.to_datetime(df02["timestamps"])
   ...: df02["timestamps"].map(mapping_func)
Out[4]:
0    None
1    None
2    None
3    None
4    None
Name: timestamps, dtype: object

@mroeschke mroeschke added Needs Info Clarification about behavior needed to assess issue and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 14, 2021
@pierremonico
Copy link
Author

def mapping_func(ts):
    print(ts)
ts_list = [
    datetime(2021, randint(1, 12), randint(1, 28))
    for _ in range(5)
]

df01 = pd.DataFrame(ts_list, columns=["timestamps"])
df01["timestamps"].map(mapping_func)

Output:

2021-12-03 00:00:00
2021-06-23 00:00:00
2021-10-13 00:00:00
2021-07-06 00:00:00
2021-04-03 00:00:00
ts_list_02 = [
    '2020-09-01T15:47:33.953000+00:00',
    '2020-09-02T16:57:16.547000+00:00',
    '2020-09-02T16:57:16.887000+00:00',
    '2020-09-02T16:57:12.377000+00:00',
    '2020-09-02T16:57:12.667000+00:00',
]

df02 = pd.DataFrame(ts_list_02, columns=["timestamps"])
df02["timestamps"] = pd.to_datetime(df02["timestamps"])
df02["timestamps"].map(mapping_func)

Output:

DatetimeIndex(['2020-09-01 15:47:33.953000+00:00',
               '2020-09-02 16:57:16.547000+00:00',
               '2020-09-02 16:57:16.887000+00:00',
               '2020-09-02 16:57:12.377000+00:00',
               '2020-09-02 16:57:12.667000+00:00'],
              dtype='datetime64[ns, UTC]', freq=None)
2020-09-01 15:47:33.953000+00:00
2020-09-02 16:57:16.547000+00:00
2020-09-02 16:57:16.887000+00:00
2020-09-02 16:57:12.377000+00:00
2020-09-02 16:57:12.667000+00:00

As you see in the second example, before running the actual five iterations, mapping_func is called once with the whole DatetimeIndex as an argument.

@pierremonico
Copy link
Author

I also reproduced it in this notebook in case it helps.

@mroeschke
Copy link
Member

I see. Thanks, the print call reveals the issue a lot clearer.

@mroeschke mroeschke added Series Series data structure and removed Needs Info Clarification about behavior needed to assess issue labels Nov 15, 2021
@Mikhaylov-yv
Copy link
Contributor

Confirm that this issue is present on the master branch.

Python 3.8.12 | packaged by conda-forge | (default, Oct 12 2021, 21:59:51) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> pd.__version__
'1.4.0.dev0+1143.gbe1ef62bec'
>>> def mapping_func(ts):
...     print(ts)
... 
>>> ts_list_02 = [
...     '2020-09-01T15:47:33.953000+00:00',
...     '2020-09-02T16:57:16.547000+00:00',
...     '2020-09-02T16:57:16.887000+00:00',
...     '2020-09-02T16:57:12.377000+00:00',
...     '2020-09-02T16:57:12.667000+00:00',
... ]
>>> 
>>> df02 = pd.DataFrame(ts_list_02, columns=["timestamps"])
>>> df02["timestamps"] = pd.to_datetime(df02["timestamps"])
>>> df02["timestamps"].map(mapping_func)
DatetimeIndex(['2020-09-01 15:47:33.953000+00:00',
               '2020-09-02 16:57:16.547000+00:00',
               '2020-09-02 16:57:16.887000+00:00',
               '2020-09-02 16:57:12.377000+00:00',
               '2020-09-02 16:57:12.667000+00:00'],
              dtype='datetime64[ns, UTC]', freq=None)
2020-09-01 15:47:33.953000+00:00
2020-09-02 16:57:16.547000+00:00
2020-09-02 16:57:16.887000+00:00
2020-09-02 16:57:12.377000+00:00
2020-09-02 16:57:12.667000+00:00
0    None
1    None
2    None
3    None
4    None
Name: timestamps, dtype: object

I'll try to take this problem. I'll start by creating a test that will show this.

@Mikhaylov-yv
Copy link
Contributor

I'm trying to catch the exception with this test:

def test_series_datetime_index():
    tm_arr = [f"2021-01-0{i} 12:24:34.024322+00:00" for i in range(1, 5)]
    ser = Series(to_datetime(tm_arr))

    def f(x):
        assert type(x).__name__ == "Timestamp"

    ser.map(f)
    ser.apply(f)

The test showed nothing. In this case, the function returns nothing. The error is only visible when using print ()

@pierremonico
Copy link
Author

Very weird, because I can't reproduce it either when trying in the notebook I sent above; but I can reproduce it locally on my machine with 1.3.4 (an exception is raised for assert type(ts) == pd.Timestamp) since the first argument passed is a DatetimeIndex instance.

@mzeitlin11
Copy link
Member

Maybe related to #43940?

@pierremonico
Copy link
Author

Maybe related to #43940?

Absolutely, thanks for pointing out!
I tried it out by just running the script in the console instead of debugging in VS Code, and the assert statement doesn't raise an error. The print statement however will always print a DatetimeIndex.

To sum it up:

  • print in the mapping function will always print a DatetimeIndex before running the iterations over the Series, no matter if the script is run with debugging or simply through the console.
  • when running in debug, it will actually pass a DatetimeIndex as a first argument before running the iterations.
def mapping_func(ts):
    print(ts)
    assert isinstance(ts, pd.Timestamp)

ts_list_02 = [
    '2020-09-01T15:47:33.953000+00:00',
    '2020-09-02T16:57:16.547000+00:00',
    '2020-09-02T16:57:16.887000+00:00',
    '2020-09-02T16:57:12.377000+00:00',
    '2020-09-02T16:57:12.667000+00:00',
]

df02 = pd.DataFrame(ts_list_02, columns=["timestamps"])
df02["timestamps"] = pd.to_datetime(df02["timestamps"])
df02["timestamps"].map(mapping_func)

Output of debug:

DatetimeIndex(['2020-09-01 15:47:33.953000+00:00',
               '2020-09-02 16:57:16.547000+00:00',
               '2020-09-02 16:57:16.887000+00:00',
               '2020-09-02 16:57:12.377000+00:00',
               '2020-09-02 16:57:12.667000+00:00'],
              dtype='datetime64[ns, UTC]', freq=None)

and raises an error from the assert.

Output of running the script from console:

DatetimeIndex(['2020-09-01 15:47:33.953000+00:00',               '2020-09-02 16:57:16.547000+00:00',               '2020-09-02 16:57:16.887000+00:00',
               '2020-09-02 16:57:12.377000+00:00',
               '2020-09-02 16:57:12.667000+00:00'],
              dtype='datetime64[ns, UTC]', freq=None)
2020-09-01 15:47:33.953000+00:00
2020-09-02 16:57:16.547000+00:00
2020-09-02 16:57:16.887000+00:00
2020-09-02 16:57:12.377000+00:00
2020-09-02 16:57:12.667000+00:00

with no error raised.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Series Series data structure
Projects
None yet
Development

No branches or pull requests

4 participants