New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected behaviour in pandas.DataFrame.first_valid_index (and last...) #20499

Closed
mmngreco opened this Issue Mar 27, 2018 · 1 comment

Comments

Projects
None yet
2 participants
@mmngreco
Contributor

mmngreco commented Mar 27, 2018

Code Sample

import pandas as pd
idx_w_freq = pd.date_range('20100101', periods=3, freq='B')

# Series
# ======

# this works fine
s = pd.Series([1,2,3], index=idx_w_freq)
s.first_valid_index()
Out[5]: Timestamp('2010-01-01 00:00:00', freq='B')

# this works fine
s_nan = pd.Series([None,None,3], index=idx_w_freq)
s_nan.first_valid_index()
Out[7]: Timestamp('2010-01-05 00:00:00', freq='B')

# this works fine
s_nan = pd.Series([1,None,3], index=idx_w_freq)
s_nan.first_valid_index()
Out[9]: Timestamp('2010-01-01 00:00:00', freq='B')

# DataFrame (here is the problem)
# =========

# this works fine
df = pd.DataFrame([1,2,3], index=idx_w_freq)
df.first_valid_index()
Out[11]: Timestamp('2010-01-01 00:00:00', freq='B')

# this works fine
df_nan = pd.DataFrame([None,2,3], index=idx_w_freq)
df_nan.first_valid_index()
Out[13]: Timestamp('2010-01-04 00:00:00', freq='B')

# this works fine
df_nan = pd.DataFrame([[None,None], [None, None], [None, 3]], index=idx_w_freq)
df_nan.first_valid_index()
Out[15]: Timestamp('2010-01-05 00:00:00', freq='B')

# UNEXPECTED OUTPUT WITHOUT FREQUENCY
df_w_holes = pd.DataFrame([[1,None], [None, None], [3, 3]], index=idx_w_freq)
df_w_holes.first_valid_index()
Out[17]: Timestamp('2010-01-01 00:00:00')

Problem description

The method implemented in pandas.Series it works fine for all cases. However, the method implemented in pandas.DataFrame returns an index without the frequency when there is holes in the values.

This is because the _get_valid_indices() returns a fancy selection of the indices with the naive mask as shown below:

# pandas.core.frame.DataFrame#_get_valid_indices
def _get_valid_indices(self):
    is_valid = self.count(1) > 0
    return self.index[is_valid]

The problem is present since 0.19.1.

Output of pd.show_versions()

pd.show_versions()
INSTALLED VERSIONS

commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Darwin
OS-release: 17.4.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: None
LOCALE: es_ES.UTF-8
pandas: 0.22.0
pytest: 3.4.2
pip: 9.0.1
setuptools: 38.5.1
Cython: 0.27.3
numpy: 1.14.2
scipy: 1.0.0
pyarrow: 0.9.0
xarray: 0.10.2
IPython: 6.2.1
sphinx: 1.7.1
patsy: 0.5.0
dateutil: 2.7.0
pytz: 2018.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: 0.4.0
matplotlib: 2.2.2
openpyxl: 2.5.1
xlrd: 1.1.0
xlwt: 1.2.0
xlsxwriter: 1.0.2
lxml: 4.2.0
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.5
pymysql: 0.8.0
psycopg2: None
jinja2: 2.10
s3fs: 0.1.3
fastparquet: 0.1.4
pandas_gbq: None
pandas_datareader: None

Possible solution

I wonder to know if this is actually a bug or not. In the bug case, in my opinion, I will add the next functions as methods to NDFrame class, because It works with Series and DataFrames in substitution of the current methods. I would like do the pull request if this is ok.

def first_valid_index(data):
    """Return label for first non-NA/null value.

    Parameters
    ----------
    data : pandas.Series or pandas.DataFrame
        Input data object.

    Returns
    -------
    first valid index: index
    """
    if len(data) == 0:
        return None
    mask = data.count(1) > 0
    i = mask.argmax()
    if not mask[i]:
        return None
    else:
        return i


def last_valid_index(data):
    """Return index label for last non-NA/null value.

    Parameters
    ----------
    data : pandas.Series or pandas.DataFrame
        Input data object.

    Returns
    -------
    index_label: type of input index
        Index label for the last non-NA/null value.
    """
    if len(data) == 0:
        return None

    mask = data.count(1) > 0  # count number of non-null values per row, if
    # result is greater than 0, then the row is valid
    i = mask._values[::-1].argmax()  # find the integer index of the first True
    # value starting from the end
    if not mask.iat[len(data) - i - 1]:  # no valid values in data
        return None
    else:
        return data.index[len(data) - i - 1]
@jreback

This comment has been minimized.

Contributor

jreback commented Mar 30, 2018

hmm, this seems reasonable. i suspect we don't have great testing on the DataFrame side for this. would take a PR for this (and add all of those cases as parameterized tests).

@jreback jreback added this to the Next Major Release milestone Mar 30, 2018

mmngreco added a commit to mmngreco/pandas that referenced this issue Mar 31, 2018

BUG: Fixed first last valid index (pandas-dev#20499)
Removed old methods from Series and DF
Added new methods into NDFrame
Created new convenient method _find_first_valid

mmngreco added a commit to mmngreco/pandas that referenced this issue Mar 31, 2018

BUG: Fixed first last valid index (pandas-dev#20499)
Removed old methods from Series and DF
Added new methods into NDFrame
Created new convenient method _find_first_valid

@jreback jreback modified the milestones: Next Major Release, 0.23.0 Mar 31, 2018

mmngreco added a commit to mmngreco/pandas that referenced this issue Mar 31, 2018

BUG: Fixed first last valid index (pandas-dev#20499)
Removed old methods from Series and DF
Added new methods into NDFrame
Created new convenient method _find_first_valid
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment