New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: update the pandas.DataFrame.replace docstring #20271

Merged
merged 8 commits into from Apr 22, 2018

Conversation

Projects
None yet
5 participants
@math-and-data
Contributor

math-and-data commented Mar 11, 2018

  • PR title is "DOC: update the docstring"
  • The validation script passes: scripts/validate_docstrings.py <your-function-or-method>
  • The PEP8 style check passes: git diff upstream/master -u -- "*.py" | flake8 --diff
  • The html version looks good: python doc/make.py --single <your-function-or-method>
  • It has been proofread on language by another sprint participant

Note: Just did a minor improvement, not a full change!

Still a few verification errors:

  • Errors in parameters section
    • Parameter "to_replace" description should start with capital letter
    • Parameter "axis" description should finish with "."
  • Examples do not pass tests
################################################################################
##################### Docstring (pandas.DataFrame.replace) #####################
################################################################################

Replace values given in 'to_replace' with 'value'.

Values of the DataFrame or a Series are being replaced with
other values. One or several values can be replaced with one
or several values.

Parameters
----------
to_replace : str, regex, list, dict, Series, numeric, or None

    * numeric, str or regex:

        - numeric: numeric values equal to ``to_replace`` will be
          replaced with ``value``
        - str: string exactly matching ``to_replace`` will be replaced
          with ``value``
        - regex: regexs matching ``to_replace`` will be replaced with
          ``value``

    * list of str, regex, or numeric:

        - First, if ``to_replace`` and ``value`` are both lists, they
          **must** be the same length.
        - Second, if ``regex=True`` then all of the strings in **both**
          lists will be interpreted as regexs otherwise they will match
          directly. This doesn't matter much for ``value`` since there
          are only a few possible substitution regexes you can use.
        - str, regex and numeric rules apply as above.

    * dict:

        - Dicts can be used to specify different replacement values
          for different existing values. For example,
          {'a': 'b', 'y': 'z'} replaces the value 'a' with 'b' and
          'y' with 'z'. To use a dict in this way the ``value``
          parameter should be ``None``.
        - For a DataFrame a dict can specify that different values
          should be replaced in different columns. For example,
          {'a': 1, 'b': 'z'} looks for the value 1 in column 'a' and
          the value 'z' in column 'b' and replaces these values with
          whatever is specified in ``value``. The ``value`` parameter
          should not be ``None`` in this case. You can treat this as a
          special case of passing two lists except that you are
          specifying the column to search in.
        - For a DataFrame nested dictionaries, e.g.,
          {'a': {'b': np.nan}}, are read as follows: look in column 'a'
          for the value 'b' and replace it with NaN. The ``value``
          parameter should be ``None`` to use a nested dict in this
          way. You can nest regular expressions as well. Note that
          column names (the top-level dictionary keys in a nested
          dictionary) **cannot** be regular expressions.

    * None:

        - This means that the ``regex`` argument must be a string,
          compiled regular expression, or list, dict, ndarray or Series
          of such elements. If ``value`` is also ``None`` then this
          **must** be a nested dictionary or ``Series``.

    See the examples section for examples of each of these.
value : scalar, dict, list, str, regex, default None
    Value to replace any values matching ``to_replace`` with.
    For a DataFrame a dict of values can be used to specify which
    value to use for each column (columns not in the dict will not be
    filled). Regular expressions, strings and lists or dicts of such
    objects are also allowed.
inplace : boolean, default False
    If True, in place. Note: this will modify any
    other views on this object (e.g. a column from a DataFrame).
    Returns the caller if this is True.
limit : int, default None
    Maximum size gap to forward or backward fill.
regex : bool or same types as ``to_replace``, default False
    Whether to interpret ``to_replace`` and/or ``value`` as regular
    expressions. If this is ``True`` then ``to_replace`` *must* be a
    string. Alternatively, this could be a regular expression or a
    list, dict, or array of regular expressions in which case
    ``to_replace`` must be ``None``.
method : string, optional, {'pad', 'ffill', 'bfill'}, default is 'pad'
    The method to use when for replacement, when ``to_replace`` is a
    scalar, list or tuple and ``value`` is None.
axis : None
    Deprecated.

    .. versionchanged:: 0.23.0
        Added to DataFrame

See Also
--------
DataFrame.fillna : Fill NA/NaN values
DataFrame.where : Replace values based on boolean condition

Returns
-------
DataFrame
    Some values have been substituted for new values.

Raises
------
AssertionError
    * If ``regex`` is not a ``bool`` and ``to_replace`` is not
      ``None``.
TypeError
    * If ``to_replace`` is a ``dict`` and ``value`` is not a ``list``,
      ``dict``, ``ndarray``, or ``Series``
    * If ``to_replace`` is ``None`` and ``regex`` is not compilable
      into a regular expression or is a list, dict, ndarray, or
      Series.
    * When replacing multiple ``bool`` or ``datetime64`` objects and
      the arguments to ``to_replace`` does not match the type of the
      value being replaced
ValueError
    * If a ``list`` or an ``ndarray`` is passed to ``to_replace`` and
      `value` but they are not the same length.

Notes
-----
* Regex substitution is performed under the hood with ``re.sub``. The
  rules for substitution for ``re.sub`` are the same.
* Regular expressions will only substitute on strings, meaning you
  cannot provide, for example, a regular expression matching floating
  point numbers and expect the columns in your frame that have a
  numeric dtype to be matched. However, if those floating point
  numbers *are* strings, then you can do this.
* This method has *a lot* of options. You are encouraged to experiment
  and play with this method to gain intuition about how it works.

Examples
--------

>>> s = pd.Series([0, 1, 2, 3, 4])
>>> s.replace(0, 5)
0    5
1    1
2    2
3    3
4    4
dtype: int64
>>> df = pd.DataFrame({'A': [0, 1, 2, 3, 4],
...                    'B': [5, 6, 7, 8, 9],
...                    'C': ['a', 'b', 'c', 'd', 'e']})
>>> df.replace(0, 5)
   A  B  C
0  5  5  a
1  1  6  b
2  2  7  c
3  3  8  d
4  4  9  e

>>> df.replace([0, 1, 2, 3], 4)
   A  B  C
0  4  5  a
1  4  6  b
2  4  7  c
3  4  8  d
4  4  9  e
>>> df.replace([0, 1, 2, 3], [4, 3, 2, 1])
   A  B  C
0  4  5  a
1  3  6  b
2  2  7  c
3  1  8  d
4  4  9  e
>>> s.replace([1, 2], method='bfill')
0    0
1    3
2    3
3    3
4    4
dtype: int64

>>> df.replace({0: 10, 1: 100})
     A  B  C
0   10  5  a
1  100  6  b
2    2  7  c
3    3  8  d
4    4  9  e
>>> df.replace({'A': 0, 'B': 5}, 100)
     A    B  C
0  100  100  a
1    1    6  b
2    2    7  c
3    3    8  d
4    4    9  e
>>> df.replace({'A': {0: 100, 4: 400}})
     A  B  C
0  100  5  a
1    1  6  b
2    2  7  c
3    3  8  d
4  400  9  e

>>> df = pd.DataFrame({'A': ['bat', 'foo', 'bait'],
...                    'B': ['abc', 'bar', 'xyz']})
>>> df.replace(to_replace=r'^ba.$', value='new', regex=True)
      A    B
0   new  abc
1   foo  new
2  bait  xyz
>>> df.replace({'A': r'^ba.$'}, {'A': 'new'}, regex=True)
      A    B
0   new  abc
1   foo  bar
2  bait  xyz
>>> df.replace(regex=r'^ba.$', value='new')
      A    B
0   new  abc
1   foo  new
2  bait  xyz
>>> df.replace(regex={r'^ba.$':'new', 'foo':'xyz'})
      A    B
0   new  abc
1   xyz  new
2  bait  xyz
>>> df.replace(regex=[r'^ba.$', 'foo'], value='new')
      A    B
0   new  abc
1   new  new
2  bait  xyz

Note that when replacing multiple ``bool`` or ``datetime64`` objects,
the data types in the ``to_replace`` parameter must match the data
type of the value being replaced:

>>> df = pd.DataFrame({'A': [True, False, True],
...                    'B': [False, True, False]})
>>> df.replace({'a string': 'new value', True: False})  # raises
TypeError: Cannot compare types 'ndarray(dtype=bool)' and 'str'

This raises a ``TypeError`` because one of the ``dict`` keys is not of
the correct type for replacement.

Compare the behavior of
``s.replace('a', None)`` and ``s.replace({'a': None})`` to understand
the pecularities of the ``to_replace`` parameter.
``s.replace('a', None)`` is actually equivalent to
``s.replace(to_replace='a', value=None, method='pad')``,
because when ``value=None`` and ``to_replace`` is a scalar, list or
tuple, ``replace`` uses the method parameter to do the replacement.
So this is why the 'a' values are being replaced by 30 in rows 3 and 4
and 'b' in row 6 in this case. However, this behaviour does not occur
when you use a dict as the ``to_replace`` value. In this case, it is
like the value(s) in the dict are equal to the value parameter.

>>> s = pd.Series([10, 20, 30, 'a', 'a', 'b', 'a'])
>>> print(s)
0    10
1    20
2    30
3     a
4     a
5     b
6     a
dtype: object
>>> print(s.replace('a', None))
0    10
1    20
2    30
3    30
4    30
5     b
6     b
dtype: object
>>> print(s.replace({'a': None}))
0      10
1      20
2      30
3    None
4    None
5       b
6    None
dtype: object

################################################################################
################################## Validation ##################################
################################################################################

Errors found:
        Errors in parameters section
                Parameter "to_replace" description should start with capital letter
                Parameter "axis" description should finish with "."
        Examples do not pass tests

################################################################################
################################### Doctests ###################################
################################################################################

**********************************************************************
Line 229, in pandas.DataFrame.replace
Failed example:
    df.replace({'a string': 'new value', True: False})  # raises
Exception raised:
    Traceback (most recent call last):
      File "C:\Users\thisi\AppData\Local\conda\conda\envs\pandas_dev\lib\doctest.py", line 1330, in __run
        compileflags, 1), test.globs)
      File "<doctest pandas.DataFrame.replace[17]>", line 1, in <module>
        df.replace({'a string': 'new value', True: False})  # raises
      File "C:\Users\thisi\Documents\GitHub\pandas\pandas\core\frame.py", line 3136, in replace
        method=method, axis=axis)
      File "C:\Users\thisi\Documents\GitHub\pandas\pandas\core\generic.py", line 5208, in replace
        limit=limit, regex=regex)
      File "C:\Users\thisi\Documents\GitHub\pandas\pandas\core\frame.py", line 3136, in replace
        method=method, axis=axis)
      File "C:\Users\thisi\Documents\GitHub\pandas\pandas\core\generic.py", line 5257, in replace
        regex=regex)
      File "C:\Users\thisi\Documents\GitHub\pandas\pandas\core\internals.py", line 3696, in replace_list
        masks = [comp(s) for i, s in enumerate(src_list)]
      File "C:\Users\thisi\Documents\GitHub\pandas\pandas\core\internals.py", line 3696, in <listcomp>
        masks = [comp(s) for i, s in enumerate(src_list)]
      File "C:\Users\thisi\Documents\GitHub\pandas\pandas\core\internals.py", line 3694, in comp
        return _maybe_compare(values, getattr(s, 'asm8', s), operator.eq)
      File "C:\Users\thisi\Documents\GitHub\pandas\pandas\core\internals.py", line 5122, in _maybe_compare
        b=type_names[1]))
    TypeError: Cannot compare types 'ndarray(dtype=bool)' and 'str'
.. versionchanged:: 0.23.0
Added to DataFrame
.. versionchanged:: 0.23.0

This comment has been minimized.

@WillAyd

WillAyd Mar 11, 2018

Member

This isn't what you want to do - make sure you keep the versionchanged directive below the method argument as that's what was added in v0.23

regex : bool or same types as ``to_replace``, default False
Whether to interpret ``to_replace`` and/or ``value`` as regular
expressions. If this is ``True`` then ``to_replace`` *must* be a
string. Alternatively, this could be a regular expression or a
list, dict, or array of regular expressions in which case
``to_replace`` must be ``None``.
method : string, optional, {'pad', 'ffill', 'bfill'}
method : string, optional, {'pad', 'ffill', 'bfill'}, default is 'pad'

This comment has been minimized.

@WillAyd

WillAyd Mar 11, 2018

Member

method : {'pad', 'ffill', 'bfill', `None`}

The method to use when for replacement, when ``to_replace`` is a
scalar, list or tuple and ``value`` is None.
axis : None
Deprecated.

This comment has been minimized.

@WillAyd

WillAyd Mar 11, 2018

Member

Warning says this will be removed in v0.13? Woof...I guess OK to document for this change but should have a follow up change to actually go ahead and remove - care to take a stab at that?

This comment has been minimized.

@jreback

jreback Mar 11, 2018

Contributor

@WillAyd where is this warning?

This comment has been minimized.

This comment has been minimized.

@math-and-data

math-and-data Mar 12, 2018

Contributor

I'm happy to take a stab at this - always nice when I can remove code too

This comment has been minimized.

@WillAyd

WillAyd Mar 12, 2018

Member

@math-and-data awesome thanks! Can you open a separate issue for this?

This comment has been minimized.

@math-and-data

math-and-data Mar 14, 2018

Contributor

will do.

This comment has been minimized.

@math-and-data

math-and-data Apr 21, 2018

Contributor

@WillAyd I was waiting for this PR to be approved, then I would open a new request where I change the relevant code (remove the 'axis' reference) and edit the documentation accordingly. Is there anything else I had missed in this PR (other than the suggestion of breaking out the DataFrame and Series examples)?

@@ -4869,6 +4869,10 @@ def bfill(self, axis=None, inplace=False, limit=None, downcast=None):
_shared_docs['replace'] = ("""
Replace values given in 'to_replace' with 'value'.
Values of the DataFrame or a Series are being replaced with

This comment has been minimized.

@WillAyd

WillAyd Mar 11, 2018

Member

Not sure this extended description is adding much. Better served to make mention of how this can replace values with a dynamic set of inputs like dicts

This comment has been minimized.

@math-and-data

math-and-data Mar 14, 2018

Contributor

done, thank you for the suggestion

Values of the DataFrame or a Series are being replaced with
other values. One or several values can be replaced with one
or several values.
Parameters
----------
to_replace : str, regex, list, dict, Series, numeric, or None

This comment has been minimized.

@WillAyd

WillAyd Mar 11, 2018

Member

Say int, float instead of numeric (if float is even valid?)

the pecularities of the ``to_replace`` parameter.
``s.replace('a', None)`` is actually equivalent to
``s.replace(to_replace='a', value=None, method='pad')``,
because when ``value=None`` and ``to_replace`` is a scalar, list or

This comment has been minimized.

@WillAyd

WillAyd Mar 11, 2018

Member

This is interesting as I was not aware of this behavior. Certainly great to have it documented, though I would move the majority of the writing into the Notes section and shorten the blurb introducing the comparison here.

``s.replace(to_replace='a', value=None, method='pad')``,
because when ``value=None`` and ``to_replace`` is a scalar, list or
tuple, ``replace`` uses the method parameter to do the replacement.
So this is why the 'a' values are being replaced by 30 in rows 3 and 4

This comment has been minimized.

@WillAyd

WillAyd Mar 11, 2018

Member

Maybe just reinforce that it's the fill behavior that is really replacing values here

like the value(s) in the dict are equal to the value parameter.
>>> s = pd.Series([10, 20, 30, 'a', 'a', 'b', 'a'])
>>> print(s)

This comment has been minimized.

@WillAyd

WillAyd Mar 11, 2018

Member

This Series is simple enough where you don't need to explicitly print it - the constructor shows you everything of interest

This comment has been minimized.

@math-and-data

math-and-data Mar 14, 2018

Contributor

I personally have found the visual of inspecting the changes before/after easier for such replacements (both in vertical positions). You have more experience and I'll rely on your suggestion and make the change.

5 b
6 b
dtype: object
>>> print(s.replace({'a': None}))

This comment has been minimized.

@WillAyd

WillAyd Mar 11, 2018

Member

I would put this example first as it is (from my perspective) the behavior most would expect. Having it first makes it a better segue into the nuance that you want to describe with the other example

when you use a dict as the ``to_replace`` value. In this case, it is
like the value(s) in the dict are equal to the value parameter.
>>> s = pd.Series([10, 20, 30, 'a', 'a', 'b', 'a'])

This comment has been minimized.

@WillAyd

WillAyd Mar 11, 2018

Member

Just to keep things concise why don't you get rid of 10 and 20 in this example? They don't serve any real purpose but make the documentation longer. Can also replace 30 with 1

This comment has been minimized.

@math-and-data

math-and-data Mar 14, 2018

Contributor

great suggestion of simplifying.

@jorisvandenbossche jorisvandenbossche added Docs and removed Docs labels Mar 11, 2018

The method to use when for replacement, when ``to_replace`` is a
scalar, list or tuple and ``value`` is None.
axis : None
Deprecated.

This comment has been minimized.

@jreback

jreback Mar 11, 2018

Contributor

@WillAyd where is this warning?

5 b
6 a
dtype: object
>>> print(s.replace('a', None))

This comment has been minimized.

@jreback

jreback Mar 11, 2018

Contributor

you don't need the prints, use a blank line between cases. Having an expl for each case is also nice.

This comment has been minimized.

@math-and-data

math-and-data Mar 14, 2018

Contributor

done.

@math-and-data

This comment has been minimized.

Contributor

math-and-data commented Mar 14, 2018

  • Docstring validation not passing
################################################################################
##################### Docstring (pandas.DataFrame.replace) #####################
################################################################################

Replace values given in 'to_replace' with 'value'.

Values of the DataFrame or a Series are being replaced with
other values in a dynamic way. Instead of replacing values in a
specific cell (row/column combination), this method allows for more
flexibility with replacements. For instance, values can be replaced
by specifying lists of values and replacements separately or
with a dynamic set of inputs like dicts.

Parameters
----------
to_replace : str, regex, list, dict, Series, int, float, or None
    * numeric, str or regex:

        - numeric: numeric values equal to ``to_replace`` will be
          replaced with ``value``
        - str: string exactly matching ``to_replace`` will be replaced
          with ``value``
        - regex: regexs matching ``to_replace`` will be replaced with
          ``value``

    * list of str, regex, or numeric:

        - First, if ``to_replace`` and ``value`` are both lists, they
          **must** be the same length.
        - Second, if ``regex=True`` then all of the strings in **both**
          lists will be interpreted as regexs otherwise they will match
          directly. This doesn't matter much for ``value`` since there
          are only a few possible substitution regexes you can use.
        - str, regex and numeric rules apply as above.

    * dict:

        - Dicts can be used to specify different replacement values
          for different existing values. For example,
          {'a': 'b', 'y': 'z'} replaces the value 'a' with 'b' and
          'y' with 'z'. To use a dict in this way the ``value``
          parameter should be ``None``.
        - For a DataFrame a dict can specify that different values
          should be replaced in different columns. For example,
          {'a': 1, 'b': 'z'} looks for the value 1 in column 'a' and
          the value 'z' in column 'b' and replaces these values with
          whatever is specified in ``value``. The ``value`` parameter
          should not be ``None`` in this case. You can treat this as a
          special case of passing two lists except that you are
          specifying the column to search in.
        - For a DataFrame nested dictionaries, e.g.,
          {'a': {'b': np.nan}}, are read as follows: look in column
          'a' for the value 'b' and replace it with NaN. The ``value``
          parameter should be ``None`` to use a nested dict in this
          way. You can nest regular expressions as well. Note that
          column names (the top-level dictionary keys in a nested
          dictionary) **cannot** be regular expressions.

    * None:

        - This means that the ``regex`` argument must be a string,
          compiled regular expression, or list, dict, ndarray or
          Series of such elements. If ``value`` is also ``None`` then
          this **must** be a nested dictionary or ``Series``.

    See the examples section for examples of each of these.
value : scalar, dict, list, str, regex, default None
    Value to replace any values matching ``to_replace`` with.
    For a DataFrame a dict of values can be used to specify which
    value to use for each column (columns not in the dict will not be
    filled). Regular expressions, strings and lists or dicts of such
    objects are also allowed.
inplace : boolean, default False
    If True, in place. Note: this will modify any
    other views on this object (e.g. a column from a DataFrame).
    Returns the caller if this is True.
limit : int, default None
    Maximum size gap to forward or backward fill.
regex : bool or same types as ``to_replace``, default False
    Whether to interpret ``to_replace`` and/or ``value`` as regular
    expressions. If this is ``True`` then ``to_replace`` *must* be a
    string. Alternatively, this could be a regular expression or a
    list, dict, or array of regular expressions in which case
    ``to_replace`` must be ``None``.
method : {'pad', 'ffill', 'bfill', `None`}
    The method to use when for replacement, when ``to_replace`` is a
    scalar, list or tuple and ``value`` is `None`.
    .. versionchanged:: 0.23.0
        Added to DataFrame.
axis : None
    Deprecated.

See Also
--------
DataFrame.fillna : Fill `NaN` values
DataFrame.where : Replace values based on boolean condition

Returns
-------
DataFrame
    Object after replacement.

Raises
------
AssertionError
    * If ``regex`` is not a ``bool`` and ``to_replace`` is not
      ``None``.
TypeError
    * If ``to_replace`` is a ``dict`` and ``value`` is not a ``list``,
      ``dict``, ``ndarray``, or ``Series``
    * If ``to_replace`` is ``None`` and ``regex`` is not compilable
      into a regular expression or is a list, dict, ndarray, or
      Series.
    * When replacing multiple ``bool`` or ``datetime64`` objects and
      the arguments to ``to_replace`` does not match the type of the
      value being replaced
ValueError
    * If a ``list`` or an ``ndarray`` is passed to ``to_replace`` and
      `value` but they are not the same length.

Notes
-----
* Regex substitution is performed under the hood with ``re.sub``. The
  rules for substitution for ``re.sub`` are the same.
* Regular expressions will only substitute on strings, meaning you
  cannot provide, for example, a regular expression matching floating
  point numbers and expect the columns in your frame that have a
  numeric dtype to be matched. However, if those floating point
  numbers *are* strings, then you can do this.
* This method has *a lot* of options. You are encouraged to experiment
  and play with this method to gain intuition about how it works.
* When dict is used as the ``to_replace`` value, it is like
  key(s) in the dict are the to_replace part and
  value(s) in the dict are the value parameter.

Examples
--------

>>> s = pd.Series([0, 1, 2, 3, 4])
>>> s.replace(0, 5)
0    5
1    1
2    2
3    3
4    4
dtype: int64
>>> df = pd.DataFrame({'A': [0, 1, 2, 3, 4],
...                    'B': [5, 6, 7, 8, 9],
...                    'C': ['a', 'b', 'c', 'd', 'e']})
>>> df.replace(0, 5)
   A  B  C
0  5  5  a
1  1  6  b
2  2  7  c
3  3  8  d
4  4  9  e

>>> df.replace([0, 1, 2, 3], 4)
   A  B  C
0  4  5  a
1  4  6  b
2  4  7  c
3  4  8  d
4  4  9  e
>>> df.replace([0, 1, 2, 3], [4, 3, 2, 1])
   A  B  C
0  4  5  a
1  3  6  b
2  2  7  c
3  1  8  d
4  4  9  e
>>> s.replace([1, 2], method='bfill')
0    0
1    3
2    3
3    3
4    4
dtype: int64

>>> df.replace({0: 10, 1: 100})
     A  B  C
0   10  5  a
1  100  6  b
2    2  7  c
3    3  8  d
4    4  9  e
>>> df.replace({'A': 0, 'B': 5}, 100)
     A    B  C
0  100  100  a
1    1    6  b
2    2    7  c
3    3    8  d
4    4    9  e
>>> df.replace({'A': {0: 100, 4: 400}})
     A  B  C
0  100  5  a
1    1  6  b
2    2  7  c
3    3  8  d
4  400  9  e

>>> df = pd.DataFrame({'A': ['bat', 'foo', 'bait'],
...                    'B': ['abc', 'bar', 'xyz']})
>>> df.replace(to_replace=r'^ba.$', value='new', regex=True)
      A    B
0   new  abc
1   foo  new
2  bait  xyz
>>> df.replace({'A': r'^ba.$'}, {'A': 'new'}, regex=True)
      A    B
0   new  abc
1   foo  bar
2  bait  xyz
>>> df.replace(regex=r'^ba.$', value='new')
      A    B
0   new  abc
1   foo  new
2  bait  xyz
>>> df.replace(regex={r'^ba.$':'new', 'foo':'xyz'})
      A    B
0   new  abc
1   xyz  new
2  bait  xyz
>>> df.replace(regex=[r'^ba.$', 'foo'], value='new')
      A    B
0   new  abc
1   new  new
2  bait  xyz

Note that when replacing multiple ``bool`` or ``datetime64`` objects,
the data types in the ``to_replace`` parameter must match the data
type of the value being replaced:

>>> df = pd.DataFrame({'A': [True, False, True],
...                    'B': [False, True, False]})
>>> df.replace({'a string': 'new value', True: False})  # raises
TypeError: Cannot compare types 'ndarray(dtype=bool)' and 'str'

This raises a ``TypeError`` because one of the ``dict`` keys is not of
the correct type for replacement.

Compare the behavior of ``s.replace({'a': None})`` and
``s.replace('a', None)`` to understand the pecularities
of the ``to_replace`` parameter:

>>> s = pd.Series([10, 'a', 'a', 'b', 'a'])

When one uses a dict as the ``to_replace`` value, it is like the
value(s) in the dict are equal to the value parameter.
``s.replace({'a': None})`` is equivalent to
``s.replace(to_replace={'a': None}, value=None, method=None)``:

>>> s.replace({'a': None})
0      10
1    None
2    None
3       b
4    None
dtype: object

When ``value=None`` and ``to_replace`` are a scalar, list or
tuple, ``replace`` uses the method parameter (default 'pad') to do the
replacement. So this is why the 'a' values are being replaced by 10
in rows 1 and 2 and 'b' in row 4 in this case.
The command ``s.replace('a', None)`` is actually equivalent to
``s.replace(to_replace='a', value=None, method='pad')``:

>>> s.replace('a', None)
0    10
1    10
2    10
3     b
4     b
dtype: object

################################################################################
################################## Validation ##################################
################################################################################

Errors found:
        Errors in parameters section
                Parameter "to_replace" description should start with capital letter
        Examples do not pass tests

################################################################################
################################### Doctests ###################################
################################################################################

**********************************************************************
Line 233, in pandas.DataFrame.replace
Failed example:
    df.replace({'a string': 'new value', True: False})  # raises
Exception raised:
    Traceback (most recent call last):
      File "C:\Users\thisi\AppData\Local\conda\conda\envs\pandas_dev\lib\doctest.py", line 1330, in __run
        compileflags, 1), test.globs)
      File "<doctest pandas.DataFrame.replace[17]>", line 1, in <module>
        df.replace({'a string': 'new value', True: False})  # raises
      File "C:\Users\thisi\Documents\GitHub\pandas\pandas\core\frame.py", line 3136, in replace
        method=method, axis=axis)
      File "C:\Users\thisi\Documents\GitHub\pandas\pandas\core\generic.py", line 5205, in replace
        limit=limit, regex=regex)
      File "C:\Users\thisi\Documents\GitHub\pandas\pandas\core\frame.py", line 3136, in replace
        method=method, axis=axis)
      File "C:\Users\thisi\Documents\GitHub\pandas\pandas\core\generic.py", line 5254, in replace
        regex=regex)
      File "C:\Users\thisi\Documents\GitHub\pandas\pandas\core\internals.py", line 3696, in replace_list
        masks = [comp(s) for i, s in enumerate(src_list)]
      File "C:\Users\thisi\Documents\GitHub\pandas\pandas\core\internals.py", line 3696, in <listcomp>
        masks = [comp(s) for i, s in enumerate(src_list)]
      File "C:\Users\thisi\Documents\GitHub\pandas\pandas\core\internals.py", line 3694, in comp
        return _maybe_compare(values, getattr(s, 'asm8', s), operator.eq)
      File "C:\Users\thisi\Documents\GitHub\pandas\pandas\core\internals.py", line 5122, in _maybe_compare
        b=type_names[1]))
    TypeError: Cannot compare types 'ndarray(dtype=bool)' and 'str'
Updates.
Section headers.

Consistent quoting.

Formatting.

Traceback.
@codecov

This comment has been minimized.

codecov bot commented Mar 15, 2018

Codecov Report

Merging #20271 into master will increase coverage by 0.02%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #20271      +/-   ##
==========================================
+ Coverage   91.82%   91.84%   +0.02%     
==========================================
  Files         152      153       +1     
  Lines       49248    49305      +57     
==========================================
+ Hits        45222    45286      +64     
+ Misses       4026     4019       -7
Flag Coverage Δ
#multiple 90.24% <100%> (+0.02%) ⬆️
#single 41.89% <53.84%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/generic.py 95.94% <100%> (+0.08%) ⬆️
pandas/io/clipboard/clipboards.py 30.58% <0%> (-1.6%) ⬇️
pandas/core/config_init.py 99.24% <0%> (-0.76%) ⬇️
pandas/core/arrays/categorical.py 95.78% <0%> (-0.41%) ⬇️
pandas/core/nanops.py 96.3% <0%> (-0.4%) ⬇️
pandas/util/_decorators.py 82.25% <0%> (-0.15%) ⬇️
pandas/plotting/_core.py 82.39% <0%> (-0.12%) ⬇️
pandas/io/pytables.py 92.41% <0%> (-0.05%) ⬇️
pandas/core/frame.py 97.16% <0%> (-0.02%) ⬇️
pandas/tseries/offsets.py 97% <0%> (-0.01%) ⬇️
... and 27 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cdfce2b...58f6531. Read the comment docs.

@TomAugspurger

This comment has been minimized.

Contributor

TomAugspurger commented Mar 15, 2018

Updated

################################################################################
##################### Docstring (pandas.DataFrame.replace) #####################
################################################################################

Replace values given in `to_replace` with `value`.

Values of the DataFrame are replaced with other values dynamically.
This differs from updating with ``.loc`` or ``.iloc``, which require
you to specify a location to update with some value.

Parameters
----------
to_replace : str, regex, list, dict, Series, int, float, or None
    How to find the values that will be replaced.

    * numeric, str or regex:

        - numeric: numeric values equal to `to_replace` will be
          replaced with `value`
        - str: string exactly matching `to_replace` will be replaced
          with `value`
        - regex: regexs matching `to_replace` will be replaced with
          `value`

    * list of str, regex, or numeric:

        - First, if `to_replace` and `value` are both lists, they
          **must** be the same length.
        - Second, if ``regex=True`` then all of the strings in **both**
          lists will be interpreted as regexs otherwise they will match
          directly. This doesn't matter much for `value` since there
          are only a few possible substitution regexes you can use.
        - str, regex and numeric rules apply as above.

    * dict:

        - Dicts can be used to specify different replacement values
          for different existing values. For example,
          ``{'a': 'b', 'y': 'z'}`` replaces the value 'a' with 'b' and
          'y' with 'z'. To use a dict in this way the `value`
          parameter should be `None`.
        - For a DataFrame a dict can specify that different values
          should be replaced in different columns. For example,
          ``{'a': 1, 'b': 'z'}`` looks for the value 1 in column 'a'
          and the value 'z' in column 'b' and replaces these values
          with whatever is specified in `value`. The `value` parameter
          should not be ``None`` in this case. You can treat this as a
          special case of passing two lists except that you are
          specifying the column to search in.
        - For a DataFrame nested dictionaries, e.g.,
          ``{'a': {'b': np.nan}}``, are read as follows: look in column
          'a' for the value 'b' and replace it with NaN. The `value`
          parameter should be ``None`` to use a nested dict in this
          way. You can nest regular expressions as well. Note that
          column names (the top-level dictionary keys in a nested
          dictionary) **cannot** be regular expressions.

    * None:

        - This means that the `regex` argument must be a string,
          compiled regular expression, or list, dict, ndarray or
          Series of such elements. If `value` is also ``None`` then
          this **must** be a nested dictionary or Series.

    See the examples section for examples of each of these.
value : scalar, dict, list, str, regex, default None
    Value to replace any values matching `to_replace` with.
    For a DataFrame a dict of values can be used to specify which
    value to use for each column (columns not in the dict will not be
    filled). Regular expressions, strings and lists or dicts of such
    objects are also allowed.
inplace : boolean, default False
    If True, in place. Note: this will modify any
    other views on this object (e.g. a column from a DataFrame).
    Returns the caller if this is True.
limit : int, default None
    Maximum size gap to forward or backward fill.
regex : bool or same types as `to_replace`, default False
    Whether to interpret `to_replace` and/or `value` as regular
    expressions. If this is ``True`` then `to_replace` *must* be a
    string. Alternatively, this could be a regular expression or a
    list, dict, or array of regular expressions in which case
    `to_replace` must be ``None``.
method : {'pad', 'ffill', 'bfill', `None`}
    The method to use when for replacement, when `to_replace` is a
    scalar, list or tuple and `value` is ``None``.

    .. versionchanged:: 0.23.0
        Added to DataFrame.
axis : None
    Deprecated.

See Also
--------
DataFrame.fillna : Fill `NaN` values
DataFrame.where : Replace values based on boolean condition
Series.str.replace : Simple string replacement.

Returns
-------
DataFrame
    Object after replacement.

Raises
------
AssertionError
    * If `regex` is not a ``bool`` and `to_replace` is not
      ``None``.
TypeError
    * If `to_replace` is a ``dict`` and `value` is not a ``list``,
      ``dict``, ``ndarray``, or ``Series``
    * If `to_replace` is ``None`` and `regex` is not compilable
      into a regular expression or is a list, dict, ndarray, or
      Series.
    * When replacing multiple ``bool`` or ``datetime64`` objects and
      the arguments to `to_replace` does not match the type of the
      value being replaced
ValueError
    * If a ``list`` or an ``ndarray`` is passed to `to_replace` and
      `value` but they are not the same length.

Notes
-----
* Regex substitution is performed under the hood with ``re.sub``. The
  rules for substitution for ``re.sub`` are the same.
* Regular expressions will only substitute on strings, meaning you
  cannot provide, for example, a regular expression matching floating
  point numbers and expect the columns in your frame that have a
  numeric dtype to be matched. However, if those floating point
  numbers *are* strings, then you can do this.
* This method has *a lot* of options. You are encouraged to experiment
  and play with this method to gain intuition about how it works.
* When dict is used as the `to_replace` value, it is like
  key(s) in the dict are the to_replace part and
  value(s) in the dict are the value parameter.

Examples
--------

**Scalar `to_replace` and `value`**

>>> s = pd.Series([0, 1, 2, 3, 4])
>>> s.replace(0, 5)
0    5
1    1
2    2
3    3
4    4
dtype: int64

>>> df = pd.DataFrame({'A': [0, 1, 2, 3, 4],
...                    'B': [5, 6, 7, 8, 9],
...                    'C': ['a', 'b', 'c', 'd', 'e']})
>>> df.replace(0, 5)
   A  B  C
0  5  5  a
1  1  6  b
2  2  7  c
3  3  8  d
4  4  9  e

**List-like `to_replace`**

>>> df.replace([0, 1, 2, 3], 4)
   A  B  C
0  4  5  a
1  4  6  b
2  4  7  c
3  4  8  d
4  4  9  e

>>> df.replace([0, 1, 2, 3], [4, 3, 2, 1])
   A  B  C
0  4  5  a
1  3  6  b
2  2  7  c
3  1  8  d
4  4  9  e

>>> s.replace([1, 2], method='bfill')
0    0
1    3
2    3
3    3
4    4
dtype: int64

**dict-like `to_replace`**

>>> df.replace({0: 10, 1: 100})
     A  B  C
0   10  5  a
1  100  6  b
2    2  7  c
3    3  8  d
4    4  9  e

>>> df.replace({'A': 0, 'B': 5}, 100)
     A    B  C
0  100  100  a
1    1    6  b
2    2    7  c
3    3    8  d
4    4    9  e

>>> df.replace({'A': {0: 100, 4: 400}})
     A  B  C
0  100  5  a
1    1  6  b
2    2  7  c
3    3  8  d
4  400  9  e

**Regular expression `to_replace`**

>>> df = pd.DataFrame({'A': ['bat', 'foo', 'bait'],
...                    'B': ['abc', 'bar', 'xyz']})
>>> df.replace(to_replace=r'^ba.$', value='new', regex=True)
      A    B
0   new  abc
1   foo  new
2  bait  xyz

>>> df.replace({'A': r'^ba.$'}, {'A': 'new'}, regex=True)
      A    B
0   new  abc
1   foo  bar
2  bait  xyz

>>> df.replace(regex=r'^ba.$', value='new')
      A    B
0   new  abc
1   foo  new
2  bait  xyz

>>> df.replace(regex={r'^ba.$':'new', 'foo':'xyz'})
      A    B
0   new  abc
1   xyz  new
2  bait  xyz

>>> df.replace(regex=[r'^ba.$', 'foo'], value='new')
      A    B
0   new  abc
1   new  new
2  bait  xyz

Note that when replacing multiple ``bool`` or ``datetime64`` objects,
the data types in the `to_replace` parameter must match the data
type of the value being replaced:

>>> df = pd.DataFrame({'A': [True, False, True],
...                    'B': [False, True, False]})
>>> df.replace({'a string': 'new value', True: False})  # raises
Traceback (most recent call last):
    ...
TypeError: Cannot compare types 'ndarray(dtype=bool)' and 'str'

This raises a ``TypeError`` because one of the ``dict`` keys is not of
the correct type for replacement.

Compare the behavior of ``s.replace({'a': None})`` and
``s.replace('a', None)`` to understand the pecularities
of the `to_replace` parameter:

>>> s = pd.Series([10, 'a', 'a', 'b', 'a'])

When one uses a dict as the `to_replace` value, it is like the
value(s) in the dict are equal to the `value` parameter.
``s.replace({'a': None})`` is equivalent to
``s.replace(to_replace={'a': None}, value=None, method=None)``:

>>> s.replace({'a': None})
0      10
1    None
2    None
3       b
4    None
dtype: object

When ``value=None`` and `to_replace` is a scalar, list or
tuple, `replace` uses the method parameter (default 'pad') to do the
replacement. So this is why the 'a' values are being replaced by 10
in rows 1 and 2 and 'b' in row 4 in this case.
The command ``s.replace('a', None)`` is actually equivalent to
``s.replace(to_replace='a', value=None, method='pad')``:

>>> s.replace('a', None)
0    10
1    10
2    10
3     b
4     b
dtype: object

################################################################################
################################## Validation ##################################
################################################################################

Docstring for "pandas.DataFrame.replace" correct. :)



fireshot capture 003 - pandas dataframe replace pandas 0 2_ - file____users_taugspurger_sandbox_

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Mar 15, 2018

I would personally split this docstring in separate ones for series and dataframe, it's becoming quite a monster :)

@WillAyd

One very minor edit but otherwise lgtm

See Also
--------
%(klass)s.fillna : Fill NA/NaN values
%(klass)s.fillna : Fill `NaN` values

This comment has been minimized.

@WillAyd

WillAyd Apr 21, 2018

Member

This would be better as Fill NA values since it is talking about the concept of missing data and not necessarily the NaN value itself

math-and-data and others added some commits Apr 21, 2018

@TomAugspurger

Fixed the linting failure. Let's get this merged when that passes.

@TomAugspurger TomAugspurger merged commit 4de2e9b into pandas-dev:master Apr 22, 2018

3 checks passed

ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@TomAugspurger

This comment has been minimized.

Contributor

TomAugspurger commented Apr 22, 2018

Thanks @math-and-data!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment