diff --git a/doc/source/user_guide/text.rst b/doc/source/user_guide/text.rst index 11d5ab86e76ef..d82c2744fbb85 100644 --- a/doc/source/user_guide/text.rst +++ b/doc/source/user_guide/text.rst @@ -2,21 +2,27 @@ {{ header }} -====================== +###################### Working with text data -====================== +###################### + +.. versionchanged:: 3.0 + The inference and behavior of strings changed significantly in pandas 3.0. + See the :ref:`string_migration_guide`. .. _text.types: Text data types ---------------- +=============== There are two ways to store text data in pandas: -1. ``object`` dtype NumPy array. -2. :class:`StringDtype` extension type. +1. :class:`StringDtype` extension type. +2. ``object`` dtype NumPy array. -We recommend using :class:`StringDtype` to store text data. +We recommend using :class:`StringDtype` to store text data via the alias +``dtype="str"`` (the default when dtype of strings is inferred), see +below for more details. Prior to pandas 1.0, ``object`` dtype was the only option. This was unfortunate for many reasons: @@ -29,45 +35,44 @@ for many reasons: 3. When reading code, the contents of an ``object`` dtype array is less clear than ``'string'``. -Currently, the performance of ``object`` dtype arrays of strings and -:class:`arrays.StringArray` are about the same. We expect future enhancements +When using :class:`StringDtype` with PyArrow as the storage (see below), +users will see large performance improvements in memory as well as time +for certain operations when compared to ``object`` dtype arrays. When +not using PyArrow as the storage, the performance of :class:`StringDtype` +is about the same as that of ``object``. We expect future enhancements to significantly increase the performance and lower the memory overhead of -:class:`~arrays.StringArray`. - -.. warning:: +:class:`StringDtype` in this case. - ``StringArray`` is currently considered experimental. The implementation - and parts of the API may change without warning. +.. versionchanged:: 3.0 -For backwards-compatibility, ``object`` dtype remains the default type we -infer a list of strings to: + The default when pandas infers the dtype of a collection of + strings is to use ``dtype='str'``. This will use ``np.nan`` + as it's NA value and be backed by a PyArrow string array when + PyArrow is installed, or backed by NumPy ``object`` array + when PyArrow is not installed. .. ipython:: python pd.Series(["a", "b", "c"]) -To explicitly request ``string`` dtype, specify the ``dtype``: +Specifying :class:`StringDtype` explicitly +========================================== -.. ipython:: python - - pd.Series(["a", "b", "c"], dtype="string") - pd.Series(["a", "b", "c"], dtype=pd.StringDtype()) - -Or ``astype`` after the ``Series`` or ``DataFrame`` is created: +When it is desired to explicitly specify the dtype, we generally recommend +using the alias ``dtype="str"`` if you desire to have ``np.nan`` as the NA +value or the alias ``dtype="string"`` if you desire to have ``pd.NA`` as +the NA value. .. ipython:: python - s = pd.Series(["a", "b", "c"]) - s - s.astype("string") - + pd.Series(["a", "b", None], dtype="str") + pd.Series(["a", "b", None], dtype="string") -You can also use :class:`StringDtype`/``"string"`` as the dtype on non-string data and -it will be converted to ``string`` dtype: +Specifying either alias will also convert non-string data to strings: .. ipython:: python - s = pd.Series(["a", 2, np.nan], dtype="string") + s = pd.Series(["a", 2, np.nan], dtype="str") s type(s[1]) @@ -81,59 +86,159 @@ or convert from existing pandas data: s2 type(s2[0]) +However there are four distinct :class:`StringDtype` variants that may be utilized. + +Python storage with ``np.nan`` values +------------------------------------- + +.. note:: + This is the same as ``dtype='str'`` *when PyArrow is not installed*. + +The implementation uses a NumPy object array, which directly stores the +Python string objects, hence why the storage here is called ``'python'``. +NA values in this array are stored using ``np.nan``. + +.. ipython:: python + + pd.Series( + ["a", "b", None, np.nan, pd.NA], + dtype=pd.StringDtype(storage="python", na_value=np.nan) + ) + +Notice that the last three values are all inferred by pandas as being +an NA values, and hence stored as ``np.nan``. + +PyArrow storage with ``np.nan`` values +-------------------------------------- + +.. note:: + This is the same as ``dtype='str'`` *when PyArrow is installed*. + +The implementation uses a PyArrow array, however NA values in this array +are stored using ``np.nan``. + +.. ipython:: python + + pd.Series( + ["a", "b", None, np.nan, pd.NA], + dtype=pd.StringDtype(storage="pyarrow", na_value=np.nan) + ) + +Notice that the last three values are all inferred by pandas as being +an NA values, and hence stored as ``np.nan``. + +Python storage with ``pd.NA`` values +------------------------------------ + +.. note:: + This is the same as ``dtype='string'`` *when PyArrow is not installed*. + +The implementation uses a NumPy object array, which directly stores the +Python string objects, hence why the storage here is called ``'python'``. +NA values in this array are stored using ``np.nan``. + +.. ipython:: python + + pd.Series( + ["a", "b", None, np.nan, pd.NA], + dtype=pd.StringDtype(storage="python", na_value=pd.NA) + ) + +Notice that the last three values are all inferred by pandas as +being an NA values, and hence stored as ``pd.NA``. + +PyArrow storage with ``pd.NA`` values +------------------------------------- + +.. note:: + This is the same as ``dtype='string'`` *when PyArrow is installed*. + +The implementation uses a PyArrow array. NA values in this array are +stored using ``pd.NA``. + +.. ipython:: python + + pd.Series( + ["a", "b", None, np.nan, pd.NA], + dtype=pd.StringDtype(storage="python", na_value=pd.NA) + ) + +Notice that the last three values are all inferred by pandas as being an NA +values, and hence stored as ``pd.NA``. .. _text.differences: Behavior differences -^^^^^^^^^^^^^^^^^^^^ +==================== -These are places where the behavior of ``StringDtype`` objects differ from -``object`` dtype: +``StringDtype`` with ``np.nan`` NA values +----------------------------------------- -1. For ``StringDtype``, :ref:`string accessor methods` - that return **numeric** output will always return a nullable integer dtype, - rather than either int or float dtype, depending on the presence of NA values. - Methods returning **boolean** output will return a nullable boolean dtype. +1. Like ``dtype="object"``, :ref:`string accessor methods` + that return **integer** output will return a NumPy array that is + either dtype int or float depending on the presence of NA values. + Methods returning **boolean** output will return a NumPy array this is + dtype bool, with the value ``False`` when an NA value is encountered. .. ipython:: python - s = pd.Series(["a", None, "b"], dtype="string") + s = pd.Series(["a", None, "b"], dtype="str") s s.str.count("a") s.dropna().str.count("a") - Both outputs are ``Int64`` dtype. Compare that with object-dtype: + When NA values are present, the output dtype is float64. However + **boolean** output results in ``False`` for the NA values. .. ipython:: python - s2 = pd.Series(["a", None, "b"], dtype="object") - s2.str.count("a") - s2.dropna().str.count("a") + s.str.isdigit() + s.str.match("a") + +2. Some string methods, like :meth:`Series.str.decode`, are not + available because the underlying array can only contain + strings, not bytes. +3. Comparison operations will return a NumPy array with dtype bool. Missing + values will always compare as unequal just as :attr:`np.nan` does. - When NA values are present, the output dtype is float64. Similarly for - methods returning boolean values. +``StringDtype`` with ``pd.NA`` NA values +---------------------------------------- + +1. :ref:`String accessor methods` + that return **integer** output will always return a nullable integer dtype, + rather than either int or float dtype (depending on the presence of NA values). + Methods returning **boolean** output will return a nullable boolean dtype. + + .. ipython:: python + + s = pd.Series(["a", None, "b"], dtype="string") + s + s.str.count("a") + s.dropna().str.count("a") + + Both outputs are ``Int64`` dtype. Similarly for methods returning boolean values. .. ipython:: python s.str.isdigit() s.str.match("a") -2. Some string methods, like :meth:`Series.str.decode` are not available - on ``StringArray`` because ``StringArray`` only holds strings, not - bytes. -3. In comparison operations, :class:`arrays.StringArray` and ``Series`` backed - by a ``StringArray`` will return an object with :class:`BooleanDtype`, - rather than a ``bool`` dtype object. Missing values in a ``StringArray`` - will propagate in comparison operations, rather than always comparing +2. Some string methods, like :meth:`Series.str.decode` because the underlying + array can only contain strings, not bytes. +3. Comparison operations will return an object with :class:`BooleanDtype`, + rather than a ``bool`` dtype object. Missing values will propagate + in comparison operations, rather than always comparing unequal like :attr:`numpy.nan`. -Everything else that follows in the rest of this document applies equally to -``string`` and ``object`` dtype. + +.. important:: + Everything else that follows in the rest of this document applies equally to + ``'str'``, ``'string'``, and ``object`` dtype. .. _text.string_methods: String methods --------------- +============== Series and Index are equipped with a set of string processing methods that make it easy to operate on each element of the array. Perhaps most @@ -144,7 +249,8 @@ the equivalent (scalar) built-in string methods: .. ipython:: python s = pd.Series( - ["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"], dtype="string" + ["A", "B", "C", "Aaba", np.nan, "dog", "cat"], + dtype="str", ) s.str.lower() s.str.upper() @@ -164,7 +270,9 @@ leading or trailing whitespace: .. ipython:: python df = pd.DataFrame( - np.random.randn(3, 2), columns=[" Column A ", " Column B "], index=range(3) + np.random.randn(3, 2), + columns=[" Column A ", " Column B "], + index=range(3), ) df @@ -212,13 +320,13 @@ and replacing any remaining whitespaces with underscores: .. _text.split: Splitting and replacing strings -------------------------------- +=============================== Methods like ``split`` return a Series of lists: .. ipython:: python - s2 = pd.Series(["a_b_c", "c_d_e", np.nan, "f_g_h"], dtype="string") + s2 = pd.Series(["a_b_c", "c_d_e", np.nan, "f_g_h"], dtype="str") s2.str.split("_") Elements in the split lists can be accessed using ``get`` or ``[]`` notation: @@ -257,19 +365,18 @@ i.e., from the end of the string to the beginning of the string: s3 = pd.Series( ["A", "B", "C", "Aaba", "Baca", "", np.nan, "CABA", "dog", "cat"], - dtype="string", + dtype="str", ) s3 s3.str.replace("^.a|dog", "XX-XX ", case=False, regex=True) .. versionchanged:: 2.0 - -Single character pattern with ``regex=True`` will also be treated as regular expressions: + Single character pattern with ``regex=True`` will also be treated as regular expressions: .. ipython:: python - s4 = pd.Series(["a.b", ".", "b", np.nan, ""], dtype="string") + s4 = pd.Series(["a.b", ".", "b", np.nan, ""], dtype="str") s4 s4.str.replace(".", "a", regex=True) @@ -279,7 +386,7 @@ character. In this case both ``pat`` and ``repl`` must be strings: .. ipython:: python - dollars = pd.Series(["12", "-$10", "$10,000"], dtype="string") + dollars = pd.Series(["12", "-$10", "$10,000"], dtype="str") # These lines are equivalent dollars.str.replace(r"-\$", "-", regex=True) @@ -297,7 +404,7 @@ positional argument (a regex object) and return a string. def repl(m): return m.group(0)[::-1] - pd.Series(["foo 123", "bar baz", np.nan], dtype="string").str.replace( + pd.Series(["foo 123", "bar baz", np.nan], dtype="str").str.replace( pat, repl, regex=True ) @@ -307,7 +414,7 @@ positional argument (a regex object) and return a string. def repl(m): return m.group("two").swapcase() - pd.Series(["Foo Bar Baz", np.nan], dtype="string").str.replace( + pd.Series(["Foo Bar Baz", np.nan], dtype="str").str.replace( pat, repl, regex=True ) @@ -335,8 +442,6 @@ regular expression object will raise a ``ValueError``. ``removeprefix`` and ``removesuffix`` have the same effect as ``str.removeprefix`` and ``str.removesuffix`` added in `Python 3.9 `__: -.. versionadded:: 1.4.0 - .. ipython:: python s = pd.Series(["str_foo", "str_bar", "no_prefix"]) @@ -348,19 +453,19 @@ regular expression object will raise a ``ValueError``. .. _text.concatenate: Concatenation -------------- +============= There are several ways to concatenate a ``Series`` or ``Index``, either with itself or others, all based on :meth:`~Series.str.cat`, resp. ``Index.str.cat``. Concatenating a single Series into a string -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +------------------------------------------- The content of a ``Series`` (or ``Index``) can be concatenated: .. ipython:: python - s = pd.Series(["a", "b", "c", "d"], dtype="string") + s = pd.Series(["a", "b", "c", "d"], dtype="str") s.str.cat(sep=",") If not specified, the keyword ``sep`` for the separator defaults to the empty string, ``sep=''``: @@ -373,12 +478,12 @@ By default, missing values are ignored. Using ``na_rep``, they can be given a re .. ipython:: python - t = pd.Series(["a", "b", np.nan, "d"], dtype="string") + t = pd.Series(["a", "b", np.nan, "d"], dtype="str") t.str.cat(sep=",") t.str.cat(sep=",", na_rep="-") Concatenating a Series and something list-like into a Series -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +------------------------------------------------------------ The first argument to :meth:`~Series.str.cat` can be a list-like object, provided that it matches the length of the calling ``Series`` (or ``Index``). @@ -394,7 +499,7 @@ Missing values on either side will result in missing values in the result as wel s.str.cat(t, na_rep="-") Concatenating a Series and something array-like into a Series -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +------------------------------------------------------------- The parameter ``others`` can also be two-dimensional. In this case, the number or rows must match the lengths of the calling ``Series`` (or ``Index``). @@ -406,7 +511,7 @@ The parameter ``others`` can also be two-dimensional. In this case, the number o s.str.cat(d, na_rep="-") Concatenating a Series and an indexed object into a Series, with alignment -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +-------------------------------------------------------------------------- For concatenation with a ``Series`` or ``DataFrame``, it is possible to align the indexes before concatenation by setting the ``join``-keyword. @@ -414,7 +519,7 @@ the ``join``-keyword. .. ipython:: python :okwarning: - u = pd.Series(["b", "d", "a", "c"], index=[1, 3, 0, 2], dtype="string") + u = pd.Series(["b", "d", "a", "c"], index=[1, 3, 0, 2], dtype="str") s u s.str.cat(u) @@ -425,7 +530,7 @@ In particular, alignment also means that the different lengths do not need to co .. ipython:: python - v = pd.Series(["z", "a", "b", "d", "e"], index=[-1, 0, 1, 3, 4], dtype="string") + v = pd.Series(["z", "a", "b", "d", "e"], index=[-1, 0, 1, 3, 4], dtype="str") s v s.str.cat(v, join="left", na_rep="-") @@ -441,7 +546,7 @@ The same alignment can be used when ``others`` is a ``DataFrame``: s.str.cat(f, join="left", na_rep="-") Concatenating a Series and many objects into a Series -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +----------------------------------------------------- Several array-like items (specifically: ``Series``, ``Index``, and 1-dimensional variants of ``np.ndarray``) can be combined in a list-like container (including iterators, ``dict``-views, etc.). @@ -470,7 +575,7 @@ the union of these indexes will be used as the basis for the final concatenation s.str.cat([u.loc[[3]], v.loc[[-1, 0]]], join="right", na_rep="-") Indexing with ``.str`` ----------------------- +====================== .. _text.indexing: @@ -481,19 +586,19 @@ of the string, the result will be a ``NaN``. .. ipython:: python s = pd.Series( - ["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"], dtype="string" + ["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"], dtype="str" ) s.str[0] s.str[1] Extracting substrings ---------------------- +===================== .. _text.extract: Extract first match in each subject (extract) -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +--------------------------------------------- The ``extract`` method accepts a `regular expression `__ with at least one @@ -506,7 +611,7 @@ DataFrame with one column per group. pd.Series( ["a1", "b2", "c3"], - dtype="string", + dtype="str", ).str.extract(r"([ab])(\d)", expand=False) Elements that do not match return a row filled with ``NaN``. Thus, a @@ -520,7 +625,7 @@ Named groups like .. ipython:: python - pd.Series(["a1", "b2", "c3"], dtype="string").str.extract( + pd.Series(["a1", "b2", "c3"], dtype="str").str.extract( r"(?P[ab])(?P\d)", expand=False ) @@ -530,7 +635,7 @@ and optional groups like pd.Series( ["a1", "b2", "3"], - dtype="string", + dtype="str", ).str.extract(r"([ab])?(\d)", expand=False) can also be used. Note that any capture group names in the regular @@ -542,20 +647,20 @@ with one column if ``expand=True``. .. ipython:: python - pd.Series(["a1", "b2", "c3"], dtype="string").str.extract(r"[ab](\d)", expand=True) + pd.Series(["a1", "b2", "c3"], dtype="str").str.extract(r"[ab](\d)", expand=True) It returns a Series if ``expand=False``. .. ipython:: python - pd.Series(["a1", "b2", "c3"], dtype="string").str.extract(r"[ab](\d)", expand=False) + pd.Series(["a1", "b2", "c3"], dtype="str").str.extract(r"[ab](\d)", expand=False) Calling on an ``Index`` with a regex with exactly one capture group returns a ``DataFrame`` with one column if ``expand=True``. .. ipython:: python - s = pd.Series(["a1", "b2", "c3"], ["A11", "B22", "C33"], dtype="string") + s = pd.Series(["a1", "b2", "c3"], ["A11", "B22", "C33"], dtype="str") s s.index.str.extract("(?P[a-zA-Z])", expand=True) @@ -592,7 +697,7 @@ first row) +--------+---------+------------+ Extract all matches in each subject (extractall) -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +------------------------------------------------ .. _text.extractall: @@ -600,7 +705,7 @@ Unlike ``extract`` (which returns only the first match), .. ipython:: python - s = pd.Series(["a1a2", "b1", "c1"], index=["A", "B", "C"], dtype="string") + s = pd.Series(["a1a2", "b1", "c1"], index=["A", "B", "C"], dtype="str") s two_groups = "(?P[a-z])(?P[0-9])" s.str.extract(two_groups, expand=True) @@ -618,7 +723,7 @@ When each subject string in the Series has exactly one match, .. ipython:: python - s = pd.Series(["a3", "b3", "c2"], dtype="string") + s = pd.Series(["a3", "b3", "c2"], dtype="str") s then ``extractall(pat).xs(0, level='match')`` gives the same result as @@ -639,11 +744,11 @@ same result as a ``Series.str.extractall`` with a default index (starts from 0). pd.Index(["a1a2", "b1", "c1"]).str.extractall(two_groups) - pd.Series(["a1a2", "b1", "c1"], dtype="string").str.extractall(two_groups) + pd.Series(["a1a2", "b1", "c1"], dtype="str").str.extractall(two_groups) Testing for strings that match or contain a pattern ---------------------------------------------------- +=================================================== You can check whether elements contain a pattern: @@ -652,7 +757,7 @@ You can check whether elements contain a pattern: pattern = r"[0-9][a-z]" pd.Series( ["1", "2", "3a", "3b", "03c", "4dx"], - dtype="string", + dtype="str", ).str.contains(pattern) Or whether elements match a pattern: @@ -661,14 +766,14 @@ Or whether elements match a pattern: pd.Series( ["1", "2", "3a", "3b", "03c", "4dx"], - dtype="string", + dtype="str", ).str.match(pattern) .. ipython:: python pd.Series( ["1", "2", "3a", "3b", "03c", "4dx"], - dtype="string", + dtype="str", ).str.fullmatch(pattern) .. note:: @@ -692,21 +797,21 @@ True or False: .. ipython:: python s4 = pd.Series( - ["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"], dtype="string" + ["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"], dtype="str" ) s4.str.contains("A", na=False) .. _text.indicator: Creating indicator variables ----------------------------- +============================ You can extract dummy variables from string columns. For example if they are separated by a ``'|'``: .. ipython:: python - s = pd.Series(["a", "a|b", np.nan, "a|c"], dtype="string") + s = pd.Series(["a", "a|b", np.nan, "a|c"], dtype="str") s.str.get_dummies(sep="|") String ``Index`` also supports ``get_dummies`` which returns a ``MultiIndex``. @@ -719,7 +824,7 @@ String ``Index`` also supports ``get_dummies`` which returns a ``MultiIndex``. See also :func:`~pandas.get_dummies`. Method summary --------------- +============== .. _text.summary: