# Operations on `Series`

In the [introduction](pandas_series) on Pandas `Series` we noted that they are objects and as such have inner state given by the attributes (e.g., `.index` and `.values`) and methods to query and modify this inner state. So far we have just touched on methods more or less accidentially (cf. the [`NaNs` in `Series`](nans-in-series) section).

In this section we will spent some time exploring different types of methods available for Pandas `Series`. As we will learn, calling methods --- and in particular chaining several method calls (so-called method chaining) --- is a quite common pattern when working with Pandas.

An exhaustive list of available methods can be found in the [`Series` documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html)).

As usual we start with some imports.

In [1]:
%matplotlib inline

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

f"Pandas version: {pd.__version__ = }, Numpy version: {np.__version__ = }"

"Pandas version: pd.__version__ = '2.2.3', Numpy version: np.__version__ = '2.2.3'"

(series-statistics)=
## Statistics in `Series`

Series have a number of methods for performing basic statistics. The result of the corresponding methods call may vary: Some (may) yield other `Series` while others yield scalar values.

### Reductions

Reductions are operations that map the content of a `Series` to a single scalar value. The principle is illustrated in the following sketch.

![`Series` reductions: Summation, mean value, and standard deviation](../../_build_img/Reductions-1.png)

:::{note} We note that reductions are most commonly used with numerical data. In particular mean value, the standard deviation, or the median strange quantities in the context of non-numerical data such as strings. 
:::

In [76]:
s = pd.Series(range(1, 10))
s

0    1
1    2
2    3
3    4
4    5
5    6
6    7
7    8
8    9
dtype: int64

Sum all elements in the `Series` to yield a scaler value. The result usually is of the same type as the `dtype` of the `Series`.

In [77]:
s.sum(), type(s.sum())

(np.int64(45), numpy.int64)

The mean value is the sum over all elements devided by the size of the `Series`. Due to the division operation result has a floating point type.

$$
\mu = \frac{1}{N}\sum_{i = 1}^{N - 1}s_i,
$$

where $N$ is the size of the `Series` and $s_i$ are the elements of the `Series`, $ i = 0, \dots, N - 1$. 

In [78]:
s.mean(), s.sum() / s.size

(np.float64(5.0), np.float64(5.0))

The computation of the standard deviation deserves a bit of explanation. In *Pandas* it is implemented like

$$
\sigma = \sqrt{\frac{1}{N - \Delta} \sum_{i = 0}^{N - 1}\left(s_i - \mu\right)^2},
$$

and $\Delta = 1$. The $\Delta$-value is important; this is also called "the degree of freedom". NumPy, on the contrary computes the standard deviation with $\Delta = 0$. When using the [`np.std`](https://numpy.org/doc/stable/reference/generated/numpy.std.html) function we have to specify the `ddof` parameter to obtain the same result as obtained from calling `.std()` on a `Series`.

In [79]:
s.std()

np.float64(2.7386127875258306)

We note that the variance is the standard deviation squared, $\sigma^2$.

In [80]:
s.var(), s.std() ** 2

(np.float64(7.5), np.float64(7.5))

Another important statistics to be computed is the median.

In [81]:
s.median()

np.float64(5.0)

The [`.agg()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.agg.html) method allows to compute multiple reductions at once. In contrast to the specialized methods used so far, the result *may be* another `Series` object.[^agg-returns-a-series] The method also accepts a `list[str]`, where each `str`ing is a valid name of a reduction (e.g. `"mean"` for the mean value, or `"median"` for the median value). 

[^agg-returns-a-series]: `.agg()` will return a `Series` if the argument `func` is of type `list` (more generally, an iterable). Even if the `list` has just a single argument (like in `s.agg(["mean",])`)  the result will be a new `Series` of size equal to 1.

In [82]:
s.agg(['sum', 'mean', 'std', 'median'])

sum       45.000000
mean       5.000000
std        2.738613
median     5.000000
dtype: float64

Actually, it is also possible to pass a `list[callable]` (a `list` of `callable` functions) where each of the functions must compute some sort of reduction. In practice, this will often by NumPy ufuncs like [`np.mean`](https://numpy.org/doc/stable/reference/generated/numpy.mean.html), or [`np.std`](https://numpy.org/doc/stable/reference/generated/numpy.std.html).

In [83]:
s.agg([np.sum, np.mean, np.std, np.median])
s.describe()

  s.agg([np.sum, np.mean, np.std, np.median])
  s.agg([np.sum, np.mean, np.std, np.median])
  s.agg([np.sum, np.mean, np.std, np.median])
  s.agg([np.sum, np.mean, np.std, np.median])


count    9.000000
mean     5.000000
std      2.738613
min      1.000000
25%      3.000000
50%      5.000000
75%      7.000000
max      9.000000
dtype: float64

As a final example for reductions we consider passing custom functions in a `dict`. The keys specify the index label of the result in the resulting `Series`.

In [84]:
s.agg(
    {
        "sum of squares": lambda x: (x ** 2).sum(),
        "max of squares": lambda x: (x ** 2).max()
    }
)

sum of squares    285
max of squares     81
dtype: int64

Each of the functions defined inside the `dict` operates on the whole `Series` on which the `.agg()` method is called. The expression `lambda x: (x ** 2).sum()` defines an anonymous `lambda` function that takes a `Series` as argument `x`, squares it `(x ** 2)`[^sum-of-squares-with-method-chaining] (this is another `Series`!), and then calls `.sum()` on the new `Series` resulting from the previous operation. The result of the summation is returned and is the entry associated with the label `"sum of squares"`.

[^sum-of-squares-with-method-chaining]: An alternative to compute the sum of squares is to chain appropriate method calls: `lambda x: x.pow(2).sum()`. `x` is a `Series`, on which we call `pow(2)` (compute each element to the power of 2) which returns a *new* `Series`, on which we call `.sum()` to compute the actual reduction.

### Methods yielding other `Series` (or something similar)

Apart from reductions there also exist "statistical" methods that may return a new `Series`. 

In [85]:
s = pd.Series([1, 1, 1, 2, 3, 3, 4, 4, 4, 4, 5, 5])
s

0     1
1     1
2     1
3     2
4     3
5     3
6     4
7     4
8     4
9     4
10    5
11    5
dtype: int64

The `.value_counts()` methods returns a `Series` with the frequency of values. The index of the resulting `Series` are the unique entries of the original `Series` object, while data it holds is the count of each of the unique values.

In [86]:
s.value_counts()

4    4
1    3
3    2
5    2
2    1
Name: count, dtype: int64

The unique values can either be obtained from the index of the previous result or we use the `.unique()` method.

:::{note} The `.unique()` method returns the its result as a `np.ndarray`.
:::

In [87]:
s.unique()

array([1, 2, 3, 4, 5])

As the last method discussed in this section we take the [`.duplicated()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.duplicated.html) method. This method returns a new `Series` with `dtype == bool` that has `True` at the position of each duplicate value. One of the values with multiplicity > 1 in the `Series` is marked with `False`.

In [88]:
s = pd.Series(index=[f"a{idx}" for idx in range(len(s))], data=s.to_numpy())
s

a0     1
a1     1
a2     1
a3     2
a4     3
a5     3
a6     4
a7     4
a8     4
a9     4
a10    5
a11    5
dtype: int64

In [89]:
s.duplicated()

a0     False
a1      True
a2      True
a3     False
a4     False
a5      True
a6     False
a7      True
a8      True
a9      True
a10    False
a11     True
dtype: bool

We can use this `Series` as a boolean mask to extract the the unique values from the `Series`. The `~` in front of `s.duplicated()` *negates* all entries in the `Series`; this is a syntactic sugar for calling [`np.logical_not`](https://numpy.org/doc/stable/reference/generated/numpy.logical_not.html).[^numpy-logical-functions] We note that the index of the `Series` we obtain contains the labels at which the `~s.duplicated()` has `True` entries.

[^numpy-logical-functions]: An exhaustive list of NumPy's logical functions can be found on: https://numpy.org/doc/stable/reference/routines.logic.html.

In [90]:
s.loc[~s.duplicated()]

a0     1
a3     2
a4     3
a6     4
a10    5
dtype: int64

In [91]:
s.loc[np.logical_not(s.duplicated())]

a0     1
a3     2
a4     3
a6     4
a10    5
dtype: int64

Before closing this section we will demonstrate that the `.value_counts()` method can also be used for non-numerical data, `str` for example.

:::{note} The `dtype` of a `Series` containing `str` objects is `object`.
:::

In [92]:
words = pd.Series("In Ulm und um Ulm und um Ulm herum".split())
words

0       In
1      Ulm
2      und
3       um
4      Ulm
5      und
6       um
7      Ulm
8    herum
dtype: object

In [93]:
words.value_counts()

Ulm      3
um       2
und      2
In       1
herum    1
Name: count, dtype: int64

In [94]:
words.unique()

array(['In', 'Ulm', 'und', 'um', 'herum'], dtype=object)

(series-methods-manipulation)=
## `Series` manipulation

Much of the work with `Pandas` data structures is done with calling appropriate methods on objects. Indeed, chaining method calls is a commonly observed pattern found in Pandas workflows. 

In the following we will discuss some helpful methods used to manipulate content of a `Series` object. 

:::{warning} Essentially all methods discussed here return *new* objects. That is, they either return a *new* `Series` (e.g. by transforming the content of another `Series`) or new scalar values (e.g. resulting from an aggregation).
:::

The principle of use methods calls on objects is sketched in the following figure (we will deal with the `.transform()` and the `.apply()` method below).

![Calling methods on Pandas `Series` objects.](../../_build_img/SeriesMethods-1.png)

### `.replace()` and `.map()`

The methods [`.map()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.map.html) and [.`replace()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.replace.html) are used to replace different values according to a replacement rule.

We start with `.replace()`. You will soon realize that methods called on `Series` (and `DataFrame`s as well) accept quite a lot of different types of arguments (e.g., `str`, `list`, `dict`, and `callable`s). This is also true for `.replace()` (refer to the `to_replace` parameter) as we will now see.

In [95]:
strings = pd.Series("Er sah das Wasser as".split())
strings

0        Er
1       sah
2       das
3    Wasser
4        as
dtype: object

The following method call will replace the entry `'as'` with `'an'`: We use two (positional) arguments to replace any occurrence of the first with the second. It is important to note that an entry must *exactly* match the string `'as'` --- no characters to the left or the right are allowed. As a result, the substrings in`'das'` or `'Wasser'` will *not* be replaced with `'an'`.

In [96]:
(
    strings
    .replace(to_replace="as", value="an") # replace sring matching 'as' with 'an'
)

0        Er
1       sah
2       das
3    Wasser
4        an
dtype: object

If we want to `'as'` to be interpreted as a pattern that shell be replaced we must use [regular expressions](https://docs.python.org/3/howto/regex.html). We will, however, not dwell too long with "regexps" as this is topic for itself. Suffice it to say, that regular expressions are often very helpful when working with strings.

In [97]:
(
    strings
    .replace(to_replace="as", value="an", regex=True)
)

0        Er
1       sah
2       dan
3    Wanser
4        an
dtype: object

In [24]:
# Fill the gap!

Python `dict`s are good for expressing replacement rules: The keys describe what to replace (old values) while the values describe what to fill in instead (new values).

:::{note} Replacements are made for (old) values that are *explicitly* specified while all others are *ignored*. As a result the size of the old and the new `Series` will be the same. Needless to say they, they also have the same index.
:::

In [100]:
integers = pd.Series((0, 10, 20, 30))
integers

0     0
1    10
2    20
3    30
dtype: int64

In [101]:
(
    integers
    .replace({
        0: 1000, # replace 0 with 1000
        20: 100, # replace 20 with 100
    })
)

0    1000
1      10
2     100
3      30
dtype: int64

The same result can also be achieved with passing a `Series` as argument.

In [103]:
integers_with_replaced_entries = (
    integers
    .replace(pd.Series({0: 1000, 20: 100})) 
)
integers_with_replaced_entries

0    1000
1      10
2     100
3      30
dtype: int64

To quickly demonstrate the a call to `.replace()` actually returns a *new* `Series` --- that does *not* share memory with the original `Series` object it was created from --- we inspect the content of the original `Series`:

In [104]:
integers

0     0
1    10
2    20
3    30
dtype: int64

As we can see the content of the original `Series` is unchanged. In fact, calling `.replace()` on `integers_with_replaced` would yield another `Series` independent of that it was created from (and so forth).

The `.map()` method is also available for making replacements. The semantics are slightly different, however. Let's start with the example from above where we used a `dict` to specify the replacements rules:

In [106]:
(
    (
        integers
        .map({0: 1000, 20: 100})
    ),
    (
        integers
        .replace({0: 1000, 20: 100})
    )
)

(0    1000.0
 1       NaN
 2     100.0
 3       NaN
 dtype: float64,
 0    1000
 1      10
 2     100
 3      30
 dtype: int64)

We note that *all* values not captured by the replacement rules have been replaced with `NaN`. As a result the `dtype` has been changed to `float64` (remember that `NaN` is a special floating point value). If we do not want the `NaN` values we can use a [`defaultdict`](https://docs.python.org/3/library/collections.html#collections.defaultdict) with a suitable (whatever your current situation demands) default value.

In [110]:
def make_replacement_with_default(default_value: int | float, replacements: dict):
    from collections import defaultdict

    dd = defaultdict(lambda: default_value)
    for key, value in replacements.items():
        dd[key] = value

    return dd

make_replacement_with_default(-1000, {0: 1000, 20: 100})

defaultdict(<function __main__.make_replacement_with_default.<locals>.<lambda>()>,
            {0: 1000, 20: 100})

Instead of letting values not taken into account in the replacement specification being converted to `NaN`s, we specify a default replacment value for them (-5000 in this case).

In [113]:
(
    integers
    .map(make_replacement_with_default(-1000, {0: 1000, 20: 100}))
)

0    1000
1   -1000
2     100
3   -1000
dtype: int64

Finally, we also not that `.map()` also accepts a `callable` that will be applied element-wise to the `Series` (operates on one row at a time). The `callable` can either be an anonymous `lambda` function or a named functions (defined with the `def` keyword).

In [115]:
(
    integers 
    .map(lambda x: x * 1000) # this returns a *new* Series object!
)

0        0
1    10000
2    20000
3    30000
dtype: int64

In [117]:
def mul_by_value(x, value=1000):
    return x * value

In [121]:
# (
#     integers
#     .map(lambda x: mul_by_value(x, value=-1000))
# )

(
    integers
    .map(lambda x: "divisible by 20" if x % 20 == 0 else "not divisible by 20")
)

0        divisible by 20
1    not divisible by 20
2        divisible by 20
3    not divisible by 20
dtype: object

In summary, `.replace()` and map *can* do similar things. `.map()` is more generic: It can be used to replace values in a `Series`, but at the same time it does more than just replace specified values. Unspecified values will be replaced as well (e.g. with `NaN`) and, when passing a `callable`, we can specify operations to be applied to every element of the `Series`. As a result the `dtype` of the resulting `Series` can be different from the of `Series` on which `.map()` has been called on. This makes `.map()` more difficult to reason about and a more detailed inspection of the code may be required to fully understand the intent of calling it.

### Quiz

<span style="display:none" id="2_SeriesOperations:1">W3sidHlwZSI6ICJtYW55X2Nob2ljZSIsICJhbnN3ZXJfY29scyI6IDEsICJxdWVzdGlvbiI6ICJXaGF0IGlzIGEgdmFsaWQgYm9vbGVuIGV4cHJlc3Npb24gd2hlbiBgYWAgYW5kIGBiYCBhcmUgTnVtUHkgYXJyYXlzPyIsICJhbnN3ZXJzIjogW3siY29ycmVjdCI6IHRydWUsICJhbnN3ZXIiOiAiICIsICJjb2RlIjogIihhICUgMiA9PSAwKSAmIChiIDwgNDIpIn0sIHsiY29ycmVjdCI6IHRydWUsICJhbnN3ZXIiOiAiICIsICJjb2RlIjogImEgLSBiIDwgMCJ9LCB7ImNvcnJlY3QiOiBmYWxzZSwgImFuc3dlciI6ICIgIiwgImNvZGUiOiAiKGEgJSAyID09IDEpIGFuZCAoYiAqIGEgPT0gbnAucGkpIn0sIHsiY29ycmVjdCI6IGZhbHNlLCAiYW5zd2VyIjogIiAiLCAiY29kZSI6ICIoYiA+IDApIG9yIChhIDwgMCkifSwgeyJjb3JyZWN0IjogdHJ1ZSwgImFuc3dlciI6ICIgIiwgImNvZGUiOiAiKGEgPiBiKSB8IChhIDwgYikifV19XQ==</span>

In [2]:
import jupyterquiz
jupyterquiz.display_quiz("#2_SeriesOperations:1")

<IPython.core.display.Javascript object>

You are given a `Series` with name `s` with index labels `["a", "g", "h", "p", "q", "b", "t"]`.

<span style="display:none" id="2_SeriesOperations:2">W3sidHlwZSI6ICJtYW55X2Nob2ljZSIsICJhbnN3ZXJfY29scyI6IDEsICJxdWVzdGlvbiI6ICJXaGF0IGFyZSBwb3NzaWJsZSB3YXlzIHRvIGFjY2VzcyB0aGUgbGFiZWxzIGAnZydgLCBgJ2gnYCwgYCdwJ2AsIGAncSdgPyIsICJhbnN3ZXJzIjogW3siY29ycmVjdCI6IHRydWUsICJhbnN3ZXIiOiAiICIsICJjb2RlIjogInMubG9jW1tcImdcIiwgXCJoXCIsIFwicFwiLCBcInFcIl1dIn0sIHsiY29ycmVjdCI6IGZhbHNlLCAiYW5zd2VyIjogIiAiLCAiY29kZSI6ICJzLmxvY1tcImdcIiwgXCJoXCIsIFwicFwiLCBcInFcIl0ifSwgeyJjb3JyZWN0IjogdHJ1ZSwgImFuc3dlciI6ICIgIiwgImNvZGUiOiAicy5sb2NbXCJnXCI6XCJxXCJdIn0sIHsiY29ycmVjdCI6IGZhbHNlLCAiYW5zd2VyIjogIiAiLCAiY29kZSI6ICJzLmlsb2NbXCJnXCI6XCJxXCJdIn0sIHsiY29ycmVjdCI6IGZhbHNlLCAiYW5zd2VyIjogIiAiLCAiY29kZSI6ICJzLmlsb2NbW1wiZ1wiLCBcImhcIiwgXCJwXCIsIFwicVwiXV0ifSwgeyJjb3JyZWN0IjogdHJ1ZSwgImFuc3dlciI6ICIgIiwgImNvZGUiOiAic1tbXCJnXCIsIFwiaFwiLCBcInBcIiwgXCJxXCJdXSJ9XX1d</span>

In [3]:

jupyterquiz.display_quiz("#2_SeriesOperations:2")

<IPython.core.display.Javascript object>

<span style="display:none" id="2_SeriesOperations:3">W3sidHlwZSI6ICJtYW55X2Nob2ljZSIsICJhbnN3ZXJfY29scyI6IDEsICJxdWVzdGlvbiI6ICJXaGF0IGFyZSB2YWxpZCB3YXlzIHRvIGdldCB0aGUgbWVhbiwgbWVkaWFuLCBhbmQgc3RhbmRhcmQgZGV2aWF0aW9uIGZyb20gYSBgU2VyaWVzYCBhbmQgcmV0dXJuIHRoZW0gaW4gYSBgU2VyaWVzYD8iLCAiYW5zd2VycyI6IFt7ImNvcnJlY3QiOiB0cnVlLCAiYW5zd2VyIjogIiAiLCAiY29kZSI6ICJzLmFnZyhbXCJtZWFuXCIsIFwibWVkaWFuXCIsIFwic3RkXCJdKSJ9LCB7ImNvcnJlY3QiOiB0cnVlLCAiYW5zd2VyIjogIiAiLCAiY29kZSI6ICJwZC5TZXJpZXMoW3MubWVhbigpLCBzLm1lZGlhbigpLCBzLnN0ZCgpXSkifSwgeyJjb3JyZWN0IjogZmFsc2UsICJhbnN3ZXIiOiAiICIsICJjb2RlIjogInMubWVhbigpLCBzLm1lZGlhbigpLCBzLnN0ZCgpIn0sIHsiY29ycmVjdCI6IHRydWUsICJhbnN3ZXIiOiAiICIsICJjb2RlIjogInMuYWdnKHtcIm1lYW5cIjogbGFtYmRhIHg6IHgubWVhbigpLCBcIm1lZGlhblwiOiBsYW1iZGEgeDogeC5tZWRpYW4oKSwgXCJzdGRcIjogbGFtYmRhIHg6IHguc3RkKCl9KSJ9LCB7ImNvcnJlY3QiOiBmYWxzZSwgImFuc3dlciI6ICIgIiwgImNvZGUiOiAicy5yZXBsYWNlKHtcIm1lYW5cIjogbGFtYmRhIHg6IHgubWVhbigpLCBcIm1lZGlhblwiOiBsYW1iZGEgeDogeC5tZWRpYW4oKSwgXCJzdGRcIjogbGFtYmRhIHg6IHguc3RkKCl9KSJ9XX1d</span>

In [4]:

jupyterquiz.display_quiz("#2_SeriesOperations:3")

<IPython.core.display.Javascript object>

You are given two `Series`, one with index `["a", "a", "b"]`, and the other with index `["a", "a", "d", "b"]`.

<span style="display:none" id="2_SeriesOperations:4">W3sidHlwZSI6ICJtYW55X2Nob2ljZSIsICJhbnN3ZXJfY29scyI6IDEsICJxdWVzdGlvbiI6ICJXaGF0IGlzIHRoZSBjb250ZW50IG9mIHRoZSBpbmRleCBmb3IgYSBgU2VyaWVzYCByZXN1bHRpbmcgZnJvbSBhbiBvcGVyYXRpb24gYmV0d2VlbiB0aGUgdHdvIChlLmcuIGFkZGl0aW9uKT8gKG9yZGVyIGRvZXMgbm90IG1hdHRlciBoZXJlISkiLCAiYW5zd2VycyI6IFt7ImNvcnJlY3QiOiB0cnVlLCAiYW5zd2VyIjogIiAiLCAiY29kZSI6ICJbXCJhXCIsIFwiYVwiLCBcImFcIiwgXCJhXCIsIFwiYlwiLCBcImRcIl0ifSwgeyJjb3JyZWN0IjogZmFsc2UsICJhbnN3ZXIiOiAiICIsICJjb2RlIjogIltcImFcIiwgXCJhXCIsIFwiYlwiLCBcImRcIl0ifSwgeyJjb3JyZWN0IjogZmFsc2UsICJhbnN3ZXIiOiAiICIsICJjb2RlIjogIltcImFcIiwgXCJhXCIsIFwiYVwiLCBcImFcIiwgXCJiXCIsIFwiYlwiLCBcImRcIl0ifV19XQ==</span>

In [5]:

jupyterquiz.display_quiz("#2_SeriesOperations:4")

<IPython.core.display.Javascript object>

<span style="display:none" id="2_SeriesOperations:5">W3sidHlwZSI6ICJtYW55X2Nob2ljZSIsICJhbnN3ZXJfY29scyI6IDEsICJxdWVzdGlvbiI6ICJXaGljaCB2YWx1ZSBpcyB1c2VkIGFzIGEgcmVzdWx0IGluIHBvc2l0aW9ucyB3aGVyZSB0aGUgaW5kZXggb2YgdHdvIGBTZXJpZXNgIG9wZXJhbmRzIGRvIG5vdCBtYXRjaCBpbiBhbiBvcGVyYXRpb24/IiwgImFuc3dlcnMiOiBbeyJjb3JyZWN0IjogdHJ1ZSwgImFuc3dlciI6ICIgIiwgImNvZGUiOiAibmFuIn0sIHsiY29ycmVjdCI6IGZhbHNlLCAiYW5zd2VyIjogIiAiLCAiY29kZSI6ICJpbmYifSwgeyJjb3JyZWN0IjogZmFsc2UsICJhbnN3ZXIiOiAiICIsICJjb2RlIjogIi1pbmYifSwgeyJjb3JyZWN0IjogZmFsc2UsICJhbnN3ZXIiOiAiICIsICJjb2RlIjogIk5vbmUifSwgeyJjb3JyZWN0IjogZmFsc2UsICJhbnN3ZXIiOiAiICIsICJjb2RlIjogIjQyIn1dfV0=</span>

In [6]:

jupyterquiz.display_quiz("#2_SeriesOperations:5")

<IPython.core.display.Javascript object>

(series-operations-transform-and-apply)=
### `.transform()` and `.apply()`

The [`.transform()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.transform.html) as well as [`.apply()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.apply.html) method are used to "apply" a function to the values of a `Series`. As we will soon see, `.transform()` more clearly conveys its intent --- which is "transforming" one `Series` into another --- while `.apply()` is more generic (and harder to understand).

We start with `.transform()`: It expects a `callable` that often is a user-defined (anonymous or named) function that operates on the whole `Series`. Some examples will best illustrate common usage of this method.

In [7]:
integers = pd.Series(data=range(10), index=[f"a{idx}" for idx in range(10)])
integers

a0    0
a1    1
a2    2
a3    3
a4    4
a5    5
a6    6
a7    7
a8    8
a9    9
dtype: int64

In our first example we query if a value is even or not. The result is a new `Series` of the same length with `dtype == bool` and the same entries in the index.

In [13]:
(
    integers
    .transform(lambda s: s % 2 == 0)
    .loc[lambda s: s] # capture current Series after call to `transform` method
)


a0    True
a2    True
a4    True
a6    True
a8    True
dtype: bool

Of course, the same result could have been achieved by writing a `bool`ean expression that directly uses the `Series` but often using `.transform()` is better as it nicely "paves the way" for calling multiple methods in sequence (method chaining).

In [16]:
(integers % 2 == 0)

a0     True
a1    False
a2     True
a3    False
a4     True
a5    False
a6     True
a7    False
a8     True
a9    False
dtype: bool

It is important to understand that the `.transform()` method expects a functions to actually "transform", i.e., reductions --- like summing all values --- are not allowed because this will *not* return a `Series` but rather a single scalar value. Have a look at the following where we attempt to compute the mean of all values with the `.transform()` method.

```python
>>> integers.transform(np.mean)
...
ValueError: Function did not transform
```

In [18]:
# integers.transform(np.mean) # uncomment  to get defailed output

If the function used to transform the values has more than a single parameter, we can pass values for this parameter as keyword arguments to the `.transform()` method. The following functions shifts a value by a specified amount. We use it to scale all values to the interval $[0, 1]$.

$$
s_i^{(\mathrm{scaled})} = \frac{s_i - s_\mathrm{min}}{s_\mathrm{max} - s_\mathrm{min}}, \quad s_\mathrm{max} = \max_i s_i, \quad s_\mathrm{min} = \min_i s_i, \quad i = 1, \dots, N - 1
$$

In [19]:
def scale_minmax(x, min_value, max_value):
    return (x - min_value) / (max_value - min_value)

In [20]:
(
    integers
    .transform(
        scale_minmax, # callable
        min_value=integers.min(), # 1st function parameter
        max_value=integers.max(), # 2nd function parameter
    )
)

a0    0.000000
a1    0.111111
a2    0.222222
a3    0.333333
a4    0.444444
a5    0.555556
a6    0.666667
a7    0.777778
a8    0.888889
a9    1.000000
dtype: float64

The same can result can be achieved by either using a `lambda` function or using [`functools.partial`](https://docs.python.org/3/library/functools.html#functools.partial).

In [23]:
(
    integers
    .transform(
        lambda x: scale_minmax(
            x, 
            min_value=integers.min(), 
            max_value=integers.max(),
        )
    )
)

a0    0.000000
a1    0.111111
a2    0.222222
a3    0.333333
a4    0.444444
a5    0.555556
a6    0.666667
a7    0.777778
a8    0.888889
a9    1.000000
dtype: float64

`functools.partial` will return a new function where some of its arguments have been fixed to concrete values. In our case we fix the `min_value` / `max_value` parameter to contain the smallest / largest value `integers`. The resulting function expects a single argument which is the element of the `Series` to transform. 

In [29]:
from functools import partial

f_specialized = partial(
    scale_minmax,
    min_value=integers.min(),
    max_value=integers.max(),
)

# f_specialized(integers.loc["a9"])
(
    integers
    .transform(f_specialized)
)

a0    0.000000
a1    0.111111
a2    0.222222
a3    0.333333
a4    0.444444
a5    0.555556
a6    0.666667
a7    0.777778
a8    0.888889
a9    1.000000
dtype: float64

Finally, we also take note of the following cases. Firstly, we can pass on of NumPy's universal functions that operates on the whole `Series`.

In [30]:
integers.transform(np.square)

a0     0
a1     1
a2     4
a3     9
a4    16
a5    25
a6    36
a7    49
a8    64
a9    81
dtype: int64

Secondly, it is also possible to have a function that is called on one element at a time.

In [33]:
def f(x):
    print(type(x))  # inspect the type of the argument
    if x % 2 == 0:
        return "divisible by 2"
    elif x % 3 == 0:
        return "divisible by 3"
    elif x % 4 == 0:
        return "divisible by 4"
    else:
        return "something else"


(
    integers
    .iloc[:5]
    .transform(f)  # iloc for shorter output
)

<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>
<class 'int'>


a0    divisible by 2
a1    something else
a2    divisible by 2
a3    divisible by 3
a4    divisible by 2
dtype: object

We briefly contrast this with the case in which the whole `Series` is the argument of the callable.

In [34]:
def f(s):
    output = (s - s.min()) / (s.max() - s.min())
    print(type(s))  # inspect the type of the argument
    return output


integers.transform(lambda s: f(s))

<class 'pandas.core.series.Series'>


a0    0.000000
a1    0.111111
a2    0.222222
a3    0.333333
a4    0.444444
a5    0.555556
a6    0.666667
a7    0.777778
a8    0.888889
a9    1.000000
dtype: float64

We now turn to the `.apply()` method. Let's first look at an example where `.apply()` can be used in just the same manner as `.transform()`.

In [35]:
(
    integers
    .apply(f_specialized)
)

a0    0.000000
a1    0.111111
a2    0.222222
a3    0.333333
a4    0.444444
a5    0.555556
a6    0.666667
a7    0.777778
a8    0.888889
a9    1.000000
dtype: float64

In [37]:
(
    integers
    .apply(scale_minmax, min_value=integers.min(), max_value=integers.max())
)

a0    0.000000
a1    0.111111
a2    0.222222
a3    0.333333
a4    0.444444
a5    0.555556
a6    0.666667
a7    0.777778
a8    0.888889
a9    1.000000
dtype: float64

So far, so good. We have seen that --- at least the case of transformations --- `.transform()` and `.apply()` can be used interchangably. 

:::{note} If you want to transform a `Series` (and later also `DataFrame`s) we recommend to use the `.transform()` method as it more clearly expresses the intent of what you actually want to achieve with this particular operation.
:::

Let's revisit the case in which we earlier obtained a `ValueError: Function did not transform` with a slight modication of the the method call. When using `by_row=False` the function will be applied to the *whole* `Series`. If the function happens to be a reduction, like computing the sum or the median of all elements, the result of the reduction is returned.

In [41]:
(
    integers
    .apply(np.mean, by_row=False)
)

np.float64(4.5)

The reduction above is equivalent to calling `.agg()` in the following way:

In [42]:
integers.mean(), integers.agg("mean")

(np.float64(4.5), np.float64(4.5))

In fact, we can mix transformations and reductions in a single call to `.apply()`. The transformation is the square operation and the addition, the reduction is computing the sum of the transformed result.

In [45]:
(
    integers
    .apply(lambda s: (s ** 2 + 100).sum(), by_row=False)
    # .apply(lambda s: s ** 2 + 100)
    # .apply(lambda s: s.sum(), by_row=False)
)

np.int64(1285)

A more readable way to write this is using method chaining:

In [47]:
(
    integers
    .transform(lambda s: s.pow(2) + 100)
    .sum()
)

np.int64(1285)

Yet another way to write the expression above is the following. We would, however, prefer the call to `.transform()` which is allows to merge the two transformations in a single one and hence is more compact.

In [50]:
(
    integers
    .pow(2)    # new Series object
    .add(100)  # yet a new Series object
    .sum()
)

np.int64(1285)


:::{note} Use `.transform()` if you want to transform and `.agg()` (or a specialized version like `.sum()` or `.mean()`) if you want to aggregate. While `.apply()` can do both, code using methods with explicit names more clearly convey your intent. The more complex your workflow, the more you will come to appreciate expressive code. 
:::

### `.where()`

The `.where()` method is a useful tool to replace values based on a condition.

In [51]:
integers = pd.Series(range(10))

In this example we test for all values < 5 but replace all values for which the condition `x < 5` is *not* true with 0; i.e., all values >= 5 will be set to 0. 

In [53]:
integers.where(
    lambda s: s < 5,  # boolean condition
    0                 # value to plugin where condition is False
)

0    0
1    1
2    2
3    3
4    4
5    0
6    0
7    0
8    0
9    0
dtype: int64

In [57]:
(
    (
        integers
        .where(lambda s: s < 5, 0)
        .sum()
    ),
    (
        integers
        .transform(lambda x: x if x < 5 else 0)
        .sum()
    ),
    (
        integers
        .loc[lambda s: s < 5]
        .sum()
    )
)

(np.int64(10), np.int64(10), np.int64(10))

Let's come up with a more realistic (still very contrived) example: We want to sum all values smaller than a certain threshold. Based on what we have learned so far we may indeed come up with multiple ways to achieve this.

The following figure sketches on possible way of doing this and also showcases the workflow of chaining method calls (method chaining).

![Chaining method calls on Pandas `Series` objects](../../_build_img/SeriesMethodChaining-1.png)

In [60]:
threshold = 100
s = pd.Series(range(1_000))

Set all values greater than `threshold` to 0 as they will not make a contribution to the sum.

In [58]:
threshold = 100
s = pd.Series(range(1_000))

Return a `Series` containing only the values which are smaller than the threshold and sum them up.

In [59]:
(
    s
    .where(lambda s: s < threshold, 0)
    .agg("sum")
)

np.int64(4950)

Like with `.where()` above, set all values greater than `threshold` to 0 as they will not make a contribution to the sum.

In [60]:
s.loc[s < threshold].sum()

np.int64(4950)

Wrap the example using `.loc[]` from above with `.apply()`.

In [62]:
(
    s
    .apply(
        lambda s: s.loc[s.lt(threshold)].sum(),
        by_row=False,
    )
)

np.int64(4950)

In [67]:
s = pd.Series(range(1_000_000))

threshold = 50_000

%timeit s.where(lambda s: s < threshold, 0).sum()
%timeit s.transform(lambda x: x if x < threshold else 0).sum()
%timeit (s * (s < threshold)).sum()
%timeit s.apply(lambda s: s.loc[s.lt(threshold)].sum(), by_row=False)
%timeit s.loc[s.lt(threshold)].sum()

1.58 ms ± 6.72 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
193 ms ± 469 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
1.22 ms ± 11.3 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
641 μs ± 23 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
630 μs ± 2.32 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


(series-operations-exercise-poll)=
## Exercises

In [68]:
rng = np.random.default_rng(seed=42)

Suppose you have conducted an anonymous poll in which --- amongst other things --- you have asked participants to provide information regarding employment status. Annoyingly, the online form to gather the data contained a field in which people could write arbitray text (maybe the number of characters was limited) instead of a drop-down menu that provides several answers from which people can choose what fits them best. Anyway, as a result you are getting some answers which are not really suitable for your research (e.g. "Working with the Avengers"). You have to apply some of the techniques just learned about manipulating `Series` to bring the data into a form suitable for further processing.

To keep things simple let's assume the answers you were hoping for are `"Employed"` (or `"employed"` --- yes, capitalization can will also get in your way here ;-)), and `"Unemployed"` (or `"unemployed"`). We consider these as 'usable' while the rest is 'unusable' (we assume that the poll does not containe further information that allows to make them 'usable').


:::{note} In this exercise you are meant to practise modifying the content of `Series` with suitable method calls. We strongly advise to use method chaining.
:::

In [69]:
data = np.array(
    ["Employed"] * rng.choice(range(50, 200))
    + ["employed"] * rng.choice(range(10, 60))
    + ["Unemployed"] * rng.choice(range(10, 100))
    + ["unemployed"] * rng.choice(range(50, 70))
    + ["Working with the Avengers"] * rng.choice(range(2, 10))
    + ["Geht dich nix an"] * rng.choice(range(1, 5))
    + ["Rate mal"] * rng.choice(range(2, 8))
    + ["Having fun all day"] * rng.choice(range(5, 10))
    + ["geht dich nix an"] * rng.choice(range(10, 20))
)
rng.shuffle(data)

poll = pd.Series(data=data)
poll

0                       Employed
1                     Unemployed
2                     unemployed
3                       Employed
4                       Employed
                 ...            
263                     Employed
264                     Employed
265                   Unemployed
266                   Unemployed
267    Working with the Avengers
Length: 268, dtype: object

### Statistics

How may instances of each category (every answer that has been given) is contained in the dataset? What is the proportion of each class in percent?

In [70]:
poll.value_counts()

Unemployed                   68
Employed                     63
unemployed                   58
employed                     48
geht dich nix an             12
Having fun all day            8
Working with the Avengers     5
Geht dich nix an              4
Rate mal                      2
Name: count, dtype: int64

In [75]:
(
    poll
    # .value_counts()
    # .div(poll.size)
    # .mul(100)
    .value_counts(normalize=True)
    .mul(100)
)

Unemployed                   25.373134
Employed                     23.507463
unemployed                   21.641791
employed                     17.910448
geht dich nix an              4.477612
Having fun all day            2.985075
Working with the Avengers     1.865672
Geht dich nix an              1.492537
Rate mal                      0.746269
Name: proportion, dtype: float64

What is the number 'usable' answers?

In [84]:
(
    poll.str.lower()
    .loc[poll.str.endswith("ployed")]
    .value_counts()
)

# (
#     poll.str.lower()
#     .value_counts()
#     .loc[["unemployed", "employed"]]
# )

unemployed    126
employed      111
Name: count, dtype: int64

What is the number of 'unusable' answers? 

*Hint*: There are multiple free fields to fill in a solution because there are multiple ways to solve this.

In [91]:
(
    poll
    .loc[~poll.str.contains("ployed")]
    .size
    # .value_counts()
    # .sum()
)

31

In [93]:
(
    poll
    .loc[poll.transform(lambda x: "ployed" not in x)]
    .size
)

31

### Cleaning

Now that you have an overview of how many usable and unusable answers you have gotten, it is time to bring the data into a cleaner form: Replace all entries you consider 'unusable' with `"unknown"`. Futher change all 'usable' entries to lowercase. Finally print the counts per (cleaned) category.

You are asked to solve this using 3 different methods (you will probably need other as well but the particular method should be used!).

1. Use the [`.where()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.where.html) method.

In [104]:
(
    poll.str.lower()
    .where(lambda s: s.str.endswith("ployed"), "unknown")
    .value_counts()
)

unemployed    126
employed      111
unknown        31
Name: count, dtype: int64

In [107]:
(
    poll.str.lower()
    .where(lambda s: s.isin(["unemployed", "employed"]), "unknown")
    .value_counts()
)

unemployed    126
employed      111
unknown        31
Name: count, dtype: int64

In [112]:
# "asdf" in poll.str.lower().unique().tolist()

2. Use the [`.transform()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.transform.html) *or* the [`.apply()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.apply.html) method.

In [97]:
(
    poll.str.lower()
    .transform(lambda x: "unknown" if (not "ployed" in x) else x)
    .value_counts()
)

unemployed    126
employed      111
unknown        31
Name: count, dtype: int64

In [99]:
# poll.str.lower().unique()

3. Use the [`.replace()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.replace.html) method.

In [103]:
(
    poll.str.lower()
    .replace({
        answer: "unknown"
        for answer in poll.str.lower().unique()
        if ("ployed" not in answer)
    })
    .value_counts()
)

unemployed    126
employed      111
unknown        31
Name: count, dtype: int64

In [102]:
{
    answer: "unknown"
    for answer in poll.str.lower().unique()
    if ("ployed" not in answer)
}

{'having fun all day': 'unknown',
 'geht dich nix an': 'unknown',
 'working with the avengers': 'unknown',
 'rate mal': 'unknown'}