In [None]:
import pandas as pd
import numpy as np

# Operating on Data in Pandas

Pandas universal functions (ufuncs) are useful because they automatically preserve index and column labels during operations. For binary operations like addition, ufuncs will also align data by index, preventing errors when combining data from different sources.

## Ufuncs: Index Preservation

In [None]:
rng = np.random.RandomState(42)
ser = pd.Series(rng.randint(0, 10, 4))
ser

In [None]:
df = pd.DataFrame(rng.randint(0, 10, (3, 4)),
                    columns=['A', 'B', 'C', 'D'])
df

If we apply a NumPy ufunc on either of these objects, the result will be another Pandas object with the indices preserved:

In [None]:
print(ser)
print(np.exp(ser))

Or, for a slightly more complex calculation:

In [None]:
print(df)
print(np.sin(df * np.pi / 4))

Any unary ufunc in NumPy can be used in this manner.

## Ufuncs: Index Alignment

For binary operations on two Series or DataFrame objects, Pandas will align indices
in the process of performing the operation. This is very convenient when you are
working with incomplete data, as we’ll see in some of the examples that follow.

### Index alignment in Series

As an example, suppose we are combining two different data sources, and find only
the top three US states by area and the top three US states by population:

In [None]:
area = pd.Series({'Alaska': 1723337, 'Texas': 695662,
                    'California': 423967}, name='area')
population = pd.Series({'California': 38332521, 'Texas': 26448193,
                        'New York': 19651127}, name='population')

print(area)
print()
print(population)

Let’s see what happens when we divide these to compute the population density:

In [None]:
population / area

The resulting array contains the union of indices of the two input arrays, which we
could determine using standard Python set arithmetic on these indices:

In [None]:
print(area.index)
print(population.index)
print(area.index.union(population.index))

When operating between two Pandas `Series` objects with binary ufuncs, if either index does not match up, then a `NaN` value is placed in that position of the resulting `Series` object.

In [None]:
A = pd.Series([2, 4, 6], index=[0, 1, 2])
B = pd.Series([1, 3, 5], index=[1, 2, 3])
C = A + B

print(C)

If `NaN` is not desired, then the ufunc method itself has a parameter called `fill_value` specifying the default value when there is a missing value.

In [None]:
A.add(B, fill_value=0)

### Index alignment in DataFrame

A similar type of alignment takes place for both columns and indices when you are
performing operations on DataFrames:

In [None]:
A = pd.DataFrame(rng.randint(0, 20, (2, 2)), columns=list('AB'))
A

In [None]:
B = pd.DataFrame(rng.randint(0, 10, (3, 3)),
                   columns=list('BAC'))
B

In [None]:
A + B

Notice that indices are aligned correctly irrespective of their order in the two objects,
and indices in the result are sorted. As was the case with Series, we can use the asso‐
ciated object’s arithmetic method and pass any desired fill_value to be used in place
of missing entries. Here we’ll fill with the mean of all values in A (which we compute
by first stacking the rows of A):

In [None]:
fill = A.stack().mean()
A.add(B, fill_value=fill)

Table 3-1 lists Python operators and their equivalent Pandas object methods.

| Python operator | Pandas method(s) |
|-----------------|------------------|
| ``+``           | ``add()``        |
| ``-``           | ``sub()``, ``subtract()`` |
| ``*``           | ``mul()``, ``multiply()`` |
| ``/``           | ``truediv()``, ``div()``, ``divide()`` |
| ``//``          | ``floordiv()``   |
| ``%``           | ``mod()``        |
| ``**``          | ``pow()``        |

## Ufuncs: Operations Between DataFrame and Series

When you are performing operations between a DataFrame and a Series, the index
and column alignment is similarly maintained. Operations between a DataFrame and
a Series are similar to operations between a two-dimensional and one-dimensional
NumPy array. Consider one common operation, where we find the difference of a
two-dimensional array and one of its rows:

In [None]:
A = rng.randint(10, size=(3, 4))
A

In [None]:
A - A[0]

According to NumPy’s broadcasting rules (see “Computation on Arrays: Broadcast‐
ing” on page 63), subtraction between a two-dimensional array and one of its rows is
applied row-wise.

In Pandas, the convention similarly operates row-wise by default:

In [None]:
df = pd.DataFrame(A, columns=list('QRST'))
df - df.iloc[0]

If you would instead like to operate column-wise, you can use the object methods
mentioned earlier, while specifying the axis keyword:

In [None]:
df.subtract(df['R'], axis=0)

Note that these DataFrame/Series operations, like the operations discussed before,
will automatically align indices between the two elements:

In [None]:
halfrow = df.iloc[0, ::2]
halfrow

In [None]:
df - halfrow

This preservation and alignment of indices and columns means that operations on
data in Pandas will always maintain the data context, which prevents the types of silly
errors that might come up when you are working with heterogeneous and/or mis‐
aligned data in raw NumPy arrays.

## Other

A quick note on the `.idxmax()` method (and I suppose its corresponding `idxmin()` method): It's very useful for finding data regarding a specific max/min value, whether that be in a series or DataFrame.

For example, when given a Series with meaningful index labels, `.idxmax()` can return the index of the maximum value.

For DataFrames, `.idxmax()` can still be used to find the location of the row with a max in a specific column (via `df['col_name'].idxmax()`), which can then be paired with `.loc` to retrieve that row.

If working directly on a DataFrame, it returns a Series. Depending on the specific axis (default axis=0 --> down rows), it returns the either index/col labels where maximums occur.

In [None]:
df = pd.DataFrame({
    'col_1': [10, 40, 5, 25],
    'col_2': [50, 30, 95, 15]
}, index=['w', 'x', 'y', 'z'])

print("Original DataFrame:")
print(df)

# Find the index of the max value for each column
max_indices_cols = df.idxmax() # axis=0 is the default

print("\nIndex of max value in each column:")
print(max_indices_cols)