# Operating on Data in Pandas

One of the essential pieces of NumPy is the ability to perform quick element-wise
operations, both with basic arithmetic (addition, subtraction, multiplication, etc.) and
with more sophisticated operations (trigonometric functions, exponential and logarithmic
functions, etc.). Pandas inherits much of this functionality from NumPy, and
the ufuncs (Universal Functions) are key to this.

## Ufuncs: Index Preservation
Because Pandas is designed to work with NumPy, any NumPy ufunc will work on
Pandas Series and DataFrame objects. Let’s start by defining a simple Series and
DataFrame on which to demonstrate this:

In [82]:
import numpy as np 
import pandas as pd 

In [83]:
rng = np.random.RandomState(42)
ser = pd.Series(rng.randint(0, 10, 4))
ser

0    6
1    3
2    7
3    4
dtype: int32

In [84]:
df = pd.DataFrame(rng.randint(0, 10, (3, 4)), columns=['A', 'B', 'C', 'D'])
df

Unnamed: 0,A,B,C,D
0,6,9,2,6
1,7,4,3,7
2,7,2,5,4


If we apply a NumPy ufunc on either of these objects, the result will be another Pandas
object with the indices preserved:

In [85]:
np.exp(ser)

0     403.428793
1      20.085537
2    1096.633158
3      54.598150
dtype: float64

In [86]:
# Or, for a slightly more complex calculation:
np.sin(df * np.pi / 4)

Unnamed: 0,A,B,C,D
0,-1.0,0.7071068,1.0,-1.0
1,-0.707107,1.224647e-16,0.707107,-0.7071068
2,-0.707107,1.0,-0.707107,1.224647e-16


## UFuncs: Index Alignment
For binary operations on two Series or DataFrame objects, Pandas will align indices
in the process of performing the operation. This is very convenient when you are
working with incomplete data, as we’ll see in some of the examples that follow.

### Index alignment in Series
As an example, suppose we are combining two different data sources, and find only
the top three products by revenue and the top three products by volume:

In [87]:
revenue = pd.Series({'Bananas': 99999233, 'Tomatoes': 869995662,
'Onions': 742443967}, name='revenue')
volume = pd.Series({'Bananas': 3833252, 'Tomatoes': 26448193,
'Onions': 19651127}, name='volume')
# Let’s see what happens when we divide these to compute the unit price:
revenue / volume

Bananas     26.087310
Tomatoes    32.894333
Onions      37.781241
dtype: float64

The resulting array contains the union of indices of the two input arrays, which we
could determine using standard Python set arithmetic on these indices:

In [88]:
revenue.index | volume.index

Index(['Bananas', 'Tomatoes', 'Onions'], dtype='object')

Any item for which one or the other does not have an entry is marked with NaN, or
“Not a Number,” which is how Pandas marks missing data

In [89]:
A = pd.Series([2, 4, 6], index=[0, 1, 2])
B = pd.Series([1, 3, 5], index=[1, 2, 3])
# arithmetic operations
print(A + B)
print(A.add(B))

0    NaN
1    5.0
2    9.0
3    NaN
dtype: float64
0    NaN
1    5.0
2    9.0
3    NaN
dtype: float64


In [90]:
print(A - B)
print(A.subtract(B))

0    NaN
1    3.0
2    3.0
3    NaN
dtype: float64
0    NaN
1    3.0
2    3.0
3    NaN
dtype: float64


If using NaN values is not the desired behavior, we can modify the fill value using
appropriate object methods in place of the operators. For example, calling A.add(B)
is equivalent to calling A + B, but allows optional explicit specification of the fill value
for any elements in A or B that might be missing:

In [91]:
# filling the NaN value with appropriate object method
A.add(B, fill_value = 0)

0    2.0
1    5.0
2    9.0
3    5.0
dtype: float64

## Index alignment in DataFrame
A similar type of alignment takes place for both columns and indices when you are
performing operations on DataFrames:

In [92]:
A = pd.DataFrame(rng.randint(0, 20, (2, 2)),columns=list('AB'))
A

Unnamed: 0,A,B
0,1,11
1,5,1


In [93]:
B = pd.DataFrame(rng.randint(0, 10, (3, 3)), columns=list('BAC'))
B

Unnamed: 0,B,A,C
0,4,0,9
1,5,8,0
2,9,2,6


In [94]:
# addition
print(A + B)
# we can fill the NaN values with 0
print(A.add(B, fill_value = 0))

A     B   C
0   1.0  15.0 NaN
1  13.0   6.0 NaN
2   NaN   NaN NaN
      A     B    C
0   1.0  15.0  9.0
1  13.0   6.0  0.0
2   2.0   9.0  6.0


In [95]:
# subtraction
# all these operations are one and the same
print(A - B)
# or
print(A.sub(B))
# or
print(A.subtract(B, fill_value = 0))

A    B   C
0  1.0  7.0 NaN
1 -3.0 -4.0 NaN
2  NaN  NaN NaN
     A    B   C
0  1.0  7.0 NaN
1 -3.0 -4.0 NaN
2  NaN  NaN NaN
     A    B    C
0  1.0  7.0 -9.0
1 -3.0 -4.0  0.0
2 -2.0 -9.0 -6.0


In [96]:
# multiplication
print(A * B)
# or
print(A.mul(B))
# or
print(A.multiply(B, fill_value = 0))

A     B   C
0   0.0  44.0 NaN
1  40.0   5.0 NaN
2   NaN   NaN NaN
      A     B   C
0   0.0  44.0 NaN
1  40.0   5.0 NaN
2   NaN   NaN NaN
      A     B    C
0   0.0  44.0  0.0
1  40.0   5.0  0.0
2   0.0   0.0  0.0


In [97]:
# True Division
print(A / B)
# or
print(A.truediv(B))
# or
print(A.div(B))
# or
print(A.divide(B))

A     B   C
0    inf  2.75 NaN
1  0.625  0.20 NaN
2    NaN   NaN NaN
       A     B   C
0    inf  2.75 NaN
1  0.625  0.20 NaN
2    NaN   NaN NaN
       A     B   C
0    inf  2.75 NaN
1  0.625  0.20 NaN
2    NaN   NaN NaN
       A     B   C
0    inf  2.75 NaN
1  0.625  0.20 NaN
2    NaN   NaN NaN


In [98]:
# floor division
print(A // B)
# or
print(A.floordiv(B))

A    B   C
0  inf  2.0 NaN
1  0.0  0.0 NaN
2  NaN  NaN NaN
     A    B   C
0  inf  2.0 NaN
1  0.0  0.0 NaN
2  NaN  NaN NaN


In [99]:
# mode
print(A % B)
print(A.mod(B))

A    B   C
0  NaN  3.0 NaN
1  5.0  1.0 NaN
2  NaN  NaN NaN
     A    B   C
0  NaN  3.0 NaN
1  5.0  1.0 NaN
2  NaN  NaN NaN


In [100]:
# power
print(A ** B)
# or
print(A.pow(B))

A        B    C
0       1.0  14641.0  NaN
1  390625.0      1.0  1.0
2       NaN      NaN  NaN
          A        B    C
0       1.0  14641.0  NaN
1  390625.0      1.0  1.0
2       NaN      NaN  NaN


Notice that indices are aligned correctly irrespective of their order in the two objects,
and indices in the result are sorted.

In [101]:
fill = A.stack().mean()
# fill with the mean of all values in A (which we compute by first stacking the rows of A):
A.add(B, fill_value = fill)

Unnamed: 0,A,B,C
0,1.0,15.0,13.5
1,13.0,6.0,4.5
2,6.5,13.5,10.5


## Ufuncs: Operations Between DataFrame and Series
When you are performing operations between a DataFrame and a Series, the index
and column alignment is similarly maintained. Operations between a DataFrame and
a Series are similar to operations between a two-dimensional and one-dimensional
NumPy array. Consider one common operation, where we find the difference of a
two-dimensional array and one of its rows:

In [102]:
A = rng.randint(10, size=(3, 4))
A

array([[3, 8, 2, 4],
       [2, 6, 4, 8],
       [6, 1, 3, 8]])

In [103]:
A - A[0]

array([[ 0,  0,  0,  0],
       [-1, -2,  2,  4],
       [ 3, -7,  1,  4]])

In [104]:
A[0]

array([3, 8, 2, 4])

According to NumPy’s broadcasting rules subtraction between a two-dimensional array and one of its rows is
applied row-wise.
In Pandas, the convention similarly operates row-wise by default:

In [105]:
df = pd.DataFrame(A, columns=list('QRST'))
df - df.iloc[0]

Unnamed: 0,Q,R,S,T
0,0,0,0,0
1,-1,-2,2,4
2,3,-7,1,4


If you would instead like to operate column-wise, you can use the object methods
 while specifying the axis keyword:

In [106]:
df.subtract(df['R'], axis=0)

Unnamed: 0,Q,R,S,T
0,-5,0,-6,-4
1,-4,0,-2,2
2,5,0,2,7


Note that these DataFrame/Series operations, like the operations discussed before,
will automatically align indices between the two elements:

In [107]:
halfrow = df.iloc[0, ::2]
halfrow

Q    3
S    2
Name: 0, dtype: int32

In [108]:
df.subtract(halfrow)

Unnamed: 0,Q,R,S,T
0,0.0,,0.0,
1,-1.0,,2.0,
2,3.0,,1.0,


This preservation and alignment of indices and columns means that operations on
data in Pandas will always maintain the data context, which prevents the types of silly
errors that might come up when you are working with heterogeneous and/or misaligned
data in raw NumPy arrays.