# Operating on Data in Pandas


## Ufuncs : Index Preservation

Because Pandas is designed to work with NumPy, any NumPy unfunc will work on Pandas Series and DataFrame objects. Let's start by defining a simple Series and DataFrame on which on demonstrate this:

In [2]:

import numpy as np
import pandas as pd

In [4]:
rng = np.random.RandomState(42)
ser = pd.Series(rng.randint(0, 10, 4))
ser

0    6
1    3
2    7
3    4
dtype: int32

In [10]:
df = pd.DataFrame(rng.randint(0, 10, (3, 4)),
                  columns=['A', 'B', 'C', 'D'])
df

Unnamed: 0,A,B,C,D
0,5,1,9,1
1,9,3,7,6
2,8,7,4,1


if we appy NumPy ufunc on either of these objects, the result will be another Pandas object with the indices preserved

In [11]:
np.exp(ser)

0     403.428793
1      20.085537
2    1096.633158
3      54.598150
dtype: float64

Or, for a slight more complex calculation:

In [12]:
np.sin(df * np.pi / 4)

Unnamed: 0,A,B,C,D
0,-0.7071068,0.707107,0.7071068,0.707107
1,0.7071068,0.707107,-0.7071068,-1.0
2,-2.449294e-16,-0.707107,1.224647e-16,0.707107


## UFuncs: Index Alignment

For binary operations on two series or DataFrame objects, Pandas will align indices in the process of performing the operation. This is very convenient when working with incomplete data, as we'll see in some of the examples that follow.

### Index alignment in Series

As an example, suppose we are combining two different data sources, and find only the top three US states by area and the top three US states by population

In [13]:
area = pd.Series({'Alaska': 1723337, 'Texas': 695662,
                  'California': 423967}, name='area')
population = pd.Series({'California': 38332521, 'Texas': 26448193,
                        'New York': 19651127}, name='population')


let's see what happens when we divice these to compute the population density

In [14]:
population / area

Alaska              NaN
California    90.413926
New York            NaN
Texas         38.018740
dtype: float64

The resulting array contains the *union* of indices of the two input arrays, which could be determined using standard Python set arithmetic on these indices:

In [15]:
area.index | population.index

Index(['Alaska', 'California', 'New York', 'Texas'], dtype='object')

Any item for which one or the other does not have an entry is marked with NaN, or "Not a Number," which is how Pandas marks missing data (see further discussion of missing data in Handling Missing Data). This index matching is implemented this way for any of Python's built-in arithmetic expressions; any missing values are filled in with NaN by default:

In [16]:
A = pd.Series([2, 4, 6], index=[0, 1, 2])
B = pd.Series([1, 3, 5], index=[1, 2, 3])

A + B

0    NaN
1    5.0
2    9.0
3    NaN
dtype: float64

If using NaN values is not the desired behavior, the fill value can be modified using appropriate object methods in place of the operators. For example, calling A.add(B) is equivalent to calling A + B, but allows optional explicit specification of the fill value for an element in A or B that might be missing.

In [17]:
A.add(B, fill_value=0)

0    2.0
1    5.0
2    9.0
3    5.0
dtype: float64

## Index alignment in DataFrame

A similiar type of alignment takes place for both columns and indices when performing operations on DataFrames:

In [18]:
A = pd.DataFrame(rng.randint(0, 20, (2, 2)),
                columns=list('AB'))
A

Unnamed: 0,A,B
0,14,12
1,8,14


In [20]:
B = pd.DataFrame(rng.randint(0, 10, (3, 3)),
                columns=list('BAC'))
                 
                 
B                 

Unnamed: 0,B,A,C
0,0,8,6
1,8,7,0
2,7,7,2


In [21]:
A + B

Unnamed: 0,A,B,C
0,22.0,12.0,
1,15.0,22.0,
2,,,


Notice that indices are aligned correctly irrespective of their order in the two objects, and indices in the result are sorted. As was the case with *Series*, we can use the associated objects arithmetic method and pass any desired *fill_value* to be used in place of missing entries. Here we'll fill with the mean of all values in A (computed by first atacking the rows of A):

In [22]:
fill = A.stack().mean()
A.add(B, fill_value=fill)

Unnamed: 0,A,B,C
0,22.0,12.0,18.0
1,15.0,22.0,12.0
2,19.0,19.0,14.0


In [26]:
print (A.stack()) # What does A.stack() Do?; puts them in alphabetical order
print (A)

0  A    14
   B    12
1  A     8
   B    14
dtype: int32
    A   B
0  14  12
1   8  14


## Ufuncs: Operations Between DataFrame & Series

When performing operations between a DataFrame and a Series, the index and column alignment is similarly maintained. Operations between a DataFrame and a Series are similiar to operations between a two-dimensional and one-dimensional NumPy array. Consider one common operations, where we find the difference of a two-dimensional array and one of its rows:

In [115]:
A = rng.randint(10, size=(3, 4))
A

array([[6, 2, 5, 1],
       [9, 8, 4, 5],
       [3, 9, 6, 8]])

In [116]:
A - A[0]

array([[ 0,  0,  0,  0],
       [ 3,  6, -1,  4],
       [-3,  7,  1,  7]])

According to NumPy's boradcasting rules, subtraction between a two-dimensional array and one of its rows is applied row-wise.

In Pandas, the convention similarly operates row-wise by default:

In [117]:
df = pd.DataFrame(A, columns=list('QRST'))

df - df.iloc[0]

Unnamed: 0,Q,R,S,T
0,0,0,0,0
1,3,6,-1,4
2,-3,7,1,7


If you would instead like to operate column-wise, you can use the object methods mentioned earlier, while specifying the axis keyword:


In [118]:
df.subtract(df['R'], axis=0)

Unnamed: 0,Q,R,S,T
0,4,0,3,-1
1,1,0,-4,-3
2,-6,0,-3,-1


In [119]:
df

Unnamed: 0,Q,R,S,T
0,6,2,5,1
1,9,8,4,5
2,3,9,6,8


Note that these DataFrame/Series operations, like the operations discussed above, will automatically align indices between the two elements:

In [122]:
halfrow = df.iloc[0, ::2]

halfrow

Q    6
S    5
Name: 0, dtype: int32

In [123]:
df - halfrow

Unnamed: 0,Q,R,S,T
0,0.0,,0.0,
1,3.0,,-1.0,
2,-3.0,,1.0,


This preservation and alignment of indices and columns means that operations on data in Pandas will always maintain the data context, which prevents the types of silly errors that might come up when working with hererogenous and or misaligned data in raw NumPy arrays.