## Arithmetic with Series and DataFrames

pandas uses **broadcasting** when we apply an arithmetic function to a Series or DataFrame. For example, from above here's our RT data from two participants:

In [1]:
import pandas as pd

filenames = ['s1.csv', 's2.csv', 's3.csv']
df_list = [pd.read_csv(f) for f in filenames]
df = df_list[0]
df = df.rename(columns={'RT':'s1'})
df['s2'] = df_list[1]['RT']
df

Unnamed: 0,trial,s1,s2
0,1,0.508971,0.433094
1,2,0.389858,0.392526
2,3,0.404175,0.396831
3,4,0.26952,0.417988
4,5,0.437765,0.37181
5,6,0.368142,0.659228
6,7,0.400544,0.411051
7,8,0.335198,0.40958
8,9,0.341722,0.486828
9,10,0.439583,0.468912


RT is in seconds, but often we want to report RT in milliseconds. To do this, we could multiply the entire DataFrame by 1000 (since there are 1000 ms in 1 s), but this would apply to all columns, including trial number:

In [2]:
df * 1000

Unnamed: 0,trial,s1,s2
0,1000,508.971072,433.093893
1,2000,389.857974,392.526034
2,3000,404.175466,396.830804
3,4000,269.520309,417.987737
4,5000,437.764713,371.810078
5,6000,368.141756,659.228422
6,7000,400.544278,411.051235
7,8000,335.198066,409.580168
8,9000,341.722042,486.828076
9,10000,439.583357,468.912134


So instead, we can apply the math only to specified columns:

In [3]:
df[['s1','s2']] * 1000

Unnamed: 0,s1,s2
0,508.971072,433.093893
1,389.857974,392.526034
2,404.175466,396.830804
3,269.520309,417.987737
4,437.764713,371.810078
5,368.141756,659.228422
6,400.544278,411.051235
7,335.198066,409.580168
8,341.722042,486.828076
9,439.583357,468.912134


pandas also provides methods for applying some common arithmetic operations to DataFrames. This includes simple operation slike addition (`.add()` and multiplication (`.multiply()`), but also more complex "convenience functions" like `.mean()`:

In [4]:
df[['s1','s2']].mean()

s1    0.389548
s2    0.444785
dtype: float64

Note that this produces two values, the mean of each column. WHile the default is to apply the function "column-wise", there's also an argument that allows us to compute the mean of each row, instead:

In [5]:
df[['s1','s2']].mean(axis=1)

0    0.471032
1    0.391192
2    0.400503
3    0.343754
4    0.404787
5    0.513685
6    0.405798
7    0.372389
8    0.414275
9    0.454248
dtype: float64

By default, these arithmetic methods will ignore any NaN values. However, we can tell pandas to replace `NaN`s with some other value. For example, let's reload the rat data, which had `NaN` values:

In [6]:
import pandas as pd

maze_files = ['maze_data_1.csv', 'maze_data_2.csv', 'maze_data_3.csv']

days_list = [['day1', 'day2', 'day3', 'day4', 'day5', 'day6', 'day7', 'day8'], 
             ['day1', 'day2', 'day3', 'day5', 'day6', 'day7', 'day8'],
             ['day1', 'day2', 'day4', 'day5', 'day6', 'day7']
            ]

maze_list = []
for counter, filename in enumerate(maze_files):
    maze_list.append(pd.read_csv(filename))
    maze_list[counter]['days'] = days_list[counter]
    maze_list[counter] = maze_list[counter].set_index('days')

rat_df = maze_list[0]
rat_df = rat_df.rename(columns={'maze_time':'r1'})
rat_df['r2'] = maze_list[1]
rat_df['r3'] = maze_list[2]
rat_df

Unnamed: 0_level_0,r1,r2,r3
days,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
day1,6.0,7.32,2.55
day2,7.56,4.12,4.0
day3,2.17,6.28,
day4,2.39,,6.0
day5,5.6,4.2,8.38
day6,8.94,2.11,6.53
day7,2.95,4.98,3.01
day8,3.3,7.44,


If we wanted to sum the data from rats 1 and 2 for each day, we would get `NaN` results if any inputs were `NaN` (like day4):

In [7]:
rat_df['r1'].add(rat_df['r2'])

days
day1    13.32
day2    11.68
day3     8.45
day4      NaN
day5     9.80
day6    11.05
day7     7.93
day8    10.74
dtype: float64

But we could use `fill_value=0` to tell pandas to treat these as zeros instead:

In [8]:
rat_df['r1'].add(rat_df['r2'], fill_value=0)

days
day1    13.32
day2    11.68
day3     8.45
day4     2.39
day5     9.80
day6    11.05
day7     7.93
day8    10.74
dtype: float64