In [1]:
import pandas as pd
import numpy as np

In [2]:
data = pd.DataFrame(np.random.rand(20,5))
data.shape

(20, 5)

# Regular for loop
For loops in python work based on column label, not by index like in matlab. If I want to just go through all of the first column (labeled 0) then I can just iterate over that one column like so:

In [3]:
for n in data[0]:
    print(n)

0.27439672119826375
0.09699903202479743
0.410812247697549
0.7280063989094548
0.7321530215856412
0.3506980105353622
0.4032623977695008
0.323190410844836
0.17569811256108792
0.1730873834636002
0.7227471411077043
0.22582535204625487
0.5211345406718816
0.26293754427879
0.28750634130213704
0.4103534797448758
0.14747431013587187
0.07700007841573653
0.6879804172838697
0.4873982891477249


Just for posterity, here's what that column of data looks like:

In [4]:
data[[0]]

Unnamed: 0,0
0,0.274397
1,0.096999
2,0.410812
3,0.728006
4,0.732153
5,0.350698
6,0.403262
7,0.32319
8,0.175698
9,0.173087


Say I wanted to also keep the running index values, I can wrap my iterator (aka the thing I want to loop through) in the enumerate() function:

In [5]:
for idx, n in enumerate(data[0]):
    print(idx, n)

0 0.27439672119826375
1 0.09699903202479743
2 0.410812247697549
3 0.7280063989094548
4 0.7321530215856412
5 0.3506980105353622
6 0.4032623977695008
7 0.323190410844836
8 0.17569811256108792
9 0.1730873834636002
10 0.7227471411077043
11 0.22582535204625487
12 0.5211345406718816
13 0.26293754427879
14 0.28750634130213704
15 0.4103534797448758
16 0.14747431013587187
17 0.07700007841573653
18 0.6879804172838697
19 0.4873982891477249


Say I wanted to loop through two columns. In matlab, you might do a nested for loop. You can do that here too:

In [6]:
# data.columns will let you go through each column. 
# a range() iterator is exactly what it sounds like (creates a range from x to y)
data.columns

RangeIndex(start=0, stop=5, step=1)

In [7]:
for col in data.columns:
    print()
    print('column:')
    print(col)
    print()
    for n in data[col]:
        print(n)


column:
0

0.27439672119826375
0.09699903202479743
0.410812247697549
0.7280063989094548
0.7321530215856412
0.3506980105353622
0.4032623977695008
0.323190410844836
0.17569811256108792
0.1730873834636002
0.7227471411077043
0.22582535204625487
0.5211345406718816
0.26293754427879
0.28750634130213704
0.4103534797448758
0.14747431013587187
0.07700007841573653
0.6879804172838697
0.4873982891477249

column:
1

0.05896467159868268
0.09417369971035672
0.9184360355685851
0.1490431727404673
0.6371796211221116
0.7216088274402561
0.07559126755849266
0.4752673849978929
0.3579600045293301
0.8636978433398043
0.32476661914621574
0.19870832433171048
0.8730462675197428
0.39966221555532777
0.786132282261481
0.6823426728494734
0.4847525108586156
0.09713816736123804
0.6543540376932752
0.03758301993055446

column:
2

0.7014469124042736
0.5323515106423637
0.23406088186041607
0.8508128680241287
0.3432710000314979
0.9431699956232313
0.8680904332447513
0.24530063933228974
0.8780630613984854
0.857170404787786
0.7

For dataframe reference:

In [8]:
data

Unnamed: 0,0,1,2,3,4
0,0.274397,0.058965,0.701447,0.425146,0.987868
1,0.096999,0.094174,0.532352,0.460996,0.256847
2,0.410812,0.918436,0.234061,0.160211,0.065362
3,0.728006,0.149043,0.850813,0.663227,0.888791
4,0.732153,0.63718,0.343271,0.726732,0.742553
5,0.350698,0.721609,0.94317,0.272385,0.550245
6,0.403262,0.075591,0.86809,0.606123,0.483321
7,0.32319,0.475267,0.245301,0.472825,0.043853
8,0.175698,0.35796,0.878063,0.675739,0.302277
9,0.173087,0.863698,0.85717,0.18636,0.150995


# Comprehensions

Since python is function based, you may want to set up a complex set of information per each point in your data. Say I want to make a function that takes a mean from all other columns and makes a difference-percent from each column. I could do that for a single point like this:

In [9]:
for idx, n in enumerate(data[0]):
    # mean of row not in column 0
    row_mean = data.iloc[idx].loc[data.columns != 0].mean()
    
    # percent difference from data point n
    print((n - row_mean)/row_mean)

-0.49499692510962817
-0.7113915688377397
0.19242791090244812
0.14113235132255442
0.19548095897416468
-0.43604269203319795
-0.20661593435243344
0.04487070939646801
-0.6825744975328332
-0.66361784935939
0.1604160259404509
-0.49075193399093087
-0.19874650318089937
0.1777160794343758
-0.4306631554393492
-0.2702646863921165
-0.48381375984490643
-0.8179209475311235
0.8231506697440838
0.6438145409637492


I could just turn the thing in the for loop into a function like this:

In [10]:
def percent_diff(idx, n, col):
    # mean of row not in column col
    row_mean = data.iloc[idx].loc[data.columns != col].mean()
    
    # percent difference from data point n
    return (n - row_mean)/row_mean

In [11]:
# So if we want col 0, first value
n = data[0][0]
percent_diff(0, n, 0)

-0.49499692510962817

Then I could use this in the for loop like this:

In [12]:
for idx, n in enumerate(data[0]):
    # function does all the work for us now
    print(percent_diff(idx, n, 0))

-0.49499692510962817
-0.7113915688377397
0.19242791090244812
0.14113235132255442
0.19548095897416468
-0.43604269203319795
-0.20661593435243344
0.04487070939646801
-0.6825744975328332
-0.66361784935939
0.1604160259404509
-0.49075193399093087
-0.19874650318089937
0.1777160794343758
-0.4306631554393492
-0.2702646863921165
-0.48381375984490643
-0.8179209475311235
0.8231506697440838
0.6438145409637492


So, alternatively, if you have something that spits out a list like above, and it's in function format, then you can use the more-resource-efficient list comprehension to do the same thing! It basically just takes the for loop statement you normally do and puts it at the end. Then you take your function statement and write that in the front. Sort of like: 
```python
[do stuff to your data(data) for data in matrix]  
```
like so:

In [13]:
[percent_diff(idx, n, 0) for idx, n in enumerate(data[0])]

[-0.49499692510962817,
 -0.7113915688377397,
 0.19242791090244812,
 0.14113235132255442,
 0.19548095897416468,
 -0.43604269203319795,
 -0.20661593435243344,
 0.04487070939646801,
 -0.6825744975328332,
 -0.66361784935939,
 0.1604160259404509,
 -0.49075193399093087,
 -0.19874650318089937,
 0.1777160794343758,
 -0.4306631554393492,
 -0.2702646863921165,
 -0.48381375984490643,
 -0.8179209475311235,
 0.8231506697440838,
 0.6438145409637492]

We can time it and see how each thing does

In [14]:
%%time

for idx, n in enumerate(data[0]):
    # mean of row not in column 0
    row_mean = data.iloc[idx].loc[data.columns != 0].mean()
    
    # percent difference from data point n
    print((n - row_mean)/row_mean)

-0.49499692510962817
-0.7113915688377397
0.19242791090244812
0.14113235132255442
0.19548095897416468
-0.43604269203319795
-0.20661593435243344
0.04487070939646801
-0.6825744975328332
-0.66361784935939
0.1604160259404509
-0.49075193399093087
-0.19874650318089937
0.1777160794343758
-0.4306631554393492
-0.2702646863921165
-0.48381375984490643
-0.8179209475311235
0.8231506697440838
0.6438145409637492
Wall time: 13 ms


In [15]:
%%time

for idx, n in enumerate(data[0]):
    # function does all the work for us now
    print(percent_diff(idx, n, 0))

-0.49499692510962817
-0.7113915688377397
0.19242791090244812
0.14113235132255442
0.19548095897416468
-0.43604269203319795
-0.20661593435243344
0.04487070939646801
-0.6825744975328332
-0.66361784935939
0.1604160259404509
-0.49075193399093087
-0.19874650318089937
0.1777160794343758
-0.4306631554393492
-0.2702646863921165
-0.48381375984490643
-0.8179209475311235
0.8231506697440838
0.6438145409637492
Wall time: 20.1 ms


In [16]:
%%time 
[percent_diff(idx, n, 0) for idx, n in enumerate(data[0])]

Wall time: 10 ms


[-0.49499692510962817,
 -0.7113915688377397,
 0.19242791090244812,
 0.14113235132255442,
 0.19548095897416468,
 -0.43604269203319795,
 -0.20661593435243344,
 0.04487070939646801,
 -0.6825744975328332,
 -0.66361784935939,
 0.1604160259404509,
 -0.49075193399093087,
 -0.19874650318089937,
 0.1777160794343758,
 -0.4306631554393492,
 -0.2702646863921165,
 -0.48381375984490643,
 -0.8179209475311235,
 0.8231506697440838,
 0.6438145409637492]

So list comprehension does what the traditional for loop does in half the time! 

# pandas apply  
You can also use pandas to do the looping for you. Say you had a function that detected some range of values to be important. You could write a function for that:

In [17]:
def flagger(value):
    if (value >=.4) and (value <=.6):
        return 1
    else:
        return 0

In [18]:
# np.arange() is the same as range() except it also works with floats (range() only works with int)
[flagger(x) for x in data[0]]

[0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1]

So our flagger works. We could do this for each column:

In [19]:
[flagger(x) for col in data.columns for x in data[col]]

[0,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 1,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0]

But since we're in pandas, it's probably best to use the pandas features for looping since they are designed to be efficient for pandas

In [20]:
data[0].apply(flagger)

0     0
1     0
2     1
3     0
4     0
5     0
6     1
7     0
8     0
9     0
10    0
11    0
12    1
13    0
14    0
15    1
16    0
17    0
18    0
19    1
Name: 0, dtype: int64

we could combine a for loop to do this for the entire dataset

In [21]:
for col in data.columns:
    data[col].apply(flagger)

We can write a convenience function to go through each column in a dataframe and apply the function to the frame directly, versus using a for loop

In [22]:
def pd_flagger(df):
    flags = [flagger(val) for val in df]
    if np.sum(flags) > 0:
        return 1
    else:
        return 0

In [23]:
df_flags = data.apply(pd_flagger, axis=1)
df_flags

0     1
1     1
2     1
3     0
4     0
5     1
6     1
7     1
8     0
9     0
10    0
11    1
12    1
13    0
14    1
15    1
16    1
17    1
18    0
19    1
dtype: int64

So there's a flag for each index! We can fine tune the pd_flagger() to give us the column value instead of just a binary.  

Also, here, I'm introducing a slight change to list comprehensions: dictionary comprehensions! Basically the same thing except instead of a list, it spits out a dictionary. I can use the dictionary to tell me what column and if that column was 1 or 0 for flagging.

In [24]:
def pd_flagger2(df):
    flags = {idx:flagger(val) for idx, val in enumerate(df)}
    return [k for k,v in flags.items() if v ==1]

In [25]:
df_flags2 = data.apply(pd_flagger2, axis=1)
df_flags2

0        [3]
1     [2, 3]
2        [0]
3         []
4         []
5        [4]
6     [0, 4]
7     [1, 3]
8         []
9         []
10        []
11       [3]
12       [0]
13        []
14    [2, 3]
15    [0, 2]
16       [1]
17       [2]
18        []
19       [0]
dtype: object

We can also just directly add this to our dataframe as its own column:

In [26]:
data['flags'] = data.apply(pd_flagger2, axis=1)
data

Unnamed: 0,0,1,2,3,4,flags
0,0.274397,0.058965,0.701447,0.425146,0.987868,[3]
1,0.096999,0.094174,0.532352,0.460996,0.256847,"[2, 3]"
2,0.410812,0.918436,0.234061,0.160211,0.065362,[0]
3,0.728006,0.149043,0.850813,0.663227,0.888791,[]
4,0.732153,0.63718,0.343271,0.726732,0.742553,[]
5,0.350698,0.721609,0.94317,0.272385,0.550245,[4]
6,0.403262,0.075591,0.86809,0.606123,0.483321,"[0, 4]"
7,0.32319,0.475267,0.245301,0.472825,0.043853,"[1, 3]"
8,0.175698,0.35796,0.878063,0.675739,0.302277,[]
9,0.173087,0.863698,0.85717,0.18636,0.150995,[]
