# Iterating a DataFrame

- https://towardsdatascience.com/python-pandas-iterating-a-dataframe-eb7ce7db62f8

In [14]:
import math
import numpy as np
import pandas as pd

In [7]:

data = {'column_a': pd.Series(np.random.randint(0, 100, 5)),
            'column_b': pd.Series(np.random.randint(0, 100, 5)),
            'column_c': pd.Series(np.random.randint(0, 100, 5))}
df = pd.DataFrame(data)
df.index = ['row_1', 'row_2', 'row_3', 'row_4', 'row_5']
print(df)

       column_a  column_b  column_c
row_1        53        62        87
row_2        45        86        98
row_3        84        93        21
row_4         1        45        21
row_5         0        49        50


In [8]:

for index, row in df.iterrows():
    print(f'Index: {index}, row: {row.values}')

Index: row_1, row: [53 62 87]
Index: row_2, row: [45 86 98]
Index: row_3, row: [84 93 21]
Index: row_4, row: [ 1 45 21]
Index: row_5, row: [ 0 49 50]


In [10]:
for index, row in df.iterrows():
    # row.get is safer than row['column']
    print(f'Index: {index}, column_a: {row.get("column_a", 0)}')

Index: row_1, column_a: 53
Index: row_2, column_a: 45
Index: row_3, column_a: 84
Index: row_4, column_a: 1
Index: row_5, column_a: 0


In [12]:
for row in df.itertuples():
    print(row)

Pandas(Index='row_1', column_a=53, column_b=62, column_c=87)
<class 'pandas.core.frame.Pandas'>
Pandas(Index='row_2', column_a=45, column_b=86, column_c=98)
<class 'pandas.core.frame.Pandas'>
Pandas(Index='row_3', column_a=84, column_b=93, column_c=21)
<class 'pandas.core.frame.Pandas'>
Pandas(Index='row_4', column_a=1, column_b=45, column_c=21)
<class 'pandas.core.frame.Pandas'>
Pandas(Index='row_5', column_a=0, column_b=49, column_c=50)
<class 'pandas.core.frame.Pandas'>


Whilst being able to iterate a DataFrame using .iterrows() and .itertuples() is convenient, generally, it’s advised not to as the performance is quite slow over a larger DataFrame. Usually, when people are wanting to iterate a DataFrame it is to add in a calculated column or reformat an existing one. Pandas provides this type of functionality through its built-in function .apply(). The .apply() function provides a more efficient method for updating a DataFrame. 

Pandas Apply for Power Users provides an in-depth look at Pandas .apply().
https://towardsdatascience.com/pandas-apply-for-power-users-f44d0e0025ce

## Pandas Apply

In [20]:
def squared_dataframe(size):
    output = [{'square': np.square(np.random.randint(0, 1000))} for _ in range(size)]
    return output

In [21]:
# Generate DataFrame with square numbers.
df = pd.DataFrame(squared_dataframe(10000))

In [22]:
print(df.shape)
print(df.head(10))

(10000, 1)
   square
0    2025
1  811801
2  173056
3   45796
4  674041
5   99225
6  201601
7  329476
8  407044
9  731025


In [23]:
def get_square_root(squared_number):
    return math.sqrt(squared_number)

## Calculate square roots using different methods.

In [27]:
# apply the function simplicitor
df['square_root_numpy'] = df['square'].apply(np.sqrt)

In Python, an anonymous function means that a function is without a name. As we already know that the def keyword is used to define a normal function in Python. Similarly, the lambda keyword is used to define an anonymous function in Python. It has the following syntax: 

Syntax: lambda arguments: expression

This function can have any number of arguments but only one expression, which is evaluated and returned.
One is free to use lambda functions wherever function objects are required.
You need to keep in your knowledge that lambda functions are syntactically restricted to a single expression.
It has various uses in particular fields of programming besides other types of expressions in functions.

https://www.geeksforgeeks.org/python-lambda-anonymous-functions-filter-map-reduce/


In [30]:
# Python code to illustrate cube of a number 
# showing difference between def() and lambda(). 
def cube(y): 
    return y*y*y 

lambda_cube = lambda y: y*y*y 

# using the normally 
# defined function 
print(cube(5)) 

# using the lamda function 
print(lambda_cube(5)) 

125
125


In [32]:
# apply a "nameless" lambda function
df['square_root_lambda'] = df['square'].apply(lambda x: math.sqrt(x))

In [33]:
# use the function we defined above
df['square_root_function'] = df['square'].apply(get_square_root)

In [15]:
print(df.head(10))

   square  square_root_numpy  square_root_lambda  square_root_function
0  146689              383.0               383.0                 383.0
1   11664              108.0               108.0                 108.0
2   62500              250.0               250.0                 250.0
3   74529              273.0               273.0                 273.0
4      49                7.0                 7.0                   7.0
5   23409              153.0               153.0                 153.0
6  391876              626.0               626.0                 626.0
7  421201              649.0               649.0                 649.0
8   20736              144.0               144.0                 144.0
9  692224              832.0               832.0                 832.0


## using apply with multiple columns

In [34]:
# function with two arguments!
def f(n_1, n_2):
    return n_1*n_2

In [39]:
# apply the function to two columns
df['new_square'] = df.apply(lambda x: f(x.square_root_numpy, x.square_root_lambda), axis=1)

In [38]:
print(df[['square', 'new_square']].head(10))

   square  new_square
0    2025      2025.0
1  811801    811801.0
2  173056    173056.0
3   45796     45796.0
4  674041    674041.0
5   99225     99225.0
6  201601    201601.0
7  329476    329476.0
8  407044    407044.0
9  731025    731025.0
