### Practice with Apply

apply takes a function and "applies"( i.e. runs it) across each row or column of the dataframe "simultaneously". It typically is much faster than writing a loop in python

Let's create a dataframe

In [1]:
import pandas as pd

df = pd.DataFrame({'a':[10,20,30],
                  'b':[20,30,40]})
df

Unnamed: 0,a,b
0,10,20
1,20,30
2,30,40


In [4]:
df['a']**2

0    100
1    400
2    900
Name: a, dtype: int64

We can apply our funcitons over a Series ( i.e. individual row or column)
Let's first write a function

In [5]:
def my_sq(x):
    """Squares a given value
    """
    return x**2


### Apply over a Series
For example we can apply our square function on the 'a' column



In [6]:
sq = df['a'].apply(my_sq)
sq

0    100
1    400
2    900
Name: a, dtype: int64

Suppose we create a function with two arguments
To apply the function, **the first argument in the apply is the function**, and second argument is the keyword argument

In [7]:
def my_exp(x,e):
    """Calculates the average of 2 number
    """
    return x**e
ex = df['a'].apply(my_exp,e=2)
ex



0    100
1    400
2    900
Name: a, dtype: int64

In [8]:
ex = df['a'].apply(my_exp,e=3)
ex

0     1000
1     8000
2    27000
Name: a, dtype: int64

### Apply over a Dataframe

When we apply a function over a dataframe, we first need to specify which axis to apply the function over - eg. column by column or row by row.

Lets write a function that takes a single value and prints out the given value

In [11]:
def print_me(x):
    print(x**2)

We pass the axis=0 parameter into apply. If we want the function to work row-wise, we can pass the axis=1 parameter

Column wise Operations

In [12]:
df.apply(print_me,axis=0)

0    100
1    400
2    900
Name: a, dtype: int64
0     400
1     900
2    1600
Name: b, dtype: int64


a    None
b    None
dtype: object

If we want to take an average across the column, we can do as below

In [14]:
def avg_3apply(col):
    x = col[0]
    y = col[1]
    z = col[2]
    return (x+y+z)/3

print(df.apply(avg_3apply))

a    20.0
b    30.0
dtype: float64


### Row-wise Operations
They work like column-wise operations. The part that differs is the axis. We will now use axis=1 in the apply method. 
Thus instead of the entire column being passed into the first argument of the function, the entire row is used as the first argument.

In [16]:
def avg_2apply(row):
    x = row[0]
    y = row[1]
    return  (x+y)/2

print(df.apply(avg_2apply,axis=0))

a    15.0
b    25.0
dtype: float64


### Apply - More advanced

In [17]:
import seaborn as sns
titanic = sns.load_dataset("titanic")
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB


This data set has 891 rows and 15 columns. Almost all of the cells have a value in them. Of the 891 values, age has 714 complete cases and deck has 203 complete cases. One way we can use **apply** is to calculate how many null or NaN values there are in out data, as well as the % of complete cases across each column or across each row. Let's write a function:

**1. Number of Missing Values, Proportion of Missing values and Proportion of complete values***

In [20]:
# we'll use the numpy sum function
import numpy as np

def count_missing(vec):
    """Counts the number of missing values in a vector
    """
    # get a vector of True/False values
    # depending whether the value is missing
    null_vec = pd.isnull(vec)
    
    # take te sum of null_vec
    # since null values do not contribute to the sum
    null_count = np.sum(null_vec)
    
    # return the number of missing values in the vector
    return null_count

#1. Proportion of missing values
def prop_missing(vec):
    """Pecentage of missing values in a vector
    """
    # numerator: Number of missing values
    # We can use the count_missing function we just wrote!
    num = count_missing(vec)
    
    # denominator: total number of values in the vector
    # We also need to count the missing values
    dem = vec.size
    
    # return the proporiton/percentage of missing
    return num/dem

#2. Proportion of complete values
def prop_complete(vec):
    """Percentage of nonmissing values in a vector
    """
    
    # we can utilize the percent_missing function we just wrote
    # by subtracting its value from 1
    return 1-prop_missing(vec)

These vectorized functions work across a vector and can handle any arbitrary amount of information

### Column-wise Operations

In [21]:
cmis_col = titanic.apply(count_missing)
pmis_col = titanic.apply(prop_missing)
pcom_col= titanic.apply(prop_complete)

In [22]:
cmis_col

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

In [23]:
pmis_col

survived       0.000000
pclass         0.000000
sex            0.000000
age            0.198653
sibsp          0.000000
parch          0.000000
fare           0.000000
embarked       0.002245
class          0.000000
who            0.000000
adult_male     0.000000
deck           0.772166
embark_town    0.002245
alive          0.000000
alone          0.000000
dtype: float64

In [24]:
print(pcom_col)

survived       1.000000
pclass         1.000000
sex            1.000000
age            0.801347
sibsp          1.000000
parch          1.000000
fare           1.000000
embarked       0.997755
class          1.000000
who            1.000000
adult_male     1.000000
deck           0.227834
embark_town    0.997755
alive          1.000000
alone          1.000000
dtype: float64


We can check if the missing values are random or some other reason


In [26]:
titanic.loc[pd.isnull(titanic.embark_town),:]

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
61,1,1,female,38.0,0,0,80.0,,First,woman,False,B,,yes,True
829,1,1,female,62.0,0,0,80.0,,First,woman,False,B,,yes,True


### Row-wise Operations

In [28]:
cmis_row = titanic.apply(count_missing,axis=1)
pmis_row = titanic.apply(prop_missing,axis =1)
pcom_row = titanic.apply(prop_complete,axis = 1)

print(cmis_row.head())
print(pmis_row.head())
print(pcom_row.head())

# One thing we can do with this analysis is to see if we have any rows in out data that have multiple missing values.  

print(cmis_row.value_counts())

0    1
1    0
2    1
3    0
4    1
dtype: int64
0    0.066667
1    0.000000
2    0.066667
3    0.000000
4    0.066667
dtype: float64
0    0.933333
1    1.000000
2    0.933333
3    1.000000
4    0.933333
dtype: float64
1    549
0    182
2    160
dtype: int64


Since we are using apply in a row-wise mannaer, we can actually create a new column containing these values

In [29]:
titanic['num_missing'] = titanic.apply(count_missing,axis=1)
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone,num_missing
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False,1
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False,0
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True,1
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False,0
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True,1


We can then look at the rows with multiple missing values. Since there are too many rows with multiple values, lets randomize the results

In [30]:
titanic.loc[titanic.num_missing >1 ,:].sample(10)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone,num_missing
613,0,3,male,,0,0,7.75,Q,Third,man,True,,Queenstown,no,True,2
648,0,3,male,,0,0,7.55,S,Third,man,True,,Southampton,no,True,2
241,1,3,female,,1,0,15.5,Q,Third,woman,False,,Queenstown,yes,False,2
656,0,3,male,,0,0,7.8958,S,Third,man,True,,Southampton,no,True,2
181,0,2,male,,0,0,15.05,C,Second,man,True,,Cherbourg,no,True,2
32,1,3,female,,0,0,7.75,Q,Third,woman,False,,Queenstown,yes,True,2
490,0,3,male,,1,0,19.9667,S,Third,man,True,,Southampton,no,False,2
375,1,1,female,,1,0,82.1708,C,First,woman,False,,Cherbourg,yes,False,2
667,0,3,male,,0,0,7.775,S,Third,man,True,,Southampton,no,True,2
601,0,3,male,,0,0,7.8958,S,Third,man,True,,Southampton,no,True,2


### More Practice with Apply

Using apply on rows:

Let's say you have a DataFrame with columns "A" and "B", and you want to calculate the sum of values in each row:

In [31]:
import pandas as pd

data = {'A': [1, 2, 3],
        'B': [4, 5, 6]}

df = pd.DataFrame(data)

def sum_row(row):
    return row['A'] + row['B']

df['sum'] = df.apply(sum_row, axis=1)
print(df)


   A  B  sum
0  1  4    5
1  2  5    7
2  3  6    9


Using apply on columns:

Suppose you want to calculate the square of each value in a DataFrame:

In [32]:
import pandas as pd

data = {'A': [1, 2, 3],
        'B': [4, 5, 6]}

df = pd.DataFrame(data)

def square_column(column):
    return column.apply(lambda x: x**2)

df_square = df.apply(square_column)
print(df_square)


   A   B
0  1  16
1  4  25
2  9  36


1. Applying a Custom Function on Columns:

Suppose you want to normalize the values in each column by subtracting the mean:

In [33]:
import pandas as pd

data = {'A': [1, 2, 3],
        'B': [4, 5, 6]}

df = pd.DataFrame(data)

def normalize_column(column):
    return column - column.mean()

df_normalized = df.apply(normalize_column)
print(df_normalized)


     A    B
0 -1.0 -1.0
1  0.0  0.0
2  1.0  1.0


2. Applying a Function with Arguments:

You can also use the apply function with a custom function that takes additional arguments:

In [34]:
import pandas as pd

data = {'A': [1, 2, 3],
        'B': [4, 5, 6]}

df = pd.DataFrame(data)

def add_values(column, value):
    return column + value

df_added = df.apply(add_values, args=(10,))
print(df_added)


    A   B
0  11  14
1  12  15
2  13  16


3. Applying a Function Element-Wise:

You can use the applymap method to apply a function element-wise to each cell in the DataFrame:

In [35]:
import pandas as pd

data = {'A': [1, 2, 3],
        'B': [4, 5, 6]}

df = pd.DataFrame(data)

df_squared = df.applymap(lambda x: x**2)
print(df_squared)


   A   B
0  1  16
1  4  25
2  9  36


5. Applying a Function on Selected Rows:

You can apply a function on specific rows using the apply function with a conditional statement:

In [36]:
import pandas as pd

data = {'A': [1, 2, 3],
        'B': [4, 5, 6]}

df = pd.DataFrame(data)

def sum_row(row):
    return row['A'] + row['B']

df['sum'] = df[df['A'] > 1].apply(sum_row, axis=1)
print(df)


   A  B  sum
0  1  4  NaN
1  2  5  7.0
2  3  6  9.0
