# Basics - Apply, Map and Vectorised Functions

In [2]:
import pandas as pd
import numpy as np

data = np.round(np.random.normal(size=(4, 3)), 2)
df = pd.DataFrame(data, columns=["A", "B", "C"])
df.head()

Unnamed: 0,A,B,C
0,0.25,1.0,1.24
1,-0.19,-1.44,1.28
2,-1.61,0.28,0.73
3,-2.38,-0.48,-1.49


## Apply

Used to execute an arbitrary function again an entire dataframe, or a subection. Applies in a vectorised fashion.

In [3]:
df.apply(lambda x: 1 + np.abs(x))

Unnamed: 0,A,B,C
0,1.25,2.0,2.24
1,1.19,2.44,2.28
2,2.61,1.28,1.73
3,3.38,1.48,2.49


In [4]:
# To run on a single column
df.A.apply(np.abs)

0    0.25
1    0.19
2    1.61
3    2.38
Name: A, dtype: float64

In [6]:
# This doesn't work, NOT vectorized
#def double_if_positive(x):
#    if x > 0:
#        return 2 * x
#    return x
#
#df.apply(double_if_positive)

In [5]:
# This works, but not ideal because its modifying our dataframe
def double_if_positive(x):
    x[x > 0] *= 2
    return x

df.apply(double_if_positive)

Unnamed: 0,A,B,C
0,0.5,2.0,2.48
1,-0.19,-1.44,2.56
2,-1.61,0.56,1.46
3,-2.38,-0.48,-1.49


In [6]:
df

Unnamed: 0,A,B,C
0,0.5,2.0,2.48
1,-0.19,-1.44,2.56
2,-1.61,0.56,1.46
3,-2.38,-0.48,-1.49


In [7]:
# This is okay because we're returning a manipulated copy and not mutating original data source
def double_if_positive(x):
    x = x.copy()
    x[x > 0] *= 2
    return x

# Use raw=True if function is pure numpy for performance increase
df.apply(double_if_positive, raw=True)

Unnamed: 0,A,B,C
0,1.0,4.0,4.96
1,-0.19,-1.44,5.12
2,-1.61,1.12,2.92
3,-2.38,-0.48,-1.49


## Map

Similar to apply, but operators on Series, and uses dictionary based inputs rather than an array of values.


In [8]:
series = pd.Series(["Steve", "Alex", "Jess", "Mark"])

In [9]:
# executes like a find and replace with key-value pairs
series.map({"Steve": "Stephen"})

0    Stephen
1        NaN
2        NaN
3        NaN
dtype: object

In [10]:
# Can also pass in functions instead of dictionaries
series.map(lambda d: f"I am {d}")

0    I am Steve
1     I am Alex
2     I am Jess
3     I am Mark
dtype: object

## Vectorised functions

Pandas and numpy obviously have tons of these, here are some examples

In [11]:
display(df, df.abs())

Unnamed: 0,A,B,C
0,0.5,2.0,2.48
1,-0.19,-1.44,2.56
2,-1.61,0.56,1.46
3,-2.38,-0.48,-1.49


Unnamed: 0,A,B,C
0,0.5,2.0,2.48
1,0.19,1.44,2.56
2,1.61,0.56,1.46
3,2.38,0.48,1.49


In [12]:
series = pd.Series(["Obi-Wan Kenobi", "Luke Skywalker", "Han Solo", "Leia Organa"])

In [13]:
"Luke Skywalker".split()

['Luke', 'Skywalker']

In [18]:
series.str.split(expand=True)

Unnamed: 0,0,1
0,Obi-Wan,Kenobi
1,Luke,Skywalker
2,Han,Solo
3,Leia,Organa


In [15]:
series.str.contains("Skywalker")

0    False
1     True
2    False
3    False
dtype: bool

In [20]:
series.str.upper().str.split()

0    [OBI-WAN, KENOBI]
1    [LUKE, SKYWALKER]
2          [HAN, SOLO]
3       [LEIA, ORGANA]
dtype: object

## User defined functions

Lets investigate a super simple example of trying to find the hypotenuse given x and y distances.


In [21]:
data2 = np.random.normal(10, 2, size=(100000, 2))
df2 = pd.DataFrame(data2, columns=["x", "y"])

In [22]:
hypot = (df2.x**2 + df2.y**2)**0.5
print(hypot[0])

13.418962226391836


In [23]:
def hypot1(x, y):
    return np.sqrt(x**2 + y**2)

h1 = []
for index, (x, y) in df2.iterrows():
    h1.append(hypot1(x, y))
print(h1[0])

13.418962226391836


In [24]:
def hypot2(row):
    return np.sqrt(row.x**2 + row.y**2)

h2 = df2.apply(hypot2, axis=1)
print(h2[0])

13.418962226391836


In [25]:
def hypot3(xs, ys):
    return np.sqrt(xs**2 + ys**2)
h3 = hypot3(df2.x, df2.y)
print(h3[0])

13.418962226391836


Vectorising everything you can is the key to speeding up your code. Once you've done that, you should use other tools to investigate. PyCharm Professional has a great optimisation tool built in. Jupyter has %lprun (line profiler) command you can find here: https://github.com/rkern/line_profiler

### Recap

* apply
* map
* .str & similar


* apply runs in vectroized fashion, row by row, or column by column as specified
* map operates element by element using either dictionary key-value replacement, or a function, slower
* apply can be used on series/dataframes; map is only for series
* vecorized implemenetatins much faster using df. much faster