# Apply (or, thinking and doing functionally)
- We've talked at length about the benefits of broadcasting and vectorization, but sometimes you need to do something that NumPy just can't do
- We can still think and act functionally, and pandas provides two mechanisms for this `apply` and `applymap`

In [1]:
%load_ext fakermaker
import fakermaker

- `applymap` is a function on a `DataFrame` which takes a single parameter: some function to apply to all cells in the `DataFrame`
- The return value is, likewise, simple: the new `DataFrame`
- Let's take a look

In [8]:
import pandas as pd

def sample_func(x):
    return x*x

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
print(df)
new_df = df.applymap(sample_func)
print(new_df)

   A  B  C
0  1  4  7
1  2  5  8
2  3  6  9
   A   B   C
0  1  16  49
1  4  25  64
2  9  36  81


- In the end, I almost never use `applymap`, but `apply` I use on **every** data cleaning approach
- `apply` looks almost the same, but acts across a row or column
- **very very powerful**
- the result of an apply is a `Series` or `DataFrame` object

In [5]:
%%fakermaker
people
------
name
address

In [6]:
people.head()

Unnamed: 0,name,address
0,Pamela Clark,"5722 Williamson Falls\nGrahamchester, OK 81070"
1,Jane Mitchell,"94816 Turner Meadow\nWest Annetteshire, WA 54126"
2,Stephen Caldwell,"480 Matthew Neck\nEast Lorettahaven, HI 12607"
3,James Weiss,"37312 Jenkins Brooks\nSanchezport, KS 54026"
4,Danielle Pearson,"708 Amy Mountains Suite 319\nMasseyfort, OK 41062"


- So, how should we clean this?
- Let's break everything apart!

In [32]:
# my implementation
import re
def grab_zip(x):
    for i in x:
        pattern = '\d{5}'
        stuff = re.findall(pattern, i)
        if len(stuff) > 0:
            print(stuff[-1])

people.apply(grab_zip)

81070
54126
12607
54026
41062
56663
75195
90430
12628
49096
18623
37528
89941
06728
80476
13425
97474
50641
68914
92753
36717
90430
89652
83256
81880
13425
50641
96286
18623
90430
01543
50641
09385
89652
86523
04052
19558
54026
95902
08313
21140
04670
85000
03058
20699
12628
23577
81114
55671
03596
08355
04670
85000
58538
27422
17860
31835
01070
76623
16459
09865
83126
94196
38011
45318
60420
38011
58538
74999
12750
11326
51249
95902
60406
94659
85000
58538
74999
80476
84977
18390
81970
39083
89941
06278
23577
50598
85000
90430
18623
60228
38997
31233
67278
87566
74729
49096
09865
10794


name       None
address    None
dtype: object

In [33]:
def get_zip(row):
    '''This function expects a parameter called row which is of type Series.
    The expectation is that the Series object has two fields in it, one for
    the name of a person and one for the address of the person. This function
    will pull out and return the zip code for the address'''
    # little tests to make sure our row is formatted properly
    assert("name" in row)
    assert("address" in row)
    
    #Let's use a regex to find the zip code!
    import re
    pattern = '\d{5}'
    zips = re.findall(pattern, row['address'])
    return zips[-1]

people['zip codes'] = people.apply(get_zip, axis=1)
people.head()

Unnamed: 0,name,address,zip codes
0,Pamela Clark,"5722 Williamson Falls\nGrahamchester, OK 81070",81070
1,Jane Mitchell,"94816 Turner Meadow\nWest Annetteshire, WA 54126",54126
2,Stephen Caldwell,"480 Matthew Neck\nEast Lorettahaven, HI 12607",12607
3,James Weiss,"37312 Jenkins Brooks\nSanchezport, KS 54026",54026
4,Danielle Pearson,"708 Amy Mountains Suite 319\nMasseyfort, OK 41062",41062


In [None]:
people

Let's try one more example...with even more addresses!

In [34]:
%%fakermaker
person_info
-----------
address as home
address as work
address as school
ssn

In [35]:
person_info.head()

Unnamed: 0,home,work,school,ssn
0,"20298 Harris Coves\nPort Kim, MD 51809","45283 Matthew Isle\nEast William, OK 01639","88247 Rodgers Burgs\nSouth Stephanie, OR 89883",149-02-1628
1,"46436 Reed Trace\nEast Brian, OH 16994","5723 Diana Squares\nPort Jefferyview, TX 12997","259 Neal Mill\nNorth Angela, NJ 24946",865-16-1263
2,"PSC 6900, Box 9807\nAPO AE 13267","41806 Austin Forges Apt. 535\nAngelastad, MN 1...","00047 Chris Summit\nBrownmouth, TN 47772",745-88-0600
3,"986 Turner Point Suite 470\nAyalabury, MD 16167","24748 Nicholas Mews\nSouth Kyleborough, ID 25912",Unit 9513 Box 1120\nDPO AP 68386,833-71-8763
4,USCGC Duke\nFPO AE 88972,"90173 Troy Mews\nSouth Lisaside, IA 41242","073 Nguyen Turnpike\nWest Kristinhaven, MA 26604",833-71-8763


In [43]:
import re
def get_state(x):
    home = x['home']
    pattern = '[A-Z]{2}'
    state = re.findall(pattern, home)
    if len(state) > 0:
        print(state[-1])

person_info.apply(get_state, axis=1)

MD
OH
AE
MD
AE
MI
MN
VT
IN
NC
NM
ME
PA
KY
UT
UT
ME
AP
ID
AE
WV
CO
AZ
CO
VA
ID
NV
MN
AZ
NJ
AP
TN
AL
VT
CO
MI
MO
CO
ND
CO
IL
NM
DE
IL
CO
AK
MI
KS
WA
MD
NM
NV
NV
GA
ME
AA
IA
GA
DE
WY
OK
MT
WV
UT
NC
OK
NM
NE
NY
NC
AZ
NV
SD
IL
CO
IN
CO
AA
SD
MA
NY
TX
VT
CO
IA
TN
AA
CT
RI
SD
SC
NE
LA
AE
MA
OK
SD
OR
PA


0     None
1     None
2     None
3     None
4     None
      ... 
94    None
95    None
96    None
97    None
98    None
Length: 99, dtype: object