# The apply, map, and str.replace methods

In python it is common to loop through a database table and perform an operation on each record individually.

This is possible in GeoPandas as well. For instance we could use the loc or iloc method to loop through the rows and access each field value in that row and perform some operation on it.

In Pandas and GeoPandas, however, it is far more common to work on the entire DataFrame at once rather than row by row.  To do this we have to apply a function to an entire series using the apply method.

In [1]:
%matplotlib inline
import geopandas as gpd

buowl = gpd.read_file("data/BUOWL_Habitat.shp")
raptor = gpd.read_file("data/Raptor_Nests.shp")
linear = gpd.read_file("data/Linear_Projects.shp")
eagles = gpd.read_file("data/BAEA_Nests.shp")

Let's say that we want to create a new field that contains the length(number of characters) of the recentstat field.  If we were new to Panda's we might try something like.

In [2]:
raptor['stat_len'] = len(raptor['recentstat'])
raptor.head()

Unnamed: 0,postgis_fi,lat_y_dd,long_x_dd,lastsurvey,recentspec,recentstat,Nest_ID,geometry,stat_len
0,361.0,40.267502,-104.870872,2012-03-16,Swainsons Hawk,INACTIVE NEST,361,POINT (-104.79595 40.29891),879
1,362.0,40.264321,-104.860255,2012-03-16,Swainsons Hawk,INACTIVE NEST,362,POINT (-104.78897 40.22089),879
2,1.0,38.650081,-105.494251,2014-07-28,Swainsons Hawk,INACTIVE NEST,1,POINT (-105.50223 38.68694),879
3,2.0,40.309574,-104.932604,2011-01-06,Swainsons Hawk,INACTIVE NEST,2,POINT (-104.84889 40.35215),879
4,3.0,40.219343,-104.729246,2014-07-03,Swainsons Hawk,ACTIVE NEST,3,POINT (-104.74466 40.18571),879


This clearly didn't work.  The length of a series is the number of rows in the DataFrame (879) not the number of characters for each row.  Our next step might be to loop through each row and assign the len of the recentstat field for every row. This is how we would typically do things in Python.

In [3]:
import time
ts = time.time()
for idx in raptor.index:
    raptor.at[idx, 'stat_len'] = len(raptor.at[idx, 'recentstat'])
te = time.time()

method1 = te-ts
print("Total Time: {:10.5f}".format(method1))
raptor

Total Time:    0.13034


Unnamed: 0,postgis_fi,lat_y_dd,long_x_dd,lastsurvey,recentspec,recentstat,Nest_ID,geometry,stat_len
0,361.0,40.267502,-104.870872,2012-03-16,Swainsons Hawk,INACTIVE NEST,361,POINT (-104.79595 40.29891),13
1,362.0,40.264321,-104.860255,2012-03-16,Swainsons Hawk,INACTIVE NEST,362,POINT (-104.78897 40.22089),13
2,1.0,38.650081,-105.494251,2014-07-28,Swainsons Hawk,INACTIVE NEST,1,POINT (-105.50223 38.68694),13
3,2.0,40.309574,-104.932604,2011-01-06,Swainsons Hawk,INACTIVE NEST,2,POINT (-104.84889 40.35215),13
4,3.0,40.219343,-104.729246,2014-07-03,Swainsons Hawk,ACTIVE NEST,3,POINT (-104.74466 40.18571),11
...,...,...,...,...,...,...,...,...,...
874,911.0,40.006950,-104.894370,2015-08-18,Red-tail Hawk,INACTIVE NEST,911,POINT (-104.98394 40.00297),13
875,912.0,39.998876,-104.900128,2015-09-01,Red-tail Hawk,INACTIVE NEST,912,POINT (-104.84766 39.96975),13
876,,,,2020-05-08,Northern Harrier,INACTIVE NEST,9991,POINT (-104.95039 40.24432),13
877,,,,2020-05-05,SWHA,INACTIVE NEST,1001,POINT (-104.94502 40.24443),13


This works but it is a brute force way and probably should never be done.  A better way is to use the apply method as follows

In [4]:
ts = time.time()
raptor['stat_len']=raptor['recentstat'].apply(len)
te = time.time()
method2 = te-ts
print("Total Time: {:10.5f}".format(method2))
print(method1/method2)
raptor.tail()

Total Time:    0.00237
54.938398150939605


Unnamed: 0,postgis_fi,lat_y_dd,long_x_dd,lastsurvey,recentspec,recentstat,Nest_ID,geometry,stat_len
874,911.0,40.00695,-104.89437,2015-08-18,Red-tail Hawk,INACTIVE NEST,911,POINT (-104.98394 40.00297),13
875,912.0,39.998876,-104.900128,2015-09-01,Red-tail Hawk,INACTIVE NEST,912,POINT (-104.84766 39.96975),13
876,,,,2020-05-08,Northern Harrier,INACTIVE NEST,9991,POINT (-104.95039 40.24432),13
877,,,,2020-05-05,SWHA,INACTIVE NEST,1001,POINT (-104.94502 40.24443),13
878,,40.243865,-104.93717,,RTHA,FLEDGED NEST,1002,POINT (-104.93717 40.24387),12


Wow, notice that is almost 20x faster.  It may not make a big difference in this case with a simple function on a small database but if it was a more complex function on a larger database it could make a significant difference.  Think 1 minute vs 20 minutes or 1 hour vs. 20 hours.

Another option would be to apply a lambda function. Lambda's in python are simple unnamed functions that take an input and return a value in one line of code.

In [5]:
ts = time.time()
raptor['stat_len']=raptor['recentstat'].apply(lambda x: int(len(x)/2))
te = time.time()
method3 = te-ts
print("Total Time: {:10.5f}".format(method3))
print(method1/method3)
raptor.tail()

Total Time:    0.00181
72.00895679662803


Unnamed: 0,postgis_fi,lat_y_dd,long_x_dd,lastsurvey,recentspec,recentstat,Nest_ID,geometry,stat_len
874,911.0,40.00695,-104.89437,2015-08-18,Red-tail Hawk,INACTIVE NEST,911,POINT (-104.98394 40.00297),6
875,912.0,39.998876,-104.900128,2015-09-01,Red-tail Hawk,INACTIVE NEST,912,POINT (-104.84766 39.96975),6
876,,,,2020-05-08,Northern Harrier,INACTIVE NEST,9991,POINT (-104.95039 40.24432),6
877,,,,2020-05-05,SWHA,INACTIVE NEST,1001,POINT (-104.94502 40.24443),6
878,,40.243865,-104.93717,,RTHA,FLEDGED NEST,1002,POINT (-104.93717 40.24387),6


This also works and is alot faster than the brute force method but not quite as fast as directly applying the len function

Still another method would be to write our own custom function.  This is overkill in this case as the len function already exists but as we'll see in a bit in other cases it may be our only option.  The advantage of this method in more complex cases is that we can include as many lines of code as we need.

In [6]:
def return_len(s):
    return len(s)

ts = time.time()
raptor['stat_len']=raptor['recentstat'].apply(return_len)
te = time.time()
method3 = te-ts
print("Total Time: {:10.5f}".format(method3))
print(method1/method3)

Total Time:    0.00202
64.62080378250592


Now lets think about the reason I interrupted this series on spatial function to discuss the apply method in the first place.  We want to create buffers for the raptor points, but the buffers will vary according to the species.  For Swainsons Hawks the buffer distance will be 333 meters and for everything else the buffer distance will be 667 meters.

First we need to convert the raptors GeoDataFrame to UTM coordinates so we can work with meters as the units.

In [7]:
raptor.to_crs(epsg=26913, inplace=True)
raptor.head()

Unnamed: 0,postgis_fi,lat_y_dd,long_x_dd,lastsurvey,recentspec,recentstat,Nest_ID,geometry,stat_len
0,361.0,40.267502,-104.870872,2012-03-16,Swainsons Hawk,INACTIVE NEST,361,POINT (517341.522 4460953.719),13
1,362.0,40.264321,-104.860255,2012-03-16,Swainsons Hawk,INACTIVE NEST,362,POINT (517955.324 4452295.260),13
2,1.0,38.650081,-105.494251,2014-07-28,Swainsons Hawk,INACTIVE NEST,1,POINT (456319.858 4282156.305),13
3,2.0,40.309574,-104.932604,2011-01-06,Swainsons Hawk,INACTIVE NEST,2,POINT (512832.261 4466854.171),13
4,3.0,40.219343,-104.729246,2014-07-03,Swainsons Hawk,ACTIVE NEST,3,POINT (521736.624 4448400.393),11


Next we want to create a new column called buf_dist that will hold the distance we will buffer the point by.  Since there isn't a pre-existing function to do this we will have to write our own custom function. We could use a Lambda function with a ternary operator that allows us to write a simple if-then statement in a single line.

In [8]:
ts = time.time()
raptor['buf_dist'] = raptor['recentspec'].apply(lambda s: 333 if s=='Swainsons Hawk' else 667)
te = time.time()
print("Total Time: {:10.5f}".format(method3))
raptor

Total Time:    0.00202


Unnamed: 0,postgis_fi,lat_y_dd,long_x_dd,lastsurvey,recentspec,recentstat,Nest_ID,geometry,stat_len,buf_dist
0,361.0,40.267502,-104.870872,2012-03-16,Swainsons Hawk,INACTIVE NEST,361,POINT (517341.522 4460953.719),13,333
1,362.0,40.264321,-104.860255,2012-03-16,Swainsons Hawk,INACTIVE NEST,362,POINT (517955.324 4452295.260),13,333
2,1.0,38.650081,-105.494251,2014-07-28,Swainsons Hawk,INACTIVE NEST,1,POINT (456319.858 4282156.305),13,333
3,2.0,40.309574,-104.932604,2011-01-06,Swainsons Hawk,INACTIVE NEST,2,POINT (512832.261 4466854.171),13,333
4,3.0,40.219343,-104.729246,2014-07-03,Swainsons Hawk,ACTIVE NEST,3,POINT (521736.624 4448400.393),11,333
...,...,...,...,...,...,...,...,...,...,...
874,911.0,40.006950,-104.894370,2015-08-18,Red-tail Hawk,INACTIVE NEST,911,POINT (501370.881 4428086.730),13,667
875,912.0,39.998876,-104.900128,2015-09-01,Red-tail Hawk,INACTIVE NEST,912,POINT (513009.440 4424410.603),13,667
876,,,,2020-05-08,Northern Harrier,INACTIVE NEST,9991,POINT (504219.463 4454875.391),13,667
877,,,,2020-05-05,SWHA,INACTIVE NEST,1001,POINT (504676.732 4454887.748),13,667


That works but I think it is a bit confusing and also it is limited to a single condition.  We can write our own custom function and apply it and we can use it to evaluate as many conditions as we want.

In [9]:
def calc_raptor_buffer(spec):
    if spec=='Swainsons Hawk':
        return 333
    elif spec=='Northern Harrier':
        return 500
    else:
        return 667
    
ts = time.time()
raptor['buf_dist'] = raptor['recentspec'].apply(calc_raptor_buffer)
te = time.time()
print("Total Time: {:10.5f}".format(method3))
raptor

Total Time:    0.00202


Unnamed: 0,postgis_fi,lat_y_dd,long_x_dd,lastsurvey,recentspec,recentstat,Nest_ID,geometry,stat_len,buf_dist
0,361.0,40.267502,-104.870872,2012-03-16,Swainsons Hawk,INACTIVE NEST,361,POINT (517341.522 4460953.719),13,333
1,362.0,40.264321,-104.860255,2012-03-16,Swainsons Hawk,INACTIVE NEST,362,POINT (517955.324 4452295.260),13,333
2,1.0,38.650081,-105.494251,2014-07-28,Swainsons Hawk,INACTIVE NEST,1,POINT (456319.858 4282156.305),13,333
3,2.0,40.309574,-104.932604,2011-01-06,Swainsons Hawk,INACTIVE NEST,2,POINT (512832.261 4466854.171),13,333
4,3.0,40.219343,-104.729246,2014-07-03,Swainsons Hawk,ACTIVE NEST,3,POINT (521736.624 4448400.393),11,333
...,...,...,...,...,...,...,...,...,...,...
874,911.0,40.006950,-104.894370,2015-08-18,Red-tail Hawk,INACTIVE NEST,911,POINT (501370.881 4428086.730),13,667
875,912.0,39.998876,-104.900128,2015-09-01,Red-tail Hawk,INACTIVE NEST,912,POINT (513009.440 4424410.603),13,667
876,,,,2020-05-08,Northern Harrier,INACTIVE NEST,9991,POINT (504219.463 4454875.391),13,500
877,,,,2020-05-05,SWHA,INACTIVE NEST,1001,POINT (504676.732 4454887.748),13,667


This also works, it is more flexible. its easier to understand, and it runs just as fast so its a win, win, win.

There is another method that can be used in these situations where we are "coding" a new column based on values in another column. The map function takes a dictionary as a parameter and returns the value when a key match is found.

First lets create the dictionary

In [10]:
species_buffer = {"Swainsons Hawk":333, "Red-tail Hawk":667, "Northern Harrier":500}

Now lets use it to map a buffer distance value to the proper species

In [11]:
ts = time.time()
raptor['buf_dist2']=raptor['recentspec'].map(species_buffer)
te = time.time()
print("Total Time: {:10.5f}".format(method3))
raptor

Total Time:    0.00202


Unnamed: 0,postgis_fi,lat_y_dd,long_x_dd,lastsurvey,recentspec,recentstat,Nest_ID,geometry,stat_len,buf_dist,buf_dist2
0,361.0,40.267502,-104.870872,2012-03-16,Swainsons Hawk,INACTIVE NEST,361,POINT (517341.522 4460953.719),13,333,333.0
1,362.0,40.264321,-104.860255,2012-03-16,Swainsons Hawk,INACTIVE NEST,362,POINT (517955.324 4452295.260),13,333,333.0
2,1.0,38.650081,-105.494251,2014-07-28,Swainsons Hawk,INACTIVE NEST,1,POINT (456319.858 4282156.305),13,333,333.0
3,2.0,40.309574,-104.932604,2011-01-06,Swainsons Hawk,INACTIVE NEST,2,POINT (512832.261 4466854.171),13,333,333.0
4,3.0,40.219343,-104.729246,2014-07-03,Swainsons Hawk,ACTIVE NEST,3,POINT (521736.624 4448400.393),11,333,333.0
...,...,...,...,...,...,...,...,...,...,...,...
874,911.0,40.006950,-104.894370,2015-08-18,Red-tail Hawk,INACTIVE NEST,911,POINT (501370.881 4428086.730),13,667,667.0
875,912.0,39.998876,-104.900128,2015-09-01,Red-tail Hawk,INACTIVE NEST,912,POINT (513009.440 4424410.603),13,667,667.0
876,,,,2020-05-08,Northern Harrier,INACTIVE NEST,9991,POINT (504219.463 4454875.391),13,500,500.0
877,,,,2020-05-05,SWHA,INACTIVE NEST,1001,POINT (504676.732 4454887.748),13,667,


Notice however that the last two values are NaN because they didn't match any key in the dictionary.  We can set a default value using a default dictionary.  These are a special type of dictionary that include a lambda function as the first element in the dictionary that returns a value when no key match is made.

To create a defaultdictionary we have to import the defaultdict class from the collections package.  Our lambda function will take no parameter because it will always just return the same value. In this case 1000

In [12]:
from collections import defaultdict

species_buffer = defaultdict(lambda: 1000, {'Swainsons Hawk': 333, 'Red-tail Hawk': 667, 'Northern Harrier': 500})

In [13]:
ts = time.time()
raptor['buf_dist']=raptor['recentspec'].map(species_buffer)
te = time.time()
print("Total Time: {:10.5f}".format(method3))
raptor

Total Time:    0.00202


Unnamed: 0,postgis_fi,lat_y_dd,long_x_dd,lastsurvey,recentspec,recentstat,Nest_ID,geometry,stat_len,buf_dist,buf_dist2
0,361.0,40.267502,-104.870872,2012-03-16,Swainsons Hawk,INACTIVE NEST,361,POINT (517341.522 4460953.719),13,333,333.0
1,362.0,40.264321,-104.860255,2012-03-16,Swainsons Hawk,INACTIVE NEST,362,POINT (517955.324 4452295.260),13,333,333.0
2,1.0,38.650081,-105.494251,2014-07-28,Swainsons Hawk,INACTIVE NEST,1,POINT (456319.858 4282156.305),13,333,333.0
3,2.0,40.309574,-104.932604,2011-01-06,Swainsons Hawk,INACTIVE NEST,2,POINT (512832.261 4466854.171),13,333,333.0
4,3.0,40.219343,-104.729246,2014-07-03,Swainsons Hawk,ACTIVE NEST,3,POINT (521736.624 4448400.393),11,333,333.0
...,...,...,...,...,...,...,...,...,...,...,...
874,911.0,40.006950,-104.894370,2015-08-18,Red-tail Hawk,INACTIVE NEST,911,POINT (501370.881 4428086.730),13,667,667.0
875,912.0,39.998876,-104.900128,2015-09-01,Red-tail Hawk,INACTIVE NEST,912,POINT (513009.440 4424410.603),13,667,667.0
876,,,,2020-05-08,Northern Harrier,INACTIVE NEST,9991,POINT (504219.463 4454875.391),13,500,500.0
877,,,,2020-05-05,SWHA,INACTIVE NEST,1001,POINT (504676.732 4454887.748),13,1000,


Now notice that the SWHA and RTHA have a buffer distance of 1000.

Really though, we don't want SWHA and RTHA in the recentspec column. They were just entered by mistake by lazy people who didn't want to type out the full name and should be fixed.  First lets make sure that these are the only unwanted values using the unique method.

In [14]:
raptor['recentspec'].unique()

array(['Swainsons Hawk', 'Red-tail Hawk', 'Northern Harrier', 'SWHA',
       'RTHA'], dtype=object)

It looks like that is the case.  We can use the str.replace method to change them to their proper values

In [15]:
ts = time.time()
raptor['recentspec'].str.replace('SWHA', 'Swainsons Hawk')
raptor['recentspec'].str.replace('RTHA', 'Red-tail Hawk')
te = time.time()
print("Total Time: {:10.5f}".format(method3))
raptor.tail()

Total Time:    0.00202


Unnamed: 0,postgis_fi,lat_y_dd,long_x_dd,lastsurvey,recentspec,recentstat,Nest_ID,geometry,stat_len,buf_dist,buf_dist2
874,911.0,40.00695,-104.89437,2015-08-18,Red-tail Hawk,INACTIVE NEST,911,POINT (501370.881 4428086.730),13,667,667.0
875,912.0,39.998876,-104.900128,2015-09-01,Red-tail Hawk,INACTIVE NEST,912,POINT (513009.440 4424410.603),13,667,667.0
876,,,,2020-05-08,Northern Harrier,INACTIVE NEST,9991,POINT (504219.463 4454875.391),13,500,500.0
877,,,,2020-05-05,SWHA,INACTIVE NEST,1001,POINT (504676.732 4454887.748),13,1000,
878,,40.243865,-104.93717,,RTHA,FLEDGED NEST,1002,POINT (505344.097 4454825.953),12,1000,


Great except that our values haven't changed.  Well remember that in pandas most things are not really permanent unless you set the inplace parameter to true.  Except that replace has no inplace parameter.

But we can still make it permanent by assigning the result of the replace method to the original column.

In [16]:
ts = time.time()
raptor['recentspec'] = raptor['recentspec'].str.replace('SWHA', 'Swainsons Hawk')
raptor['recentspec'] = raptor['recentspec'].str.replace('RTHA', 'Red-tail Hawk')
te = time.time()
print("Total Time: {:10.5f}".format(method3))
raptor.tail()

Total Time:    0.00202


Unnamed: 0,postgis_fi,lat_y_dd,long_x_dd,lastsurvey,recentspec,recentstat,Nest_ID,geometry,stat_len,buf_dist,buf_dist2
874,911.0,40.00695,-104.89437,2015-08-18,Red-tail Hawk,INACTIVE NEST,911,POINT (501370.881 4428086.730),13,667,667.0
875,912.0,39.998876,-104.900128,2015-09-01,Red-tail Hawk,INACTIVE NEST,912,POINT (513009.440 4424410.603),13,667,667.0
876,,,,2020-05-08,Northern Harrier,INACTIVE NEST,9991,POINT (504219.463 4454875.391),13,500,500.0
877,,,,2020-05-05,Swainsons Hawk,INACTIVE NEST,1001,POINT (504676.732 4454887.748),13,1000,
878,,40.243865,-104.93717,,Red-tail Hawk,FLEDGED NEST,1002,POINT (505344.097 4454825.953),12,1000,


In [17]:
ts = time.time()
raptor['buf_dist']=raptor['recentspec'].map(species_buffer)
te = time.time()
print("Total Time: {:10.5f}".format(method3))
raptor

Total Time:    0.00202


Unnamed: 0,postgis_fi,lat_y_dd,long_x_dd,lastsurvey,recentspec,recentstat,Nest_ID,geometry,stat_len,buf_dist,buf_dist2
0,361.0,40.267502,-104.870872,2012-03-16,Swainsons Hawk,INACTIVE NEST,361,POINT (517341.522 4460953.719),13,333,333.0
1,362.0,40.264321,-104.860255,2012-03-16,Swainsons Hawk,INACTIVE NEST,362,POINT (517955.324 4452295.260),13,333,333.0
2,1.0,38.650081,-105.494251,2014-07-28,Swainsons Hawk,INACTIVE NEST,1,POINT (456319.858 4282156.305),13,333,333.0
3,2.0,40.309574,-104.932604,2011-01-06,Swainsons Hawk,INACTIVE NEST,2,POINT (512832.261 4466854.171),13,333,333.0
4,3.0,40.219343,-104.729246,2014-07-03,Swainsons Hawk,ACTIVE NEST,3,POINT (521736.624 4448400.393),11,333,333.0
...,...,...,...,...,...,...,...,...,...,...,...
874,911.0,40.006950,-104.894370,2015-08-18,Red-tail Hawk,INACTIVE NEST,911,POINT (501370.881 4428086.730),13,667,667.0
875,912.0,39.998876,-104.900128,2015-09-01,Red-tail Hawk,INACTIVE NEST,912,POINT (513009.440 4424410.603),13,667,667.0
876,,,,2020-05-08,Northern Harrier,INACTIVE NEST,9991,POINT (504219.463 4454875.391),13,500,500.0
877,,,,2020-05-05,Swainsons Hawk,INACTIVE NEST,1001,POINT (504676.732 4454887.748),13,333,


Although the pandas apply method is very fast, there are further performance gains that can be achieved. If you are working with large datasets and need improved performance you can look into vectorizing pandas functions. This blog post on [optinmizing pandas code for speed](https://engineering.upside.com/a-beginners-guide-to-optimizing-pandas-code-for-speed-c09ef2c6a4d6) provides a very good place to start.