In [1]:
import pandas as pd

parks_df = pd.read_parquet("../../data/nps/nps_public_data_parks.parquet")

parks_df.head()

Unnamed: 0,relevanceScore,designation,weatherInfo,addresses,operatingHours,entrancePasses,name,description,directionsUrl,fees,...,activities,url,longitude,id,images,directionsInfo,fullName,parkCode,latLong,latitude
0,1,National Memorial,http://forecast.weather.gov/MapClick.php?CityN...,"[{'type': 'Physical', 'line2': '', 'line1': '1...","[{'name': 'Hours of Operation', 'standardHours...",[],Federal Hall,"Here on Wall Street, George Washington took th...",http://www.nps.gov/feha/planyourvisit/directio...,[],...,"[{'name': 'Arts and Culture', 'id': '09DF0950-...",https://www.nps.gov/feha/index.htm,-74.010256,2337D255-2D32-4997-957A-D461EEA03AF8,[{'url': 'https://www.nps.gov/common/uploads/s...,The main entrance of Federal Hall is located a...,Federal Hall National Memorial,feha,"lat:40.70731192, long:-74.01025636",40.707312
1,1,National Historic Trail,"In winter, watch for ice on trails and sidewal...","[{'type': 'Physical', 'line2': '', 'line1': '6...","[{'name': 'Visitor Center Hours', 'standardHou...",[],Lewis & Clark,The Lewis and Clark National Historic Trail wi...,https://www.nps.gov/lecl/,[],...,"[{'name': 'Auto and ATV', 'id': '5F723BAD-7359...",https://www.nps.gov/lecl/index.htm,-95.924515,5D443C5F-19A0-4A06-9CE4-30534A3DD81A,[{'url': 'https://www.nps.gov/common/uploads/s...,Lewis & Clark National Historic Trail Headquar...,Lewis & Clark National Historic Trail,lecl,"lat:41.2646141052, long:-95.9245147705",41.264614
2,1,,"Summers are generally hot and humid, with dayt...","[{'type': 'Physical', 'line2': '', 'line1': '1...",[{'name': 'National Capital Parks-East Headqua...,[],National Capital Parks-East,Welcome to National Capital Parks-East. We inv...,http://www.nps.gov/nace/planyourvisit/directio...,[],...,"[{'name': 'Biking', 'id': '7CE6E935-F839-4FEC-...",https://www.nps.gov/nace/index.htm,-76.994,BA3C1A1D-AA6A-49EB-9237-0222CEEE2670,[{'url': 'https://www.nps.gov/common/uploads/s...,DC295 South to the Exit for I-694/I-395/Capito...,National Capital Parks-East,nace,"lat:38.8659, long:-76.994",38.8659
3,1,National Historical Park,"Be prepared for hot, humid weather. The histor...","[{'type': 'Physical', 'line2': '', 'line1': '1...","[{'name': 'Visitor Center', 'standardHours': {...",[{'description': 'Adams National Historical Pa...,Adams,From the sweet little farm at the foot of Penn...,http://www.nps.gov/adam/planyourvisit/directio...,[],...,"[{'name': 'Guided Tours', 'id': 'B33DC9B6-0B7D...",https://www.nps.gov/adam/index.htm,-71.011604,E4C7784E-66A0-4D44-87D0-3E072F5FEF43,[{'url': 'https://www.nps.gov/common/uploads/s...,"Traveling on U.S. Interstate 93, take exit 7 -...",Adams National Historical Park,adam,"lat:42.2553961, long:-71.01160356",42.255396
4,1,Memorial Parkway,Summers on the parkway are generally hot and h...,"[{'type': 'Physical', 'line2': '700 George Was...",[{'name': 'George Washington Memorial Parkway ...,[],George Washington,The George Washington Memorial Parkway was des...,http://www.nps.gov/gwmp/planyourvisit/directio...,[],...,"[{'name': 'Arts and Culture', 'id': '09DF0950-...",https://www.nps.gov/gwmp/index.htm,-77.1495,E6D5BB41-3251-469F-ABDA-7B43B966F0CF,[{'url': 'https://www.nps.gov/common/uploads/s...,Directions to Parkway Headquarters From the so...,George Washington Memorial Parkway,gwmp,"lat:38.9628, long:-77.1495",38.9628


The `apply()` method in pandas is used to apply a function along an axis of a DataFrame or Series. It allows you to perform custom operations on the data within each row or column, depending on how you specify the axis parameter. Here's how it works:

1. **DataFrame.apply()**:
   - When applied to a DataFrame, `df.apply(func, axis=0)` applies the function `func` to each column (axis=0) or `df.apply(func, axis=1)` applies the function to each row (axis=1).
   - The `func` parameter can be a built-in function, lambda function, or a custom function defined by the user.
   - For example, you can create a custom function that calculates the sum of two columns and apply it to each row or column using `df.apply(custom_function, axis=0)` or `df.apply(custom_function, axis=1)`.

2. **Series.apply()**:
   - When applied to a Series, `series.apply(func)` applies the function `func` to each element in the Series.
   - Similar to DataFrame.apply(), the `func` parameter can be a built-in function, lambda function, or a custom function.
   - For instance, you can use `series.apply(lambda x: x * 2)` to multiply each element in the Series by 2.

The `apply()` method works by iterating over the elements (rows or columns) of the DataFrame or Series and applying the specified function to each element. It is a powerful tool for performing complex transformations, calculations, or filtering operations on data within pandas objects.

For example, we note that `parks_df['addresses']` is a list of json, but what if we only wanted the `city` value from the first listed address. We could do it with apply:

In [2]:
parks_df["city_state"] = parks_df["addresses"].apply(
    lambda x: f"{x[0]['city']}, {x[0]['stateCode']}"
)

parks_df[["name", "city_state"]]

Unnamed: 0,name,city_state
0,Federal Hall,"New York, NY"
1,Lewis & Clark,"Omaha, NE"
2,National Capital Parks-East,"Washington, DC"
3,Adams,"Quincy, MA"
4,George Washington,"McLean, VA"
...,...,...
466,Navajo,"Shonto, AZ"
467,Cabrillo,"San Diego, CA"
468,Golden Spike,"Promontory, Utah, UT"
469,Fort Union Trading Post,"Williston, ND"



Note the syntax, it can be tricky `lambda x: f"{x[0]['city']}, {x[0]['stateCode']`. You can think of `lambda x` as saying "for each x," so this command is saying _for each x in the column, build an f-string with the 0th index city and state_

And we can easily count the states with the most parks:

In [3]:
parks_df["city_state"].value_counts()

city_state
Washington, DC        28
New York, NY          12
Santa Fe, NM          10
Philadelphia, PA       5
San Francisco, CA      4
                      ..
Harkers Island, NC     1
Buxton, NC             1
St. Joe, AR            1
Bryce, UT              1
Nicodemus, KS          1
Name: count, Length: 370, dtype: int64

In the above example, we only used `apply` on one column, but it can be used on entire rows, for example:

In [4]:
parks_df.apply(
    lambda row: f"{row['fullName']}, {row['addresses'][0]['city']}, {row['addresses'][0]['stateCode']}",
    axis=1,
)

0           Federal Hall National Memorial, New York, NY
1       Lewis & Clark National Historic Trail, Omaha, NE
2            National Capital Parks-East, Washington, DC
3             Adams National Historical Park, Quincy, MA
4         George Washington Memorial Parkway, McLean, VA
                             ...                        
466                 Navajo National Monument, Shonto, AZ
467            Cabrillo National Monument, San Diego, CA
468    Golden Spike National Historical Park, Promont...
469    Fort Union Trading Post National Historic Site...
470      Nicodemus National Historic Site, Nicodemus, KS
Length: 471, dtype: object

But note the differences: we now have to specify the _column_ in the query _and_ we have to specify `axis=1`. Of course we don't need to use `x` or `row`, we can use whatever:

In [5]:
# why not? 🐦

parks_df.apply(
    lambda bird: f"{bird['fullName']}, {bird['addresses'][0]['city']}, {bird['addresses'][0]['stateCode']}",
    axis=1,
)

0           Federal Hall National Memorial, New York, NY
1       Lewis & Clark National Historic Trail, Omaha, NE
2            National Capital Parks-East, Washington, DC
3             Adams National Historical Park, Quincy, MA
4         George Washington Memorial Parkway, McLean, VA
                             ...                        
466                 Navajo National Monument, Shonto, AZ
467            Cabrillo National Monument, San Diego, CA
468    Golden Spike National Historical Park, Promont...
469    Fort Union Trading Post National Historic Site...
470      Nicodemus National Historic Site, Nicodemus, KS
Length: 471, dtype: object

But your coworkers and collaborators will like you better if you use descriptive names. The power in apply comes from being able to apply arbitrary functions— you can even define your own.

In [6]:
def cubed_len(input_string):
    return len(input_string) ** 3


parks_df["city_state"].apply(cubed_len)

0      1728
1       729
2      2744
3      1000
4      1000
       ... 
466    1000
467    2197
468    8000
469    2197
470    2197
Name: city_state, Length: 471, dtype: int64

In [7]:
parks_df["states_list"] = parks_df["states"].apply(lambda x: x.split(","))

parks_df[["name", "states_list"]]

Unnamed: 0,name,states_list
0,Federal Hall,[NY]
1,Lewis & Clark,"[IA, ID, IL, IN, KS, KY, MO, MT, NE, ND, OH, O..."
2,National Capital Parks-East,[DC]
3,Adams,[MA]
4,George Washington,"[DC, MD, VA]"
...,...,...
466,Navajo,[AZ]
467,Cabrillo,[CA]
468,Golden Spike,[UT]
469,Fort Union Trading Post,"[MT, ND]"


The thing to be most aware of is that apply operates _per row_. That means for massive datasets, this can be a _very_ expensive operation. For that reason, columnar-oriented SQL or frameworks like Polars are likely best for those operations, but you'll only need to worry about that in the hundred-thousand to million row range.

So we come to another pattern— using `apply` to perform row-wise transformations using Pandas. This is useful for complex transformations, but can be slow. 