![Many pancakes vs single pancake](images/pancakes.png)

#### How might our camping application scale?
* More states
* Additional data sources
* ?

In [85]:
from csv import DictReader
import geopandas as gpd
import json
import pandas as pd
import itertools

from camping.mocks.request import RequestsMock
from camping.util.scraper import Scraper
from camping.util.distance import distance_merge

def max_col_width(w=100):
    pd.set_option('display.max_colwidth', w)

ridb_facilities_url = "https://ridb.recreation.gov/api/v1/facilities"

### Scaling our prototype code

What are some aspects that might not scale?   
*   
  
Opportunities to parallelize?  
*   

#### Exploration code for reference

In [None]:
params = {"activity_id":9, "state":"OR"}
headers = {"accept": "application/json", "apikey": "key"}

# Get RIDB facilities data
response = RequestsMock.get(ridb_facilities_url, params, headers=headers)
camping_json  = json.loads(response.text)
df_ridb_camping = pd.DataFrame(camping_json['RECDATA'])

# Get RIDB campground data for each facility
campground_info = pd.DataFrame()
for facility in camping_json['RECDATA']:
    if facility.get('FacilityID') is not None:
        campground_url = f"{ridb_facilities_url}/{facility['FacilityID']}/campsites"
        resp = RequestsMock.get(campground_url, headers=headers)
        if resp.status_code != 200:
            continue
        
        campsites = json.loads(resp.text)
        if len(campsites['RECDATA']) > 0:
            df_campsites = pd.DataFrame(campsites['RECDATA'])
            campground_info = campground_info.append(df_campsites[['FacilityID', 'CampsiteID', 'CampsiteName', 'ATTRIBUTES']].merge(df_ridb_camping, on='FacilityID', how='left'))

# Get NF website data
nf_data = []
with open('../data/NF_sites/OR_sitelist.csv') as f:
    reader = DictReader(f)
    for row in reader:
        sc = Scraper(row['site_url'], row['site_name'])
        nf_data.append(sc.scrape())
nf_df = pd.DataFrame(nf_data) 

# Merge RIDB and NF data
merged = distance_merge(nf_df, campground_info, 2000, 'ridb', 'nf')


### Pipelines
Data pipelines split data processing into discrete steps. To get a sense of how to break our code into steps, lets take a look at the data processing flow of our prototype:

Single pancake. takes so long!

![Prototype Flow](images/prototype_flow.png)


Data processing steps in our prototype:
* Extracting the data from source
* Transforming campsite data
* Merging data into a single table
* Storing the table in a database (not pictured)

Consider...
* What happens if there is an error in convert attributes for one campsite?
* How long will it take to run if a facility has a large number of campsites?


Pipelines enable scaling up of data processing
* Reduced latency through parallel processing
* Ability to take advantage of "near limitless" cloud compute
* Generalized pipeline steps can be reused 
* Resiliency - data can be saved to disk or cloud storage after a stage to allow retry 


What can we parallelize ?
* Process campsite data in batches
* Data source extraction - NF can run independently of RIDB

Configuration to scale up without code modifications
* State
* NF urls


#### Lets start by breaking our data processing code into discrete steps

In [86]:
# Start with getting facilities
def get_facilities(state):
    params = {'state':state}
    response = RequestsMock.get(ridb_facilities_url, params, headers=headers)
    if response.status_code == 200:
        print(f"Getting facilities for {state}")
        result = json.loads(response.text)
        return result['RECDATA']
    print(f"Unable to get result for state {state}, got {response.reason}")
    return {}

In [87]:
def get_campsites(facility_id):
    campsite_details_url = f"ridb_facilities_url/{facility_id}/campsites"
    response = RequestsMock.get(campsite_details_url, headers=headers)
    if response.status_code == 200:
        campsites = json.loads(response.text)
        if len(campsites['RECDATA']) > 0:
            return campsites['RECDATA']
        else:
            return {}
    print(f"Unable to get result for facility_id {facility_id}, got {response.code} {response.reason}")
    return {}

In [88]:
def process_campsites(campsite_data):
    # Another benefit of breaking data processing code into steps is encapsulation -
    # if we need to do additional processing on campsite data at a later time
    # we only need to modify this method. Also helpful for unit testing.
    campsite_data['AttributeDict'] = {item['AttributeName']: item['AttributeValue'] for item in campsite_data['ATTRIBUTES']}
    return {key: campsite_data.get(key) for key in ['FacilityID', 'CampsiteID', 'CampsiteName', 'AttributeDict']}

#### Batch processing pipeline

Lets say we got facility IDs 9000 - 9999, here is an example of how we could run parallel batches.  

Big data engines like Spark can do this for you if you take advantage of distributed data structures.  
Keep in mind the impact of parallel API calls!

Many pancakes...

![Batch processing](images/campsite_batch.png)

In [79]:
def process_campsite_batch(facility_ids):
    campsite_data = []
    for id in facility_ids:
        # In a data pipeline get_campsites and process_campsites would be different stages.
        # To illustrate running the batch we call both in this method
        raw_data = get_campsites(id)
        if raw_data == {}:
            continue
        for site in raw_data:
            campsite_data.append(process_campsites(site))
    return campsite_data

In [78]:
states = ['OR']
facilities_json = get_facilities('OR')
facility_ids = [f['FacilityID'] for f in facilities_json]
num_facilities = len(facility_ids)
step = int(num_facilities/10)
for index in range(0, num_facilities, step):
    process_campsite_batch(facility_ids[index:index+step])

Getting facilities for OR


#### How about forest service data?
Is there anything differently you would do here?

In [80]:
def get_nf_data(source_file):
    nf_data = []
    with open(source_file) as f:
        reader = DictReader(f)
        for row in reader:
            # Can parallelize this as well
            sc = Scraper(row['site_url'], row['site_name'])
            nf_data.append(sc.scrape())
    return nf_data

![RIDB pipeline](images/full_pipeline.png)

In [None]:
# What stages might we want to be able to retry?
# Where might we want to save our progress?
# How about scheduling? Might not be time for this.

In [52]:
# Start with what we want as paramaterized inputs and build from there
NF_sites = [
    ("East Lemolo Campground", "https://www.fs.usda.gov/recarea/umpqua/recarea/?recid=63492"),
    ("Magone Lake Campground", "https://www.fs.usda.gov/recarea/malheur/recarea/?recid=39964")
]
states = ['OR', 'WA', 'CA']

In [84]:
# Reuse - propane canister
def get_ridb_data(url, params, headers):
    response = RequestsMock.get(url, params, headers=headers)
    if response.status_code == 200:
        result = json.loads(response.text)
        return result['RECDATA']
    return {}