![Many pancakes vs single pancake](images/pancakes.png)

#### How might our camping application scale?
* More states
* Additional data sources
* ?

In [100]:
from csv import DictReader
import geopandas as gpd
import json
import numpy as np
import pandas as pd
import itertools

from camping.mocks.request import RequestsMock
from camping.util.scraper import Scraper
from camping.util.distance import distance_merge

def max_col_width(w=100):
    pd.set_option('display.max_colwidth', w)

ridb_facilities_url = "https://ridb.recreation.gov/api/v1/facilities"

### Scaling our prototype code

What are some aspects that might not scale?   
*   
  
Opportunities to parallelize?  
*   

#### Exploration code for reference

In [108]:
ridb_facilities_url = "https://ridb.recreation.gov/api/v1/facilities"
params = {"activity_id":9, "state":"OR"}
headers = {"accept": "application/json", "apikey": "key"}
campground_info = pd.DataFrame()

def transform_campsites(campsite_json):
    # First, translate the ATTRIBUTES list into a list of dict of {AttributeName: AttributeValue}
    for i in range(len(campsite_json)):
        campsite_json[i]['AttributeDict'] = [{item['AttributeName']: item['AttributeValue']} for item in campsite_json[i]['ATTRIBUTES']]

    # Next, convert the AttriuteDict to columns
    df = pd.DataFrame(campsites['RECDATA'])[['ATTRIBUTES', 'CampsiteID', 'CampsiteName', 'FacilityID']]
    df['AttributeDict'] = df['ATTRIBUTES'].apply(lambda x: {item['AttributeName']: item['AttributeValue'] for item in x})
    norm = pd.json_normalize(df['AttributeDict'])
    df = df[['CampsiteID','CampsiteName','FacilityID']].join(norm)
    return df.replace({np.nan: ""})

# Get RIDB Facilities with camping
response = RequestsMock.get(ridb_facilities_url, params, headers=headers)
camping_json  = json.loads(response.text)
df_ridb_camping = pd.DataFrame(camping_json['RECDATA'])

# Get campsite specific information
for facility in camping_json['RECDATA']:
    campground_url = f"{ridb_facilities_url}/{facility['FacilityID']}/campsites"
    resp = RequestsMock.get(campground_url, headers=headers)
    campsites = json.loads(resp.text)
    if len(campsites['RECDATA']) > 0:
        df_campsites = transform_campsites(campsites['RECDATA'])
        campground_info = campground_info.append(df_campsites.merge(df_ridb_camping, on='FacilityID', how='left'))

# Get NF website data
nf_data = []
with open('../data/NF_sites/OR_sitelist.csv') as f:
    reader = DictReader(f)
    for row in reader:
        sc = Scraper(row['site_url'], row['site_name'])
        nf_data.append(sc.scrape())
nf_df = pd.DataFrame(nf_data) 

# Merge RIDB and NF data
merged = distance_merge(nf_df, campground_info, 2000, 'ridb', 'nf')
merged = merged.replace(np.nan, '')

  return _prepare_from_string(" ".join(pjargs))
  return _prepare_from_string(" ".join(pjargs))


In [109]:
merged

Unnamed: 0,CampsiteID,CampsiteName,FacilityID,Location Rating,Grills/Fire Ring,Picnic Table,Capacity/Size Rating,Site Rating,Checkout Time,Checkin Time,...,FacilityStatus,FacilityLatitude_nf,FacilityLongitude_nf,FacilityElevation,Conditions,Reservations,FacilityName_nf,Water,Restroom,Open Season
0,98358,008,251894,Good,Y,Y,Single,Standard,1:00 PM,2:00 PM,...,,,,,,,,,,
1,98441,014,251894,Prime,Y,Y,Single,Prime,1:00 PM,2:00 PM,...,,,,,,,,,,
2,98438,004,251894,Prime,Y,Y,Single,Prime,1:00 PM,2:00 PM,...,,,,,,,,,,
3,98389,006,251894,Good,Y,Y,Single,Standard,1:00 PM,2:00 PM,...,,,,,,,,,,
4,98359,005,251894,Prime,Y,Y,Single,Prime,1:00 PM,2:00 PM,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45,96303,F001,251434,Prime,Y,Y,Single,Preferred,10:00 AM,3:00 PM,...,Closed,45.50080,-121.81641,3200,CLOSED FOR THE SEASON\n \n**Lost Lake is curre...,Reservations can be made by visiting Recreatio...,Lost Lake Campground Resort and Day Use Area,Drinking Water,Vault Toilet (18),
46,96053,B011,251434,Prime,,Y,Single,Prime,10:00 AM,3:00 PM,...,Closed,45.50080,-121.81641,3200,CLOSED FOR THE SEASON\n \n**Lost Lake is curre...,Reservations can be made by visiting Recreatio...,Lost Lake Campground Resort and Day Use Area,Drinking Water,Vault Toilet (18),
47,96013,B002,251434,Good,Y,Y,Single,Preferred,10:00 AM,3:00 PM,...,Closed,45.50080,-121.81641,3200,CLOSED FOR THE SEASON\n \n**Lost Lake is curre...,Reservations can be made by visiting Recreatio...,Lost Lake Campground Resort and Day Use Area,Drinking Water,Vault Toilet (18),
48,96009,D004,251434,Good,Y,Y,Single,Standard,10:00 AM,3:00 PM,...,Closed,45.50080,-121.81641,3200,CLOSED FOR THE SEASON\n \n**Lost Lake is curre...,Reservations can be made by visiting Recreatio...,Lost Lake Campground Resort and Day Use Area,Drinking Water,Vault Toilet (18),



### Pipelines
Data pipelines split data processing into discrete steps. To get a sense of how to break our code into steps, lets take a look at the data processing flow of our prototype:

Single pancake. takes so long!

![Prototype Flow](images/prototype_flow.png)


Data processing steps in our prototype:
* Extracting the data from source
* Transforming campsite data
* Merging data into a single table
* Storing the table in a database (not pictured)

Consider...
* What happens if there is an error in convert attributes for one campsite?
* How long will it take to run if a facility has a large number of campsites? $$


Pipelines enable scaling up of data processing
* Reduced latency through parallel processing
* Ability to take advantage of "near limitless" cloud compute
* Generalized pipeline steps can be reused 
* Resiliency - data can be saved to disk or cloud storage after a stage to allow retry 


What can we parallelize ?
* Process campsite data in batches
* Data source extraction - NF web scraper can run independently of RIDB API calls
* Run multiple states in parallel

Configuration to scale up without code modifications
* State
* NF urls


### Scaling up - data structures and formats

**Ask yourself** 
* Are the data structures I'm using optimized for large scale?
* The data pipeline is a means to an end - what is acting on the data being produced? For example, APIs surface json to front end frameworks, do you really need to put all of this information in a table?


##### Getting facilities data
We're converting the facilities to a dataframe so we can explore it, but its not necessary for the pipeline
```python
# Get RIDB Facilities with camping
response = RequestsMock.get(ridb_facilities_url, params, headers=headers)
camping_json  = json.loads(response.text)
> df_ridb_camping = pd.DataFrame(camping_json['RECDATA'])
```
  
---------  
  
##### Campsite transformation code
We're doing some expensive operations to convert the ATTRIBUTES list of dict into columns. This was really useful for data exploration, but unnecessarily expensive.  
  
Also, surfacing the attributes as table columns makes our system brittle. Consider if new attributes are added - if we had to convert these to columns it would require a database migraiton for our pipeline output.

```python
def transform_campsites(campsite_json):
    # First, translate the ATTRIBUTES list into a list of dict of {AttributeName: AttributeValue}
    for i in range(len(campsite_json)):
        campsite_json[i]['AttributeDict'] = [{item['AttributeName']: item['AttributeValue']} for item in campsite_json[i]['ATTRIBUTES']]

    # Next, convert the AttriuteDict to columns
>    df = pd.DataFrame(campsites['RECDATA'])[['ATTRIBUTES', 'CampsiteID', 'CampsiteName', 'FacilityID']]
>    df['AttributeDict'] = df['ATTRIBUTES'].apply(lambda x: {item['AttributeName']: item['AttributeValue'] for item in x})
>    norm = pd.json_normalize(df['AttributeDict'])
>    df = df[['CampsiteID','CampsiteName','FacilityID']].join(norm)
    return df.replace({np.nan: ""})
```

**Keep in mind**  
Keeping data in its original format will help your pipeline be robust to data source changes


In [110]:
# Eliminate expensive apply and join, improve resiliency to data source changes
def transform(campsite_json):
    for i in range(len(campsite_json)):
        campsite_json[i]['AttributeDict'] = [{item['AttributeName']: item['AttributeValue']} for item in campsite_json[i]['ATTRIBUTES']]
    return campsite_json
        

### Scaling up through parallel processing - Infrastructure

**Ask yourself:** What parts of my data processing flow are independent?

* We can process the NF web data in parallel with RIDB
* We can also process each state independently

The figure below shows independent pipelines running for Oregon and Washington

**Keep in mind**
* How will you track API use in a distributed system like this?

![RIDB pipeline](images/OR_WA_pipeline.png)

### Scaling up through parallel processing - Distributed data strucutres

**Ask yourself**  
What processes operate over a large number of the same data structures?

[Spark](https://spark.apache.org/) and [Dask](https://docs.dask.org/en/latest/) are examples of big data processing engines that have distributed data frame structures, enabling you to spread computation across many compute nodes to speed up runtime.

![Batch processing](images/campsite_batch.png)

### Scaling up - building in resiliency

**Ask yourself** What are potential points of failure in my pipeline, and how could caching data help reduce time and expenses?

When thinking about separating steps into independent pipeline tasks:
* Similar to thinking about object oriented principles when coding, think about single use and encapsulation when breaking data processing code into pipelines
* Consider persisting data that takes a long time to generate  
Google image of many propane tanks
Also by separaitng out pipeline stages to isolated steps

![Batch processing](images/retry_on_fail.png)

### Scaling up with configurable components

**Ask yourself** 
* If I were thinking about my data processing code as a function, what would make sense to parameterize?
* How might this system scale in breadth? For example, perhaps we want to find a place to camp that is at a facility that also has boating (`activity_id: 6`). It might be interesting to know what events are going on at the facility as well, availabile at the `/facilities/{facilityId}/events` endpoint. 

Airflow can help you setup these kinds of operations

In [127]:
def get_ridb_data(url, headers, params={}):
    response = RequestsMock.get(url, params, headers)
    if response.status_code == 200:
        result = json.loads(response.text)
        return result['RECDATA']
    return {}

# These could be environment variables
RIDB_FACILITIES_URL = "https://ridb.recreation.gov/api/v1/facilities"
HEADERS = {"accept": "application/json", "apikey": "key"}

def get_campsite_data(facilities):
    for facility in facilities:
        url = f"RIDB_FACILITIES_URL/{facility['FacilityID']}/campsites"
        campsite_data = get_ridb_data(url, HEADERS)
        transform_campsites(campsite_data)

pipeline_config = [
    {'label': 'OR', 'nf_sites': '../data/NF_sites/OR_sitelist.csv', 'params':{'state': 'OR', 'activity_id': 9}},
    {'label': 'WA', 'nf_sites': '../data/NF_sites/OR_sitelist.csv', 'params':{'state': 'WA', 'activity_id': 9}}]

def merge_data(facilities, campsite_data, nf_data):
    pass

# For demonstrating config
def run_pipeline(config):
    results = {}
    for item in config:
        facilities = get_ridb_data(RIDB_FACILITIES_URL, HEADERS, item['params'])
        campsite_data = get_campsite_data(facilities)
        nf_data = get_nf_data(item['nf_sites'])
        results[item['label']] = merge_data(facilities, campsite_data, nf_data)
    return results

#### Lets start by breaking our data processing code into discrete steps

In [126]:
run_pipeline(pipeline_config)

{'OR': None, 'WA': None}