# Introduction 

The Chesapeake Bay Program [DataHub](https://datahub.chesapeakebay.net/Home) contains many datasets for the Chesapeake Bay. The Water Quality Data is still updating and measures many field and lab parameters including: phosphorus, nitrogen, carbon, various other lab parameters (suspended solids, disolved solids, chlorophyll-a, alkalinkity, etc), dissolved oxygen, pH, salinity, turbitity, water temperature, and climate condition. 

See [Guide to Using Chesapeake Bay Program Water Quality Monitoring Data](https://d18lev1ok5leia.cloudfront.net/chesapeakebay/documents/wq_data_userguide_10feb12_mod.pdf) for more information.
 
There is also a [DataHub API](https://datahub.chesapeakebay.net/API) which we will use to access the data.

## What data do we want?

The Chesapeake Bay segements are based on circulation and salinity properties of different areas of the Bay. We want the segements in the Chesapeake Bay proper, which haveCBSeg2003Name that starts "CB".
[Map of the segments](https://www.chesapeakebay.net/what/maps/chesapeake-bay-2003-segmentation-scheme-codes).

The Water Quality data contains quite a lot of data, so we will define functions to download 5 years of data at once. This method will also work for other large datasets, although the plankton databases seem to download over long time periods without issue.

In [70]:
from datetime import datetime
from dateutil.relativedelta import relativedelta

def generate_date_ranges(start_date, end_date, delta_years=5):
    # Generate date ranges of specified years between start_date and end_date.
    date_list =[]
    current_date = start_date
    while current_date < end_date:
        next_date = current_date + relativedelta(years=delta_years)
        date_list.append(current_date.strftime('%m-%d-%Y') +'/' 
                         + next_date.strftime('%m-%d-%Y') +'/') 
        current_date = next_date
    return(date_list)

In [89]:
import requests
import csv

def fetch_and_save_data_by_date(base_url, url_idValues, start_date, end_date, output_file):
    # Generate list of date ranges
    date_list = generate_date_ranges(start_date, end_date)
    
    # Open the output file in append mode
    with open(output_file, 'w', newline='') as csvfile:
        csv_writer = None
        
        for dates in date_list:
            
            # Define the API endpoint URL with CSV format
            api_url = base_url + dates + url_idValues
            # Send a GET request to the API endpoint
            response = requests.get(api_url)
            
            # Check if the request was successful
            if response.status_code == 200:
                # Read the CSV response
                csv_data = response.text.splitlines()
                csv_reader = csv.reader(csv_data)
                
                # Write the CSV data to the file
                if csv_writer is None:
                    # Write the header only once
                    csv_writer = csv.writer(csvfile)
                    csv_writer.writerow(next(csv_reader))  # Write header
                
                for row in csv_reader:
                    csv_writer.writerow(row)
                
                print(f"Data from {dates} saved to {output_file}")
            else:
                # Handle the error
                print(f"Failed to retrieve data from {dates}: {response.status_code}")

### Water Quality

Water Quality `http://datahub.chesapeakebay.net/api.{format}/WaterQuality/<Start-Date>/<End-Date>/<Data-Stream-Value>/<Program-Id>/<Project-Id>/<Geographical-Attribute>/<Attribute-Id>/<Substance-Id>`

(Data-Stream-Value list)[https://datahub.chesapeakebay.net/api.json/DataStreams] We want all data `0,1`

(Program-Id list for water quality)[https://datahub.chesapeakebay.net/api.json/WaterQuality/Programs] We want all three: `2,4,6`

Now the base url is `https://datahub.chesapeakebay.net/api.CSV/WaterQuality/WaterQuality/7-29-2014/7-29-2024/0,1/2,4,6/`

We also define a function to call the API for our entire desired time frame and output the data in one CSV. Note that for each database, we will need a `base_url` which points to the desired database and output format, a `url_idValues` which tells the API which values to download (the length of this url also depends on the database), and an `output_file`.

### Water Quality Data

Water Quality `http://datahub.chesapeakebay.net/api.{format}/WaterQuality/<Start-Date>/<End-Date>/<Data-Stream-Value>/<Program-Id>/<Project-Id>/<Geographical-Attribute>/<Attribute-Id>/<Substance-Id>`

In [92]:
# Define base URL
base_url = 'https://datahub.chesapeakebay.net/api.CSV/WaterQuality/WaterQuality/'


(Data-Stream-Value list)[https://datahub.chesapeakebay.net/api.json/DataStreams] We want all data `0,1`

(Program-Id list for water quality)[https://datahub.chesapeakebay.net/api.json/WaterQuality/Programs] We want all three: `2,4,6`

We can update our url to `https://datahub.chesapeakebay.net/api.CSV/WaterQuality/WaterQuality/<Start-Date>/<End-Date>/0,1/2,4,6/<Project-Id>/<Geographical-Attribute>/<Attribute-Id>/<Substance-Id>`. We will deal with the date last.

In [93]:
# Define the id values that go after <Start-Date>/<End-Date>
url_idValues = '0,1/2,4,6/'

Let's get all project ids 
(Project-ID list for water quality)[https://datahub.chesapeakebay.net/api.json/WaterQuality/Projects]
Note that some projects might not have any water quality data for the desired segments, but that does not seem to cause a problem with the API timing out.

In [94]:
import requests

# Define the URL with the Projects list
projectList_url = "https://datahub.chesapeakebay.net/api.json/WaterQuality/Projects"

# Send a GET request to the list
response = requests.get(projectList_url)

# Check if the request was successful
if response.status_code == 200:
    # Parse the JSON response
    data = response.json()
    # Find ProjectId
    projectIds = [segment['ProjectId'] for segment in data]
    
    # Append ids to idValues
    for projectId in projectIds:
        url_idValues = url_idValues + str(projectId) +','

    # Remove final comma, append /
    url_idValues = url_idValues[:-1] + '/'
    print(url_idValues)
else:
    # Handle the error
    print(f"Failed to retrieve data: {response.status_code}")

0,1/2,4,6/12,13,14,15,35,36,2,3,11,7,33,34,23,24,16/


Now we need to retrieve the `Geographical-Id`s from the (Geographical-Attribute, CBSeg2003 list)[https://datahub.chesapeakebay.net/api.json/CBSeg2003]. Since we only want the segments in the Bay proper, we will search for the segment names that start with `CB`

In [95]:
# Add CBSeg2003 to idValues to specify type og geographic id
url_idValues = url_idValues +'CBSeg2003/'

# Define the URL with the CBSeg2003 list
CBSeg2003_url = "http://datahub.chesapeakebay.net/api.json/CBSeg2003"

# Send a GET request to the list
response = requests.get(CBSeg2003_url)

# Check if the request was successful
if response.status_code == 200:
    # Parse the JSON response
    data = response.json()
    # Filter the results to find CBSeg2003Name that start with "CB"
    filtered_segments = [segment for segment in data if segment['CBSeg2003Name'].startswith('CB')]
    # Extract the CBSeg2003Id for the filtered results
    segmentIds = [segment['CBSeg2003Id'] for segment in filtered_segments]

    # Append ids to idValues
    for segmentId in segmentIds:
        url_idValues = url_idValues + str(segmentId) +','

    # Remove final comma, append /
    url_idValues = url_idValues[:-1] + '/'
    print(url_idValues)
else:
    # Handle the error
    print(f"Failed to retrieve data: {response.status_code}")

0,1/2,4,6/12,13,14,15,35,36,2,3,11,7,33,34,23,24,16/CBSeg2003/10,11,12,13,14,15,16,17/


Now we need to append the `Substance-Id`. Now, there are many substances that are not measured in the dataset and regions we want. Using the online download form, it looks like the largest possible list is `21,30,31,35,36,49,55,60,63,65,67,71,73,74,77,78,82,83,85,87,88,94,104,105,109,111,114,116,121,123,33,76,113,34,119`. I don't see a more systematic way to do this step, since different stations are measuring different things.

In addition to updating the `url_idValues`, let's also pull a dictionary for these substances

In [96]:
substanceId_list = [21,30,31,33,34,35,36,49,55,60,63,65,67,71,73,74,76,77,78,82,83,85,87,88,94,104,105,109,111,113,114,116,119,121,123]


# Define the URL with the substance list
substanceId_url = "https://datahub.chesapeakebay.net/api.json/Substances"

# Send a GET request to the list
response = requests.get(substanceId_url)

# Check if the request was successful
if response.status_code == 200:
    # Parse the JSON response
    data = response.json()
    # Filter the results to find substances with SubstanceId in substanceId_list
    filtered_substances = [substance for substance in data if substance['SubstanceId'] in substanceId_list]

    # Print the filtered substances
    for substance in filtered_substances:
        print(substance)
else:
    # Handle the error
    print(f"Failed to retrieve data: {response.status_code}")

{'SubstanceId': 21, 'SubstanceIdentificationName': 'CHLA', 'SubstanceIdentificationDescription': 'Active Chlorophyll-A'}
{'SubstanceId': 30, 'SubstanceIdentificationName': 'DIN', 'SubstanceIdentificationDescription': 'Dissolved Inorganic Nitrogen'}
{'SubstanceId': 31, 'SubstanceIdentificationName': 'DO', 'SubstanceIdentificationDescription': 'Dissolved Oxygen In MG/L'}
{'SubstanceId': 33, 'SubstanceIdentificationName': 'DO_SAT_P', 'SubstanceIdentificationDescription': 'DO Saturation Using Probe Units In Percent'}
{'SubstanceId': 34, 'SubstanceIdentificationName': 'DOC', 'SubstanceIdentificationDescription': 'Dissolved Organic Carbon'}
{'SubstanceId': 35, 'SubstanceIdentificationName': 'DON', 'SubstanceIdentificationDescription': 'Dissolved Organic Nitrogen'}
{'SubstanceId': 36, 'SubstanceIdentificationName': 'DOP', 'SubstanceIdentificationDescription': 'Dissolved Organic Phosphorus'}
{'SubstanceId': 49, 'SubstanceIdentificationName': 'FSS', 'SubstanceIdentificationDescription': 'Fixed 

In [97]:
url_idValues = url_idValues +'21,30,31,33,34,35,36,49,55,60,63,65,67,71,73,74,76,77,78,82,83,85,87,88,94,104,105,109,111,113,114,116,119,121,123'

And call the API

In [98]:
start_date = datetime(2004, 7, 29)
end_date = datetime(2024, 7, 29)
output_file = '../data/plankton-patrol_ChesapeakeWaterQuality.csv'

fetch_and_save_data_by_date(base_url, url_idValues, start_date, end_date, output_file)

Data from 07-29-2004/07-29-2009/ saved to ../data/plankton-patrol_ChesapeakeWaterQuality.csv
Data from 07-29-2009/07-29-2014/ saved to ../data/plankton-patrol_ChesapeakeWaterQuality.csv
Data from 07-29-2014/07-29-2019/ saved to ../data/plankton-patrol_ChesapeakeWaterQuality.csv
Data from 07-29-2019/07-29-2024/ saved to ../data/plankton-patrol_ChesapeakeWaterQuality.csv


Clean the data with help of VSCode Data Wrangle extension

In [99]:
import pandas as pd

def clean_data(df):
    # Drop empty columns: 'PrecisionPC', 'BiasPC'
    # Drop columns that are almost all empty or nan
    df = df.drop(columns=['PrecisionPC','BiasPC','Details','Problem'])
    return df

# Loaded variable 'df' from URI: /Users/clairemerriman/Documents/GitHub/na-erdos-fellows-monorepo/data/plankton-patrol_ChesapeakeWaterQuality.csv
df = pd.read_csv(r'/Users/clairemerriman/Documents/GitHub/na-erdos-fellows-monorepo/data/plankton-patrol_ChesapeakeWaterQuality.csv')

df_clean = clean_data(df.copy())
df_clean.head()

  df = pd.read_csv(r'/Users/clairemerriman/Documents/GitHub/na-erdos-fellows-monorepo/data/plankton-patrol_ChesapeakeWaterQuality.csv')


Unnamed: 0,CBSeg2003,EventId,Cruise,Program,Project,Agency,Source,Station,SampleDate,SampleTime,...,SampleReplicateType,Parameter,Qualifier,MeasureValue,Unit,Method,Lab,Latitude,Longitude,TierLevel
0,CB1TF,88268,BAY475,TWQM,MAIN,MDDNR,MDDNR,CB2.1,2/19/2008,10:18:00,...,S1,CHLA,,1.495,UG/L,L01,MDHMH,39.44149,-76.02599,T3
1,CB1TF,88268,BAY475,TWQM,MAIN,MDDNR,MDDNR,CB2.1,2/19/2008,10:18:00,...,S1,CHLA,,,UG/L,L01,MDHMH,39.44149,-76.02599,T3
2,CB1TF,130482,BAY477,TWQM,MAIN,MDDNR,MDDNR,CB2.1,3/21/2008,12:33:00,...,S1,CHLA,,,UG/L,L01,MDHMH,39.44149,-76.02599,T3
3,CB1TF,130482,BAY477,TWQM,MAIN,MDDNR,MDDNR,CB2.1,3/21/2008,12:33:00,...,S1,CHLA,,5.233,UG/L,L01,MDHMH,39.44149,-76.02599,T3
4,CB1TF,130623,BAY478,SWM,DFLO,MDDNR,MDDNR,XJH8658,4/3/2008,08:51:00,...,S1,CHLA,,1.495,UG/L,L01,MDHMH,39.4767,-76.0709,T3


### BioMass

There are many different parts of the Living Resources database. Let's start by looking at BioMass, which is part of the `TidalBenthic` dataset. The url format is `http://datahub.chesapeakebay.net/api.{format}/LivingResources/TidalBenthic/BioMass/<Start-Date>/<End-Date>/<Project-Id>/<Geographical-Attribute>/<Attribute-Id>`. It appears the `Attribut-Id` is optional.

Note that the API does not allow combining projects like we did in the Water Quality dataset, but it can download the entire timeframe without timeout errors. We define a new function to retrieve the data. This function can be used for any of the Living Resources datasets.

In [117]:
import requests
import csv

def fetch_and_save_data_by_project(base_url, url_idValues, start_date, end_date, project_list, output_file):
    # Format the dates
    start_str = start_date.strftime('%m-%d-%Y')
    end_str = end_date.strftime('%m-%d-%Y')
    
    # Open the output file in append mode
    with open(output_file, 'w', newline='') as csvfile:
        csv_writer = None
        
        for project in project_list:
            
            # Define the API endpoint URL with CSV format
            api_url = f"{base_url}{start_str}/{end_str}/{project[1]}/{url_idValues}"
            # Send a GET request to the API endpoint
            response = requests.get(api_url)
            
            # Check if the request was successful
            if response.status_code == 200:
                # Read the CSV response
                csv_data = response.text.splitlines()
                csv_reader = csv.reader(csv_data)
                
                # Write the CSV data to the file
                if csv_writer is None:
                    # Write the header only once
                    csv_writer = csv.writer(csvfile)
                    csv_writer.writerow(next(csv_reader))  # Write header
                
                for row in csv_reader:
                    csv_writer.writerow(row)
                
                print(f"Data from {project[0]} saved to {output_file}")
            else:
                # Handle the error
                print(f"Failed to retrieve data from {dates}: {response.status_code}")


We define a new `base_url`

In [112]:
base_url = "http://datahub.chesapeakebay.net/api.csv/LivingResources/TidalBenthic/BioMass/"

We create the project list. From the online download menu, the relevent projects are BEN-Tidal Benthic Monitoring and SBEN-Special Tidal Benthic Monitoring. We may also want to add CBEN-Coastal Bays Benthic Monitoring, which are bays adjacent to the Chesapeake.

In [113]:
# Initialize url_idValues
url_idValues = ""

projectIdentifier_list = ['BEN','SBEN']


# Define the URL with the substance list
projectIdentifier_url = "https://datahub.chesapeakebay.net/api.json/LivingResources/Projects"

# Send a GET request to the list
response = requests.get(projectIdentifier_url)

# Check if the request was successful
if response.status_code == 200:
    # Parse the JSON response
    data = response.json()
    # Filter the results to find projects with ProjectIdentifier in projectIdentifier_list
    filtered_projects = [project for project in data if project['ProjectIdentifier'] in projectIdentifier_list]

    # Extract the ProjectID for the filtered results
    projectIds = [[project['ProjectName'],project['ProjectId']] for project in filtered_projects]


    print(projectIds)
else:
    # Handle the error
    print(f"Failed to retrieve data: {response.status_code}")

[['Tidal Benthic Monitoring', 1], ['Special Tidal Benthic Monitoring', 32]]


The `Geographic-Id` is the same as for Water Quality

In [114]:
# Add CBSeg2003 to idValues to specify type og geographic id
url_idValues = url_idValues +'CBSeg2003/'

# Define the URL with the CBSeg2003 list
CBSeg2003_url = "http://datahub.chesapeakebay.net/api.json/CBSeg2003"

# Send a GET request to the list
response = requests.get(CBSeg2003_url)

# Check if the request was successful
if response.status_code == 200:
    # Parse the JSON response
    data = response.json()
    # Filter the results to find CBSeg2003Name that start with "CB"
    filtered_segments = [segment for segment in data if segment['CBSeg2003Name'].startswith('CB')]
    # Extract the CBSeg2003Id for the filtered results
    segmentIds = [segment['CBSeg2003Id'] for segment in filtered_segments]

    # Append ids to idValues
    for segmentId in segmentIds:
        url_idValues = url_idValues + str(segmentId) +','

    # Remove final comma, append /
    url_idValues = url_idValues[:-1] + '/'
    print(url_idValues)
else:
    # Handle the error
    print(f"Failed to retrieve data: {response.status_code}")

CBSeg2003/10,11,12,13,14,15,16,17/


We do not need any additional information, so let's run the API

In [118]:
start_date = datetime(2004, 7, 29)
end_date = datetime(2024, 7, 29)
output_file = '../data/plankton-patrol_ChesapeakeBioMass.csv'

fetch_and_save_data_by_project(base_url, url_idValues, start_date, end_date, projectIds,output_file)

Data from Tidal Benthic Monitoring saved to ../data/plankton-patrol_ChesapeakeBioMass.csv
Data from Special Tidal Benthic Monitoring saved to ../data/plankton-patrol_ChesapeakeBioMass.csv


Clean up with VSCode Data Wrangler

In [119]:
import pandas as pd

def clean_data(df):
    # Drop column: 'SiteType' with only one value
    df = df.drop(columns=['SiteType'])
    return df

# Loaded variable 'df' from URI: /Users/clairemerriman/Documents/GitHub/na-erdos-fellows-monorepo/data/plankton-patrol_ChesapeakeBioMass.csv
df = pd.read_csv(r'/Users/clairemerriman/Documents/GitHub/na-erdos-fellows-monorepo/data/plankton-patrol_ChesapeakeBioMass.csv')

df_clean = clean_data(df.copy())
df_clean.head()

Unnamed: 0,CBSeg2003,CBSeg2003Description,FieldActivityId,BiologicalEventId,Source,SampleDate,Latitude,Longitude,Station,TotalDepth,SampleTime,SampleReplicate,IBIParameter,IBIValue
0,CB5MH,Chesapeake Bay-Mesohaline Region,215580,68852,VERSAR/EME/BEL,8/31/2004,38.3081,-76.3727,11513,8.1,06:58:00,S1,PCT_CARN_OMN,21.57
1,CB5MH,Chesapeake Bay-Mesohaline Region,215580,68852,VERSAR/EME/BEL,8/31/2004,38.3081,-76.3727,11513,8.1,06:58:00,S1,PCT_DEPO,15.69
2,CB5MH,Chesapeake Bay-Mesohaline Region,215580,68852,VERSAR/EME/BEL,8/31/2004,38.3081,-76.3727,11513,8.1,06:58:00,S1,PCT_PI_ABUND,11.76
3,CB5MH,Chesapeake Bay-Mesohaline Region,215580,68852,VERSAR/EME/BEL,8/31/2004,38.3081,-76.3727,11513,8.1,06:58:00,S1,PCT_PI_BIO,4.62
4,CB5MH,Chesapeake Bay-Mesohaline Region,215580,68852,VERSAR/EME/BEL,8/31/2004,38.3081,-76.3727,11513,8.1,06:58:00,S1,PCT_PS_ABUND,9.8
