# Introduction 

The Chesapeake Bay Program [DataHub](https://datahub.chesapeakebay.net/Home) contains many datasets for the Chesapeake Bay. The Water Quality Data is still updating and measures many field and lab parameters including: phosphorus, nitrogen, carbon, various other lab parameters (suspended solids, disolved solids, chlorophyll-a, alkalinkity, etc), dissolved oxygen, pH, salinity, turbitity, water temperature, and climate condition. 

See [Guide to Using Chesapeake Bay Program Water Quality Monitoring Data](https://d18lev1ok5leia.cloudfront.net/chesapeakebay/documents/wq_data_userguide_10feb12_mod.pdf) for more information.
 
There is also a [DataHub API](https://datahub.chesapeakebay.net/API) which we will use to access the data.

## What data do we want?

The Chesapeake Bay segements are based on circulation and salinity properties of different areas of the Bay. We want the segements in the Chesapeake Bay proper, which haveCBSeg2003Name that starts "CB".
[Map of the segments](https://www.chesapeakebay.net/what/maps/chesapeake-bay-2003-segmentation-scheme-codes).

Since we want the same geographic information for every dataset, let's go ahead and generate that now. We need to retrieve the `Geographical-Id`s from the [Geographical-Attribute, CBSeg2003 list](https://datahub.chesapeakebay.net/api.json/CBSeg2003). Since we only want the segments in the Bay proper, we will search for the segment names that start with `CB`. We will then add the sounds and bays that adjoin the Chesapeake Bay. We will not include the segments for bays on the other side of the Eastern Shore from the Chesapeake.

In [44]:
import requests

# Add CBSeg2003 to idValues to specify type of geographic ID
GeographicID_values = 'CBSeg2003/'

# Define the URL with the CBSeg2003 list
CBSeg2003_url = "http://datahub.chesapeakebay.net/api.json/CBSeg2003"

# Send a GET request to the list
response = requests.get(CBSeg2003_url)
filtered_segments=[]
if response.status_code == 200:
    try:
        # Parse the JSON response
        data = response.json()

        # Filter the results to find CBSeg2003Name that start with "CB"
        filtered_segments = [
            segment['CBSeg2003Id'] for segment in data
            if segment.get('CBSeg2003Name', '').startswith('CB') or 
                # Add the adjacent bays and Tangier sound
               segment.get('CBSeg2003Name', '') in ['EASMH ', 'MOBPH ', 'TANMH ']
        ]

        # Append ids to idValues
        if filtered_segments:
            GeographicID_values += ','.join(map(str, filtered_segments)) + '/'
        else:
            print("No matching segments found.")
        
        print(GeographicID_values)
    
    except ValueError as e:
        print(f"Failed to parse JSON data: {e}")
else:
    # Handle the error
    print(f"Failed to retrieve data: {response.status_code}")


CBSeg2003/10,11,12,13,14,15,16,17,28,49,84/


### Water Quality

CBP Water Quality Data (1984-present): measured and calculated physical and nutrient parameters

The long list of parameters includes phosphorus, nitrogen, carbon, suspended solids, disolved solids, chlorophyll_a, pH, salinity, turbitidy, water temperature, and atmospheric conditions.

The Water Quality data contains quite a lot of data, so we will define functions to download 5 years of data at once. This method will also work for other large datasets, although the plankton databases seem to download over long time periods without issue.

In [53]:
from datetime import datetime
from dateutil.relativedelta import relativedelta

def generate_date_ranges(start_date, end_date, delta_years=5):
    # Generate date ranges of specified years between start_date and end_date.
    date_list =[]
    current_date = start_date
    while current_date < end_date:
        next_date = current_date + relativedelta(years=delta_years)
        date_list.append(current_date.strftime('%m-%d-%Y') +'/' 
                         + next_date.strftime('%m-%d-%Y') +'/') 
        current_date = next_date
    return(date_list)

We also define a function to call the API for our entire desired time frame and output the data in one CSV. Note that for each database, we will need a `base_url` which points to the desired database and output format, a `url_idValues` which tells the API which values to download (the length of this url also depends on the database), and an `output_file`.

In [62]:
import csv
import pandas as pd

def api_to_csv_by_date(base_url, url_idValues, start_date, end_date, output_file):
    # Generate list of date ranges
    date_list = generate_date_ranges(start_date, end_date)
    
    # Open the output file in append mode
    with open(output_file, 'w', newline='') as csvfile:
        csv_writer = None
        
        for dates in date_list:
            
            # Define the API endpoint URL with CSV format
            api_url = f"{base_url}{dates}{url_idValues}"

            try:
                response = requests.get(api_url)  # Added timeout for reliability
                response.raise_for_status()  # Raise an exception for HTTP errors
            except requests.RequestException as e:
                # Handle the error
                print(f"Failed to retrieve data from {api_url}: {e}")
                continue
            
            # Read the CSV response
            csv_data = response.text.splitlines()
            csv_reader = csv.reader(csv_data)
            
            # Write the CSV data to the file
            if csv_writer is None:
                # Write the header only once
                csv_writer = csv.writer(csvfile)
                csv_writer.writerow(next(csv_reader))  # Write header
            
            for row in csv_reader:
                csv_writer.writerow(row)
            
            print(f"Data from {dates} saved to {output_file}")


def api_to_dataframe_by_date(base_url, url_idValues, start_date, end_date):
    # Generate list of date ranges
    date_list = generate_date_ranges(start_date, end_date)
    
    # Initialize an empty DataFrame
    all_data = pd.DataFrame()
    
    for dates in date_list:
        # Define the API endpoint URL with CSV format
        api_url = base_url + dates + url_idValues
        # Send a GET request to the API endpoint
        response = requests.get(api_url)
        
        # Check if the request was successful
        if response.status_code == 200:
            # Read the CSV response into a DataFrame
            data = pd.read_csv(api_url)
            
            # Append the data to the all_data DataFrame
            all_data = pd.concat([all_data, data], ignore_index=True)
            
            print(f"Data from {dates} added to DataFrame")
        else:
            # Handle the error
            print(f"Failed to retrieve data from {dates}: {response.status_code}")

The format for the API url for 
Water Quality is: `http://datahub.chesapeakebay.net/api.{format}/WaterQuality/WaterQuality/<Start-Date>/<End-Date>/<Data-Stream-Value>/<Program-Id>/<Project-Id>/<Geographical-Attribute>/<Attribute-Id>/<Substance-Id>`

We will start with defining the `base_url` which is everything before the dates.

In [55]:
# Define base URL
base_url = 'https://datahub.chesapeakebay.net/api.CSV/WaterQuality/WaterQuality/'

We want all [Data-Stream-Value](https://datahub.chesapeakebay.net/api.json/DataStreams) data, so that part of our url is `0,1`

We also want all three programs in the [Program-Id list for water quality](https://datahub.chesapeakebay.net/api.json/WaterQuality/Programs) `2,4,6`

We can update our url to `https://datahub.chesapeakebay.net/api.CSV/WaterQuality/WaterQuality/<Start-Date>/<End-Date>/0,1/2,4,6/<Project-Id>/<Geographical-Attribute>/<Attribute-Id>/<Substance-Id>`. We will deal with the date last.

In [56]:
# Define the id values that go after <Start-Date>/<End-Date>
url_idValues = '0,1/2,4,6/'

Let's get all project ids 
[Project-ID list for water quality](https://datahub.chesapeakebay.net/api.json/WaterQuality/Projects).
Note that some projects might not have any water quality data for the desired segments, but that does not seem to cause a problem with the API timing out.

In [57]:
# Define the URL with the Projects list
projectList_url = "https://datahub.chesapeakebay.net/api.json/WaterQuality/Projects"

# Send a GET request to the list
response = requests.get(projectList_url)

# Check if the request was successful
if response.status_code == 200:
    # Parse the JSON response
    data = response.json()
    # Find ProjectId
    projectIds = [segment['ProjectId'] for segment in data]
    
    # Append ids to idValues
    for projectId in projectIds:
        url_idValues = url_idValues + str(projectId) +','

    # Remove final comma, append /
    url_idValues = url_idValues[:-1] + '/'
    print(url_idValues)
else:
    # Handle the error
    print(f"Failed to retrieve data: {response.status_code}")

0,1/2,4,6/12,13,14,15,35,36,2,3,11,7,33,34,23,24,16/


Now we append the geographic information

In [58]:
url_idValues =url_idValues + GeographicID_values

Now we need to append the `Substance-Id`. Now, there are many substances that are not measured in the dataset and regions we want. Using the online download form, it looks like the largest possible list is `21,30,31,35,36,49,55,60,63,65,67,71,73,74,77,78,82,83,85,87,88,94,104,105,109,111,114,116,121,123,33,76,113,34,119`. I don't see a more systematic way to do this step, since different stations are measuring different things.

In addition to updating the `url_idValues`, let's also pull a dictionary for these substances

In [59]:
substanceId_list = [21,30,31,33,34,35,36,49,55,60,63,65,67,71,73,74,76,77,78,82,83,85,87,88,94,104,105,109,111,113,114,116,119,121,123]


# Define the URL with the substance list
substanceId_url = "https://datahub.chesapeakebay.net/api.json/Substances"

# Send a GET request to the list
response = requests.get(substanceId_url)

# Check if the request was successful
if response.status_code == 200:
    # Parse the JSON response
    data = response.json()
    # Filter the results to find substances with SubstanceId in substanceId_list
    filtered_substances = [substance for substance in data if substance['SubstanceId'] in substanceId_list]

    # Print the filtered substances
    for substance in filtered_substances:
        print(substance)
else:
    # Handle the error
    print(f"Failed to retrieve data: {response.status_code}")

{'SubstanceId': 21, 'SubstanceIdentificationName': 'CHLA', 'SubstanceIdentificationDescription': 'Active Chlorophyll-A'}
{'SubstanceId': 30, 'SubstanceIdentificationName': 'DIN', 'SubstanceIdentificationDescription': 'Dissolved Inorganic Nitrogen'}
{'SubstanceId': 31, 'SubstanceIdentificationName': 'DO', 'SubstanceIdentificationDescription': 'Dissolved Oxygen In MG/L'}
{'SubstanceId': 33, 'SubstanceIdentificationName': 'DO_SAT_P', 'SubstanceIdentificationDescription': 'DO Saturation Using Probe Units In Percent'}
{'SubstanceId': 34, 'SubstanceIdentificationName': 'DOC', 'SubstanceIdentificationDescription': 'Dissolved Organic Carbon'}
{'SubstanceId': 35, 'SubstanceIdentificationName': 'DON', 'SubstanceIdentificationDescription': 'Dissolved Organic Nitrogen'}
{'SubstanceId': 36, 'SubstanceIdentificationName': 'DOP', 'SubstanceIdentificationDescription': 'Dissolved Organic Phosphorus'}
{'SubstanceId': 49, 'SubstanceIdentificationName': 'FSS', 'SubstanceIdentificationDescription': 'Fixed 

In [60]:
url_idValues = url_idValues +'21,30,31,33,34,35,36,49,55,60,63,65,67,71,73,74,76,77,78,82,83,85,87,88,94,104,105,109,111,113,114,116,119,121,123'

And call the API

In [63]:
start_date = datetime(2004, 7, 29)
end_date = datetime(2024, 7, 29)
output_file = '../data/plankton-patrol_ChesapeakeWaterQuality.csv'

# For CSV
api_to_csv_by_date(base_url, url_idValues, start_date, end_date, output_file)

# As dataframe
# waterQuality = api_to_dataframe_by_date(base_url, url_idValues, start_date, end_date)

Data from 07-29-2004/07-29-2009/ saved to ../data/plankton-patrol_ChesapeakeWaterQuality.csv
Data from 07-29-2009/07-29-2014/ saved to ../data/plankton-patrol_ChesapeakeWaterQuality.csv
Data from 07-29-2014/07-29-2019/ saved to ../data/plankton-patrol_ChesapeakeWaterQuality.csv
Data from 07-29-2019/07-29-2024/ saved to ../data/plankton-patrol_ChesapeakeWaterQuality.csv


Clean the data with help of VSCode Data Wrangle extension, then save the new csv. Since we are removing empty (or almost empty) columns, we will save over the previous csv.

In [None]:
def clean_waterQuality(df):
    # Drop empty columns: 'PrecisionPC', 'BiasPC'
    # Drop columns that are almost all empty or nan
    df = df.drop(columns=['PrecisionPC','BiasPC','Details','Problem'])
    return df

# Loaded variable 'df' from output_file
df = pd.read_csv(output_file)

df_clean = clean_waterQuality(df.copy())

# Same output_file as before
df_clean.to_csv(output_file, index=False)  

Since the `csv` is so large, let's add a print statement for the column headers.

In [5]:
import pandas as pd

output_file = '../data/plankton-patrol_ChesapeakeWaterQuality.csv'

df = pd.read_csv(output_file)

df.head()


  df = pd.read_csv(output_file)


Unnamed: 0,CBSeg2003,EventId,Cruise,Program,Project,Agency,Source,Station,SampleDate,SampleTime,...,SampleReplicateType,Parameter,Qualifier,MeasureValue,Unit,Method,Lab,Latitude,Longitude,TierLevel
0,CB1TF,20021,BAY451,TWQM,MAIN,MDDNR,MDDNR,CB1.1,12/13/2006,10:58:00,...,FS1,CHLA,,0.598,UG/L,L01,MDHMH,39.54794,-76.08481,T3
1,CB1TF,20021,BAY451,TWQM,MAIN,MDDNR,MDDNR,CB1.1,12/13/2006,10:58:00,...,FS2,CHLA,,0.748,UG/L,L01,MDHMH,39.54794,-76.08481,T3
2,CB1TF,20236,BAY452,TWQM,MAIN,MDDNR,MDDNR,CB1.1,1/12/2007,13:00:00,...,FS1,CHLA,,3.489,UG/L,L01,MDHMH,39.54794,-76.08481,T3
3,CB1TF,20236,BAY452,TWQM,MAIN,MDDNR,MDDNR,CB1.1,1/12/2007,13:00:00,...,FS2,CHLA,,3.987,UG/L,L01,MDHMH,39.54794,-76.08481,T3
4,CB1TF,20236,BAY452,TWQM,MAIN,MDDNR,MDDNR,CB1.1,1/12/2007,13:00:00,...,S1,CHLA,,4.984,UG/L,L01,MDHMH,39.54794,-76.08481,T3


### Living Resources

The Living Resources database has three data sources and we must access each separately. Also, we cannot access multiple projects at once like we could with the Water Quality database. We will write a function to fetch all projects for a given data source and save in the same csv.

In [None]:
import requests

# Define the URL with the substance list
projectIdentifier_url = "https://datahub.chesapeakebay.net/api.json/LivingResources/Projects"

# Function to get the project IDs
def get_project_ids(url, projectIdentifiers):

    # Send a GET request to the list
    response = requests.get(url)

    if response.status_code == 200:
        try:
            # Parse the JSON response
            data = response.json()
            
            # Filter the results to find projects with ProjectIdentifier in projectIdentifiers
            filtered_projects = [
                project for project in data 
                if project.get('ProjectIdentifier') in projectIdentifiers
            ]
            
            # Extract the ProjectName and ProjectId for the filtered results
            project_ids = [
                [project['ProjectName'], project['ProjectId']] for project in filtered_projects
            ]
            
            return project_ids

        except ValueError as e:
            print(f"Failed to parse JSON data: {e}")
            return []
    else:
        # Handle the error
        print(f"Failed to retrieve data from {url}: {response.status_code} - {response.text}")
        return []

And a general function for producting `csv`s from the project list

In [None]:
def api_to_csv_by_project(base_url, url_idValues, start_date, end_date, project_list, output_file):
    # Format the dates
    start_str = start_date.strftime('%m-%d-%Y')
    end_str = end_date.strftime('%m-%d-%Y')
    
    # Open the output file in write mode
    with open(output_file, 'w', newline='') as csvfile:
        csv_writer = None
        
        for project in project_list:
            project_name, project_id = project

            # Define the API endpoint URL with CSV format
            api_url = f"{base_url}{start_str}/{end_str}/{project_id}/{url_idValues}"
            
            # Send a GET request to the API endpoint
            response = requests.get(api_url)
            
            if response.status_code == 200:
                # Read the CSV response
                csv_data = response.text.splitlines()
                csv_reader = csv.reader(csv_data)
                
                # Initialize CSV writer and write header if not already done
                if csv_writer is None:
                    csv_writer = csv.writer(csvfile)
                    # Write header
                    csv_writer.writerow(next(csv_reader)) 
                
                # Write data rows
                for row in csv_reader:
                    csv_writer.writerow(row)
                
                print(f"Data from {project_name} saved to {output_file}")
            else:
                # Provide detailed error message
                print(f"Failed to retrieve data from {api_url}: {response.status_code} - {response.text}")

Also, the datasets might be missing columns, so we will need to create some dictionaries for the various stations. 
- Tidal Plankton does not have CBSeg2003, Latitude, or Longitude. This can be fixed with `https://datahub.chesapeakebay.net/api.JSON/LivingResources/TidalPlankton/Station/CBSeg2003`
- Tidal Benthic does not have CBSeg2003, Latitude, or Longitude. This can be fixed with `https://datahub.chesapeakebay.net/api.JSON/LivingResources/TidalBenthic/MonitorEvent/6-29-2010/6-29-2015/1/CBSeg2003/10,11,12,13,14,15,16,17,28,49,84/` and `https://datahub.chesapeakebay.net/api.JSON/LivingResources/TidalBenthic/MonitorEvent/6-29-2010/6-29-2015/32/CBSeg2003/10,11,12,13,14,15,16,17,28,49,84/`
- Nontidal Benthic does not have a CBSeg2003 option for data. It has a different station list, but does include latitude and longitude.

This will be updated as I download the files.

In [None]:
import pandas as pd

# URLs to fetch the data from
url1 = "https://datahub.chesapeakebay.net/api.csv/LivingResources/TidalBenthic/MonitorEvent/7-29-2004/6-29-2015/32/CBSeg2003/10,11,12,13,14,15,16,17,28,49,84/"
url2 = "https://datahub.chesapeakebay.net/api.csv/LivingResources/TidalBenthic/MonitorEvent/7-29-2004/6-29-2015/1/CBSeg2003/10,11,12,13,14,15,16,17,28,49,84/"

# Read in data from Benthic Monitor Events
df1 = pd.read_csv(url1)
df2 = pd.read_csv(url2)

# Combine the DataFrames
combined_df = pd.concat([df1, df2], ignore_index=True)

Drop columns that aren't useful for our dictionary using code from Data Wrangler and find the total number of stations.

In [None]:
def create_station_dictionary(df):
    # Drop columns that do not contain location data
    df = df.drop(columns=['ProjectIdentifier','FieldActivityId', 'Source', 'SampleType', 'SampleDate', 'Layer', 'PDepth', 'Salzone', 'SampleVolume', 'Units', 'TotalDepth', 'SampleTime'])

    # Remove rows with 'Station' empty
    df = df[df['Station'].notna()]

    # Remove duplicate rows
    df = df.drop_duplicates()

    return df

station_dictionary = create_station_dictionary(combined_df.copy())

print("Number of stations: ", station_dictionary.shape[0])

Number of stations:  750


#### Tidal Plankton

We should only need the data type `Reported`. The API has the form `TidalPlankton/Reported/<Start-Date>/<End-Date>/<Project-Id>/<Geographical-Attribute>/<Attribute-Id>`, where `<Attribute-Id>` is optional.

We define our `base_url`

In [19]:
base_url = 'http://datahub.chesapeakebay.net/api.csv/LivingResources/TidalPlankton/Reported/'

There are four projects: `MEZ`, `MIZ`, `PHYTP`, and `PICOP`

In [20]:
projectList_TidalPlankton = ['MEZ','MIZ','PHYTP','PICOP']

projectIds_TidalPlankton = get_project_ids(projectIdentifier_url,projectList_TidalPlankton)

And now we generate the `csv`

In [21]:
start_date = datetime(2004, 7, 29)
end_date = datetime(2024, 7, 29)
output_file = '../data/plankton-patrol_ChesapeakeTidalPlankton.csv'

api_to_csv_by_project(base_url, GeographicID_values, start_date, end_date, projectIds_TidalPlankton, output_file)

Data from Tidal Phytoplankton Monitoring saved to ../data/plankton-patrol_ChesapeakeTidalPlankton.csv
Data from Tidal Picoplankton Monitoring saved to ../data/plankton-patrol_ChesapeakeTidalPlankton.csv
Data from Tidal Mesozooplankton Monitoring saved to ../data/plankton-patrol_ChesapeakeTidalPlankton.csv
Data from Tidal Microzooplankton Monitoring saved to ../data/plankton-patrol_ChesapeakeTidalPlankton.csv


In [22]:
tidalPlankton = pd.read_csv(output_file)

  tidalPlankton = pd.read_csv(output_file)


In [23]:
unique_values = tidalPlankton['Station'].unique()
print(f"Unique values in Station': {unique_values}")

Unique values in Station': ['CB5.2' 'CB7.4' 'CB7.3E' 'CB4.3C' 'CB3.3C' 'CB2.2' 'WE4.2' 'CB6.4'
 'CB6.1' 'CB1.1' 'EE3.2' 'EE3.1' 'EE1.1' nan 'Station']


#### Tidal Benthic Data

There are four (possibly five) relevant data types for the Tidal Benthic Data. Since Indicator of Benthic Integrity (IBI) is calculated, we will ignore it for now.

Tidal Benthic Data (1971-2013): taxonomic abundance and composition, biomass, sediment, water quality and Indicator of Benthic Integrity (IBI)

Each of these data types uses the projects BEN - Tidal Benthic Monitoring and SBEN - Special Tidal Benthic Monitoring.

The url API format is `http://datahub.chesapeakebay.net/api.{format}/LivingResources/TidalBenthic/<Data-Type>/<Start-Date>/<End-Date>/<Project-Id>/<Geographical-Attribute>/<Attribute-Id>`. It appears the `Attribut-Id` is optional.

In [24]:
base_url = "http://datahub.chesapeakebay.net/api.csv/LivingResources/TidalBenthic/"

Now we will define the list of data types and projects.

In [25]:
tidalBenthic_dataTypes = ['Sediment', 'BioMass','Taxonomic','WaterQuality']
tidalBenthic_projectIds = ['BEN','SBEN']

Now we loop through the data types let's run the API.

In [33]:
start_date = datetime(2004, 7, 29)
end_date = datetime(2024, 7, 29)

projectIds = get_project_ids(projectIdentifier_url,tidalBenthic_projectIds)

for dataType in tidalBenthic_dataTypes:
    print("Data Type: ", dataType)
    new_base_url = f"{base_url}{dataType}/"
    output_file = f"../data/plankton-patrol_ChesapeakeBenthic{dataType}.csv"

    # For CSV
    api_to_csv_by_project(new_base_url, GeographicID_values, start_date, end_date, projectIds,output_file)

Data Type:  Sediment
Data from Tidal Benthic Monitoring saved to ../data/plankton-patrol_ChesapeakeBenthicSediment.csv
Data from Special Tidal Benthic Monitoring saved to ../data/plankton-patrol_ChesapeakeBenthicSediment.csv
Data Type:  BioMass
Data from Tidal Benthic Monitoring saved to ../data/plankton-patrol_ChesapeakeBenthicBioMass.csv
Data from Special Tidal Benthic Monitoring saved to ../data/plankton-patrol_ChesapeakeBenthicBioMass.csv
Data Type:  Taxonomic
Data from Tidal Benthic Monitoring saved to ../data/plankton-patrol_ChesapeakeBenthicTaxonomic.csv
Data from Special Tidal Benthic Monitoring saved to ../data/plankton-patrol_ChesapeakeBenthicTaxonomic.csv
Data Type:  WaterQuality
Data from Tidal Benthic Monitoring saved to ../data/plankton-patrol_ChesapeakeBenthicWaterQuality.csv
Data from Special Tidal Benthic Monitoring saved to ../data/plankton-patrol_ChesapeakeBenthicWaterQuality.csv
