# Introduction 

The Chesapeake Bay Program [DataHub](https://datahub.chesapeakebay.net/Home) contains many datasets for the Chesapeake Bay. 

The Water Quality Data is still updating and measures many field and lab parameters including: phosphorus, nitrogen, carbon, various other lab parameters (suspended solids, disolved solids, chlorophyll-a, alkalinkity, etc), dissolved oxygen, pH, salinity, turbitity, water temperature, and climate condition. See [Guide to Using Chesapeake Bay Program Water Quality Monitoring Data](https://d18lev1ok5leia.cloudfront.net/chesapeakebay/documents/wq_data_userguide_10feb12_mod.pdf) for more information.

The Living Resources database includes biological monitoring data from the Chesapeake Bay Program. From the [The 2012 Users Guide to CBP Biological Monitoring Data](https://d18lev1ok5leia.cloudfront.net/chesapeakebay/documents/guide2012_final.pdf):
>All Chesapeake Bay phytoplankton, historic zooplankton (including microzooplankton, mesozooplankton and gelatinous zooplankton) and benthos monitoring data and data documentation for Maryland and Virginia from 1984 to present can be obtained directly from the ... Living Resources Data Manager.

There is also a [DataHub API](https://datahub.chesapeakebay.net/API) which we will use to access the data.

Imports we need in the notebook. This allows sections or subsections to be run without needing to repeat import statements

In [1]:
import requests
from datetime import datetime
import csv
import pandas as pd

# Geographic restriction

The Chesapeake Bay segements are based on circulation and salinity properties of different areas of the Bay.
Since we want the same geographic information for every dataset, let's go ahead and generate that now. We need to retrieve the `Geographical-Id`s from the [Geographical-Attribute, CBSeg2003 list](https://datahub.chesapeakebay.net/api.json/CBSeg2003). Since we only want the segments in the Bay proper, we will search for the segment names that start with `CB`. We will then add the sounds and bays that adjoin the Chesapeake Bay. We will not include the segments for bays on the other side of the Eastern Shore from the Chesapeake.

In [2]:
# Add CBSeg2003 to idValues to specify type of geographic ID
GeographicID_values = 'CBSeg2003/'

# Define the URL with the CBSeg2003 list
CBSeg2003_url = "http://datahub.chesapeakebay.net/api.json/CBSeg2003"

# Send a GET request to the list
response = requests.get(CBSeg2003_url)
filtered_segments=[]
if response.status_code == 200:
    try:
        # Parse the JSON response
        data = response.json()

        # Filter the results to find CBSeg2003Name that start with "CB"
        filtered_segments = [
            segment['CBSeg2003Id'] for segment in data
            if segment.get('CBSeg2003Name', '').startswith('CB') or 
                # Add the aadjacent bays and Tangier sound
               segment.get('CBSeg2003Name', '') in ['EASMH ', 'MOBPH ', 'TANMH ']
        ]

        # Append ids to idValues
        if filtered_segments:
            GeographicID_values += ','.join(map(str, filtered_segments)) + '/'
        else:
            print("No matching segments found.")
        
        print(GeographicID_values)
    
    except ValueError as e:
        print(f"Failed to parse JSON data: {e}")
else:
    # Handle the error
    print(f"Failed to retrieve data: {response.status_code}")


CBSeg2003/10,11,12,13,14,15,16,17,28,49,84/


# Living Resources

The Living Resources database has three databases and we must access each separately. The [The 2012 Users Guide to CBP Biological Monitoring Data](https://d18lev1ok5leia.cloudfront.net/chesapeakebay/documents/guide2012_final.pdf) 

This document is out of date for the Plankton Database, but the Plankton Database does not require merging except by project. The Tidal Benthic Database requires merging by data type, as well. We will use the monitoring event files by project before me

First let's write a function to fetch all relevant project ids and create a dictionary. This function takes in a list of identifiers (as strings) and parses the relevent url. The default API url is the Living Resources Projects list JSON.

Note that `ProjectIdentifier` is the abbreviation for the project name, while `ProjectId` is the number we need for the API.

In [3]:
# Function to get the project IDs
def get_project_dict(projectIdentifiers,url= "https://datahub.chesapeakebay.net/api.json/LivingResources/Projects"):

    # Send a GET request to the list
    response = requests.get(url)

    if response.status_code == 200:
        try:
            # Parse the JSON response
            data = response.json()
            
            # Filter the results to find projects with ProjectIdentifier in projectIdentifiers
            filtered_projects = [
                project for project in data 
                if project.get('ProjectIdentifier') in projectIdentifiers
            ]
            
            # Extract the ProjectName and ProjectId for the filtered results
            project_info = {
                project['ProjectId']: {
                     project['ProjectIdentifier'] : project['ProjectName']
                } for project in filtered_projects
            }
            
            return project_info

        except ValueError as e:
            print(f"Failed to parse JSON data: {e}")
            return {}
    else:
        # Handle the error
        print(f"Failed to retrieve data from {url}: {response.status_code} - {response.text}")
        return {}

Now let's write a function to get the monitoring event data set for each project. The data will be stored in a dictionary with the ProjectIdentifier as the key.

This function will take in the same list of identifiers as `get_project_dict`, a start date in MM-DD-YYYY form, an end date in MM-DD-YYYY form, and the geographic attribute list. The default API url is the Living Resources csv file, but the function works for other monitoring events. The default geographic identifier is `GeographicID_values`

The general form for the API url is: `http://datahub.chesapeakebay.net/api.csv/LivingResources/<Source>/MonitorEvent/<Start-Date>/<End-Date>/<Project-Id>/<Geographical-Attribute>`

In [4]:
# Function to get the monitor event data
def fetch_monitor_data_by_project(source,projectIdentifiers,start_date,end_date,base_url="http://datahub.chesapeakebay.net/api.csv/LivingResources/",geograhic_id=GeographicID_values):
    # get project-id list
    projects = get_project_dict(projectIdentifiers)

    # Format the dates
    start_str = start_date.strftime('%m-%d-%Y')
    end_str = end_date.strftime('%m-%d-%Y')

    # Dictionary to store DataFrames for each project
    project_dataframes = {}

    # create API url for each project
    # create dataframe for each project
    for project_id, info in projects.items():
        project_abr, project_name = next(iter(info.items()))

        api_url=f"{base_url}{source}/MonitorEvent/{start_str}/{end_str}/{project_id}/{geograhic_id}"

        # Fetch data from the URL, skipping totals row
        df = pd.read_csv(api_url, skipfooter=1, engine='python')

        # Add ProjectIdentifier column if it does not exist
        if 'ProjectIdentifier' not in df.columns:
            df['ProjectIdentifier'] = project_abr

        # Store the DataFrame in the dictionary using the project abbreviation as the key
        project_dataframes[project_abr] = df

    return project_dataframes

Finally, we need functions to download the project data and combine with the monitor data. The later is a bit trickier for the Tidal Benthic database, as it does not use consistent column names for the different datasets.

The general form for the API url is: `http://datahub.chesapeakebay.net/api.csv/LivingResources/<Source>/<Data-Type>/<Start-Date>/<End-Date>/<Project-Id>/<Geographical-Attribute>`

In [5]:
# function to get the data record
def fetch_recorded_data_by_project(source,data_type,projectIdentifiers,start_date,end_date,base_url="http://datahub.chesapeakebay.net/api.csv/LivingResources/",geograhic_id=GeographicID_values):
    # get project-id list
    projects = get_project_dict(projectIdentifiers)

    # Format the dates
    start_str = start_date.strftime('%m-%d-%Y')
    end_str = end_date.strftime('%m-%d-%Y')

    # Dictionary to store DataFrames for each project
    project_dataframes = {}

    # create API url for each project
    # create dataframe for each project
    for project_id, info in projects.items():
        project_abr, project_name = next(iter(info.items()))

        api_url=f"{base_url}{source}/{data_type}/{start_str}/{end_str}/{project_id}/{geograhic_id}"

        # Fetch data from the URL, skipping totals row
        df = pd.read_csv(api_url, skipfooter=1, engine='python')

        # Add ProjectIdentifier column if it does not exist
        if 'ProjectIdentifier' not in df.columns:
            df['ProjectIdentifier'] = project_abr

        # Store the DataFrame in the dictionary using the project abbreviation as the key
        project_dataframes[source + project_abr] = df

    return project_dataframes

We will handle the column naming discrepencies by renaming columns in the data records. The function below will:
- Use the dictionaries of monitor event data and data records to combine dataframes from the same project
- The monitor event data will serve as a dictionary, where the keys are the values of the columns that are in both dataframes. This dictionary will be used to create any columns that exist in the monitor event data and not the data records
- Merge the dataframes for each project into one dataframe
- Save a csv of the data

In [6]:
def remove_columns_from_dict(dictionary, columns_to_remove):

    # Iterate over the dictionary
    for key, df in dictionary.items():
        # Drop specified columns
        if all(col in df.columns for col in columns_to_remove):
            df.drop(columns=columns_to_remove, inplace=True)
        else:
            print(f"Some columns to remove were not found in DataFrame for key: {key}")
            missing_cols = [col for col in columns_to_remove if col not in df.columns]
            if missing_cols:
                print(f"Missing columns: {missing_cols}")
    
    return dictionary

In [21]:
def merge_and_save_data(monitor_event_data, data_records, output_csv_path):
    # Dictionary to store merged DataFrames for each project
    merged_dataframes = {}

    for monitor_key in monitor_event_data.keys():
        # Find the corresponding key in data_records that contains the monitor_key as a substring
        # data_records keys also include data source
        data_record_key = next((key for key in data_records.keys() if monitor_key in key), None)
        
        if data_record_key is None:
            print(f"No matching data record found for monitor key: {monitor_key}")
            continue

        # Get the corresponding dataframes for the project
        monitor_df = monitor_event_data[monitor_key]
        data_record_df = data_records[data_record_key]

        # Find common columns to merge on
        common_columns = list(set(monitor_df.columns).intersection(set(data_record_df.columns)))
        

        # Find "missing" columns
        missing_columns = list(set(monitor_df.columns) - set(data_record_df.columns))


        # Ensure common columns have the same data type
        for col in common_columns:
            if monitor_df[col].dtype != data_record_df[col].dtype:
                target_dtype = monitor_df[col].dtype
                print(data_record_key,col, "type converted")
                try:
                    data_record_df[col] = data_record_df[col].astype(target_dtype)
                except ValueError:
                    monitor_df[col] = monitor_df[col].astype(str)
                    data_record_df[col] = data_record_df[col].astype(str)


        # Adding missing columns to data_record_df with default values
        for col in missing_columns:
            data_record_df[col] = ''

        # Group monitor_df by key columns
        grouped = monitor_df.groupby(common_columns)

                # Apply lambda to convert each group into a list of dictionaries
        monitor_dict = grouped.apply(lambda x: x[missing_columns].to_dict('records'),include_groups=False).to_dict()

        # Iterate over rows in data_record_df
        for index, row in data_record_df.iterrows():
            # Create a key tuple from the row's key columns
            key = tuple(row[col] for col in common_columns)
            if key in monitor_dict:
                # Iterate over records corresponding to the key
                for record in monitor_dict[key]:
                    # Update data_record_df with values from the record
                    for col in missing_columns:
                        data_record_df.at[index, col] = record[col]

        # Store the merged dataframe in the dictionary
        merged_dataframes[data_record_key] = data_record_df

    # Combine all merged dataframes into one
    combined_df = pd.concat(merged_dataframes.values(), ignore_index=True)

    # Print the shape of the combined dataframe
    print("Shape of combined_df:", combined_df.shape)

    #Reorder the columns
    desired_order=['CBSeg2003','CBSeg2003Description','Station','Latitude','Longitude','SampleType','FieldActivityId','SampleDate','SampleTime','Layer','TotalDepth','Parameter','ReportingValue','ReportingUnit']

    # Filter out columns in desired_order that do not exist in the DataFrame
    valid_order = [col for col in desired_order if col in combined_df.columns]

    # Get the columns that are not in the desired_order
    remaining_columns = [col for col in combined_df.columns if col not in valid_order]

    # Combine valid_order with remaining_columns to maintain the desired order
    final_order = valid_order + remaining_columns

    # Reorder the DataFrame columns
    combined_df = combined_df[final_order]

    # Save the combined dataframe to a CSV file
    combined_df.to_csv(output_csv_path, index=False, encoding='utf-8')

    return combined_df

## Plankton

As mentioned above, the plankton database is much smaller than suggested by the Users Guide. Let's look at the monitoring event data for the relevent regions.

In [8]:
start_date = datetime(2004, 1, 1)
end_date = datetime(2024, 8, 3)
projectList_TidalPlankton = ['MEZ','MIZ','PHYTP','PICOP']
plankton_monitor_events_dict = fetch_monitor_data_by_project("TidalPlankton",projectList_TidalPlankton,start_date,end_date)

In [9]:
plankton_combined_monitor_events = pd.concat(plankton_monitor_events_dict, ignore_index=True)

plankton_combined_monitor_events.shape

  plankton_combined_monitor_events = pd.concat(plankton_monitor_events_dict, ignore_index=True)


(4738, 17)

Read in Plankton data dictionary

In [10]:
plankton_records_dict = fetch_recorded_data_by_project("TidalPlankton","Reported",projectList_TidalPlankton,start_date,end_date)

Now, we have a problem because `CBSeg2003` and `CBSeg2003Description` columns exist, but their values are all missing. We need to remove these columns.

In [11]:
plankton_records_dict = remove_columns_from_dict(plankton_records_dict,['CBSeg2003','CBSeg2003Description'])

In [19]:
output_file ="../data/plank_ChesapeakeTidalPlankton.csv"
merge_and_save_data(plankton_monitor_events_dict, plankton_records_dict, output_file)

  combined_df = pd.concat(merged_dataframes.values(), ignore_index=True)


Shape of combined_df: (93467, 30)


Unnamed: 0,CBSeg2003,CBSeg2003Description,Station,Latitude,Longitude,SampleType,FieldActivityId,SampleDate,SampleTime,Layer,...,Method,NODCCode,SPECCode,SerialNumber,ProjectIdentifier,DataType,SampleVolume,Units,Salzone,PDepth
0,MOBPH,Mobjack Bay-Polyhaline Region,WE4.2,37.24181,-76.38634,C,170820,1/12/2004,10:28:00,AP,...,PH102,07020301,58.0,20041122WE4.,PHYTP,PHYTP,15.0,Liter,M,3.0
1,MOBPH,Mobjack Bay-Polyhaline Region,WE4.2,37.24181,-76.38634,C,170820,1/12/2004,10:28:00,AP,...,PH102,07030501,97.0,20041122WE4.,PHYTP,PHYTP,15.0,Liter,M,3.0
2,CB6PH,Chesapeake Bay-Polyhaline Region,CB6.4,37.23653,-76.20799,C,170822,1/12/2004,12:25:00,AP,...,PH102,07030101,77.0,20041122CB6.,PHYTP,PHYTP,15.0,Liter,M,3.0
3,MOBPH,Mobjack Bay-Polyhaline Region,WE4.2,37.24181,-76.38634,C,170820,1/12/2004,10:28:00,BP,...,PH102,07020205,156.0,20041122WE4.,PHYTP,PHYTP,15.0,Liter,M,12.5
4,CB6PH,Chesapeake Bay-Polyhaline Region,CB6.4,37.23653,-76.20799,C,170822,1/12/2004,12:25:00,AP,...,PH102,0703100114,105.0,20041122CB6.,PHYTP,PHYTP,15.0,Liter,M,3.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
93462,CB6PH,Chesapeake Bay-Polyhaline Region,CB6.1,37.58847,-76.16216,C,644643,12/8/2021,13:20:00,BP,...,PP101,AUTO_PICO,1148,20211208CB6.1,PICOP,PICOP,15.0,Liter,M,12.5
93463,CB7PH,Chesapeake Bay-Polyhaline Region,CB7.3E,37.22875,-76.05383,C,644649,12/9/2021,10:32:00,BP,...,PP101,AUTO_PICO,1148,20211209CB7.3E,PICOP,PICOP,15.0,Liter,P,18.5
93464,CB7PH,Chesapeake Bay-Polyhaline Region,CB7.3E,37.22875,-76.05383,C,644649,12/9/2021,10:32:00,AP,...,PP101,AUTO_PICO,1148,20211209CB7.3E,PICOP,PICOP,15.0,Liter,P,1.0
93465,CB6PH,Chesapeake Bay-Polyhaline Region,CB6.4,37.23653,-76.20799,C,644647,12/9/2021,13:57:00,BP,...,PP101,AUTO_PICO,1148,20211209CB6.4,PICOP,PICOP,15.0,Liter,P,9.5
