# Introduction 

The Chesapeake Bay Program [DataHub](https://datahub.chesapeakebay.net/Home) contains many datasets for the Chesapeake Bay. 

The Water Quality Data is still updating and measures many field and lab parameters including: phosphorus, nitrogen, carbon, various other lab parameters (suspended solids, disolved solids, chlorophyll-a, alkalinkity, etc), dissolved oxygen, pH, salinity, turbitity, water temperature, and climate condition. See [Guide to Using Chesapeake Bay Program Water Quality Monitoring Data](https://d18lev1ok5leia.cloudfront.net/chesapeakebay/documents/wq_data_userguide_10feb12_mod.pdf) for more information.

The Living Resources database includes biological monitoring data from the Chesapeake Bay Program. From the [The 2012 Users Guide to CBP Biological Monitoring Data](https://d18lev1ok5leia.cloudfront.net/chesapeakebay/documents/guide2012_final.pdf):
>All Chesapeake Bay phytoplankton, historic zooplankton (including microzooplankton, mesozooplankton and gelatinous zooplankton) and benthos monitoring data and data documentation for Maryland and Virginia from 1984 to present can be obtained directly from the ... Living Resources Data Manager.

There is also a [DataHub API](https://datahub.chesapeakebay.net/API) which we will use to access the data.

Imports we need in the notebook. This allows sections or subsections to be run without needing to repeat import statements

In [1]:
import requests
from datetime import datetime
from dateutil.relativedelta import relativedelta

import numpy as np
import pandas as pd

# Geographic restriction

The Chesapeake Bay segements are based on circulation and salinity properties of different areas of the Bay.
Since we want the same geographic information for every dataset, let's go ahead and generate that now. We need to retrieve the `Geographical-Id`s from the [Geographical-Attribute, CBSeg2003 list](https://datahub.chesapeakebay.net/api.json/CBSeg2003). Since we only want the segments in the Bay proper, we will search for the segment names that start with `CB`. We will then add the sounds and bays that adjoin the Chesapeake Bay. We will not include the segments for bays on the other side of the Eastern Shore from the Chesapeake.

In [2]:
# Add CBSeg2003 to idValues to specify type of geographic ID
GeographicID_values = 'CBSeg2003/'

# Define the URL with the CBSeg2003 list
CBSeg2003_url = "http://datahub.chesapeakebay.net/api.json/CBSeg2003"

# Send a GET request to the list
response = requests.get(CBSeg2003_url)
filtered_segments=[]
if response.status_code == 200:
    try:
        # Parse the JSON response
        data = response.json()

        # Filter the results to find CBSeg2003Name that start with "CB"
        filtered_segments = [
            segment['CBSeg2003Id'] for segment in data
            if segment.get('CBSeg2003Name', '').startswith('CB') or 
                # Add the aadjacent bays and Tangier sound
               segment.get('CBSeg2003Name', '') in ['EASMH ', 'MOBPH ', 'TANMH ']
        ]

        # Append ids to idValues
        if filtered_segments:
            GeographicID_values += ','.join(map(str, filtered_segments)) + '/'
        else:
            print("No matching segments found.")
        
        print(GeographicID_values)
    
    except ValueError as e:
        print(f"Failed to parse JSON data: {e}")
else:
    # Handle the error
    print(f"Failed to retrieve data: {response.status_code}")


CBSeg2003/10,11,12,13,14,15,16,17,28,49,84/


# Water Quality

CBP Water Quality Data (1984-present): measured and calculated physical and nutrient parameters

The long list of parameters includes phosphorus, nitrogen, carbon, suspended solids, disolved solids, chlorophyll_a, pH, salinity, turbitidy, water temperature, and atmospheric conditions.

The Water Quality data contains quite a lot of data, so we will define functions to download 5 years of data at once. This method will also work for other large datasets, although the Living Resources database can download without it.

In [3]:
def generate_date_ranges(start_date, end_date, delta_years=5):
    # Generate date ranges of specified years between start_date and end_date.
    date_list =[]
    current_date = start_date
    while current_date < end_date:
        next_date = current_date + relativedelta(years=delta_years)
        date_list.append(current_date.strftime('%m-%d-%Y') +'/' 
                         + next_date.strftime('%m-%d-%Y') +'/') 
        current_date = next_date
    return(date_list)

We also define a function to call the API for our entire desired time frame and output the data in one CSV. Note that for each database, we will need a `base_url` which points to the desired database and output format, a `url_idValues` which tells the API which values to download (the length of this url also depends on the database), and an `output_file`.

In [4]:
def api_to_csv_by_date(base_url, url_idValues, start_date, end_date, output_file):
    # Generate list of date ranges
    date_list = generate_date_ranges(start_date, end_date)
    
    # Open the output file in append mode
    with open(output_file, 'w', newline='') as csvfile:
        csv_writer = None
        
        for dates in date_list:
            
            # Define the API endpoint URL with CSV format
            api_url = f"{base_url}{dates}{url_idValues}"

            try:
                response = requests.get(api_url)  # Added timeout for reliability
                response.raise_for_status()  # Raise an exception for HTTP errors
            except requests.RequestException as e:
                # Handle the error
                print(f"Failed to retrieve data from {api_url}: {e}")
                continue
            
            # Read the CSV response
            csv_data = response.text.splitlines()
            csv_reader = csv.reader(csv_data)
            
            # Write the CSV data to the file
            if csv_writer is None:
                # Write the header only once
                csv_writer = csv.writer(csvfile)
                csv_writer.writerow(next(csv_reader))  # Write header
            
            for row in csv_reader:
                csv_writer.writerow(row)
            
            print(f"Data from {dates} saved to {output_file}")
    # Generate list of date ranges
    date_list = generate_date_ranges(start_date, end_date)
    
    # Initialize an empty DataFrame
    all_data = pd.DataFrame()
    
    for dates in date_list:
        # Define the API endpoint URL with CSV format
        api_url = base_url + dates + url_idValues
        # Send a GET request to the API endpoint
        response = requests.get(api_url)
        
        # Check if the request was successful
        if response.status_code == 200:
            # Read the CSV response into a DataFrame
            data = pd.read_csv(api_url)
            
            # Check if the last row contains 'Total_Records:'
            if data.iloc[-1].astype(str).str.contains('Total_Records:').any():
                data = data[:-1]  # Remove the last row
            
            # Accumulate data
            all_data = pd.concat([all_data, data], ignore_index=True)
            
            print(f"Data from {dates} added to DataFrame")
        else:
            # Handle the error
            print(f"Failed to retrieve data from {dates}: {response.status_code}")

The format for the API url for 
Water Quality is: `http://datahub.chesapeakebay.net/api.{format}/WaterQuality/WaterQuality/<Start-Date>/<End-Date>/<Data-Stream-Value>/<Program-Id>/<Project-Id>/<Geographical-Attribute>/<Attribute-Id>/<Substance-Id>`

We will start with defining the `base_url` which is everything before the dates.

In [5]:
# Define base URL
base_url = 'https://datahub.chesapeakebay.net/api.CSV/WaterQuality/WaterQuality/'

We want all [Data-Stream-Value](https://datahub.chesapeakebay.net/api.json/DataStreams) data, so that part of our url is `0,1`

We also want all three programs in the [Program-Id list for water quality](https://datahub.chesapeakebay.net/api.json/WaterQuality/Programs) `2,4,6`

We can update our url to `https://datahub.chesapeakebay.net/api.CSV/WaterQuality/WaterQuality/<Start-Date>/<End-Date>/0,1/2,4,6/<Project-Id>/<Geographical-Attribute>/<Attribute-Id>/<Substance-Id>`. We will deal with the date last.

In [6]:
# Define the id values that go after <Start-Date>/<End-Date>
url_idValues = '0,1/2,4,6/'

Let's get all project ids 
[Project-ID list for water quality](https://datahub.chesapeakebay.net/api.json/WaterQuality/Projects).
Note that some projects might not have any water quality data for the desired segments, but that does not seem to cause a problem with the API timing out.

In [7]:
# Define the URL with the Projects list
projectList_url = "https://datahub.chesapeakebay.net/api.json/WaterQuality/Projects"

# Send a GET request to the list
response = requests.get(projectList_url)

# Check if the request was successful
if response.status_code == 200:
    # Parse the JSON response
    data = response.json()
    # Find ProjectId
    projectIds = [segment['ProjectId'] for segment in data]
    
    # Append ids to idValues
    for projectId in projectIds:
        url_idValues = url_idValues + str(projectId) +','

    # Remove final comma, append /
    url_idValues = url_idValues[:-1] + '/'
    print(url_idValues)
else:
    # Handle the error
    print(f"Failed to retrieve data: {response.status_code}")

0,1/2,4,6/12,13,14,15,35,36,2,3,11,7,33,34,23,24,16/


Now we append the geographic information

In [8]:
url_idValues =url_idValues + GeographicID_values

Now we need to append the `Substance-Id`. Now, there are many substances that are not measured in the dataset and regions we want. Using the online download form, it looks like the largest possible list is `21,30,31,35,36,49,55,60,63,65,67,71,73,74,77,78,82,83,85,87,88,94,104,105,109,111,114,116,121,123,33,76,113,34,119`. I don't see a more systematic way to do this step, since different stations are measuring different things.

In addition to updating the `url_idValues`, let's also pull a dictionary for these substances

In [9]:
substanceId_list = [21,30,31,33,34,35,36,49,55,60,63,65,67,71,73,74,76,77,78,82,83,85,87,88,94,104,105,109,111,113,114,116,119,121,123]


# Define the URL with the substance list
substanceId_url = "https://datahub.chesapeakebay.net/api.json/Substances"

# Send a GET request to the list
response = requests.get(substanceId_url)

# Check if the request was successful
if response.status_code == 200:
    # Parse the JSON response
    data = response.json()
    # Filter the results to find substances with SubstanceId in substanceId_list
    filtered_substances = [substance for substance in data if substance['SubstanceId'] in substanceId_list]

    # Print the filtered substances
    for substance in filtered_substances:
        print(substance)
else:
    # Handle the error
    print(f"Failed to retrieve data: {response.status_code}")

{'SubstanceId': 21, 'SubstanceIdentificationName': 'CHLA', 'SubstanceIdentificationDescription': 'Active Chlorophyll-A'}
{'SubstanceId': 30, 'SubstanceIdentificationName': 'DIN', 'SubstanceIdentificationDescription': 'Dissolved Inorganic Nitrogen'}
{'SubstanceId': 31, 'SubstanceIdentificationName': 'DO', 'SubstanceIdentificationDescription': 'Dissolved Oxygen In MG/L'}
{'SubstanceId': 33, 'SubstanceIdentificationName': 'DO_SAT_P', 'SubstanceIdentificationDescription': 'DO Saturation Using Probe Units In Percent'}
{'SubstanceId': 34, 'SubstanceIdentificationName': 'DOC', 'SubstanceIdentificationDescription': 'Dissolved Organic Carbon'}
{'SubstanceId': 35, 'SubstanceIdentificationName': 'DON', 'SubstanceIdentificationDescription': 'Dissolved Organic Nitrogen'}
{'SubstanceId': 36, 'SubstanceIdentificationName': 'DOP', 'SubstanceIdentificationDescription': 'Dissolved Organic Phosphorus'}
{'SubstanceId': 49, 'SubstanceIdentificationName': 'FSS', 'SubstanceIdentificationDescription': 'Fixed 

In [10]:
url_idValues = url_idValues +'21,30,31,33,34,35,36,49,55,60,63,65,67,71,73,74,76,77,78,82,83,85,87,88,94,104,105,109,111,113,114,116,119,121,123'

And call the API

In [11]:
start_date = datetime(2004, 7, 29)
end_date = datetime(2024, 7, 29)
output_file = '../data/plank_ChesapeakeWaterQuality.csv'

# For CSV
api_to_csv_by_date(base_url, url_idValues, start_date, end_date, output_file)

# As dataframe
# waterQuality = api_to_dataframe_by_date(base_url, url_idValues, start_date, end_date)

NameError: name 'csv' is not defined

Clean the data with help of VSCode Data Wrangle extension, then save the new csv. Since we are removing empty (or almost empty) columns, we will save over the previous csv.

In [None]:
def clean_waterQuality(df):
    # Drop empty columns: 'PrecisionPC', 'BiasPC'
    # Drop columns that are almost all empty or nan
    df = df.drop(columns=['PrecisionPC','BiasPC','Details','Problem'])
    return df

# Loaded variable 'df' from output_file
df = pd.read_csv(output_file)

df_clean = clean_waterQuality(df.copy())

# Same output_file as before
df_clean.to_csv(output_file, index=False)  

Since the `csv` is so large, let's add a print statement for the column headers.

In [None]:
import pandas as pd

output_file = '../data/plank_ChesapeakeWaterQuality.csv'

df = pd.read_csv(output_file)

df.head()


  df = pd.read_csv(output_file)


Unnamed: 0,CBSeg2003,EventId,Cruise,Program,Project,Agency,Source,Station,SampleDate,SampleTime,...,SampleReplicateType,Parameter,Qualifier,MeasureValue,Unit,Method,Lab,Latitude,Longitude,TierLevel
0,CB1TF,20021,BAY451,TWQM,MAIN,MDDNR,MDDNR,CB1.1,12/13/2006,10:58:00,...,FS1,CHLA,,0.598,UG/L,L01,MDHMH,39.54794,-76.08481,T3
1,CB1TF,20021,BAY451,TWQM,MAIN,MDDNR,MDDNR,CB1.1,12/13/2006,10:58:00,...,FS2,CHLA,,0.748,UG/L,L01,MDHMH,39.54794,-76.08481,T3
2,CB1TF,20236,BAY452,TWQM,MAIN,MDDNR,MDDNR,CB1.1,1/12/2007,13:00:00,...,FS1,CHLA,,3.489,UG/L,L01,MDHMH,39.54794,-76.08481,T3
3,CB1TF,20236,BAY452,TWQM,MAIN,MDDNR,MDDNR,CB1.1,1/12/2007,13:00:00,...,FS2,CHLA,,3.987,UG/L,L01,MDHMH,39.54794,-76.08481,T3
4,CB1TF,20236,BAY452,TWQM,MAIN,MDDNR,MDDNR,CB1.1,1/12/2007,13:00:00,...,S1,CHLA,,4.984,UG/L,L01,MDHMH,39.54794,-76.08481,T3


# Living Resources

The Living Resources database has three databases and we must access each separately. The [The 2012 Users Guide to CBP Biological Monitoring Data](https://d18lev1ok5leia.cloudfront.net/chesapeakebay/documents/guide2012_final.pdf) 

This document is out of date for the Plankton Database, but the Plankton Database does not require merging except by project. The Tidal Benthic Database requires merging by data type, as well. We will use the monitoring event files by project as a dictionary for the data reporting files.

### Common Functions

First let's write a function to fetch all relevant project ids and create a dictionary. This function takes in a list of identifiers (as strings) and parses the relevent url. The default API url is the Living Resources Projects list JSON.

Note that `ProjectIdentifier` is the abbreviation for the project name, while `ProjectId` is the number we need for the API.

In [5]:
# Function to get the project IDs
def get_project_dict(projectIdentifiers,url= "https://datahub.chesapeakebay.net/api.json/LivingResources/Projects"):

    # Send a GET request to the list
    response = requests.get(url)

    if response.status_code == 200:
        try:
            # Parse the JSON response
            data = response.json()
            
            # Filter the results to find projects with ProjectIdentifier in projectIdentifiers
            filtered_projects = [
                project for project in data 
                if project.get('ProjectIdentifier') in projectIdentifiers
            ]
            
            # Extract the ProjectName and ProjectId for the filtered results
            project_info = {
                project['ProjectId']: {
                     project['ProjectIdentifier'] : project['ProjectName']
                } for project in filtered_projects
            }
            
            return project_info

        except ValueError as e:
            print(f"Failed to parse JSON data: {e}")
            return {}
    else:
        # Handle the error
        print(f"Failed to retrieve data from {url}: {response.status_code} - {response.text}")
        return {}

Now let's write a function to get the monitoring event data set for each project. The data will be stored in a dictionary with the ProjectIdentifier as the key.

This function will take in the same list of identifiers as `get_project_dict`, a start date in MM-DD-YYYY form, an end date in MM-DD-YYYY form, and the geographic attribute list. The default API url is the Living Resources csv file, but the function works for other monitoring events. The default geographic identifier is `GeographicID_values`

The general form for the API url is: `http://datahub.chesapeakebay.net/api.csv/LivingResources/<Source>/MonitorEvent/<Start-Date>/<End-Date>/<Project-Id>/<Geographical-Attribute>`

In [6]:
# Function to get the monitor event data
def fetch_monitor_data_by_project(source,projectIdentifiers,start_date,end_date,base_url="http://datahub.chesapeakebay.net/api.csv/LivingResources/",geograhic_id=GeographicID_values):
    # get project-id list
    projects = get_project_dict(projectIdentifiers)

    # Format the dates
    start_str = start_date.strftime('%m-%d-%Y')
    end_str = end_date.strftime('%m-%d-%Y')

    # Dictionary to store DataFrames for each project
    project_dataframes = {}

    # create API url for each project
    # create dataframe for each project
    for project_id, info in projects.items():
        project_abr, project_name = next(iter(info.items()))

        api_url=f"{base_url}{source}/MonitorEvent/{start_str}/{end_str}/{project_id}/{geograhic_id}"

        # Fetch data from the URL, skipping totals row
        df = pd.read_csv(api_url, skipfooter=1, engine='python')

        # Add ProjectIdentifier column if it does not exist
        if 'ProjectIdentifier' not in df.columns:
            df['ProjectIdentifier'] = project_abr

        # Store the DataFrame in the dictionary using the project abbreviation as the key
        project_dataframes[project_abr] = df

    return project_dataframes

Finally, we need functions to download the project data and combine with the monitor data. The later is a bit trickier for the Tidal Benthic database, as it does not use consistent column names for the different datasets.

The general form for the API url is: `http://datahub.chesapeakebay.net/api.csv/LivingResources/<Source>/<Data-Type>/<Start-Date>/<End-Date>/<Project-Id>/<Geographical-Attribute>`

In [7]:
def fetch_recorded_data_by_project(source,data_type,projectIdentifiers,start_date,end_date,base_url="http://datahub.chesapeakebay.net/api.csv/LivingResources/",geograhic_id=GeographicID_values):
    # get project-id list
    projects = get_project_dict(projectIdentifiers)

    # Format the dates
    start_str = start_date.strftime('%m-%d-%Y')
    end_str = end_date.strftime('%m-%d-%Y')

    # Dictionary to store DataFrames for each project
    project_dataframes = {}

    # create API url for each project
    # create dataframe for each project
    for project_id, info in projects.items():
        project_abr, project_name = next(iter(info.items()))

        api_url=f"{base_url}{source}/{data_type}/{start_str}/{end_str}/{project_id}/{geograhic_id}"

        # Fetch data from the URL, skipping totals row
        df = pd.read_csv(api_url, skipfooter=1, engine='python')

        # Add ProjectIdentifier column if it does not exist
        if 'ProjectIdentifier' not in df.columns:
            df['ProjectIdentifier'] = project_abr

        # Store the DataFrame in the dictionary using the project abbreviation as the key
        project_dataframes[source + project_abr] = df

    return project_dataframes

We will handle the column naming discrepencies by renaming columns in the data records. The function below will:
- Use the dictionaries of monitor event data and data records to combine dataframes from the same project
- The monitor event data will serve as a dictionary, where the keys are the values of the columns that are in both dataframes. This dictionary will be used to create any columns that exist in the monitor event data and not the data records
- Merge the dataframes for each project into one dataframe
- Save a csv of the data

In [8]:
def remove_columns_from_dict(dictionary, columns_to_remove):

    # Iterate over the dictionary
    for key, df in dictionary.items():
        # Drop specified columns
        if all(col in df.columns for col in columns_to_remove):
            df.drop(columns=columns_to_remove, inplace=True)
        else:
            print(f"Some columns to remove were not found in DataFrame for key: {key}")
            missing_cols = [col for col in columns_to_remove if col not in df.columns]
            if missing_cols:
                print(f"Missing columns: {missing_cols}")
    
    return dictionary

In [9]:
def merge_and_save_data(monitor_event_data, data_records, output_csv_path):
    # Dictionary to store merged DataFrames for each project
    merged_dataframes = {}

    for monitor_key in monitor_event_data.keys():
        # Find the corresponding key in data_records that contains the monitor_key as a substring
        data_record_key = next((key for key in data_records.keys() if monitor_key in key), None)
        
        if data_record_key is None:
            print(f"No matching data record found for monitor key: {monitor_key}")
            continue

        # Get the corresponding dataframes for the project
        monitor_df = monitor_event_data[monitor_key].copy()
        data_record_df = data_records[data_record_key].copy()

        # Print the shapes of the dataframes before processing
        print(f"Shape of monitor_df for project {data_record_key}:", monitor_df.shape)
        print(f"Shape of data_record_df for project {data_record_key}:", data_record_df.shape)

        # Find common columns to merge on
        common_columns = list(set(monitor_df.columns).intersection(set(data_record_df.columns)))
        print(len(common_columns), "Common columns:", common_columns)

        # Find "missing" columns
        missing_columns = list(set(monitor_df.columns) - set(data_record_df.columns))

        # Ensure common columns have the same data type
        for col in common_columns:
            if monitor_df[col].dtype != data_record_df[col].dtype:
                target_dtype = monitor_df[col].dtype
                print(data_record_key, col, "type converted")
                try:
                    data_record_df[col] = data_record_df[col].astype(target_dtype)
                except ValueError:
                    data_record_df[col] = data_record_df[col].astype(str)
                    monitor_df[col] = monitor_df[col].astype(str)

        # Adding missing columns to data_record_df with default values
        for col in missing_columns:
            data_record_df[col] = ''

        # Group monitor_df by key columns
        grouped_monitor_df = monitor_df.groupby(common_columns)

        # Apply lambda to convert each group into a list of dictionaries
        monitor_dict = grouped_monitor_df.apply(lambda group: group[missing_columns].to_dict('records'), include_groups=False).to_dict()

        # Create a new DataFrame to hold updated records
        updated_data_record_df = data_record_df.copy()

        # Iterate over rows in data_record_df
        for idx, row in updated_data_record_df.iterrows():
            # Create a key tuple from the row's key columns
            key = tuple(row[col] for col in common_columns)
            if key in monitor_dict:
                # Iterate over records corresponding to the key
                for record in monitor_dict[key]:
                    # Update updated_data_record_df with values from the record
                    for col in missing_columns:
                        if pd.isna(updated_data_record_df.at[idx, col]) or updated_data_record_df.at[idx, col] == '':
                            updated_data_record_df.at[idx, col] = record[col]

        print(f"Updated shape of data_record_df for project {data_record_key}:", updated_data_record_df.shape)


        # Store the updated dataframe in the dictionary
        merged_dataframes[data_record_key] = updated_data_record_df

    # Combine all merged dataframes into one
    combined_df = pd.concat(merged_dataframes.values(), ignore_index=True)

    # Print the shape of the combined dataframe
    print("Shape of combined_df:", combined_df.shape)

    # Reorder the columns
    desired_order = ['CBSeg2003', 'CBSeg2003Description', 'Station', 'Latitude', 'Longitude', 
                      'SampleType', 'FieldActivityId', 'SampleDate', 'SampleTime', 'Layer', 
                      'TotalDepth', 'Parameter', 'ReportingValue', 'ReportingUnit']

    # Filter out columns in desired_order that do not exist in the DataFrame
    valid_order = [col for col in desired_order if col in combined_df.columns]

    # Get the columns that are not in the desired_order
    remaining_columns = [col for col in combined_df.columns if col not in valid_order]

    # Combine valid_order with remaining_columns to maintain the desired order
    final_order = valid_order + remaining_columns

    # Reorder the DataFrame columns
    reordered_combined_df = combined_df[final_order]

    # Save the combined dataframe to a CSV file
    reordered_combined_df.to_csv(output_csv_path, index=False, encoding='utf-8')

    return reordered_combined_df

## Plankton

As mentioned above, the plankton database is much smaller than suggested by the Users Guide. Let's look at the monitoring event data for the relevent regions.

In [118]:
start_date = datetime(2004, 1, 1)
end_date = datetime(2024, 8, 3)
projectList_TidalPlankton = ['MEZ','MIZ','PHYTP','PICOP']
plankton_monitor_events_dict = fetch_monitor_data_by_project("TidalPlankton",projectList_TidalPlankton,start_date,end_date)

In [220]:
plankton_combined_monitor_events = pd.concat(plankton_monitor_events_dict, ignore_index=True)

plankton_combined_monitor_events.shape

  plankton_combined_monitor_events = pd.concat(plankton_monitor_events_dict, ignore_index=True)


(4738, 17)

In [221]:
plankton_combined_monitor_events.columns

Index(['CBSeg2003', 'CBSeg2003Description', 'DataType', 'Source', 'SampleType',
       'SampleDate', 'Layer', 'Latitude', 'Longitude', 'PDepth', 'Salzone',
       'SampleVolume', 'Units', 'Station', 'TotalDepth', 'SampleTime',
       'ProjectIdentifier'],
      dtype='object')

Read in Plankton data dictionary

In [120]:
plankton_records_dict = fetch_recorded_data_by_project("TidalPlankton","Reported",projectList_TidalPlankton,start_date,end_date)

Now, we have a problem because `CBSeg2003` and `CBSeg2003Description` columns exist, but their values are all missing. We need to remove these columns.

In [121]:
plankton_records_dict = remove_columns_from_dict(plankton_records_dict,['CBSeg2003','CBSeg2003Description'])

In [122]:
output_file ="../data/plank_ChesapeakeTidalPlankton.csv"
merge_and_save_data(plankton_monitor_events_dict, plankton_records_dict, output_file)

Shape of monitor_df for project TidalPlanktonPHYTP: (2580, 17)
Shape of data_record_df for project TidalPlanktonPHYTP: (91309, 19)
6 Common columns: ['SampleDate', 'SampleType', 'ProjectIdentifier', 'Layer', 'Source', 'Station']
Updated shape of data_record_df for project TidalPlanktonPHYTP: (91309, 30)
Expected shape: rows: 91309 columns: 41
Shape of monitor_df for project TidalPlanktonPICOP: (2158, 17)
Shape of data_record_df for project TidalPlanktonPICOP: (2158, 19)
6 Common columns: ['SampleDate', 'SampleType', 'ProjectIdentifier', 'Layer', 'Source', 'Station']
Updated shape of data_record_df for project TidalPlanktonPICOP: (2158, 30)
Expected shape: rows: 2158 columns: 41
Shape of monitor_df for project TidalPlanktonMEZ: (0, 17)
Shape of data_record_df for project TidalPlanktonMEZ: (0, 19)
6 Common columns: ['SampleDate', 'SampleType', 'ProjectIdentifier', 'Layer', 'Source', 'Station']
Updated shape of data_record_df for project TidalPlanktonMEZ: (0, 30)
Expected shape: rows: 0 c

  combined_df = pd.concat(merged_dataframes.values(), ignore_index=True)


Unnamed: 0,CBSeg2003,CBSeg2003Description,Station,Latitude,Longitude,SampleType,FieldActivityId,SampleDate,SampleTime,Layer,...,Method,NODCCode,SPECCode,SerialNumber,ProjectIdentifier,Units,DataType,SampleVolume,PDepth,Salzone
0,CB6PH,Chesapeake Bay-Polyhaline Region,CB6.4,37.23653,-76.20799,C,170822,1/12/2004,12:25:00,BP,...,PH102,0702100108,123.0,20041122CB6.,PHYTP,Liter,PHYTP,15.0,10.5,M
1,MOBPH,Mobjack Bay-Polyhaline Region,WE4.2,37.24181,-76.38634,C,170820,1/12/2004,10:28:00,BP,...,PH102,1203020107,321.0,20041122WE4.,PHYTP,Liter,PHYTP,15.0,12.5,M
2,CB6PH,Chesapeake Bay-Polyhaline Region,CB6.4,37.23653,-76.20799,C,170822,1/12/2004,12:25:00,AP,...,PH102,0703010802,673.0,20041122CB6.,PHYTP,Liter,PHYTP,15.0,3.0,M
3,MOBPH,Mobjack Bay-Polyhaline Region,WE4.2,37.24181,-76.38634,C,170820,1/12/2004,10:28:00,AP,...,PH102,12041005,355.0,20041122WE4.,PHYTP,Liter,PHYTP,15.0,3.0,M
4,CB6PH,Chesapeake Bay-Polyhaline Region,CB6.4,37.23653,-76.20799,C,170822,1/12/2004,12:25:00,BP,...,PH102,12040103,337.0,20041122CB6.,PHYTP,Liter,PHYTP,15.0,10.5,M
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
93462,CB6PH,Chesapeake Bay-Polyhaline Region,CB6.1,37.58847,-76.16216,C,644643,12/8/2021,13:20:00,BP,...,PP101,AUTO_PICO,1148,20211208CB6.1,PICOP,Liter,PICOP,15.0,12.5,M
93463,CB7PH,Chesapeake Bay-Polyhaline Region,CB7.3E,37.22875,-76.05383,C,644649,12/9/2021,10:32:00,BP,...,PP101,AUTO_PICO,1148,20211209CB7.3E,PICOP,Liter,PICOP,15.0,18.5,P
93464,CB7PH,Chesapeake Bay-Polyhaline Region,CB7.3E,37.22875,-76.05383,C,644649,12/9/2021,10:32:00,AP,...,PP101,AUTO_PICO,1148,20211209CB7.3E,PICOP,Liter,PICOP,15.0,1.0,P
93465,CB6PH,Chesapeake Bay-Polyhaline Region,CB6.4,37.23653,-76.20799,C,644647,12/9/2021,13:57:00,BP,...,PP101,AUTO_PICO,1148,20211209CB6.4,PICOP,Liter,PICOP,15.0,9.5,P


## Tidal Benthic

As mentioned above, the Tidal Benthic Database requires merging by data type, as well as project. Since it does not use consistent column names for the different datasets (data types), we will have to handle the cases individually. 

#### Monitoring Event data

First we download the common monitoring event data. Each of the data types uses the projects BEN - Tidal Benthic Monitoring and SBEN - Special Tidal Benthic Monitoring.

In [10]:
start_date = datetime(2004, 1, 1)
end_date = datetime(2024, 8, 3)
projectList_TidalBenthic = ['BEN','SBEN']

benthic_monitor_events_dict = fetch_monitor_data_by_project("TidalBenthic",projectList_TidalBenthic,start_date,end_date)

Since `SampleType` is not one of the values used to combine datasets, we will drop it. Every  `SampleType` in the monitoring data and Taxanomic Counts is `D` (which is not in the user guide). Every `SampleType` in the Sediment database is `B` - Bottom, and every `SampleType` in the Water Quality dataset is `ISM` - In-situ measurement at depth, no sample collected.

In [11]:
benthic_monitor_events_dict = remove_columns_from_dict(benthic_monitor_events_dict,['SampleType'])

And let's check how many records are in the dataset, along with the columns.

In [12]:
benthic_combined_monitor_events = pd.concat(benthic_monitor_events_dict, ignore_index=True)
benthic_combined_monitor_events.shape

(990, 16)

In [13]:
benthic_combined_monitor_events.columns

Index(['CBSeg2003', 'CBSeg2003Description', 'ProjectIdentifier',
       'FieldActivityId', 'Source', 'Station', 'SampleDate', 'Layer',
       'Latitude', 'Longitude', 'PDepth', 'Salzone', 'SampleVolume', 'Units',
       'TotalDepth', 'SampleTime'],
      dtype='object')

#### Sediment

We will read in the sediment data, the determine how best to combine it with the monitoring event data.

In [208]:
sediment_data_dict = fetch_recorded_data_by_project('TidalBenthic','Sediment',projectList_TidalBenthic,start_date,end_date)

for key, value in sediment_data_dict.items():
    df = sediment_data_dict[key]
    # Remove columns with only empty string values
    df = df.loc[:, ~(df.isin(['', np.nan, None]).all(axis=0))]
    # Update the dictionary with the modified DataFrame
    sediment_data_dict[key] = df
    print(df.columns)

Index(['CBSeg2003', 'CBSeg2003Description', 'EventId', 'Source', 'SampleType',
       'Station', 'TotalDepth', 'SampleReplicate', 'SampleDate',
       'ReportingParameter', 'ReportedValue', 'ReportingUnits',
       'ProjectIdentifier'],
      dtype='object')
Index(['CBSeg2003', 'CBSeg2003Description', 'EventId', 'Source', 'SampleType',
       'Station', 'TotalDepth', 'SampleReplicate', 'SampleDate',
       'ReportingParameter', 'ReportedValue', 'ReportingUnits',
       'ProjectIdentifier'],
      dtype='object')


This looks great -- it seems that `EventId` and `FieldActivityId` are different numbering schemes, but `EventId` might be helpfulin combining datasets. Again we remove `SampleType`.

In [209]:
sediment_data_dict = remove_columns_from_dict(sediment_data_dict,['SampleType'])

In [210]:
output_file ="../data/plank_ChesapeakeBenthicSediment.csv"
merge_and_save_data(benthic_monitor_events_dict, sediment_data_dict, output_file)

Shape of monitor_df for project TidalBenthicBEN: (913, 16)
Shape of data_record_df for project TidalBenthicBEN: (4567, 12)
7 Common columns: ['Station', 'CBSeg2003Description', 'TotalDepth', 'ProjectIdentifier', 'Source', 'CBSeg2003', 'SampleDate']
Updated shape of data_record_df for project TidalBenthicBEN: (4567, 21)
Shape of monitor_df for project TidalBenthicSBEN: (77, 16)
Shape of data_record_df for project TidalBenthicSBEN: (158, 12)
7 Common columns: ['Station', 'CBSeg2003Description', 'TotalDepth', 'ProjectIdentifier', 'Source', 'CBSeg2003', 'SampleDate']
Updated shape of data_record_df for project TidalBenthicSBEN: (158, 21)
Shape of combined_df: (4725, 21)


Unnamed: 0,CBSeg2003,CBSeg2003Description,Station,Latitude,Longitude,FieldActivityId,SampleDate,SampleTime,Layer,TotalDepth,...,Source,SampleReplicate,ReportingParameter,ReportedValue,ReportingUnits,ProjectIdentifier,Units,SampleVolume,PDepth,Salzone
0,CB3MH,Chesapeake Bay-Mesohaline Region,024,39.12201,-76.35528,214917,5/10/2004,11:11:00,B,6.5,...,VERSAR/EME/BEL,S1,MOIST,63.6065,PCT,BEN,Centimeter,14.0,6.5,O
1,CB3MH,Chesapeake Bay-Mesohaline Region,024,39.12201,-76.35528,214917,5/10/2004,11:11:00,B,6.5,...,VERSAR/EME/BEL,S1,SAND,3.9911,PCT,BEN,Centimeter,14.0,6.5,O
2,CB3MH,Chesapeake Bay-Mesohaline Region,024,39.12201,-76.35528,214917,5/10/2004,11:11:00,B,6.5,...,VERSAR/EME/BEL,S1,TC,3.6000,PCT,BEN,Centimeter,14.0,6.5,O
3,CB3MH,Chesapeake Bay-Mesohaline Region,024,39.12201,-76.35528,214917,5/10/2004,11:11:00,B,6.5,...,VERSAR/EME/BEL,S1,TIC,0.0400,PCT,BEN,Centimeter,14.0,6.5,O
4,CB2OH,Chesapeake Bay-Oligohaline Region,026,39.27151,-76.28998,214920,5/10/2004,12:13:00,B,3.9,...,VERSAR/EME/BEL,S1,MOIST,56.3773,PCT,BEN,Centimeter,16.0,3.9,O
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4720,CB7PH,Chesapeake Bay-Polyhaline Region,VA13-036,37.55186,-75.86085,225398,8/14/2013,10:30:00,B,1.0,...,ODU/BEL,S1,SILTCLAY,98.3500,PCT,SBEN,Centimeter,7,1.0,HM
4721,CB7PH,Chesapeake Bay-Polyhaline Region,VA13-020,37.72762,-75.80521,225656,8/27/2013,10:00:00,B,2.5,...,ODU/BEL,S1,SAND,4.4800,PCT,SBEN,Centimeter,7,2.5,P
4722,CB7PH,Chesapeake Bay-Polyhaline Region,VA13-020,37.72762,-75.80521,225656,8/27/2013,10:00:00,B,2.5,...,ODU/BEL,S1,SILTCLAY,95.5200,PCT,SBEN,Centimeter,7,2.5,P
4723,MOBPH,Mobjack Bay-Polyhaline Region,VA13-023,37.14126,-76.38061,225677,8/28/2013,09:00:00,B,0.5,...,ODU/BEL,S1,SAND,31.1100,PCT,SBEN,Centimeter,7,0.5,P


#### BioMass

We will read in the bio mass data, the determine how best to combine it with the monitoring event data.

In [199]:
biomass_data_dict = fetch_recorded_data_by_project('TidalBenthic','BioMass',projectList_TidalBenthic,start_date,end_date)

for key, value in biomass_data_dict.items():
    df = biomass_data_dict[key]
    # Remove columns with only empty string values
    df = df.loc[:, ~(df.isin(['', np.nan, None]).all(axis=0))]
    # Update the dictionary with the modified DataFrame
    biomass_data_dict[key] = df
    print(df.columns)

Index(['CBSeg2003', 'CBSeg2003Description', 'FieldActivityId',
       'BiologicalEventId', 'Source', 'SampleDate', 'Latitude', 'Longitude',
       'Station', 'TotalDepth', 'SampleTime', 'SampleReplicate',
       'IBIParameter', 'IBIValue', 'ProjectIdentifier'],
      dtype='object')
Index(['CBSeg2003', 'CBSeg2003Description', 'FieldActivityId',
       'BiologicalEventId', 'Source', 'SampleDate', 'Latitude', 'Longitude',
       'Station', 'TotalDepth', 'SampleTime', 'SampleReplicate',
       'IBIParameter', 'IBIValue', 'ProjectIdentifier'],
      dtype='object')


There appears to be some bookkeping rows that have no measuremeant data. Let's also remove those.

In [200]:
for key, value in biomass_data_dict.items():
    df = biomass_data_dict[key]
    # Filter rows based on column: 'IBIParameter'
    df = df[df['IBIParameter'].notna()]
    biomass_data_dict[key] = df

In [201]:
output_file ="../data/plank_ChesapeakeBenthicBioMass.csv"
merge_and_save_data(benthic_monitor_events_dict, biomass_data_dict, output_file)

Shape of monitor_df for project TidalBenthicBEN: (913, 16)
Shape of data_record_df for project TidalBenthicBEN: (25487, 15)
11 Common columns: ['Latitude', 'SampleDate', 'CBSeg2003Description', 'TotalDepth', 'ProjectIdentifier', 'Source', 'FieldActivityId', 'Longitude', 'SampleTime', 'CBSeg2003', 'Station']
Updated shape of data_record_df for project TidalBenthicBEN: (25487, 20)
Shape of monitor_df for project TidalBenthicSBEN: (77, 16)
Shape of data_record_df for project TidalBenthicSBEN: (1417, 15)
11 Common columns: ['Latitude', 'SampleDate', 'CBSeg2003Description', 'TotalDepth', 'ProjectIdentifier', 'Source', 'FieldActivityId', 'Longitude', 'SampleTime', 'CBSeg2003', 'Station']
Updated shape of data_record_df for project TidalBenthicSBEN: (1417, 20)
Shape of combined_df: (26904, 20)


Unnamed: 0,CBSeg2003,CBSeg2003Description,Station,Latitude,Longitude,FieldActivityId,SampleDate,SampleTime,Layer,TotalDepth,BiologicalEventId,Source,SampleReplicate,IBIParameter,IBIValue,ProjectIdentifier,Units,SampleVolume,PDepth,Salzone
0,CB8PH,Chesapeake Bay-Polyhaline Region,11M27,36.98756,-76.19004,214998,7/15/2004,08:15:00,B,10.36,68560,ODU/BEL,S1,PCT_CARN_OMN,19.40298,BEN,Centimeter,7.0,10.36,P
1,CB8PH,Chesapeake Bay-Polyhaline Region,11M27,36.98756,-76.19004,214998,7/15/2004,08:15:00,B,10.36,68560,ODU/BEL,S1,PCT_PI_BIO,7.84314,BEN,Centimeter,7.0,10.36,P
2,CB8PH,Chesapeake Bay-Polyhaline Region,11M27,36.98756,-76.19004,214998,7/15/2004,08:15:00,B,10.36,68560,ODU/BEL,S1,PCT_PS_BIO,70.58823,BEN,Centimeter,7.0,10.36,P
3,CB8PH,Chesapeake Bay-Polyhaline Region,11M27,36.98756,-76.19004,214998,7/15/2004,08:15:00,B,10.36,68560,ODU/BEL,S1,SW,3.30875,BEN,Centimeter,7.0,10.36,P
4,CB8PH,Chesapeake Bay-Polyhaline Region,11M27,36.98756,-76.19004,214998,7/15/2004,08:15:00,B,10.36,68560,ODU/BEL,S1,TOT_ABUND,7613.63623,BEN,Centimeter,7.0,10.36,P
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26899,MOBPH,Mobjack Bay-Polyhaline Region,VA13-023,37.14125,-76.38060,225677,8/28/2013,09:00:00,,0.50,73481,ODU/BEL,S1,PCT_PS_BIO,50.00000,SBEN,,,,
26900,MOBPH,Mobjack Bay-Polyhaline Region,VA13-023,37.14125,-76.38060,225677,8/28/2013,09:00:00,,0.50,73481,ODU/BEL,S1,SW,2.93000,SBEN,,,,
26901,MOBPH,Mobjack Bay-Polyhaline Region,VA13-023,37.14125,-76.38060,225677,8/28/2013,09:00:00,,0.50,73481,ODU/BEL,S1,TOT_ABUND,1272.59998,SBEN,,,,
26902,MOBPH,Mobjack Bay-Polyhaline Region,VA13-023,37.14125,-76.38060,225677,8/28/2013,09:00:00,,0.50,73481,ODU/BEL,S1,TOT_BIOMASS_G,0.32320,SBEN,,,,


#### Taxonomic Counts


We will read in the taxanomic data, the determine how best to combine it with the monitoring event data.

In [14]:
taxonomic_data_dict = fetch_recorded_data_by_project('TidalBenthic','Taxonomic','BEN',start_date,end_date)

for key, value in taxonomic_data_dict.items():
    df = taxonomic_data_dict[key]
    # Remove columns with only empty string values
    df = df.loc[:, ~(df.isin(['', np.nan, None]).all(axis=0))]
    # Update the dictionary with the modified DataFrame
    taxonomic_data_dict[key] = df
    print(df.columns)

Index(['CBSeg2003', 'CBSeg2003Description', 'EventId', 'Source', 'SampleType',
       'Station', 'SampleDate', 'GMethod', 'TSN', 'LifeStageDescription',
       'LatinName', 'ReportingValue', 'ReportingUnit', 'ProjectIdentifier'],
      dtype='object')


We don't need `HUC8`, this is a different way to characterize location with larger regions the `CBSeg2003`. We can leave `CatalogingUnitDescription` for cleaning. We will remove `SampleType`

In [203]:
taxonomic_data_dict = remove_columns_from_dict(taxonomic_data_dict,['HUC8','CatalogingUnitDescription'])

Some columns to remove were not found in DataFrame for key: TidalBenthicBEN
Missing columns: ['HUC8', 'CatalogingUnitDescription']
Some columns to remove were not found in DataFrame for key: TidalBenthicSBEN
Missing columns: ['HUC8', 'CatalogingUnitDescription']


In [15]:
output_file ="../data/plank_ChesapeakeBenthicTaxonomic.csv"

merge_and_save_data(benthic_monitor_events_dict, taxonomic_data_dict, output_file)

Shape of monitor_df for project TidalBenthicBEN: (913, 16)
Shape of data_record_df for project TidalBenthicBEN: (26398, 14)
6 Common columns: ['Station', 'CBSeg2003Description', 'SampleDate', 'CBSeg2003', 'ProjectIdentifier', 'Source']
Updated shape of data_record_df for project TidalBenthicBEN: (26398, 24)
No matching data record found for monitor key: SBEN
Shape of combined_df: (26398, 24)


Unnamed: 0,CBSeg2003,CBSeg2003Description,Station,Latitude,Longitude,SampleType,FieldActivityId,SampleDate,SampleTime,Layer,...,Source,GMethod,TSN,LifeStageDescription,LatinName,ProjectIdentifier,Units,Salzone,PDepth,SampleVolume
0,CB1TF,Chesapeake Bay-Tidal Fresh Region,11623,39.43749,-76.05629,D,215670,9/1/2004,15:23:00,B,...,VERSAR/EME/BEL,97,70493.0,89.0,Hydrobiidae,BEN,Centimeter,TF,4.1,10.0
1,CB1TF,Chesapeake Bay-Tidal Fresh Region,11623,39.43749,-76.05629,D,215670,9/1/2004,15:23:00,B,...,VERSAR/EME/BEL,97,81427.0,79.0,Musculium,BEN,Centimeter,TF,4.1,10.0
2,CB1TF,Chesapeake Bay-Tidal Fresh Region,11623,39.43749,-76.05629,D,215670,9/1/2004,15:23:00,B,...,VERSAR/EME/BEL,97,128010.0,79.0,Coelotanypus,BEN,Centimeter,TF,4.1,10.0
3,CB1TF,Chesapeake Bay-Tidal Fresh Region,11622,39.43029,-76.0716,D,215673,9/1/2004,15:32:00,B,...,VERSAR/EME/BEL,97,68585.0,248.0,Tubificidae,BEN,Centimeter,TF,3.4,10.0
4,CB1TF,Chesapeake Bay-Tidal Fresh Region,11623,39.43749,-76.05629,D,215670,9/1/2004,15:23:00,B,...,VERSAR/EME/BEL,97,573739.0,89.0,Marenzelleria viridis,BEN,Centimeter,TF,4.1,10.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26393,TANMH,Tangier Sound-Mesohaline Region,20504,38.07889,-75.9346,D,225956,9/19/2013,09:12:00,B,...,VERSAR/EME/BEL,97,555698.0,89.0,Podarkeopsis levifuscina,BEN,Centimeter,HM,3.3,10.0
26394,TANMH,Tangier Sound-Mesohaline Region,20501,37.93309,-75.8938,D,225965,9/19/2013,10:35:00,B,...,VERSAR/EME/BEL,97,92627.0,89.0,Edotea triloba,BEN,Centimeter,HM,1.9,8.0
26395,TANMH,Tangier Sound-Mesohaline Region,20506,38.11139,-75.9579,D,225950,9/19/2013,08:27:00,B,...,VERSAR/EME/BEL,97,66949.0,89.0,Scolelepis texana,BEN,Centimeter,HM,2.5,10.0
26396,TANMH,Tangier Sound-Mesohaline Region,20501,37.93309,-75.8938,D,225965,9/19/2013,10:35:00,B,...,VERSAR/EME/BEL,97,68585.0,248.0,Tubificidae,BEN,Centimeter,HM,1.9,8.0


#### Water Quality (Tidal Benthic)

We will read in the water quality data from the tidal bentic database, the determine how best to combine it with the monitoring event data.


In [211]:
water_data_dict = fetch_recorded_data_by_project('TidalBenthic','WaterQuality',projectList_TidalBenthic,start_date,end_date)

for key, value in water_data_dict.items():
    df = water_data_dict[key]
    # Remove columns with only empty string values
    df = df.loc[:, ~(df.isin(['', np.nan, None]).all(axis=0))]
    # Update the dictionary with the modified DataFrame
    water_data_dict[key] = df
    print(df.columns)

Index(['CBSeg2003', 'CBSeg2003Description', 'EventId', 'Source', 'SampleType',
       'Station', 'SampleDate', 'SampleDepth', 'SampleReplicate',
       'ReportedParameter', 'ReportedValue', 'ReportedUnits', 'WQMethod',
       'ProjectIdentifier'],
      dtype='object')
Index(['CBSeg2003', 'CBSeg2003Description', 'EventId', 'Source', 'SampleType',
       'Station', 'SampleDate', 'SampleDepth', 'SampleReplicate',
       'ReportedParameter', 'ReportedValue', 'ReportedUnits', 'WQMethod',
       'ProjectIdentifier'],
      dtype='object')


This looks great -- it seems that `EventId` and `FieldActivityId` are different numbering schemes, but `EventId` might be helpfulin combining datasets. Again we remove `SampleType`.

In [212]:
water_data_dict = remove_columns_from_dict(water_data_dict,['SampleType'])

In [213]:
output_file ="../data/plank_ChesapeakeBenthicWaterQuality.csv"
merge_and_save_data(benthic_monitor_events_dict, water_data_dict, output_file)

Shape of monitor_df for project TidalBenthicBEN: (913, 16)
Shape of data_record_df for project TidalBenthicBEN: (8196, 13)
6 Common columns: ['Station', 'CBSeg2003Description', 'ProjectIdentifier', 'Source', 'CBSeg2003', 'SampleDate']
Updated shape of data_record_df for project TidalBenthicBEN: (8196, 23)
Shape of monitor_df for project TidalBenthicSBEN: (77, 16)
Shape of data_record_df for project TidalBenthicSBEN: (231, 13)
6 Common columns: ['Station', 'CBSeg2003Description', 'ProjectIdentifier', 'Source', 'CBSeg2003', 'SampleDate']
Updated shape of data_record_df for project TidalBenthicSBEN: (231, 23)
Shape of combined_df: (8427, 23)


Unnamed: 0,CBSeg2003,CBSeg2003Description,Station,Latitude,Longitude,FieldActivityId,SampleDate,SampleTime,Layer,TotalDepth,...,SampleReplicate,ReportedParameter,ReportedValue,ReportedUnits,WQMethod,ProjectIdentifier,Units,SampleVolume,PDepth,Salzone
0,CB2OH,Chesapeake Bay-Oligohaline Region,026,39.27151,-76.28998,214920,5/10/2004,12:13:00,B,3.9,...,M1,PH,7.90,SU,F01,BEN,Centimeter,16.0,3.9,O
1,CB2OH,Chesapeake Bay-Oligohaline Region,026,39.27151,-76.28998,214920,5/10/2004,12:13:00,B,3.9,...,M1,WTEMP,20.65,DEG C,F01,BEN,Centimeter,16.0,3.9,O
2,CB2OH,Chesapeake Bay-Oligohaline Region,026,39.27151,-76.28998,214920,5/10/2004,12:13:00,B,3.9,...,M1,PH,8.33,SU,F01,BEN,Centimeter,16.0,3.9,O
3,CB2OH,Chesapeake Bay-Oligohaline Region,026,39.27151,-76.28998,214920,5/10/2004,12:13:00,B,3.9,...,M1,WTEMP,20.11,DEG C,F01,BEN,Centimeter,16.0,3.9,O
4,CB2OH,Chesapeake Bay-Oligohaline Region,026,39.27151,-76.28998,214920,5/10/2004,12:13:00,B,3.9,...,M1,PH,8.17,SU,F01,BEN,Centimeter,16.0,3.9,O
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8422,CB7PH,Chesapeake Bay-Polyhaline Region,VA13-020,37.72762,-75.80521,225656,8/27/2013,10:00:00,B,2.5,...,M1,SALINITY,18.75,PSU,F01,SBEN,Centimeter,7,2.5,P
8423,CB7PH,Chesapeake Bay-Polyhaline Region,VA13-020,37.72762,-75.80521,225656,8/27/2013,10:00:00,B,2.5,...,M1,WTEMP,24.39,DEG C,F01,SBEN,Centimeter,7,2.5,P
8424,MOBPH,Mobjack Bay-Polyhaline Region,VA13-023,37.14126,-76.38061,225677,8/28/2013,09:00:00,B,0.5,...,M1,DO,5.60,MG/L,F01,SBEN,Centimeter,7,0.5,P
8425,MOBPH,Mobjack Bay-Polyhaline Region,VA13-023,37.14126,-76.38061,225677,8/28/2013,09:00:00,B,0.5,...,M1,SALINITY,20.56,PSU,F01,SBEN,Centimeter,7,0.5,P
