# Introduction

This is a database of the water in the Chesapeake Bay.

The CSV from the API (combined with station data) has the columns 
`CBSeg2003`, `CBSeg2003Description`, `Station`, `Latitude`, `Longitude`,
`FieldActivityId`, `Source`, `SampleType`, `SampleDate`, `Layer`,
`SampleNumber`, `GMethod`, `TSN`, `LatinName`, `Size`, `Method`,
`Parameter`, `ReportingValue`, `ReportingUnit`, `NODCCode`, `SPECCode`,
`SerialNumber`

Here is a descriptor of the columns, from [The 2012 Users Guide to CBP Biological Monitoring Data](https://d18lev1ok5leia.cloudfront.net/chesapeakebay/documents/guide2012_final.pdf):
- `CBSeg2003` 2003 Chesapeake Bay Segment Designation. Divided into regions based on circulation and salinity properties. We used 8 from the Bay proper, 2 adjoining Bays, and 1 adjoining sound.
- `CBSeg2003Description` 2003 Chesapeake Bay Segment Designation Description in the format Location-Salinity. The locations are Chesapeake Bay, Eastern Bay, Mobjack Bay, and Tangier Sound. The salinity levels are tidal fresh (0.0 - 0.5 parts per thousand),
oligohaline (0.5 - 5.0 parts per thousand), mesohaline (5.0 - 18.0 parts per thousand), and polyhaline (greater than 18.0 parts per thousand). 
- `Station` the sampling station
- `Latitude` and  `Longitude`, the Latitude and Longitude for the sampling station
- 'FieldActivityId' is not included in the database user guide
- `Source` Data Collection Agency
- `SampleType`  Collection Type. However, in this dataset all are C, composite sample, made up of subsamples from multiple depths.
- `SampleDate` Sampling date (MM/DD/YYYY). We downloaded 8/9/2004 through 12/9/2021
- `Layer` Layer of Water Column in Which Sample Was Taken. In this dataset
     - S, Surface
     - AP, Above pycnocline
     - WC, Whole water column
- `SampleNumber` number assigned to the sample
- `GMethod` Chesapeake Bay Program Gear Method Code. Codes represent information relating to the type of field gear used to collect samples for all analysis. In this dataset all are 7, sediment Pump
- `TSN` ITIS Taxon Serial Number, unique to the species. When used in conjunction with the NODC, the TSN
overcomes the problem of numeric changes in the NODC code whenever species are reclassified. 
- `LatinName` Species Latin Name 
- `Size` Cell Size Groupings when taken. Some species have different measurements for different sizes. 
- `Method` Chesapeake Bay Program Sample Analysis Code. In January of 2005 in Maryland and October 2005 in Virginia, the following enumeration technique was instituted for all Chesapeake Bay Program supported phytoplankton enumerations. In this sample, the codes are
     - PH101, MSU/ANS Phytoplankton Enumeration Method
     - PH102, ODU Phytoplankton Enumeration Method
     - PH102M, ODU Phytoplankton Enumeration Method-2005 Modification
     - PH103, Uniform Chesapeake Bay Program Phytoplankton Enumeration Method
     - PP101, ODU Picoplankton Enumeration Method
     - PP102, MSU/ANS Picoplankton Enumeration Method 
- `Parameter` Sampling Parameter. In this dataset, all are COUNT,  the number of cells per liter
- `ReportingValue` the value of the count
- `ReportingUnit` This parameter describes the units in which a substance is measured. In this dataset, all are L.
- `NODCCode` National Oceanographic Data Center Species Code. All species on the list have been assigned at least partial National Oceanographic Data Center (NODC).
- `SPECCode` Many of the agencies reporting data containing species information have developed their own in-house species codes. All of these codes are found in the SPECCODE column of a given data type. Codes will
vary by agency and data type. The agency code column in most cases has been given the agency name
code in the data documentation. 
- `SerialNumber` Sample serial number. However, multiple dates and locations have the same serial number.

Since there are more unique Latin names than TSN (563 vs 519), we will use Latin. There are many missing NODC Codes and SPEC Codes. In theory, these four columns all encode the same data.

The main thing we will need to do is create a column for each of the 563 unique Latin names. Then put the count in the correct row.

## Sediment

## BioMass

In [41]:
import pandas as pd

biomass = pd.read_csv('../data/plank_ChesapeakeBenthicBioMass.csv')

print(biomass.shape,biomass.columns)

(26059, 15) Index(['CBSeg2003', 'CBSeg2003Description', 'Station', 'Latitude', 'Longitude',
       'FieldActivityId', 'BiologicalEventId', 'Source', 'SampleDate',
       'SiteType', 'TotalDepth', 'SampleTime', 'SampleReplicate',
       'IBIParameter', 'IBIValue'],
      dtype='object')


### Reduce dataframe size

First, let's remove the excess columns:
- `Source` is a bookkeepng value, but important for merging
- `SiteType` is empty

In [70]:
biomass_clean = biomass.drop(columns=['SiteType'])

biomass_clean.shape

(26059, 14)

There are also a few rows that do not encode information that were missed by the cleaning in the download. Let's drop the rows where `IBIParameter` is missing (from DataWrangler, these are the correct rows).

In [71]:
biomass_clean = biomass_clean.dropna(subset=['IBIParameter'])

### Creating features columns



Now we will pivot the `IBIParameter` column and `ReportedValue` column.

In [73]:
# Reset index to use row numbers as the index
df_reset = biomass_clean.reset_index(drop=True)

# Pivot the DataFrame while preserving non-pivoted columns
biomass_pivoted = df_reset.pivot_table(index=df_reset.index, columns='IBIParameter', values='IBIValue', aggfunc='first')

# Combine pivoted result with the original DataFrame columns not involved in the pivot
biomass_pivoted= df_reset.drop(columns=['IBIParameter','IBIValue']).join(biomass_pivoted)


In [74]:
biomass_pivoted.shape

(25785, 138)

Now we combine rows if the following all agree:
`CBSeg2003`, `CBSeg2003Description`, `Station`, `Latitude`, `Longitude`,`EventId`,`Source`, `SampleReplicate`,`SampleDate`.

In [86]:
columns_to_group = ['CBSeg2003', 'CBSeg2003Description', 'Station', 'Latitude', 'Longitude', 'FieldActivityId', 'BiologicalEventId','Source','SampleDate','TotalDepth', 'SampleTime', 'SampleReplicate']

#check unique combinations
#this allows us to check that we havent lost data
unique_combinations = biomass_pivoted[columns_to_group].drop_duplicates()


unique_combinations.shape

(832, 12)

In [87]:
# check we haven't lost unique values
for col in columns_to_group:
    print(col, unique_combinations[col].unique().size == 
          biomass_clean[col].unique().size)

CBSeg2003 True
CBSeg2003Description True
Station True
Latitude True
Longitude True
FieldActivityId True
BiologicalEventId True
Source True
SampleDate True
TotalDepth True
SampleTime True
SampleReplicate True


In [88]:
# Create a copy of the DataFrame for processing
biomass_processed = biomass_pivoted.copy()

# Create a unique identifier for each group based on the columns to match
biomass_processed['UniqueID'] = biomass_processed[columns_to_group].astype(str).agg('-'.join, axis=1)

# Group by the unique identifier
biomass_combined = biomass_processed.groupby('UniqueID', as_index=False).first()

# Drop the UniqueID column and remove duplicates
biomass_really_clean = biomass_combined.drop(columns='UniqueID').drop_duplicates()


In [89]:
# Check we haven't lost unique non-empty values
for col in columns_to_group:
    # Filter out empty values
    df_combined_nonempty = biomass_really_clean[col].dropna()
    biomass_pivoted_nonempty = biomass_clean[col].dropna()
    
    # Check if the number of unique non-empty values matches
    unique_check = df_combined_nonempty.unique().size == biomass_pivoted_nonempty.unique().size
    
    # Print the results
    print(col, unique_check)

CBSeg2003 True
CBSeg2003Description True
Station True
Latitude True
Longitude True
FieldActivityId True
BiologicalEventId True
Source True
SampleDate True
TotalDepth True
SampleTime True
SampleReplicate True


In [90]:
biomass_really_clean.shape

(832, 138)

Write to CSV

In [92]:
biomass_really_clean.to_csv('../data/plank_ChesapeakeBayBioMass_clean.csv',index=False)

## WaterQuality

In [93]:
import pandas as pd

water = pd.read_csv('../data/plank_ChesapeakeBenthicWaterQuality.csv')

print(water.shape,water.columns)

(8193, 15) Index(['CBSeg2003', 'CBSeg2003Description', 'Station', 'Latitude', 'Longitude',
       'EventId', 'Source', 'SampleType', 'SampleDate', 'SampleDepth',
       'SampleReplicate', 'ReportedParameter', 'ReportedValue',
       'ReportedUnits', 'WQMethod'],
      dtype='object')


### Reduce dataframe size

First, let's remove the excess columns:
- `Source` is a bookkeepng value, but important for merging
- `SampleType`, `SampleReplicate`, and `WQMethod` are the same (or empty) for every entry

In [94]:
water_clean = water.drop(columns=['SampleType','SampleReplicate','WQMethod'])

water_clean.shape

(8193, 12)

### Creating features columns

First, let's combine the parameter and it's units.

In [95]:
water_clean['ReportedParameter'] = water_clean['ReportedParameter'] + ' '+ water_clean['ReportedUnits']

water_clean = water_clean.drop(columns=['ReportedUnits'])

water_clean.shape

(8193, 11)

Now we will pivot the `ReportedParameter` column and `ReportedValue` column.

In [96]:
# Reset index to use row numbers as the index
df_reset = water_clean.reset_index(drop=True)

# Pivot the DataFrame while preserving non-pivoted columns
water_pivoted = df_reset.pivot_table(index=df_reset.index, columns='ReportedParameter', values='ReportedValue', aggfunc='first')

# Combine pivoted result with the original DataFrame columns not involved in the pivot
water_pivoted= df_reset.drop(columns=['ReportedParameter','ReportedValue']).join(water_pivoted)


In [97]:
print(water_pivoted.shape,water_pivoted.columns)

(8193, 15) Index(['CBSeg2003', 'CBSeg2003Description', 'Station', 'Latitude', 'Longitude',
       'EventId', 'Source', 'SampleDate', 'SampleDepth', 'DO MG/L',
       'DO_SAT_P PCT', 'PH SU', 'SALINITY PSU', 'SPCOND UMHOS/CM',
       'WTEMP DEG C'],
      dtype='object')


Now we combine rows if the following all agree:
`CBSeg2003`, `CBSeg2003Description`, `Station`, `Latitude`, `Longitude`,`EventId`,`Source`, `SampleDate`, `SampleDepth`.

In [98]:
columns_to_group = ['CBSeg2003', 'CBSeg2003Description', 'Station', 'Latitude', 'Longitude', 'EventId','Source','SampleDepth','SampleDate']

#check unique combinations
#this allows us to check that we havent lost data
unique_combinations = water_pivoted[columns_to_group].drop_duplicates()


unique_combinations.shape

(1542, 9)

In [99]:
# check we haven't lost unique values
for col in columns_to_group:
    print(col, unique_combinations[col].unique().size == 
          water_clean[col].unique().size)

CBSeg2003 True
CBSeg2003Description True
Station True
Latitude True
Longitude True
EventId True
Source True
SampleDepth True
SampleDate True


In [100]:
# Create a copy of the DataFrame for processing
water_processed = water_pivoted.copy()

# Create a unique identifier for each group based on the columns to match
water_processed['UniqueID'] = water_processed[columns_to_group].astype(str).agg('-'.join, axis=1)

# Group by the unique identifier
water_combined = water_processed.groupby('UniqueID', as_index=False).first()

# Drop the UniqueID column and remove duplicates
water_really_clean = water_combined.drop(columns='UniqueID').drop_duplicates()


In [101]:
# Check we haven't lost unique non-empty values
for col in columns_to_group:
    # Filter out empty values
    df_combined_nonempty = water_really_clean[col].dropna()
    water_pivoted_nonempty = water[col].dropna()
    
    # Check if the number of unique non-empty values matches
    unique_check = df_combined_nonempty.unique().size == water_pivoted_nonempty.unique().size
    
    # Print the results
    print(col, unique_check)

CBSeg2003 True
CBSeg2003Description True
Station True
Latitude True
Longitude True
EventId True
Source True
SampleDepth True
SampleDate True


Write to CSV

In [102]:
water_really_clean.to_csv('../data/plank_ChesapeakeBayBenthicWater_clean.csv',index=False)

## Combining Dataset

Now we combine on 'CBSeg2003', 'CBSeg2003Description', 'Station', 'Latitude', 'Longitude',
       'EventId', 'Source', 'SampleDate', 'SampleDepth'

In [106]:
print('Sediment columns:', sediment_clean.columns)
print('BioMass columns:', biomass_clean.columns)
print('Water columns', water_clean.columns)

Sediment columns: Index(['CBSeg2003', 'CBSeg2003Description', 'Station', 'Latitude', 'Longitude',
       'EventId', 'TotalDepth', 'SampleReplicate', 'SampleDate',
       'ReportingParameter', 'ReportedValue'],
      dtype='object')
BioMass columns: Index(['CBSeg2003', 'CBSeg2003Description', 'Station', 'Latitude', 'Longitude',
       'FieldActivityId', 'BiologicalEventId', 'Source', 'SampleDate',
       'TotalDepth', 'SampleTime', 'SampleReplicate', 'IBIParameter',
       'IBIValue'],
      dtype='object')
Water columns Index(['CBSeg2003', 'CBSeg2003Description', 'Station', 'Latitude', 'Longitude',
       'EventId', 'Source', 'SampleDate', 'SampleDepth', 'ReportedParameter',
       'ReportedValue'],
      dtype='object')
