# Introduction

This is a database of the plankton in the Chesapeake Bay.

The CSV from the API (combined with station data) has the columns 
`CBSeg2003`, `CBSeg2003Description`, `Station`, `Latitude`, `Longitude`,
`FieldActivityId`, `Source`, `SampleType`, `SampleDate`, `Layer`,
`SampleNumber`, `GMethod`, `TSN`, `LatinName`, `Size`, `Method`,
`Parameter`, `ReportingValue`, `ReportingUnit`, `NODCCode`, `SPECCode`,
`SerialNumber`

Here is a descriptor of the columns, from [The 2012 Users Guide to CBP Biological Monitoring Data](https://d18lev1ok5leia.cloudfront.net/chesapeakebay/documents/guide2012_final.pdf):
- `CBSeg2003` 2003 Chesapeake Bay Segment Designation. Divided into regions based on circulation and salinity properties. We used 8 from the Bay proper, 2 adjoining Bays, and 1 adjoining sound.
- `CBSeg2003Description` 2003 Chesapeake Bay Segment Designation Description in the format Location-Salinity. The locations are Chesapeake Bay, Eastern Bay, Mobjack Bay, and Tangier Sound. The salinity levels are tidal fresh (0.0 - 0.5 parts per thousand),
oligohaline (0.5 - 5.0 parts per thousand), mesohaline (5.0 - 18.0 parts per thousand), and polyhaline (greater than 18.0 parts per thousand). 
- `Station` the sampling station
- `Latitude` and  `Longitude`, the Latitude and Longitude for the sampling station
- 'FieldActivityId' is not included in the database user guide
- `Source` Data Collection Agency
- `SampleType`  Collection Type. However, in this dataset all are C, composite sample, made up of subsamples from multiple depths.
- `SampleDate` Sampling date (MM/DD/YYYY). We downloaded 8/9/2004 through 12/9/2021
- `Layer` Layer of Water Column in Which Sample Was Taken. In this dataset
     - S, Surface
     - AP, Above pycnocline
     - WC, Whole water column
- `SampleNumber` number assigned to the sample
- `GMethod` Chesapeake Bay Program Gear Method Code. Codes represent information relating to the type of field gear used to collect samples for all analysis. In this dataset all are 7, Plankton Pump
- `TSN` ITIS Taxon Serial Number, unique to the species. When used in conjunction with the NODC, the TSN
overcomes the problem of numeric changes in the NODC code whenever species are reclassified. 
- `LatinName` Species Latin Name 
- `Size` Cell Size Groupings when taken. Some species have different measurements for different sizes. 
- `Method` Chesapeake Bay Program Sample Analysis Code. In January of 2005 in Maryland and October 2005 in Virginia, the following enumeration technique was instituted for all Chesapeake Bay Program supported phytoplankton enumerations. In this sample, the codes are
     - PH101, MSU/ANS Phytoplankton Enumeration Method
     - PH102, ODU Phytoplankton Enumeration Method
     - PH102M, ODU Phytoplankton Enumeration Method-2005 Modification
     - PH103, Uniform Chesapeake Bay Program Phytoplankton Enumeration Method
     - PP101, ODU Picoplankton Enumeration Method
     - PP102, MSU/ANS Picoplankton Enumeration Method 
- `Parameter` Sampling Parameter. In this dataset, all are COUNT,  the number of cells per liter
- `ReportingValue` the value of the count
- `ReportingUnit` This parameter describes the units in which a substance is measured. In this dataset, all are L.
- `NODCCode` National Oceanographic Data Center Species Code. All species on the list have been assigned at least partial National Oceanographic Data Center (NODC).
- `SPECCode` Many of the agencies reporting data containing species information have developed their own in-house species codes. All of these codes are found in the SPECCODE column of a given data type. Codes will
vary by agency and data type. The agency code column in most cases has been given the agency name
code in the data documentation. 
- `SerialNumber` Sample serial number. However, multiple dates and locations have the same serial number.

Since there are more unique Latin names than TSN (563 vs 519), we will use Latin. There are many missing NODC Codes and SPEC Codes. In theory, these four columns all encode the same data.

The main thing we will need to do is create a column for each of the 563 unique Latin names. Then put the count in the correct row.

In [2]:
import pandas as pd

plankton = pd.read_csv('../data/plank_ChesapeakeTidalPlankton.csv',low_memory=False)

print(plankton.shape)

(93467, 30)


## Reduce dataframe size

First, let's remove the excess columns:
- `Source` is a bookkeepng value. Since we do not need to combine datasets, we can remove this value.
- `SampleType`,`GMethod`, `Parameter`, and `ReportingUnit` are the same (or empty) for every entry
- `NODCCode` and `SPECCode` encode the same information as `TSN` and `LatinName`, with a lot of missing values. The User Guide says `TSN` is preferable to `NODCCode`.

In [17]:
plankton_clean = plankton.drop(columns=['Source','SampleType','GMethod', 'Parameter', 'ReportingUnit','NODCCode', 'SPECCode'])


There also might be rows that do not encode information that were missed by the cleaning in the download. Let's drop the rows where `LatinName` is missing (from DataWrangler, these are the correct rows).

In [18]:
plankton_clean = plankton_clean.dropna(subset=['LatinName'])

In [19]:
plankton_clean.shape

(93466, 23)

In [20]:
plankton_clean.columns

Index(['CBSeg2003', 'CBSeg2003Description', 'Station', 'Latitude', 'Longitude',
       'FieldActivityId', 'SampleDate', 'SampleTime', 'Layer', 'TotalDepth',
       'ReportingValue', 'SampleNumber', 'TSN', 'LatinName', 'Size', 'Method',
       'SerialNumber', 'ProjectIdentifier', 'Units', 'DataType',
       'SampleVolume', 'PDepth', 'Salzone'],
      dtype='object')

## Creating features columns

First, let's handle the face that some species are measured at various sizes. We will combine the `LatinName` and `Size` columns, removing `Not Applicable`

In [21]:
plankton_clean['LatinName'] = plankton_clean['LatinName'] + ' ' +plankton_clean['Size'].replace('Not Applicable', '')

plankton_clean = plankton_clean.drop(columns=['Size'])

plankton_clean.shape

(93466, 22)

In [22]:
plankton_clean.columns

Index(['CBSeg2003', 'CBSeg2003Description', 'Station', 'Latitude', 'Longitude',
       'FieldActivityId', 'SampleDate', 'SampleTime', 'Layer', 'TotalDepth',
       'ReportingValue', 'SampleNumber', 'TSN', 'LatinName', 'Method',
       'SerialNumber', 'ProjectIdentifier', 'Units', 'DataType',
       'SampleVolume', 'PDepth', 'Salzone'],
      dtype='object')

Now we will pivot the `LatinName` column and `ReportingValue` column. We also drop `TSN` since we are using `LatinName` (the other option would be to also pivot on `TSN` and `ReportedValue`)

In [9]:
# Keep the other columns the same (for now)
other_columns = [col for col in plankton_clean.columns if col not in ['LatinName', 'ReportingValue']]

# Reset index to use row numbers as the index
df_reset = plankton_clean.reset_index(drop=True)

# Pivot the DataFrame while preserving non-pivoted columns
plankton_pivoted = df_reset.pivot_table(index=df_reset.index, columns='LatinName', values='ReportingValue', aggfunc='first')

# Combine pivoted result with the original DataFrame columns not involved in the pivot
plankton_pivoted= df_reset.drop(columns=['LatinName','ReportingValue']).join(plankton_pivoted)


In [10]:
plankton_pivoted.shape

(93466, 672)

Now we combine rows if they agree on all of the columns in the monitoring data dictionary (Except those that we dropped)

In [24]:
columns_to_group = ['CBSeg2003', 'CBSeg2003Description', 'DataType',
       'SampleDate', 'Layer', 'Latitude', 'Longitude', 'PDepth', 'Salzone',
       'SampleVolume', 'Units', 'Station', 'TotalDepth', 'SampleTime',
       'ProjectIdentifier']

#check unique combinations
#this allows us to check that we havent lost data
unique_combinations = plankton_pivoted[columns_to_group].drop_duplicates()


unique_combinations.shape

(4737, 15)

In [25]:
# check we haven't lost unique values
for col in columns_to_group:
    print(col, unique_combinations[col].unique().size == 
          plankton_clean[col].unique().size)

CBSeg2003 True
CBSeg2003Description True
DataType True
SampleDate True
Layer True
Latitude True
Longitude True
PDepth True
Salzone True
SampleVolume True
Units True
Station True
TotalDepth True
SampleTime True
ProjectIdentifier True


In [27]:
# Create a copy of the DataFrame for processing
plankton_processed = plankton_pivoted.copy()

# Create a unique identifier for each group based on the columns to match
plankton_processed['UniqueID'] = plankton_processed[columns_to_group].astype(str).agg('-'.join, axis=1)

# Group by the unique identifier
plankton_combined = plankton_processed.groupby('UniqueID', as_index=False).first()

# Drop the UniqueID column and remove duplicates
plankton_really_clean = plankton_combined.drop(columns='UniqueID').drop_duplicates()


In [28]:
# Check we haven't lost unique non-empty values
for col in columns_to_group:
    # Filter out empty values
    df_combined_nonempty = plankton_really_clean[col].dropna()
    plankton_pivoted_nonempty = plankton[col].dropna()
    
    # Check if the number of unique non-empty values matches
    unique_check = df_combined_nonempty.unique().size == plankton_pivoted_nonempty.unique().size
    
    # Print the results
    print(col, unique_check)

CBSeg2003 True
CBSeg2003Description True
DataType True
SampleDate True
Layer True
Latitude True
Longitude True
PDepth True
Salzone True
SampleVolume True
Units True
Station True
TotalDepth True
SampleTime True
ProjectIdentifier True


Write to CSV

In [29]:
plankton_really_clean.to_csv('../data/plank_ChesapeakeBayPlankton_clean.csv',index=False)