# Processing FIB data for the Safe to Swim map (v2)

### Introduction
The following code processes fecal indicator bacteria data (FIB) for the Safe to Swim map (v2), which is currently in development. It sources FIB data from the [BeachWatch](https://beachwatch.waterboards.ca.gov/) and [California Environmental Data Exchange Network (CEDEN)](https://ceden.org/) databases, both of which are managed by the [State Water Resources Control Board](https://www.waterboards.ca.gov/). It combines the two datasets and calculates the rolling 30-day and 6-week geometric mean values for each data point. The FIB data used in this script includes sampling data for E. coli, Enterococcus, Fecal Coliform, and Total Coliform.

### Requirements
To run the following code, you will need Python 3.x installed along with the Python packages, pandas and pyodbc. You will also need access to the internal BeachWatch and CEDEN data tables via internal data mart or some other access point.

### Instructions
Run the following code cells in sequential order. You can run them manually cell by cell or run them all in one go. Do not skip any steps or cells. Depending on your computer and/or internet connection, it can take about two hours to run the script in its entirety.

### 1. Import the required Python packages

In [1]:
from datetime import datetime
import numpy as np
import os
import pandas as pd
import pyodbc # Used for connecting to the internal data marts
from scipy.stats.mstats import gmean

### 2. Download FIB data from BeachWatch and CEDEN

#### 2.1 BeachWatch
Define the variables for connecting to BeachWatch. These are private login credentials. The code block below will not run unless the environment variables on your machine are set up similarly.

In [2]:
BW_SERVER1 = os.environ.get('S2S_Server')
BW_DATABASE = os.environ.get('S2S_DB')
BW_TABLE = os.environ.get('S2S_Table')
BW_UID = os.environ.get('S2S_User')
BW_PWD = os.environ.get('S2S_Pass')

Define and run a function for connecting to BeachWatch, querying all data records from BeachWatch, and returning the data as a pandas dataframe.

In [3]:
# Define the date columns for both BeachWatch and CEDEN to ensure that date values get parsed correctly
date_cols = ['SampleDate', 'CalibrationDate', 'CollectionTime', 'PrepPreservationDate', 'DigestExtractDate', 'AnalysisDate']

def get_bw_data():
    cnxn = pyodbc.connect(Driver='SQL Server', Server=BW_SERVER1, Database=BW_DATABASE, uid=BW_UID, pwd=BW_PWD)
    sql =  "SELECT * FROM %s" % BW_TABLE
    df = pd.read_sql_query(sql, cnxn, parse_dates=date_cols, dtype={'ResultReplicate': np.int16, 'CollectionReplicate': np.int16})
    return df

bw_df = get_bw_data() 
print("Count of rows:", bw_df.shape[0])

# Add a field for identifying the database source of the data
bw_df['DataSource'] = 'BeachWatch'

pd.set_option('display.max_columns', None)
bw_df.head()

  df = pd.read_sql_query(sql, cnxn, parse_dates=date_cols, dtype={'ResultReplicate': np.int16, 'CollectionReplicate': np.int16})


Count of rows: 2201087


Unnamed: 0,ProgramName,ParentProjectName,ProjectName,StationName,StationCode,SampleDate,CollectionTime,LocationCode,CollectionDepth,UnitCollectionDepth,SampleTypeCode,CollectionReplicate,ResultReplicate,LabBatch,LabSampleID,MatrixName,MethodName,Analyte,UnitName,Result,Observation,MDL,RL,ResQualCode,QACode,BatchVerificationCode,ComplianceCode,SampleComments,LabCollectionComments,LabResultComments,BatchComments,EventCode,ProtocolCode,AgencyCode,GroupSamples,CollectionMethodName,TargetLatitude,TargetLongitude,CollectionDeviceName,CalibrationDate,PositionWaterColumn,PrepPreservationName,PrepPreservationDate,DigestExtractMethod,DigestExtractDate,AnalysisDate,DilutionFactor,ExpectedValue,LabAgency,SubmittingAgency,LabSubmissionCode,OccupationMethod,StartingBank,DistanceFromBank,UnitDistanceFromBank,StreamWidth,UnitStreamWidth,StationWaterDepth,UnitStationWaterDepth,Hydromod,HydromodLoc,LocationDetailWQComments,ChannelWidth,UpstreamLength,DownStreamLength,TotalReach,LocationDetailBAComments,SampleID,DW_AnalyteName,DataSource
0,BeachWatch,BeachWatch_Orange County,BeachWatch_Orange County,"S-23-San Clemente State Beach, Orange",S-23,2004-10-25,1900-01-01,SurfZone,-88.0,NR,Grab,1,1,SRRA-10/25/2004,Not Recorded,samplewater,EPA 1600,Enterococcus,cfu/100mL,10,,0,0,<,NR,NR,NR,,,,,wq,Not Recorded,OCEHD,,Water_Grab,33.3958,-117.6,Not Recorded,NaT,Not Recorded,Not Recorded,1950-01-01,Not Recorded,1950-01-01,1950-01-01,1.0,0,SRRA,SRRA,NR,,,,,,,,,,,,,,,,,Not Recorded,Enterococcus,BeachWatch
1,BeachWatch,BeachWatch_Orange County,BeachWatch_Orange County,"S-21-San Clemente State Beach, Orange",S-21,2009-08-17,1900-01-01,SurfZone,-88.0,NR,Grab,1,1,SRRA-08/17/2009,Not Recorded,samplewater,SM 9222 B,"Coliform, Total",cfu/100mL,2,,0,0,<,NR,NR,NR,,,,,wq,Not Recorded,OCEHD,,Water_Grab,33.4051,-117.607,Not Recorded,NaT,Not Recorded,Not Recorded,1950-01-01,Not Recorded,1950-01-01,1950-01-01,1.0,0,SRRA,SRRA,NR,,,,,,,,,,,,,,,,,Not Recorded,"Coliform, Total",BeachWatch
2,BeachWatch,BeachWatch_Orange County,BeachWatch_Orange County,"S-21-San Clemente State Beach, Orange",S-21,2009-08-25,1900-01-01,SurfZone,-88.0,NR,Grab,1,1,SRRA-08/25/2009,Not Recorded,samplewater,SM 9222 D,"Coliform, Fecal",cfu/100mL,2,,0,0,<,NR,NR,NR,,,,,wq,Not Recorded,OCEHD,,Water_Grab,33.4051,-117.607,Not Recorded,NaT,Not Recorded,Not Recorded,1950-01-01,Not Recorded,1950-01-01,1950-01-01,1.0,0,SRRA,SRRA,NR,,,,,,,,,,,,,,,,,Not Recorded,"Coliform, Fecal",BeachWatch
3,BeachWatch,BeachWatch_Orange County,BeachWatch_Orange County,"S13-Laguna Beach, Orange",S13,2005-02-08,1900-01-01,SurfZone,-88.0,NR,Grab,1,1,AWMA-02/08/2005,Not Recorded,samplewater,SM 9222 D,"Coliform, Fecal",cfu/100mL,2,,0,0,<,NR,NR,NR,,,,,wq,Not Recorded,OCEHD,,Water_Grab,33.5168,-117.761,Not Recorded,NaT,Not Recorded,Not Recorded,1950-01-01,Not Recorded,1950-01-01,1950-01-01,1.0,0,AWMA,AWMA,NR,,,,,,,,,,,,,,,,,Not Recorded,"Coliform, Fecal",BeachWatch
4,BeachWatch,BeachWatch_Orange County,BeachWatch_Orange County,"S13-Laguna Beach, Orange",S13,2005-02-14,1900-01-01,SurfZone,-88.0,NR,Grab,1,1,AWMA-02/14/2005,Not Recorded,samplewater,SM 9222 D,"Coliform, Fecal",cfu/100mL,20,,0,0,=,NR,NR,NR,,,,,wq,Not Recorded,OCEHD,,Water_Grab,33.5168,-117.761,Not Recorded,NaT,Not Recorded,Not Recorded,1950-01-01,Not Recorded,1950-01-01,1950-01-01,1.0,0,AWMA,AWMA,NR,,,,,,,,,,,,,,,,,Not Recorded,"Coliform, Fecal",BeachWatch


Some of the BeachWatch columns have slightly different names compared to the CEDEN columns. Because we will be joining these two datasets, we want all of the column names to match.

In [4]:
# Dictionary for mapping the names of BeachWatch fields to CEDEN fields
bw_to_ceden_fields = {
    'ProgramName': 'Program',
    'ParentProjectName': 'ParentProject',
    'ProjectName': 'Project',
    'UnitName': 'Unit',
    'ResQualCode': 'ResultQualCode',
    'BatchVerificationCode': 'BatchVerification',
    'LabCollectionComments': 'CollectionComments',
    'LabResultComments': 'ResultsComments',
    'AgencyCode': 'SampleAgency',
    'CollectionDeviceName': 'CollectionDeviceDescription',
    'LabSubmissionCode': 'SubmissionCode',
    'ResultReplicate': 'ResultsReplicate'
}

bw_df = bw_df.rename(columns=bw_to_ceden_fields)
bw_df.head()

Unnamed: 0,Program,ParentProject,Project,StationName,StationCode,SampleDate,CollectionTime,LocationCode,CollectionDepth,UnitCollectionDepth,SampleTypeCode,CollectionReplicate,ResultsReplicate,LabBatch,LabSampleID,MatrixName,MethodName,Analyte,Unit,Result,Observation,MDL,RL,ResultQualCode,QACode,BatchVerification,ComplianceCode,SampleComments,CollectionComments,ResultsComments,BatchComments,EventCode,ProtocolCode,SampleAgency,GroupSamples,CollectionMethodName,TargetLatitude,TargetLongitude,CollectionDeviceDescription,CalibrationDate,PositionWaterColumn,PrepPreservationName,PrepPreservationDate,DigestExtractMethod,DigestExtractDate,AnalysisDate,DilutionFactor,ExpectedValue,LabAgency,SubmittingAgency,SubmissionCode,OccupationMethod,StartingBank,DistanceFromBank,UnitDistanceFromBank,StreamWidth,UnitStreamWidth,StationWaterDepth,UnitStationWaterDepth,Hydromod,HydromodLoc,LocationDetailWQComments,ChannelWidth,UpstreamLength,DownStreamLength,TotalReach,LocationDetailBAComments,SampleID,DW_AnalyteName,DataSource
0,BeachWatch,BeachWatch_Orange County,BeachWatch_Orange County,"S-23-San Clemente State Beach, Orange",S-23,2004-10-25,1900-01-01,SurfZone,-88.0,NR,Grab,1,1,SRRA-10/25/2004,Not Recorded,samplewater,EPA 1600,Enterococcus,cfu/100mL,10,,0,0,<,NR,NR,NR,,,,,wq,Not Recorded,OCEHD,,Water_Grab,33.3958,-117.6,Not Recorded,NaT,Not Recorded,Not Recorded,1950-01-01,Not Recorded,1950-01-01,1950-01-01,1.0,0,SRRA,SRRA,NR,,,,,,,,,,,,,,,,,Not Recorded,Enterococcus,BeachWatch
1,BeachWatch,BeachWatch_Orange County,BeachWatch_Orange County,"S-21-San Clemente State Beach, Orange",S-21,2009-08-17,1900-01-01,SurfZone,-88.0,NR,Grab,1,1,SRRA-08/17/2009,Not Recorded,samplewater,SM 9222 B,"Coliform, Total",cfu/100mL,2,,0,0,<,NR,NR,NR,,,,,wq,Not Recorded,OCEHD,,Water_Grab,33.4051,-117.607,Not Recorded,NaT,Not Recorded,Not Recorded,1950-01-01,Not Recorded,1950-01-01,1950-01-01,1.0,0,SRRA,SRRA,NR,,,,,,,,,,,,,,,,,Not Recorded,"Coliform, Total",BeachWatch
2,BeachWatch,BeachWatch_Orange County,BeachWatch_Orange County,"S-21-San Clemente State Beach, Orange",S-21,2009-08-25,1900-01-01,SurfZone,-88.0,NR,Grab,1,1,SRRA-08/25/2009,Not Recorded,samplewater,SM 9222 D,"Coliform, Fecal",cfu/100mL,2,,0,0,<,NR,NR,NR,,,,,wq,Not Recorded,OCEHD,,Water_Grab,33.4051,-117.607,Not Recorded,NaT,Not Recorded,Not Recorded,1950-01-01,Not Recorded,1950-01-01,1950-01-01,1.0,0,SRRA,SRRA,NR,,,,,,,,,,,,,,,,,Not Recorded,"Coliform, Fecal",BeachWatch
3,BeachWatch,BeachWatch_Orange County,BeachWatch_Orange County,"S13-Laguna Beach, Orange",S13,2005-02-08,1900-01-01,SurfZone,-88.0,NR,Grab,1,1,AWMA-02/08/2005,Not Recorded,samplewater,SM 9222 D,"Coliform, Fecal",cfu/100mL,2,,0,0,<,NR,NR,NR,,,,,wq,Not Recorded,OCEHD,,Water_Grab,33.5168,-117.761,Not Recorded,NaT,Not Recorded,Not Recorded,1950-01-01,Not Recorded,1950-01-01,1950-01-01,1.0,0,AWMA,AWMA,NR,,,,,,,,,,,,,,,,,Not Recorded,"Coliform, Fecal",BeachWatch
4,BeachWatch,BeachWatch_Orange County,BeachWatch_Orange County,"S13-Laguna Beach, Orange",S13,2005-02-14,1900-01-01,SurfZone,-88.0,NR,Grab,1,1,AWMA-02/14/2005,Not Recorded,samplewater,SM 9222 D,"Coliform, Fecal",cfu/100mL,20,,0,0,=,NR,NR,NR,,,,,wq,Not Recorded,OCEHD,,Water_Grab,33.5168,-117.761,Not Recorded,NaT,Not Recorded,Not Recorded,1950-01-01,Not Recorded,1950-01-01,1950-01-01,1.0,0,AWMA,AWMA,NR,,,,,,,,,,,,,,,,,Not Recorded,"Coliform, Fecal",BeachWatch


#### 2.2 CEDEN
Define the variables for connecting to CEDEN. Like for the BeachWatch data above, these are private login credentials. 

In [5]:
CEDEN_SERVER1 = os.environ.get('SERVER1')
CEDEN_UID = os.environ.get('UID')
CEDEN_PWD = os.environ.get('PWD')
CEDEN_TABLE = os.environ.get('TABLE')
CEDEN_SITE_DATUM_TABLE = os.environ.get('SITE_DATUM_TABLE') # Used for getting site datum data
CEDEN_SITE_TABLE = os.environ.get('SITE_TABLE') # Used for getting site region number

Define and run a function for connecting to the CEDEN data mart and returning the data as a pandas dataframe. This query includes all data for E. coli, Enterococcus, Fecal Coliform, and Total Coliform, but at the same time it excludes all records where Program == BeachWatch. There is a lot of duplicate BeachWatch data in CEDEN from the time when BeachWatch data was copied over into CEDEN. We want to exclude the duplicate BeachWatch data from our query.

In [6]:
def get_ceden_data():
    cnxn = pyodbc.connect(Driver='SQL Server', Server=CEDEN_SERVER1, uid=CEDEN_UID, pwd=CEDEN_PWD)
    sql = "SELECT * FROM %s WHERE (Analyte in ('E. coli', 'Enterococcus', 'Coliform, Total', 'Coliform, Fecal') AND Program != 'BeachWatch')" % CEDEN_TABLE
    df = pd.read_sql_query(sql, cnxn, parse_dates=date_cols, dtype={'ResultsReplicate': np.int16, 'CollectionReplicate': np.int16})
    return df

ceden_df = get_ceden_data()
print("Count of rows:", ceden_df.shape[0])

# Add data source field
ceden_df['DataSource'] = 'CEDEN'

ceden_df.head()

  df = pd.read_sql_query(sql, cnxn, parse_dates=date_cols, dtype={'ResultsReplicate': np.int16, 'CollectionReplicate': np.int16})


Count of rows: 390359


Unnamed: 0,Program,ParentProject,Project,StationName,StationCode,SampleDate,CollectionTime,LocationCode,CollectionDepth,UnitCollectionDepth,SampleTypeCode,CollectionReplicate,ResultsReplicate,LabBatch,LabSampleID,MatrixName,MethodName,Analyte,Unit,Result,Observation,MDL,RL,ResultQualCode,QACode,BatchVerification,ComplianceCode,SampleComments,CollectionComments,ResultsComments,BatchComments,EventCode,ProtocolCode,SampleAgency,GroupSamples,CollectionMethodName,TargetLatitude,TargetLongitude,CollectionDeviceDescription,CalibrationDate,PositionWaterColumn,PrepPreservationName,PrepPreservationDate,DigestExtractMethod,DigestExtractDate,AnalysisDate,DilutionFactor,ExpectedValue,LabAgency,SubmittingAgency,SubmissionCode,OccupationMethod,StartingBank,DistanceFromBank,UnitDistanceFromBank,StreamWidth,UnitStreamWidth,StationWaterDepth,UnitStationWaterDepth,HydroMod,HydroModLoc,LocationDetailWQComments,ChannelWidth,UpstreamLength,DownStreamLength,TotalReach,LocationDetailBAComments,SampleID,DW_AnalyteName,isQA,DataSource
0,OCSD Core Ocean Monitoring Program,OCSD Core Ocean Monitoring Program,OCSD Core Ocean Monitoring Program,OCSD Ocean WQ Station 2103,2103,2012-07-19,1899-12-30 13:47:00,Not Recorded,30,m,Not Recorded,1,1,,,samplewater,Colilert-18,E. coli,MPN/100 mL,10,,,,=,,NR,NR,,,,,WQ,Not Recorded,OCSD,,Field Method,33.58482,-117.94463,Not Recorded,1899-12-30,Subsurface,,NaT,,NaT,NaT,,,,,,,,,,,,,,,,,,,,,,,E. coli,False,CEDEN
1,OCSD Core Ocean Monitoring Program,OCSD Core Ocean Monitoring Program,OCSD Core Ocean Monitoring Program,OCSD Ocean WQ Station 2103,2103,2012-07-19,1899-12-30 13:47:00,Not Recorded,40,m,Not Recorded,1,1,,,samplewater,Enterolert,Enterococcus,MPN/100 mL,10,,,,=,,NR,NR,,,,,WQ,Not Recorded,OCSD,,Field Method,33.58482,-117.94463,Not Recorded,1899-12-30,Subsurface,,NaT,,NaT,NaT,,,,,,,,,,,,,,,,,,,,,,,Enterococcus,False,CEDEN
2,OCSD Core Ocean Monitoring Program,OCSD Core Ocean Monitoring Program,OCSD Core Ocean Monitoring Program,OCSD Ocean WQ Station 2103,2103,2012-07-19,1899-12-30 13:47:00,Not Recorded,50,m,Not Recorded,1,1,,,samplewater,Colilert-18,E. coli,MPN/100 mL,10,,,,=,,NR,NR,,,,,WQ,Not Recorded,OCSD,,Field Method,33.58482,-117.94463,Not Recorded,1899-12-30,Subsurface,,NaT,,NaT,NaT,,,,,,,,,,,,,,,,,,,,,,,E. coli,False,CEDEN
3,OCSD Core Ocean Monitoring Program,OCSD Core Ocean Monitoring Program,OCSD Core Ocean Monitoring Program,OCSD Ocean WQ Station 2103,2103,2012-08-02,1899-12-30 08:47:00,Not Recorded,1,m,Not Recorded,1,1,,,samplewater,Colilert-18,"Coliform, Total",MPN/100 mL,10,,,,=,,NR,NR,,,,,WQ,Not Recorded,OCSD,,Field Method,33.58482,-117.94463,Not Recorded,1899-12-30,Subsurface,,NaT,,NaT,NaT,,,,,,,,,,,,,,,,,,,,,,,"Coliform, Total",False,CEDEN
4,OCSD Core Ocean Monitoring Program,OCSD Core Ocean Monitoring Program,OCSD Core Ocean Monitoring Program,OCSD Ocean WQ Station 2103,2103,2012-08-02,1899-12-30 08:47:00,Not Recorded,10,m,Not Recorded,1,1,,,samplewater,Colilert-18,E. coli,MPN/100 mL,10,,,,=,,NR,NR,,,,,WQ,Not Recorded,OCSD,,Field Method,33.58482,-117.94463,Not Recorded,1899-12-30,Subsurface,,NaT,,NaT,NaT,,,,,,,,,,,,,,,,,,,,,,,E. coli,False,CEDEN


### 3. Combine the BeachWatch and CEDEN datasets
The BeachWatch and CEDEN datasets have similar data structures, allowing us to combine the two datasets and work on both of them at the same time.

In [7]:
combined_df = pd.concat([bw_df, ceden_df],  ignore_index=True)
print("Count of rows:", combined_df.shape[0])

combined_df.head()

Count of rows: 2591446


Unnamed: 0,Program,ParentProject,Project,StationName,StationCode,SampleDate,CollectionTime,LocationCode,CollectionDepth,UnitCollectionDepth,SampleTypeCode,CollectionReplicate,ResultsReplicate,LabBatch,LabSampleID,MatrixName,MethodName,Analyte,Unit,Result,Observation,MDL,RL,ResultQualCode,QACode,BatchVerification,ComplianceCode,SampleComments,CollectionComments,ResultsComments,BatchComments,EventCode,ProtocolCode,SampleAgency,GroupSamples,CollectionMethodName,TargetLatitude,TargetLongitude,CollectionDeviceDescription,CalibrationDate,PositionWaterColumn,PrepPreservationName,PrepPreservationDate,DigestExtractMethod,DigestExtractDate,AnalysisDate,DilutionFactor,ExpectedValue,LabAgency,SubmittingAgency,SubmissionCode,OccupationMethod,StartingBank,DistanceFromBank,UnitDistanceFromBank,StreamWidth,UnitStreamWidth,StationWaterDepth,UnitStationWaterDepth,Hydromod,HydromodLoc,LocationDetailWQComments,ChannelWidth,UpstreamLength,DownStreamLength,TotalReach,LocationDetailBAComments,SampleID,DW_AnalyteName,DataSource,HydroMod,HydroModLoc,isQA
0,BeachWatch,BeachWatch_Orange County,BeachWatch_Orange County,"S-23-San Clemente State Beach, Orange",S-23,2004-10-25,1900-01-01,SurfZone,-88.0,NR,Grab,1,1,SRRA-10/25/2004,Not Recorded,samplewater,EPA 1600,Enterococcus,cfu/100mL,10,,0,0,<,NR,NR,NR,,,,,wq,Not Recorded,OCEHD,,Water_Grab,33.3958,-117.6,Not Recorded,NaT,Not Recorded,Not Recorded,1950-01-01,Not Recorded,1950-01-01,1950-01-01,1.0,0,SRRA,SRRA,NR,,,,,,,,,,,,,,,,,Not Recorded,Enterococcus,BeachWatch,,,
1,BeachWatch,BeachWatch_Orange County,BeachWatch_Orange County,"S-21-San Clemente State Beach, Orange",S-21,2009-08-17,1900-01-01,SurfZone,-88.0,NR,Grab,1,1,SRRA-08/17/2009,Not Recorded,samplewater,SM 9222 B,"Coliform, Total",cfu/100mL,2,,0,0,<,NR,NR,NR,,,,,wq,Not Recorded,OCEHD,,Water_Grab,33.4051,-117.607,Not Recorded,NaT,Not Recorded,Not Recorded,1950-01-01,Not Recorded,1950-01-01,1950-01-01,1.0,0,SRRA,SRRA,NR,,,,,,,,,,,,,,,,,Not Recorded,"Coliform, Total",BeachWatch,,,
2,BeachWatch,BeachWatch_Orange County,BeachWatch_Orange County,"S-21-San Clemente State Beach, Orange",S-21,2009-08-25,1900-01-01,SurfZone,-88.0,NR,Grab,1,1,SRRA-08/25/2009,Not Recorded,samplewater,SM 9222 D,"Coliform, Fecal",cfu/100mL,2,,0,0,<,NR,NR,NR,,,,,wq,Not Recorded,OCEHD,,Water_Grab,33.4051,-117.607,Not Recorded,NaT,Not Recorded,Not Recorded,1950-01-01,Not Recorded,1950-01-01,1950-01-01,1.0,0,SRRA,SRRA,NR,,,,,,,,,,,,,,,,,Not Recorded,"Coliform, Fecal",BeachWatch,,,
3,BeachWatch,BeachWatch_Orange County,BeachWatch_Orange County,"S13-Laguna Beach, Orange",S13,2005-02-08,1900-01-01,SurfZone,-88.0,NR,Grab,1,1,AWMA-02/08/2005,Not Recorded,samplewater,SM 9222 D,"Coliform, Fecal",cfu/100mL,2,,0,0,<,NR,NR,NR,,,,,wq,Not Recorded,OCEHD,,Water_Grab,33.5168,-117.761,Not Recorded,NaT,Not Recorded,Not Recorded,1950-01-01,Not Recorded,1950-01-01,1950-01-01,1.0,0,AWMA,AWMA,NR,,,,,,,,,,,,,,,,,Not Recorded,"Coliform, Fecal",BeachWatch,,,
4,BeachWatch,BeachWatch_Orange County,BeachWatch_Orange County,"S13-Laguna Beach, Orange",S13,2005-02-14,1900-01-01,SurfZone,-88.0,NR,Grab,1,1,AWMA-02/14/2005,Not Recorded,samplewater,SM 9222 D,"Coliform, Fecal",cfu/100mL,20,,0,0,=,NR,NR,NR,,,,,wq,Not Recorded,OCEHD,,Water_Grab,33.5168,-117.761,Not Recorded,NaT,Not Recorded,Not Recorded,1950-01-01,Not Recorded,1950-01-01,1950-01-01,1.0,0,AWMA,AWMA,NR,,,,,,,,,,,,,,,,,Not Recorded,"Coliform, Fecal",BeachWatch,,,


### 4. Create the SampleDateTime column
For CEDEN, the sample date and collection time are stored in two different columns, SampleDate and CollectionTime, respectively. CollectionTime has a recorded date along with a time, but the paired date is not usable. Create a new column by separating out the time value from the CollectionTime column and combine it with the date value in the SampleDate column.

In [8]:
# Extract the time value from CollectionTime field and copy to a new field
combined_df['CollectionTimeOnly'] = combined_df['CollectionTime'].dt.time

# Combine the date and time values into a new SampleDateTime field
combined_df['SampleDateTime'] = pd.to_datetime(combined_df['SampleDate']) + pd.to_timedelta(combined_df['CollectionTimeOnly'].astype(str))

combined_df.head()

Unnamed: 0,Program,ParentProject,Project,StationName,StationCode,SampleDate,CollectionTime,LocationCode,CollectionDepth,UnitCollectionDepth,SampleTypeCode,CollectionReplicate,ResultsReplicate,LabBatch,LabSampleID,MatrixName,MethodName,Analyte,Unit,Result,Observation,MDL,RL,ResultQualCode,QACode,BatchVerification,ComplianceCode,SampleComments,CollectionComments,ResultsComments,BatchComments,EventCode,ProtocolCode,SampleAgency,GroupSamples,CollectionMethodName,TargetLatitude,TargetLongitude,CollectionDeviceDescription,CalibrationDate,PositionWaterColumn,PrepPreservationName,PrepPreservationDate,DigestExtractMethod,DigestExtractDate,AnalysisDate,DilutionFactor,ExpectedValue,LabAgency,SubmittingAgency,SubmissionCode,OccupationMethod,StartingBank,DistanceFromBank,UnitDistanceFromBank,StreamWidth,UnitStreamWidth,StationWaterDepth,UnitStationWaterDepth,Hydromod,HydromodLoc,LocationDetailWQComments,ChannelWidth,UpstreamLength,DownStreamLength,TotalReach,LocationDetailBAComments,SampleID,DW_AnalyteName,DataSource,HydroMod,HydroModLoc,isQA,CollectionTimeOnly,SampleDateTime
0,BeachWatch,BeachWatch_Orange County,BeachWatch_Orange County,"S-23-San Clemente State Beach, Orange",S-23,2004-10-25,1900-01-01,SurfZone,-88.0,NR,Grab,1,1,SRRA-10/25/2004,Not Recorded,samplewater,EPA 1600,Enterococcus,cfu/100mL,10,,0,0,<,NR,NR,NR,,,,,wq,Not Recorded,OCEHD,,Water_Grab,33.3958,-117.6,Not Recorded,NaT,Not Recorded,Not Recorded,1950-01-01,Not Recorded,1950-01-01,1950-01-01,1.0,0,SRRA,SRRA,NR,,,,,,,,,,,,,,,,,Not Recorded,Enterococcus,BeachWatch,,,,00:00:00,2004-10-25
1,BeachWatch,BeachWatch_Orange County,BeachWatch_Orange County,"S-21-San Clemente State Beach, Orange",S-21,2009-08-17,1900-01-01,SurfZone,-88.0,NR,Grab,1,1,SRRA-08/17/2009,Not Recorded,samplewater,SM 9222 B,"Coliform, Total",cfu/100mL,2,,0,0,<,NR,NR,NR,,,,,wq,Not Recorded,OCEHD,,Water_Grab,33.4051,-117.607,Not Recorded,NaT,Not Recorded,Not Recorded,1950-01-01,Not Recorded,1950-01-01,1950-01-01,1.0,0,SRRA,SRRA,NR,,,,,,,,,,,,,,,,,Not Recorded,"Coliform, Total",BeachWatch,,,,00:00:00,2009-08-17
2,BeachWatch,BeachWatch_Orange County,BeachWatch_Orange County,"S-21-San Clemente State Beach, Orange",S-21,2009-08-25,1900-01-01,SurfZone,-88.0,NR,Grab,1,1,SRRA-08/25/2009,Not Recorded,samplewater,SM 9222 D,"Coliform, Fecal",cfu/100mL,2,,0,0,<,NR,NR,NR,,,,,wq,Not Recorded,OCEHD,,Water_Grab,33.4051,-117.607,Not Recorded,NaT,Not Recorded,Not Recorded,1950-01-01,Not Recorded,1950-01-01,1950-01-01,1.0,0,SRRA,SRRA,NR,,,,,,,,,,,,,,,,,Not Recorded,"Coliform, Fecal",BeachWatch,,,,00:00:00,2009-08-25
3,BeachWatch,BeachWatch_Orange County,BeachWatch_Orange County,"S13-Laguna Beach, Orange",S13,2005-02-08,1900-01-01,SurfZone,-88.0,NR,Grab,1,1,AWMA-02/08/2005,Not Recorded,samplewater,SM 9222 D,"Coliform, Fecal",cfu/100mL,2,,0,0,<,NR,NR,NR,,,,,wq,Not Recorded,OCEHD,,Water_Grab,33.5168,-117.761,Not Recorded,NaT,Not Recorded,Not Recorded,1950-01-01,Not Recorded,1950-01-01,1950-01-01,1.0,0,AWMA,AWMA,NR,,,,,,,,,,,,,,,,,Not Recorded,"Coliform, Fecal",BeachWatch,,,,00:00:00,2005-02-08
4,BeachWatch,BeachWatch_Orange County,BeachWatch_Orange County,"S13-Laguna Beach, Orange",S13,2005-02-14,1900-01-01,SurfZone,-88.0,NR,Grab,1,1,AWMA-02/14/2005,Not Recorded,samplewater,SM 9222 D,"Coliform, Fecal",cfu/100mL,20,,0,0,=,NR,NR,NR,,,,,wq,Not Recorded,OCEHD,,Water_Grab,33.5168,-117.761,Not Recorded,NaT,Not Recorded,Not Recorded,1950-01-01,Not Recorded,1950-01-01,1950-01-01,1.0,0,AWMA,AWMA,NR,,,,,,,,,,,,,,,,,Not Recorded,"Coliform, Fecal",BeachWatch,,,,00:00:00,2005-02-14


### 5. Dropping duplicate records
Even though we excluded BeachWatch records when pulling data from CEDEN (Step 2.2), there are still some duplicate BeachWatch records in CEDEN because these records are submitted to CEDEN under a different program name (i.e., not BeachWatch). 

An example of this is StationCode == 'Wharf-East' for Total coliform, sample taken on 9/5/2019. There are three data points for the same result, one in BeachWatch and two in CEDEN. They mostly have the same values in every column except for Program, ResultQualCode, and QACode. The Program value in BeachWatch is "BeachWatch" whereas the Program values in CEDEN are "BeachWatch" and "Santa Cruz City Environmental Program". The BeachWatch record was copied over into CEDEN from the BeachWatch database, and the other record was submitted to CEDEN under a different program name. Because the SQL query used in Step 2.2 only excludes records that have a Program value of "BeachWatch", the latter record would still make it into the combined dataset.

A list of columns, defined below in the variable "duplicate_cols", is used to identify and drop the remaining duplicate records. When comparing one record to another, the code is looking for at least one unique value across all of these columns. If the values for both records across all columns are the same, then it is considered a duplicate record. This list of columns can be changed, as needed.

In [9]:
# Sort the dataframe by the DataSource column so that all BeachWatch records are positioned before the CEDEN records. 
# This is to ensure that BeachWatch records are kept by default if there happens to be the same record from btoh BeachWatch and CEDEN
combined_df = combined_df.sort_values(by='DataSource')

combined_df.head()

Unnamed: 0,Program,ParentProject,Project,StationName,StationCode,SampleDate,CollectionTime,LocationCode,CollectionDepth,UnitCollectionDepth,SampleTypeCode,CollectionReplicate,ResultsReplicate,LabBatch,LabSampleID,MatrixName,MethodName,Analyte,Unit,Result,Observation,MDL,RL,ResultQualCode,QACode,BatchVerification,ComplianceCode,SampleComments,CollectionComments,ResultsComments,BatchComments,EventCode,ProtocolCode,SampleAgency,GroupSamples,CollectionMethodName,TargetLatitude,TargetLongitude,CollectionDeviceDescription,CalibrationDate,PositionWaterColumn,PrepPreservationName,PrepPreservationDate,DigestExtractMethod,DigestExtractDate,AnalysisDate,DilutionFactor,ExpectedValue,LabAgency,SubmittingAgency,SubmissionCode,OccupationMethod,StartingBank,DistanceFromBank,UnitDistanceFromBank,StreamWidth,UnitStreamWidth,StationWaterDepth,UnitStationWaterDepth,Hydromod,HydromodLoc,LocationDetailWQComments,ChannelWidth,UpstreamLength,DownStreamLength,TotalReach,LocationDetailBAComments,SampleID,DW_AnalyteName,DataSource,HydroMod,HydroModLoc,isQA,CollectionTimeOnly,SampleDateTime
0,BeachWatch,BeachWatch_Orange County,BeachWatch_Orange County,"S-23-San Clemente State Beach, Orange",S-23,2004-10-25,1900-01-01 00:00:00,SurfZone,-88.0,NR,Grab,1,1,SRRA-10/25/2004,Not Recorded,samplewater,EPA 1600,Enterococcus,cfu/100mL,10,,0,0,<,NR,NR,NR,,,,,wq,Not Recorded,OCEHD,,Water_Grab,33.3958,-117.6,Not Recorded,NaT,Not Recorded,Not Recorded,1950-01-01,Not Recorded,1950-01-01,1950-01-01,1.0,0,SRRA,SRRA,NR,,,,,,,,,,,,,,,,,Not Recorded,Enterococcus,BeachWatch,,,,00:00:00,2004-10-25 00:00:00
1467401,BeachWatch,BeachWatch_San Luis Obispo County,BeachWatch_San Luis Obispo County,"PB3-Pismo State Beach, San Luis Obispo",PB3,2007-07-09,1900-01-01 09:50:00,SurfZone,-88.0,NR,Grab,1,1,SLOCPHDEH-07/09/2007,Not Recorded,samplewater,Enterolert,Enterococcus,MPN/100 mL,20,,0,0,=,NR,NR,NR,,,,,wq,Not Recorded,SLOCPHDEH,,Water_Grab,35.1359,-120.643,Not Recorded,NaT,Not Recorded,Not Recorded,1950-01-01,Not Recorded,1950-01-01,1950-01-01,1.0,0,SLOCPHDEH,SLOCPHDEH,NR,,,,,,,,,,,,,,,,,Not Recorded,Enterococcus,BeachWatch,,,,09:50:00,2007-07-09 09:50:00
1467400,BeachWatch,BeachWatch_San Luis Obispo County,BeachWatch_San Luis Obispo County,"PB3-Pismo State Beach, San Luis Obispo",PB3,2007-07-02,1900-01-01 11:45:00,SurfZone,-88.0,NR,Grab,1,1,SLOCPHDEH-07/02/2007,Not Recorded,samplewater,Enterolert,Enterococcus,MPN/100 mL,10,,0,0,<,NR,NR,NR,,,,,wq,Not Recorded,SLOCPHDEH,,Water_Grab,35.1359,-120.643,Not Recorded,NaT,Not Recorded,Not Recorded,1950-01-01,Not Recorded,1950-01-01,1950-01-01,1.0,0,SLOCPHDEH,SLOCPHDEH,NR,,,,,,,,,,,,,,,,,Not Recorded,Enterococcus,BeachWatch,,,,11:45:00,2007-07-02 11:45:00
1467399,BeachWatch,BeachWatch_San Diego County,BeachWatch_San Diego County,"MB-170-Mission Bay, Mariners Basin, San Diego",MB-170,2000-12-13,1900-01-01 00:00:00,SurfZone,-88.0,NR,Grab,1,1,CSDDEH-12/13/2000,Not Recorded,samplewater,SM 9221 E,"Coliform, Fecal",MPN/100 mL,1700,,10,10,=,NR,NR,NR,,,,,wq,Not Recorded,CSDDEH,,Water_Grab,32.7696,-117.248,Not Recorded,NaT,Not Recorded,Not Recorded,1950-01-01,Not Recorded,1950-01-01,1950-01-01,1.0,0,CSDDEH,CSDDEH,NR,,,,,,,,,,,,,,,,,Not Recorded,"Coliform, Fecal",BeachWatch,,,,00:00:00,2000-12-13 00:00:00
1467398,BeachWatch,BeachWatch_San Diego County,BeachWatch_San Diego County,"MB-170-Mission Bay, Mariners Basin, San Diego",MB-170,2000-12-12,1900-01-01 00:00:00,SurfZone,-88.0,NR,Grab,1,1,CSDDEH-12/12/2000,Not Recorded,samplewater,SM 9221 E,"Coliform, Total",MPN/100 mL,800,,10,10,=,NR,NR,NR,,,,,wq,Not Recorded,CSDDEH,,Water_Grab,32.7696,-117.248,Not Recorded,NaT,Not Recorded,Not Recorded,1950-01-01,Not Recorded,1950-01-01,1950-01-01,1.0,0,CSDDEH,CSDDEH,NR,,,,,,,,,,,,,,,,,Not Recorded,"Coliform, Total",BeachWatch,,,,00:00:00,2000-12-12 00:00:00


In [10]:
# Define the columns used to identify duplicate records
# 10/1/24 - I removed 'QACode' and 'ResultQualCode' from this list because it appears that some duplicate records across BeachWatch and CEDEN have different QACode and ResultQualCode values 
# See StationCode == 'Wharf-East' for Total coliform, samples taken on 9/5/2019 (QACode) and 9/23/2019 (ResultQualCode)
duplicate_cols = ['Analyte', 'MatrixName', 'SampleDateTime', 'CollectionReplicate', 'ResultsReplicate', 'MethodName', 'Result', 'Unit']

# Select the identified duplicate records from the combined dataset and copy them to a new dataframe
# These records will later be added to the rejected_records csv file output
duplicates_df = combined_df.loc[combined_df.duplicated(subset=duplicate_cols, keep='first')]
duplicates_df['Comments'] = 'Duplicate record'

print('Count of duplicate records:', duplicates_df.shape[0])
duplicates_df.head() 

Count of duplicate records: 824285


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  duplicates_df['Comments'] = 'Duplicate record'


Unnamed: 0,Program,ParentProject,Project,StationName,StationCode,SampleDate,CollectionTime,LocationCode,CollectionDepth,UnitCollectionDepth,SampleTypeCode,CollectionReplicate,ResultsReplicate,LabBatch,LabSampleID,MatrixName,MethodName,Analyte,Unit,Result,Observation,MDL,RL,ResultQualCode,QACode,BatchVerification,ComplianceCode,SampleComments,CollectionComments,ResultsComments,BatchComments,EventCode,ProtocolCode,SampleAgency,GroupSamples,CollectionMethodName,TargetLatitude,TargetLongitude,CollectionDeviceDescription,CalibrationDate,PositionWaterColumn,PrepPreservationName,PrepPreservationDate,DigestExtractMethod,DigestExtractDate,AnalysisDate,DilutionFactor,ExpectedValue,LabAgency,SubmittingAgency,SubmissionCode,OccupationMethod,StartingBank,DistanceFromBank,UnitDistanceFromBank,StreamWidth,UnitStreamWidth,StationWaterDepth,UnitStationWaterDepth,Hydromod,HydromodLoc,LocationDetailWQComments,ChannelWidth,UpstreamLength,DownStreamLength,TotalReach,LocationDetailBAComments,SampleID,DW_AnalyteName,DataSource,HydroMod,HydroModLoc,isQA,CollectionTimeOnly,SampleDateTime,Comments
1466002,BeachWatch,BeachWatch_San Diego County,BeachWatch_San Diego County,PL-050-non-accessible or restricted access sho...,PL-050,1999-08-09,1900-01-01,SurfZone,-88.0,NR,Grab,1,1,CSDDEH-08/09/1999,Not Recorded,samplewater,SM 9222 B,"Coliform, Total",cfu/100mL,2,,2,2,<,NR,NR,NR,,,,,wq,Not Recorded,CSDDEH,,Water_Grab,32.6794,-117.248,Not Recorded,NaT,Not Recorded,Not Recorded,1950-01-01,Not Recorded,1950-01-01,1950-01-01,1.0,0,CSDDEH,CSDDEH,NR,,,,,,,,,,,,,,,,,Not Recorded,"Coliform, Total",BeachWatch,,,,00:00:00,1999-08-09,Duplicate record
1466835,BeachWatch,BeachWatch_San Diego County,BeachWatch_San Diego County,"SE-060-Cardiff State Beach, San Diego",SE-060,2002-07-02,1900-01-01,SurfZone,-88.0,NR,Grab,1,1,CSDDEH-07/02/2002,Not Recorded,samplewater,SM 9221 E,"Coliform, Total",MPN/100 mL,20,,10,10,<,NR,NR,NR,,,,,wq,Not Recorded,CSDDEH,,Water_Grab,33.0184,-117.284,Not Recorded,NaT,Not Recorded,Not Recorded,1950-01-01,Not Recorded,1950-01-01,1950-01-01,1.0,0,CSDDEH,CSDDEH,NR,,,,,,,,,,,,,,,,,Not Recorded,"Coliform, Total",BeachWatch,,,,00:00:00,2002-07-02,Duplicate record
1469566,BeachWatch,BeachWatch_San Diego County,BeachWatch_San Diego County,PL-060-non-accessible or restricted access sho...,PL-060,2000-07-06,1900-01-01,SurfZone,-88.0,NR,Grab,1,1,CSDDEH-07/06/2000,Not Recorded,samplewater,SM 9222 B,Enterococcus,cfu/100mL,2,,2,2,<,NR,NR,NR,,,,,wq,Not Recorded,CSDDEH,,Water_Grab,32.6934,-117.261,Not Recorded,NaT,Not Recorded,Not Recorded,1950-01-01,Not Recorded,1950-01-01,1950-01-01,1.0,0,CSDDEH,CSDDEH,NR,,,,,,,,,,,,,,,,,Not Recorded,Enterococcus,BeachWatch,,,,00:00:00,2000-07-06,Duplicate record
1469686,BeachWatch,BeachWatch_San Diego County,BeachWatch_San Diego County,PL-060-non-accessible or restricted access sho...,PL-060,2002-06-26,1900-01-01,SurfZone,-88.0,NR,Grab,1,1,CSDDEH-06/26/2002,Not Recorded,samplewater,SM 9222 B,Enterococcus,cfu/100mL,2,,2,2,<,NR,NR,NR,,,,,wq,Not Recorded,CSDDEH,,Water_Grab,32.6934,-117.261,Not Recorded,NaT,Not Recorded,Not Recorded,1950-01-01,Not Recorded,1950-01-01,1950-01-01,1.0,0,CSDDEH,CSDDEH,NR,,,,,,,,,,,,,,,,,Not Recorded,Enterococcus,BeachWatch,,,,00:00:00,2002-06-26,Duplicate record
1469655,BeachWatch,BeachWatch_San Diego County,BeachWatch_San Diego County,PL-060-non-accessible or restricted access sho...,PL-060,2002-05-09,1900-01-01,SurfZone,-88.0,NR,Grab,1,1,CSDDEH-05/09/2002,Not Recorded,samplewater,SM 9222 B,"Coliform, Fecal",cfu/100mL,2,,2,2,<,NR,NR,NR,,,,,wq,Not Recorded,CSDDEH,,Water_Grab,32.6934,-117.261,Not Recorded,NaT,Not Recorded,Not Recorded,1950-01-01,Not Recorded,1950-01-01,1950-01-01,1.0,0,CSDDEH,CSDDEH,NR,,,,,,,,,,,,,,,,,Not Recorded,"Coliform, Fecal",BeachWatch,,,,00:00:00,2002-05-09,Duplicate record


In [11]:
print('Count of rows before dropping duplicates:', combined_df.shape[0])

# Drop the duplicate records from the combined dataset; if there are duplicates, keep the first duplicate record found (BeachWatch)
combined_df = combined_df.drop_duplicates(subset=duplicate_cols, keep='first')

print('Count of rows after removing duplicates:', combined_df.shape[0])

Count of rows before dropping duplicates: 2591446
Count of rows after removing duplicates: 1767161


### 6. Clean and process data

#### 6.1 Strip special characters and whitespace characters. Check null/missing values for compatability with the open data portal.

In [12]:
# Strip special characters. These characters can cause issues when reading, parsing, or writing the data
combined_df.replace(r'\t',' ', regex=True, inplace=True) # tab
combined_df.replace(r'\r',' ', regex=True, inplace=True) # carriage return
combined_df.replace(r'\n',' ', regex=True, inplace=True) # newline
combined_df.replace(r'\f',' ', regex=True, inplace=True) # formfeed
combined_df.replace(r'\v',' ', regex=True, inplace=True) # vertical tab
combined_df.replace(r'\|', ' ', regex=True, inplace=True) # pipe
combined_df.replace(r'\"', ' ', regex=True, inplace=True) # quotes

# Process the data to make sure the fields are compatible with the portal’s data type definition. 
# For numeric, make sure that all values can be recognized as a number. Missing values have to be encoded as "NaN". 
# For dates, the data has to be formatted as YYYY-MM-DD (you can also add a time to that - YYYY-MM-DD HH:MM:SS), and missing values have to be encoded as an empty text string ("").
# Check numeric columns

numeric_cols = ['CollectionDepth', 'CollectionReplicate', 'ResultsReplicate', 'Result']
for col in numeric_cols:
    try:
        combined_df[col].fillna('NaN')
    except:
        print('%s field does not exist for dataframe' % col)

# Cast data type for Result and MDL columns to numeric. Must be done here, not in the import data section
combined_df['Result'] = pd.to_numeric(combined_df['Result'], errors='coerce')
combined_df['MDL'] = pd.to_numeric(combined_df['MDL'], errors='coerce')

  combined_df.replace(r'\t',' ', regex=True, inplace=True) # tab


#### 6.2 Check latitude and longitude values.

In [13]:
def check_latitude(val):
    try:
        lat = float(val)
        return lat
    except TypeError:
        # a missing latitude value (and non-numeric values) should throw an error
        # missing values should be encoded as 'NaN' to define data type as numeric on open data portal
        return 'NaN'
    except ValueError:
        return 'NaN'

# Sometimes the Longitude gets entered as 119 instead of -119...
# Make sure Longitude value is negative and less than 10000 (could be projected)
# Check for missing and non-numeric values, replace with 'NaN'
def check_longitude(val):
    try:
        long = float(val)
        if 0. < long < 10000.0:
            val = -long
        return val
    except TypeError:
        # a missing latitude value (and non-numeric values) should throw an error
        # missing values should be encoded as 'NaN' to define data type as numeric on open data portal
        return 'NaN'
    except ValueError:
        return 'NaN'

combined_df['TargetLatitude'] = combined_df['TargetLatitude'].map(check_latitude).fillna('')
combined_df['TargetLongitude'] = combined_df['TargetLongitude'].map(check_longitude).fillna('')

#### 6.3 Drop records that do not have valid Result and MDL values
These records cannot be used even if we try to substitute the original value with 1/2 the MDL.

In [14]:
# Copy non-ND records that have a negative or null Result and a negative or null MDL value to a new dataframe
# These records will later be added to the rejected_records csv file output
rejected1_df = combined_df[((pd.isna(combined_df['Result'])) | (combined_df['Result'] < 0)) & ((pd.isna(combined_df['MDL'])) | (combined_df['MDL'] < 0)) & (combined_df['ResultQualCode'] != 'ND')]
rejected1_df['Comments'] = 'Result is null or negative; MDL is null or negative'
print('Count of unusable records to be dropped:', rejected1_df.shape[0])

# Drop the records from the dataset
combined_df = combined_df.drop(rejected1_df.index)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  rejected1_df['Comments'] = 'Result is null or negative; MDL is null or negative'


Count of unusable records to be dropped: 233


#### 6.4 Drop replicate records

In [15]:
# Copy replicate records to a new dataframe
# These records will later be added to the rejected_records csv file output
replicate_df = combined_df[(combined_df['ResultsReplicate'] != 1) | (combined_df['CollectionReplicate'] != 1)]
replicate_df['Comments'] = 'Replicate data'
print('Count of replicate records to be dropped:', replicate_df.shape[0])

combined_df = combined_df.drop(replicate_df.index)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  replicate_df['Comments'] = 'Replicate data'


Count of replicate records to be dropped: 44300


#### 6.5 Standardize unit values and drop unneeded records
There is inconsistency, mainly in the CEDEN database, with how the unit values are named. Later on, when calculating the geomeans, we will want to be able to group records by common unit values, so these values should match exactly.

In [16]:
# Rename units with abbreviations to have all capitalized letters
combined_df['Unit'] = combined_df['Unit'].replace('cfu/100mL', 'CFU/100 mL') 
combined_df['Unit'] = combined_df['Unit'].replace('mpn/100mL', 'MPN/100 mL') 

# Filter for specific units to be included in the dataset; copy all other records to new dataframe
units_keep = ['MPN/100 mL', 'CFU/100 mL', 'copies/100 mL']
rejected_units_df = combined_df[~combined_df['Unit'].isin(units_keep)]
print('Count of unit records to filter out:', rejected_units_df.shape[0])

Count of unit records to filter out: 4388


#### 6.6 Categorize records into unit groups based on the unit name
This is based on the assumption that results reported in MPN (most probable number) are equivalent to results reported in CFU (colony forming units). Result values reported in "copies/100 mL" are associated with ddPCR methods. They are not equivalent to either MPN/CFU and should be handled separately. 

In [17]:
# Assign a numeric value to each record based on the UnitName value
unit_map = { 'MPN/100 mL': 1, 'CFU/100 mL': 1, 'copies/100 mL': 2}
combined_df['UnitGroup'] = combined_df['Unit'].map(unit_map)  

combined_df.head()

Unnamed: 0,Program,ParentProject,Project,StationName,StationCode,SampleDate,CollectionTime,LocationCode,CollectionDepth,UnitCollectionDepth,SampleTypeCode,CollectionReplicate,ResultsReplicate,LabBatch,LabSampleID,MatrixName,MethodName,Analyte,Unit,Result,Observation,MDL,RL,ResultQualCode,QACode,BatchVerification,ComplianceCode,SampleComments,CollectionComments,ResultsComments,BatchComments,EventCode,ProtocolCode,SampleAgency,GroupSamples,CollectionMethodName,TargetLatitude,TargetLongitude,CollectionDeviceDescription,CalibrationDate,PositionWaterColumn,PrepPreservationName,PrepPreservationDate,DigestExtractMethod,DigestExtractDate,AnalysisDate,DilutionFactor,ExpectedValue,LabAgency,SubmittingAgency,SubmissionCode,OccupationMethod,StartingBank,DistanceFromBank,UnitDistanceFromBank,StreamWidth,UnitStreamWidth,StationWaterDepth,UnitStationWaterDepth,Hydromod,HydromodLoc,LocationDetailWQComments,ChannelWidth,UpstreamLength,DownStreamLength,TotalReach,LocationDetailBAComments,SampleID,DW_AnalyteName,DataSource,HydroMod,HydroModLoc,isQA,CollectionTimeOnly,SampleDateTime,UnitGroup
0,BeachWatch,BeachWatch_Orange County,BeachWatch_Orange County,"S-23-San Clemente State Beach, Orange",S-23,2004-10-25,1900-01-01 00:00:00,SurfZone,-88.0,NR,Grab,1,1,SRRA-10/25/2004,Not Recorded,samplewater,EPA 1600,Enterococcus,CFU/100 mL,10.0,,0.0,0,<,NR,NR,NR,,,,,wq,Not Recorded,OCEHD,,Water_Grab,33.3958,-117.6,Not Recorded,NaT,Not Recorded,Not Recorded,1950-01-01,Not Recorded,1950-01-01,1950-01-01,1.0,0,SRRA,SRRA,NR,,,,,,,,,,,,,,,,,Not Recorded,Enterococcus,BeachWatch,,,,00:00:00,2004-10-25 00:00:00,1.0
1467401,BeachWatch,BeachWatch_San Luis Obispo County,BeachWatch_San Luis Obispo County,"PB3-Pismo State Beach, San Luis Obispo",PB3,2007-07-09,1900-01-01 09:50:00,SurfZone,-88.0,NR,Grab,1,1,SLOCPHDEH-07/09/2007,Not Recorded,samplewater,Enterolert,Enterococcus,MPN/100 mL,20.0,,0.0,0,=,NR,NR,NR,,,,,wq,Not Recorded,SLOCPHDEH,,Water_Grab,35.1359,-120.643,Not Recorded,NaT,Not Recorded,Not Recorded,1950-01-01,Not Recorded,1950-01-01,1950-01-01,1.0,0,SLOCPHDEH,SLOCPHDEH,NR,,,,,,,,,,,,,,,,,Not Recorded,Enterococcus,BeachWatch,,,,09:50:00,2007-07-09 09:50:00,1.0
1467400,BeachWatch,BeachWatch_San Luis Obispo County,BeachWatch_San Luis Obispo County,"PB3-Pismo State Beach, San Luis Obispo",PB3,2007-07-02,1900-01-01 11:45:00,SurfZone,-88.0,NR,Grab,1,1,SLOCPHDEH-07/02/2007,Not Recorded,samplewater,Enterolert,Enterococcus,MPN/100 mL,10.0,,0.0,0,<,NR,NR,NR,,,,,wq,Not Recorded,SLOCPHDEH,,Water_Grab,35.1359,-120.643,Not Recorded,NaT,Not Recorded,Not Recorded,1950-01-01,Not Recorded,1950-01-01,1950-01-01,1.0,0,SLOCPHDEH,SLOCPHDEH,NR,,,,,,,,,,,,,,,,,Not Recorded,Enterococcus,BeachWatch,,,,11:45:00,2007-07-02 11:45:00,1.0
1467399,BeachWatch,BeachWatch_San Diego County,BeachWatch_San Diego County,"MB-170-Mission Bay, Mariners Basin, San Diego",MB-170,2000-12-13,1900-01-01 00:00:00,SurfZone,-88.0,NR,Grab,1,1,CSDDEH-12/13/2000,Not Recorded,samplewater,SM 9221 E,"Coliform, Fecal",MPN/100 mL,1700.0,,10.0,10,=,NR,NR,NR,,,,,wq,Not Recorded,CSDDEH,,Water_Grab,32.7696,-117.248,Not Recorded,NaT,Not Recorded,Not Recorded,1950-01-01,Not Recorded,1950-01-01,1950-01-01,1.0,0,CSDDEH,CSDDEH,NR,,,,,,,,,,,,,,,,,Not Recorded,"Coliform, Fecal",BeachWatch,,,,00:00:00,2000-12-13 00:00:00,1.0
1467398,BeachWatch,BeachWatch_San Diego County,BeachWatch_San Diego County,"MB-170-Mission Bay, Mariners Basin, San Diego",MB-170,2000-12-12,1900-01-01 00:00:00,SurfZone,-88.0,NR,Grab,1,1,CSDDEH-12/12/2000,Not Recorded,samplewater,SM 9221 E,"Coliform, Total",MPN/100 mL,800.0,,10.0,10,=,NR,NR,NR,,,,,wq,Not Recorded,CSDDEH,,Water_Grab,32.7696,-117.248,Not Recorded,NaT,Not Recorded,Not Recorded,1950-01-01,Not Recorded,1950-01-01,1950-01-01,1.0,0,CSDDEH,CSDDEH,NR,,,,,,,,,,,,,,,,,Not Recorded,"Coliform, Total",BeachWatch,,,,00:00:00,2000-12-12 00:00:00,1.0


### 7. Add Datum column to the dataset
The data quality estimator tool (used in Step 8) requires the Datum field. This field is not included with the BeachWatch and CEDEN datasets by default, so we must get it from another CEDEN table and then join the values to the working dataset.

In [18]:
# Define a function used to get all records from the CEDEN table with datum data
def get_datum_data():
    try:
        sql = "SELECT StationCode, Datum FROM %s ;" % CEDEN_SITE_DATUM_TABLE
        cnxn = pyodbc.connect(Driver='SQL Server', Server=CEDEN_SERVER1, uid=CEDEN_UID, pwd=CEDEN_PWD)
        df = pd.read_sql(sql, cnxn)
        return df
    except:
        print("Couldn't connect to %s." % CEDEN_SERVER1)

datum_df = get_datum_data()
datum_df.head()

  df = pd.read_sql(sql, cnxn)


Unnamed: 0,StationCode,Datum
0,000JCBOYL,WGS84
1,000POSR1,WGS84
2,000SR01xx,NAD83
3,01_AC_US,NAD83
4,01_FC_DS,NAD83


In [19]:
# Join the datum data to the combined dataset on common StationCode IDs
data_df = pd.merge(combined_df, datum_df, on='StationCode', how='left')

# Fill empty datum values with 'NR'. This is an important step for the data quality estimator, used later
data_df = data_df.fillna(value={'Datum': 'NR'})

data_df.head()

Unnamed: 0,Program,ParentProject,Project,StationName,StationCode,SampleDate,CollectionTime,LocationCode,CollectionDepth,UnitCollectionDepth,SampleTypeCode,CollectionReplicate,ResultsReplicate,LabBatch,LabSampleID,MatrixName,MethodName,Analyte,Unit,Result,Observation,MDL,RL,ResultQualCode,QACode,BatchVerification,ComplianceCode,SampleComments,CollectionComments,ResultsComments,BatchComments,EventCode,ProtocolCode,SampleAgency,GroupSamples,CollectionMethodName,TargetLatitude,TargetLongitude,CollectionDeviceDescription,CalibrationDate,PositionWaterColumn,PrepPreservationName,PrepPreservationDate,DigestExtractMethod,DigestExtractDate,AnalysisDate,DilutionFactor,ExpectedValue,LabAgency,SubmittingAgency,SubmissionCode,OccupationMethod,StartingBank,DistanceFromBank,UnitDistanceFromBank,StreamWidth,UnitStreamWidth,StationWaterDepth,UnitStationWaterDepth,Hydromod,HydromodLoc,LocationDetailWQComments,ChannelWidth,UpstreamLength,DownStreamLength,TotalReach,LocationDetailBAComments,SampleID,DW_AnalyteName,DataSource,HydroMod,HydroModLoc,isQA,CollectionTimeOnly,SampleDateTime,UnitGroup,Datum
0,BeachWatch,BeachWatch_Orange County,BeachWatch_Orange County,"S-23-San Clemente State Beach, Orange",S-23,2004-10-25,1900-01-01 00:00:00,SurfZone,-88.0,NR,Grab,1,1,SRRA-10/25/2004,Not Recorded,samplewater,EPA 1600,Enterococcus,CFU/100 mL,10.0,,0.0,0,<,NR,NR,NR,,,,,wq,Not Recorded,OCEHD,,Water_Grab,33.3958,-117.6,Not Recorded,NaT,Not Recorded,Not Recorded,1950-01-01,Not Recorded,1950-01-01,1950-01-01,1.0,0,SRRA,SRRA,NR,,,,,,,,,,,,,,,,,Not Recorded,Enterococcus,BeachWatch,,,,00:00:00,2004-10-25 00:00:00,1.0,NR
1,BeachWatch,BeachWatch_San Luis Obispo County,BeachWatch_San Luis Obispo County,"PB3-Pismo State Beach, San Luis Obispo",PB3,2007-07-09,1900-01-01 09:50:00,SurfZone,-88.0,NR,Grab,1,1,SLOCPHDEH-07/09/2007,Not Recorded,samplewater,Enterolert,Enterococcus,MPN/100 mL,20.0,,0.0,0,=,NR,NR,NR,,,,,wq,Not Recorded,SLOCPHDEH,,Water_Grab,35.1359,-120.643,Not Recorded,NaT,Not Recorded,Not Recorded,1950-01-01,Not Recorded,1950-01-01,1950-01-01,1.0,0,SLOCPHDEH,SLOCPHDEH,NR,,,,,,,,,,,,,,,,,Not Recorded,Enterococcus,BeachWatch,,,,09:50:00,2007-07-09 09:50:00,1.0,NR
2,BeachWatch,BeachWatch_San Luis Obispo County,BeachWatch_San Luis Obispo County,"PB3-Pismo State Beach, San Luis Obispo",PB3,2007-07-02,1900-01-01 11:45:00,SurfZone,-88.0,NR,Grab,1,1,SLOCPHDEH-07/02/2007,Not Recorded,samplewater,Enterolert,Enterococcus,MPN/100 mL,10.0,,0.0,0,<,NR,NR,NR,,,,,wq,Not Recorded,SLOCPHDEH,,Water_Grab,35.1359,-120.643,Not Recorded,NaT,Not Recorded,Not Recorded,1950-01-01,Not Recorded,1950-01-01,1950-01-01,1.0,0,SLOCPHDEH,SLOCPHDEH,NR,,,,,,,,,,,,,,,,,Not Recorded,Enterococcus,BeachWatch,,,,11:45:00,2007-07-02 11:45:00,1.0,NR
3,BeachWatch,BeachWatch_San Diego County,BeachWatch_San Diego County,"MB-170-Mission Bay, Mariners Basin, San Diego",MB-170,2000-12-13,1900-01-01 00:00:00,SurfZone,-88.0,NR,Grab,1,1,CSDDEH-12/13/2000,Not Recorded,samplewater,SM 9221 E,"Coliform, Fecal",MPN/100 mL,1700.0,,10.0,10,=,NR,NR,NR,,,,,wq,Not Recorded,CSDDEH,,Water_Grab,32.7696,-117.248,Not Recorded,NaT,Not Recorded,Not Recorded,1950-01-01,Not Recorded,1950-01-01,1950-01-01,1.0,0,CSDDEH,CSDDEH,NR,,,,,,,,,,,,,,,,,Not Recorded,"Coliform, Fecal",BeachWatch,,,,00:00:00,2000-12-13 00:00:00,1.0,NR
4,BeachWatch,BeachWatch_San Diego County,BeachWatch_San Diego County,"MB-170-Mission Bay, Mariners Basin, San Diego",MB-170,2000-12-12,1900-01-01 00:00:00,SurfZone,-88.0,NR,Grab,1,1,CSDDEH-12/12/2000,Not Recorded,samplewater,SM 9221 E,"Coliform, Total",MPN/100 mL,800.0,,10.0,10,=,NR,NR,NR,,,,,wq,Not Recorded,CSDDEH,,Water_Grab,32.7696,-117.248,Not Recorded,NaT,Not Recorded,Not Recorded,1950-01-01,Not Recorded,1950-01-01,1950-01-01,1.0,0,CSDDEH,CSDDEH,NR,,,,,,,,,,,,,,,,,Not Recorded,"Coliform, Total",BeachWatch,,,,00:00:00,2000-12-12 00:00:00,1.0,NR


### 8. Add a RegionNumber column to the dataset
This is a requested column to identify the Regional Board area where the site is located. We have to get data from another CEDEN stations table and join it to this dataset. This CEDEN table is a different table than the one used in Step 7. Unfortunately, the RB number values from this table are not complete. There will be some null values and other non-standard values in the dataset.

In [20]:
# Define a function that gets all records from the CEDEN station table, used to join region values.
def get_ceden_site_data():
    cnxn = pyodbc.connect(Driver='SQL Server', Server=CEDEN_SERVER1, uid=CEDEN_UID, pwd=CEDEN_PWD)
    sql = "SELECT StationLUCode, rb_number FROM %s" % CEDEN_SITE_TABLE
    df = pd.read_sql_query(sql, cnxn)
    return df

site_data = get_ceden_site_data()
site_data.head()

  df = pd.read_sql_query(sql, cnxn)


Unnamed: 0,StationLUCode,rb_number
0,0,8
1,0_Rapid_Microbial_2009,
2,000AGRR01,0
3,000BBC011,0
4,000JCBOYL,OOS


In [21]:
# Join the Region number to the combined dataset
data_df = data_df.merge(site_data, how='left', left_on='StationCode', right_on='StationLUCode')
data_df = data_df.rename(columns={'rb_number': 'RegionNumber'})

data_df.head()

Unnamed: 0,Program,ParentProject,Project,StationName,StationCode,SampleDate,CollectionTime,LocationCode,CollectionDepth,UnitCollectionDepth,SampleTypeCode,CollectionReplicate,ResultsReplicate,LabBatch,LabSampleID,MatrixName,MethodName,Analyte,Unit,Result,Observation,MDL,RL,ResultQualCode,QACode,BatchVerification,ComplianceCode,SampleComments,CollectionComments,ResultsComments,BatchComments,EventCode,ProtocolCode,SampleAgency,GroupSamples,CollectionMethodName,TargetLatitude,TargetLongitude,CollectionDeviceDescription,CalibrationDate,PositionWaterColumn,PrepPreservationName,PrepPreservationDate,DigestExtractMethod,DigestExtractDate,AnalysisDate,DilutionFactor,ExpectedValue,LabAgency,SubmittingAgency,SubmissionCode,OccupationMethod,StartingBank,DistanceFromBank,UnitDistanceFromBank,StreamWidth,UnitStreamWidth,StationWaterDepth,UnitStationWaterDepth,Hydromod,HydromodLoc,LocationDetailWQComments,ChannelWidth,UpstreamLength,DownStreamLength,TotalReach,LocationDetailBAComments,SampleID,DW_AnalyteName,DataSource,HydroMod,HydroModLoc,isQA,CollectionTimeOnly,SampleDateTime,UnitGroup,Datum,StationLUCode,RegionNumber
0,BeachWatch,BeachWatch_Orange County,BeachWatch_Orange County,"S-23-San Clemente State Beach, Orange",S-23,2004-10-25,1900-01-01 00:00:00,SurfZone,-88.0,NR,Grab,1,1,SRRA-10/25/2004,Not Recorded,samplewater,EPA 1600,Enterococcus,CFU/100 mL,10.0,,0.0,0,<,NR,NR,NR,,,,,wq,Not Recorded,OCEHD,,Water_Grab,33.3958,-117.6,Not Recorded,NaT,Not Recorded,Not Recorded,1950-01-01,Not Recorded,1950-01-01,1950-01-01,1.0,0,SRRA,SRRA,NR,,,,,,,,,,,,,,,,,Not Recorded,Enterococcus,BeachWatch,,,,00:00:00,2004-10-25 00:00:00,1.0,NR,S-23,9
1,BeachWatch,BeachWatch_San Luis Obispo County,BeachWatch_San Luis Obispo County,"PB3-Pismo State Beach, San Luis Obispo",PB3,2007-07-09,1900-01-01 09:50:00,SurfZone,-88.0,NR,Grab,1,1,SLOCPHDEH-07/09/2007,Not Recorded,samplewater,Enterolert,Enterococcus,MPN/100 mL,20.0,,0.0,0,=,NR,NR,NR,,,,,wq,Not Recorded,SLOCPHDEH,,Water_Grab,35.1359,-120.643,Not Recorded,NaT,Not Recorded,Not Recorded,1950-01-01,Not Recorded,1950-01-01,1950-01-01,1.0,0,SLOCPHDEH,SLOCPHDEH,NR,,,,,,,,,,,,,,,,,Not Recorded,Enterococcus,BeachWatch,,,,09:50:00,2007-07-09 09:50:00,1.0,NR,PB3,3
2,BeachWatch,BeachWatch_San Luis Obispo County,BeachWatch_San Luis Obispo County,"PB3-Pismo State Beach, San Luis Obispo",PB3,2007-07-02,1900-01-01 11:45:00,SurfZone,-88.0,NR,Grab,1,1,SLOCPHDEH-07/02/2007,Not Recorded,samplewater,Enterolert,Enterococcus,MPN/100 mL,10.0,,0.0,0,<,NR,NR,NR,,,,,wq,Not Recorded,SLOCPHDEH,,Water_Grab,35.1359,-120.643,Not Recorded,NaT,Not Recorded,Not Recorded,1950-01-01,Not Recorded,1950-01-01,1950-01-01,1.0,0,SLOCPHDEH,SLOCPHDEH,NR,,,,,,,,,,,,,,,,,Not Recorded,Enterococcus,BeachWatch,,,,11:45:00,2007-07-02 11:45:00,1.0,NR,PB3,3
3,BeachWatch,BeachWatch_San Diego County,BeachWatch_San Diego County,"MB-170-Mission Bay, Mariners Basin, San Diego",MB-170,2000-12-13,1900-01-01 00:00:00,SurfZone,-88.0,NR,Grab,1,1,CSDDEH-12/13/2000,Not Recorded,samplewater,SM 9221 E,"Coliform, Fecal",MPN/100 mL,1700.0,,10.0,10,=,NR,NR,NR,,,,,wq,Not Recorded,CSDDEH,,Water_Grab,32.7696,-117.248,Not Recorded,NaT,Not Recorded,Not Recorded,1950-01-01,Not Recorded,1950-01-01,1950-01-01,1.0,0,CSDDEH,CSDDEH,NR,,,,,,,,,,,,,,,,,Not Recorded,"Coliform, Fecal",BeachWatch,,,,00:00:00,2000-12-13 00:00:00,1.0,NR,MB-170,9
4,BeachWatch,BeachWatch_San Diego County,BeachWatch_San Diego County,"MB-170-Mission Bay, Mariners Basin, San Diego",MB-170,2000-12-12,1900-01-01 00:00:00,SurfZone,-88.0,NR,Grab,1,1,CSDDEH-12/12/2000,Not Recorded,samplewater,SM 9221 E,"Coliform, Total",MPN/100 mL,800.0,,10.0,10,=,NR,NR,NR,,,,,wq,Not Recorded,CSDDEH,,Water_Grab,32.7696,-117.248,Not Recorded,NaT,Not Recorded,Not Recorded,1950-01-01,Not Recorded,1950-01-01,1950-01-01,1.0,0,CSDDEH,CSDDEH,NR,,,,,,,,,,,,,,,,,Not Recorded,"Coliform, Total",BeachWatch,,,,00:00:00,2000-12-12 00:00:00,1.0,NR,MB-170,9


### 9. Add data quality columns to the dataset
The OIMA data quality estimator tool adds two columns, DataQuality and DataQualityIndicator, to the dataset.

DataQuality: Describes the overall quality of the record by taking the QACode, ResulualQACode, ComplicanceCode, BatchVerificationCode, and special circumstances into account to assign it to one of the following categories: Passed, Some review needed, Spatial accuracy unknown, Extensive review needed, Unknown data quality, Reject record, Error in data, Metadata. The assignments and categories are provisional. A working explanation of the data quality ranking can be found in this Google Doc: https://docs.google.com/spreadsheets/d/1q-tGulvO9jyT2dR9GGROdy89z3W6xulYaci5-ezWAe0/edit?usp=sharing

DataQualityIadic:tor: Explains the reason for the DataQuality value by indicating which quality assurance check the data did not pass (e.g. BatchVerificationCode, ResultQACode, etc.).

The function "add_data_quality" used to add these two columns is imported into this notebook from another Python script. Re for the ful and dictionaries. e.c.).

The code for the data quality estimator is hosted on GitHub here: https://github.com/mmtang/data-quality-estimator.
- The function *add_data_quality*: https://github.com/mmtang/data-quality-estimator/blob/master/data_quality.py
- The dictionaries for QACodes, ResultQualCodes, ComplianceCodes, etc. and their associated data quality values: https://github.com/mmtang/data-quality-estimator/blob/master/dq_constants.py

In [22]:
# Import Python file with the data quality estimator functions
import sys
sys.path.append('../data-quality-estimator')  # Path contains data_quality_utils.py

import data_quality

In [23]:
# Add the DataQuality and DataQualityIndicator columns
data_df = data_quality.add_data_quality(data_df, 'chemistry')

data_df.head()

H6 not a valid key in QA_Code_list
H6 not a valid key in QA_Code_list
H6 not a valid key in QA_Code_list
H6 not a valid key in QA_Code_list
H6 not a valid key in QA_Code_list
H6 not a valid key in QA_Code_list
H6 not a valid key in QA_Code_list
H6 not a valid key in QA_Code_list
H6 not a valid key in QA_Code_list
H6 not a valid key in QA_Code_list
H6 not a valid key in QA_Code_list
H6 not a valid key in QA_Code_list
H6 not a valid key in QA_Code_list
H6 not a valid key in QA_Code_list
H6 not a valid key in QA_Code_list
H6 not a valid key in QA_Code_list
H6 not a valid key in QA_Code_list
H6 not a valid key in QA_Code_list
H6 not a valid key in QA_Code_list
H6 not a valid key in QA_Code_list
H6 not a valid key in QA_Code_list
H6 not a valid key in QA_Code_list
H6 not a valid key in QA_Code_list
H6 not a valid key in QA_Code_list
H6 not a valid key in QA_Code_list
H6 not a valid key in QA_Code_list
H6 not a valid key in QA_Code_list
H6 not a valid key in QA_Code_list
H6 not a valid key i

Unnamed: 0,Program,ParentProject,Project,StationName,StationCode,SampleDate,CollectionTime,LocationCode,CollectionDepth,UnitCollectionDepth,SampleTypeCode,CollectionReplicate,ResultsReplicate,LabBatch,LabSampleID,MatrixName,MethodName,Analyte,Unit,Result,Observation,MDL,RL,ResultQualCode,QACode,BatchVerification,ComplianceCode,SampleComments,CollectionComments,ResultsComments,BatchComments,EventCode,ProtocolCode,SampleAgency,GroupSamples,CollectionMethodName,TargetLatitude,TargetLongitude,CollectionDeviceDescription,CalibrationDate,PositionWaterColumn,PrepPreservationName,PrepPreservationDate,DigestExtractMethod,DigestExtractDate,AnalysisDate,DilutionFactor,ExpectedValue,LabAgency,SubmittingAgency,SubmissionCode,OccupationMethod,StartingBank,DistanceFromBank,UnitDistanceFromBank,StreamWidth,UnitStreamWidth,StationWaterDepth,UnitStationWaterDepth,Hydromod,HydromodLoc,LocationDetailWQComments,ChannelWidth,UpstreamLength,DownStreamLength,TotalReach,LocationDetailBAComments,SampleID,DW_AnalyteName,DataSource,HydroMod,HydroModLoc,isQA,CollectionTimeOnly,SampleDateTime,UnitGroup,Datum,StationLUCode,RegionNumber,DataQuality,DataQualityIndicator
0,BeachWatch,BeachWatch_Orange County,BeachWatch_Orange County,"S-23-San Clemente State Beach, Orange",S-23,2004-10-25,1900-01-01 00:00:00,SurfZone,-88.0,NR,Grab,1,1,SRRA-10/25/2004,Not Recorded,samplewater,EPA 1600,Enterococcus,CFU/100 mL,10.0,,0.0,0,<,NR,NR,NR,,,,,wq,Not Recorded,OCEHD,,Water_Grab,33.3958,-117.6,Not Recorded,NaT,Not Recorded,Not Recorded,1950-01-01,Not Recorded,1950-01-01,1950-01-01,1.0,0,SRRA,SRRA,NR,,,,,,,,,,,,,,,,,Not Recorded,Enterococcus,BeachWatch,,,,00:00:00,2004-10-25 00:00:00,1.0,NR,S-23,9,Unknown data quality,QACode:NR; BatchVerification:NR
1,BeachWatch,BeachWatch_San Luis Obispo County,BeachWatch_San Luis Obispo County,"PB3-Pismo State Beach, San Luis Obispo",PB3,2007-07-09,1900-01-01 09:50:00,SurfZone,-88.0,NR,Grab,1,1,SLOCPHDEH-07/09/2007,Not Recorded,samplewater,Enterolert,Enterococcus,MPN/100 mL,20.0,,0.0,0,=,NR,NR,NR,,,,,wq,Not Recorded,SLOCPHDEH,,Water_Grab,35.1359,-120.643,Not Recorded,NaT,Not Recorded,Not Recorded,1950-01-01,Not Recorded,1950-01-01,1950-01-01,1.0,0,SLOCPHDEH,SLOCPHDEH,NR,,,,,,,,,,,,,,,,,Not Recorded,Enterococcus,BeachWatch,,,,09:50:00,2007-07-09 09:50:00,1.0,NR,PB3,3,Unknown data quality,QACode:NR; BatchVerification:NR
2,BeachWatch,BeachWatch_San Luis Obispo County,BeachWatch_San Luis Obispo County,"PB3-Pismo State Beach, San Luis Obispo",PB3,2007-07-02,1900-01-01 11:45:00,SurfZone,-88.0,NR,Grab,1,1,SLOCPHDEH-07/02/2007,Not Recorded,samplewater,Enterolert,Enterococcus,MPN/100 mL,10.0,,0.0,0,<,NR,NR,NR,,,,,wq,Not Recorded,SLOCPHDEH,,Water_Grab,35.1359,-120.643,Not Recorded,NaT,Not Recorded,Not Recorded,1950-01-01,Not Recorded,1950-01-01,1950-01-01,1.0,0,SLOCPHDEH,SLOCPHDEH,NR,,,,,,,,,,,,,,,,,Not Recorded,Enterococcus,BeachWatch,,,,11:45:00,2007-07-02 11:45:00,1.0,NR,PB3,3,Unknown data quality,QACode:NR; BatchVerification:NR
3,BeachWatch,BeachWatch_San Diego County,BeachWatch_San Diego County,"MB-170-Mission Bay, Mariners Basin, San Diego",MB-170,2000-12-13,1900-01-01 00:00:00,SurfZone,-88.0,NR,Grab,1,1,CSDDEH-12/13/2000,Not Recorded,samplewater,SM 9221 E,"Coliform, Fecal",MPN/100 mL,1700.0,,10.0,10,=,NR,NR,NR,,,,,wq,Not Recorded,CSDDEH,,Water_Grab,32.7696,-117.248,Not Recorded,NaT,Not Recorded,Not Recorded,1950-01-01,Not Recorded,1950-01-01,1950-01-01,1.0,0,CSDDEH,CSDDEH,NR,,,,,,,,,,,,,,,,,Not Recorded,"Coliform, Fecal",BeachWatch,,,,00:00:00,2000-12-13 00:00:00,1.0,NR,MB-170,9,Unknown data quality,QACode:NR; BatchVerification:NR
4,BeachWatch,BeachWatch_San Diego County,BeachWatch_San Diego County,"MB-170-Mission Bay, Mariners Basin, San Diego",MB-170,2000-12-12,1900-01-01 00:00:00,SurfZone,-88.0,NR,Grab,1,1,CSDDEH-12/12/2000,Not Recorded,samplewater,SM 9221 E,"Coliform, Total",MPN/100 mL,800.0,,10.0,10,=,NR,NR,NR,,,,,wq,Not Recorded,CSDDEH,,Water_Grab,32.7696,-117.248,Not Recorded,NaT,Not Recorded,Not Recorded,1950-01-01,Not Recorded,1950-01-01,1950-01-01,1.0,0,CSDDEH,CSDDEH,NR,,,,,,,,,,,,,,,,,Not Recorded,"Coliform, Total",BeachWatch,,,,00:00:00,2000-12-12 00:00:00,1.0,NR,MB-170,9,Unknown data quality,QACode:NR; BatchVerification:NR


### 10. Drop records with a DataQuality score of "Reject record" or "Metadata"

In [24]:
# Copy records with a DataQuality score of 'Reject record' or 'MetaData to a new dataframe
# These records will later be added to the rejected_records csv file output
dq_filter = ['Reject record', 'MetaData']
reject_dq_df = data_df[data_df['DataQuality'].isin(dq_filter)]

# Drop these records from the dataset
data_df = data_df[~data_df['DataQuality'].isin(dq_filter)]

### 11. Clean null values
For compatability with the open data portal

In [25]:
# We have to make a distinction between None, 'None', and ''
# 'None' and '' are used specifically in the datasets, but None gets translated to 'None' unless we replace it with '' explicitly
data_df.fillna('')

  data_df.fillna('')


Unnamed: 0,Program,ParentProject,Project,StationName,StationCode,SampleDate,CollectionTime,LocationCode,CollectionDepth,UnitCollectionDepth,SampleTypeCode,CollectionReplicate,ResultsReplicate,LabBatch,LabSampleID,MatrixName,MethodName,Analyte,Unit,Result,Observation,MDL,RL,ResultQualCode,QACode,BatchVerification,ComplianceCode,SampleComments,CollectionComments,ResultsComments,BatchComments,EventCode,ProtocolCode,SampleAgency,GroupSamples,CollectionMethodName,TargetLatitude,TargetLongitude,CollectionDeviceDescription,CalibrationDate,PositionWaterColumn,PrepPreservationName,PrepPreservationDate,DigestExtractMethod,DigestExtractDate,AnalysisDate,DilutionFactor,ExpectedValue,LabAgency,SubmittingAgency,SubmissionCode,OccupationMethod,StartingBank,DistanceFromBank,UnitDistanceFromBank,StreamWidth,UnitStreamWidth,StationWaterDepth,UnitStationWaterDepth,Hydromod,HydromodLoc,LocationDetailWQComments,ChannelWidth,UpstreamLength,DownStreamLength,TotalReach,LocationDetailBAComments,SampleID,DW_AnalyteName,DataSource,HydroMod,HydroModLoc,isQA,CollectionTimeOnly,SampleDateTime,UnitGroup,Datum,StationLUCode,RegionNumber,DataQuality,DataQualityIndicator
0,BeachWatch,BeachWatch_Orange County,BeachWatch_Orange County,"S-23-San Clemente State Beach, Orange",S-23,2004-10-25,1900-01-01 00:00:00,SurfZone,-88.0,NR,Grab,1,1,SRRA-10/25/2004,Not Recorded,samplewater,EPA 1600,Enterococcus,CFU/100 mL,10.0,,0.0,0,<,NR,NR,NR,,,,,wq,Not Recorded,OCEHD,,Water_Grab,33.395800,-117.600000,Not Recorded,NaT,Not Recorded,Not Recorded,1950-01-01,Not Recorded,1950-01-01,1950-01-01 00:00:00,1.0,0,SRRA,SRRA,NR,,,,,,,,,,,,,,,,,Not Recorded,Enterococcus,BeachWatch,,,,00:00:00,2004-10-25 00:00:00,1.0,NR,S-23,9,Unknown data quality,QACode:NR; BatchVerification:NR
1,BeachWatch,BeachWatch_San Luis Obispo County,BeachWatch_San Luis Obispo County,"PB3-Pismo State Beach, San Luis Obispo",PB3,2007-07-09,1900-01-01 09:50:00,SurfZone,-88.0,NR,Grab,1,1,SLOCPHDEH-07/09/2007,Not Recorded,samplewater,Enterolert,Enterococcus,MPN/100 mL,20.0,,0.0,0,=,NR,NR,NR,,,,,wq,Not Recorded,SLOCPHDEH,,Water_Grab,35.135900,-120.643000,Not Recorded,NaT,Not Recorded,Not Recorded,1950-01-01,Not Recorded,1950-01-01,1950-01-01 00:00:00,1.0,0,SLOCPHDEH,SLOCPHDEH,NR,,,,,,,,,,,,,,,,,Not Recorded,Enterococcus,BeachWatch,,,,09:50:00,2007-07-09 09:50:00,1.0,NR,PB3,3,Unknown data quality,QACode:NR; BatchVerification:NR
2,BeachWatch,BeachWatch_San Luis Obispo County,BeachWatch_San Luis Obispo County,"PB3-Pismo State Beach, San Luis Obispo",PB3,2007-07-02,1900-01-01 11:45:00,SurfZone,-88.0,NR,Grab,1,1,SLOCPHDEH-07/02/2007,Not Recorded,samplewater,Enterolert,Enterococcus,MPN/100 mL,10.0,,0.0,0,<,NR,NR,NR,,,,,wq,Not Recorded,SLOCPHDEH,,Water_Grab,35.135900,-120.643000,Not Recorded,NaT,Not Recorded,Not Recorded,1950-01-01,Not Recorded,1950-01-01,1950-01-01 00:00:00,1.0,0,SLOCPHDEH,SLOCPHDEH,NR,,,,,,,,,,,,,,,,,Not Recorded,Enterococcus,BeachWatch,,,,11:45:00,2007-07-02 11:45:00,1.0,NR,PB3,3,Unknown data quality,QACode:NR; BatchVerification:NR
3,BeachWatch,BeachWatch_San Diego County,BeachWatch_San Diego County,"MB-170-Mission Bay, Mariners Basin, San Diego",MB-170,2000-12-13,1900-01-01 00:00:00,SurfZone,-88.0,NR,Grab,1,1,CSDDEH-12/13/2000,Not Recorded,samplewater,SM 9221 E,"Coliform, Fecal",MPN/100 mL,1700.0,,10.0,10,=,NR,NR,NR,,,,,wq,Not Recorded,CSDDEH,,Water_Grab,32.769600,-117.248000,Not Recorded,NaT,Not Recorded,Not Recorded,1950-01-01,Not Recorded,1950-01-01,1950-01-01 00:00:00,1.0,0,CSDDEH,CSDDEH,NR,,,,,,,,,,,,,,,,,Not Recorded,"Coliform, Fecal",BeachWatch,,,,00:00:00,2000-12-13 00:00:00,1.0,NR,MB-170,9,Unknown data quality,QACode:NR; BatchVerification:NR
4,BeachWatch,BeachWatch_San Diego County,BeachWatch_San Diego County,"MB-170-Mission Bay, Mariners Basin, San Diego",MB-170,2000-12-12,1900-01-01 00:00:00,SurfZone,-88.0,NR,Grab,1,1,CSDDEH-12/12/2000,Not Recorded,samplewater,SM 9221 E,"Coliform, Total",MPN/100 mL,800.0,,10.0,10,=,NR,NR,NR,,,,,wq,Not Recorded,CSDDEH,,Water_Grab,32.769600,-117.248000,Not Recorded,NaT,Not Recorded,Not Recorded,1950-01-01,Not Recorded,1950-01-01,1950-01-01 00:00:00,1.0,0,CSDDEH,CSDDEH,NR,,,,,,,,,,,,,,,,,Not Recorded,"Coliform, Total",BeachWatch,,,,00:00:00,2000-12-12 00:00:00,1.0,NR,MB-170,9,Unknown data quality,QACode:NR; BatchVerification:NR
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1722623,Riverside County NPDES MS4 Monitoring Program,San Diego Region,Santa Margarita River General Monitoring,Adobe Creek,902ADB848,2014-09-10,1899-12-30 08:20:00,Not Recorded,0.1,m,Grab,1,1,14I1523,B4I1032-01,samplewater,SM 9221 E,"Coliform, Fecal",MPN/100 mL,1600.0,,2.0,2,>=,,NR,NR,,,,,WQ,Not Recorded,RCFC,,Water_Grab,33.513020,-117.268580,Not Recorded,NaT,Not Applicable,,1950-01-01,,1950-01-01,2014-09-10 14:15:00,1,,Babcock,AMEC,NR,,,,,,,,,,,,,,,,,1415-D1-848-01G,"Coliform, Fecal",CEDEN,,,False,08:20:00,2014-09-10 08:20:00,1.0,WGS84,902ADB848,9,Unknown data quality,BatchVerification:NR
1722624,Point Loma Ocean Outfall Monitoring,Point Loma Ocean Outfall Monitoring,Point Loma Ocean Outfall Monitoring,City of San Diego Station A7,CSD_A7,2022-10-25,1899-12-30 08:33:00,Nearshore,1,m,Grab,1,1,20221025-1-CSD-EMTS,,samplewater,SM 9222 B,"Coliform, Total",CFU/100 mL,2.0,,-88.0,-88,<,,NR,NR,Fishing vessel on station; Kelp Debris; Lobste...,,,,WQ,QAPP_for_Coastal_Receiving_Waters_Monitoring_2020,CSD,,Water_Grab,32.675500,-117.266833,Not Recorded,NaT,Subsurface,Not Recorded,1950-01-01,Not Recorded,1950-01-01,2022-10-26 13:40:00,1,,EMTS,WSP_SD_SkyPkwy,NR,,,,,,,,,,,,,,,,,2210252864,"Coliform, Total",CEDEN,,,False,08:33:00,2022-10-25 08:33:00,1.0,NAD83,CSD_A7,9,Unknown data quality,BatchVerification:NR
1722625,Riverside County NPDES MS4 Monitoring Program,San Diego Region,Santa Margarita River General Monitoring,Lower Murrieta Creek,902LMC778,2016-01-05,1899-12-30 16:30:00,Not Recorded,0.1,m,Grab,1,1,6A05142,B6A0356-01,samplewater,SM 9221 E,"Coliform, Fecal",MPN/100 mL,8000.0,,200.0,200,=,,NR,NR,,,RL for analyte does not meet the SWAMP/ CTR re...,,WQ,Not Recorded,RCFC,,Water_Grab,33.477890,-117.142120,Not Recorded,NaT,Not Applicable,,1950-01-01,,1950-01-01,2016-01-05 20:20:00,100,,Babcock,AMEC,NR,,,,,,,,,,,,,,,,,1516-W2-778-01,"Coliform, Fecal",CEDEN,,,False,16:30:00,2016-01-05 16:30:00,1.0,WGS84,902LMC778,9,Unknown data quality,BatchVerification:NR
1722626,Point Loma Ocean Outfall Monitoring,Point Loma Ocean Outfall Monitoring,Point Loma Ocean Outfall Monitoring,City of San Diego Station A7,CSD_A7,2022-09-13,1899-12-30 08:49:00,Nearshore,1,m,Grab,1,1,20220913-1-CSD-EMTS,,samplewater,SM 9222 B,"Coliform, Total",CFU/100 mL,2.0,,-88.0,-88,=,"J,UF",NR,NR,Redo used 2nd cast; Kelp Debris,,,,WQ,QAPP_for_Coastal_Receiving_Waters_Monitoring_2020,CSD,,Water_Grab,32.675500,-117.266833,Not Recorded,NaT,Subsurface,Not Recorded,1950-01-01,Not Recorded,1950-01-01,2022-09-14 14:10:00,1,,EMTS,WSP_SD_SkyPkwy,NR,,,,,,,,,,,,,,,,,2209131355,"Coliform, Total",CEDEN,,,False,08:49:00,2022-09-13 08:49:00,1.0,NAD83,CSD_A7,9,Unknown data quality,BatchVerification:NR


### 12. Export a CSV file of all the dropped records. This includes:

- Step 5: Dropped duplicate records (duplicates_df)
- Step 6.3: Dropped records with unusable Result and MDL values (rejected1_df)
- Step 6.4: Dropped replicate records (replicate_df)
- Step 6.5: Dropped records with unit values we are not using (rejected_units_df)
- Step 10: Dropped data quality records (reject_dq_df)

In [26]:
# Define fields to be included in file export
reject_export_fields = [
    'Program',
    'ParentProject',
    'Project',
    'StationName',
    'StationCode',
    'SampleDate',
    'CollectionTime',
    'LocationCode',
    'CollectionDepth',
    'UnitCollectionDepth',
    'SampleTypeCode',
    'CollectionReplicate',
    'ResultsReplicate',
    'LabBatch',
    'LabSampleID',
    'MatrixName',
    'MethodName',
    'Analyte',
    'Unit',
    'Result',
    'Observation',
    'MDL',
    'RL',
    'ResultQualCode',
    'QACode',
    'BatchVerification',
    'ComplianceCode',
    'SampleComments',
    'CollectionComments',
    'ResultsComments',
    'BatchComments',
    'EventCode',
    'ProtocolCode',
    'SampleAgency',
    'GroupSamples',
    'CollectionMethodName',
    'TargetLatitude',
    'TargetLongitude',
    'CollectionDeviceDescription',
    'CalibrationDate',
    'PositionWaterColumn',
    'PrepPreservationName',
    'PrepPreservationDate',
    'DigestExtractMethod',
    'DigestExtractDate',
    'AnalysisDate',
    'DilutionFactor',
    'ExpectedValue',
    'LabAgency',
    'SubmittingAgency',
    'SubmissionCode',
    'OccupationMethod',
    'StartingBank',
    'DistanceFromBank',
    'UnitDistanceFromBank',
    'StreamWidth',
    'UnitStreamWidth',
    'StationWaterDepth',
    'UnitStationWaterDepth',
    'HydroMod',
    'HydroModLoc',
    'LocationDetailWQComments',
    'ChannelWidth',
    'UpstreamLength',
    'DownStreamLength',
    'TotalReach',
    'LocationDetailBAComments',
    'SampleID',
    'DW_AnalyteName',
    'UnitGroup',
    'Datum',
    'DataSource',
    'SampleDateTime',
    'RegionNumber',
    'DataQuality',
    'DataQualityIndicator',
    'Comments'
]

# Merge all dataframes into a single dataframe
all_dropped_records_df = pd.concat([duplicates_df, rejected1_df, replicate_df, rejected_units_df, reject_dq_df], ignore_index=True)
all_dropped_records_df = all_dropped_records_df[reject_export_fields]

all_dropped_records_df.head()

Unnamed: 0,Program,ParentProject,Project,StationName,StationCode,SampleDate,CollectionTime,LocationCode,CollectionDepth,UnitCollectionDepth,SampleTypeCode,CollectionReplicate,ResultsReplicate,LabBatch,LabSampleID,MatrixName,MethodName,Analyte,Unit,Result,Observation,MDL,RL,ResultQualCode,QACode,BatchVerification,ComplianceCode,SampleComments,CollectionComments,ResultsComments,BatchComments,EventCode,ProtocolCode,SampleAgency,GroupSamples,CollectionMethodName,TargetLatitude,TargetLongitude,CollectionDeviceDescription,CalibrationDate,PositionWaterColumn,PrepPreservationName,PrepPreservationDate,DigestExtractMethod,DigestExtractDate,AnalysisDate,DilutionFactor,ExpectedValue,LabAgency,SubmittingAgency,SubmissionCode,OccupationMethod,StartingBank,DistanceFromBank,UnitDistanceFromBank,StreamWidth,UnitStreamWidth,StationWaterDepth,UnitStationWaterDepth,HydroMod,HydroModLoc,LocationDetailWQComments,ChannelWidth,UpstreamLength,DownStreamLength,TotalReach,LocationDetailBAComments,SampleID,DW_AnalyteName,UnitGroup,Datum,DataSource,SampleDateTime,RegionNumber,DataQuality,DataQualityIndicator,Comments
0,BeachWatch,BeachWatch_San Diego County,BeachWatch_San Diego County,PL-050-non-accessible or restricted access sho...,PL-050,1999-08-09,1900-01-01,SurfZone,-88.0,NR,Grab,1,1,CSDDEH-08/09/1999,Not Recorded,samplewater,SM 9222 B,"Coliform, Total",cfu/100mL,2,,2,2,<,NR,NR,NR,,,,,wq,Not Recorded,CSDDEH,,Water_Grab,32.6794,-117.248,Not Recorded,NaT,Not Recorded,Not Recorded,1950-01-01,Not Recorded,1950-01-01,1950-01-01,1.0,0,CSDDEH,CSDDEH,NR,,,,,,,,,,,,,,,,,Not Recorded,"Coliform, Total",,,BeachWatch,1999-08-09,,,,Duplicate record
1,BeachWatch,BeachWatch_San Diego County,BeachWatch_San Diego County,"SE-060-Cardiff State Beach, San Diego",SE-060,2002-07-02,1900-01-01,SurfZone,-88.0,NR,Grab,1,1,CSDDEH-07/02/2002,Not Recorded,samplewater,SM 9221 E,"Coliform, Total",MPN/100 mL,20,,10,10,<,NR,NR,NR,,,,,wq,Not Recorded,CSDDEH,,Water_Grab,33.0184,-117.284,Not Recorded,NaT,Not Recorded,Not Recorded,1950-01-01,Not Recorded,1950-01-01,1950-01-01,1.0,0,CSDDEH,CSDDEH,NR,,,,,,,,,,,,,,,,,Not Recorded,"Coliform, Total",,,BeachWatch,2002-07-02,,,,Duplicate record
2,BeachWatch,BeachWatch_San Diego County,BeachWatch_San Diego County,PL-060-non-accessible or restricted access sho...,PL-060,2000-07-06,1900-01-01,SurfZone,-88.0,NR,Grab,1,1,CSDDEH-07/06/2000,Not Recorded,samplewater,SM 9222 B,Enterococcus,cfu/100mL,2,,2,2,<,NR,NR,NR,,,,,wq,Not Recorded,CSDDEH,,Water_Grab,32.6934,-117.261,Not Recorded,NaT,Not Recorded,Not Recorded,1950-01-01,Not Recorded,1950-01-01,1950-01-01,1.0,0,CSDDEH,CSDDEH,NR,,,,,,,,,,,,,,,,,Not Recorded,Enterococcus,,,BeachWatch,2000-07-06,,,,Duplicate record
3,BeachWatch,BeachWatch_San Diego County,BeachWatch_San Diego County,PL-060-non-accessible or restricted access sho...,PL-060,2002-06-26,1900-01-01,SurfZone,-88.0,NR,Grab,1,1,CSDDEH-06/26/2002,Not Recorded,samplewater,SM 9222 B,Enterococcus,cfu/100mL,2,,2,2,<,NR,NR,NR,,,,,wq,Not Recorded,CSDDEH,,Water_Grab,32.6934,-117.261,Not Recorded,NaT,Not Recorded,Not Recorded,1950-01-01,Not Recorded,1950-01-01,1950-01-01,1.0,0,CSDDEH,CSDDEH,NR,,,,,,,,,,,,,,,,,Not Recorded,Enterococcus,,,BeachWatch,2002-06-26,,,,Duplicate record
4,BeachWatch,BeachWatch_San Diego County,BeachWatch_San Diego County,PL-060-non-accessible or restricted access sho...,PL-060,2002-05-09,1900-01-01,SurfZone,-88.0,NR,Grab,1,1,CSDDEH-05/09/2002,Not Recorded,samplewater,SM 9222 B,"Coliform, Fecal",cfu/100mL,2,,2,2,<,NR,NR,NR,,,,,wq,Not Recorded,CSDDEH,,Water_Grab,32.6934,-117.261,Not Recorded,NaT,Not Recorded,Not Recorded,1950-01-01,Not Recorded,1950-01-01,1950-01-01,1.0,0,CSDDEH,CSDDEH,NR,,,,,,,,,,,,,,,,,Not Recorded,"Coliform, Fecal",,,BeachWatch,2002-05-09,,,,Duplicate record


In [27]:
# Export all rejected records as a CSV file
all_dropped_records_df.to_csv('SafeToSwim_rejected_records.csv', index=False)

### 13. Handle non-detect (ND) records and assign substitute Result values
If a record is flagged as non-detect (ResultQualCode == 'ND'), substitute the Result value with either half the original Result value (if the Result > 0) or half the MDL (if the Result <= 0).

Also substitute half the MDL for records that are not flagged as non-detect but for some reason have a zero, null, or negative Result value. There shouldn't be very many (if any) of these records at this point, but I've left the code here just in case any slip through.

In [28]:
# Define a function for assigning substitute Result values
def subResult(row):
    if (row['ResultQualCode'] == 'ND'):
        if (row['Result'] > 0):
            return pd.Series([(0.5 * row['Result']), 'Nondetect: result substituted with half the result value'])
        elif (row['MDL'] > 0):
            return pd.Series([(0.5 * row['MDL']), 'Nondetect: result substituted with half the MDL'])
        else:
            return pd.Series([row['Result'], 'No substitution'])
    elif ((row['Result'] == 0) or (pd.isna(row['Result'])) or (row['Result'] < 0)):
        if (row['MDL'] > 0):
            return pd.Series([(0.5 * row['MDL']), 'Result substituted with half the MDL'])
        else:
            return pd.Series([row['Result'], 'No substitution'])
    else:
        return pd.Series([row['Result'], 'No substitution'])

# Apply the function to the entire dataframe and save the subbed and non-subbed Result values to a new dataframe
sub_values = data_df.apply(lambda x: subResult(x), axis=1)

# Copy over the values and comments to the original dataframe as a new column "ResultSub". The original "Result" column is left untouched for reference.
data_df['ResultSub'], data_df['ResultSubComments'] = sub_values[0], sub_values[1]

data_df.head()

Unnamed: 0,Program,ParentProject,Project,StationName,StationCode,SampleDate,CollectionTime,LocationCode,CollectionDepth,UnitCollectionDepth,SampleTypeCode,CollectionReplicate,ResultsReplicate,LabBatch,LabSampleID,MatrixName,MethodName,Analyte,Unit,Result,Observation,MDL,RL,ResultQualCode,QACode,BatchVerification,ComplianceCode,SampleComments,CollectionComments,ResultsComments,BatchComments,EventCode,ProtocolCode,SampleAgency,GroupSamples,CollectionMethodName,TargetLatitude,TargetLongitude,CollectionDeviceDescription,CalibrationDate,PositionWaterColumn,PrepPreservationName,PrepPreservationDate,DigestExtractMethod,DigestExtractDate,AnalysisDate,DilutionFactor,ExpectedValue,LabAgency,SubmittingAgency,SubmissionCode,OccupationMethod,StartingBank,DistanceFromBank,UnitDistanceFromBank,StreamWidth,UnitStreamWidth,StationWaterDepth,UnitStationWaterDepth,Hydromod,HydromodLoc,LocationDetailWQComments,ChannelWidth,UpstreamLength,DownStreamLength,TotalReach,LocationDetailBAComments,SampleID,DW_AnalyteName,DataSource,HydroMod,HydroModLoc,isQA,CollectionTimeOnly,SampleDateTime,UnitGroup,Datum,StationLUCode,RegionNumber,DataQuality,DataQualityIndicator,ResultSub,ResultSubComments
0,BeachWatch,BeachWatch_Orange County,BeachWatch_Orange County,"S-23-San Clemente State Beach, Orange",S-23,2004-10-25,1900-01-01 00:00:00,SurfZone,-88.0,NR,Grab,1,1,SRRA-10/25/2004,Not Recorded,samplewater,EPA 1600,Enterococcus,CFU/100 mL,10.0,,0.0,0,<,NR,NR,NR,,,,,wq,Not Recorded,OCEHD,,Water_Grab,33.3958,-117.6,Not Recorded,NaT,Not Recorded,Not Recorded,1950-01-01,Not Recorded,1950-01-01,1950-01-01,1.0,0,SRRA,SRRA,NR,,,,,,,,,,,,,,,,,Not Recorded,Enterococcus,BeachWatch,,,,00:00:00,2004-10-25 00:00:00,1.0,NR,S-23,9,Unknown data quality,QACode:NR; BatchVerification:NR,10.0,No substitution
1,BeachWatch,BeachWatch_San Luis Obispo County,BeachWatch_San Luis Obispo County,"PB3-Pismo State Beach, San Luis Obispo",PB3,2007-07-09,1900-01-01 09:50:00,SurfZone,-88.0,NR,Grab,1,1,SLOCPHDEH-07/09/2007,Not Recorded,samplewater,Enterolert,Enterococcus,MPN/100 mL,20.0,,0.0,0,=,NR,NR,NR,,,,,wq,Not Recorded,SLOCPHDEH,,Water_Grab,35.1359,-120.643,Not Recorded,NaT,Not Recorded,Not Recorded,1950-01-01,Not Recorded,1950-01-01,1950-01-01,1.0,0,SLOCPHDEH,SLOCPHDEH,NR,,,,,,,,,,,,,,,,,Not Recorded,Enterococcus,BeachWatch,,,,09:50:00,2007-07-09 09:50:00,1.0,NR,PB3,3,Unknown data quality,QACode:NR; BatchVerification:NR,20.0,No substitution
2,BeachWatch,BeachWatch_San Luis Obispo County,BeachWatch_San Luis Obispo County,"PB3-Pismo State Beach, San Luis Obispo",PB3,2007-07-02,1900-01-01 11:45:00,SurfZone,-88.0,NR,Grab,1,1,SLOCPHDEH-07/02/2007,Not Recorded,samplewater,Enterolert,Enterococcus,MPN/100 mL,10.0,,0.0,0,<,NR,NR,NR,,,,,wq,Not Recorded,SLOCPHDEH,,Water_Grab,35.1359,-120.643,Not Recorded,NaT,Not Recorded,Not Recorded,1950-01-01,Not Recorded,1950-01-01,1950-01-01,1.0,0,SLOCPHDEH,SLOCPHDEH,NR,,,,,,,,,,,,,,,,,Not Recorded,Enterococcus,BeachWatch,,,,11:45:00,2007-07-02 11:45:00,1.0,NR,PB3,3,Unknown data quality,QACode:NR; BatchVerification:NR,10.0,No substitution
3,BeachWatch,BeachWatch_San Diego County,BeachWatch_San Diego County,"MB-170-Mission Bay, Mariners Basin, San Diego",MB-170,2000-12-13,1900-01-01 00:00:00,SurfZone,-88.0,NR,Grab,1,1,CSDDEH-12/13/2000,Not Recorded,samplewater,SM 9221 E,"Coliform, Fecal",MPN/100 mL,1700.0,,10.0,10,=,NR,NR,NR,,,,,wq,Not Recorded,CSDDEH,,Water_Grab,32.7696,-117.248,Not Recorded,NaT,Not Recorded,Not Recorded,1950-01-01,Not Recorded,1950-01-01,1950-01-01,1.0,0,CSDDEH,CSDDEH,NR,,,,,,,,,,,,,,,,,Not Recorded,"Coliform, Fecal",BeachWatch,,,,00:00:00,2000-12-13 00:00:00,1.0,NR,MB-170,9,Unknown data quality,QACode:NR; BatchVerification:NR,1700.0,No substitution
4,BeachWatch,BeachWatch_San Diego County,BeachWatch_San Diego County,"MB-170-Mission Bay, Mariners Basin, San Diego",MB-170,2000-12-12,1900-01-01 00:00:00,SurfZone,-88.0,NR,Grab,1,1,CSDDEH-12/12/2000,Not Recorded,samplewater,SM 9221 E,"Coliform, Total",MPN/100 mL,800.0,,10.0,10,=,NR,NR,NR,,,,,wq,Not Recorded,CSDDEH,,Water_Grab,32.7696,-117.248,Not Recorded,NaT,Not Recorded,Not Recorded,1950-01-01,Not Recorded,1950-01-01,1950-01-01,1.0,0,CSDDEH,CSDDEH,NR,,,,,,,,,,,,,,,,,Not Recorded,"Coliform, Total",BeachWatch,,,,00:00:00,2000-12-12 00:00:00,1.0,NR,MB-170,9,Unknown data quality,QACode:NR; BatchVerification:NR,800.0,No substitution


### 14. Calculate the geometric mean values

#### 14.1 Required data prep before calculating the geometric mean

In [29]:
# Ensure that SampleDateTime values are cast as datetime objects
data_df['SampleDateTime'] = data_df['SampleDateTime'].astype('datetime64[ns]')

# Set SampleDateTime as the index. This is more efficient for the grouping operations
data_df.set_index('SampleDateTime', inplace=True) 

# Drop records that have a null/NaT SampleDate value. As of 6-18-24, this is just one record.
data_df = data_df.loc[data_df.index.notnull()] 

# Sort records based on ascending SampleDate. A bit counterintuitive, but this is the setup for calculating 
# the rolling geometric starting from the most recent sample date working backwards using the rolling function
data_df.sort_index(ascending=True, inplace=True) 

data_df.head()

Unnamed: 0_level_0,Program,ParentProject,Project,StationName,StationCode,SampleDate,CollectionTime,LocationCode,CollectionDepth,UnitCollectionDepth,SampleTypeCode,CollectionReplicate,ResultsReplicate,LabBatch,LabSampleID,MatrixName,MethodName,Analyte,Unit,Result,Observation,MDL,RL,ResultQualCode,QACode,BatchVerification,ComplianceCode,SampleComments,CollectionComments,ResultsComments,BatchComments,EventCode,ProtocolCode,SampleAgency,GroupSamples,CollectionMethodName,TargetLatitude,TargetLongitude,CollectionDeviceDescription,CalibrationDate,PositionWaterColumn,PrepPreservationName,PrepPreservationDate,DigestExtractMethod,DigestExtractDate,AnalysisDate,DilutionFactor,ExpectedValue,LabAgency,SubmittingAgency,SubmissionCode,OccupationMethod,StartingBank,DistanceFromBank,UnitDistanceFromBank,StreamWidth,UnitStreamWidth,StationWaterDepth,UnitStationWaterDepth,Hydromod,HydromodLoc,LocationDetailWQComments,ChannelWidth,UpstreamLength,DownStreamLength,TotalReach,LocationDetailBAComments,SampleID,DW_AnalyteName,DataSource,HydroMod,HydroModLoc,isQA,CollectionTimeOnly,UnitGroup,Datum,StationLUCode,RegionNumber,DataQuality,DataQualityIndicator,ResultSub,ResultSubComments
SampleDateTime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1
1900-01-02 08:36:00,BeachWatch,BeachWatch_Ventura County,BeachWatch_Ventura County,"1000-Rincon Beach, Ventura",1000,1900-01-02,1900-01-01 08:36:00,SurfZone,-88.0,NR,Grab,1,1,VCEHD-01/02/1900,Not Recorded,samplewater,Colilert-18,E. coli,MPN/100 mL,98.0,,1.0,1,=,NR,NR,NR,,,,,wq,Not Recorded,VCEHD,,Water_Grab,34.3733,-119.477,Not Recorded,NaT,Not Recorded,Not Recorded,1950-01-01,Not Recorded,1950-01-01,1950-01-01,0.0,0,VCEHD,VCEHD,NR,,,,,,,,,,,,,,,,,Not Recorded,E. coli,BeachWatch,,,,08:36:00,1.0,NR,1000,3,Unknown data quality,QACode:NR; BatchVerification:NR,98.0,No substitution
1900-01-02 08:36:00,BeachWatch,BeachWatch_Ventura County,BeachWatch_Ventura County,"1000-Rincon Beach, Ventura",1000,1900-01-02,1900-01-01 08:36:00,SurfZone,-88.0,NR,Grab,1,1,VCEHD-01/02/1900,Not Recorded,samplewater,Colilert-18,"Coliform, Total",MPN/100 mL,1956.0,,1.0,1,=,NR,NR,NR,,,,,wq,Not Recorded,VCEHD,,Water_Grab,34.3733,-119.477,Not Recorded,NaT,Not Recorded,Not Recorded,1950-01-01,Not Recorded,1950-01-01,1950-01-01,0.0,0,VCEHD,VCEHD,NR,,,,,,,,,,,,,,,,,Not Recorded,"Coliform, Total",BeachWatch,,,,08:36:00,1.0,NR,1000,3,Unknown data quality,QACode:NR; BatchVerification:NR,1956.0,No substitution
1900-01-02 08:36:00,BeachWatch,BeachWatch_Ventura County,BeachWatch_Ventura County,"1000-Rincon Beach, Ventura",1000,1900-01-02,1900-01-01 08:36:00,SurfZone,-88.0,NR,Grab,1,1,VCEHD-01/02/1900,Not Recorded,samplewater,Enterolert,Enterococcus,MPN/100 mL,53.0,,1.0,1,=,NR,NR,NR,,,,,wq,Not Recorded,VCEHD,,Water_Grab,34.3733,-119.477,Not Recorded,NaT,Not Recorded,Not Recorded,1950-01-01,Not Recorded,1950-01-01,1950-01-01,0.0,0,VCEHD,VCEHD,NR,,,,,,,,,,,,,,,,,Not Recorded,Enterococcus,BeachWatch,,,,08:36:00,1.0,NR,1000,3,Unknown data quality,QACode:NR; BatchVerification:NR,53.0,No substitution
1900-01-02 08:53:00,BeachWatch,BeachWatch_Ventura County,BeachWatch_Ventura County,"4000-Oil Piers Beach, Ventura",4000,1900-01-02,1900-01-01 08:53:00,SurfZone,-88.0,NR,Grab,1,1,VCEHD-01/02/1900,Not Recorded,samplewater,Enterolert,Enterococcus,MPN/100 mL,10.0,,1.0,1,=,NR,NR,NR,,,,,wq,Not Recorded,VCEHD,,Water_Grab,34.3522,-119.428,Not Recorded,NaT,Not Recorded,Not Recorded,1950-01-01,Not Recorded,1950-01-01,1950-01-01,0.0,0,VCEHD,VCEHD,NR,,,,,,,,,,,,,,,,,Not Recorded,Enterococcus,BeachWatch,,,,08:53:00,1.0,NR,4000,4,Unknown data quality,QACode:NR; BatchVerification:NR,10.0,No substitution
1900-01-02 08:53:00,BeachWatch,BeachWatch_Ventura County,BeachWatch_Ventura County,"4000-Oil Piers Beach, Ventura",4000,1900-01-02,1900-01-01 08:53:00,SurfZone,-88.0,NR,Grab,1,1,VCEHD-01/02/1900,Not Recorded,samplewater,Colilert-18,E. coli,MPN/100 mL,10.0,,1.0,1,=,NR,NR,NR,,,,,wq,Not Recorded,VCEHD,,Water_Grab,34.3522,-119.428,Not Recorded,NaT,Not Recorded,Not Recorded,1950-01-01,Not Recorded,1950-01-01,1950-01-01,0.0,0,VCEHD,VCEHD,NR,,,,,,,,,,,,,,,,,Not Recorded,E. coli,BeachWatch,,,,08:53:00,1.0,NR,4000,4,Unknown data quality,QACode:NR; BatchVerification:NR,10.0,No substitution


#### 14.2 Group records and calculate the geometric mean
This code block adds four new columns:

- 30DayGeoMean: The rolling geometric mean value looking back 30 days from the recorded sample date.
- 30DayCount: The number of distinct sample result values included in the 30 day date range and used in the geometric mean calculation.
- 6WeekGeoMean: The rolling geometric mean value looking back 6 weeks (42 days) from the recorded sample date.
- 6WeekCount: The number of distinct sample result values included in the 6 week date range and used in the geometric mean calculation.

In [30]:
# Function for calculating and adding the geometric mean columns to a grouped dataframe
def process_group(df):
    # Nested function for calculating the geometric mean         
    def calculateGeometricMean(x):
        # Need to group records again or else the Result values are duplicated in the calculation
        x = x.groupby(level=0).mean()
        g_value = gmean(x, nan_policy='omit') # gmean is a SciPy function
        return g_value

    # It is not recommended to mutate the object we're iterating on, thus the copy:
    # https://pandas.pydata.org/docs/user_guide/gotchas.html#mutating-with-user-defined-function-udf-methods
    df = df.copy() 

    # Calculate 30 day rolling geomean
    df['30DayGeoMean'] = df['ResultSub'].rolling(window='30D', min_periods=1, closed='both').apply(calculateGeometricMean).round(3) 
    df['30DayCount'] = df['ResultSub'].rolling(window='30D', min_periods=1, closed='both').apply(lambda x: len(x.groupby(level=0)))

    # Calculate 6 week (42 days) rolling geomean
    df['6WeekGeoMean'] = df['ResultSub'].rolling(window='42D', min_periods=1, closed='both').apply(calculateGeometricMean).round(3)
    df['6WeekCount'] = df['ResultSub'].rolling(window='42D', min_periods=1, closed='both').apply(lambda x: len(x.groupby(level=0))) 

    # Drop duplicate records
    df = df.groupby(level=0).last()
    return df

# Calculate new geometric mean values for all FIB records based on the SampleDateTime index and common column values as defined in group_cols
# Set allow_duplicates=True to reinsert index columns into the dataframe and allow columns with the same name
group_cols = ['Analyte', 'StationCode', 'UnitGroup']
grouped_df = data_df.groupby(group_cols).apply(process_group).reset_index(allow_duplicates=True)

# Drop duplicate columns. There might be some duplicate columns leftover after the new geomean columns are inserted back into the dataframe
grouped_df = grouped_df.loc[:,~grouped_df.columns.duplicated()]

grouped_df.head()

  grouped_df = data_df.groupby(group_cols).apply(process_group).reset_index(allow_duplicates=True)


Unnamed: 0,Analyte,StationCode,UnitGroup,SampleDateTime,Program,ParentProject,Project,StationName,SampleDate,CollectionTime,LocationCode,CollectionDepth,UnitCollectionDepth,SampleTypeCode,CollectionReplicate,ResultsReplicate,LabBatch,LabSampleID,MatrixName,MethodName,Unit,Result,Observation,MDL,RL,ResultQualCode,QACode,BatchVerification,ComplianceCode,SampleComments,CollectionComments,ResultsComments,BatchComments,EventCode,ProtocolCode,SampleAgency,GroupSamples,CollectionMethodName,TargetLatitude,TargetLongitude,CollectionDeviceDescription,CalibrationDate,PositionWaterColumn,PrepPreservationName,PrepPreservationDate,DigestExtractMethod,DigestExtractDate,AnalysisDate,DilutionFactor,ExpectedValue,LabAgency,SubmittingAgency,SubmissionCode,OccupationMethod,StartingBank,DistanceFromBank,UnitDistanceFromBank,StreamWidth,UnitStreamWidth,StationWaterDepth,UnitStationWaterDepth,Hydromod,HydromodLoc,LocationDetailWQComments,ChannelWidth,UpstreamLength,DownStreamLength,TotalReach,LocationDetailBAComments,SampleID,DW_AnalyteName,DataSource,HydroMod,HydroModLoc,isQA,CollectionTimeOnly,Datum,StationLUCode,RegionNumber,DataQuality,DataQualityIndicator,ResultSub,ResultSubComments,30DayGeoMean,30DayCount,6WeekGeoMean,6WeekCount
0,"Coliform, Fecal",0,1.0,1998-03-02 08:00:00,BeachWatch,BeachWatch_Orange County,BeachWatch_Orange County,"0-Huntington State Beach, Orange",1998-03-02,1900-01-01 08:00:00,SurfZone,-88.0,NR,Grab,1,1,OC-03/02/1998,Not Recorded,samplewater,SM 9221 E,MPN/100 mL,700.0,,0.0,0,=,NR,NR,NR,,,,,wq,Not Recorded,OCEHD,,Water_Grab,33.6293,-117.96,Not Recorded,NaT,Not Recorded,Not Recorded,1950-01-01,Not Recorded,1950-01-01,1950-01-01,1.0,0,OC,OC,NR,,,,,,,,,,,,,,,,,Not Recorded,"Coliform, Fecal",BeachWatch,,,,08:00:00,NR,0,8,Unknown data quality,QACode:NR; BatchVerification:NR,700.0,No substitution,700.0,1.0,700.0,1.0
1,"Coliform, Fecal",0,1.0,1998-03-03 08:00:00,BeachWatch,BeachWatch_Orange County,BeachWatch_Orange County,"0-Huntington State Beach, Orange",1998-03-03,1900-01-01 08:00:00,SurfZone,-88.0,NR,Grab,1,1,OC-03/03/1998,Not Recorded,samplewater,SM 9221 E,MPN/100 mL,230.0,,0.0,0,=,NR,NR,NR,,,,,wq,Not Recorded,OCEHD,,Water_Grab,33.6293,-117.96,Not Recorded,NaT,Not Recorded,Not Recorded,1950-01-01,Not Recorded,1950-01-01,1950-01-01,1.0,0,OC,OC,NR,,,,,,,,,,,,,,,,,Not Recorded,"Coliform, Fecal",BeachWatch,,,,08:00:00,NR,0,8,Unknown data quality,QACode:NR; BatchVerification:NR,230.0,No substitution,401.248,2.0,401.248,2.0
2,"Coliform, Fecal",0,1.0,1998-03-04 08:00:00,BeachWatch,BeachWatch_Orange County,BeachWatch_Orange County,"0-Huntington State Beach, Orange",1998-03-04,1900-01-01 08:00:00,SurfZone,-88.0,NR,Grab,1,1,OC-03/04/1998,Not Recorded,samplewater,SM 9221 E,MPN/100 mL,80.0,,0.0,0,=,NR,NR,NR,,,,,wq,Not Recorded,OCEHD,,Water_Grab,33.6293,-117.96,Not Recorded,NaT,Not Recorded,Not Recorded,1950-01-01,Not Recorded,1950-01-01,1950-01-01,1.0,0,OC,OC,NR,,,,,,,,,,,,,,,,,Not Recorded,"Coliform, Fecal",BeachWatch,,,,08:00:00,NR,0,8,Unknown data quality,QACode:NR; BatchVerification:NR,80.0,No substitution,234.408,3.0,234.408,3.0
3,"Coliform, Fecal",0,1.0,1998-03-13 08:00:00,BeachWatch,BeachWatch_Orange County,BeachWatch_Orange County,"0-Huntington State Beach, Orange",1998-03-13,1900-01-01 08:00:00,SurfZone,-88.0,NR,Grab,1,1,OC-03/13/1998,Not Recorded,samplewater,SM 9221 E,MPN/100 mL,40.0,,0.0,0,=,NR,NR,NR,,,,,wq,Not Recorded,OCEHD,,Water_Grab,33.6293,-117.96,Not Recorded,NaT,Not Recorded,Not Recorded,1950-01-01,Not Recorded,1950-01-01,1950-01-01,1.0,0,OC,OC,NR,,,,,,,,,,,,,,,,,Not Recorded,"Coliform, Fecal",BeachWatch,,,,08:00:00,NR,0,8,Unknown data quality,QACode:NR; BatchVerification:NR,40.0,No substitution,150.659,4.0,150.659,4.0
4,"Coliform, Fecal",0,1.0,1998-03-17 08:00:00,BeachWatch,BeachWatch_Orange County,BeachWatch_Orange County,"0-Huntington State Beach, Orange",1998-03-17,1900-01-01 08:00:00,SurfZone,-88.0,NR,Grab,1,1,OC-03/17/1998,Not Recorded,samplewater,SM 9221 E,MPN/100 mL,230.0,,0.0,0,=,NR,NR,NR,,,,,wq,Not Recorded,OCEHD,,Water_Grab,33.6293,-117.96,Not Recorded,NaT,Not Recorded,Not Recorded,1950-01-01,Not Recorded,1950-01-01,1950-01-01,1.0,0,OC,OC,NR,,,,,,,,,,,,,,,,,Not Recorded,"Coliform, Fecal",BeachWatch,,,,08:00:00,NR,0,8,Unknown data quality,QACode:NR; BatchVerification:NR,230.0,No substitution,163.961,5.0,163.961,5.0


### 14. Export the geomean dataset as a CSV file

#### 14.1 Export the dataset with all columns

In [31]:
all_fields = [
    'Program',
    'ParentProject',
    'Project',
    'StationName',
    'StationCode',
    'SampleDate',
    'CollectionTime',
    'LocationCode',
    'CollectionDepth',
    'UnitCollectionDepth',
    'SampleTypeCode',
    'CollectionReplicate',
    'ResultsReplicate',
    'LabBatch',
    'LabSampleID',
    'MatrixName',
    'MethodName',
    'Analyte',
    'Unit',
    'Result',
    'Observation',
    'MDL',
    'RL',
    'ResultQualCode',
    'QACode',
    'BatchVerification',
    'ComplianceCode',
    'SampleComments',
    'CollectionComments',
    'ResultsComments',
    'BatchComments',
    'EventCode',
    'ProtocolCode',
    'SampleAgency',
    'GroupSamples',
    'CollectionMethodName',
    'TargetLatitude',
    'TargetLongitude',
    'CollectionDeviceDescription',
    'CalibrationDate',
    'PositionWaterColumn',
    'PrepPreservationName',
    'PrepPreservationDate',
    'DigestExtractMethod',
    'DigestExtractDate',
    'AnalysisDate',
    'DilutionFactor',
    'ExpectedValue',
    'LabAgency',
    'SubmittingAgency',
    'SubmissionCode',
    'OccupationMethod',
    'StartingBank',
    'DistanceFromBank',
    'UnitDistanceFromBank',
    'StreamWidth',
    'UnitStreamWidth',
    'StationWaterDepth',
    'UnitStationWaterDepth',
    'HydroMod',
    'HydroModLoc',
    'LocationDetailWQComments',
    'ChannelWidth',
    'UpstreamLength',
    'DownStreamLength',
    'TotalReach',
    'LocationDetailBAComments',
    'SampleID',
    'DW_AnalyteName',
    #'UnitGroup',
    'Datum',
    #'CollectionTimeOnly',
    'DataSource',
    'SampleDateTime',
    'RegionNumber',
    'DataQuality',
    'DataQualityIndicator',
    'ResultSub',
    'ResultSubComments',
    #'ResultAvg',
    '30DayGeoMean',
    '30DayCount',
    '6WeekGeoMean',
    '6WeekCount'
]

# Order columns
grouped_df_full = grouped_df[all_fields]

# Export dataframe as a CSV file
grouped_df_full.to_csv('SafeToSwim_geomeans.csv', index=False)

#### 14.2 Dataset with select columns (for testing)
Expprt an shortened version of the dataset (fewer columns) for testing. 

In [32]:
test_fields = [
    'Program',
    'ParentProject',
    'Project',
    'StationName',
    'StationCode',
    'SampleDate',
    'CollectionTime',
    #'LocationCode',
    #'CollectionDepth',
    #'UnitCollectionDepth',
    #'SampleTypeCode',
    #'CollectionReplicate',
    #'ResultsReplicate',
    'LabBatch',
    #'LabSampleID',
    'MatrixName',
    'MethodName',
    'Analyte',
    'Unit',
    'Result',
    #'Observation',
    'MDL',
    'RL',
    'ResultQualCode',
    #'QACode',
    #'BatchVerification',
    #'ComplianceCode',
    #'SampleComments',
    #'CollectionComments',
    #'ResultsComments',
    #'BatchComments',
    #'EventCode',
    #'ProtocolCode',
    #'SampleAgency',
    #'GroupSamples',
    #'CollectionMethodName',
    #'TargetLatitude',
    #'TargetLongitude',
    #'CollectionDeviceDescription',
    #'CalibrationDate',
    #'PositionWaterColumn',
    #'PrepPreservationName',
    #'PrepPreservationDate',
    #'DigestExtractMethod',
    #'DigestExtractDate',
    #'AnalysisDate',
    #'DilutionFactor',
    #'ExpectedValue',
    #'LabAgency',
    #'SubmittingAgency',
    #'SubmissionCode',
    #'OccupationMethod',
    #'StartingBank',
    #'DistanceFromBank',
    #'UnitDistanceFromBank',
    #'StreamWidth',
    #'UnitStreamWidth',
    #'StationWaterDepth',
    #'UnitStationWaterDepth',
    #'HydroMod',
    #'HydroModLoc',
    #'LocationDetailWQComments',
    #'ChannelWidth',
    #'UpstreamLength',
    #'DownStreamLength',
    #'TotalReach',
    #'LocationDetailBAComments',
    #'SampleID',
    #'DW_AnalyteName',
    #'Datum',
    #'CollectionTimeOnly',
    'DataSource',
    'SampleDateTime',
    'RegionNumber',
    'DataQuality',
    'DataQualityIndicator',
    'ResultSub',
    'ResultSubComments',
    #'ResultAvg',
    '30DayGeoMean',
    '30DayCount',
    '6WeekGeoMean',
    '6WeekCount'
]

# Order columns
grouped_df_test = grouped_df[test_fields]

# Export dataframe as a CSV file
grouped_df_test.to_csv('SafeToSwim_geomeans_short.csv', index=False)