# 1. Read and pre-process the data

## 1.1 Read data

This is the first part of the code and is designed for checking and cleaning the data. The data used is taken from Alberta Water Quality Data Portal with focus on water matrix (0) and long-term and tributary monitoring stations:
Source for getting data: https://environment.extranet.gov.ab.ca/apps/WaterQuality/dataportal/DataDownload/Index/

In [2]:
import pandas as pd

In [75]:
# Specify parse date:
parse_dates = ['SampleDateTime']

# Now read the data
data = pd.read_csv('../data/Water Quality-2025-03-08 172848.csv', dtype=
{
 'ProjectNumber': 'string',
 'SampleNumber': 'string',
 'ContinentalRiverBasinCode': 'category',
 'RiverBasinCode': 'category',
 'RiverSubBasinCode': 'category',
 'StationTypeCode': 'string',
 'StationNumber': 'category',
 'Station': 'category',
 'LatitudeDecimalDegrees': 'float64',
 'LongitudeDecimalDegrees': 'float64',
 'SampleMatrixCode': 'category',
 'SampleTypeCode': 'category',
 'CollectionCode': 'category',
 'QCSampleFlag': 'category',
 'SampleComment': 'string',
 'SampleDateTime': 'string',
 'VmvCode': 'string',
 'VariableCode': 'string',
 'VariableName': 'category',
 'MeasurementFlag': 'category',
 'MeasurementValue': 'float64',
 'UnitCode': 'category',
 'SampleDetectLimit': 'string',
 'MeasurementComment': 'string',
 'MeasurementQualifier': 'string',
 'MeasurementQualifierDescription': 'string',
 'MeasurementQualifierComment': 'string',
 'MethodCode': 'string',
 'MethodDetectionLimit': 'float64',
 'LabCode': 'category'
 }
 , na_values=['', 'NaN', 'NULL', 'N/A', 'NA', 'null'])

In [76]:
# convert the SampleDateTime column to datetime
data['SampleDateTime'] = pd.to_datetime(data['SampleDateTime'], format='%m/%d/%Y %H:%M:%S', errors='coerce')

# Convert SampleDetectLimit to numeric
data['SampleDetectLimit'] = pd.to_numeric(data['SampleDetectLimit'], errors='coerce')

# 1.2 Filter the data

In [77]:
# Check missingness for each column and arrange in descending order
data.isnull().sum().sort_values()

ProjectNumber                            0
VmvCode                                  0
SampleDateTime                           0
MeasurementValue                         0
QCSampleFlag                             0
MethodCode                               0
SampleTypeCode                           0
SampleMatrixCode                         0
VariableCode                             0
LongitudeDecimalDegrees                  0
Station                                  0
StationNumber                            0
StationTypeCode                          0
RiverSubBasinCode                        0
RiverBasinCode                           0
ContinentalRiverBasinCode                0
SampleNumber                             0
LatitudeDecimalDegrees                   0
VariableName                             0
LabCode                                288
UnitCode                             73616
CollectionCode                      200683
MethodDetectionLimit                386180
SampleComme

**Note:** it is the user's choice to not include the data without unit codes. I chose to eliminate them as they really create ambiguity in how usable the data will be. Just as a more general rule, we apply SampleDateTime, MeasurementValue, StationNumber, and VariableName. 

In [97]:
data = data.dropna(subset = ['UnitCode', 'MeasurementValue', 'SampleDateTime', 'StationNumber', 'VariableName'])
data = data.dropna(how = 'all') # drop rows where all elements are NaN

In [None]:
duplicates = data.duplicated(subset=['SampleDateTime', 'StationNumber', 'VariableName', 'MeasurementValue', 
                                     'UnitCode', 'VmvCode', 'SampleNumber', 'LabCode'], 
                             keep=False)
                             
# write duplicate rows to csv   
# data[duplicates].to_csv('../output/duplicate_rows.csv', index=False)

Basically there is no duplicate row based on the analysis of duplicates. 

## 1.3 EDA

More to do: harmonize the units for each parameter.

In [78]:
data.describe()

Unnamed: 0,LatitudeDecimalDegrees,LongitudeDecimalDegrees,SampleDateTime,MeasurementValue,SampleDetectLimit,MethodDetectionLimit
count,2283487.0,2283487.0,2283487,2283487.0,1676152.0,1897307.0
mean,52.50656,-113.6527,2012-11-14 19:27:48.970451456,56.97865,0.428427,0.4913207
min,49.02673,-118.8047,1959-02-20 14:00:00,-213.0,0.0,0.0
25%,50.3531,-114.4871,2007-06-18 08:55:00,0.02,0.005,0.006
50%,52.08902,-113.4421,2016-03-15 09:30:00,0.5,0.05,0.05
75%,54.01297,-112.4759,2020-08-04 12:30:00,8.5,0.2,0.3
max,58.44722,-110.0297,2024-12-12 14:45:00,1330000.0,400.0,400.0
std,2.508057,1.9937,,1787.225,1.851364,1.648276
