In [None]:
#! conda install -y matplotlib pandas numpy

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Separate ML model training data

The respiration rates are bimodally distributed; ML models trained on the whole data set tend to predict the mean value of the data but this doesn't help predict whether sites are hot or cold spots (i.e. either side of the distribution).

One approach to this issue is the `QuantileRegressor`; it seeks to minimize the difference wrt the median (the 50% quantile) instead of the `LinearRegressor` that is minimizing the difference wrt the mean. For a bimodal distribution, then, you could train `QuantileRegressor`s on the 25% and 75% quantiles, thus hitting the two peaks of the bimodal distribution.

However, this requires creating two different ML models and then selecting which is best. We could add different `QuantileRegressors` to the stacked ensemble of SuperLearner submodels. However, at the end of the day, those models' results are all blended/averaged together; how would those weighted averaged results be really any different from the results from a single `LinearRegressor`?

Another approach is motivated by the fact that the `combined.metric` used to guide which sites to sample at is composed of two things: how different that site is compared to other sides (PCA analysis) and the estimated error in the prediction at that site (which is strongly correlated to the magnitude of the respiration rate prediction at that site). This means that higher priority sites tend to be sites where the ML model predicts high respiration rates. Since the model does a reasonable job predicting the order of magnitude of respiration rates, the set of higher priority sites contains all the respiration rate hot spots with some hot spots while the lower priority sites contain only cold spots. ML models trained on the HP sites tend to predict the mean value in the middle of the bimodal distribution. BUT - ML models trained on LP sites, although their scores may be less (see below), they are able to predict a bimodal distribution of respiraiton rates! This leads to the hypothesis that if you train an ML model on cold spots only but ask it to predict the respiration rates at hot spots -> can it detect that there are two dramatically different types of data points? Even though these hot spot predictions are extrapolative, they appear to be the right order of magnitude.

Here, we want to test this hypothesis explicitly by removing all the hot spots from the training set (`N_h`) and then pulling an equal number of random cold spots out (`N_c = N_h`). This combined, equally weighted cold/hot spot dataset will be the fully independent testing set. The remaining points are used for training the ML model.

In [2]:
src_file="ICON-ModEx_Data_Nov-2023.csv"

# Load data
src=pd.read_csv(src_file)



In [11]:
# Get the hot spots
hot_spots = src[src['Normalized_Respiration_Rate_mg_DO_per_H_per_L_sediment'] < -500.0]

# Check that there are 94 hot spots
print(np.shape(hot_spots))

# Get 94 randomly selected cold spots
cold_spots = src[src['Normalized_Respiration_Rate_mg_DO_per_H_per_L_sediment'] > -500.0].sample(n=94)

# Join the hot and cold spots into a single testing set
pd.concat

# Get all the Sample_ID for the hot and cold spots

# Use sample ID to drop all the spots from src

# Write out src as training, spots as testing

(94, 57)


In [13]:
cold_spots

Unnamed: 0,Sample_Kit_ID,Sample_ID,Date,Median_Time_Unix,Median_Time_Local,Local_Timezone,Mean_DO_mg_per_L,DO_sd,Mean_Temp_Deg_C,Temp_sd,...,Canopy_Cover,Macrophyte_Coverage,Algal_Mat_Coverage,Sediment_Collection_Depth_cm,MiniDot_Notes,Additional_Sampling_Notes,Hydrograph_Online,Hydrograph_Other,Water_volume_Flag,Notes
131,CM_052,CM_052-3,2023-03-20,1.679350e+09,15:02:30,Pacific Daylight Time (UTC-7),12.28,0.01,4.67,0.02,...,Partial coverage,Partial coverage,Partial coverage,1-3 cm as described in the protocol,Downstream of coarse wood.,,https://data.neonscience.org/data-products/DP4...,Yes,False,
597,S19S_0069,S19S_0069-D,8/12/2019,,,PDT,,,15.00,,...,Partial direct sunlight (50-80% canopy cover),No,No,,,Upstream site is above newly formed log jam. R...,,,False,
366,SSS015,SSS015-1,2022-08-08,1.659998e+09,14:34:30,Pacific Standard Time (UTC-8),10.24,0.01,19.08,0.08,...,,,Low (5-30%),1-3 cm as described in the protocol,,"Lost chamber, swept away by river when no one ...",,,False,
91,CM_038,CM_038-2,2022-10-31,1.667223e+09,09:27:00,Eastern Daylight Time (UTC-4),6.74,0.23,10.06,0.03,...,Partial coverage,Partial coverage,No coverage,1-3 cm as described in the protocol,,Had trouble with getting the sediment passed f...,https://water.weather.gov/ahps2/hydrograph.php...,,False,
221,CM_083,CM_083-3,2023-08-14,1.692030e+09,11:18:30,Central Daylight Time (UTC-5),7.80,0.11,31.07,0.06,...,Partial coverage,No coverage,Partial coverage,More than 10 cm in some places,~14 cm sensor deployment; site is immediately ...,My apologies - I only have a pic of the place ...,,No,False,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
42,CM_015,CM_015-1,2022-06-06,1.654549e+09,14:53:30,Mountain Daylight Time (UTC-6),7.99,0.03,11.05,0.16,...,Partial coverage,No coverage,No coverage,1-3 cm as described in the protocol,,"Dominant vegetation = willow, aspen, some fir;...",https://waterdata.usgs.gov/monitoring-location...,,False,
702,S19S_0052,S19S_0052-D,8/20/2019,,,EST,,,20.30,,...,Partial direct sunlight (50-80% canopy cover),No,No,,,We remarked the M and D sediment jars to indic...,,,False,
219,CM_083,CM_083-1,2023-08-14,1.692030e+09,11:18:30,Central Daylight Time (UTC-5),7.80,0.11,31.07,0.06,...,Partial coverage,No coverage,Partial coverage,More than 10 cm in some places,~14 cm sensor deployment; site is immediately ...,My apologies - I only have a pic of the place ...,,No,False,
210,CM_080,CM_080-4,2023-03-20,1.679334e+09,12:44:30,Central Daylight Time (UTC-5),2.34,0.11,15.67,0.51,...,Partial coverage,Partial coverage,No coverage,1-3 cm as described in the protocol,,Due to high levels of sediments we prioritized...,https://waterdata.usgs.gov/monitoring-location...,,False,
