# Notebook for making predictions with an ensemble of SuperLearner machine learning models.

This Jupyter notebook is an example for how to reuse an archived ML model trained for predicting sediment respiration rates. The general concepts used here can be applied elsewhere. The core operations used here are based on `sl_core/predict.py` available at [this link](https://github.com/parallelworks/sl_core/blob/main/predict.py).

## Dependencies

In order to use the ML model, you need to first get access to the Python packages necessary for running the model. SuperLearners are currently stored in `.pkl` format and this format is sensitive to the exact versions of Python and scikit-learn that are in the active environment. (Future work will move SuperLearners to ONNX format which is more portable.) The SuperLearner automatically stores `.yaml` files that define its run environment, but, to keep environments lightweight and minimize install time, etc., these environments do not contain the packages needed for displaying Jupyter notebooks. As such, this repository contains a `.yaml` that can be used with the following command:
```
conda env update --name <your-env-name> -f fig07-08-notebook-conda-env.yaml
```

This file was created from an automatically generated SuperLearner environment definition file with the following commands:
```
conda create -y --name superlearner python=3.9
conda activate superlearner
conda env update --name superlearner -f requirements.yaml
conda install -y -c conda-forge requests
conda install -y -c anaconda jinja2
conda install -y -c conda-forge ipykernel
conda env export --name superlearner > fig07-08-notebook-conda-env.yaml
```

The `.yaml` files here are stored as gzipped files `.yaml.gz` because otherwise GitHub will read the `requirements.yaml` files and print security warnings if the files use out of date packages. Since this code is of limited duration scientific use (i.e. not production) I have ignored these warnings.

In [1]:
import pickle
import pandas as pd
import json
import numpy as np
import matplotlib.pyplot as plt
import sklearn.metrics
from sklearn.metrics import mean_squared_error
from sklearn.utils import shuffle
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MaxAbsScaler
import sys

## Specify repository and branch to work on

In [36]:
# Using ~ here for $HOME will cause pickle.load failures later
# so must use absolute path.
repo_prefix = '/home/sfgary/tmp/'
repo_name = 'sl-archive-whondrs'
repo_url = 'https://github.com/parallelworks/'+repo_name
branch = 'S19S-SSS-log10-extrap-r02'

# Grab the data and get onto the branch if not already there
! mkdir -p {repo_prefix}
! cd {repo_prefix}; git clone {repo_url}
! cd {repo_prefix}/{repo_name}; git checkout {branch}

fatal: destination path 'sl-archive-whondrs' already exists and is not an empty directory.
M	ml_models/sl_0/requirements.yaml.gz
Already on 'S19S-SSS-log10-extrap-r02'


## Load data that will be used to make predictions

In [45]:
# There are two zipped files.  Decompress and load each one
!gunzip -c grdb_step_03b_output_RiverATLAS_v10_na.xyz.7411.csv.gz > grdb_step_03b_output_RiverATLAS_v10_na.xyz.7411.csv
!gunzip -c grdb_step_03b_output_RiverATLAS_v10_na.xyz.7412.csv.gz > grdb_step_03b_output_RiverATLAS_v10_na.xyz.7412.csv

# Do not specify ID as the index when loading the data because
# we will want to remove it later.
list_df = []
list_df.append(pd.read_csv('grdb_step_03b_output_RiverATLAS_v10_na.xyz.7411.csv')) #, index_col='RA_ID'))
list_df.append(pd.read_csv('grdb_step_03b_output_RiverATLAS_v10_na.xyz.7412.csv')) #, index_col='RA_ID'))

# Clean up
!rm grdb_step_03b_output_RiverATLAS_v10_na.xyz.7411.csv
!rm grdb_step_03b_output_RiverATLAS_v10_na.xyz.7412.csv

# Concatenate
predict_df = pd.concat(list_df,axis=0)

# Some river segments are so short they only have one coordinate point.
# Replace any missing (lon2,lat2) with (lon1,lat2) for uniformity.
# Add a very very small displacement to each so all segments have
# very small but non-zero length.
predict_df['lon2'].fillna(value=predict_df['lon1']+0.000001,inplace=True)
predict_df['lat2'].fillna(value=predict_df['lat1']+0.000001,inplace=True)

# Store the ID, lon, and lat in a separate dataframe for integration later
# These values are NOT used by the ML model and and such should be removed
# from the DF that is going to be used to make predictions.
predict_ixy = pd.DataFrame(columns=['RA_ID','lon1','lat1','lon2','lat2'])
predict_ixy['RA_ID'] = predict_df.pop('RA_ID')
predict_ixy['lon1'] = predict_df.pop('lon1')
predict_ixy['lat1'] = predict_df.pop('lat1')
predict_ixy['lon2'] = predict_df.pop('lon2')
predict_ixy['lat2'] = predict_df.pop('lat2')

# There is exactly one NaN value remaining at one site for the stream
# depth. For generality with other data sets, replace any NaN
# with the mean value of the whole column.
predict_df.fillna(predict_df.mean(), inplace = True)

In [46]:
# Set number of SuperLearner ensemble members
num_sl = 10

# Initialize data frame list to hold output
sl_predict_output_df_list = []

# Loop over all ensemble members
for ll in range(0,num_sl):
    
    print("Working on SuperLearner ensemble member "+str(ll))
    
    # Load the SuperLearner model from .pkl
    model_dir = repo_prefix+repo_name+"/ml_models/sl_"+str(ll)
    sys.path.append(model_dir)
    with open(model_dir+'/SuperLearners.pkl','rb') as file_object:
        superlearner = pickle.load(file_object)

    # OPTIONAL: For a given output variable, list the models:
    predict_var = 'Normalized_Respiration_Rate_mg_DO_per_H_per_L_sediment'
    #print("Submodels within SuperLearner and their weights:")
    #list_models = list(superlearner[predict_var].named_estimators_.keys())
    #print(list_models)
    
    # OPTIONAL: The following only works for the scipy.optimize.nnls
    # stacking regressor, not the sklearn stacking regressors.
    #print(superlearner[predict_var].final_estimator_.weights_)
    
    # Make predictions
    sl_predict_output_df_list.append(superlearner[predict_var].predict(predict_df))

Working on SuperLearner ensemble member 0




Working on SuperLearner ensemble member 1




Working on SuperLearner ensemble member 2




MemoryError: Unable to allocate 43.3 GiB for an array with shape (86054, 67525) and data type float64