# Modeling Agricultural Variables (Incomplete)

This notebook contains the modeling approach used by the 2023 MOSAIKS team as part of the University of California, Santa Barbara's Bren School of the Environment & Management Masters of Environmental Data Science (MEDS) Program's Capstone Project. 

This notebook immediately follows the [feature preprocessing](https://github.com/mosaiks-capstone/Modeling/blob/main/feature_preprocessing.ipynb) notebook used to aggregate and join featurized satellite imagery (generated [here](https://github.com/mosaiks-capstone/Featurization)) and ground truth data (in our case, Crop Forecast Survey (CFS) Data collected by the Zambian Ministry of Agriculture). Our approach accomodates different sampling methods: bootstrapping and block sampling used in combination with RidgeCV's 5 fold cross validation. These additional sampling methods can be used to evaluate model performances, but ensembling the individual bootstrapped models into a final model to generate predictions is incomplete but can be expanded on in future work.

To use our modeling approach to make predictions, please see the ![model_predictions notebook]()


## Python modules

In [12]:
import warnings
warnings.filterwarnings('ignore')
import time
import os
import random

import dask
from dask.distributed import Client

import numpy as np
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
import matplotlib.colors as colors
import multiprocessing as mp

import geopandas as gpd
import pyarrow

from sklearn.linear_model import RidgeCV
from sklearn.model_selection import train_test_split, ShuffleSplit
from sklearn.metrics import mean_squared_error, confusion_matrix, r2_score, roc_auc_score
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import BaggingRegressor
from sklearn.preprocessing import StandardScaler
from scipy.stats import spearmanr
from scipy.linalg import LinAlgWarning
from scipy.stats import pearsonr
from sklearn.utils import check_random_state, resample
from joblib import Parallel, delayed

import math
import seaborn as sns

# Read in Data

We first read in the aggregated features and ground-truth data joined in  `feature_preprocessing.ipynb` .

The joined data being read in should take on the following form:

| spatial_identifier | year | target_1 | target_2 | feature1| feature2 | feature3
| ----| ----| ---- | ---- | -- | -- | -- |
| 1   | 2016 | 72 | 13 | 1.23 | 3.25 | 0.123
| 2   | 2016  | 50 | 7.5 | 0.78| 1.2 | 2.4



In our case, our unique spatial_identifiers are `sea_unq`. This enables us to regress `target_1` and `target_2` on our features, using the following equation:

$y_{1}$ = $\beta_{1}$$x_{1}$ + $\beta_{2}$$x_{2}$ + $\beta_{3}$$x_{3}$ + $\beta_{n}$$x_{n}$ 

In [13]:
## Insert path to joined ground data + features
path = "/capstone/mosaiks/repos/modeling/data/model_directory/SEA_averaged_features_simple_impute_mean.csv" ## Your path here

grouped_features = pd.read_csv(path)
grouped_features

Unnamed: 0,year,sea_unq,0_1,0_2,0_3,0_4,0_5,0_6,0_7,0_8,...,log_sweetpotatoes,log_groundnuts,log_soybeans,loss_ind,drought_loss_ind,flood_loss_ind,animal_loss_ind,pest_loss_ind,lat,lon
0,2016.0,1,0.000000,0.000863,0.000783,0.000000,0.000000,0.000000,0.000000,6.157999e-06,...,6.364023,5.935403,6.565149,0.0,0.0,0.0,0.0,0.0,-13.659357,27.807993
1,2016.0,2,0.000069,0.000863,0.000783,0.000000,0.000002,0.000014,0.000047,6.299240e-05,...,6.364023,5.935403,6.565149,0.0,0.0,0.0,0.0,0.0,-13.493902,27.959205
2,2016.0,7,0.001141,0.000863,0.000783,0.000329,0.000000,0.000000,0.000000,1.008277e-03,...,0.689155,5.935403,6.565149,1.0,1.0,0.0,0.0,0.0,-13.772690,28.634660
3,2016.0,9,0.001131,0.000863,0.000783,0.000006,0.000004,0.000010,0.000014,2.590917e-05,...,6.364023,-1.408767,6.565149,1.0,0.0,0.0,0.0,0.0,-12.905428,27.406446
4,2016.0,10,0.001131,0.000863,0.000783,0.000000,0.000000,0.000000,0.000000,3.113844e-07,...,2.525729,3.354421,6.565149,1.0,0.0,0.0,0.0,0.0,-12.962298,27.381719
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1364,2021.0,388,0.001131,0.000076,0.000106,0.000075,0.000046,0.000112,0.000144,8.372727e-04,...,9.367183,8.098897,7.336848,0.0,0.0,0.0,0.0,0.0,-14.652084,25.116478
1365,2021.0,389,0.001131,0.000863,0.000018,0.000104,0.000202,0.000424,0.000395,7.513932e-04,...,6.364023,8.048788,6.565149,1.0,1.0,0.0,0.0,0.0,-13.966394,22.794290
1366,2021.0,390,0.000821,0.000345,0.000180,0.000227,0.000306,0.000457,0.000474,9.321970e-04,...,7.863267,8.154788,6.565149,1.0,0.0,0.0,0.0,0.0,-14.240607,23.101535
1367,2021.0,391,0.001131,0.000863,0.000353,0.000323,0.000244,0.000222,0.000311,1.540414e-03,...,6.364023,8.065208,6.565149,1.0,0.0,1.0,0.0,0.0,-16.485957,24.338360


### Select Features and Outcomes

We then select all observations for each of the columns containing the features. We do the same with our outcome/target variables

In [14]:
# Select features and outcomes:
features = grouped_features.iloc[:,2:12002] # adjust to select all feature columns
outcomes = grouped_features.iloc[:,12003:12041] #adjust to select all target/outcome variables

## Ensure each of the binary target variables are of categorical data types
outcomes["loss_ind"].astype('category')
outcomes["drought_loss_ind"].astype('category')
outcomes['pest_loss_ind'].astype('category')
outcomes['animal_loss_ind'].astype('category')
outcomes['flood_loss_ind'].astype('category')

# Gut-check
outcomes.head()
features.head()

Unnamed: 0,0_1,0_2,0_3,0_4,0_5,0_6,0_7,0_8,0_9,0_10,...,999_3,999_4,999_5,999_6,999_7,999_8,999_9,999_10,999_11,999_12
0,0.0,0.000863,0.000783,0.0,0.0,0.0,0.0,6.157999e-06,0.000207,0.001568,...,0.060421,1.0,0.274676,1.0,0.115388,0.002708,0.001319,0.002867,0.003866,1.0
1,6.9e-05,0.000863,0.000783,0.0,2e-06,1.4e-05,4.7e-05,6.29924e-05,0.000168,0.001568,...,0.060421,0.939709,0.049106,0.039969,0.004752,0.002671,0.002439,0.002867,0.003866,0.071531
2,0.001141,0.000863,0.000783,0.000329,0.0,0.0,0.0,0.001008277,0.00136,0.002211,...,0.060421,0.006789,1.0,1.0,1.0,0.000517,0.000343,0.000396,0.003866,0.071531
3,0.001131,0.000863,0.000783,6e-06,4e-06,1e-05,1.4e-05,2.590917e-05,0.00011,0.001568,...,0.060421,0.005561,0.006391,0.004212,0.003235,0.001937,0.001683,0.002867,0.003866,0.071531
4,0.001131,0.000863,0.000783,0.0,0.0,0.0,0.0,3.113844e-07,1.2e-05,0.001568,...,0.060421,0.00557,0.006739,0.003991,0.002857,0.001979,0.001435,0.002867,0.003866,0.071531


# Helper Functions 

### 1. Confusion Matrix for Categorical Variables
`calculate_confusion_matrix`:
To evaluate the performance of our categorical variables, we need to use a confusion matrix instead of r-squared. This function calculates the confusion matrix for binary classification problems based on the given true labels (`y_true`), predicted values (`y_pred`), and a decision boundary (`decision_boundary`) to assign a class to the binary target. 

In [15]:
# helper function to calculate confusion matrix for categorical variables
def calculate_confusion_matrix(y_true, y_pred, decision_boundary):
    y_pred_adj = np.where(y_pred >= decision_boundary, 1, 0)
    cm = confusion_matrix(y_true, y_pred_adj)
    if cm.shape == (1, 1):
        if y_true.iloc[0] == 0:
            tn, fp, fn, tp = cm[0, 0], 0, 0, 0
        else:
            tn, fp, fn, tp = 0, 0, 0, cm[0, 0]
    elif cm.shape == (2, 2):
        tn, fp, fn, tp = cm.ravel()
    else:
        print("Unexpected confusion matrix:")
        print(cm)
        raise ValueError('Unexpected confusion matrix shape.')
    return tn, fp, fn, tp

### 2. Custom Function for Block Sampling 

Here, we define a helper function to block sample across our training and testing sets on our unique spatial identifier (sea_unq). `n_seas_val` and `n_seas_test` in our helper function allow us to specify how many values of `sea_unq` to hold out for our validation and test set. Block sampling allows us to evaluate model performance on unseen spatial areas by holding out all observations of a specified number of unique values in the (`sea_unq`) column.


In [16]:
def get_train_val_test_indices(sea_ids, n_seas_val, n_seas_to_hold_out_for_test, random_state):
    unique_seas = np.unique(sea_ids)
    np.random.seed(random_state)
    np.random.shuffle(unique_seas)

    # Hold out some SEAs for testing
    test_seas = unique_seas[:n_seas_to_hold_out_for_test]
    remaining_seas = unique_seas[n_seas_to_hold_out_for_test:]

    # Hold out some SEAs for validation
    val_seas = remaining_seas[:n_seas_to_hold_out_for_val]
    train_seas = remaining_seas[n_seas_to_hold_out_for_val:]

    # Convert boolean indices to integer indices
    train_indices = np.where(sea_ids.isin(train_seas))[0]
    val_indices = np.where(sea_ids.isin(val_seas))[0]
    test_indices = np.where(sea_ids.isin(test_seas))[0]

    return train_indices, val_indices, test_indices, train_seas, val_seas, test_seas


# Modeling Agricultural Variables using different sampling approaches

## Description 

This function is designed to use Cross-Validated Ridge Regression to regress our `features` (encoded/summarised "featurized" Sentinel 2 satellite imagery data using [MOSAIKS](https://www.nature.com/articles/s41467-021-24638-z) process) on our `outcomes` (ground-truthed Crop Forecast Survey (CFS) data from the Zambian Ministry of Agriculture). We have labelled data at the Survey Enumeration Area (SEA)/year level for 2015-2022 approximating ~1300 observations before imputation steps. We are interested in using 4 different sampling approaches to evaluate our models' performance:
1. **5-fold CV using sklearn's RidgeCV**
    - Data is split using `train_test_split` into training and testing sets. The training data is again split into a validation set. A RidgeCV model is trained for each target outcome selected. To choose the penalty coefficient alpha for each model, RidgeCV searches over a logspace of 75 values from $10^{-8}$ to $10^8$. The trained model's performance is evaluated on the validation set. 
    
    - This approach serves as the foundation for our other methods.
    
2. **Bootstrapping + 5-fold CV (incomplete)**
    - In this approach, `n_bootstraps` are resampled with replacement from the training + validation data. For each bootstrap sample, we fit a RidgeCV model using 5-fold CV. Meaning that, for each bootstrap sample, we actually fit 5 RidgeCV models on different parts of the data and validate them on different validation sets. This gives us 5 performance estimates for each bootstrap sample, which we average to get a single performance estimate for that bootstrap sample. The bootstrapping process is run in parallel. We then average the performance of all bootstrap samples, to get a final estimate of model performance. This approach gave us our best results. 
    
    - We've marked this method is incomplete because while we can evaluate model performances, we have yet to ensemble each individual bootstrap model into a single model that can be used to output predictions. We hypothesize that the best approach to this ensemble would be to weight each individual model for each variable by its performance ($R^2$). 
    
3. **Block Sampling + 5-fold CV (incomplete)**
    - Block sampling allows us to evaluate model performance on unseen spatial areas by holding out all observations of a specified number of unique values in the unique spatial identifier (here, `sea_unq`) column. This operates similarly to our first approach, the only difference being that instead of randomly selecting observations to train/test/validate on, we're specifying to hold out all observations for `n_seas_val` and `n_seas_test` of distinct values of `sea_unq`. The model can then be evaluated on unseen spatial areas by evaluating on validation/test sets.
    
    - This was unsurprisingly our worst performing approach. We noticed high variability in performances based on which set of SEAs were trained/tested on. Since our targets have high variability across SEAs (some SEAs much larger than others, in very different agricultural regions, etc), our models were overall poor but tended to perform better on SEAs with representative target values. 
    - More exploration can be conducted here, and we mark this as incomplete because we again have yet to complete the infrastructure to make predictions.
    
4. **Bootstrapping, Block Sampling and 5-fold CV (incomplete)**
    - Our last approach was intended to combine bootstrapping and block sampling to improve our bootstrapping performance. The idea for this approach would be to block on unique SEAs, then generate bootstrap samples of our training/validation data with the remaining SEAs using our bootstrapping approach. A model would then be fit to each resample using 5-fold CV. 
    
    - This approach was also incomplete. The infrastructure to split the data as we intended is still unfinished, so we have yet to evaluate model performances using this approach. 
    

Before any results are printed, our function prints out several parameters selected by the user to ensure the proper parameters are being employed. The parameters output by the function before the models run are: the target columns, the validation and test sizes, whether or not bootstrapping and block sampling were used, and the random state used. We measured the accuracy of our training, validation, and testing sets primarily using the $R^2$ metric. Below are the arguments passed to our function (in a dictionary). 


### Arguments

`args`: This is a dictionary that contains all the arguments that are necessary to run the function train_and_evaluate_models.

1. `target_columns` (list of strings): contains the names of the columns in the data that are considered as the target variables in the model training process.
2. `test_size` (float, optional): represents the proportion of the data to include in the test split. The default value is 0.1, meaning that 10% of the data will be used for testing.
3. `categorical_columns` (list of strings): contains the names of the columns in the data that are categorical variables.
4. `decision_boundaries` (list of floats): defines the decision boundaries for each categorical target variable. 
5. `sea_ids` (list of integers): contains the unique spatial identifiers of each Survey Enumeration Area (SEA).
6. `validation_size` (float, optional): represents the proportion of the data to include in the validation split. The default value is 0.1, meaning that 10% of the data will be used for validation.
7. `bootstrap` (boolean, optional): indicates whether to use bootstrap sampling in the model training process. The default value is False, meaning that bootstrap sampling is not used by default.
8. `n_bootstraps` (integer, optional): specifies the number of bootstrap samples to use if bootstrap sampling is enabled. It only has an effect if bootstrap is True.
9. `block_sample` (boolean, optional): indicates whether to use block sampling for splitting the data into training, validation, and testing datasets. The default value is False.
10. `random_state` (integer, optional): seed used by the random number generator. Setting this value ensures that the splits that we generate are reproducible. The default value is 1.
11. `n_seas_held_out_val` (integer, optional): specifies the number of Survey Enumeration Areas (SEAs) to hold out for validation. This is only relevant when using block sampling. The default value is 30.
12. `n_seas_held_out_test` (integer, optional): specifies the number of Survey Enumeration Areas (SEAs) to hold out for testing. This is only relevant when using block sampling. The default value is 
13. `features` : Dataframe containing the features we defined at the top of the notebook
14. `outcomes`: Dataframe containing the outcomes we defined at the top of the notebook

In [30]:
# Prepare the arguments as a dictionary
args = {
    'target_columns': ['frac_area_harv', 'log_maize'], 
    'test_size': 0.1,
    'categorical_columns':[],
    'decision_boundaries': [0.5],
    'sea_ids': grouped_features['sea_unq'],
    'validation_size' : 0.1,
    'bootstrap' : True,
    'n_bootstraps': 10,
    'block_sample': False,
    'n_seas_held_out_val': None,
    'n_seas_held_out_test': None,
    'random_state': 50,
}

In [35]:
def train_and_evaluate_models(args):
    """
    This function accepts an `args` dictionary and performs training and evaluation of RidgeCV regression models. 
    It handles different sampling methods: bootstrap and block sampling and calculates various metrics to evaluate the performance of the models.
    
    Args:
        args (dict): dictionary containing different parameters needed for model training and evaluation.
    Returns:
        Returns performance metrics for models. 
    """

    # Extract values from the args dictionary
    target_columns = args['target_columns']  # The target columns in the dataframe
    test_size = args.get('test_size', 0.1)  # The proportion of the dataset to include in the test set
    categorical_columns = args['categorical_columns']  # The categorical columns in the dataframe
    decision_boundaries = args['decision_boundaries']  # The decision boundaries to use when evaluating models
    sea_ids = args['sea_ids']  # The unique spatial IDs
    validation_size = args.get('validation_size', 0.1)  # The proportion of the training set to include in the validation set
    bootstrap = args.get('bootstrap', False)  # Whether to bootstrap the data or not
    n_bootstraps = args.get('n_bootstraps', 0)  # The number of bootstrap samples to create
    block_sample = args.get('block_sample', False)  # Whether to use block sampling or not
    random_state = args.get('random_state', 1)  # The seed for the random number generator
    n_seas_held_out_val = args.get('n_seas_held_out_val', 30)  # The number of seas to hold out for validation
    n_seas_held_out_test = args.get('n_seas_held_out_test', 10)  # The number of seas to hold out for testing

    # Read in data
    path = "/capstone/mosaiks/repos/modeling/data/model_directory/SEA_averaged_features_simple_impute_mean.csv" ## Your path here
    grouped_features = pd.read_csv(path)
    features = grouped_features.iloc[:,2:12002] # adjust to select all feature columns
    outcomes = grouped_features.iloc[:,12003:12041] #adjust to select all target/outcome variables

    # Initialize empty dataframes to store metrics and predictions
    predictions_df = pd.DataFrame()
    metrics_df = pd.DataFrame(columns=['target_column', 'train_score', 'val_score', 'pearson_coeff', 'fpr', 'roc_auc'])
    
    # Print out our parameters
    print(f"\nRunning model with the following parameters:")
    print(f"Target columns: {target_columns}")
    print(f"Test size: {test_size}", f"Validation size: {validation_size}")
    print(f"Bootstrap: {bootstrap}")
    print(f"Random State: {random_state}")
    if bootstrap:
        print(f"Number of bootstrapped samples: {n_bootstraps}")
    print(f"Block sample: {block_sample}\n")
    if block_sample:
        train_indices, val_indices, test_indices, train_seas, val_seas, test_seas = get_train_val_test_indices(sea_ids, n_seas_to_hold_out_for_val=n_seas_held_out_val, n_seas_to_hold_out_for_test=n_seas_held_out_test, random_state=random_state)
        print(f"Number of seas held out for validation: {n_seas_held_out_val}\n")
        print(f"Training SEAs: {train_seas}")
        print(f"Validation SEAs: {val_seas}")
        print(f"Testing SEAs: {test_seas}")
        
    
    ## Body to generate models for each target column
    for target_column in target_columns: 
        # Block Sampling + Bootstrapping switch for sampling methods
        if bootstrap and block_sample:
            X_train, X_test = features.iloc[train_indices], features.iloc[test_indices]
            y_train, y_test = outcomes[target_column].iloc[train_indices], outcomes[target_column].iloc[test_indices]
        # Bootstrap sampling method switch    
        elif bootstrap:
            # Split the data into training and test sets
            X_train, X_test, y_train, y_test = train_test_split(features, outcomes[target_column], test_size=test_size, random_state=random_state)
        # Block sampling method switch
        elif block_sample:
            # Use the already obtained train_indices, val_indices, and test_indices
            # Create train, validation and test sets using the indices
            X_train, X_val, X_test = features.iloc[train_indices], features.iloc[val_indices], features.iloc[test_indices]
            y_train, y_val, y_test = outcomes[target_column].iloc[train_indices], outcomes[target_column].iloc[val_indices], outcomes[target_column].iloc[test_indices]
        # else just use regular CV     
        else:
            X_train, X_test, y_train, y_test = train_test_split(features, outcomes[target_column], test_size=test_size, random_state = random_state)
            # Split the training data again to create a validation set
            X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=validation_size, random_state = random_state)
        sea_ids_train = sea_ids[X_train.index]  # Update the sea_ids for the training set
        sea_ids_test = sea_ids[X_test.index]  
        
        # Nested function for bootstrap samples
        if bootstrap:
            # Set the random seed for reproducibility
            rng = np.random.default_rng(random_state)
            # Define function for each bootstrap iteration
            def bootstrap_iteration(i):
                # Create a unique seed for this bootstrap iteration
                seed = random_state + i
                # Initialize the random number generator with the unique seed
                rng = np.random.default_rng(seed)
                # Create a bootstrapped training dataset
                indices = rng.choice(len(X_train), size=len(X_train), replace=True)
                X_train_bootstrap = X_train.iloc[indices]
                y_train_bootstrap = y_train.iloc[indices]
                X_train_bootstrap, X_val_bootstrap, y_train_bootstrap, y_val_bootstrap = train_test_split(X_train_bootstrap, y_train_bootstrap, test_size=validation_size, random_state=random_state)
                # To ensure bootstraps are created properly run the following two lines of code:
               # print(f"Bootstrap sample {i}:")
               # print(pd.concat([X_train_bootstrap, y_train_bootstrap], axis=1).head())
                
                # Fit the model to the bootstrapped data
                cv = 5
                ridge_cv = RidgeCV(cv=cv, alphas=np.logspace(-8, 8, base=10, num=75))
                ridge_cv.fit(X_train_bootstrap, y_train_bootstrap)

                # Store the coefficients and validation scores
                train_score = ridge_cv.score(X_train_bootstrap, y_train_bootstrap)
                val_score = r2_score(y_val_bootstrap, ridge_cv.predict(X_val_bootstrap))
                pearson_coeff, _ = pearsonr(y_val_bootstrap, ridge_cv.predict(X_val_bootstrap)) 
                
                # Calculate false positive rate and AUC-ROC if the target variable is categorical
                if target_column in categorical_columns:
                    y_val_pred = ridge_cv.predict(X_val_bootstrap)
                    fpr = 0
                    auc_roc = 0
                    for decision_boundary in decision_boundaries:
                        # Calculate confusion matrix
                        tn, fp, fn, tp = calculate_confusion_matrix(y_val_bootstrap, y_val_pred, decision_boundary)
                        # Calculate the false positive rate
                        fpr += fp / (fp + tn)
                        # Calculate AUC-ROC
                        auc_roc += roc_auc_score(y_val_bootstrap, y_val_pred)
                    fpr /= len(decision_boundaries)  # Get average false positive rate
                    auc_roc /= len(decision_boundaries)  # Get average AUC-ROC
                    return ridge_cv.coef_, train_score, val_score, pearson_coeff, fpr, auc_roc
                else:
                    return ridge_cv.coef_, train_score, val_score, pearson_coeff, None, None

            # Run bootstrap iterations in parallel
            results = Parallel(n_jobs=-1)(delayed(bootstrap_iteration)(i) for i in range(n_bootstraps))
            # Unpack results
            coefs, train_scores, val_scores, pearson_coeffs, false_positive_rates, auc_rocs = zip(*results)
            # Calculate the average coefficients and validation scores
            avg_coefs = np.mean(coefs, axis=0)
            avg_train_score = np.mean(train_scores)
            avg_val_score = np.mean(val_scores)
            avg_pearson_coeff = np.mean(pearson_coeffs)
            
                        # Calculate the average false positive rate and AUC-ROC for categorical variables
            if target_column in categorical_columns:
                avg_false_positive_rate = np.nanmean(false_positive_rates)
                avg_auc_roc = np.nanmean(auc_rocs)
                print(f"Average false positive rate: {avg_false_positive_rate:0.2f}")
                print(f"Average AUC-ROC: {avg_auc_roc:0.2f}")
                
            print(f"Target variable: {target_column}")
            print(f"Average training R2 score: {avg_train_score:0.2f}")
            print(f"Average validation R2 score: {avg_val_score:0.2f}")
            print(f"Average Pearson's correlation coefficient: {avg_pearson_coeff:0.2f}")
            print()

        else:
            cv = 5
            ridge_cv = RidgeCV(cv=cv, alphas=np.logspace(-8, 8, base=10, num=75))
            ridge_cv.fit(X_train, y_train)
            
            # Make predictions on the validation data
            y_val_pred = ridge_cv.predict(X_val)
            # Update the predictions DataFrame with the new predictions
            predictions_df[target_column] = y_val_pred
            
            if target_column in categorical_columns:
                for decision_boundary in decision_boundaries:
                    # Calculate confusion matrix
                    tn, fp, fn, tp = calculate_confusion_matrix(y_val, y_val_pred, decision_boundary)
                # Calculate the false positive rate
                    false_positive_rate = fp / (fp + tn)
                # Calculate AUC-ROC
                    auc_roc = roc_auc_score(y_val, y_val_pred)
                    print(f"Target variable: {target_column} (Categorical)")
                    print(f"Decision boundary: {decision_boundary}")
                    print(f"False positive rate: {false_positive_rate:0.2f}")
                    print(f"AUC-ROC: {auc_roc:0.2f}")
                    print()
                    
                metrics_df = metrics_df.append({
                'target_column': target_column,
                'fpr': false_positive_rate,
                'roc_auc': auc_roc}, ignore_index=True)
                
            else:
                # Calculate Pearson's correlation coefficient
                pearson_coeff, _ = pearsonr(y_val, y_val_pred)
                # Calculate training R squared
                train_r_squared = ridge_cv.score(X_train, y_train)
                # Calculate validation R squared
                val_r_squared = ridge_cv.score(X_val, y_val)
                metrics_df = metrics_df.append({
                'target_column': target_column,
                'train_score': train_r_squared,
                'val_score': val_r_squared,
                'pearson_coeff': pearson_coeff}, ignore_index=True)
                    
                print()
                print(f"Target variable: {target_column}")
                print(f"Estimated regularization parameter: {ridge_cv.alpha_}")
                print(f"Training R2 performance: {train_r_squared:0.2f}")
                print(f"Validation R2 performance: {val_r_squared:0.2f}")
                print(f"Pearson's correlation coefficient: {pearson_coeff:0.2f}")
                print()

    return predictions_df, metrics_df

In [36]:
metrics_df = train_and_evaluate_models(args)


Running model with the following parameters:
Target columns: ['frac_area_harv', 'log_maize']
Test size: 0.1 Validation size: 0.1
Bootstrap: True
Random State: 50
Number of bootstrapped samples: 10
Block sample: False

Target variable: frac_area_harv
Average training R2 score: 0.77
Average validation R2 score: 0.54
Average Pearson's correlation coefficient: 0.74

Target variable: log_maize
Average training R2 score: 0.90
Average validation R2 score: 0.72
Average Pearson's correlation coefficient: 0.85

