# Predicting Agricultural Variables

This notebook contains the modeling approach used by the 2023 MOSAIKS team as part of the University of California, Santa Barbara's Bren School of the Environment & Management Masters of Environmental Data Science (MEDS) Program's Capstone Project. 

This notebook immediately follows the [feature preprocessing](https://github.com/mosaiks-capstone/Modeling/blob/main/feature_preprocessing.ipynb) notebook used to aggregate and join featurized satellite imagery (generated [here](https://github.com/mosaiks-capstone/Featurization)) and ground truth data (in our case, Crop Forecast Survey (CFS) Data collected by the Zambian Ministry of Agriculture). This notebook regresses our features on our outcomes, and generates predictions for each. As of now, this notebook only supports generating predictions for our 5-fold RidgeCV model approach

To evaluate model performances using different sampling methods, please see the [model_performance_eval](https://github.com/mosaiks-capstone/Modeling/blob/main/model_performance_eval.ipynb) notebook.
## Python modules

In [10]:
import warnings
import time
import os
import random

import dask
from dask.distributed import Client

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
import matplotlib.colors as colors
import ipywidgets as widgets
from IPython.display import clear_output

import geopandas as gpd
import pyarrow

from IPython.display import display
from joblib import Parallel, delayed
from matplotlib.axes import Axes
from sklearn.linear_model import RidgeCV
from sklearn.model_selection import train_test_split, ShuffleSplit
from sklearn.metrics import mean_squared_error, confusion_matrix, r2_score, roc_auc_score
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import BaggingRegressor
from sklearn.preprocessing import StandardScaler
from scipy.stats import spearmanr
from scipy.linalg import LinAlgWarning
from scipy.stats import pearsonr
from sklearn.utils import check_random_state, resample


import math
import seaborn as sns

# Read in Data

We first read in the aggregated features and ground-truth data joined in  `feature_preprocessing.ipynb` .

The joined data being read in should take on the following form:

| spatial_identifier | year | target_1 | target_2 | feature1| feature2 | feature3
| ----| ----| ---- | ---- | -- | -- | -- |
| 1   | 2016 | 72 | 13 | 1.23 | 3.25 | 0.123
| 2   | 2016  | 50 | 7.5 | 0.78| 1.2 | 2.4



In our case, our unique spatial_identifiers are `sea_unq`. This enables us to regress `target_1` and `target_2` on our features, using the following equation:

$y_{1}$ = $\beta_{1}$$x_{1}$ + $\beta_{2}$$x_{2}$ + $\beta_{3}$$x_{3}$ + $\beta_{n}$$x_{n}$ 

In [11]:
## Insert path to joined ground data + features
path = "/capstone/mosaiks/repos/modeling/data/model_directory/SEA_averaged_features_simple_impute_mean.csv" ## Your path here

grouped_features = pd.read_csv(path)
grouped_features

Unnamed: 0,year,sea_unq,0_1,0_2,0_3,0_4,0_5,0_6,0_7,0_8,...,log_sweetpotatoes,log_groundnuts,log_soybeans,loss_ind,drought_loss_ind,flood_loss_ind,animal_loss_ind,pest_loss_ind,lat,lon
0,2016.0,1,0.000000,0.000863,0.000783,0.000000,0.000000,0.000000,0.000000,6.157999e-06,...,6.364023,5.935403,6.565149,0.0,0.0,0.0,0.0,0.0,-13.659357,27.807993
1,2016.0,2,0.000069,0.000863,0.000783,0.000000,0.000002,0.000014,0.000047,6.299240e-05,...,6.364023,5.935403,6.565149,0.0,0.0,0.0,0.0,0.0,-13.493902,27.959205
2,2016.0,7,0.001141,0.000863,0.000783,0.000329,0.000000,0.000000,0.000000,1.008277e-03,...,0.689155,5.935403,6.565149,1.0,1.0,0.0,0.0,0.0,-13.772690,28.634660
3,2016.0,9,0.001131,0.000863,0.000783,0.000006,0.000004,0.000010,0.000014,2.590917e-05,...,6.364023,-1.408767,6.565149,1.0,0.0,0.0,0.0,0.0,-12.905428,27.406446
4,2016.0,10,0.001131,0.000863,0.000783,0.000000,0.000000,0.000000,0.000000,3.113844e-07,...,2.525729,3.354421,6.565149,1.0,0.0,0.0,0.0,0.0,-12.962298,27.381719
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1364,2021.0,388,0.001131,0.000076,0.000106,0.000075,0.000046,0.000112,0.000144,8.372727e-04,...,9.367183,8.098897,7.336848,0.0,0.0,0.0,0.0,0.0,-14.652084,25.116478
1365,2021.0,389,0.001131,0.000863,0.000018,0.000104,0.000202,0.000424,0.000395,7.513932e-04,...,6.364023,8.048788,6.565149,1.0,1.0,0.0,0.0,0.0,-13.966394,22.794290
1366,2021.0,390,0.000821,0.000345,0.000180,0.000227,0.000306,0.000457,0.000474,9.321970e-04,...,7.863267,8.154788,6.565149,1.0,0.0,0.0,0.0,0.0,-14.240607,23.101535
1367,2021.0,391,0.001131,0.000863,0.000353,0.000323,0.000244,0.000222,0.000311,1.540414e-03,...,6.364023,8.065208,6.565149,1.0,0.0,1.0,0.0,0.0,-16.485957,24.338360


### Select Features and Outcomes

We then select all observations for each of the columns containing the features. We do the same with our outcome/target variables

In [12]:
# Select features and outcomes:
features = grouped_features.iloc[:,2:12002] # adjust to select all feature columns
outcomes = grouped_features.iloc[:,12003:12041] #adjust to select all target/outcome variables

## Ensure each of the binary target variables are of categorical data types
outcomes["loss_ind"].astype('category')
outcomes["drought_loss_ind"].astype('category')
outcomes['pest_loss_ind'].astype('category')
outcomes['animal_loss_ind'].astype('category')
outcomes['flood_loss_ind'].astype('category')

# Gut-check
outcomes.head()
features.head()

Unnamed: 0,0_1,0_2,0_3,0_4,0_5,0_6,0_7,0_8,0_9,0_10,...,999_3,999_4,999_5,999_6,999_7,999_8,999_9,999_10,999_11,999_12
0,0.0,0.000863,0.000783,0.0,0.0,0.0,0.0,6.157999e-06,0.000207,0.001568,...,0.060421,1.0,0.274676,1.0,0.115388,0.002708,0.001319,0.002867,0.003866,1.0
1,6.9e-05,0.000863,0.000783,0.0,2e-06,1.4e-05,4.7e-05,6.29924e-05,0.000168,0.001568,...,0.060421,0.939709,0.049106,0.039969,0.004752,0.002671,0.002439,0.002867,0.003866,0.071531
2,0.001141,0.000863,0.000783,0.000329,0.0,0.0,0.0,0.001008277,0.00136,0.002211,...,0.060421,0.006789,1.0,1.0,1.0,0.000517,0.000343,0.000396,0.003866,0.071531
3,0.001131,0.000863,0.000783,6e-06,4e-06,1e-05,1.4e-05,2.590917e-05,0.00011,0.001568,...,0.060421,0.005561,0.006391,0.004212,0.003235,0.001937,0.001683,0.002867,0.003866,0.071531
4,0.001131,0.000863,0.000783,0.0,0.0,0.0,0.0,3.113844e-07,1.2e-05,0.001568,...,0.060421,0.00557,0.006739,0.003991,0.002857,0.001979,0.001435,0.002867,0.003866,0.071531


### Helper function 
`calculate_confusion_matrix`:
To evaluate the performance of our categorical variables, we need to use a confusion matrix instead of r-squared. This function calculates the confusion matrix for binary classification problems based on the given true labels (`y_true`), predicted values (`y_pred`), and a decision boundary (`decision_boundary`) to assign a class to the binary target. 

In [13]:
def calculate_confusion_matrix(y_true, y_pred, decision_boundary):
    y_pred_adj = np.where(y_pred >= decision_boundary, 1, 0)
    cm = confusion_matrix(y_true, y_pred_adj)
    if cm.shape == (1, 1):
        if y_true.iloc[0] == 0:
            tn, fp, fn, tp = cm[0, 0], 0, 0, 0
        else:
            tn, fp, fn, tp = 0, 0, 0, cm[0, 0]
    elif cm.shape == (2, 2):
        tn, fp, fn, tp = cm.ravel()
    else:
        print("Unexpected confusion matrix:")
        print(cm)
        raise ValueError('Unexpected confusion matrix shape.')
    return tn, fp, fn, tp

### Model Implementation

## Description 

This function is designed to use Cross-Validated Ridge Regression to regress our `features` (encoded/summarised "featurized" Sentinel 2 satellite imagery data using [MOSAIKS](https://www.nature.com/articles/s41467-021-24638-z) process) on our `outcomes` (ground-truthed Crop Forecast Survey (CFS) data from the Zambian Ministry of Agriculture). We have labelled data at the Survey Enumeration Area (SEA)/year level for 2015-2022 approximating ~1300 observations before imputation steps.

The approach employs **5-fold CV using sklearn's RidgeCV**
    - Data is split using `train_test_split` into training and testing sets. The training data is again split into a validation set. A RidgeCV model is trained for each target outcome selected. To choose the penalty coefficient alpha for each model, RidgeCV searches over a logspace of 75 values from $10^{-8}$ to $10^8$. The trained model's performance is evaluated on the validation set. 
    

Before any results are printed, our function prints out several parameters selected by the user to ensure the proper parameters are being employed. We measured the accuracy of our training, validation, and testing sets primarily using the $R^2$ metric. Below are the arguments passed to our function (in a dictionary). 


### Arguments

`args`: This is a dictionary that contains all the arguments that are necessary to run the function train_and_evaluate_models.

1. `target_columns` (list of strings): contains the names of the columns in the data that are considered as the target variables in the model training process.
2. `test_size` (float, optional): represents the proportion of the data to include in the test split. The default value is 0.1, meaning that 10% of the data will be used for testing.
3. `categorical_columns` (list of strings): contains the names of the columns in the data that are categorical variables.
4. `decision_boundaries` (list of floats): defines the decision boundaries for each categorical target variable. 
5. `sea_ids` (list of integers): contains the unique spatial identifiers of each Survey Enumeration Area (SEA).
6. `validation_size` (float, optional): represents the proportion of the data to include in the validation split. The default value is 0.1, meaning that 10% of the data will be used for validation.
7.  `random_state` (integer, optional): seed used by the random number generator. Setting this value ensures that the splits that we generate are reproducible. The default value is 1.


In [14]:
# Prepare the arguments as a dictionary
args = {
    'target_columns': outcomes.columns,
    'test_size': 0.1,
    'categorical_columns':['loss_ind','drought_loss_ind', 'flood_loss_ind','animal_loss_ind','pest_loss_ind'],
    'decision_boundaries': [0.3,0.5,0.7],
    'sea_ids': grouped_features['sea_unq'],
    'validation_size' : 0.1,
    'random_state': 50
}

In [15]:
def train_and_evaluate_models(args):
    # Extracting input parameters
    target_columns = args['target_columns']
    test_size = args.get('test_size', 0.1)
    categorical_columns = args['categorical_columns']
    decision_boundaries = args['decision_boundaries']
    sea_ids = args['sea_ids']
    validation_size = args.get('validation_size', 0.1)
    random_state = args.get('random_state', False)
    
    # Read the grouped features from a CSV file
    grouped_features = pd.read_csv("/capstone/mosaiks/repos/modeling/data/model_directory/SEA_averaged_features_manual_impute_bfill_modeltrain.csv")

    # Extract the relevant features, outcomes, and year columns
    features = grouped_features.iloc[:, 5:12005]
    outcomes = grouped_features.iloc[:, 12006:]
    year = grouped_features.iloc[:, 0]
    
    # Initialize data structures to store metrics and results
    metrics_df = pd.DataFrame(columns=['target_column', 'train_score', 'val_score', 'pearson_coeff'])
    models = {}
    X_trains = {}
    X_tests = {}
    y_trains = pd.DataFrame()
    y_tests = pd.DataFrame()
    y_year = pd.DataFrame()
    
    # Print the model parameters
    print(f"\nRunning model with the following parameters:")
    print(f"Target columns: {target_columns}")
    print(f"Test size: {test_size}", f"Validation size: {validation_size}")
    print(f"Random State: {random_state}")

    # Iterate over each target column
    for target_column in target_columns:
        
        # Split the data into training and testing sets
        X_train, X_test, y_train, y_test = train_test_split(features, outcomes[target_column], test_size=test_size, random_state = random_state)
        
        # Split the training data again to create a validation set
        X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=validation_size, random_state = random_state)
        
        # Store the training and testing data for each target column
        X_trains[target_column] = X_train
        X_tests[target_column] = X_test
        y_trains[target_column] = y_train
        y_tests[target_column] = y_test
        y_year[target_column] = year.loc[y_trains.index]

        # Train a RidgeCV model with cross-validation
        cv = 5
        ridge_cv = RidgeCV(cv=cv, alphas=np.logspace(-8, 8, base=10, num=75))
        ridge_cv.fit(X_train, y_train)
        
        # Store the trained model for each target column
        models[target_column] = ridge_cv
        
        # Make predictions on the training and validation data
        y_val_pred = ridge_cv.predict(X_val)
        y_train_pred = ridge_cv.predict(X_train)

        # Perform evaluation for categorical target columns
        if target_column in categorical_columns:
            for decision_boundary in decision_boundaries:
                # Calculate confusion matrix
                tn, fp, fn, tp = calculate_confusion_matrix(y_val, y_val_pred, decision_boundary)

                # Calculate the false positive rate
                false_positive_rate = fp / (fp + tn)

                # Calculate AUC-ROC
                auc_roc = roc_auc_score(y_val, y_val_pred)

                # Print evaluation metrics for categorical columns
                print(f"Target variable: {target_column} (Categorical)")
                print(f"Decision boundary: {decision_boundary}")
                print(f"False positive rate: {false_positive_rate:0.2f}")
                print(f"AUC-ROC: {auc_roc:0.2f}")
                print()
        else:
            # Calculate Pearson's correlation coefficient
            pearson_coeff, _ = pearsonr(y_val, y_val_pred)

            # Calculate training R squared
            train_r_squared = ridge_cv.score(X_train, y_train)

            # Calculate validation R squared
            val_r_squared = ridge_cv.score(X_val, y_val)
            
            # Append metrics to the metrics DataFrame
            metrics_df = metrics_df.append({
                'target_column': target_column,
                'train_score': train_r_squared,
                'val_score': val_r_squared,
                'pearson_coeff': pearson_coeff}, ignore_index=True)
                
            # Print evaluation metrics for non-categorical columns
            print()
            print(f"Target variable: {target_column}")
            print(f"Estimated regularization parameter: {ridge_cv.alpha_}")
            print(f"Training R2 performance: {train_r_squared:0.2f}")
            print(f"Validation R2 performance: {val_r_squared:0.2f}")
            print(f"Pearson's correlation coefficient: {pearson_coeff:0.2f}")
            print()

    # Return the collected data and results
    return X_trains, X_tests, y_trains, y_tests, metrics_df, models, y_year


In [None]:
X_trains, X_tests, y_trains, y_tests, metrics_df, models, y_year  = train_and_evaluate_models(args)


Running model with the following parameters:
Target columns: Index(['total_area_harv_ha', 'total_area_lost_ha', 'total_harv_kg',
       'yield_kgha', 'frac_area_harv', 'frac_area_loss', 'area_lost_fire',
       'maize', 'groundnuts', 'mixed_beans', 'popcorn', 'sorghum', 'soybeans',
       'sweet_potatoes', 'bunding', 'monocrop', 'mixture', 'frac_loss_drought',
       'frac_loss_flood', 'frac_loss_animal', 'frac_loss_pests',
       'frac_loss_soil', 'frac_loss_fert', 'prop_till_plough',
       'prop_till_ridge', 'prop_notill', 'prop_hand', 'prop_mono', 'prop_mix',
       'log_maize', 'log_sweetpotatoes', 'log_groundnuts', 'log_soybeans',
       'loss_ind', 'drought_loss_ind', 'flood_loss_ind', 'animal_loss_ind',
       'pest_loss_ind'],
      dtype='object')
Test size: 0.1 Validation size: 0.1
Random State: 50


### Train Set

After training models for each specified target variable in `target_columns`, we employ these models to create and store predictions and $R^2$ scores for each target column on our training data. Our training data has been aggregated by survey enumeration area (SEA) and year, which means that each of the 436 rows of `y_pred_train` represents a prediction made for a particular SEA during a particular year. 

In [None]:
# Initialize empty dataframes for storing the predicted values and R2 scores
y_pred_train = pd.DataFrame()
r2_train = pd.DataFrame()

# Iterate over the keys in models dictionary
for target_column in models.keys():
    # Get the corresponding trained model for the target column
    model = models[target_column]
    
    # Get the training data for the target column
    X_train_column = X_trains[target_column]
    y_train_column = y_trains[target_column]
    
    # Make predictions for the target column
    y_pred_train_column = np.maximum(model.predict(X_train_column), 0)
    
    # Compute the R2 score for the target column
    r2_train_column = r2_score(y_train_column, y_pred_train_column)
    
    # Store the predicted values and R2 score in their respective dictionaries
    y_pred_train[target_column] = y_pred_train_column
    r2_train[target_column] = [r2_train_column]

In [None]:
y_pred_train.head()

### Visualize Performance of Train Set 

We visualize performances of the training set through scatterplots of our predicted values versus ground-truthed values. These scatterplots include a regression line, and display the R^2 value for the selected variable. 

In [None]:
# Create a list of variable names from the dataframes
variable_names = list(y_pred_train.columns)

# Create the dropdown widget
variable_dropdown = widgets.Dropdown(options=variable_names, description='Variable:')

# create a container widget to hold the dropdown and the plot
container = widgets.VBox(children=[variable_dropdown])

# Create an output widget to display the plot
plot_output = widgets.Output()

# Define a function to update the plot based on the selected variable
def update_plot_train(variable):
    with plot_output:
        clear_output(wait=True)
        # Create the scatterplot
        fig, ax = plt.subplots()
        ax.scatter(y_pred_train[variable], y_trains[variable])
        ax.axline([0, 0], [1, 1], c="k")

        # Extract the R2 value from the r2_train dataframe
        r2_value = r2_train[variable]
        r2_value = round(r2_value, 2)

        # Set the title with the current title as a subtitle and the new title as "Variable: [variable]"
        sub_title = f"Model applied to train data n = {len(y_trains)}, R$^2$ = {r2_value}"
        title = f"Variable: {variable}"
        plt.title(sub_title, fontsize=12, y=1.0, loc='left')
        plt.title(title, fontsize=14, y=1.15, loc='center')

        # Set x and y axis labels
        ax.set_xlabel("Predicted", fontsize=15)
        ax.set_ylabel("Ground Truth", fontsize=15)

        # Display the plot
        plt.show()

# Define a function to update the dropdown options when the variable names change
def update_dropdown_options(change):
    variable_dropdown.options = variable_names

# Call the update_plot_train function with the initial value of the dropdown
update_plot_train(variable_dropdown.value)

# Register the event handler to update the dropdown options
variable_dropdown.observe(update_dropdown_options, 'options')

# Set up the interaction between the dropdown and the plot
def dropdown_eventhandler(change):
    variable = change.new
    update_plot_train(variable)

variable_dropdown.observe(dropdown_eventhandler, 'value')

# Display the dropdown and the plot
display(widgets.VBox([variable_dropdown, plot_output]))

### Test Set 

Next, we employ these models to create and store predictions and R^2 scores for each target column on our testing data. Again, our testing data has been aggregated by survey enumeration area (SEA) and year, which means that each of the 436 rows of `y_pred_test` represents a prediction made for a particular SEA during a particular year. 

In [None]:
# Initialize empty dictionaries for storing the predicted values and R2 scores
y_pred_test = pd.DataFrame()
r2_test = pd.DataFrame()

# Iterate over the keys in models dictionary
for target_column in models.keys():
    # Get the corresponding trained model for the target column
    model = models[target_column]
    
    # Get the training data for the target column
    X_test_column = X_tests[target_column]
    y_test_column = y_tests[target_column]
    
    # Make predictions for the target column
    y_pred_test_column = np.maximum(model.predict(X_test_column), 0)
    
    # Compute the R2 score for the target column
    r2_test_column = r2_score(y_test_column, y_pred_test_column)
    
    # Store the predicted values and R2 score in their respective dictionaries
    y_pred_test[target_column] = y_pred_test_column
    r2_test[target_column] = [r2_test_column]

In [None]:
y_pred_test

### Visualize Performance of Test Set 

In [None]:
# Create a list of variable names from the dataframes
variable_names = list(y_pred_test.columns)

# create a container widget to hold the dropdown and the plot
container = widgets.VBox(children=[variable_dropdown])

# Create the dropdown widget
variable_dropdown = widgets.Dropdown(options=variable_names, description='Variable:')

# Create an output widget to display the plot
plot_output = widgets.Output()


# Define a function to update the plot based on the selected variable
def update_plot_test(variable):
    with plot_output:
        clear_output(wait=True)
        # Create the scatterplot
        fig, ax = plt.subplots()
        ax.scatter(y_pred_test[variable], y_tests[variable])
        ax.axline([0, 0], [1, 1], c="k")

        # Extract the R2 value from the r2_train dataframe
        r2_value = r2_test[variable]
        r2_value = round(r2_value, 2)

        # Set the title with the current title as a subtitle and the new title as "Variable: [variable]"
        sub_title = f"Model applied to test data n = {len(y_tests)}, R$^2$ = {r2_value}"
        title = f"Variable: {variable}"
        plt.title(sub_title, fontsize=12, y=1.0, loc='left')
        plt.title(title, fontsize=14, y=1.15, loc='center')

        # Set x and y axis labels
        ax.set_xlabel("Predicted", fontsize=15)
        ax.set_ylabel("Ground Truth", fontsize=15)

        # Display the plot
        plt.show()

# Define a function to update the dropdown options when the variable names change
def update_dropdown_options(change):
    variable_dropdown.options = variable_names

# Call the update_plot_train function with the initial value of the dropdown
update_plot_test(variable_dropdown.value)

# Register the event handler to update the dropdown options
variable_dropdown.observe(update_dropdown_options, 'options')

# Set up the interaction between the dropdown and the plot
def dropdown_eventhandler(change):
    variable = change.new
    update_plot_test(variable)

variable_dropdown.observe(dropdown_eventhandler, 'value')

# Display the dropdown and the plot
display(widgets.VBox([variable_dropdown, plot_output]))

### Apply Model to Ungrouped SEA Features

In [None]:
features_sea_ungrouped = pd.read_feather("/capstone/mosaiks/repos/modeling/data/model_directory/SEA_ungroup_features_simple_impute_mean.feather")

In [None]:
features_sea = features_sea_ungrouped.iloc[:, 2:12002]

In [None]:
# Initialize empty dictionaries for storing the predicted values and R2 scores
y_pred_sea = pd.DataFrame()

# Iterate over the keys in models dictionary
for target_column in models.keys():
    # Get the corresponding trained model for the target column
    model = models[target_column]
    
    # Make predictions for the target column
    y_pred_sea_column = np.maximum(model.predict(features_sea), 0)
    
    # Store the predicted values and R2 score in their respective dictionaries
    y_pred_sea[target_column] = y_pred_sea_column

In [None]:
# Select the columns from features
selected_columns_sea = features_ungrouped[['lat', 'lon', 'year']]

# Concatenate selected_columns with y_preds
sea_preds = pd.concat([selected_columns_sea, y_pred_sea], axis=1)

# Display the combined dataframe
sea_preds

In [None]:
sea_preds.to_feather("/capstone/mosaiks/repos/modeling/data/predictions/SEA_predictions_ungrouped.feather")

## Apply Model to Zambia 10% Data


In [None]:
zambia = pd.read_feather("/capstone/mosaiks/repos/modeling/data/model_directory/zambia_10percent_features_simple_impute_modelpredict.feather")

In [None]:
zambia_features = zambia.iloc[:,2:12002]
zambia_features.head()

In [None]:
# Initialize empty dictionaries for storing the predicted values and R2 scores
y_pred_zambia = pd.DataFrame()

# Iterate over the keys in models dictionary
for target_column in models.keys():
    # Get the corresponding trained model for the target column
    model = models[target_column]
    
    # Make predictions for the target column
    y_pred_zambia_column = np.maximum(model.predict(zambia_features), 0)
    
    # Store the predicted values and R2 score in their respective dictionaries
    y_pred_zambia[target_column] = y_pred_zambia_column

In [None]:
y_pred_zambia.head()

In [None]:
# Select the columns from features
selected_columns_zambia = zambia[['lat', 'lon', 'year']]

# Concatenate selected_columns with y_preds
zambia_preds = pd.concat([selected_columns_zambia, y_pred_zambia], axis=1)

# Display the combined dataframe
zambia_preds

In [None]:
zambia_preds.to_feather("/capstone/mosaiks/repos/modeling/data/predictions/zambia_10perc_predictions.feather")

In [None]:
sea_preds = pd.read_csv('capstone/mosaiks/repos/modeling/data/predictions/SEA_predictions_ungrouped.csv')


### Congratulations on completing this analysis!