Skip to content

mosaiks-capstone/Modeling

Repository files navigation

MOSAIKS Modeling Repository

Purpose

This Modeling repository next repository a user should review, after generating features from satellite imagery through the Featurization repository. This Modeling repository contains code for utilizing the features created in the Featurization repository, or features downloaded from the MOSAIKS API. The Random Convolutional Featurization (RCF) process conducted in the Featurization repository produces features that are agnostic to the task a user is interested in modeling. In this repository, the MOSAIKS team has focused on using RCF to predict [insert final interest variables here], but these same features could be used to make predictions on forest cover, population, or many other variables visible from space. The key element to building a model is the data which the user needs to supply. This data should be in a standard tabular dataframe, with a spatial component included as column(s) (latitude and longitude) because this data is joined to the feature data spatially. In this repository, the MOSAIKS team pairs feature data with agricultural survey data for the country of Zambia. Due to data availability and privacy restrictions, this agricultural data is not provided with these notebooks. Regardless, the code and documentation in this repository is designed to guide the user in spatially joining the features to the user's ground truth data of interest, executing the linear regression step to train the model, applying the trained model to "out-of-bag" data, and statistically analyzing the results.

Datasets

No feature data or crop data is hosted directly in this repository. To request access to Zambia feature data files, please contact the MOSAIKS team (individual GitHub accounts with contact information are included at the end of the main organization README here). Below we describe the feature data and agricultural data as it is used in this repository. A user can apply this same workflow to their own ground truth data and contact the MOSAIKS team with any questions regarding data substitution.

1. Features

Random Convolutional Features represent a means of encoding geospatial locations with information based on satellite imagery. These features capture a broad range of detail, such as the color palettes of landscapes, as well as the delineation between colors (e.g., the boundaries between fields, forests, or buildings that are visible from space) and color combinations (e.g., blue next to green). In a feature data frame, each row corresponds to an image, while each column corresponds to a feature. Each cell within the data frame contains a numerical value that corresponds to the feature at that specific location. This value is statistically correlated, during the modeling process, with the agricultural data employed by the MOSAIKS team or other relevant data provided by the user during the modeling process.

Random Convolutional Features can be generated using either the Featurization repository in this organization or by downloading them from the MOSAIKS API. For further information regarding featurization and the MOSAIKS pipeline, readers are encourages to refer to this paper by Rolf et al. (2021.).

2. Agricultural Data data (labels)

Due to data distribution restrictions, the survey enumeration area (SEA) level agricultural data used in our model is not available for public download. Coarse resolution province-level crop data is available for download from 1987 - 2017 at the Zambia Data Portal.

The following is the complete list of variables selected from the preprocessed Zambian agricultural survey dataset. These varaiables were used as target variables in the modeling process.

Variable name Description Data type
sea_unq SEA unique identifier string
year Date date
total_area_planted_ha Area planted in hectares numeric
total_area_harv_ha Total area harvested in hectares numeric
total_area_lost_ha Total area of crops lost in hectares numeric
total_harv_kg Total kilograms harvested numeric
yield_kgha Total kilograms maize per area planted maize numeric
Frac_area_harv Total area harvested divided by total area planted in hectares numeric
frac_area_loss Total area lost divided by total area planted in hectares numeric
area_lost_fire Total area lost due to fire in hectares numeric
maize Total maize harvested in kilograms numeric
groundnuts Total groundnuts harvested in kilograms numeric
mixed_beans Total mixed beans harvested in kilograms numeric
popcorn Total popcorn harvested in kilograms numeric
sorghum Total sorghum harvested in kilograms numeric
soybeans Total soybeans harvested in kilograms numeric
sweet_potatoes Total sweet potatoes harvested in kilograms numeric
bunding Total area tilled using bunding in hectares numeric
frac_loss_drought Drought loss divided by area planted in hectares numeric
frac_loss_flood Flood loss divided by area planted in hectares numeric
frac_loss_animal Animal loss divided by area planted in hectares numeric
frac_loss_pests Pest loss divided by area planted in hectares numeric
frac_loss_soil Soil loss divided by area planted in hectares numeric
frac_loss_fert Fertilizer loss divided by area planted in hectares numeric
prop_till_plough Area ploughed divided by area planted in hectares numeric
prop_till_ridge Area ridged divided by area planted in hectares numeric
prop_notill Area not tilled divided by area planted in hectares numeric
prop_hand Area not hand tilled divided by area planted in hectares numeric
monocrop Area monocropped in hectares numeric
mixed_crop Area mixed cropped in hectares numeric
prop_mono Proportion of area monocropped numeric
prop_mixed Proportion of area mixed cropped numeric
log_maize Log of total kilograms maize per area maize numeric
log_sweetpotatoes Log of total kilograms sweet potatoes per area sweet potatoes numeric
log_groundnuts Log of total kilograms groundnuts per area groundnuts numeric
log_soybeans Log of total kilograms soybeans per area soybeans numeric
loss_ind Binary area loss indicator numeric
drought_loss_ind Binary drought area loss indicator numeric
flood_loss_ind Binary flood area loss indicator numeric
animal_loss_ind Binary animal area loss indicator numeric
pest_loss_ind Binary pest area loss indicator numeric
geometry Geospatial geometry of polygon polygon

This data was collected by the Central Statistics Office of Zambia (CSO). These data are forecasted from survey data collected in May preceding the harvest season (July-August). The forecast model is conducted in Stata, a general purpose statistical software, and adjusted with post-harvest season survey data. There is an unknown degree of uncertainty in this forecast data as the model process and parameters are unknown. The spatial resolution of the crop data is at the SEA-level, but higher resolution data can be used and will likely result in improved model perfomance.

Compute Requirements

Most modern personal computers will be able to run the modeling notebook once access to the user's labeled data is obtained. However, there is a minimum ammount of comfort with Python needed in order to use or adapt this code, including installing Python and managing environments.

Using personal computer

A properly configured python environment can be created through the provided environment.yml file. To build this environment open a terminal and run

conda env create -f environment.yml

Then activate the environment with

conda activate mosaiks

Finally open this repository in Jupyter Lab by running

jupyter lab
Using taylor.bren.ucsb.edu

If a UCSB student in Bren School of Environmental Science & Management is using tsosie.bren.ucsb.edu, the process to install an environment is more involved but the general principle is the same. The following code should be run one line at a time.

bash                                                   # this will open bash and allow you to navigate directories more easily
cd <dir with environment file>                         # navigate to the directory with this repository clone
conda env create -f environment.yml                    # create new anaconda environment
conda env list                                         # show available environments
conda activate <env name>                              # activate the new environments
conda install ipykernel                                # install ipykernel into new environment
ipython kernel install --user --name=<name_for_kernel> # create and name the kernel
conda deactivate                                       # deactivate environment

When you open a notebook, the new kernel <name_for_kernel> will be available to use.

Getting Started

There are two primary options to getting started. If you have access to the Master of Environmental Data Science server tsosie.bren.ucsb.edu, that is the preferred platform. Otherwise, you will need to use personal compute.

1. Usage on Tsosie

The notebooks are currently configured for use on Tsosie, with file paths that lead to persistent data storage containing features and agricultural data. If necessary, the file paths can be adjusted at the top of the notebook to reflect your data directory location.

2. Usage on Personal Compute

To use a personal computer, first clone this repository and configure your environemnt as described above. Following this, adjust the file paths at the top of the document to reflect your data directory location. This repository maintains the structure of the data subfolders, and it is recomended to use this structure for your own data. If you are unable to produce your own features, pre-compiled features can be downloaded from the MOSAIKS API.

Constraints

Although this analysis can generally run on most personal computers, users may still face constraints due to the scope of their analysis, including the number of features, points, and length of time being modeled. These factors can potentially exhaust the resources of a personal computer. Therefore, system monitoring may be necessary, and aggressive memory management strategies may need to be implemented. For instance, users may need to modify existing code chunks to execute computationally heavy code in smaller steps.

Notebooks and Folders

Starting Notebook for Data Preprocessing & Imputation: [insert preprocessing notebook here]

This notebook serves as a guide for users to preprocess and impute data in preparation for modeling. The following steps are outlined:

  1. Import, process, and merge feature data and agricultural ground truth data. The resolution of the user-supplied data may be lower than that of the feature data. Agricultural data used in this MOSAIKS pipeline application pertains to the SEA-level.
  2. Process NA values (convert infinity values to NA, impute NA values, and drop NA values).
  3. Perform statistical and visual data checks during the process.
  4. Export processed and merged geodataframe to be utilized in subsequent modeling stages.

Secondary Notebook for Modeling: [insert modeling notebook here]

This notebook serves as a guide for users to model after data preprocessing and imputation:

  1. Perform a data split into distinct train and test sets to prevent overfitting and ensure unbiased results.
  2. Train the model, produce prediction maps, and visually analyze the model's performance.
  3. Statistically analyze the model's performance with residual maps and evaluation of metrics such as the R and R^2 coefficients.

Data folder

This folder and its subfolders are largely placeholders for where to put data when working on a local computer.

Future Work

As previously stated, the MOSAIKS pipeline is task-agnostic, meaning that it can be applied to any spatial data of interest, allowing for the training of a model and prediction of both temporal and spatial outcomes. Each model's perfomance is assessed using various R and R^2 metrics. For our purposes, we establish a validation R^2 score threshold of 0.4 as an indicator of statistical significance. Below is a comprehensive testing summary for each of the aforementioned variables that meet this minimum threshold:

Variable name Training R^2 Validation R^2 Pearsons Testing R^2
total_area_harv_ha 0.71 0.46 0.69 0.63
total_area_lost_ha 0.75 0.50 0.72 0.58
total_harv_kg 0.86 0.45 0.71 0.37
yield_kgha 0.74 0.62 0.80 0.58
frac_area_harv 0.64 0.46 0.71 0.2
frac_area_loss 0.64 0.46 0.71 0.2
maize 0.77 0.61 0.79 0.6
groundnuts 0.52 0.44 0.67 0.33
log_maize 0.77 0.71 0.84 0.65
log_groundnuts 0.58 0.42 0.66 0.38
prop_till_plough 0.78 0.71 0.85 0.85
prop_till_ridge 0.77 0.54 0.74 0.61
prop_mono 0.9 0.56 0.76 0.68

The model performance of the following indicator variables are assessed using False Positive Rate (FPR) and AUC-ROC metrics:

Variable name FPR AUC-ROC
loss_ind 0 0.84
drought_loss_ind 0 0.79
flood_loss_ind 0 0.46
animal_loss_ind 0 0.42
pest_loss_ind 0 0.42

The significant performance of these models demonstrates the ability of this technique to be used for prediction for the above variables. It is relevant to note that many variables that did not meet the threshold of performance also had known significant data quality issues. However, further testing is recommended to gain a better understanding of the feasibility of predicting specific variables using these techniques.

Although these values demonstrate a comparatively robust predictive performance compared to other time-series models, there remains ample opportunity for further refinement. To enhance the training dataset and improve model performance, consider the following recommendations:\

  • Increase the number of training samples by incorporating more years of satellite data. This can be accomplished by extracting features from additional satellites, such as Landsat 5, which extends further back in time.

  • In case of availability of recent agricultural data beyond 2022, incorporating these additional years of data can enhance the dataset and enable the model to capture more recent trends.

  • Investigate if the agricultural predictions for Zambia prior to 2022 can be used to detect agricultural fluctuations resulting from known climatic anomalies like drought. A thorough evaluation of the model's predictive ability for these years can reveal significant correlations with precipitation and temperature data, further enhancing the tool's accuracy. By achieving this, the model can be utilized by governments, community leaders, farmers, and food security initiatives to anticipate future [insert appropriate] for Zambia. The tool's potential was already demonstrated by generating predictions for all years used to train the model. Ultimately, the goal is to provide more foresight into agricultural yields and indicators before harvest, enabling farmers and leaders to adjust crop imports, exports, and costs accordingly.

  • A report presenting a correlational analysis between estimated agricultural yields/indicators and high-resolution, publicly available climate indicators (i.e., temperature and precipitation). This report is distinct from the previous suggestion as it focuses on general climate data, rather than anomalies. To ensure accessibility, the report should be made available in all prominent languages of Zambia, to cater to local farmers and leaders who may not be proficient in English.

  • Stack additional bands to Sentinel 2 in addition to visible spectrum (2,3,4). Examples include short wave infrared (12, 8, and 4), and red edge

  • Utilizing notebook for 0.01 degree grid cells (an equal angle grid)

  • Increasing the cloud cover limit from 10% to 15% or more

  • Filtering cloud cover at the level of the resolution you are featurizing (0.01 degree) rather than at the image level

Contributing

This project was completed on June 9th, 2023. However, we welcome and encourage suggestions for improvements to the code or documentation. Please submit any questions, comments, or code changes via issues or pull requests on either of the repositories. To communicate with the data scientists responsible for this project, please refer to their personal GitHub accounts listed at the bottom of the organization's README and feel free to contact them via email. You may also contact the authors of the MOSAIKS paper, Rolf et al. 2021 with any questions regarding the process.

The agricultural data used in this MOSAIKS approach for Zambia was generously provided by the Kathy Baylis lab at UC Santa Barbara, for which Protensia Hadunka offered exceptional support and contextual information on how this model could be helpful to Zambia's governing bodies. It should be noted that this agricultural data is not publicly available. However, this MOSAIKS pipeline and model are generalizable, and can be applied to any data that can be spatially joined with the feature data.

We express our gratitude to the 2022 CropMOSAIKS team for developing the base code that served as the foundation for the MOSAIKS team to extend the MOSAIKS approach. We would also like to thank Cullen Molitor for providing continuous support to the MOSAIKS team throughout the duration of this project.