# Tracking crops from space:
## Trying to answer the million hectare question

#### Student: Mike Petrut

## Background

Dryland winter cropping refers to the cultivation of mostly commercial broadacre crops that include wheat, barley, canola, lupins and pulses. Dryland crops are not irrigated and are dependent on rainfall from late-Autumn through Winter and up to early-Spring. This project aims to look at what predictive value remote sensing data can add to crop yield forecasting by sourcing imagery over 30-years of to track the relationship between vegetation growth and crop performance from 1989 to 2020. There has been a growing field of academic research on correlating these variables at the field, or local regional level (such as this article: [Predicting Wheat Yield at the Field Scale by Combining High-Resolution Sentinel-2 Satellite Imagery and Crop Modelling] (https://www.mdpi.com/2072-4292/12/6/1024), but not so much looking at large volumes of land cover, where primary ground truth data is not available. 

This project will in effect test whether cloud processing of high volumes of remote sensing data can enable predictive insight into crop performance at the broad geographical level where the aggregated production totals are openly available. 

To access the data in this blog, and for instructions on how to set up the environment, clone the repository and run the model, visit [the project Github page](https://github.com/mike-petrut/dryland-crop-performance-modelling-project) 

### Area of Interest

The area of interest for this project is South Australia (SA), which I have used for this test case due to the quality of land cover data provided by the SA state government which breaks out the polygons by cereals and oilseeds. As this project is working with all open-source data, it is not possible to accurately classify the different agricultural crop families present across the SA Wheatbelt in absence of quality ground truth data. Dryland cropping in SA is mostly all covered by cereals (wheat, barley and, triticale) which removes potential for noise caused by large areas of oilseeds such as canola, which respond differently at different times of year to cereals in satellite imagery. With this in mind, this itteration of the model will only focus on cereal crops. 

#### South Australian Government Land Use Shapefile for Cereals Cropping
![](SA1.JPG)
![](SA2.JPG)




## Data Collection & Processing

### Google Earth

The remote sensing data which is used in this model is sourced from Google Earth Engine (GEE). I have chosen to use this python API due to the speed and potential to process a large volume of imagery in the cloud over the 30-year study period. The code creates 2 collections, Landsat 5 data from 1989 to 1999, and Landsat 7 data from 1999 to 2020. 

In the repo the custom functions and workflow can be found in earth_engine.py

### ABARES 

The Australian Bureau of Agricultural and Resource Economics and Sciences publish annual production and area planted data for all states and crops. This is the data I will use as the historical actuals to calculate yield (yield calculated as production / hectares planted). 

In the repo the workflow for downloading, wrangling and formatting the ABARES data can be found in model_data_setup.py

Once the feature extraction and formatting of the raw excel data is complete we can visualize the historical data using the python plotting libraries. 

#### South Australian Cereal Cropping Production, Area Planted and Average Yield, 1989 -2020
![](sa_data.jpg)

## Methodology

This model aims to test the relationship between EVI & NDVI values over the key months of August, Spetember and October with final harvest crop yield. The model is a linear regression model which uses vegetation index as the exogenous variable (X variable) to predict crop yield (Y variable). 

The model is expressed as a standard linear regression equasion

<math xmlns="http://www.w3.org/1998/Math/MathML" display="block">
  <mi>y</mi>
  <mo>=</mo>
  <mi>a</mi>
  <mo>+</mo>
  <mi>b</mi>
  <mi>x</mi>
  <mo>+</mo>
  <mtext>error</mtext>
  <mo></mo>
</math>

where: <br>
a is an intercept, and <br>
b is a slope

#### EVI vs. Cereal Cropping Yield in South Australia, 1989 - 2020

![](reg_plot.jpg)

### Findings

The regression model findings show that EVI correlates to yield at a R2 of 0.434 and p value of 0.04E-05 meaning we reject the null hypothesis and accept the alternative hypothesis that there is a relationship between the EVI and Cereals Yield.

#### Regression Plot of EVI vs Cereal Crop Yield 1989-2020 for South Australia
[](reg_plot.jpg)

The correlation over the 30-years for SA is however not strong enough for a robust predictive model, hence the key takeaways are:

*	There is a relationship between NDVI/EVI and crop yield that can be examined further at different geographical scales to test for greater predictive value
*	There is a case to explore more modeling methodologies to test this hypothesis using more data inputs and different programming techniques
*	This modeling technique can be used for a high-level direction of what range yield is likely to fall in, but the predictive value is not high enough to be considered a basis for prescriptive actions. 

## Future Work

*	Source more local and regional time-series data from government and industry groups to test the model hypothesis across multiple regions incorporating soil data, elevation and other geographic variables.
*	Experiment with random forest models to further evaluate the impact each month throughout the year has on the final harvest yield. 
*	Use remote sensing technology to evaluate where nitrogen fertilizers may have been applied across the cropping regions (more rainfall generally means more nitrogen fertilizer) 


In [2]:

!jupyter nbconvert --no-input --to html blog_post_outline_20210623.ipynb


[NbConvertApp] Converting notebook blog_post_outline_20210623.ipynb to html
[NbConvertApp] Writing 575146 bytes to blog_post_outline_20210623.html
