## Estimate crop area based on crop mask (single year)
**Author**: Hannah Kerner (hkerner@umd.edu) and Adebowale Daniel Adebayo (aadebowaledaniel@gmail.com)

**Description:** 

This notebook performs the following steps: 

1. Copies existing crop map from Google cloud storage
1. Clips crop map to a regional boundary (admin1 shape or user-defined bounding box)
1. Thresholds the crop map to a binary mask of 0 (noncrop) or 1 (crop)
1. Creates a random stratified sample from the crop mask for labeling in CEO
1. Computes the confusion matrix between the labeled reference sample and the crop mask
1. Calculates the crop and noncrop area and accuracy estimates based on Olofsson et al., 2014

To be added in the future:
* Code for sub-regional estimates (subsetting the reference sample according to admin2 bounds, e.g.), probably as a separate notebook

## Note:
This notebook can be either be use on [Google Colab](https://colab.research.google.com/github/nasaharvest/crop-mask/blob/area-estimation/notebooks/crop_area_estimation.ipynb) or your local computer. Therefore, if you are using your local computer, skip the Colab Setup step and start with the General Setup section.

If your map size is >7GB consider running this notebook on your personal computer or a virtual machine with RAM >12GB.

In [None]:
# Clone the crop-mask repository
# Skip this step if you have already cloned the repository or running locally
email = input("Github email: ")
username = input("Github username: ")

!git config --global user.email $username
!git config --global user.name $email

from getpass import getpass
token = getpass('Github Personal Access Token:')
!git clone https://$username:$token@github.com/nasaharvest/crop-mask.git
%cd crop-mask

## Colab Setup
* Note: You must be logged into Colab with the same account that you will use to authenticate.
* You need to authenticate your google account in order to access the cloud storage where the map is saved. 

In [None]:
# Authenticate Google Cloud
from google.colab import auth
print("Logging into Google Cloud")
auth.authenticate_user()

## Local Setup
* Note: You need to install the gsutil in order to acess Google Cloud Storage. Follow the instructions [here](https://cloud.google.com/storage/docs/gsutil_install) to install gsutil.

In [None]:
# Install required packages
# Skip this step if you have already installed the packages in your local environment
!pip install geopandas -q
!pip install rasterio -q
!pip install cartopy==0.19.0.post1 -q

In [None]:
# Import libraries
import os
import sys
from shapely.geometry import box
import geopandas as gpd

In [None]:
# Import crop area estimation functions
sys.path.insert(0, "..")
from src.area_utils import (
    load_ne,
    load_raster,
    binarize,
    cal_map_area_class,
    estimate_num_sample_per_class,
    generate_ref_samples,
    reference_sample_agree,
    compute_confusion_matrix,
    compute_area_estimate,
    plot_area,
)

Paste the map gsutil URI (file path in the cloud storage) to download/copy the map into local storage in Colab or your personal computer.

In [None]:
# Download the map from the cloud storage by providing bucket URI
# Example: gs://crop-mask-final-maps/2016/China/epsg32652_Heilongjiang_2016.tif
import ipywidgets as widgets
bucket_uri = widgets.Text(description="Bucket URI:", placeholder="Paste the crop map bucket uri or file path: gs://", layout=widgets.Layout(height="5em", width="60%"))
bucket_uri

In [None]:
!gsutil du -h $bucket_uri.value

In [None]:
# Download the map
!gsutil cp $bucket_uri.value .

## Load Region of Interest(ROI)
If you do not have the shapefile for your ROI downloaded already, you can run the following steps to download one (note: this functionality only available for admin1 level boundaries). 

If you want to use the dimensions of a bounding box instead of a shapefile, you will have the opportunity to do that later. 

In [None]:
country_iso_code = 'CHN' # Can be found https://www.iso.org/obp/ui/#search under the Alpha-3 code column
region_of_interest = ['Heilongjiang']
roi = load_ne(country_iso_code, region_of_interest)

In [None]:
roi.plot()

In [None]:
# Optionally specify bounding box boundaries to clip to
# Note that these boundaries must be in the same CRS as the raster
# You can get this from bboxfinder, e.g.: http://bboxfinder.com/#10.277000,36.864900,10.835100,37.191000

def getFeatures(gdf):
    """Function to parse features from GeoDataFrame in such a manner that rasterio wants them"""
    import json
    return [json.loads(gdf.to_json())['features'][0]['geometry']]

minx, miny, maxx, maxy = # your optional bbox bounds, e.g. 
                         # 249141.6217,840652.3433,272783.1953,855138.2342
target_crs = #EPSG:XXXXX
bbox = box(minx, miny, maxx, maxy)
geodf = gpd.GeoDataFrame({'geometry': bbox}, index=[0], crs=target_crs)
roi = getFeatures(geodf)

## Load the crop mask

* Loads the map from the .tif file as a numpy array. If region of interest (roi) is specified above, a masked array with the roi is returned; else the the whole map extent is returned as an numpy array.

* To make sure your rasters are projected using the local UTM zone (e.g., EPSG:326XX where XX is the 2-digit UTM zone), you will be prompted to 
input the EPSG code for region of interest if the map has not already been projected (i.e., the map CRS is EPSG:4326).
* The projected map will be saved as `prj_{the name base name}.tif`.

In [None]:
map_path = os.path.basename(bucket_uri.value)

In [None]:
map_array, map_meta = load_raster(map_path, roi) 

## Calculate the mapped area for each class

In [None]:
# uses 0.5 threshold by default
binary_map = binarize(map_array, map_meta)

In [None]:
crop_area_px, noncrop_area_px = cal_map_area_class(binary_map, unit='pixels')
crop_area_ha, noncrop_area_ha = cal_map_area_class(binary_map, unit='ha')

In [None]:
crop_area_frac, noncrop_area_frac = cal_map_area_class(binary_map, unit='fraction')

## Create random stratified reference sample from change map strata following best practices

First we need to determine the number of total samples we want to label for our reference dataset.

We use the method identified by Olofsson et al. in Good practices for estimating area and assessing accuracy of land change (eq 13) to determine sample size:

n ≈ ( $Σ$($W_iS_i$) / $S(Ô)$ )$^2$

| Where         |                                                      |
|---------------|------------------------------------------------------|
| W<sub>i</sub> | Mapped proportion of class i                         |
| S<sub>i</sub> | Standard deviation √(U<sub>i</sub>(1-U<sub>i</sub>)) |
| U<sub>i</sub> | Expected user's accuracy for class i                 |
| S(Ô)          | Desired standard error of overall accuracy           |
| n             | Sample size                                          |

If you have already used an independent validation or test set to estimate the user's accuracy for each class, you can plug those values into this equation. If you have not already calculated it, you will need to make a guess (it is better to make a conservative guess since an overestimation may lead to fewer points than are actually needed to achieve low standard errors). See the example calculation below for user's accuracy of both classes of 0.63 and a standard error of 0.02.

In [None]:
u_crop = 0.7
u_noncrop = 0.7
stderr = 0.02

In [None]:
n_crop_sample, n_noncrop_sample = estimate_num_sample_per_class(crop_area_frac, noncrop_area_frac, u_crop, u_noncrop)

Now we can randomly draw sample locations using this allocation from each of the map strata. 

In [None]:
# from util import sample_df
generate_ref_samples(binary_map, map_meta, n_crop_sample, n_noncrop_sample)

### Label the reference samples in CEO

This step is done in Collect Earth Online. First you need to create a labeling project with the shapefile we just created (two copies for consensus). Once all of the points in both sets have been labeled, come back to the next step.

See the instructions for labeling planted area points [here](https://docs.google.com/presentation/d/18bJHMX5M1jIR9NBWIdYeJyo3tG4CL3dNO5vvxOpz5-4/edit#slide=id.p).

## Load the labeled reference samples and get the mapped class for each of the reference samples

There should be two sets of labels for the reference sample. We compare the labels from each set to filter out labels for which the labelers did not agree, and thus we can be confident about the true label.

Upload the labeled reference sample and paste the relative paths.

In [None]:
# paths to the labeled reference samples
ceo_set_1 = 'ceo-Heilongjiang-2016-(Set-2)---v2-sample-data-2022-09-08.csv'
ceo_set_2 = 'ceo-Heilongjiang-2016-(Set-1)---v2-sample-data-2022-09-08.csv'

In [None]:
ceo_geom = reference_sample_agree(binary_map, map_meta, ceo_set_1, ceo_set_2)

In [None]:
ceo_geom.head(10)

## Compute the confusion matrix between the mapped classes and reference labels

In [None]:
confusion_matrix = compute_confusion_matrix(ceo_geom)

## Adjust mapped area using confusion matrix to compute area estimates

In [None]:
summary = compute_area_estimate(crop_area_px, noncrop_area_px, confusion_matrix, map_meta)

In [None]:
summary

In [None]:
plot_area(summary)