# Generating CEO labeling project
**Author**: Hannah Kerner (hkerner@umd.edu) and Ivan Zvonkov (izvonkov@umd.edu)

**Description**: This notebook contains:
1. Code to select a region of interest from a shapefile
2. Code for creating a shapefile with points randomly sampled inside a the region of interest
3. Instructions for creating a CEO (Collect Earth Online) project using the random sample shapefile.

In [None]:
from shapely.geometry import Point
from pathlib import Path
import cartopy.io.shapereader as shpreader
import geopandas as gpd
import numpy as np
import random
import matplotlib.pyplot as plt
import shutil

## 1. Selecting the region(s) of interest
The priority list for regions of interest can be found here: https://docs.google.com/spreadsheets/d/1y94ZV2z2biW8IX6SoPDNJ4PFXK_27_rxsOqY_jlbzhc/edit?usp=sharing

You should update three variables in the cell below:
1. `project_name`: Use the format **[Region_Name]_StartYear** (e.g., Kenya_2022 or Sudan_Blue_Nile_2019)
2. `country_code`: Enter the 3-letter country code for the region of interest
3. `regions_of_interest_{adm1,adm2}`: Add the admin1 or admin2 region names if your region of interest is at the admin1 or admin2 scale. Multiple names can be added to the list. If you leave the list empty, the region of interest will include the entire country. If you write 'all' (e.g., `regions_of_interest_adm1 = ['all']`), then all of the admin1 or admin2 regions will be used. This is useful if you want to stratify your sample by admin zone. 
4. `use_dissolved`: If set to True, all shapes in the resulting boundary will be merged into a single (dissolved) boundary before sampling. If set to False, `sample_amount` samples will be generated in each polygon contained in the boundary dataframe.

You should not have to change the `sample_amount`, unless you want to sample fewer or more points than we typically use in projects.

In [None]:
################################################################################
# THIS IS THE ONLY CELL THAT SHOULD BE EDITED WHEN RUNNING THIS NOTEBOOK
################################################################################
sample_amount = 1500
project_name = "Ethiopia_Tigray_2019"
country_code = "ETH" # Can be found https://www.iso.org/obp/ui/#search under the Alpha-3 code column
regions_of_interest_adm1 = ['Tigray']
regions_of_interest_adm2 = ['all']
use_dissolved = False

# Alternatively instead of the regions_of_interest variable, you can use a shapefilead of specifying the above country code and regions a custom shapefile path can be provided
custom_shapefile = ""

In [None]:
if custom_shapefile:
    boundary = gpd.read_file(custom_shapefile)
    boundary.plot();
else:
    # Load in shapefile from GADM
    gadm2_path = f'https://geodata.ucdavis.edu/gadm/gadm4.1/json/gadm41_{country_code}_2.json.zip'
    gadm_gdf = gpd.read_file(gadm2_path)
    
    if len(regions_of_interest_adm1) > 0:
        # Check regions
        available_regions = gadm_gdf.query('GID_0 == @country_code')['NAME_1'].tolist()
        if len(regions_of_interest_adm1) == 1 and regions_of_interest_adm1[0] == 'all':
            regions_of_interest_adm1 = available_regions
        regions_not_found = [region for region in regions_of_interest_adm1 if region not in available_regions]

        if len(regions_not_found) > 0:
            condition = 'GID_0 == @country_code'
            boundary = None
            print(f"WARNING: {regions_not_found} was not found. Please select regions only seen in below plot.")
        else:
            condition = 'NAME_1 in @regions_of_interest_adm1'
            boundary = gadm_gdf.query('NAME_1 in @regions_of_interest_adm1').copy()
            print("All admin1 regions found!")
    
            if len(regions_of_interest_adm2) > 0:
                # Check regions
                available_regions = boundary.query('NAME_1 in @regions_of_interest_adm1')['NAME_2'].tolist()
                if len(regions_of_interest_adm2) == 1 and regions_of_interest_adm2[0] == 'all':
                    regions_of_interest_adm2 = available_regions
                regions_not_found = [region for region in regions_of_interest_adm2 if region not in available_regions]

                if len(regions_not_found) > 0:
                    condition = 'NAME_1 in @regions_of_interest_adm1'
                    boundary = None
                    print(f"WARNING: {regions_not_found} was not found. Please select regions only seen in below plot.")
                else:
                    condition = 'NAME_2 in @regions_of_interest_adm2'
                    boundary = boundary.query('NAME_2 in @regions_of_interest_adm2').copy()
                    print("All admin2 regions found!")
    else:
        # use entire country
        condition = 'GID_0 == @country_code'
        boundary = gadm_gdf.query(condition).copy()
    
    gadm_gdf.query(condition).plot(
        column=condition.split(' ')[0], 
        legend=True, 
        legend_kwds={'loc': 'lower right'}, 
        figsize=(10,10)
    );

In [None]:
# Verify boundary is set
assert boundary is not None, "Boundary was not set in above cell, most likely due to not all regions found."

# Make sure the shapefile has EPSG:4326, otherwise convert it
print('Boundary shapefile CRS is %s' % boundary.crs)
if boundary.crs == None:
    boundary = boundary.set_crs('epsg:4326')
    print('Boundary shapefile set to %s' % boundary.crs)
if boundary.crs != 'epsg:4326':
    boundary = boundary.to_crs('epsg:4326')
    print('Boundary shapefile converted to %s' % boundary.crs)

## 2. Creating shapefile with points in each region
In order to evaluate crop land mapping methodologies a random sample can be used to obtain an estimate of overall map user accuracy and producer accuracy.

In [None]:
def create_shapefile_zip(gdf, filename):
    p = Path(f"../data/shapefiles") / filename
    gdf.to_file(p, index=False)
    shutil.make_archive(p, 'zip', p)
    shutil.rmtree(p)

In [None]:
boundary["roi"] = True
dissolved_boundary = boundary.dissolve(by="roi")

if use_dissolved:
    create_shapefile_zip(dissolved_boundary, f"{project_name}_boundary")
    dissolved_boundary.plot()
else:
    create_shapefile_zip(boundary, f"{project_name}_boundary")
    boundary.plot()

In [None]:
# Function for sampling random points. 
# From https://gis.stackexchange.com/questions/294394/randomly-sample-from-geopandas-dataframe-in-python
def random_points_in_polygon(num_points, polygon):
    points = []
    min_x, min_y, max_x, max_y = polygon.bounds
    i= 0
    while i < num_points:
        point = Point(random.uniform(min_x, max_x), random.uniform(min_y, max_y))
        if polygon.contains(point):
            points.append(point)
            i += 1
    return points  # returns list of shapely points

# Sample n points within the shapefile

if use_dissolved:
    points = random_points_in_polygon(sample_amount, dissolved_boundary.iloc[0].geometry)
else:
    points = []
    for idx, polygon in boundary.iterrows():
        points += random_points_in_polygon(sample_amount, polygon.geometry)
    
# Convert the list of points to a geodataframe
points_gdf = gpd.GeoDataFrame([], geometry=gpd.points_from_xy(x=[p.x for p in points], 
                                                                  y=[p.y for p in points]))

# Plot the random points
fig, ax = plt.subplots(1, figsize=(20,20))
ax.set_title("Sampled Points")
boundary.plot(ax=ax)
points_gdf.plot(ax=ax, color="orange");

In [None]:
# Add columns for CEO formatting
points_gdf['PLOTID'] = points_gdf.index
points_gdf['SAMPLEID'] = points_gdf.index

# Set the data type of the IDs to be integers
points_gdf['SAMPLEID'] = points_gdf['SAMPLEID'].astype(np.int64)
points_gdf['PLOTID'] = points_gdf['PLOTID'].astype(np.int64)

# Set crs
points_gdf.crs = 'epsg:4326'

In [None]:
# Save the file as a new shapefile
create_shapefile_zip(points_gdf[['geometry', 'PLOTID', 'SAMPLEID']], f"{project_name}_random_sample_ceo")

The above cells should have generated two shapefiles (boundary and random sample ceo) inside the `crop-mask/data/shapefiles` directory.

## 3. Creating CEO project using shapefile
The CEO project is the interface which labelers use to label points as crop or non-crop.

3.1. Navigate to NASA Harvest's CEO page: https://app.collect.earth/review-institution?institutionId=1493

3.2. Select "Create New Project" (if no such button exists, email izvonkov@umd.edu for admin permissions)

3.3. Input project title in the following format: **[Region name] [season start month year] - [season end month year] (Set 1)**. (Use the crop calendar to determine month range in title. You can find the crop calendar at this URL if you replace `NAM` with your 3-letter country code: https://www.fao.org/giews/countrybrief/country.jsp?code=NAM)

![ceo-project-overview](../assets/ceo-project-overview.png)

3.4. Select "Planet Monthly Mosaics" as default imagery and also select "Google Satellite Layer" and "Sentinel-2"

![ceo-imagery-selection](../assets/ceo-imagery-selection.png)

3.5. Upload the created shapefile zip located in `crop-mask/data/shapefiles` in Plot Design only

![ceo-plot-design](../assets/ceo-plot-design.png)

3.6. On the Sample Generation - Spatial Distribution select "Center".

3.7. Create survey question

![ceo-survey-question](../assets/ceo-survey-question.png)

3.8. Click next through Survey Rules and select Create Project. (The project will not be visible to non-admins until it is Published).

![ceo-complete-project](../assets/ceo-complete-project.png)

3.9. Verify project configuration by sending a slack message to Ivan or Hannah.

3.10. Create Set 2 version by selecting "Create New Project" again but this time navigating to "Select Template" and selecting the previously made project and clicking "Load". Update the title from Set 1 to Set 2 and click Review.

![ceo-load-template](../assets/ceo-load-template.png)

3.11. Publish both projects by selecting the Publish Project button on the review page.

![ceo-publish-project](../assets/ceo-publish-project.png)

3.12. Add both new projects to the Google Sheet: https://docs.google.com/spreadsheets/d/124Ona841vhMI1FQjzuBerKxTwK_CWuyeWj1E6j3iUaM/edit?usp=sharing

## 4. Pushing new shapefiles to Github

It's important to store the random sample shapefile for CEO project reproducibility and important to store the boundary shapefile for creating a map later on. We'll store these files directly inside the repository. 

4.1. Push the changes to Github using the following code:
```bash
git checkout -b'namibia-shapefile-data'
git add data/shapefiles/*
git commit -m'New Namibia shapefile data'
git push
```

4.2. Create a Pull Request into master by navigating to this page: https://github.com/nasaharvest/crop-mask/compare and selecting the branch you recently pushed. ("namibia-shapefile-data") in this case.