# COSC 526 - Assignment 11
### April 9, 2021
---

## A - Problem Overview

In this notebook, we provide you with instructions for completing the assignment.  *Be sure to answer all the questions in this notebook.*  Each member of the group must upload their own work (i.e., a notebook file) to GitHub.

Last week, we looked at an example of processing and analysing data in the domain of soil science. Soil moisture is a critical variable that links climate dynamics with water and food security. Information on soil moisture is a key factor to inform and enable precision agriculture. The current availability in soil moisture data over large areas comes from remote sensing (i.e., satellites with radar sensors) which provide daily, nearly global coverage of soil moisture. However, satellite soil moisture datasets have a major shortcoming in that they are limited to coarse spatial resolution (generally no finer than tens of kilometers). There do exist at higher resolution other geographic datasets (e.g., climatic, geological, and topographic) that are intimately related to soil moisture values, which can be paired with soil moisture data in order to downscale (i.e., increase resolution) the original soil moisture product.

Last assignment, we performed some early stages of data downscaling using a common and powerful Python package for data analysis: [Pandas](https://pandas.pydata.org/pandas-docs/stable/index.html). 

This week, we execute a fully developed workflow to reproduce scientific experiments using powerful machine learning methods (i.e., KNN, Surrogate-based Model, HYPPO, and Random Forest).

#### Setup of the SOMOSPIE workflow

This process consists of two major steps: 
* (1) creating a new virtual machine in Jetstream from a custom VM image; 
* (2) cloning the SOMOSPIE GitHub repository on your new VM.

Your assignment is to: 
* Setup and run the full SOMOSPIE workflow;
* Use SOMOSPIE to partially reproduce a published study on one region with multiple machine learning methods;
* Use SOMOSPIE to downscale soil moisture using one method across multiple regions;
* Use Pandas to analyze the reported accuracies ($R^2$ and $RMSE$) of the predictions across multiple regions and months.

You will report the results of your experiments within this Notebook. 

## IMPORTANT:  Do **not** push a copy of the SOMOSPIE repo in your class repo.

---

## B - SOMOSPIE Setup

### Step 1: Create a new VM

In Jetstream (https://use.jetstream-cloud.org), launch an m1.medium Virtual Machine (VM) using the SOMOSPIE image found here: https://use.jetstream-cloud.org/application/images/946. You can launch an instance based on an image with the launch button in the top right -- just make sure you pick the project you want to have the instance in! Consult the JetstreamGuide.ipynb for instructions on launching a VM.

### Step 2: Clone the repo

**Inside your new virtual machine**, clone the SOMOSPIE repository:
* `git clone --recursive https://github.com/TauferLab/SOMOSPIE`

### Step 3: Set up environment  

Before opening the SOMOSPIE notebook, we need to update the `.bashrc`:
* `cd SOMOSPIE`
* `make bash`
* `source ~/.bashrc`

### Step 4: Load data and run simple test

Now open the notebook `SOMOSPIE.ipynb` and in the `Cell` menu select `Run All`. 

IMPORTANT: The first time you run it may take 45 minutes. The reason is because you are loading the data from public datasets.

---
## C - Reproducable Science

The paper on SOMOSPIE ([SOMOSPIE.pdf](SOMOSPIE.pdf); Rorabaugh, Guevara, Llamas, Kitson, Vargas, and Taufer, _SOMOSPIE: A modular SOil MOisture SPatial Inference Engine based on data-driven decisions_, IEEE eScience 2019) was presrtned at the IEEE eScience 2019 conference and involves a case study using the following data:
* Soil Moisture Information
  * Data: Spatial pixels--Longitude, Latitude, Soil Moisture Ratio
  * Source: ESA-CCI satelites (https://www.esa-soilmoisture-cci.org)
  * Spatial: Global, 0.25 degree (~25 km) resolution
  * Temporal: Monthly means for April 2016
* Topography Information
  * Data: Spatial pixels--Longitude, Latitude, 15 parameters
  * Source: Derived with SAGA GIS (https://www.hydroshare.org/resource/b8f6eae9d89241cf8b5904033460af61)
  * Spatial: Conterminous United States (CONUS), 1 km resolution
  * Temporal: n/a (elevation is relatively stable across time)
* Region Information
  * Data: Spatial shapefile (boundaries)
  * Source: CEC Level III ecoregions (cec.org/tools-and-resources/map-files/terrestrial-ecoregions-level-iii)
  * Spatial: Boundary of ecoregion 8.5.1, Middle Atlantic Coastal Plains
  * Temporal: n/a (ecoregion boundaries are relatively stable across time)
  
![Delaware_ecoregion.jpg](Delaware_ecoregion.jpg)

The next 3 problems take you throught the reproducibility of the study for other scenarios (e.g., for different years, different terrain parameters, regions of intereset)  

---
  
## Problem 1
  
Since the time of that paper, updated data and modeling scripts have become available. Use the SOMOSPIE Jupyter Notebook to repeat the above case study, except with 2017 soil moisture data and only a couple of the 15 topographic parameters.

To accomplish this, the following things need to be changed in the Notebook you ran at the end of setup:

**Before proceeding you must restart and clean all output**

**Run cell by cell and select the desired options**
* In the Data Loading section:
  * Select 2 topographic parameters (also called covariates by the experts in the field) of your choice.
* In the Data Selection section:
  * In the Region Data subsection, uncheck the default selection and check CEC.
  * Make sure your downloaded topographic parameters are both checked.
  * Uncheck the default region and check region 8.5.1, which is in CEC level 3.
* In the Data Processing Decisions section:
  * Change the MONTHS entry from [1] to [4] (for April).
* In the Machine Learning section:
  * Select all three methods used in the paper -- KKNN, HYPPO, RF.
  
**Question 1a:** What 2 topographic covariates did you use?  
**Question 1b:** What $R^2$ and $RMSE$ values do you get for the three methods?

1a. I chose Aspect and Slope.

1b.
KKNN:0.034992(rmse), 0.547494(r2)

HYPPO:0.064726(rmse), 0.108650(r2)

RF:0.032542(rmse), 0.604472(r2)


---
  
## Problem 2

Having tested multiple methods on one region, let's test one method on multiple regions. We will look at a (6 longitude $\times$ 4 latitude) rectangular region in the middle of the country, sliced into six (2 longitude $\times$ 2 latitude) sub-boxes.
![CONUS_with_6boxes.png](CONUS_with_6boxes.png)

To accomplish this, the following things need to be changed in the SOMOSPIE Notebook:

**Before proceeding you must reset and clean all output**

**Run cell by cell and select the desired options**
* In the Data Loading section:
  * In the Region Data subsection, uncheck the default selection and check BOX.
* In the Data Selection section:
  * Fill in the following BOX parameters: x1=-102, x2=-96, y1=36, y2=40, nx=3, ny=2.
* In the Machine Learning section:
  * Select one method, either KKNN or RF.
  
**Question 2a:** What method did you choose?  
**Question 2b:** What $R^2$ and $RMSE$ values do you get for the six regions?

2a. I chose KKNN.

2b. 
  
Region: -102_-100_36_38 rmse:0.043552 r2:0.235974

Region: -102_-100_38_40 rmse:0.006762 r2:0.734957

Region: -100_-98_36_38  rmse:0.045055 r2:0.301988

Region: -100_-98_38_40  rmse:0.010422 r2:0.244522

Region: -98_-96_36_38   rmse:0.023054 r2:0.493496

Region: -98_-96_38_40   rmse:0.013564 r2:0.235974
   

---
  
## Problem 3

Having tested one method on multiple regions, let's run the same test across multiple months. We can save time by using the same regions and telling SOMOSPIE not to cut out those regions from the data file a second time.

To accomplish this, the following things need to be changed in the SOMOSPIE Notebook:

**Before proceeding you must reset and clean all output**

**Run cell by cell and select the desired options**
* In the Data Processing Decisions section:
  * Change the MONTHS entry from [4] to a list with three integers (from 1 to 12). e.g. [2, 5, 4]
  * Uncheck the MAKE_T_E option to reuse the existing training and evaluation files from Problem 2.
  
**Question 3a:** What 3 months did you choose?    
Use pandas to manipulate the final DataFrame `accuracy` and answer the remaining questions:  
**Question 3b:** What is the average $R^2$ value for each month (averaged across the six boxes)?  
**Question 3c:** What is the average $RMSE$ value for each box (averaged across the three months)?  
**Question 3d:** Across all 18 predictions, what is the correlation between $R^2$ and $RMSE$? (HINT: Reference 1.)  
**Question 3e:** What does your answer to **3b** suggest about the relationship between $R^2$ and $RMSE$? (HINT: Reference 2.)  

3a. March, June and September.

3b. 

Month: March Average r2:0.387408

Month: June  Average r2:0.321556

Month: September   Average r2:0.257099





3c.

Region:-100_-98_36_38     Average rmse:0.040574

Region:-100_-98_38_40     Average rmse:0.012832

Region:-102_-100_36_38    Average rmse:0.025720

Region:-102_-100_38_40    Average rmse:0.008220

Region:-98_-96_36_38      Average rmse:0.025737

Region:-98_-96_38_40      Average rmse:0.015059


3d. The correlation coeficient between r2 and rmse has the value of -0.6159914612401693.

3e. Since the value of correlation coeficient is between -1 to 0, when r2 increases rmse decreases and vice-versa.

**References:**
- [1: pandas.Series.corr](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.corr.html)
- [2: correlation_coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient)

### Assignment Questions:
**Answer the following questions, in a couple sentences each, in the cells provided below**
* List the key tasks you accomplished during this assignment?
* Describe the challenges you faced in addressing these tasks and how you overcame these challenges?
* Did you work with other students on this assignment? If yes, how did you help them? How did they help you? Be as specific as possible.

1. In this assignment, everything is already coded and we just needed to run it by changing the default parameters. We learnt to use different machine learning methods and see how the result is changing with different methods.
2. By following instruction given in the assignment and watching video prepared to solve the problem 1, it was straightforward to solve the problems.
3. No.