<a href="https://colab.research.google.com/github/nunocesarsa/SENSECO_School_2021/blob/main/ColabNotebooks/SENSECO_07_NEON_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Notice

This part of the script has many "data engineering" steps that are convoluted to explain. 

The steps consists of:

- first loading the field LAI data and then generating a continuos time series using linear interpolation.

- extracting the value of this new interpolated dataset to the same dates of remote sensing observations

- extracting the values from the raster datasets and then comparing them



## GDrive

In [1]:
#mounting google drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


#Installing packages

In [None]:
!pip install geopandas
#!pip install pyrsgis
!pip install earthpy

#Loading packages

In [3]:
#General purpose: 
import matplotlib.pyplot as plt
import glob

#the beutiful R like data frame
import pandas as pd
import geopandas as gpd

#raster stuff
#from pyrsgis import raster
import earthpy.plot as ep
import rasterio
from rasterio.plot import show

from sklearn.metrics import mean_absolute_error

#Step 1: NEON data

- this data was collected from NEON as an example for this workshop
- it provides geolocated samples of Leaf Area Index which we will use to compare against our results

In [None]:
neon_data = pd.read_csv('/content/drive/MyDrive/SENSECO_S2Data/CLBJ_Final_csv12.csv',sep=";",decimal=".")

print(neon_data.shape)

## and then we make a selection of the repeated observations
neon_data = neon_data[neon_data['plotID'].isin(["CLBJ_001","CLBJ_002","CLBJ_003"])]

print(neon_data.shape)

## Setting up NEON data

Some of the steps are just to simplify the use of the dataset later on

In [5]:
#creating a year and month column
neon_data["Year"]  = neon_data.startDate.str[0:4]
neon_data["Month"] = neon_data.startDate.str[5:7]

#there is information on the dataset about when the image was acquired but we wil ignore it this time and just use day 15th of each month
neon_data["Day"] = str(15)

#creating a compact column
neon_data["YMD"] =  neon_data["Year"] +  "-" + neon_data["Month"] + "-" +  neon_data["Day"]

#creating a datetime type column
neon_data["datetime"] = pd.to_datetime(neon_data.YMD)

### Exploring

- maybe some of you have nice ideas how to plot this nicely

In [None]:
neon_sel = neon_data[["datetime","Final_LAI","plotID"]]
neon_sel.plot(x='datetime',style='k.',figsize=(15,10))


## Interpolating missing data

Following this example: https://towardsdatascience.com/how-to-interpolate-time-series-data-in-apache-spark-and-python-pandas-part-1-pandas-cff54d76a2ea

In [None]:
#selecting only the columns im interested on
neon_sel = neon_data[["datetime","plotID","Final_LAIe","Final_LAI",]]

neon_sel.index = neon_sel['datetime']
del neon_sel['datetime']

#this could also be done using the original dataset but for clarity
neon_sel_001 = neon_sel[neon_sel["plotID"]=="CLBJ_001"]
neon_sel_002 = neon_sel[neon_sel["plotID"]=="CLBJ_002"]
neon_sel_003 = neon_sel[neon_sel["plotID"]=="CLBJ_003"]

#peaking to an incomplete series
#neon_sel_001.plot(x='datetime',style='k.',figsize=(15,10),colormap='GnBu'
neon_sel_001.plot(style='.',figsize=(15,5), title = "Original CLBJ 001")

neon_sel_001 = neon_sel_001.resample('D').mean()
neon_sel_001["Final_LAIe"] = neon_sel_001["Final_LAIe"].interpolate()
neon_sel_001["Final_LAI"]  = neon_sel_001["Final_LAI"].interpolate()
neon_sel_001.plot(style='.',figsize=(15,5), title = "Interpolated CLBJ 001")

neon_sel_002 = neon_sel_001.resample('D').mean()
neon_sel_002["Final_LAIe"] = neon_sel_002["Final_LAIe"].interpolate()
neon_sel_002["Final_LAI"]  = neon_sel_002["Final_LAI"].interpolate()

neon_sel_003 = neon_sel_001.resample('D').mean()
neon_sel_003["Final_LAIe"] = neon_sel_002["Final_LAIe"].interpolate()
neon_sel_003["Final_LAI"]  = neon_sel_002["Final_LAI"].interpolate()


#and now we can save the interpolated data
neon_sel_001.to_csv("/content/drive/MyDrive/SENSECO/Outputs/Tables/CLBJ_001_csv.csv")
neon_sel_001.to_csv("/content/drive/MyDrive/SENSECO/Outputs/Tables/CLBJ_001_csv_Nuno.csv",sep=";",decimal=".")

neon_sel_002.to_csv("/content/drive/MyDrive/SENSECO/Outputs/Tables/CLBJ_002_csv.csv")
neon_sel_003.to_csv("/content/drive/MyDrive/SENSECO/Outputs/Tables/CLBJ_002_csv_Nuno.csv",sep=";",decimal=".")

neon_sel_003.to_csv("/content/drive/MyDrive/SENSECO/Outputs/Tables/CLBJ_003_csv.csv")
neon_sel_003.to_csv("/content/drive/MyDrive/SENSECO/Outputs/Tables/CLBJ_003_csv_Nuno.csv",sep=";",decimal=".")

#Step 2: RS data

Here we will load the RS data and see how "close" our prediction was to the field data

But first we must extract values to points:

https://hatarilabs.com/ih-en/extract-point-value-from-a-raster-file-with-python-geopandas-and-rasterio-tutorial

(using this tutorial)

## Loading Sample points

In [None]:
#loading shapefile and make it a bit simpler
shp_NEON = gpd.read_file("/content/drive/MyDrive/SENSECO/Shapefiles/NEON_DHP_Centroids_UTM14N.shp")
shp_NEON = shp_NEON[['geometry','plotID']]

#loading a sample image with rasterio
sample_img = rasterio.open("/content/drive/MyDrive/SENSECO/Outputs/Rasters/S2A_MSIL2A_20190321_0030_LAI.tif")

#show point and raster on a matplotlib plot
fig, ax = plt.subplots(figsize=(12,12))
shp_NEON.plot(ax=ax, color='orangered')
show(sample_img, ax=ax)

#example of how to get the poitns
for point in shp_NEON['geometry']:
    x = point.xy[0][0]
    y = point.xy[1][0]
    row, col = sample_img.index(x,y)
    print("Point correspond to row, col: %d, %d"%(row,col))
    print("Raster value on point %.2f \n"%sample_img.read(1)[row,col])

## Extracting raster values

In [9]:
#create an object with the paths to files
path2files = glob.glob("/content/drive/MyDrive/SENSECO/Outputs/Rasters/*LAI.tif")

#an object to store an output path
outpath = "/content/drive/MyDrive/SENSECO/Outputs/Tables/"

#an object to store the output data
out_df =  pd.DataFrame(columns=["plotID","Model","ImageName","YYYYMMDD","LAI"])

for i in range(len(path2files)):

  #print(i)
  temp_path = path2files[i]

  #fetches model used
  temp_mdl  = temp_path[67:71]

  #adapt this this if you change your path to the files, it should return: e.g. S2B_MSIL2A_20190105
  temp_name = temp_path[47:]
  
  #fetches date in yyyymmdd format
  temp_date = temp_path[58:66]

  #print("Processing: " + temp_name + " " +  str(i) +"/" + str(len(path2files)-1))

  #loads the raster
  tmp_img = rasterio.open(temp_path)

  #adds the data
  for j in range(shp_NEON.shape[0]):

    #fetching the ifno
    point = shp_NEON['geometry'][j]
    x = point.xy[0][0]
    y = point.xy[1][0]
    row, col = tmp_img.index(x,y)
    out_val = tmp_img.read(1)[row,col]

    #adding to out_df
    out_df = out_df.append({"plotID":shp_NEON['plotID'][j],
                            "Model":temp_mdl,
                            "ImageName":temp_name,
                            "YYYYMMDD":temp_date,
                            "LAI":out_val},
                           ignore_index=True)

out_df.to_csv(outpath +  "LAI_RS_csv.csv")
out_df.to_csv(outpath +  "LAI_RS_csvN.csv",decimal=".",sep=";")

## Setting up

In [10]:
rs_data = out_df

#creating a date time
rs_data["ymd"] = rs_data.YYYYMMDD.str[0:4] + "-" + rs_data.YYYYMMDD.str[4:6]  + "-" + rs_data.YYYYMMDD.str[6:] 
rs_data["datetime"] = pd.to_datetime(rs_data.ymd)

rs_data.index = rs_data['datetime']
del rs_data['datetime']

In [None]:
rs_data

In [None]:
#optinal plotting
#rs_data.groupby(["plotID","Model"]).plot(y="LAI",style='.',figsize=(15,5), title = "RS data")
rs_data.groupby(["plotID"])["LAI"].plot(y="LAI",style='.',figsize=(15,5))

#again, im sure some of you will know how to improve thse plots xP 

# Step 3: Comparing both

## Set up

In [13]:
#creating list of dates of RS data
rs_date_list = rs_data.ymd.drop_duplicates().tolist()

#selecting from the list of values
pd_NEON001 = neon_sel_001[neon_sel_001.index.to_series().dt.date.astype(str).isin(rs_date_list)]
pd_NEON002 = neon_sel_002[neon_sel_002.index.to_series().dt.date.astype(str).isin(rs_date_list)]
pd_NEON003 = neon_sel_003[neon_sel_003.index.to_series().dt.date.astype(str).isin(rs_date_list)]  

#now we must add these values back into the oriignal tables
pd_RS001 = rs_data[rs_data['plotID']=="CLBJ_001"]
pd_RS002 = rs_data[rs_data['plotID']=="CLBJ_002"]
pd_RS003 = rs_data[rs_data['plotID']=="CLBJ_003"]

#mergin dataframes
pd_M001 = pd.merge(pd_NEON001,pd_RS001,how="left",left_index=True, right_index=True)
pd_M002 = pd.merge(pd_NEON002,pd_RS002,how="left",left_index=True, right_index=True)
pd_M003 = pd.merge(pd_NEON003,pd_RS003,how="left",left_index=True, right_index=True)


## Plotting results

- Version 1:

In [None]:
fig, (ax1,ax2,ax3) = plt.subplots(1,3,figsize=(15, 5))

colors = {'rest':'tab:blue','0300':'tab:orange', '0030':'tab:green'}

ref_data = 'Final_LAIe'

grouped = pd_M001.groupby(["Model"])
for key, group in grouped:
  group.plot(ax=ax1, kind='scatter', x=ref_data, y='LAI', label=key, color=colors[key],xlim=(0.5,2),ylim=(0.5,4),title="CBLJ_001")
#plt.show()

grouped = pd_M002.groupby(["Model"])
for key, group in grouped:
  group.plot(ax=ax2, kind='scatter', x=ref_data, y='LAI', label=key, color=colors[key],xlim=(0.5,2),ylim=(0.5,4),title="CBLJ_002")
#plt.show()

grouped = pd_M003.groupby(["Model"])
for key, group in grouped:
  group.plot(ax=ax3, kind='scatter', x=ref_data, y='LAI', label=key, color=colors[key],xlim=(0.5,2),ylim=(0.5,4),title="CBLJ_003")
plt.show()



- Version 2:

In [None]:
import seaborn as sns

#combines all data
comb_data = pd.concat([pd_M001,pd_M002,pd_M003])

#create the FaceGrid
g = sns.FacetGrid(comb_data, 
                  col="plotID", 
                  row='Model',
                  sharey=False,
                  hue='Model')
                  #hue='Model', 
                  #col_wrap=3, # here it means 2 columns depending on the position you want
                  #legend_out=True) 

g.map(sns.scatterplot, 'Final_LAIe', 'LAI').add_legend()

## Measuring error

- https://scikit-learn.org/stable/modules/model_evaluation.html

In [None]:
#combining all
pd_MALL = pd.concat([pd_M001,pd_M002,pd_M003]) 

mdl_list    = pd_MALL.Model.drop_duplicates().tolist()
plotID_list = pd_MALL.plotID.drop_duplicates().tolist()  

ref_lai = "Final_LAIe"


for i in range(len(mdl_list)):

  
  for j in range(len(plotID_list)):

    pd_temp = pd_MALL[(pd_MALL['Model'] == mdl_list[i]) & (pd_MALL['plotID'] ==plotID_list[j])]

    tmp_r2 = mean_absolute_error(y_true=pd_temp[ref_lai],y_pred=pd_temp["LAI"])
    #tmp_MAE = mean_absolute_error(y_true=pd_temp[ref_lai],y_pred=pd_temp["LAI"])

    print("Model: " + mdl_list[i] + " plotID: " + plotID_list[j] + " Mean absolute error:  " + str(round(tmp_r2,4)) )
    #print("Model: " + mdl_list[i] + " plotID: " + plotID_list[j] + " MAE: " + str(round(tmp_MAE,4)) )

