# Create static features from monthly data
- Goal of the notebook
  - Create df that contains static features in monthly data. This is used for baseline 
- Edited by Rumi Nakagawa
- Spring 2023 Capstone


# 0. Preparation

## Mount google drive
- Make sure that available access is the user's own drive(no access across files in shared folder)

In [320]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [321]:
%cd drive/MyDrive/

[Errno 2] No such file or directory: 'drive/MyDrive/'
/content/drive/MyDrive


## Import libraries

In [322]:
!pip install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [323]:
# !pip install dython

In [324]:
!pip install geopandas

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [325]:
# Import SparkSession
from pyspark.sql import SparkSession
# Create a Spark Session
spark = SparkSession.builder.master("local[*]").getOrCreate()
# Check Spark Session Information
spark

In [326]:
# Import a Spark function from library
from pyspark.sql.functions import col

In [327]:
from pyspark.sql.functions import col
from pyspark.ml.classification import LogisticRegression
from pyspark.sql.functions import col
import pandas as pd
import numpy as np
import os

import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import ast
from pyspark.sql.functions import desc

import geopandas as gpd
import folium

from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import StandardScaler
from pyspark.ml.feature import MinMaxScaler
from pyspark.mllib.evaluation import MulticlassMetrics, BinaryClassificationMetrics
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from datetime import datetime, timedelta
from dateutil.relativedelta import relativedelta
from pyspark.ml.classification import RandomForestClassifier
import time
# from dython import nominal

# Apply the default theme
sns.set_theme()


## (to be updated) Get access to blob storage
- Reference from 261

In [328]:
# Put at the top of any notebooks for storing in blob

# from pyspark.sql.functions import col, max

# blob_container = "team06" # The name of your container created in https://portal.azure.com
# storage_account = "apatel" # The name of your Storage account created in https://portal.azure.com
# secret_scope = "team06" # The name of the scope created in your local computer using the Databricks CLI
# secret_key = "team06" # The name of the secret key created in your local computer using the Databricks CLI 
# blob_url = f"wasbs://{blob_container}@{storage_account}.blob.core.windows.net"
# mount_path = "/mnt/mids-w261"

## Import csv
sample csv

In [329]:
static_raw_df = pd.read_csv("static_features_month_df_raw.csv")
# Copied original file. It is needed to store in each user's mydrive

In [330]:
static_raw_df

Unnamed: 0,SITE_ID,SITE_IGBP,month,TA_F_avg,VPD_F_avg,P_F_avg,NETRAD_avg,NEE_VUT_REF_avg,NEE_VUT_REF_QC_avg,NEE_CUT_REF_avg,...,CO2_concentration_avg,dataset,MODIS_LC,MODIS_IGBP,MODIS_PFT,koppen_sub,koppen,hemisphere,LOCATION_LAT,LOCATION_LONG
0,AR-SLu,MF,1,27.8660,22.5575,1.3420,189.434640,-5.630970,0.957661,-5.609750,...,388.2825,FLUXNET,7,OSH,SH,BSk,Arid,S,-33.4648,-66.4598
1,AR-SLu,MF,2,25.6745,13.8210,3.1785,144.707204,-4.059005,0.970610,-4.047950,...,388.6475,FLUXNET,7,OSH,SH,BSk,Arid,S,-33.4648,-66.4598
2,AR-SLu,MF,3,24.2735,14.1460,0.6440,128.891734,-4.032335,0.908938,-4.034525,...,389.0650,FLUXNET,7,OSH,SH,BSk,Arid,S,-33.4648,-66.4598
3,AR-SLu,MF,4,18.4500,9.1850,0.1000,71.500693,-3.111590,0.962500,-3.107050,...,388.9050,FLUXNET,7,OSH,SH,BSk,Arid,S,-33.4648,-66.4598
4,AR-SLu,MF,5,13.4930,5.8230,1.8520,41.249149,-1.716330,0.895833,-1.559850,...,389.3200,FLUXNET,7,OSH,SH,BSk,Arid,S,-33.4648,-66.4598
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2776,ZM-Mon,DBF,8,21.9250,19.9020,0.0060,100.482875,0.904340,0.916667,0.898666,...,383.9350,FLUXNET,10,GRA,GRA,Cwa,Temperate,S,-15.4391,23.2525
2777,ZM-Mon,DBF,9,26.4800,29.9270,0.0000,101.451880,1.812900,0.927778,1.822220,...,383.5950,FLUXNET,10,GRA,GRA,Cwa,Temperate,S,-15.4391,23.2525
2778,ZM-Mon,DBF,10,26.6045,23.7095,1.0920,129.757934,2.010180,0.910618,2.002460,...,382.6150,FLUXNET,10,GRA,GRA,Cwa,Temperate,S,-15.4391,23.2525
2779,ZM-Mon,DBF,11,23.4820,10.6405,1.7335,139.433737,0.466745,0.953472,0.496900,...,383.2475,FLUXNET,10,GRA,GRA,Cwa,Temperate,S,-15.4391,23.2525


# 1. Impute missing values

## Table of missing features

Missing values

| feature           | number_of NaN | feature          | number_of NaN |
|-------------------|---------------|------------------|---------------|
| P_F_avg           | 2             | b3_avg           | 91            |
| NETRAD_avg        | 178           | b4_avg           | 91            |
| ET_avg            | 6             | b5_avg           | 91            |
| CSIF-SIFdaily_avg | 24            | b6_avg           | 91            |
| CSIF-SIFinst_avg  | 24            | b7_avg           | 91            |
| PET_avg           | 12            | EVI_avg          | 118           |
| Ts_avg            | 12            | GCI_avg          | 101           |
| Tmean_avg         | 12            | NDVI_avg         | 102           |
| prcp_avg          | 12            | NDWI_avg         | 91            |
| vpd_avg           | 12            | NIRv_avg         | 102           |
| prcp-lag3_avg     | 12            | kNDVI_avg        | 91            |
| ESACCI-sm_avg     | 236           | Percent_Snow_avg | 28            |
| b1_avg            | 91            | Fpar_avg         | 136           |
| b2_avg            | 86            | Lai_avg          | 136           |

In [331]:
print(len(static_raw_df.columns))
static_raw_df.columns

60


Index(['SITE_ID', 'SITE_IGBP', 'month', 'TA_F_avg', 'VPD_F_avg', 'P_F_avg',
       'NETRAD_avg', 'NEE_VUT_REF_avg', 'NEE_VUT_REF_QC_avg',
       'NEE_CUT_REF_avg', 'NEE_CUT_REF_QC_avg', 'GPP_NT_VUT_REF_avg',
       'GPP_DT_VUT_REF_avg', 'GPP_NT_CUT_REF_avg', 'GPP_DT_CUT_REF_avg',
       'RECO_NT_VUT_REF_avg', 'RECO_DT_VUT_REF_avg', 'RECO_NT_CUT_REF_avg',
       'RECO_DT_CUT_REF_avg', 'ET_avg', 'BESS-PAR_avg', 'BESS-PARdiff_avg',
       'BESS-RSDN_avg', 'CSIF-SIFdaily_avg', 'CSIF-SIFinst_avg', 'PET_avg',
       'Ts_avg', 'Tmean_avg', 'prcp_avg', 'vpd_avg', 'prcp-lag3_avg',
       'ESACCI-sm_avg', 'b1_avg', 'b2_avg', 'b3_avg', 'b4_avg', 'b5_avg',
       'b6_avg', 'b7_avg', 'EVI_avg', 'GCI_avg', 'NDVI_avg', 'NDWI_avg',
       'NIRv_avg', 'kNDVI_avg', 'Percent_Snow_avg', 'Fpar_avg', 'Lai_avg',
       'LST_Day_avg', 'LST_Night_avg', 'CO2_concentration_avg', 'dataset',
       'MODIS_LC', 'MODIS_IGBP', 'MODIS_PFT', 'koppen_sub', 'koppen',
       'hemisphere', 'LOCATION_LAT', 'LOCATION_LONG'],

In [332]:
# Features used for this analysis

key = ['SITE_ID']

# other options
output_related_var = ['NEE_VUT_REF_avg', 'NEE_CUT_REF_avg', 'GPP_NT_VUT_REF_avg',
                      'GPP_DT_VUT_REF_avg', 'GPP_NT_CUT_REF_avg','GPP_DT_CUT_REF_avg',
                      'RECO_NT_VUT_REF_avg', 'RECO_DT_VUT_REF_avg', 
                      'RECO_NT_CUT_REF_avg', 'RECO_DT_CUT_REF_avg']

# predictor variables
pred_var_numeric = ['TA_F_avg', 'VPD_F_avg', 'P_F_avg', 'NETRAD_avg','ET_avg',
                    'BESS-PAR_avg', 'BESS-PARdiff_avg','BESS-RSDN_avg', 
                    'CSIF-SIFdaily_avg', 'CSIF-SIFinst_avg','PET_avg', 'Ts_avg', 'Tmean_avg',
                    'prcp_avg', 'vpd_avg', 'prcp-lag3_avg', 'ESACCI-sm_avg',
                    'b1_avg', 'b2_avg', 'b3_avg','b4_avg', 'b5_avg', 'b6_avg', 'b7_avg', 
                    'EVI_avg', 'GCI_avg', 'NDVI_avg', 'NDWI_avg', 'NIRv_avg', 'kNDVI_avg',
                    'Percent_Snow_avg', 'Fpar_avg', 'Lai_avg', 'LST_Day_avg', 'LST_Night_avg', 
                    'CO2_concentration_avg']

pred_var_categorical = ['SITE_IGBP', 'MODIS_LC', 'MODIS_IGBP','MODIS_PFT', 
                        'koppen_sub', 'koppen', 'hemisphere']

ordinal_var =  ['month', 'LOCATION_LAT', 'LOCATION_LONG']

qc_flags = ['NEE_VUT_REF_QC_avg', 'NEE_CUT_REF_QC_avg'] 

others = ['dataset']

NA_list = []

len(key + output_related_var + pred_var_numeric + pred_var_categorical 
    + ordinal_var + qc_flags + NA_list + others)

# No TIMESTAMP, year and time. hemisphere is added

60

In [333]:

# Check if all the features are included in one of the four
total = key + output_related_var + pred_var_numeric + pred_var_categorical + ordinal_var + qc_flags + NA_list + others
for i in static_raw_df.columns:
  if i not in total:
    print(i)

total = key + output_related_var + pred_var_numeric + pred_var_categorical + ordinal_var + qc_flags + NA_list + others
for i in total:
  if i not in static_raw_df.columns:
    print(i)


# Preprocess features

In [334]:
# Original columns
print(len(static_raw_df.columns))
static_raw_df.columns

60


Index(['SITE_ID', 'SITE_IGBP', 'month', 'TA_F_avg', 'VPD_F_avg', 'P_F_avg',
       'NETRAD_avg', 'NEE_VUT_REF_avg', 'NEE_VUT_REF_QC_avg',
       'NEE_CUT_REF_avg', 'NEE_CUT_REF_QC_avg', 'GPP_NT_VUT_REF_avg',
       'GPP_DT_VUT_REF_avg', 'GPP_NT_CUT_REF_avg', 'GPP_DT_CUT_REF_avg',
       'RECO_NT_VUT_REF_avg', 'RECO_DT_VUT_REF_avg', 'RECO_NT_CUT_REF_avg',
       'RECO_DT_CUT_REF_avg', 'ET_avg', 'BESS-PAR_avg', 'BESS-PARdiff_avg',
       'BESS-RSDN_avg', 'CSIF-SIFdaily_avg', 'CSIF-SIFinst_avg', 'PET_avg',
       'Ts_avg', 'Tmean_avg', 'prcp_avg', 'vpd_avg', 'prcp-lag3_avg',
       'ESACCI-sm_avg', 'b1_avg', 'b2_avg', 'b3_avg', 'b4_avg', 'b5_avg',
       'b6_avg', 'b7_avg', 'EVI_avg', 'GCI_avg', 'NDVI_avg', 'NDWI_avg',
       'NIRv_avg', 'kNDVI_avg', 'Percent_Snow_avg', 'Fpar_avg', 'Lai_avg',
       'LST_Day_avg', 'LST_Night_avg', 'CO2_concentration_avg', 'dataset',
       'MODIS_LC', 'MODIS_IGBP', 'MODIS_PFT', 'koppen_sub', 'koppen',
       'hemisphere', 'LOCATION_LAT', 'LOCATION_LONG'],

## Check how much NA/None exists in each column

In [335]:
static_raw_df.isna().sum()

SITE_ID                    0
SITE_IGBP                  0
month                      0
TA_F_avg                   0
VPD_F_avg                  0
P_F_avg                    2
NETRAD_avg               178
NEE_VUT_REF_avg            0
NEE_VUT_REF_QC_avg         0
NEE_CUT_REF_avg            0
NEE_CUT_REF_QC_avg         0
GPP_NT_VUT_REF_avg         0
GPP_DT_VUT_REF_avg         0
GPP_NT_CUT_REF_avg         0
GPP_DT_CUT_REF_avg         0
RECO_NT_VUT_REF_avg        0
RECO_DT_VUT_REF_avg        0
RECO_NT_CUT_REF_avg        0
RECO_DT_CUT_REF_avg        0
ET_avg                     6
BESS-PAR_avg               0
BESS-PARdiff_avg           0
BESS-RSDN_avg              0
CSIF-SIFdaily_avg         24
CSIF-SIFinst_avg          24
PET_avg                   12
Ts_avg                    12
Tmean_avg                 12
prcp_avg                  12
vpd_avg                   12
prcp-lag3_avg             12
ESACCI-sm_avg            236
b1_avg                    91
b2_avg                    86
b3_avg        

In [336]:
static_raw_df_countNA = pd.DataFrame(static_raw_df.isna().sum())

NA0_list = list(static_raw_df_countNA[static_raw_df_countNA[0] == 0].index)
NA_list = list(static_raw_df_countNA[static_raw_df_countNA[0] != 0].index)

In [337]:
len(static_raw_df)

2781

In [338]:
static_raw_df_countNA[static_raw_df_countNA[0] != 0]

Unnamed: 0,0
P_F_avg,2
NETRAD_avg,178
ET_avg,6
CSIF-SIFdaily_avg,24
CSIF-SIFinst_avg,24
PET_avg,12
Ts_avg,12
Tmean_avg,12
prcp_avg,12
vpd_avg,12


In [339]:
static_raw_df_countNA[static_raw_df_countNA[0] != 0].index

Index(['P_F_avg', 'NETRAD_avg', 'ET_avg', 'CSIF-SIFdaily_avg',
       'CSIF-SIFinst_avg', 'PET_avg', 'Ts_avg', 'Tmean_avg', 'prcp_avg',
       'vpd_avg', 'prcp-lag3_avg', 'ESACCI-sm_avg', 'b1_avg', 'b2_avg',
       'b3_avg', 'b4_avg', 'b5_avg', 'b6_avg', 'b7_avg', 'EVI_avg', 'GCI_avg',
       'NDVI_avg', 'NDWI_avg', 'NIRv_avg', 'kNDVI_avg', 'Percent_Snow_avg',
       'Fpar_avg', 'Lai_avg'],
      dtype='object')

# Use parquet -> Spark df as preparation for global data

## DF to Parquet format
- Pandas is converted to parquet in order to make the following code usable with global data in AWS

In [340]:
type(static_raw_df)

pandas.core.frame.DataFrame

### Method 1

In [341]:
!pip install fastparquet

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [342]:
static_raw_df.to_parquet('static_raw_df_pq', engine='fastparquet')

### Method 2

In [343]:
# Convert df to parquet
import pyarrow as pa
import pyarrow.parquet as pq

# Convert DataFrame to Apache Arrow Table
table = pa.Table.from_pandas(static_raw_df)
# Second, write the table into parquet file say file_name.parquet
# Parquet with Brotli compression
pq.write_table(table, 'static_raw_df_pq.parquet')

## parquet to Spark df 

In [344]:
# Parquet is read from 
# When running Spark in AWS through the access to Azure, update the location and file name 
static_monthly_sdf = spark.read.parquet('static_raw_df_pq')

In [345]:
static_monthly_sdf.printSchema()

root
 |-- SITE_ID: string (nullable = true)
 |-- SITE_IGBP: string (nullable = true)
 |-- month: long (nullable = true)
 |-- TA_F_avg: double (nullable = true)
 |-- VPD_F_avg: double (nullable = true)
 |-- P_F_avg: double (nullable = true)
 |-- NETRAD_avg: double (nullable = true)
 |-- NEE_VUT_REF_avg: double (nullable = true)
 |-- NEE_VUT_REF_QC_avg: double (nullable = true)
 |-- NEE_CUT_REF_avg: double (nullable = true)
 |-- NEE_CUT_REF_QC_avg: double (nullable = true)
 |-- GPP_NT_VUT_REF_avg: double (nullable = true)
 |-- GPP_DT_VUT_REF_avg: double (nullable = true)
 |-- GPP_NT_CUT_REF_avg: double (nullable = true)
 |-- GPP_DT_CUT_REF_avg: double (nullable = true)
 |-- RECO_NT_VUT_REF_avg: double (nullable = true)
 |-- RECO_DT_VUT_REF_avg: double (nullable = true)
 |-- RECO_NT_CUT_REF_avg: double (nullable = true)
 |-- RECO_DT_CUT_REF_avg: double (nullable = true)
 |-- ET_avg: double (nullable = true)
 |-- BESS-PAR_avg: double (nullable = true)
 |-- BESS-PARdiff_avg: double (nullabl

In [346]:
static_monthly_sdf.show(truncate=False)

+-------+---------+-----+--------+---------+-------+------------+---------------+------------------+---------------+------------------+------------------+------------------+------------------+------------------+-------------------+-------------------+-------------------+-------------------+---------+------------+----------------+-------------+-----------------+----------------+-------------+----------+----------+------------+----------+-------------+-------------+------------+-----------+------------+-----------+-----------+-----------+-----------+----------+----------+-----------+------------+-----------+-----------+----------------+--------+-------+-----------+-------------+---------------------+-------+--------+----------+---------+----------+---------+----------+------------+-------------+
|SITE_ID|SITE_IGBP|month|TA_F_avg|VPD_F_avg|P_F_avg|NETRAD_avg  |NEE_VUT_REF_avg|NEE_VUT_REF_QC_avg|NEE_CUT_REF_avg|NEE_CUT_REF_QC_avg|GPP_NT_VUT_REF_avg|GPP_DT_VUT_REF_avg|GPP_NT_CUT_REF_avg|GPP

In [347]:
display(static_monthly_sdf)

DataFrame[SITE_ID: string, SITE_IGBP: string, month: bigint, TA_F_avg: double, VPD_F_avg: double, P_F_avg: double, NETRAD_avg: double, NEE_VUT_REF_avg: double, NEE_VUT_REF_QC_avg: double, NEE_CUT_REF_avg: double, NEE_CUT_REF_QC_avg: double, GPP_NT_VUT_REF_avg: double, GPP_DT_VUT_REF_avg: double, GPP_NT_CUT_REF_avg: double, GPP_DT_CUT_REF_avg: double, RECO_NT_VUT_REF_avg: double, RECO_DT_VUT_REF_avg: double, RECO_NT_CUT_REF_avg: double, RECO_DT_CUT_REF_avg: double, ET_avg: double, BESS-PAR_avg: double, BESS-PARdiff_avg: double, BESS-RSDN_avg: double, CSIF-SIFdaily_avg: double, CSIF-SIFinst_avg: double, PET_avg: double, Ts_avg: double, Tmean_avg: double, prcp_avg: double, vpd_avg: double, prcp-lag3_avg: double, ESACCI-sm_avg: double, b1_avg: double, b2_avg: double, b3_avg: double, b4_avg: double, b5_avg: double, b6_avg: double, b7_avg: double, EVI_avg: double, GCI_avg: double, NDVI_avg: double, NDWI_avg: double, NIRv_avg: double, kNDVI_avg: double, Percent_Snow_avg: double, Fpar_avg: dou

In [348]:
print(type(static_monthly_sdf))

<class 'pyspark.sql.dataframe.DataFrame'>


## Rename column name to avoid hyphen

In [349]:
static_monthly_sdf = static_monthly_sdf.withColumnRenamed('BESS-PAR_avg','BESS_PAR_avg')\
              .withColumnRenamed('BESS-RSDN_avg','BESS_RSDN_avg')\
              .withColumnRenamed('BESS-PARdiff_avg','BESS_PARdiff_avg')\
              .withColumnRenamed('CSIF-SIFdaily_avg','CSIF_SIFdaily_avg')\
              .withColumnRenamed('CSIF-SIFinst_avg','CSIF_SIFinst_avg')\
              .withColumnRenamed('prcp-lag3_avg','prcp_lag3_avg')\
              .withColumnRenamed('ESACCI-sm_avg','ESACCI_sm_avg')

# Imputation with pyspark

In [350]:
from pyspark.sql.functions import col

In [351]:
# code that counts the number of None in each feature
from pyspark.sql.functions import isnull, when, count, col

static_monthly_sdf.select([count(when(isnull(c), c)).alias(c) for c in static_monthly_sdf.columns]).show()

+-------+---------+-----+--------+---------+-------+----------+---------------+------------------+---------------+------------------+------------------+------------------+------------------+------------------+-------------------+-------------------+-------------------+-------------------+------+------------+----------------+-------------+-----------------+----------------+-------+------+---------+--------+-------+-------------+-------------+------+------+------+------+------+------+------+-------+-------+--------+--------+--------+---------+----------------+--------+-------+-----------+-------------+---------------------+-------+--------+----------+---------+----------+------+----------+------------+-------------+
|SITE_ID|SITE_IGBP|month|TA_F_avg|VPD_F_avg|P_F_avg|NETRAD_avg|NEE_VUT_REF_avg|NEE_VUT_REF_QC_avg|NEE_CUT_REF_avg|NEE_CUT_REF_QC_avg|GPP_NT_VUT_REF_avg|GPP_DT_VUT_REF_avg|GPP_NT_CUT_REF_avg|GPP_DT_CUT_REF_avg|RECO_NT_VUT_REF_avg|RECO_DT_VUT_REF_avg|RECO_NT_CUT_REF_avg|RECO_DT

In [352]:

impute_features = ['P_F_avg', 'NETRAD_avg', 'ET_avg', 'CSIF_SIFdaily_avg', 'CSIF_SIFinst_avg',
       'PET_avg', 'Ts_avg', 'Tmean_avg', 'prcp_avg',
       'vpd_avg', 'prcp_lag3_avg', 'ESACCI_sm_avg', 'b1_avg', 'b2_avg',
       'b3_avg', 'b4_avg', 'b5_avg', 'b6_avg', 'b7_avg', 'EVI_avg', 'GCI_avg',
       'NDVI_avg', 'NDWI_avg', 'NIRv_avg', 'kNDVI_avg', 'Percent_Snow_avg',
       'Fpar_avg', 'Lai_avg']


# impute_features = ['P_F_avg', 'NETRAD_avg', 'ET_avg', 'CSIF_SIFdaily_avg', 'CSIF_SIFinst_avg',
#        'PET_avg', 'Ts_avg', 'Tmean_avg', 'prcp_avg',
#        'vpd_avg', 'prcp-lag3_avg', 'ESACCI-sm_avg', 'b1_avg', 'b2_avg',
#        'b3_avg', 'b4_avg', 'b5_avg', 'b6_avg', 'b7_avg', 'EVI_avg', 'GCI_avg',
#        'NDVI_avg', 'NDWI_avg', 'NIRv_avg', 'kNDVI_avg', 'Percent_Snow_avg',
#        'Fpar_avg', 'Lai_avg']

## `P_F_avg`

In [353]:
static_monthly_sdf.filter(static_monthly_sdf.P_F_avg.isNull()).show()

+-------+---------+-----+--------+---------+-------+----------+---------------+------------------+---------------+------------------+------------------+------------------+------------------+------------------+-------------------+-------------------+-------------------+-------------------+-----------+------------+----------------+-------------+-----------------+----------------+-----------+---------+---------+-----------+-----------+-------------+-------------+---------+---------+---------+----------+----------+----------+-----------+---------+----------+----------+---------+-----------+-----------+----------------+--------+-------+-----------+-------------+---------------------+--------+--------+----------+---------+----------+------+----------+------------+-------------+
|SITE_ID|SITE_IGBP|month|TA_F_avg|VPD_F_avg|P_F_avg|NETRAD_avg|NEE_VUT_REF_avg|NEE_VUT_REF_QC_avg|NEE_CUT_REF_avg|NEE_CUT_REF_QC_avg|GPP_NT_VUT_REF_avg|GPP_DT_VUT_REF_avg|GPP_NT_CUT_REF_avg|GPP_DT_CUT_REF_avg|RECO_NT_

### Impute with average in `FI-Ken`

In [354]:
static_monthly_sdf = static_monthly_sdf.fillna(
    value = static_monthly_sdf
    .filter(static_monthly_sdf.SITE_ID == 'FI-Ken')
    .select(mean('P_F_avg')).head()[0],
    subset = ['P_F_avg'])

In [355]:
static_monthly_sdf.filter(static_monthly_sdf.SITE_ID == 'FI-Ken').show()

+-------+---------+-----+------------------+-----------------+-------+----------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+-------------------+-------------------+-------------------+-------------------+------------------+----------------+----------------+----------------+------------------+-----------------+-------------------+----------------+----------------+------------------+------------------+------------------+-----------------+------------------+-----------------+------------------+------------------+-----------------+------------------+------------------+-----------------+----------------+-----------------+-----------------+-----------------+-----------------+----------------+-----------------+----------------+----------------+----------------+---------------------+--------+--------+----------+---------+----------+------+----------+------------+-------------+
|SITE_ID|

## `NETRAD_avg`

In [356]:
static_monthly_sdf.filter(static_monthly_sdf.NETRAD_avg.isNull()).show()

+-------+---------+-----+--------+---------+----------------+----------+---------------+------------------+---------------+------------------+------------------+------------------+------------------+------------------+-------------------+-------------------+-------------------+-------------------+------------+------------+----------------+-------------+-----------------+----------------+--------------+-----------+-----------+-------------+------------+-------------+-------------+-------------+------------+-------------+-------------+------------+-------------+-------------+------------+-----------+------------+-------------+-------------+------------+----------------+--------+-------+-----------+-------------+---------------------+---------+--------+----------+---------+----------+---------+----------+------------+-------------+
|SITE_ID|SITE_IGBP|month|TA_F_avg|VPD_F_avg|         P_F_avg|NETRAD_avg|NEE_VUT_REF_avg|NEE_VUT_REF_QC_avg|NEE_CUT_REF_avg|NEE_CUT_REF_QC_avg|GPP_NT_VUT_REF_av

### Impute with average across sites

In [357]:
static_monthly_sdf.select(mean('NETRAD_avg')).head()[0]

83.82560482347512

In [358]:
static_monthly_sdf = static_monthly_sdf.fillna(
    value = static_monthly_sdf
    # .filter(static_monthly_sdf.SITE_ID == 'FI-Ken')
    .select(mean('NETRAD_avg')).head()[0],
    subset = ['NETRAD_avg'])

In [359]:
static_monthly_sdf.filter(static_monthly_sdf.SITE_ID == 'BE-Maa').show() 

+-------+---------+-----+--------+---------+----------------+-----------------+---------------+------------------+---------------+------------------+------------------+------------------+------------------+------------------+-------------------+-------------------+-------------------+-------------------+------------+------------+----------------+-------------+-----------------+----------------+--------------+-----------+-----------+-------------+------------+-------------+-------------+-------------+------------+-------------+-------------+------------+-------------+-------------+------------+-----------+------------+-------------+-------------+------------+----------------+--------+-------+-----------+-------------+---------------------+--------+--------+----------+---------+----------+---------+----------+------------+-------------+
|SITE_ID|SITE_IGBP|month|TA_F_avg|VPD_F_avg|         P_F_avg|       NETRAD_avg|NEE_VUT_REF_avg|NEE_VUT_REF_QC_avg|NEE_CUT_REF_avg|NEE_CUT_REF_QC_avg|GPP_

In [360]:
static_monthly_sdf.filter(static_monthly_sdf.NETRAD_avg.isNull()).show()

+-------+---------+-----+--------+---------+-------+----------+---------------+------------------+---------------+------------------+------------------+------------------+------------------+------------------+-------------------+-------------------+-------------------+-------------------+------+------------+----------------+-------------+-----------------+----------------+-------+------+---------+--------+-------+-------------+-------------+------+------+------+------+------+------+------+-------+-------+--------+--------+--------+---------+----------------+--------+-------+-----------+-------------+---------------------+-------+--------+----------+---------+----------+------+----------+------------+-------------+
|SITE_ID|SITE_IGBP|month|TA_F_avg|VPD_F_avg|P_F_avg|NETRAD_avg|NEE_VUT_REF_avg|NEE_VUT_REF_QC_avg|NEE_CUT_REF_avg|NEE_CUT_REF_QC_avg|GPP_NT_VUT_REF_avg|GPP_DT_VUT_REF_avg|GPP_NT_CUT_REF_avg|GPP_DT_CUT_REF_avg|RECO_NT_VUT_REF_avg|RECO_DT_VUT_REF_avg|RECO_NT_CUT_REF_avg|RECO_DT

## `ET_avg`

In [361]:
static_monthly_sdf.filter(static_monthly_sdf.ET_avg.isNull()).show()

+-------+---------+-----+-----------------+----------------+----------------+-----------------+-----------------+------------------+-----------------+------------------+------------------+------------------+------------------+------------------+-------------------+-------------------+-------------------+-------------------+------+----------------+----------------+----------------+------------------+------------------+-------------------+----------------+----------------+------------------+------------------+------------------+-------------+-----------------+-----------------+-----------------+-----------------+-----------------+------------------+------------------+----------+--------+----------+-----------------+----------+------------------+----------------+-----------------+-----------------+----------------+----------------+---------------------+--------+--------+----------+---------+----------+------+----------+------------+-------------+
|SITE_ID|SITE_IGBP|month|         TA_F_avg

### Impute with average in CH-Aws

In [362]:
static_monthly_sdf = static_monthly_sdf.fillna(
    value = static_monthly_sdf
    .filter(static_monthly_sdf.SITE_ID == 'CH-Aws')
    .select(mean('ET_avg')).head()[0],
    subset = ['ET_avg'])

In [363]:
static_monthly_sdf.filter(static_monthly_sdf.SITE_ID == 'CH-Aws').show()

+-------+---------+-----+-----------------+----------------+----------------+-----------------+------------------+------------------+-----------------+------------------+------------------+------------------+------------------+------------------+-------------------+-------------------+-------------------+-------------------+-----------------+----------------+----------------+----------------+-------------------+-------------------+-------------------+----------------+----------------+------------------+------------------+------------------+-----------------+------------------+-----------------+------------------+------------------+-----------------+------------------+------------------+-----------------+----------------+-----------------+------------------+------------------+------------------+-----------------+-----------------+-----------------+----------------+----------------+---------------------+--------+--------+----------+---------+----------+------+----------+------------+----

## `CSIF_SIFdaily_avg`

In [364]:
static_monthly_sdf.filter(static_monthly_sdf.CSIF_SIFdaily_avg.isNull()).show()

+-------+---------+-----+----------------+----------------+----------------+-----------------+-------------------+------------------+-------------------+------------------+------------------+------------------+------------------+------------------+-------------------+-------------------+-------------------+-------------------+------------------+----------------+----------------+----------------+-----------------+----------------+-------------------+----------------+----------------+------------------+-----------------+------------------+-----------------+------------------+------------------+------------------+------------------+-----------------+------------------+------------------+-----------------+-----------------+-----------------+------------------+------------------+-----------------+------------------+-----------------+-----------------+----------------+----------------+---------------------+--------+--------+----------+---------+----------+---------+----------+------------+--

### Impute with average across sites

In [365]:
static_monthly_sdf.select(mean('CSIF_SIFdaily_avg')).head()[0]

0.14615525132880283

In [366]:
static_monthly_sdf = static_monthly_sdf.fillna(
    value = static_monthly_sdf
    # .filter(static_monthly_sdf.SITE_ID == 'FI-Ken')
    .select(mean('CSIF_SIFdaily_avg')).head()[0],
    subset = ['CSIF_SIFdaily_avg'])

In [367]:
static_monthly_sdf.filter(static_monthly_sdf.SITE_ID == 'DE-Hte').show() 

+-------+---------+-----+----------------+----------------+----------------+-----------------+-------------------+------------------+-------------------+------------------+------------------+------------------+------------------+------------------+-------------------+-------------------+-------------------+-------------------+------------------+----------------+----------------+----------------+-------------------+----------------+-------------------+----------------+----------------+------------------+-----------------+------------------+-----------------+------------------+------------------+------------------+------------------+-----------------+------------------+------------------+-----------------+-----------------+-----------------+------------------+------------------+-----------------+------------------+-----------------+-----------------+----------------+----------------+---------------------+--------+--------+----------+---------+----------+---------+----------+------------+

In [368]:
static_monthly_sdf.filter(static_monthly_sdf.CSIF_SIFdaily_avg.isNull()).show()

+-------+---------+-----+--------+---------+-------+----------+---------------+------------------+---------------+------------------+------------------+------------------+------------------+------------------+-------------------+-------------------+-------------------+-------------------+------+------------+----------------+-------------+-----------------+----------------+-------+------+---------+--------+-------+-------------+-------------+------+------+------+------+------+------+------+-------+-------+--------+--------+--------+---------+----------------+--------+-------+-----------+-------------+---------------------+-------+--------+----------+---------+----------+------+----------+------------+-------------+
|SITE_ID|SITE_IGBP|month|TA_F_avg|VPD_F_avg|P_F_avg|NETRAD_avg|NEE_VUT_REF_avg|NEE_VUT_REF_QC_avg|NEE_CUT_REF_avg|NEE_CUT_REF_QC_avg|GPP_NT_VUT_REF_avg|GPP_DT_VUT_REF_avg|GPP_NT_CUT_REF_avg|GPP_DT_CUT_REF_avg|RECO_NT_VUT_REF_avg|RECO_DT_VUT_REF_avg|RECO_NT_CUT_REF_avg|RECO_DT

## `CSIF_SIFinst_avg`

In [369]:
static_monthly_sdf.filter(static_monthly_sdf.CSIF_SIFinst_avg.isNull()).show()

+-------+---------+-----+----------------+----------------+----------------+-----------------+-------------------+------------------+-------------------+------------------+------------------+------------------+------------------+------------------+-------------------+-------------------+-------------------+-------------------+------------------+----------------+----------------+----------------+-------------------+----------------+-------------------+----------------+----------------+------------------+-----------------+------------------+-----------------+------------------+------------------+------------------+------------------+-----------------+------------------+------------------+-----------------+-----------------+-----------------+------------------+------------------+-----------------+------------------+-----------------+-----------------+----------------+----------------+---------------------+--------+--------+----------+---------+----------+---------+----------+------------+

### Impute with average across sites

In [370]:
static_monthly_sdf.select(mean('CSIF_SIFinst_avg')).head()[0]

0.4089560298315824

In [371]:
static_monthly_sdf = static_monthly_sdf.fillna(
    value = static_monthly_sdf
    # .filter(static_monthly_sdf.SITE_ID == 'FI-Ken')
    .select(mean('CSIF_SIFinst_avg')).head()[0],
    subset = ['CSIF_SIFinst_avg'])

In [372]:
static_monthly_sdf.filter(static_monthly_sdf.SITE_ID == 'FI-Qvd').show() 

+-------+---------+-----+----------------+----------------+----------------+-----------------+------------------+------------------+------------------+------------------+-------------------+------------------+-------------------+------------------+-------------------+-------------------+-------------------+-------------------+------------------+----------------+----------------+----------------+-------------------+------------------+-------------------+----------------+----------------+------------------+------------------+------------------+-----------------+------------------+-----------------+------------------+------------------+-----------------+------------------+------------------+-----------------+----------------+-----------------+-----------------+------------------+-----------------+----------------+-----------------+----------------+----------------+----------------+---------------------+--------+--------+----------+---------+----------+------+----------+------------+------

In [373]:
static_monthly_sdf.filter(static_monthly_sdf.CSIF_SIFinst_avg.isNull()).show()

+-------+---------+-----+--------+---------+-------+----------+---------------+------------------+---------------+------------------+------------------+------------------+------------------+------------------+-------------------+-------------------+-------------------+-------------------+------+------------+----------------+-------------+-----------------+----------------+-------+------+---------+--------+-------+-------------+-------------+------+------+------+------+------+------+------+-------+-------+--------+--------+--------+---------+----------------+--------+-------+-----------+-------------+---------------------+-------+--------+----------+---------+----------+------+----------+------------+-------------+
|SITE_ID|SITE_IGBP|month|TA_F_avg|VPD_F_avg|P_F_avg|NETRAD_avg|NEE_VUT_REF_avg|NEE_VUT_REF_QC_avg|NEE_CUT_REF_avg|NEE_CUT_REF_QC_avg|GPP_NT_VUT_REF_avg|GPP_DT_VUT_REF_avg|GPP_NT_CUT_REF_avg|GPP_DT_CUT_REF_avg|RECO_NT_VUT_REF_avg|RECO_DT_VUT_REF_avg|RECO_NT_CUT_REF_avg|RECO_DT

## `PET_avg`,`Ts_avg`,`Tmean_avg`,`prcp_avg`,`vpd_avg`,`prcp_lag3_avg`


In [374]:
static_monthly_sdf.filter(static_monthly_sdf.PET_avg.isNull()).show()

+-------+---------+-----+----------------+----------------+------------------+----------------+-----------------+------------------+-------------------+------------------+------------------+------------------+------------------+------------------+-------------------+-------------------+-------------------+-------------------+----------------+----------------+----------------+----------------+------------------+-----------------+-------+------+---------+--------+-------+-------------+-------------+------------------+-----------------+------------------+------------------+-----------------+------------------+------------------+-----------------+----------------+-----------------+------------------+------------------+-----------------+----------------+-----------------+----------------+----------------+----------------+---------------------+-------+--------+----------+---------+----------+---------+----------+------------+-------------+
|SITE_ID|SITE_IGBP|month|        TA_F_avg|       VPD

### Insights for IT-Noe
`IT-Noe` might not be appropriate site to use for training since `PET_avg`,`Ts_avg`,`Tmean_avg`,`prcp_avg`,`vpd_avg`,`prcp_lag3_avg`are all missing

### Impute each feature with average across sites

In [375]:
print("average PET_avg")
print(static_monthly_sdf.select(mean('PET_avg')).head()[0])

print("average Ts_avg")
print(static_monthly_sdf.select(mean('Ts_avg')).head()[0])

print("average Tmean_avg")
print(static_monthly_sdf.select(mean('Tmean_avg')).head()[0])

print("average prcp_avg")
print(static_monthly_sdf.select(mean('prcp_avg')).head()[0])

print("average vpd_avg")
print(static_monthly_sdf.select(mean('vpd_avg')).head()[0])

print("average prcp_lag3_avg")
print(static_monthly_sdf.select(mean('prcp_lag3_avg')).head()[0])

# ,Ts_avg,Tmean_avg,prcp_avg,vpd_avg,prcp_lag3_avg

average PET_avg
-0.008036543982703521
average Ts_avg
284.1564649144834
average Tmean_avg
284.0645280718195
average prcp_avg
0.0024562933771170438
average vpd_avg
0.5754164067491089
average prcp_lag3_avg
0.007351570059445253


In [376]:
static_monthly_sdf = static_monthly_sdf.fillna(
    value = static_monthly_sdf
    .select(mean('PET_avg')).head()[0],
    subset = ['PET_avg'])

static_monthly_sdf = static_monthly_sdf.fillna(
    value = static_monthly_sdf
    .select(mean('Ts_avg')).head()[0],
    subset = ['Ts_avg'])

static_monthly_sdf = static_monthly_sdf.fillna(
    value = static_monthly_sdf
    .select(mean('Tmean_avg')).head()[0],
    subset = ['Tmean_avg'])

static_monthly_sdf = static_monthly_sdf.fillna(
    value = static_monthly_sdf
    .select(mean('prcp_avg')).head()[0],
    subset = ['prcp_avg'])

static_monthly_sdf = static_monthly_sdf.fillna(
    value = static_monthly_sdf
    .select(mean('vpd_avg')).head()[0],
    subset = ['vpd_avg'])

static_monthly_sdf = static_monthly_sdf.fillna(
    value = static_monthly_sdf
    .select(mean('prcp_lag3_avg')).head()[0],
    subset = ['prcp_lag3_avg'])

In [377]:
static_monthly_sdf.filter(static_monthly_sdf.SITE_ID == 'IT-Noe').show() 

+-------+---------+-----+----------------+----------------+------------------+----------------+-----------------+------------------+-------------------+------------------+------------------+------------------+------------------+------------------+-------------------+-------------------+-------------------+-------------------+----------------+----------------+----------------+----------------+------------------+-----------------+--------------------+-----------------+-----------------+--------------------+------------------+--------------------+-------------+------------------+-----------------+------------------+------------------+-----------------+------------------+------------------+-----------------+----------------+-----------------+------------------+------------------+-----------------+----------------+-----------------+----------------+----------------+----------------+---------------------+-------+--------+----------+---------+----------+---------+----------+------------+-----

## `ESACCI_sm_avg`
- 236 missing values

In [378]:
static_monthly_sdf.filter(static_monthly_sdf.ESACCI_sm_avg.isNull()).show()

+-------+---------+-----+-----------------+-----------------+-----------------+-----------------+-------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+-------------------+-------------------+-------------------+-------------------+-----------------+----------------+----------------+----------------+------------------+------------------+-------------------+----------------+----------------+------------------+------------------+------------------+-------------+------------------+-----------------+------------------+------------------+-----------------+------------------+------------------+-----------------+------------------+------------------+-----------------+------------------+------------------+----------------+-----------------+-----------------+----------------+----------------+---------------------+---------+--------+----------+---------+----------+---------+----------+------------+-

In [379]:
static_monthly_sdf.filter(static_monthly_sdf.ESACCI_sm_avg.isNull()).groupBy("SITE_ID").count().show()

+-------+-----+
|SITE_ID|count|
+-------+-----+
| CN-Cha|    1|
| SE-Svb|    4|
| CA-TPD|    1|
| US-GLE|    2|
| SE-Ros|    2|
| IT-Tor|    1|
| US-Ivo|    6|
| SE-Deg|    2|
| CA-Oas|    2|
| FI-Var|    2|
| CN-HaM|    4|
| SJ-Adv|    1|
| CA-Qc2|    1|
| US-UMB|    4|
| FI-Qvd|    1|
| US-UMd|    4|
| CH-Aws|    6|
| CA-TP3|   12|
| FI-Lom|    5|
| IT-Noe|   12|
+-------+-----+
only showing top 20 rows



### Impute with average across sites

In [380]:
static_monthly_sdf.select(mean('ESACCI_sm_avg')).head()[0]

0.24943720478288073

In [381]:
static_monthly_sdf = static_monthly_sdf.fillna(
    value = static_monthly_sdf
    .select(mean('ESACCI_sm_avg')).head()[0],
    subset = ['ESACCI_sm_avg'])

In [382]:
static_monthly_sdf.filter(static_monthly_sdf.SITE_ID == 'CA-TP3').show() 

+-------+---------+-----+------------------+-----------------+----------------+----------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+-------------------+-------------------+-------------------+-------------------+-----------------+----------------+----------------+----------------+------------------+------------------+-------------------+----------------+----------------+------------------+------------------+------------------+-------------------+------------------+-----------------+------------------+------------------+-----------------+-----------------+------------------+-----------------+----------------+-----------------+-------------------+------------------+-----------------+-----------------+-----------------+-----------------+----------------+----------------+---------------------+-------+--------+----------+---------+----------+------+----------+------------+----

### WIP - `CA-TP3` `IT-Noe` fill global average

In [383]:
print("average ESACCI_sm_avg")
print(static_monthly_sdf.select(mean('ESACCI_sm_avg')).head()[0])

global_avg_ESACCI_sm_avg =  static_monthly_sdf.select(mean('ESACCI_sm_avg')).head()[0]

average ESACCI_sm_avg
0.24943720478288203


In [384]:
# static_monthly_sdf = static_monthly_sdf.fillna(
#     value = static_monthly_sdf
#     .select(mean('ESACCI_sm_avg.')).head()[0],
#     subset = ['ESACCI_sm_avg'])_


# `CA-TP3` `IT-Noe`

# Replace null in CA-TP3 with global average
# from pyspark.sql.functions import coalesce, col, lit, when

# static_monthly_sdf_2 = static_monthly_sdf.select('SITE_ID',*[
#     when(col('SITE_ID')== 'CA-TP3', when(col('ESACCI_sm_avg') == 'null',
#          coalesce(col(c), lit(global_avg_ESACCI_sm_avg)))).otherwise(col(c)).alias(c) for c in static_monthly_sdf.columns[1:]])

In [385]:
# from pyspark.sql.functions import *

# static_monthly_sdf.withColumn('ESACCI_sm_avg',coalesce(col('ESACCI_sm_avg'),lit(global_avg_ESACCI_sm_avg))).show()

In [386]:
# static_monthly_sdf.withColumn(
#     'ESACCI_sm_avg',
#     F.coalesce(
#         F.col('ESACCI_sm_avg'),
#         F.first('ESACCI_sm_avg').over(Window.partitionBy("city").orderBy("city")),
#     ),
# ).show()

### WIP - Rest of the sites fill site average

## `b1_avg` - `b7_avg`

In [387]:
static_monthly_sdf.filter(static_monthly_sdf.b1_avg.isNull()).show()

+-------+---------+-----+----------------+----------------+----------------+----------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+-------------------+-------------------+-------------------+-------------------+----------------+----------------+----------------+----------------+-----------------+----------------+-------------------+----------------+----------------+------------------+-----------------+------------------+-----------------+------+------+------+------+------+------+------+-------+-------+--------+--------+--------+---------+----------------+-----------------+----------------+----------------+----------------+---------------------+-------+--------+----------+---------+----------+--------+----------+------------+-------------+
|SITE_ID|SITE_IGBP|month|        TA_F_avg|       VPD_F_avg|         P_F_avg|      NETRAD_avg|   NEE_VUT_REF_avg|NEE_VUT_REF_QC_avg|   NEE

In [388]:
static_monthly_sdf.filter(static_monthly_sdf.b1_avg.isNull()).groupBy("SITE_ID").count().show()

+-------+-----+
|SITE_ID|count|
+-------+-----+
| CN-Cha|    1|
| SE-Svb|    1|
| BR-Sa3|    6|
| SE-Ros|    2|
| PA-SPn|    5|
| US-Ivo|    3|
| CG-Tch|   11|
| SE-Deg|    1|
| SE-Lnn|    1|
| FI-Var|    3|
| GH-Ank|    8|
| SJ-Adv|    2|
| US-UMB|    1|
| FI-Qvd|    1|
| US-UMd|    1|
| CH-Aws|    3|
| CN-Din|    1|
| US-Syv|    2|
| FI-Lom|    3|
| US-Atq|    3|
+-------+-----+
only showing top 20 rows



In [389]:
static_monthly_sdf.filter(static_monthly_sdf.b2_avg.isNull()).groupBy("SITE_ID").count().show()

+-------+-----+
|SITE_ID|count|
+-------+-----+
| CN-Cha|    1|
| SE-Svb|    1|
| BR-Sa3|    3|
| SE-Ros|    2|
| PA-SPn|    5|
| US-Ivo|    3|
| CG-Tch|   11|
| SE-Deg|    1|
| SE-Lnn|    1|
| FI-Var|    3|
| GH-Ank|    8|
| SJ-Adv|    2|
| US-UMB|    1|
| FI-Qvd|    1|
| US-UMd|    1|
| CH-Aws|    3|
| CN-Din|    1|
| US-Syv|    2|
| FI-Lom|    3|
| US-Atq|    3|
+-------+-----+
only showing top 20 rows



### Impute with average across sites

In [390]:
static_monthly_sdf = static_monthly_sdf.fillna(value = static_monthly_sdf
    .select(mean('b1_avg')).head()[0],subset = ['b1_avg'])

static_monthly_sdf = static_monthly_sdf.fillna(value = static_monthly_sdf
    .select(mean('b2_avg')).head()[0],subset = ['b2_avg'])

static_monthly_sdf = static_monthly_sdf.fillna(value = static_monthly_sdf
    .select(mean('b3_avg')).head()[0],subset = ['b3_avg'])

static_monthly_sdf = static_monthly_sdf.fillna(value = static_monthly_sdf
    .select(mean('b4_avg')).head()[0],subset = ['b4_avg'])

static_monthly_sdf = static_monthly_sdf.fillna(value = static_monthly_sdf
    .select(mean('b5_avg')).head()[0],subset = ['b5_avg'])

static_monthly_sdf = static_monthly_sdf.fillna(value = static_monthly_sdf
    .select(mean('b6_avg')).head()[0],subset = ['b6_avg'])

static_monthly_sdf = static_monthly_sdf.fillna(value = static_monthly_sdf
    .select(mean('b7_avg')).head()[0],subset = ['b7_avg'])

In [391]:
static_monthly_sdf.filter(static_monthly_sdf.SITE_ID == 'CG-Tch').show()

+-------+---------+-----+----------------+----------------+----------------+----------------+-----------------+------------------+-----------------+------------------+------------------+------------------+------------------+------------------+-------------------+-------------------+-------------------+-------------------+----------------+----------------+----------------+----------------+-----------------+-----------------+-------------------+----------------+----------------+-------------+-----------------+-------------+-------------+-------------------+------------------+-------------------+-------------------+------------------+-------------------+-------------------+-------+-------+--------+--------+--------+---------+----------------+-----------------+----------------+-----------+----------------+---------------------+-------+--------+----------+---------+----------+--------+----------+------------+-------------+
|SITE_ID|SITE_IGBP|month|        TA_F_avg|       VPD_F_avg|         

## `EVI_avg`


In [392]:
static_monthly_sdf.filter(static_monthly_sdf.EVI_avg.isNull()).show()

+-------+---------+-----+----------------+----------------+----------------+----------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+-------------------+-------------------+-------------------+-------------------+----------------+----------------+----------------+----------------+-----------------+----------------+-------------------+----------------+----------------+------------------+-----------------+------------------+-----------------+-------------------+------------------+-------------------+-------------------+------------------+-------------------+-------------------+-------+---------+---------+----------+----------+----------+----------------+-----------------+----------------+----------------+----------------+---------------------+-------+--------+----------+---------+----------+---------+----------+------------+-------------+
|SITE_ID|SITE_IGBP|month|        TA_F_a

In [393]:
static_monthly_sdf.filter(static_monthly_sdf.EVI_avg.isNull()).groupBy("SITE_ID").count().show()

+-------+-----+
|SITE_ID|count|
+-------+-----+
| CN-Cha|    1|
| SE-Svb|    1|
| BR-Sa3|    6|
| SE-Ros|    2|
| PA-SPn|    5|
| IT-Tor|    1|
| US-Ivo|    6|
| CG-Tch|   11|
| SE-Deg|    1|
| SE-Lnn|    1|
| FI-Var|    3|
| GH-Ank|    8|
| SJ-Adv|    4|
| US-UMB|    1|
| FI-Qvd|    1|
| US-UMd|    1|
| CH-Aws|    5|
| CN-Din|    1|
| US-Syv|    2|
| FI-Lom|    3|
+-------+-----+
only showing top 20 rows



### Impute with average across sites

In [394]:
static_monthly_sdf = static_monthly_sdf.fillna(value = static_monthly_sdf
    .select(mean('EVI_avg')).head()[0],subset = ['EVI_avg'])

In [395]:
static_monthly_sdf.filter(static_monthly_sdf.SITE_ID == 'CG-Tch').show()

+-------+---------+-----+----------------+----------------+----------------+----------------+-----------------+------------------+-----------------+------------------+------------------+------------------+------------------+------------------+-------------------+-------------------+-------------------+-------------------+----------------+----------------+----------------+----------------+-----------------+-----------------+-------------------+----------------+----------------+-------------+-----------------+-------------+-------------+-------------------+------------------+-------------------+-------------------+------------------+-------------------+-------------------+-------------------+-------+--------+--------+--------+---------+----------------+-----------------+----------------+-----------+----------------+---------------------+-------+--------+----------+---------+----------+--------+----------+------------+-------------+
|SITE_ID|SITE_IGBP|month|        TA_F_avg|       VPD_F_a

## `GCI_avg`

In [396]:
static_monthly_sdf.filter(static_monthly_sdf.GCI_avg.isNull()).show()

+-------+---------+-----+----------------+----------------+----------------+----------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+-------------------+-------------------+-------------------+-------------------+----------------+----------------+----------------+----------------+-----------------+----------------+-------------------+----------------+----------------+------------------+-----------------+------------------+-----------------+-------------------+------------------+-------------------+-------------------+------------------+-------------------+-------------------+-------------------+-------+--------+--------+--------+---------+----------------+-----------------+----------------+----------------+----------------+---------------------+-------+--------+----------+---------+----------+--------+----------+------------+-------------+
|SITE_ID|SITE_IGBP|month|        TA_

In [397]:
static_monthly_sdf.filter(static_monthly_sdf.GCI_avg.isNull()).groupBy("SITE_ID").count().show()

+-------+-----+
|SITE_ID|count|
+-------+-----+
| CN-Cha|    1|
| SE-Svb|    1|
| BR-Sa3|    6|
| SE-Ros|    2|
| PA-SPn|    5|
| IT-Tor|    1|
| US-Ivo|    6|
| CG-Tch|   11|
| SE-Deg|    1|
| SE-Lnn|    1|
| FI-Var|    3|
| GH-Ank|    8|
| SJ-Adv|    4|
| US-UMB|    1|
| FI-Qvd|    1|
| US-UMd|    1|
| CH-Aws|    5|
| CN-Din|    1|
| US-Syv|    2|
| FI-Lom|    3|
+-------+-----+
only showing top 20 rows



### Impute with average across sites

In [398]:
static_monthly_sdf = static_monthly_sdf.fillna(value = static_monthly_sdf
    .select(mean('GCI_avg')).head()[0],subset = ['GCI_avg'])

In [399]:
static_monthly_sdf.filter(static_monthly_sdf.SITE_ID == 'CG-Tch').show()

+-------+---------+-----+----------------+----------------+----------------+----------------+-----------------+------------------+-----------------+------------------+------------------+------------------+------------------+------------------+-------------------+-------------------+-------------------+-------------------+----------------+----------------+----------------+----------------+-----------------+-----------------+-------------------+----------------+----------------+-------------+-----------------+-------------+-------------+-------------------+------------------+-------------------+-------------------+------------------+-------------------+-------------------+-------------------+-----------------+--------+--------+--------+---------+----------------+-----------------+----------------+-----------+----------------+---------------------+-------+--------+----------+---------+----------+--------+----------+------------+-------------+
|SITE_ID|SITE_IGBP|month|        TA_F_avg|    

## `NDVI_avg`

In [400]:
static_monthly_sdf.filter(static_monthly_sdf.NDVI_avg.isNull()).show()

+-------+---------+-----+----------------+----------------+----------------+----------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+-------------------+-------------------+-------------------+-------------------+----------------+----------------+----------------+----------------+-----------------+----------------+-------------------+----------------+----------------+------------------+-----------------+------------------+-----------------+-------------------+------------------+-------------------+-------------------+------------------+-------------------+-------------------+-------------------+-----------------+--------+--------+--------+---------+----------------+-----------------+----------------+----------------+----------------+---------------------+-------+--------+----------+---------+----------+--------+----------+------------+-------------+
|SITE_ID|SITE_IGBP|month| 

In [401]:
static_monthly_sdf.filter(static_monthly_sdf.NDVI_avg.isNull()).groupBy("SITE_ID").count().show()

+-------+-----+
|SITE_ID|count|
+-------+-----+
| CN-Cha|    1|
| SE-Svb|    1|
| BR-Sa3|    6|
| SE-Ros|    2|
| PA-SPn|    5|
| IT-Tor|    1|
| US-Ivo|    6|
| CG-Tch|   11|
| SE-Deg|    1|
| SE-Lnn|    1|
| FI-Var|    3|
| GH-Ank|    8|
| SJ-Adv|    4|
| US-UMB|    1|
| FI-Qvd|    1|
| US-UMd|    1|
| CH-Aws|    5|
| CN-Din|    1|
| US-Syv|    2|
| FI-Lom|    3|
+-------+-----+
only showing top 20 rows



### Impute with average across sites

In [402]:
static_monthly_sdf = static_monthly_sdf.fillna(value = static_monthly_sdf
    .select(mean('NDVI_avg')).head()[0],subset = ['NDVI_avg'])

In [403]:
static_monthly_sdf.filter(static_monthly_sdf.SITE_ID == 'CG-Tch').show()

+-------+---------+-----+----------------+----------------+----------------+----------------+-----------------+------------------+-----------------+------------------+------------------+------------------+------------------+------------------+-------------------+-------------------+-------------------+-------------------+----------------+----------------+----------------+----------------+-----------------+-----------------+-------------------+----------------+----------------+-------------+-----------------+-------------+-------------+-------------------+------------------+-------------------+-------------------+------------------+-------------------+-------------------+-------------------+-----------------+------------------+--------+--------+---------+----------------+-----------------+----------------+-----------+----------------+---------------------+-------+--------+----------+---------+----------+--------+----------+------------+-------------+
|SITE_ID|SITE_IGBP|month|        TA_

## `NDWI_avg`

In [404]:
static_monthly_sdf.filter(static_monthly_sdf.NDWI_avg.isNull()).show()

+-------+---------+-----+----------------+----------------+----------------+----------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+-------------------+-------------------+-------------------+-------------------+----------------+----------------+----------------+----------------+-----------------+----------------+-------------------+----------------+----------------+------------------+-----------------+------------------+-----------------+-------------------+------------------+-------------------+-------------------+------------------+-------------------+-------------------+-------------------+-----------------+------------------+--------+--------+---------+----------------+-----------------+----------------+----------------+----------------+---------------------+-------+--------+----------+---------+----------+--------+----------+------------+-------------+
|SITE_ID|SITE_IG

### Impute with average across sites

In [405]:
static_monthly_sdf = static_monthly_sdf.fillna(value = static_monthly_sdf
    .select(mean('NDWI_avg')).head()[0],subset = ['NDWI_avg'])

In [406]:
static_monthly_sdf.filter(static_monthly_sdf.SITE_ID == 'CG-Tch').show()

+-------+---------+-----+----------------+----------------+----------------+----------------+-----------------+------------------+-----------------+------------------+------------------+------------------+------------------+------------------+-------------------+-------------------+-------------------+-------------------+----------------+----------------+----------------+----------------+-----------------+-----------------+-------------------+----------------+----------------+-------------+-----------------+-------------+-------------+-------------------+------------------+-------------------+-------------------+------------------+-------------------+-------------------+-------------------+-----------------+------------------+------------------+--------+---------+----------------+-----------------+----------------+-----------+----------------+---------------------+-------+--------+----------+---------+----------+--------+----------+------------+-------------+
|SITE_ID|SITE_IGBP|month| 

## `NIRv_avg`

In [407]:
static_monthly_sdf.filter(static_monthly_sdf.NIRv_avg.isNull()).show()

+-------+---------+-----+----------------+----------------+----------------+----------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+-------------------+-------------------+-------------------+-------------------+----------------+----------------+----------------+----------------+-----------------+----------------+-------------------+----------------+----------------+------------------+-----------------+------------------+-----------------+-------------------+------------------+-------------------+-------------------+------------------+-------------------+-------------------+-------------------+-----------------+------------------+------------------+--------+---------+----------------+-----------------+----------------+----------------+----------------+---------------------+-------+--------+----------+---------+----------+--------+----------+------------+-------------+
|SITE_

In [408]:
static_monthly_sdf.filter(static_monthly_sdf.NIRv_avg.isNull()).groupBy("SITE_ID").count().show()

+-------+-----+
|SITE_ID|count|
+-------+-----+
| CN-Cha|    1|
| SE-Svb|    1|
| BR-Sa3|    6|
| SE-Ros|    2|
| PA-SPn|    5|
| IT-Tor|    1|
| US-Ivo|    6|
| CG-Tch|   11|
| SE-Deg|    1|
| SE-Lnn|    1|
| FI-Var|    3|
| GH-Ank|    8|
| SJ-Adv|    4|
| US-UMB|    1|
| FI-Qvd|    1|
| US-UMd|    1|
| CH-Aws|    5|
| CN-Din|    1|
| US-Syv|    2|
| FI-Lom|    3|
+-------+-----+
only showing top 20 rows



### Impute with average across sites

In [409]:
static_monthly_sdf = static_monthly_sdf.fillna(value = static_monthly_sdf
    .select(mean('NIRv_avg')).head()[0],subset = ['NIRv_avg'])

In [410]:
static_monthly_sdf.filter(static_monthly_sdf.SITE_ID == 'CG-Tch').show()

+-------+---------+-----+----------------+----------------+----------------+----------------+-----------------+------------------+-----------------+------------------+------------------+------------------+------------------+------------------+-------------------+-------------------+-------------------+-------------------+----------------+----------------+----------------+----------------+-----------------+-----------------+-------------------+----------------+----------------+-------------+-----------------+-------------+-------------+-------------------+------------------+-------------------+-------------------+------------------+-------------------+-------------------+-------------------+-----------------+------------------+------------------+-------------------+---------+----------------+-----------------+----------------+-----------+----------------+---------------------+-------+--------+----------+---------+----------+--------+----------+------------+-------------+
|SITE_ID|SITE_I

## `kNDVI_avg`

In [411]:
static_monthly_sdf.filter(static_monthly_sdf.kNDVI_avg.isNull()).show()

+-------+---------+-----+----------------+----------------+----------------+----------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+-------------------+-------------------+-------------------+-------------------+----------------+----------------+----------------+----------------+-----------------+----------------+-------------------+----------------+----------------+------------------+-----------------+------------------+-----------------+-------------------+------------------+-------------------+-------------------+------------------+-------------------+-------------------+-------------------+-----------------+------------------+------------------+-------------------+---------+----------------+-----------------+----------------+----------------+----------------+---------------------+-------+--------+----------+---------+----------+--------+----------+------------+----------

In [412]:
static_monthly_sdf.filter(static_monthly_sdf.kNDVI_avg.isNull()).groupBy("SITE_ID").count().show()

+-------+-----+
|SITE_ID|count|
+-------+-----+
| CN-Cha|    1|
| SE-Svb|    1|
| BR-Sa3|    6|
| SE-Ros|    2|
| PA-SPn|    5|
| US-Ivo|    3|
| CG-Tch|   11|
| SE-Deg|    1|
| SE-Lnn|    1|
| FI-Var|    3|
| GH-Ank|    8|
| SJ-Adv|    2|
| US-UMB|    1|
| FI-Qvd|    1|
| US-UMd|    1|
| CH-Aws|    3|
| CN-Din|    1|
| US-Syv|    2|
| FI-Lom|    3|
| US-Atq|    3|
+-------+-----+
only showing top 20 rows



### Impute with average across sites

In [413]:
static_monthly_sdf = static_monthly_sdf.fillna(value = static_monthly_sdf
    .select(mean('kNDVI_avg')).head()[0],subset = ['kNDVI_avg'])

In [414]:
static_monthly_sdf.filter(static_monthly_sdf.SITE_ID == 'CG-Tch').show()

+-------+---------+-----+----------------+----------------+----------------+----------------+-----------------+------------------+-----------------+------------------+------------------+------------------+------------------+------------------+-------------------+-------------------+-------------------+-------------------+----------------+----------------+----------------+----------------+-----------------+-----------------+-------------------+----------------+----------------+-------------+-----------------+-------------+-------------+-------------------+------------------+-------------------+-------------------+------------------+-------------------+-------------------+-------------------+-----------------+------------------+------------------+-------------------+-------------------+----------------+-----------------+----------------+-----------+----------------+---------------------+-------+--------+----------+---------+----------+--------+----------+------------+-------------+
|SITE

## `Percent_Snow_avg`

In [415]:
static_monthly_sdf.filter(static_monthly_sdf.Percent_Snow_avg.isNull()).show()

+-------+---------+-----+-----------------+-----------------+-----------------+-----------------+-----------------+------------------+-----------------+------------------+-------------------+------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+----------------+-----------------+----------------+--------------------+-------------------+-------------------+----------------+----------------+------------------+------------------+------------------+-------------------+-------------------+------------------+-------------------+-------------------+------------------+-------------------+-------------------+-------------------+-----------------+------------------+------------------+-------------------+-------------------+----------------+--------+-------+----------------+----------------+---------------------+--------+--------+----------+---------+----------+------+----------+------------+-

In [416]:
static_monthly_sdf.filter(static_monthly_sdf.Percent_Snow_avg.isNull()).groupBy("SITE_ID").count().show()

+-------+-----+
|SITE_ID|count|
+-------+-----+
| SE-Svb|    1|
| SE-Ros|    1|
| US-Ivo|    3|
| SE-Deg|    1|
| FI-Var|    2|
| SJ-Adv|    1|
| FI-Qvd|    1|
| FI-Lom|    2|
| US-Atq|    3|
| FI-Sod|    2|
| RU-Che|    1|
| FI-Ken|    2|
| US-Prr|    2|
| FI-Hyy|    1|
| FI-Let|    1|
| US-Uaf|    1|
| FI-Sii|    1|
| SE-Nor|    1|
| FI-Jok|    1|
+-------+-----+



### Impute with average across sites

In [417]:
static_monthly_sdf = static_monthly_sdf.fillna(value = static_monthly_sdf
    .select(mean('Percent_Snow_avg')).head()[0],subset = ['Percent_Snow_avg'])

In [418]:
static_monthly_sdf.filter(static_monthly_sdf.SITE_ID == 'SE-Svb').show()

+-------+---------+-----+-----------------+-----------------+----------------+-----------------+-----------------+------------------+-----------------+------------------+------------------+------------------+------------------+------------------+-------------------+-------------------+-------------------+-------------------+------------------+----------------+----------------+----------------+------------------+-----------------+-------------------+----------------+----------------+------------------+------------------+------------------+-------------------+-------------------+------------------+-------------------+-------------------+------------------+-------------------+-------------------+-------------------+-----------------+------------------+------------------+-------------------+-------------------+------------------+-----------------+-----------------+----------------+----------------+---------------------+--------+--------+----------+---------+----------+------+----------+---

## `Fpar_avg`

In [419]:
static_monthly_sdf.filter(static_monthly_sdf.Fpar_avg.isNull()).show()

+-------+---------+-----+-----------------+-----------------+------------------+------------------+-----------------+------------------+-----------------+------------------+------------------+------------------+-------------------+------------------+-------------------+-------------------+-------------------+-------------------+-----------------+----------------+----------------+----------------+------------------+------------------+------------------+----------------+----------------+------------------+------------------+------------------+-------------------+------------------+-----------------+------------------+------------------+-----------------+------------------+------------------+-----------------+-----------------+-----------------+-----------------+------------------+------------------+----------------+--------+-------+----------------+----------------+---------------------+---------+--------+----------+---------+----------+---------+----------+------------+-------------+
|S

In [420]:
static_monthly_sdf.filter(static_monthly_sdf.Fpar_avg.isNull()).groupBy("SITE_ID").count().show()

+-------+-----+
|SITE_ID|count|
+-------+-----+
| SE-Svb|    3|
| DE-Lnf|    1|
| DE-RuW|    1|
| CA-Ca2|    1|
| SE-Ros|    3|
| DE-Tha|    1|
| US-Ivo|    7|
| CZ-RAJ|    1|
| RU-Fy2|    2|
| SE-Deg|    3|
| CA-Oas|    1|
| SE-Lnn|    3|
| BE-Maa|    1|
| FI-Var|    5|
| SJ-Adv|    3|
| CA-Qc2|    1|
| NL-Hor|    1|
| DE-HoH|    1|
| FI-Qvd|    3|
| BE-Dor|    1|
+-------+-----+
only showing top 20 rows



### Impute with average across sites

In [421]:
static_monthly_sdf = static_monthly_sdf.fillna(value = static_monthly_sdf
    .select(mean('Fpar_avg')).head()[0],subset = ['Fpar_avg'])

In [422]:
static_monthly_sdf.filter(static_monthly_sdf.SITE_ID == 'SE-Svb').show()

+-------+---------+-----+-----------------+-----------------+----------------+-----------------+-----------------+------------------+-----------------+------------------+------------------+------------------+------------------+------------------+-------------------+-------------------+-------------------+-------------------+------------------+----------------+----------------+----------------+------------------+-----------------+-------------------+----------------+----------------+------------------+------------------+------------------+-------------------+-------------------+------------------+-------------------+-------------------+------------------+-------------------+-------------------+-------------------+-----------------+------------------+------------------+-------------------+-------------------+------------------+-------------------+-----------------+----------------+----------------+---------------------+--------+--------+----------+---------+----------+------+----------+-

## `Lai_avg`

In [423]:
static_monthly_sdf.filter(static_monthly_sdf.Lai_avg.isNull()).show()

+-------+---------+-----+-----------------+-----------------+------------------+------------------+-----------------+------------------+-----------------+------------------+------------------+------------------+-------------------+------------------+-------------------+-------------------+-------------------+-------------------+-----------------+----------------+----------------+----------------+------------------+------------------+------------------+----------------+----------------+------------------+------------------+------------------+-------------------+------------------+-----------------+------------------+------------------+-----------------+------------------+------------------+-----------------+-----------------+-----------------+-----------------+------------------+------------------+----------------+-------------------+-------+----------------+----------------+---------------------+---------+--------+----------+---------+----------+---------+----------+------------+------

In [424]:
static_monthly_sdf.filter(static_monthly_sdf.Lai_avg.isNull()).groupBy("SITE_ID").count().show()

+-------+-----+
|SITE_ID|count|
+-------+-----+
| SE-Svb|    3|
| DE-Lnf|    1|
| DE-RuW|    1|
| CA-Ca2|    1|
| SE-Ros|    3|
| DE-Tha|    1|
| US-Ivo|    7|
| CZ-RAJ|    1|
| RU-Fy2|    2|
| SE-Deg|    3|
| CA-Oas|    1|
| SE-Lnn|    3|
| BE-Maa|    1|
| FI-Var|    5|
| SJ-Adv|    3|
| CA-Qc2|    1|
| NL-Hor|    1|
| DE-HoH|    1|
| FI-Qvd|    3|
| BE-Dor|    1|
+-------+-----+
only showing top 20 rows



### Impute with average across sites

In [425]:
static_monthly_sdf = static_monthly_sdf.fillna(value = static_monthly_sdf
    .select(mean('Lai_avg')).head()[0],subset = ['Lai_avg'])

In [426]:
static_monthly_sdf.filter(static_monthly_sdf.SITE_ID == 'SE-Svb').show()

+-------+---------+-----+-----------------+-----------------+----------------+-----------------+-----------------+------------------+-----------------+------------------+------------------+------------------+------------------+------------------+-------------------+-------------------+-------------------+-------------------+------------------+----------------+----------------+----------------+------------------+-----------------+-------------------+----------------+----------------+------------------+------------------+------------------+-------------------+-------------------+------------------+-------------------+-------------------+------------------+-------------------+-------------------+-------------------+-----------------+------------------+------------------+-------------------+-------------------+------------------+-------------------+------------------+----------------+----------------+---------------------+--------+--------+----------+---------+----------+------+----------+

# Export pyspark df to csv(for baseline)

### Check there is no missing value in the pyspark dataframe

In [427]:
# code that counts the number of None in each feature
from pyspark.sql.functions import isnull, when, count, col

static_monthly_sdf.select([count(when(isnull(c), c)).alias(c) for c in static_monthly_sdf.columns]).show()

+-------+---------+-----+--------+---------+-------+----------+---------------+------------------+---------------+------------------+------------------+------------------+------------------+------------------+-------------------+-------------------+-------------------+-------------------+------+------------+----------------+-------------+-----------------+----------------+-------+------+---------+--------+-------+-------------+-------------+------+------+------+------+------+------+------+-------+-------+--------+--------+--------+---------+----------------+--------+-------+-----------+-------------+---------------------+-------+--------+----------+---------+----------+------+----------+------------+-------------+
|SITE_ID|SITE_IGBP|month|TA_F_avg|VPD_F_avg|P_F_avg|NETRAD_avg|NEE_VUT_REF_avg|NEE_VUT_REF_QC_avg|NEE_CUT_REF_avg|NEE_CUT_REF_QC_avg|GPP_NT_VUT_REF_avg|GPP_DT_VUT_REF_avg|GPP_NT_CUT_REF_avg|GPP_DT_CUT_REF_avg|RECO_NT_VUT_REF_avg|RECO_DT_VUT_REF_avg|RECO_NT_CUT_REF_avg|RECO_DT

In [428]:
!pwd

/content/drive/MyDrive


In [432]:
# Write DataFrame data to CSV file
static_monthly_sdf.write.option("header",True).csv("./static_monthly_features_v1")

# APPENDIX

| feature_name      |                       source                       | definition                                                                                                                                                                              | var_type    |   |
|-------------------|:--------------------------------------------------:|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------|---|
| TA_F              | FLUXNET                                            | Air temperature, consolidated from TA_F_MDS and TA_ERA                                                                                                                                  | numeric     |   |
| VPD_F             | FLUXNET                                            | Vapor Pressure Deficit consolidated from VPD_F_MDS and VPD_ERA                                                                                                                          | numeric     |   |
| P_F               | FLUXNET                                            | Precipitation consolidated from P and P_ERA                                                                                                                                             | numeric     |   |
| NETRAD            | FLUXNET                                            | Net radiation                                                                                                                                                                           | numeric     |   |
| BESS-PAR          | BESS PAR                                           | Photosynthetic Active Radiation (PAR)                                                                                                                                                   | numeric     |   |
| BESS-PARdiff      | BESS PAR                                           | Diffuse PAR                                                                                                                                                                             | numeric     |   |
| BESS-RSDN         | BESS PAR                                           | Shortwave downwelling radiation                                                                                                                                                         | numeric     |   |
| CSIF-SIFdaily     | CSIF                                               | All-sky daily average SIF                                                                                                                                                               | numeric     |   |
| CSIF-SIFinst      | NA                                                 | NA                                                                                                                                                                                      | numeric     |   |
| PET               | ERA5-Land                                          | Potential ET                                                                                                                                                                            | numeric     |   |
| Ts                | ERA5-Land                                          | NA                                                                                                                                                                                      | numeric     |   |
| Tmean             | ERA5-Land                                          | Air temperature                                                                                                                                                                         | numeric     |   |
| prcp              | ERA5-Land                                          | Precipitation                                                                                                                                                                           | numeric     |   |
| vpd               | ERA5-Land                                          | Vapor pressure deficit                                                                                                                                                                  | numeric     |   |
| prcp-lag3         | ERA5-Land                                          | Precipitation 3-month lag                                                                                                                                                               | numeric     |   |
| ESACCI-sm         | ERA5-Land                                          | Soil moisture                                                                                                                                                                           | numeric     |   |
| MODIS_LC          | NA                                                 | MODIS land cover (MODIS LC)                                                                                                                                                             | categorical |   |
| b1                | MCD43C4                                            | Surface reflectance Band 1                                                                                                                                                              | numeric     |   |
| b2                | MCD43C4                                            | Surface reflectance Band 2 (nir)                                                                                                                                                        | numeric     |   |
| b3                | MCD43C4                                            | Surface reflectance Band 3 (blue)                                                                                                                                                       | numeric     |   |
| b4                | MCD43C4                                            | Surface reflectance Band 4 (green)                                                                                                                                                      | numeric     |   |
| b5                | MCD43C4                                            | Surface reflectance Band 5 (SWIR1)                                                                                                                                                      | numeric     |   |
| b6                | MCD43C4                                            | Surface reflectance Band 6 (SWIR2)                                                                                                                                                      | numeric     |   |
| b7                | MCD43C4                                            | Surface reflectance Band 7 (SWIR3)                                                                                                                                                      | numeric     |   |
| EVI               | MCD43C4                                            | Enhanced Vegetation Index (EVI)                                                                                                                                                         | numeric     |   |
| GCI               | MCD43C4                                            | CIGreen c                                                                                                                                                                               | numeric     |   |
| NDVI              | MCD43C4                                            | Normalized Difference Vegetation Index (NDVI)                                                                                                                                           | numeric     |   |
| NDWI              | MCD43C4                                            | Normalized Different Water Index (NDWI) b                                                                                                                                               | numeric     |   |
| NIRv              | MCD43C4                                            | NIRv d                                                                                                                                                                                  | numeric     |   |
| kNDVI             | MCD43C4                                            | kNDVI a                                                                                                                                                                                 | numeric     |   |
| Percent_Snow      | MCD43C4                                            | Percentage of snow cover                                                                                                                                                                | numeric     |   |
| Fpar              | MCD15A3H (after 2002/07) MOD15A2H (before 2002/07) | Fraction of photosynthetically active radiation (fPAR)                                                                                                                                  | numeric     |   |
| Lai               | MCD15A3H (after 2002/07) MOD15A2H (before 2002/07) | Surface reflectance Band 7 (SWIR3)                                                                                                                                                      | numeric     |   |
| LST_Day           | MYD11A1 (after 2002/07) MOD11A1 (before 2002/07)   | Daytime land surface temperature                                                                                                                                                        | numeric     |   |
| LST_Night         | MYD11A1 (after 2002/07) MOD11A1 (before 2002/07)   | Nighttime land surface temperature                                                                                                                                                      | numeric     |   |
| MODIS_IGBP        | NA                                                 | NA                                                                                                                                                                                      | categorical |   |
| MODIS_PFT         | NA                                                 | NA                                                                                                                                                                                      | categorical |   |
| koppen_sub        | NA                                                 | NA                                                                                                                                                                                      | categorical |   |
| koppen            | Koppen-Geiger                                      | Climate zone (one-hot encoding)                                                                                                                                                         | categorical |   |
| CO2_concentration | ESLR                                               | Atmospheric CO2 concentration                                                                                                                                                           | numeric     |   |