# Process Globe-LFMC samples and extract DEM Data
Extracts the samples for locations in the CONUS from the Globe-LFMC spread sheet and adds the normalised DEM, climate zone, and other auxiliary data. DEM data is from the GEE SRTM DEM product, extracted using the MODIS projection and scale. Sites within the same MODIS pixel are merged. The following files are created:
- `LFMC_CONUS.csv`: CONUS data extracted from the Globe-LFMC dataset
- `LFMC_sites.csv`: sites extracted from the Globe-LFMC CONUS data and augmented with normalised DEM and location data
- `LFMC_samples.csv`: Globe-LFMC CONUS sample data augmented with auxiliary variables

### Notes
1. The `Globe-LFMC-v2.xlsx` should exist in the `INPUT_DIR` directory - by default, a sub-directory of `DATA_DIR`
2. The tiff containing the Koppen climate zone data (`Beck_KG_V1_present_0p0083.tif` available from https://figshare.com/articles/dataset/Present_and_future_K_ppen-Geiger_climate_classification_maps_at_1-km_resolution/6396959/2) should also be in `INPUT_DIR`, as should either the `legend.txt` file (available from the same site) or `Climate_zones.csv`. If `Climate_zones.csv` doesn't exist, it needs to be created from `legend.txt` by uncommenting and running the first code cell under "Climate zone processing".
3. `EXTRACT_NAME` is a sub-directory of `DATA_DIR`. It will be created if it doesn't exist. All data files created by this and other data extraction notebooks will be located in sub-directories of this directory.
4. `LFMC_CONUS.csv` is created in the `INPUT_DIR` directory.
5. All other created files are CSVs and stored in the `SAMPLE_DIR` directory, by default a sub-directory of `DATA_DIR/EXTRACT_NAME`.
6. The samples data output by this code is further processed by the MODIS extraction code to remove the snow samples.

In [1]:
import os
import numpy as np
import pandas as pd
import time
from datetime import datetime
from datetime import timedelta

import initialise
import common
from data_extract_utils import normalise_dem
from data_extract_utils import extract_koppen_data
from data_prep_utils import normalise

Define input and output files

In [2]:
# Globe-LFMC file and sheet name
GLOBE_LFMC = os.path.join(common.SOURCE_DIR, "Globe-LFMC-v2.xlsx")
SHEET_NAME = "LFMC data"
# GLOBE_LFMC = os.path.join(common.SOURCE_DIR, "Globe-LFMC.csv")

# File Names
LFMC_RAW = os.path.join(common.SOURCE_DIR, "LFMC_Australia.csv")               # CSV of Australian data extracted from the Globe-LFMC dataset
KOPPEN_TIF = os.path.join(common.SOURCE_DIR, 'Beck_KG_V1_present_0p0083.tif')  # Tiff of Koppen climate zone values
LEGEND_FILE = os.path.join(common.SOURCE_DIR, 'legend.txt')                    # Text file with Koppen climate zone legend
KOPPEN_LEGEND = os.path.join(common.SOURCE_DIR, 'Climate_zones.csv')           # CSV of Koppen climate zone legend

SITES_OUTPUT = os.path.join(common.DATASETS_DIR, 'LFMC_australia_sites.csv')
SAMPLES_OUTPUT = os.path.join(common.DATASETS_DIR, 'LFMC_australia_samples.csv')

if not os.path.exists(common.DATASETS_DIR):
    os.makedirs(common.DATASETS_DIR)


Other constants/parameters

In [3]:
# DEM Product, projection and resolution
DEM_PRODUCT = 'USGS/SRTMGL1_003'
DEM_PROJ = "EPSG:4326"
DEM_SCALE = 30

# Floating point precision
FLOAT_PRE = 5

Initialise Google Earth Engine

In [4]:
import ee
ee.Initialize()

## Point-based Processing
Extracts the DEM data from GEE usong the native DEM projection and resolution. Keeps the sample site latitude and longitude, and adds the elevation/slope/aspect.
- Parameter:
 - sites: Dataframe of sample sites
- Returns: Dataframe of sites, latitude and longitude and the added elevation/slope/aspect attributes

In [5]:
def sites_by_point(sites):
    dem_image = ee.Terrain.products(ee.Image(DEM_PRODUCT))
    points = [ee.Geometry.Point(site.Longitude, site.Latitude) for x, site in sites.iterrows()]
    dem_col = ee.ImageCollection(dem_image)
    col_list = [dem_col.getRegion(point, DEM_SCALE, DEM_PROJ) for point in points]
    dem_list = ee.List(col_list).getInfo()
    dem_data = pd.DataFrame([item[1] for item in dem_list], columns=dem_list[0][0])
    dem_data.id = sites.Site
    dem_data.rename(columns={"id": "Site"}, inplace=True)
    dem_df = sites.merge(dem_data[['Site', 'elevation', 'slope', 'aspect']])
    dem_df.columns = ['Site', 'Latitude', 'Longitude', 'Elevation', 'Slope', 'Aspect']
    return dem_df

## Pixel-based Processing
Extracts the DEM data at the requested projection and resolution. Terrain.products adds the slope and aspect. A reducer is used so terrain product info is added before resampling.
- Parameters:
 - sites: dataframe of sampling sites
 - scale/proj: the required scale/proj (e.g. MODIS scale/proj - or map scale/proj)
 - maxPixels: Reducer parameter specifying the maximum number of DEM pixels to use to compute each down-sampled pixel. Doesn't need to be exact but make sure it's large enough - 512 is good for MODIS
- Returns: Dataframe of sites with latitude and longitude set to the pixel centroid as returned by GEE and the added elevation/slope/aspect attributes

In [6]:
def sites_by_pixel(sites, scale, proj, maxPixels):
    dem_image = ee.Terrain.products(ee.Image(DEM_PRODUCT)).reduceResolution(ee.Reducer.mean(), maxPixels=maxPixels)
    points = [ee.Geometry.Point(site.Longitude, site.Latitude) for x, site in sites.iterrows()]
    dem_col = ee.ImageCollection(dem_image)
    col_list = [dem_col.getRegion(point, scale, proj) for point in points]
    dem_list = ee.List(col_list).getInfo()
    dem_data = pd.DataFrame([item[1] for item in dem_list], columns=dem_list[0][0])
    dem_data.id = sites.Site
    dem_data.columns = ['Site', 'Longitude', 'Latitude', 'time', 'Elevation', 'Slope', 'Aspect', 'hillshade']
    dem_df = dem_data.drop(columns=["time", "hillshade"])
    return dem_df

## Main Processing
- If the LFMC_RAW file already exists, load it.
- Otherwise extract Globe LFMC data from the excel workbook sheet and save to the LFMC_RAW file.

In [8]:
if os.path.exists(LFMC_RAW):
    LFMC_data = pd.read_csv(LFMC_RAW, index_col=0, float_precision="high", parse_dates=["Sampling date"],
                           dtype={8: str, 10: np.int32, 11: np.int16, 14: np.int16, 23: str})
else:    
    LFMC_data = pd.read_excel(GLOBE_LFMC, SHEET_NAME).dropna(how="all")
    # LFMC_data = pd.read_csv(GLOBE_LFMC).dropna(how="all")
    LFMC_data = LFMC_data[(LFMC_data.Country.isin(['Australia']))
                          & (LFMC_data["Sampling date"] >= common.START_DATE)]
    LFMC_data.to_csv(LFMC_RAW)
    LFMC_data = LFMC_data.astype(dtype={'Sampling year': np.int32, 'Protocol': np.int16, 'Units': np.int16})
LFMC_data

Unnamed: 0,ID,Contact,Sitename,State/Region,Country,Latitude,Longitude,Sampling time,Sampling date,Sampling year,...,Units,NDVI SD min,NDVI SD max,NDVI CV min,NDVI CV max,Species collected,Elevation(m.a.s.l),Slope(%),Reference,Name of picture file
159972,C10_1_1,Newnham,Ballan,Victoria,Australia,-37.635167,144.221317,,2008-10-20,2008,...,1,0.032105,0.123081,0.087244,0.196893,"Lolium multiflorum, Trifolium sp.",,,"Newnham, G.J., Verbesselt, J., Grant, I.F., An...",
159973,C10_1_2,Newnham,Ballan,Victoria,Australia,-37.635167,144.221317,,2008-11-10,2008,...,1,0.032105,0.123081,0.087244,0.196893,"Lolium multiflorum, Trifolium sp.",,,"Newnham, G.J., Verbesselt, J., Grant, I.F., An...",
159974,C10_1_3,Newnham,Ballan,Victoria,Australia,-37.635167,144.221317,,2008-12-01,2008,...,1,0.032105,0.123081,0.087244,0.196893,"Lolium multiflorum, Trifolium sp.",,,"Newnham, G.J., Verbesselt, J., Grant, I.F., An...",
159975,C10_1_4,Newnham,Ballan,Victoria,Australia,-37.635167,144.221317,,2009-01-19,2009,...,1,0.032105,0.123081,0.087244,0.196893,"Lolium multiflorum, Trifolium sp.",,,"Newnham, G.J., Verbesselt, J., Grant, I.F., An...",
159976,C10_2_1,Newnham,Braidwood-A,New South Wales,Australia,-35.406367,149.803217,,2006-01-05,2006,...,1,0.033170,0.105659,0.065878,0.164409,Unknown grass,,,"Newnham, G.J., Verbesselt, J., Grant, I.F., An...",
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
161712,C18_3_32,Yebra,Namadgi-2,Australian Capital Territory,Australia,-35.607070,148.865740,,2016-09-02,2016,...,1,0.033528,0.081472,0.047433,0.125154,"Leptospermum myrtifolium, Oxylobium ellipticum...",1000.0,,"Yebra, M., Quan, X.W., Riano, D., Rozas Larrao...",
161713,C18_3_33,Yebra,Namadgi-2,Australian Capital Territory,Australia,-35.607070,148.865740,,2016-09-02,2016,...,1,0.033528,0.081472,0.047433,0.125154,"Eucalyptus dalrympleana, Eucalyptus pauciflora...",1000.0,,"Yebra, M., Quan, X.W., Riano, D., Rozas Larrao...",
161714,C18_3_34,Yebra,Namadgi-2,Australian Capital Territory,Australia,-35.607070,148.865740,,2016-11-01,2016,...,1,0.033528,0.081472,0.047433,0.125154,"Poa labillardierei, Themeda triandra",1000.0,,"Yebra, M., Quan, X.W., Riano, D., Rozas Larrao...",
161715,C18_3_35,Yebra,Namadgi-2,Australian Capital Territory,Australia,-35.607070,148.865740,,2016-11-01,2016,...,1,0.033528,0.081472,0.047433,0.125154,"Leptospermum myrtifolium, Oxylobium ellipticum...",1000.0,,"Yebra, M., Quan, X.W., Riano, D., Rozas Larrao...",


### Site processing
Extract the unique sites from the Globe-LFMC data

In [9]:
LFMC_data["Site"] = LFMC_data.ID.str.rsplit("_", 1, expand=True)[0]
sites = LFMC_data[["Site", "Latitude", "Longitude"]].drop_duplicates().reset_index(drop=True)
sites

Unnamed: 0,Site,Latitude,Longitude
0,C10_1,-37.635167,144.221317
1,C10_2,-35.406367,149.803217
2,C10_3,-35.42,149.793167
3,C10_4,-38.225677,145.5633
4,C10_5,-33.680889,117.610167
5,C10_6,-35.367667,149.0535
6,C10_7,-42.848333,147.4895
7,C10_8,-38.089457,145.430665
8,C10_9,-26.162889,121.558778
9,C10_10,-35.27775,149.196583


Retrieve the DEM data from GEE - run either sitesByPixel (pixel mode) or sitesByPoint (point mode)

In [10]:
dem_df = sites_by_pixel(sites, common.SCALE, common.PROJ, 512)
dem_df

Unnamed: 0,Site,Longitude,Latitude,Elevation,Slope,Aspect
0,C10_1,144.221027,-37.635415,499.988969,2.388163,146.934708
1,C10_2,149.801513,-35.406249,685.427163,2.627604,126.444113
2,C10_3,149.788961,-35.418749,677.982848,3.020142,188.612368
3,C10_4,145.56676,-38.227082,12.068741,2.06069,147.484675
4,C10_5,117.611528,-33.681249,313.28297,2.759327,130.758276
5,C10_6,149.057405,-35.368749,637.531455,4.19513,179.442269
6,C10_7,147.486277,-42.847915,5.658302,1.874102,118.096442
7,C10_8,145.430175,-38.089582,22.020871,1.285042,128.280466
8,C10_9,121.562633,-26.164582,505.903014,1.639325,154.308535
9,C10_10,149.194742,-35.277082,578.080353,2.12938,177.009241


Normalise the DEM data and save the sites data. Note: sites with same latitude/longitude are *not* merged yet

In [11]:
dem_norm = normalise_dem(dem_df.set_index('Site'), input_columns=['Longitude', 'Latitude', 'Elevation', 'Slope', 'Aspect'], precision=FLOAT_PRE)
dem_norm = dem_norm.reset_index()
dem_norm

Unnamed: 0,Site,Longitude,Latitude,Elevation,Slope,Aspect,Long_sin,Long_cos,Lat_norm,Aspect_sin,Aspect_cos
0,C10_1,144.22103,-37.63542,0.08333,0.02654,146.93471,0.58466,-0.81128,0.29091,-0.54559,0.83805
1,C10_2,149.80151,-35.40625,0.11424,0.0292,126.44411,0.503,-0.86429,0.3033,-0.80444,0.59404
2,C10_3,149.78896,-35.41875,0.113,0.03356,188.61237,0.50319,-0.86418,0.30323,0.14975,0.98872
3,C10_4,145.56676,-38.22708,0.00201,0.0229,147.48467,0.56545,-0.82479,0.28763,-0.53753,0.84325
4,C10_5,117.61153,-33.68125,0.05221,0.03066,130.75828,0.88611,-0.46347,0.31288,-0.75747,0.65287
5,C10_6,149.0574,-35.36875,0.10626,0.04661,179.44227,0.51418,-0.85768,0.30351,-0.00973,0.99995
6,C10_7,147.48628,-42.84792,0.00094,0.02082,118.09644,0.5375,-0.84326,0.26196,-0.88216,0.47096
7,C10_8,145.43017,-38.08958,0.00367,0.01428,128.28047,0.56741,-0.82344,0.28839,-0.78499,0.61951
8,C10_9,121.56263,-26.16458,0.08432,0.01821,154.30854,0.85207,-0.52343,0.35464,-0.43352,0.90114
9,C10_10,149.19474,-35.27708,0.09635,0.02366,177.00924,0.51212,-0.85891,0.30402,-0.05217,0.99864


### Date processing
Create dataframe with dates and normalised day-of-year

In [12]:
days = pd.date_range(common.START_DATE, common.END_DATE, inclusive="left")
doy = pd.Series(normalise(days.dayofyear, method='range', data_range=(1, 366), scaled_range=(-np.pi, np.pi)))
days_df = pd.DataFrame({"Date": days, 
                        "Day_sin": doy.transform(np.sin).round(FLOAT_PRE),
                        "Day_cos": doy.transform(np.cos).round(FLOAT_PRE)})
days_df

Unnamed: 0,Date,Day_sin,Day_cos
0,2000-03-01,-0.85876,-0.51237
1,2000-03-02,-0.86746,-0.49751
2,2000-03-03,-0.87589,-0.48251
3,2000-03-04,-0.88407,-0.46736
4,2000-03-05,-0.89198,-0.45207
...,...,...,...
6875,2018-12-27,0.08596,-0.99630
6876,2018-12-28,0.06880,-0.99763
6877,2018-12-29,0.05162,-0.99867
6878,2018-12-30,0.03442,-0.99941


### Climate zone processing

#### Create the Koppen legend csv file
If the `KOPPEN_LEGEND` file doesn't exist, uncomment and run the following cell. This will create it from the `legend.txt` file that can be downloaded with the climate zones tiff.

In [12]:
# legend = {}
# count = 0
# with open(LEGEND_FILE) as fp:
#     for ln in fp:
#         line = ln.split(':')
#         number = line[0].strip()
#         if number.isnumeric():
#             count += 1
#             key = int(line[0].strip())
#             parts = line[1].split('[')
#             colour = parts[1].strip().strip(']').split(' ')
#             code = parts[0].strip()[:3]
#             descr = parts[0].strip()[5:]
#             value = {'Number': number, 'Code': code, 'Description': descr, 'Red': colour[0], 'Green': colour[1], 'Blue': colour[2]}
#             legend[key] = value
# legend_df = pd.DataFrame.from_dict(legend, orient='index')
# legend_df.to_csv(KOPPEN_LEGEND, index=False)

#### Extract climate zones for sites
Extract the climate zone for each site and add to the sites data.

In [13]:
cz_columns = ['Czone1', 'Czone2', 'Czone3']
extract_koppen_data(KOPPEN_TIF, KOPPEN_LEGEND, sites, loc_columns=['Longitude', 'Latitude'], cz_columns=cz_columns)
dem_norm = dem_norm.merge(sites[['Site', 'Czone1', 'Czone2', 'Czone3']], on='Site')
dem_norm.to_csv(SITES_OUTPUT, index=False)
dem_norm

Unnamed: 0,Site,Longitude,Latitude,Elevation,Slope,Aspect,Long_sin,Long_cos,Lat_norm,Aspect_sin,Aspect_cos,Czone1,Czone2,Czone3
0,C10_1,144.22103,-37.63542,0.08333,0.02654,146.93471,0.58466,-0.81128,0.29091,-0.54559,0.83805,C,Cf,Cfb
1,C10_2,149.80151,-35.40625,0.11424,0.0292,126.44411,0.503,-0.86429,0.3033,-0.80444,0.59404,C,Cf,Cfb
2,C10_3,149.78896,-35.41875,0.113,0.03356,188.61237,0.50319,-0.86418,0.30323,0.14975,0.98872,C,Cf,Cfb
3,C10_4,145.56676,-38.22708,0.00201,0.0229,147.48467,0.56545,-0.82479,0.28763,-0.53753,0.84325,C,Cf,Cfb
4,C10_5,117.61153,-33.68125,0.05221,0.03066,130.75828,0.88611,-0.46347,0.31288,-0.75747,0.65287,C,Cs,Csb
5,C10_6,149.0574,-35.36875,0.10626,0.04661,179.44227,0.51418,-0.85768,0.30351,-0.00973,0.99995,C,Cf,Cfb
6,C10_7,147.48628,-42.84792,0.00094,0.02082,118.09644,0.5375,-0.84326,0.26196,-0.88216,0.47096,C,Cf,Cfb
7,C10_8,145.43017,-38.08958,0.00367,0.01428,128.28047,0.56741,-0.82344,0.28839,-0.78499,0.61951,C,Cf,Cfb
8,C10_9,121.56263,-26.16458,0.08432,0.01821,154.30854,0.85207,-0.52343,0.35464,-0.43352,0.90114,B,BW,BWh
9,C10_10,149.19474,-35.27708,0.09635,0.02366,177.00924,0.51212,-0.85891,0.30402,-0.05217,0.99864,C,Cf,Cfb


### Sample processing
Create the auxiliary dataset from the samples

##### Step 1: Merge sites and sample data to add the site longitude and latitude to the samples

In [14]:
samples = dem_norm[["Site", "Longitude", "Latitude"]].merge(
    LFMC_data[["ID", "Site", "Sampling date", "Sampling year", "Land Cover", "LFMC value"]])
samples

Unnamed: 0,Site,Longitude,Latitude,ID,Sampling date,Sampling year,Land Cover,LFMC value
0,C10_1,144.22103,-37.63542,C10_1_1,2008-10-20,2008,Grassland,260.570000
1,C10_1,144.22103,-37.63542,C10_1_2,2008-11-10,2008,Grassland,162.340000
2,C10_1,144.22103,-37.63542,C10_1_3,2008-12-01,2008,Grassland,132.660000
3,C10_1,144.22103,-37.63542,C10_1_4,2009-01-19,2009,Grassland,95.810000
4,C10_2,149.80151,-35.40625,C10_2_1,2006-01-05,2006,Mosaic cropland (>50%) / natural vegetation (t...,63.000000
...,...,...,...,...,...,...,...,...
668,C18_3,148.86310,-35.60625,C18_3_32,2016-09-02,2016,"Tree cover, broadleaved, evergreen, closed to ...",176.192379
669,C18_3,148.86310,-35.60625,C18_3_33,2016-09-02,2016,"Tree cover, broadleaved, evergreen, closed to ...",148.072003
670,C18_3,148.86310,-35.60625,C18_3_34,2016-11-01,2016,"Tree cover, broadleaved, evergreen, closed to ...",97.565123
671,C18_3,148.86310,-35.60625,C18_3_35,2016-11-01,2016,"Tree cover, broadleaved, evergreen, closed to ...",166.557840


##### Step 2: Merge samples for same latitude/longitude/date

In [15]:
# Generate a common site id for each site with the same latitude and longitude
merge_columns = ["Latitude", "Longitude"]
sites_temp = dem_norm[merge_columns + ["Site"]].groupby(merge_columns, as_index=False).min()
# Merge samples for same year and location
samples = samples.merge(sites_temp, on=merge_columns, suffixes=("_x", None))
groupby_cols = ["Latitude", "Longitude", "Sampling date"]
data_cols = {"ID": "min",                                    # Unique sample ID is the first ID of the merged samples
             "Sampling year": "min",                         # They should all be the same, but need to select one
             "Land Cover": lambda x: pd.Series.mode(x)[0],   # Most common land cover value
             "LFMC value": "mean",                           # mean LFMC value
             "Site": "min"}                                  # Site id from sites_temp
samples = samples[groupby_cols + list(data_cols.keys())].groupby(groupby_cols, as_index=False).\
              agg(data_cols).round({"LFMC value": FLOAT_PRE})
samples

Unnamed: 0,Latitude,Longitude,Sampling date,ID,Sampling year,Land Cover,LFMC value,Site
0,-42.84792,147.48628,2006-03-01,C10_7_1,2006,Shrubland deciduous,35.04517,C10_7
1,-38.22708,145.56676,2008-01-15,C10_4_1,2008,Grassland,58.84256,C10_4
2,-38.22708,145.56676,2008-01-22,C10_4_2,2008,Grassland,49.42399,C10_4
3,-38.22708,145.56676,2008-01-29,C10_4_3,2008,Grassland,49.74228,C10_4
4,-38.22708,145.56676,2008-02-05,C10_4_4,2008,Grassland,45.46927,C10_4
...,...,...,...,...,...,...,...,...
385,-15.58542,128.23378,2006-09-05,C10_16_10,2006,Mosaic tree and shrub (>50%) / herbaceous cove...,28.39000,C10_16
386,-15.58542,128.23378,2006-12-19,C10_16_11,2006,Mosaic tree and shrub (>50%) / herbaceous cove...,161.20000,C10_16
387,-15.58542,128.23378,2007-04-26,C10_16_12,2007,Mosaic tree and shrub (>50%) / herbaceous cove...,74.10000,C10_16
388,-15.58542,128.23378,2007-05-21,C10_16_13,2007,Mosaic tree and shrub (>50%) / herbaceous cove...,56.50000,C10_16


##### Step 3: Add the normalised auxiliary variables (day-of-year, location and DEM) to the samples

In [16]:
aux_df = samples[["ID", "Latitude", "Longitude", "Sampling date", "Sampling year", "Land Cover", "LFMC value", "Site"]
                ].merge(days_df, left_on="Sampling date", right_on = "Date").drop(columns="Date").\
                merge(dem_norm.drop(columns=["Longitude", "Latitude"]), on="Site").sort_values("ID")
aux_df = aux_df[['ID', 'Latitude', 'Longitude', 'Sampling date', 'Sampling year', 'Land Cover', 'LFMC value', 'Site',
                 'Czone1', 'Czone2', 'Czone3',
                 'Day_sin', 'Day_cos',
                 'Long_sin', 'Long_cos', 'Lat_norm', 'Elevation', 'Slope', 'Aspect_sin', 'Aspect_cos']]
aux_df.to_csv(SAMPLES_OUTPUT, index=False)
aux_df

Unnamed: 0,ID,Latitude,Longitude,Sampling date,Sampling year,Land Cover,LFMC value,Site,Czone1,Czone2,Czone3,Day_sin,Day_cos,Long_sin,Long_cos,Lat_norm,Elevation,Slope,Aspect_sin,Aspect_cos
80,C10_10_1,-35.27708,149.19474,2005-08-16,2005,Mosaic cropland (>50%) / natural vegetation (t...,48.00000,C10_10,C,Cf,Cfb,0.69328,0.72067,0.51212,-0.85891,0.30402,0.09635,0.02366,-0.05217,0.99864
88,C10_10_10,-35.27708,149.19474,2006-12-13,2006,Mosaic cropland (>50%) / natural vegetation (t...,10.16000,C10_10,C,Cf,Cfb,0.32127,-0.94699,0.51212,-0.85891,0.30402,0.09635,0.02366,-0.05217,0.99864
89,C10_10_11,-35.27708,149.19474,2007-01-25,2007,Mosaic cropland (>50%) / natural vegetation (t...,5.90000,C10_10,C,Cf,Cfb,-0.40149,-0.91586,0.51212,-0.85891,0.30402,0.09635,0.02366,-0.05217,0.99864
53,C10_10_12,-35.27708,149.19474,2007-10-17,2007,Mosaic cropland (>50%) / natural vegetation (t...,91.81000,C10_10,C,Cf,Cfb,0.96574,-0.25951,0.51212,-0.85891,0.30402,0.09635,0.02366,-0.05217,0.99864
54,C10_10_13,-35.27708,149.19474,2007-10-29,2007,Mosaic cropland (>50%) / natural vegetation (t...,128.80000,C10_10,C,Cf,Cfb,0.89198,-0.45207,0.51212,-0.85891,0.30402,0.09635,0.02366,-0.05217,0.99864
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
187,C18_3_28,-35.60625,148.86310,2016-02-16,2016,"Tree cover, broadleaved, evergreen, closed to ...",136.38340,C18_3,C,Cf,Cfb,-0.71166,-0.70253,0.51708,-0.85593,0.30219,0.21129,0.16830,-0.98912,-0.14709
188,C18_3_31,-35.60625,148.86310,2016-09-02,2016,"Tree cover, broadleaved, evergreen, closed to ...",145.09527,C18_3,C,Cf,Cfb,0.88001,0.47495,0.51708,-0.85593,0.30219,0.21129,0.16830,-0.98912,-0.14709
189,C18_3_34,-35.60625,148.86310,2016-11-01,2016,"Tree cover, broadleaved, evergreen, closed to ...",137.89202,C18_3,C,Cf,Cfb,0.85876,-0.51237,0.51708,-0.85593,0.30219,0.21129,0.16830,-0.98912,-0.14709
179,C18_3_4,-35.60625,148.86310,2015-10-12,2015,"Tree cover, broadleaved, evergreen, closed to ...",116.44435,C18_3,C,Cf,Cfb,0.98447,-0.17553,0.51708,-0.85593,0.30219,0.21129,0.16830,-0.98912,-0.14709


In [23]:
land_cover = sorted(aux_df['Land Cover'].unique())
land_cover

['Cropland, rainfed',
 'Grassland',
 'Herbaceous cover',
 'Mosaic cropland (>50%) / natural vegetation (tree, shrub, herbaceous cover) (<50%)',
 'Mosaic natural vegetation (tree, shrub, herbaceous cover) (>50%) / cropland (<50%) ',
 'Mosaic tree and shrub (>50%) / herbaceous cover (<50%)',
 'Shrubland deciduous',
 'Tree cover, broadleaved, evergreen, closed to open (>15%)',
 'Urban areas']

In [24]:
landcover_groups = {
    'Agriculture': [0, 3, 4],
    'Forest': [5, 7],
    'Grassland': [1, 2],
    'Shrubland': [6],
    'Other': [8],
}
lc_labels = list(landcover_groups.keys())

lc_summary = aux_df.groupby(['Land Cover', 'Site'], as_index=False
                           ).size().groupby(['Land Cover']).agg({'size': 'sum', "Site": "count"})
lc_summary['landcover_group'] = ''
for group, classes in landcover_groups.items():
    lc = [land_cover[c] for c in classes]
    lc_summary.loc[lc, 'landcover_group'] = group
lc_summary = lc_summary.reset_index()
lc_summary.columns = ['Land Cover', '#Samples', '#Sites', 'LC Category']

In [25]:
cols = list(aux_df.columns)
cols.insert(5, 'LC Category')
aux_df = aux_df.merge(lc_summary[['LC Category', 'Land Cover']])[cols]
aux_df

Unnamed: 0,ID,Latitude,Longitude,Sampling date,Sampling year,LC Category,Land Cover,LFMC value,Site,Czone1,...,Czone3,Day_sin,Day_cos,Long_sin,Long_cos,Lat_norm,Elevation,Slope,Aspect_sin,Aspect_cos
0,C10_10_1,-35.27708,149.19474,2005-08-16,2005,Agriculture,Mosaic cropland (>50%) / natural vegetation (t...,48.00000,C10_10,C,...,Cfb,0.69328,0.72067,0.51212,-0.85891,0.30402,0.09635,0.02366,-0.05217,0.99864
1,C10_10_10,-35.27708,149.19474,2006-12-13,2006,Agriculture,Mosaic cropland (>50%) / natural vegetation (t...,10.16000,C10_10,C,...,Cfb,0.32127,-0.94699,0.51212,-0.85891,0.30402,0.09635,0.02366,-0.05217,0.99864
2,C10_10_11,-35.27708,149.19474,2007-01-25,2007,Agriculture,Mosaic cropland (>50%) / natural vegetation (t...,5.90000,C10_10,C,...,Cfb,-0.40149,-0.91586,0.51212,-0.85891,0.30402,0.09635,0.02366,-0.05217,0.99864
3,C10_10_12,-35.27708,149.19474,2007-10-17,2007,Agriculture,Mosaic cropland (>50%) / natural vegetation (t...,91.81000,C10_10,C,...,Cfb,0.96574,-0.25951,0.51212,-0.85891,0.30402,0.09635,0.02366,-0.05217,0.99864
4,C10_10_13,-35.27708,149.19474,2007-10-29,2007,Agriculture,Mosaic cropland (>50%) / natural vegetation (t...,128.80000,C10_10,C,...,Cfb,0.89198,-0.45207,0.51212,-0.85891,0.30402,0.09635,0.02366,-0.05217,0.99864
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
385,C10_20_7,-35.41875,148.95045,2007-03-21,2007,Agriculture,"Cropland, rainfed",336.00000,C10_20,C,...,Cfb,-0.97785,-0.20931,0.51578,-0.85672,0.30323,0.10795,0.04367,-0.99353,0.11358
386,C10_20_8,-35.41875,148.95045,2007-04-16,2007,Agriculture,"Cropland, rainfed",108.60000,C10_20,C,...,Cfb,-0.97212,0.23449,0.51578,-0.85672,0.30323,0.10795,0.04367,-0.99353,0.11358
387,C10_20_9,-35.41875,148.95045,2007-05-02,2007,Agriculture,"Cropland, rainfed",145.80000,C10_20,C,...,Cfb,-0.87171,0.49003,0.51578,-0.85672,0.30323,0.10795,0.04367,-0.99353,0.11358
388,C10_21_1,-35.20625,149.02369,2005-08-30,2005,Other,Urban areas,46.00000,C10_21,C,...,Cfb,0.84525,0.53437,0.51468,-0.85738,0.30441,0.09647,0.03897,-0.95610,0.29304


In [26]:
aux_df.to_csv(SAMPLES_OUTPUT, index=False)

In [20]:
aux_df.groupby(['Sampling year']).size()

Sampling year
2005    11
2006    57
2007    48
2008    64
2009    23
2012     3
2013    56
2014    67
2015    41
2016    20
dtype: int64

In [21]:
aux_df.groupby(['Czone3']).size()

Czone3
Aw       9
BSh     27
BSk     18
BWh     11
Cfa     43
Cfb    250
Csa      9
Csb     23
dtype: int64

In [27]:
aux_df.groupby(['Site', 'Sampling year']).size().unstack()

Sampling year,2005,2006,2007,2008,2009,2012,2013,2014,2015,2016
Site,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
C10_1,,,,3.0,1.0,,,,,
C10_10,2.0,8.0,6.0,13.0,11.0,,,,,
C10_11,1.0,,,,,,,,,
C10_12,5.0,,,,,,,,,
C10_13,,5.0,4.0,,,,,,,
C10_14,,,,9.0,,,,,,
C10_15,,,,9.0,,,,,,
C10_16,,11.0,3.0,,,,,,,
C10_17,,4.0,,,,,,,,
C10_18,,5.0,4.0,,,,,,,
