# LecoSpec Data Munging

In [2]:
#source("Functions/lecospectR.R", echo = FALSE)
packageVersion("tidyverse")

[1] '2.0.0'

In [3]:
# notebooks use their location as their working directory, so
# if we are in a subfolder, move to the main folder.  
# This however can safely be run multiple times
#setwd(M:/lecospec/lecospec)
if(!dir.exists("Functions/")){
    setwd("../../")
    if(!dir.exists("Functions")){
        setwd("M:/lecospec/lecospec/")
    }
}
source("Functions/lecospectR.R", echo = FALSE)



## Notation

Throughout the notebook, variables starting with `img_` are UAV image-based information (data, filepaths, etc).  Similarly, variables beginning with `grd_` related to data collected on the ground.  

Also, some other naming conventions for variables with data transformations:
* `robust` in a variable name refers to data treated by center according to the median and scaling by teh inter-quartile range (a la sklearns RobustScaler)
* `minmax` (and its ilk) are min-max scaled data, i.e. scaled to the interval [0,1] by subtracting the minimum and dividing by the range.
* `standard(ized)` refers to data treated with with the z-score transform by centring using the mean and scaling y the standard deviation (like sklearns StandardScaler)
* `corrected` means that a linear transformation has been applied to account for differences in sensor calibration.
* `raw` refers to having no transformations applied
* `clipped` means that outliers have been clipped to the upper and lower fence values based on the Inter-Quartile Range method. 
* `imputed` means that outliers have been removed and imputed
* `dropped` means that dataframe rows containing outliers have been removed

Example: `img_robust_indices` refers to vegetation indices from the UAV images treated with the robust scaler. 

## Define data locations


In [4]:
# spectral library
grd_base_path <- "./Output/C_001_SC3_Cleaned_SpectralLib.csv"
grd_speclib <- read.csv(grd_base_path, header = TRUE)
#grd_index_path <- ./Data/D_002_SpecLib_Derivs.csv
#grd_indices <- read.csv(grd_index_path)
# this data has some lines that have no labels, so we remove them 
grd_speclib <- grd_speclib[!is.na(grd_speclib$Functional_group1),]
head(grd_speclib)

Unnamed: 0_level_0,X,ScanID,Area,Code_name,Species_name,Functional_group1,Functional_group2,Species_name_Freq,Functional_group1_Freq,Functional_group2_Freq,⋯,Radiometric.Calibration,Units,Latitude,Longitude,Altitude,GPS.Time,Satellites,Calibrated.Reference.Correction.File,Channels,ScanNum
Unnamed: 0_level_1,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<int>,<int>,<int>,⋯,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<int>,<int>
1,1,aleoch_Murph_061,Murphy,aleoch,Alectoria ochroleuca,Lichen,LightTerrestrialMacrolichen,6,453,118,⋯,,,,,,,,,,
2,2,aleoch_Murph_063,Murphy,aleoch,Alectoria ochroleuca,Lichen,LightTerrestrialMacrolichen,6,453,118,⋯,,,,,,,,,,
3,3,aleoch_Murph_064,Murphy,aleoch,Alectoria ochroleuca,Lichen,LightTerrestrialMacrolichen,6,453,118,⋯,,,,,,,,,,
4,4,aleoch_Murph_065,Murphy,aleoch,Alectoria ochroleuca,Lichen,LightTerrestrialMacrolichen,6,453,118,⋯,,,,,,,,,,
5,5,aleoch_Murph_066,Murphy,aleoch,Alectoria ochroleuca,Lichen,LightTerrestrialMacrolichen,6,453,118,⋯,,,,,,,,,,
6,6,alnfru_00003,Yukon_Delta,alnfru,Alnus sp.,ShrubDecid,ShrubAlder,82,360,82,⋯,,,,,,,,,,


In [5]:
img_base_path <- "Data/Ground_Validation/PFT_image_spectra/PFT_Image_SpectralLib_Clean.csv"
img_speclib <- read.csv(img_base_path)

# currently, not using the old pre-proccessing scheme and just doing it here.
#img_index_path <- Data/D_002_Image_SpecLib_Derivs.csv
#img_speclib <- read.csv(img_base_path)
head(img_speclib)

Unnamed: 0_level_0,X,UID,ScanNum,sample_name,PFT,FncGrp1,Site,X398,X399,X400,⋯,X990,X991,X992,X993,X994,X995,X996,X997,X998,X999
Unnamed: 0_level_1,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,1,BisonGulchPFTsBetula1,1,spec_1,Betula,TreeBroadleaf,BisonGulch,0.05814769,0.05926529,0.06028869,⋯,0.6815182,0.681166,0.689047,0.7040298,0.7249807,0.7507566,0.7801884,0.8121027,0.8453261,0.8786852
2,2,BisonGulchPFTsBetula1,1,spec_2,Betula,TreeBroadleaf,BisonGulch,0.04456014,0.04778814,0.05079318,⋯,0.6706666,0.6683159,0.6786394,0.7000307,0.7308801,0.7695067,0.8140391,0.8625739,0.9132079,0.9640378
3,3,BisonGulchPFTsBetula1,1,spec_3,Betula,TreeBroadleaf,BisonGulch,0.03929324,0.04265593,0.04557066,⋯,0.5152525,0.5091915,0.5178217,0.5395294,0.5726982,0.6156166,0.6663192,0.7227978,0.7830447,0.845052
4,4,BisonGulchPFTsBetula1,1,spec_4,Betula,TreeBroadleaf,BisonGulch,0.13230228,0.11122692,0.09129034,⋯,0.5120581,0.511388,0.5348292,0.5745538,0.6227243,0.6723311,0.718586,0.7570701,0.7833644,0.7930498
5,5,BisonGulchPFTsBetula1,1,spec_5,Betula,TreeBroadleaf,BisonGulch,0.05211388,0.05565497,0.05878525,⋯,0.6863419,0.6680365,0.6509006,0.634445,0.6181806,0.6017555,0.5851848,0.5685449,0.5519121,0.5353626
6,6,BisonGulchPFTsBetula1,1,spec_6,Betula,TreeBroadleaf,BisonGulch,0.06955397,0.06788242,0.06631141,⋯,0.7354495,0.7371508,0.7445194,0.7567953,0.7732173,0.7930235,0.8154512,0.8397375,0.8651196,0.8908347


Okay, there are some metadata columns that should not be there for the next step - lets remove them with `subset`

In [6]:
RawUID<- img_speclib %>% 
  dplyr::select(UID) %>% as.data.frame() #%>%

SiteNames<-str_split(RawUID[,1], "PFT") %>% 
  as.data.frame() %>% 
  t %>% 
  as.data.frame() %>%
  dplyr::rename(Site = V1) %>% 
  dplyr::select(Site)
print(unique(SiteNames))

                                      Site
c..BisonGulch....sBetula1..     BisonGulch
c..Chatanika....sBetula_nana1..  Chatanika
c..EightMile....sBetula_nana1..  EightMile
c..Bonanza....sLarix1..            Bonanza


In [7]:
bg_speclib <- img_speclib[img_speclib$Site == "BisonGulch",]
ch_speclib <- img_speclib[img_speclib$Site == "Chatanika",]
em_speclib <- img_speclib[img_speclib$Site == "EightMile",]
bz_speclib <- img_speclib[img_speclib$Site == "Bonanza",]

In [10]:
unique(bz_speclib$FncGrp1)
unique(bg_speclib$FncGrp1)
unique(em_speclib$FncGrp1)
unique(ch_speclib$FncGrp1)

In [None]:
img_bands <- subset(
    img_speclib, 
    select=-c(
        X,
    	UID,
        ScanNum,
    	sample_name,
    	PFT,
    	FncGrp1,
        Site
    ))


grd_bands <- subset(
    grd_speclib, 
    select=-c(
        X,
        ScanID,
        Area,
        Code_name,
        Species_name,
        Functional_group1,
        Functional_group2,
        Species_name_Freq,
        Functional_group1_Freq,
        Functional_group2_Freq,
        Genus,
        Version,
        File.Name,
        Instrument,
        Detectors,
        Measurement,
        Date,
        Time,
        Battery.Voltage,
        Averages,
        Integration1,
        Integration2,
        Integration3,
        Dark.Mode,
        Foreoptic,
        Radiometric.Calibration,
        Units,
        Latitude,
        Longitude,
        Altitude,
        GPS.Time,
        Satellites,
        Calibrated.Reference.Correction.File,
        Channels,
        ScanNum
    )
)

bg_bands <- subset(
    bg_speclib, 
    select=-c(
        X,
    	UID,
        ScanNum,
    	sample_name,
    	PFT,
    	FncGrp1,
        Site
    ))


em_bands <- subset(
    em_speclib, 
    select=-c(
        X,
    	UID,
        ScanNum,
    	sample_name,
    	PFT,
    	FncGrp1,
        Site
    ))
    
bz_bands <- subset(
    bz_speclib, 
    select=-c(
        X,
    	UID,
        ScanNum,
    	sample_name,
    	PFT,
    	FncGrp1,
        Site
    ))
    
ch_bands <- subset(
    ch_speclib, 
    select=-c(
        X,
    	UID,
        ScanNum,
    	sample_name,
    	PFT,
    	FncGrp1,
        Site
    ))

In [None]:
bg_bands <- subset(
    bg_speclib, 
    select=-c(
        X,
    	UID,
        ScanNum,
    	sample_name,
    	PFT,
    	FncGrp1,
        Site
    ))

Calculate the vegetation indices from the spectral libraries - its easy with lecospectR!

Note that the image-based scpectra are normalized from zero to one, and the ground specctra are on the range zero to one hundred.  

In [None]:
img_indices <- get_vegetation_indices(img_bands, NULL)# should have a default of NULL, you know?
grd_indices <- get_vegetation_indices(grd_bands, NULL)
bg_indices <- get_vegetation_indices(bg_speclib, NULL)
ch_indices <- get_vegetation_indices(ch_speclib, NULL)
bz_indices <- get_vegetation_indices(bz_speclib, NULL)
em_indices <- get_vegetation_indices(em_speclib, NULL)

In [None]:
write.csv(img_indices, file="Data/gs/x_train/img_indices_only.csv")

write.csv(grd_indices, file="Data/gs/x_train/grd_indices_only.csv")

write.csv(bg_indices, file = "Data/gs/x_train/bison_gulch_indices")

write.csv(ch_indices, file = "Data/gs/x_train/chatanika_indices.csv")

write.csv(em_indices, file = "Data/gs/x_train/eight_mile_indices.csv")

write.csv(bz_indices, file = "Data/gs/x_train/bonanza_indices.csv")

In [None]:
head(img_indices)
head(img_indices)

This is actually enough to start training models.  We have the vegetation indices, but instead of doing that, let's transform the data and write it to file.  Then we will proceed to creating the model corrections, etc.

In [None]:
img_resampled_bands <- resample_df(img_bands, drop_existing=TRUE)# corrects scale difference (poorly)
grd_resampled_bands <- resample_df(0.01*grd_bands, drop_existing=TRUE)
bg_resampled_bands <- resample_df(bg_bands, drop_existing=TRUE)# corrects scale difference (poorly)
ch_resampled_bands <- resample_df(ch_bands, drop_existing=TRUE)# corrects scale difference (poorly)
bz_resampled_bands <- resample_df(bz_bands, drop_existing=TRUE)# corrects scale difference (poorly)
em_resampled_bands <- resample_df(em_bands, drop_existing=TRUE)# corrects scale difference (poorly)

head(img_resampled_bands)
head(grd_resampled_bands)

In [None]:
img_raw_with_na <- cbind(img_resampled_bands, img_indices)
grd_raw_with_na <- cbind(grd_resampled_bands, grd_indices)
bg_raw_with_na <- cbind(bg_resampled_bands, bg_indices)
ch_raw_with_na <- cbind(ch_resampled_bands, ch_indices)
em_raw_with_na <- cbind(em_resampled_bands, em_indices)
bz_raw_with_na <- cbind(bz_resampled_bands, bz_indices)

In [None]:
img_raw <- impute_spectra(img_raw_with_na)
grd_raw <- impute_spectra(inf_to_na(grd_raw_with_na))# note also dropping an Inf (liekly div by 0 in veg index)
bg_raw <- impute_spectra(bg_raw_with_na)
bz_raw <- impute_spectra(bz_raw_with_na)
em_raw <- impute_spectra(em_raw_with_na)
ch_raw <- impute_spectra(ch_raw_with_na)

In [None]:
write.csv(bg_raw, file="Data/gs/x_train/bison_gulch.csv")
write.csv(as.data.frame(bg_speclib$FncGrp1), file="Data/gs/y_train/bison_gulch.csv")
write.csv(bz_raw, file="Data/gs/x_train/bonanza.csv")
write.csv(as.data.frame(bz_speclib$FncGrp1), file="Data/gs/y_train/bonanza.csv")
write.csv(ch_raw, file="Data/gs/x_train/chatanika.csv")
write.csv(as.data.frame(ch_speclib$FncGrp1), file="Data/gs/y_train/chatanika.csv")
write.csv(em_raw, file="Data/gs/x_train/eight_mile.csv")
write.csv(as.data.frame(em_speclib$FncGrp1), file="Data/gs/y_train/eight_mile.csv")

Apply the outlier transforms

In [None]:
grd_clipped <- clip_outliers(grd_raw)
grd_imputed <- impute_outliers_and_na(grd_raw)
grd_dropped <- grd_raw[detect_outliers_columnwise(grd_raw),]
img_clipped <- clip_outliers(img_raw)
img_imputed <- impute_outliers_and_na(img_raw)
img_dropped <- img_raw[detect_outliers_columnwise(img_raw),]

Now the center/scale transforms

In [None]:
grd_raw_robust <- columnwise_robust_scale(grd_raw)
img_raw_robust <- columnwise_robust_scale(img_raw)
grd_raw_minmax <- columnwise_min_max_scale(grd_raw)
img_raw_minmax <- columnwise_min_max_scale(img_raw)
grd_raw_standard <- standardize_df(grd_raw)
img_raw_standard <- standardize_df(img_raw)

grd_clipped_robust <- columnwise_robust_scale(grd_clipped)
grd_imputed_robust <- columnwise_robust_scale(grd_imputed)
grd_dropped_robust <- columnwise_robust_scale(grd_dropped)
img_clipped_robust <- columnwise_robust_scale(img_clipped)
img_imputed_robust <- columnwise_robust_scale(img_imputed)
img_dropped_robust <- columnwise_robust_scale(img_dropped)

grd_clipped_minmax <- columnwise_min_max_scale(grd_clipped)
grd_imputed_minmax <- columnwise_min_max_scale(grd_imputed)
grd_dropped_minmax <- columnwise_min_max_scale(grd_dropped)
img_clipped_minmax <- columnwise_min_max_scale(img_clipped)
img_imputed_minmax <- columnwise_min_max_scale(img_imputed)
img_dropped_minmax <- columnwise_min_max_scale(img_dropped)

grd_clipped_standard <- standardize_df(grd_clipped)
grd_imputed_standard <- standardize_df(grd_imputed)
grd_dropped_standard <- standardize_df(grd_imputed)
img_clipped_standard <- standardize_df(img_clipped)
img_imputed_standard <- standardize_df(img_imputed)
img_dropped_standard <- standardize_df(img_dropped)


Now, let's save all these data to disk

In [None]:
BASE_PATH <- "Data/gs/"
X_TRAIN_PATH <- paste0(BASE_PATH, "x_train/")
Y_TRAIN_PATH <- paste0(BASE_PATH, "y_train/")

X_TEST_PATH <- paste0(BASE_PATH, "x_test/")
Y_TEST_PATH <- paste0(BASE_PATH, "y_test/")

if(!dir.exists(BASE_PATH)){
    dir.create(BASE_PATH)
}
if(!dir.exists(X_TRAIN_PATH)){
    dir.create(X_TRAIN_PATH)
}
if(!dir.exists(Y_TRAIN_PATH)){
    dir.create(Y_TRAIN_PATH)
}
if(!dir.exists(X_TEST_PATH)){
    dir.create(X_TEST_PATH)
}
if(!dir.exists(Y_TEST_PATH)){
    dir.create(Y_TEST_PATH)
}


In [None]:
write.csv(grd_clipped, file=paste0(X_TRAIN_PATH, "grd_clipped_raw.csv"))
write.csv(grd_clipped_minmax, file=paste0(X_TRAIN_PATH, "grd_clipped_minmax.csv"))
write.csv(grd_clipped_robust, file=paste0(X_TRAIN_PATH, "grd_clipped_robust.csv"))
write.csv(grd_clipped_standard, file=paste0(X_TRAIN_PATH, "grd_clipped_standard.csv"))

write.csv(grd_imputed, file=paste0(X_TRAIN_PATH, "grd_imputed_raw.csv"))
write.csv(grd_imputed_minmax, file=paste0(X_TRAIN_PATH, "grd_imputed_minmax.csv"))
write.csv(grd_imputed_robust, file=paste0(X_TRAIN_PATH, "grd_imputed_robust.csv"))
write.csv(grd_imputed_standard, file=paste0(X_TRAIN_PATH, "grd_imputed_standard.csv"))

write.csv(grd_dropped, file=paste0(X_TRAIN_PATH, "grd_dropped_raw.csv"))
write.csv(grd_dropped_minmax, file=paste0(X_TRAIN_PATH, "grd_dropped_minmax.csv"))
write.csv(grd_dropped_robust, file=paste0(X_TRAIN_PATH, "grd_dropped_robust.csv"))
write.csv(grd_dropped_standard, file=paste0(X_TRAIN_PATH, "grd_dropped_standard.csv"))

write.csv(grd_raw, file=paste0(X_TRAIN_PATH, "grd_raw_raw.csv"))
write.csv(grd_raw_minmax, file=paste0(X_TRAIN_PATH, "grd_raw_minmax.csv"))
write.csv(grd_raw_robust, file=paste0(X_TRAIN_PATH, "grd_raw_robust.csv"))
write.csv(grd_raw_standard, file=paste0(X_TRAIN_PATH, "grd_raw_standard.csv"))

In [None]:
write.csv(grd_raw[,colnames(grd_indices)], file=paste0(X_TRAIN_PATH, "grd_indices_only.csv"))


## Labels for the above Data

In [None]:
img_targets <- img_speclib$FncGrp1 %>% as.factor()
grd_targets <- grd_speclib$Functional_group1 %>% as.factor()

In [None]:
write.csv(img_targets, file="Data/gs/y_train/img_indices_only.csv")
write.csv(grd_targets, file="Data/gs/y_train/grd_indices_only.csv")

In [None]:
img_targets %>% table()

In [None]:
grd_targets %>% table()

In [None]:
# drop entries with outliers to match training data
img_targets_dropped <- img_targets[detect_outliers_columnwise(img_raw)]
grd_targets_dropped <- grd_targets[detect_outliers_columnwise(grd_raw)]

In [None]:
write.csv(grd_targets, file=paste0(Y_TRAIN_PATH, "grd_clipped_raw.csv"))
write.csv(grd_targets, file=paste0(Y_TRAIN_PATH, "grd_clipped_minmax.csv"))
write.csv(grd_targets, file=paste0(Y_TRAIN_PATH, "grd_clipped_robust.csv"))
write.csv(grd_targets, file=paste0(Y_TRAIN_PATH, "grd_clipped_standard.csv"))

write.csv(grd_targets, file=paste0(Y_TRAIN_PATH, "grd_imputed_raw.csv"))
write.csv(grd_targets, file=paste0(Y_TRAIN_PATH, "grd_imputed_minmax.csv"))
write.csv(grd_targets, file=paste0(Y_TRAIN_PATH, "grd_imputed_robust.csv"))
write.csv(grd_targets, file=paste0(Y_TRAIN_PATH, "grd_imputed_standard.csv"))

write.csv(grd_targets, file=paste0(Y_TRAIN_PATH, "grd_raw_raw.csv"))
write.csv(grd_targets, file=paste0(Y_TRAIN_PATH, "grd_raw_minmax.csv"))
write.csv(grd_targets, file=paste0(Y_TRAIN_PATH, "grd_raw_robust.csv"))
write.csv(grd_targets, file=paste0(Y_TRAIN_PATH, "grd_raw_standard.csv"))

write.csv(grd_targets_dropped, file=paste0(Y_TRAIN_PATH, "grd_dropped_raw.csv"))
write.csv(grd_targets_dropped, file=paste0(Y_TRAIN_PATH, "grd_dropped_minmax.csv"))
write.csv(grd_targets_dropped, file=paste0(Y_TRAIN_PATH, "grd_dropped_robust.csv"))
write.csv(grd_targets_dropped, file=paste0(Y_TRAIN_PATH, "grd_dropped_standard.csv"))

In [None]:
write.csv(img_targets, file=paste0(Y_TRAIN_PATH, "img_clipped_raw.csv"))
write.csv(img_targets, file=paste0(Y_TRAIN_PATH, "img_clipped_minmax.csv"))
write.csv(img_targets, file=paste0(Y_TRAIN_PATH, "img_clipped_robust.csv"))
write.csv(img_targets, file=paste0(Y_TRAIN_PATH, "img_clipped_standard.csv"))

write.csv(img_imputed, file=paste0(Y_TRAIN_PATH, "img_imputed_raw.csv"))
write.csv(img_targets, file=paste0(Y_TRAIN_PATH, "img_imputed_minmax.csv"))
write.csv(img_targets, file=paste0(Y_TRAIN_PATH, "img_imputed_robust.csv"))
write.csv(img_targets, file=paste0(Y_TRAIN_PATH, "img_imputed_standard.csv"))

write.csv(img_targets, file=paste0(Y_TRAIN_PATH, "img_raw_raw.csv"))
write.csv(img_targets, file=paste0(Y_TRAIN_PATH, "img_raw_minmax.csv"))
write.csv(img_targets, file=paste0(Y_TRAIN_PATH, "img_raw_robust.csv"))
write.csv(img_targets, file=paste0(Y_TRAIN_PATH, "img_raw_standard.csv"))

write.csv(img_targets_dropped, file=paste0(Y_TRAIN_PATH, "img_dropped_raw.csv"))
write.csv(img_targets_dropped, file=paste0(Y_TRAIN_PATH, "img_dropped_minmax.csv"))
write.csv(img_targets_dropped, file=paste0(Y_TRAIN_PATH, "img_dropped_robust.csv"))
write.csv(img_targets_dropped, file=paste0(Y_TRAIN_PATH, "img_dropped_standard.csv"))

## Test Data

Build the test data, and save it with the same names as the training data

In [None]:
set.seed(61718L)

permutation <-  permute::shuffle(length(img_targets))
sample <- create_stratified_sample(
    img_targets, 
    permutation = permutation,
    samples_per_pft = 15)
# split the data based on the above sample
img_targets_test <- img_targets[permutation][sample]
img_targets_train <- img_targets[permutation][-sample]
img_raw_test <- img_raw[permutation,][sample,]
img_raw_train <- img_raw[permutation,][-sample,]


In [None]:
img_targets_test %>% as.factor() %>% table()

In [None]:
# create the subsampled data and save them for each processing type/treatment

# clipped
img_clipped_train <- img_clipped[permutation,][-sample,]
img_clipped_test <- img_clipped[permutation,][sample,]
img_clipped_minmax_train <- img_clipped_minmax[permutation,][-sample,]
img_clipped_minmax_test <- img_clipped_minmax[permutation,][sample,]
img_clipped_robust_train <- img_clipped_robust[permutation,][-sample,]
img_clipped_robust_test <- img_clipped_robust[permutation,][sample,]
img_clipped_standard_train <- img_clipped_standard[permutation,][-sample]
img_clipped_standard_test <- img_clipped_standard[permutation,][sample,]

# raw (note one is done in the previous cell)
img_raw_minmax_train <- img_raw_minmax[permutation,][-sample,]
img_raw_minmax_test <- img_raw_minmax[permutation,][sample,]
img_raw_robust_train <- img_raw_robust[permutation,][-sample,]
img_raw_robust_test <- img_raw_robust[permutation,][sample,]
img_raw_standard_train <- img_raw_standard[permutation,][sample,]
img_raw_standard_test <- img_raw_standard[permutation,][sample,]

#imputed
img_imputed_train <- img_imputed[permutation,][-sample,]
img_imputed_test <- img_imputed[permutation,][sample,]
img_imputed_minmax_train <- img_imputed_minmax[permutation,][-sample,]
img_imputed_minmax_test <- img_imputed_minmax[permutation,][sample,]
img_imputed_robust_train <- img_imputed_robust[permutation,][-sample,]
img_imputed_robust_test <- img_imputed_robust[permutation,][sample,]
img_imputed_standard_train <- img_imputed_standard[permutation,][-sample,]
img_imputed_standard_test <- img_imputed_standard[permutation,][sample,]



In [None]:
print(length(img_targets_test))
print(nrow(img_clipped_robust_test))

### Image-based Training Data

In [None]:
write.csv(img_clipped_train, file=paste0(X_TRAIN_PATH, "img_clipped_raw.csv"))
write.csv(img_clipped_minmax_train, file=paste0(X_TRAIN_PATH, "img_clipped_minmax.csv"))
write.csv(img_clipped_robust_train, file=paste0(X_TRAIN_PATH, "img_clipped_robust.csv"))
write.csv(img_clipped_standard_train, file=paste0(X_TRAIN_PATH, "img_clipped_standard.csv"))

write.csv(img_imputed_train, file=paste0(X_TRAIN_PATH, "img_imputed_raw.csv"))
write.csv(img_imputed_minmax_train, file=paste0(X_TRAIN_PATH, "img_imputed_minmax.csv"))
write.csv(img_imputed_robust_train, file=paste0(X_TRAIN_PATH, "img_imputed_robust.csv"))
write.csv(img_imputed_standard_train, file=paste0(X_TRAIN_PATH, "img_imputed_standard.csv"))

write.csv(img_dropped, file=paste0(X_TRAIN_PATH, "img_dropped_raw.csv"))
write.csv(img_dropped_minmax, file=paste0(X_TRAIN_PATH, "img_dropped_minmax.csv"))
write.csv(img_dropped_robust, file=paste0(X_TRAIN_PATH, "img_dropped_robust.csv"))
write.csv(img_dropped_standard, file=paste0(X_TRAIN_PATH, "img_dropped_standard.csv"))

write.csv(img_raw_train, file=paste0(X_TRAIN_PATH, "img_raw_raw.csv"))
write.csv(img_raw_minmax_train, file=paste0(X_TRAIN_PATH, "img_raw_minmax.csv"))
write.csv(img_raw_robust_train, file=paste0(X_TRAIN_PATH, "img_raw_robust.csv"))
write.csv(img_raw_standard_train, file=paste0(X_TRAIN_PATH, "img_raw_standard.csv"))

In [None]:
write.csv(img_targets_train, file=paste0(Y_TRAIN_PATH, "img_clipped_raw.csv"))
write.csv(img_targets_train, file=paste0(Y_TRAIN_PATH, "img_clipped_minmax.csv"))
write.csv(img_targets_train, file=paste0(Y_TRAIN_PATH, "img_clipped_robust.csv"))
write.csv(img_targets_train, file=paste0(Y_TRAIN_PATH, "img_clipped_standard.csv"))

write.csv(img_targets_train, file=paste0(Y_TRAIN_PATH, "img_imputed_raw.csv"))
write.csv(img_targets_train, file=paste0(Y_TRAIN_PATH, "img_imputed_minmax.csv"))
write.csv(img_targets_train, file=paste0(Y_TRAIN_PATH, "img_imputed_robust.csv"))
write.csv(img_targets_train, file=paste0(Y_TRAIN_PATH, "img_imputed_standard.csv"))

#write.csv(img_dropped, file=paste0(X_TRAIN_PATH, "img_dropped_raw.csv"))
#write.csv(img_dropped_minmax, file=paste0(X_TRAIN_PATH, "img_dropped_minmax.csv"))
#write.csv(img_dropped_robust, file=paste0(X_TRAIN_PATH, "img_dropped_robust.csv"))
#write.csv(img_dropped_standard, file=paste0(X_TRAIN_PATH, "img_dropped_standard.csv"))

write.csv(img_targets_train, file=paste0(Y_TRAIN_PATH, "img_raw_raw.csv"))
write.csv(img_targets_train, file=paste0(Y_TRAIN_PATH, "img_raw_minmax.csv"))
write.csv(img_targets_train, file=paste0(Y_TRAIN_PATH, "img_raw_robust.csv"))
write.csv(img_targets_train, file=paste0(Y_TRAIN_PATH, "img_raw_standard.csv"))

### Image Based Test Data
Note: this image-based test set is used for all the models (ground included)

In [None]:
write.csv(img_clipped_test, file=paste0(X_TEST_PATH, "img_clipped_raw.csv"))
write.csv(img_clipped_minmax_test, file=paste0(X_TEST_PATH, "img_clipped_minmax.csv"))
write.csv(img_clipped_robust_test, file=paste0(X_TEST_PATH, "img_clipped_robust.csv"))
write.csv(img_clipped_standard_test, file=paste0(X_TEST_PATH, "img_clipped_standard.csv"))

write.csv(img_imputed_test, file=paste0(X_TEST_PATH, "img_imputed_raw.csv"))
write.csv(img_imputed_minmax_test, file=paste0(X_TEST_PATH, "img_imputed_minmax.csv"))
write.csv(img_imputed_robust_test, file=paste0(X_TEST_PATH, "img_imputed_robust.csv"))
write.csv(img_imputed_standard_test, file=paste0(X_TEST_PATH, "img_imputed_standard.csv"))

write.csv(img_dropped, file=paste0(X_TEST_PATH, "img_dropped_raw.csv"))
write.csv(img_dropped_minmax, file=paste0(X_TEST_PATH, "img_dropped_minmax.csv"))
write.csv(img_dropped_robust, file=paste0(X_TEST_PATH, "img_dropped_robust.csv"))
write.csv(img_dropped_standard, file=paste0(X_TEST_PATH, "img_dropped_standard.csv"))

write.csv(img_raw_test, file=paste0(X_TEST_PATH, "img_raw_raw.csv"))
write.csv(img_raw_minmax_test, file=paste0(X_TEST_PATH, "img_raw_minmax.csv"))
write.csv(img_raw_robust_test, file=paste0(X_TEST_PATH, "img_raw_robust.csv"))
write.csv(img_raw_standard_test, file=paste0(X_TEST_PATH, "img_raw_standard.csv"))

### Ground test (from the images)

In [None]:
write.csv(img_clipped_test, file=paste0(X_TEST_PATH, "grd_clipped_raw.csv"))
write.csv(img_clipped_minmax_test, file=paste0(X_TEST_PATH, "grd_clipped_minmax.csv"))
write.csv(img_clipped_robust_test, file=paste0(X_TEST_PATH, "grd_clipped_robust.csv"))
write.csv(img_clipped_standard_test, file=paste0(X_TEST_PATH, "grd_clipped_standard.csv"))

write.csv(img_imputed_test, file=paste0(X_TEST_PATH, "grd_imputed_raw.csv"))
write.csv(img_imputed_minmax_test, file=paste0(X_TEST_PATH, "grd_imputed_minmax.csv"))
write.csv(img_imputed_robust_test, file=paste0(X_TEST_PATH, "grd_imputed_robust.csv"))
write.csv(img_imputed_standard_test, file=paste0(X_TEST_PATH, "grd_imputed_standard.csv"))

write.csv(img_dropped, file=paste0(X_TEST_PATH, "grd_dropped_raw.csv"))
write.csv(img_dropped_minmax, file=paste0(X_TEST_PATH, "grd_dropped_minmax.csv"))
write.csv(img_dropped_robust, file=paste0(X_TEST_PATH, "grd_dropped_robust.csv"))
write.csv(img_dropped_standard, file=paste0(X_TEST_PATH, "grd_dropped_standard.csv"))

write.csv(img_raw_test, file=paste0(X_TEST_PATH, "grd_raw_raw.csv"))
write.csv(img_raw_minmax_test, file=paste0(X_TEST_PATH, "grd_raw_minmax.csv"))
write.csv(img_raw_robust_test, file=paste0(X_TEST_PATH, "grd_raw_robust.csv"))
write.csv(img_raw_standard_test, file=paste0(X_TEST_PATH, "grd_raw_standard.csv"))

In [None]:
write.csv(img_targets_test, file=paste0(Y_TEST_PATH, "grd_clipped_raw.csv"))
write.csv(img_targets_test, file=paste0(Y_TEST_PATH, "grd_clipped_minmax.csv"))
write.csv(img_targets_test, file=paste0(Y_TEST_PATH, "grd_clipped_robust.csv"))
write.csv(img_targets_test, file=paste0(Y_TEST_PATH, "grd_clipped_standard.csv"))

write.csv(img_targets_test, file=paste0(Y_TEST_PATH, "grd_imputed_raw.csv"))
write.csv(img_targets_test, file=paste0(Y_TEST_PATH, "grd_imputed_minmax.csv"))
write.csv(img_targets_test, file=paste0(Y_TEST_PATH, "grd_imputed_robust.csv"))
write.csv(img_targets_test, file=paste0(Y_TEST_PATH, "grd_imputed_standard.csv"))

write.csv(img_targets_test, file=paste0(Y_TEST_PATH, "grd_raw_raw.csv"))
write.csv(img_targets_test, file=paste0(Y_TEST_PATH, "grd_raw_minmax.csv"))
write.csv(img_targets_test, file=paste0(Y_TEST_PATH, "grd_raw_robust.csv"))
write.csv(img_targets_test, file=paste0(Y_TEST_PATH, "grd_raw_standard.csv"))

write.csv(grd_targets_dropped, file=paste0(Y_TEST_PATH, "grd_dropped_raw.csv"))
write.csv(grd_targets_dropped, file=paste0(Y_TEST_PATH, "grd_dropped_minmax.csv"))
write.csv(grd_targets_dropped, file=paste0(Y_TEST_PATH, "grd_dropped_robust.csv"))
write.csv(grd_targets_dropped, file=paste0(Y_TEST_PATH, "grd_dropped_standard.csv"))

In [None]:
write.csv(img_targets_test, file=paste0(Y_TEST_PATH, "img_clipped_raw.csv"))
write.csv(img_targets_test, file=paste0(Y_TEST_PATH, "img_clipped_minmax.csv"))
write.csv(img_targets_test, file=paste0(Y_TEST_PATH, "img_clipped_robust.csv"))
write.csv(img_targets_test, file=paste0(Y_TEST_PATH, "img_clipped_standard.csv"))

write.csv(img_targets_test, file=paste0(Y_TEST_PATH, "img_imputed_raw.csv"))
write.csv(img_targets_test, file=paste0(Y_TEST_PATH, "img_imputed_minmax.csv"))
write.csv(img_targets_test, file=paste0(Y_TEST_PATH, "img_imputed_robust.csv"))
write.csv(img_targets_test, file=paste0(Y_TEST_PATH, "img_imputed_standard.csv"))

write.csv(img_targets_test, file=paste0(Y_TEST_PATH, "img_raw_raw.csv"))
write.csv(img_targets_test, file=paste0(Y_TEST_PATH, "img_raw_minmax.csv"))
write.csv(img_targets_test, file=paste0(Y_TEST_PATH, "img_raw_robust.csv"))
write.csv(img_targets_test, file=paste0(Y_TEST_PATH, "img_raw_standard.csv"))

write.csv(grd_targets_dropped, file=paste0(Y_TEST_PATH, "img_dropped_raw.csv"))
write.csv(grd_targets_dropped, file=paste0(Y_TEST_PATH, "img_dropped_minmax.csv"))
write.csv(grd_targets_dropped, file=paste0(Y_TEST_PATH, "img_dropped_robust.csv"))
write.csv(grd_targets_dropped, file=paste0(Y_TEST_PATH, "img_dropped_standard.csv"))

In [None]:
bg_raw <- read.csv("Data/gs/x_train/bison_gulch.csv", header = TRUE)
bg_targets <- read.csv("Data/gs/y_train/bison_gulch.csv")$bg_speclib.FncGrp1

bz_raw <- read.csv("Data/gs/x_train/bonanza.csv", header = TRUE)
bz_targets <- read.csv("Data/gs/y_train/bonanza.csv")$bz_speclib.FncGrp1

ch_raw <- read.csv("Data/gs/x_train/chatanika.csv", header = TRUE)
ch_targets <- read.csv("Data/gs/y_train/chatanika.csv")$ch_speclib.FncGrp1

em_raw <- read.csv("Data/gs/x_train/eight_mile.csv", header = TRUE)
em_targets <- read.csv("Data/gs/y_train/eight_mile.csv")$em_speclib.FncGrp1

In [None]:
print(bg_targets)

In [None]:
print(length(bg_targets))
print(length(bz_targets))
print(length(ch_targets))
print(length(em_targets))

In [None]:
bg_permutation <-  permute::shuffle(length(bg_targets)) %>% as.vector()
bg_sample <- create_stratified_sample(
    bg_targets, 
    permutation = bg_permutation,
    samples_per_pft = 18)

bz_permutation <-  permute::shuffle(length(bz_targets)) %>% as.vector()
bz_sample <- create_stratified_sample(
    bz_targets, 
    permutation = bz_permutation,
    samples_per_pft = 18)

ch_permutation <-  permute::shuffle(length(ch_targets)) %>% as.vector()
ch_sample <- create_stratified_sample(
    ch_targets, 
    permutation = ch_permutation,
    samples_per_pft = 18)

em_permutation <-  permute::shuffle(length(em_targets)) %>% as.vector()
em_sample <- create_stratified_sample(
    em_targets, 
    permutation = em_permutation,
    samples_per_pft = 18)

In [None]:
print(bg_permutation)

In [None]:
bg_targets_test <- bg_targets[bg_permutation][bg_sample]
bg_targets_train <- bg_targets[bg_permutation][-bg_sample]
bg_raw_test <- bg_raw[bg_permutation,][bg_sample,]
bg_raw_train <- bg_raw[bg_permutation,][-bg_sample,]

bz_targets_test <- bz_targets[bz_permutation][bz_sample]
bz_targets_train <- bz_targets[bz_permutation][-bz_sample]
bz_raw_test <- bz_raw[bz_permutation,][bz_sample,]
bz_raw_train <- bz_raw[bz_permutation,][-bz_sample,]

ch_targets_test <- ch_targets[ch_permutation][ch_sample]
ch_targets_train <- ch_targets[ch_permutation][-ch_sample]
ch_raw_test <- ch_raw[ch_permutation,][ch_sample,]
ch_raw_train <- ch_raw[ch_permutation,][-ch_sample,]

em_targets_test <- em_targets[em_permutation][em_sample]
em_targets_train <- em_targets[em_permutation][-em_sample]
em_raw_test <- em_raw[em_permutation,][em_sample,]
em_raw_train <- em_raw[em_permutation,][-em_sample,]

In [None]:
bg_targets_test %>% as.factor() %>% table()
bz_targets_test %>% as.factor() %>% table()
ch_targets_test %>% as.factor() %>% table()
em_targets_test %>% as.factor() %>% table()

In [None]:
write.csv(bg_targets_test, file=paste0(Y_TRAIN_PATH, "bison_gulch_stratified.csv"), row.names=FALSE )
write.csv(bg_raw_test, file=paste0(X_TRAIN_PATH, "bison_gulch_stratified.csv"), row.names = FALSE )

write.csv(bz_targets_test, file=paste0(Y_TRAIN_PATH, "bonanza_stratified.csv"), row.names=FALSE )
write.csv(bz_raw_test, file=paste0(X_TRAIN_PATH, "bonanza_stratified.csv"), row.names = FALSE )

write.csv(ch_targets_test, file=paste0(Y_TRAIN_PATH, "chatanika_stratified.csv"), row.names=FALSE )
write.csv(ch_raw_test, file=paste0(X_TRAIN_PATH, "chatanika_stratified.csv"), row.names = FALSE )

write.csv(em_targets_test, file=paste0(Y_TRAIN_PATH, "eight_mile_stratified.csv"), row.names=FALSE )
write.csv(em_raw_test, file=paste0(X_TRAIN_PATH, "eight_mile_stratified.csv"), row.names = FALSE )

write.csv(bg_targets_train, file=paste0(X_TEST_PATH, "bison_gulch.csv"))
write.csv(bg_raw_train, file=paste0(Y_TEST_PATH, "bison_gulch.csv"))

write.csv(bz_targets_train, file=paste0(X_TEST_PATH, "bonanza.csv"))
write.csv(bg_raw_train, file=paste0(Y_TEST_PATH, "bonanza.csv"))

write.csv(ch_targets_train, file=paste0(X_TEST_PATH, "chatanika.csv"))
write.csv(ch_raw_train, file=paste0(Y_TEST_PATH, "chatanika.csv"))

write.csv(em_targets_train, file=paste0(X_TEST_PATH, "eight_mile.csv"))
write.csv(em_raw_train, file=paste0(Y_TEST_PATH, "eight_mile.csv"))



In [None]:
# need to write the targets fror training
clip_transform <- create_clip_transform(
    img_raw
)

In [None]:
save(clip_transform, file="./mle/clip_transform.rda")

In [None]:
clipped_2 <- clip_transform(img_raw)# clipped 2

## Sensor Correction

In this section, we create the models (and do some data transforms) to make the sensor-correction models and create the corrected data (only three times).  

We do this first for the raw (including outliers) data.

In [None]:
grd_resampled_to_match_img_bands <- resample_df(
    grd_bands,
    min_wavelength = 398,
    max_wavelength = 999,
    delta=1,
    drop_existing = TRUE
)
head(grd_resampled_to_match_img_bands)
head(img_bands)

In [None]:
colnames(grd_resampled_to_match_img_bands) <- colnames(img_bands)

In [None]:
grd_resampled_to_match_img_bands$targets <- grd_targets
img_bands_with_targets <- img_bands
img_bands_with_targets$targets <- img_targets

In [None]:
matched_data <- create_matched_data(
    img_bands_with_targets,
    grd_resampled_to_match_img_bands,
    cols=c("targets","targets")# assumes joining on columns named "targets" in each data.frame
)
head(matched_data$left)
head(matched_data$right)

In [None]:
correction_model <- build_columnwise_sensor_correction_model(
    matched_data$left,
    matched_data$right,
    grouping_variables =c("targets","targets")
)

In [None]:
print(correction_model)

In [None]:
grd_corrected_bands <- apply_sensor_correction_model(
    correction_model,
    grd_resampled_to_match_img_bands,
    ignore_cols=c("targets")
)
head(grd_corrected_bands)

In [None]:
grd_corrected_indices <- get_vegetation_indices(grd_corrected_bands, NULL)
head(grd_corrected_indices)

In [None]:
grd_corrected_resampled_bands <- resample_df(grd_corrected_bands, drop_existing=TRUE)
head(grd_corrected_resampled_bands)

In [None]:
write.csv(
    cbind(grd_corrected_resampled_bands, grd_corrected_indices), 
    file=paste0(X_TRAIN_PATH, "grd_raw_corrected.csv")
    )

# save labels also
write.csv(
    grd_resampled_to_match_img_bands$targets,
    file=paste0(Y_TRAIN_PATH, "grd_raw_corrected.csv"))

Now that that is done, we will move on to the clipped data

In [None]:
grd_resampled_to_img_clipped <- resample_df(
    clip_outliers(grd_bands),
    min_wavelength = 398,
    max_wavelength = 999,
    delta=1,
    drop_existing = TRUE
)

colnames(grd_resampled_to_img_clipped) <- colnames(img_bands)

grd_resampled_to_img_clipped$targets <- grd_targets
img_bands_with_targets <- img_bands
img_bands_with_targets$targets <- img_targets

matched_data_clipped <- create_matched_data(
    img_bands_with_targets,
    grd_resampled_to_img_clipped,
    cols=c("targets","targets")# assumes joining on columns named "targets" in each data.frame
)

correction_model <- build_columnwise_sensor_correction_model(
    matched_data_clipped$left,
    matched_data_clipped$right
)
grd_corrected_clipped_bands <- apply_sensor_correction_model(
    correction_model,
    grd_resampled_to_match_img_bands,
    ignore_cols=c("targets")
)
grd_corrected_clipped_indices <- get_vegetation_indices(grd_corrected_bands, NULL)
grd_corrected_clipped_resampled_bands <- resample_df(grd_corrected_bands, drop_existing=TRUE)




In [None]:
write.csv(
    cbind(
        grd_corrected_clipped_indices,
        grd_corrected_clipped_resampled_bands
    ),
    file=paste0(
        X_TRAIN_PATH,
        "grd_clipped_corrected.csv"
    )
)

# save labels also
write.csv(
    grd_resampled_to_match_img_bands$targets,
    file=paste0(Y_TRAIN_PATH, "grd_clipped_corrected.csv"))

And finally the dropped outlier one

notes for later: should probably try PCA here.  clip -> scale -> PCA -> subset (and scale again for models like SVM and kNN)

...to be continued