# Compare speed and similarity in results for two cross-validation schemes

Cross-validation is important for 1) tuning to determine alpha and lambda parameters in elastic net models, and 2) evaluating performance of optimized models while parameters are fixed.

We wish to perform cross-validation in a manner that will give us the best cost/benefit ratio with respect to computational expense and model quality.

## Prepare inputs

In [1]:
library(CpGWAS)

Let's run these tests over a very small chunk on one chromosome.

In [None]:
args <- list(
    outdir = "../output/",
    chunk1 = 10^6,
    chunk2 = 10^6 + 100,
    snp_data_path = "../../mwas/gwas/libd_chr1.pgen",
    methylation_data_path = "../../mwas/pheno/dlpfc/out/chr1_AA.rda")

Load our `BSseq` object containing bisulfite sequencing data and covariates

In [None]:
load(args$methylation_data_path)

Organize inputs into an object of our class `MethylationInput`

In [None]:
methInput <- new("MethylationInput",
                 BSseq_obj = BSobj2,
                 snp_data_path = args$snp_data_path,
                 args = args)

Define window sizes for SNPs to be extracted surrounding each methylation site

In [None]:
window_sizes <- c(1000, 2000, 5000, 10000, 20000, 50000, 100000, 200000, 500000)

In [None]:
scaffoldIdentifier_prefix <- paste0(tools::file_path_sans_ext(basename(args$snp_data_path)),
                                    "-",
                                    tools::file_path_sans_ext(basename(args$methylation_data_path)))

## Try triple-nested CV scheme

In [None]:
start_time <- Sys.time() 

In [None]:
scaffoldIdentifier_1 <- paste0(scaffoldIdentifier_prefix, "_scheme1")

In [None]:
scaffold_models_1 <- build_prediction_model(
  BSobj = BSobj2,
  methInput = methInput,
  window_sizes = c(1000, 2000),
  chunk1 = 10^6,
  chunk2 = 10^6 + 100,
  n_fold = 5,
  cv_nesting = "double",
  scaffoldIdentifier = scaffoldIdentifier_1,
  outdir = "output/",
  record_runtime = TRUE
)

df_1 <- as.data.frame(scaffold_models_1)

In [None]:
end_time <- Sys.time()  # End time capture
total_runtime <- end_time - start_time
total_runtime_seconds <- as.numeric(total_runtime, units = "secs")
hours <- total_runtime_seconds %/% 3600
minutes <- (total_runtime_seconds %% 3600) %/% 60
seconds <- total_runtime_seconds %% 60

# Report the runtime
cat(sprintf("Processed chunks %d through %d in %d hours, %d minutes and %d seconds.\n",
            args$chunk1, args$chunk2, as.integer(hours), as.integer(minutes), as.integer(seconds)))

## Try double-nested, two-step CV scheme

In [None]:
scaffoldIdentifier_2 <- paste0(scaffoldIdentifier_prefix, "_scheme2")

In [None]:
scaffold_models_2 <- build_prediction_model(
  BSobj = BSobj2,
  methInput = methInput,
  window_sizes = c(1000, 2000),
  chunk1 = 10^6,
  chunk2 = 10^6 + 100,
  n_fold = 5,
  cv_nesting = "double",
  scaffoldIdentifier = scaffoldIdentifier_2,
  outdir = "output/",
  record_runtime = TRUE
)

df_2 <- as.data.frame(scaffold_models_2)

In [None]:
end_time <- Sys.time()  # End time capture
total_runtime <- end_time - start_time
total_runtime_seconds <- as.numeric(total_runtime, units = "secs")
hours <- total_runtime_seconds %/% 3600
minutes <- (total_runtime_seconds %% 3600) %/% 60
seconds <- total_runtime_seconds %% 60

# Report the runtime
cat(sprintf("Processed chunks %d through %d in %d hours, %d minutes and %d seconds.\n",
            args$chunk1, args$chunk2, as.integer(hours), as.integer(minutes), as.integer(seconds)))

## Compare results across two scheme