# Compare speed and similarity in results for two cross-validation schemes

Cross-validation is important for 1) tuning to determine alpha and lambda parameters in elastic net models, and 2) evaluating performance of optimized models while parameters are fixed.

We wish to perform cross-validation in a manner that will give us the best cost/benefit ratio with respect to computational expense and model quality.

## Prepare inputs

In [1]:
library(CpGWAS)

Let's run these tests over a very small chunk on one chromosome.

In [2]:
args <- list(
    outdir = "../output/",
    chunk1 = 10^6,
    chunk2 = 10^6 + 1000,
    snp_data_path = "../../mwas/gwas/libd_chr1.pgen",
    methylation_data_path = "../../mwas/pheno/dlpfc/out/chr1_AA.rda")

Load our `BSseq` object containing bisulfite sequencing data and covariates

In [5]:
load(args$methylation_data_path)

Organize inputs into an object of our class `MethylationInput`

In [6]:
methInput <- new("MethylationInput",
                 BSseq_obj = BSobj2,
                 snp_data_path = args$snp_data_path,
                 args = args)

Dimensions of methylations:  111 2202819 
Dimensions of cov_matrix:  111 5 
Dimensions of pseudoinv:  5 111 


Define window sizes for SNPs to be extracted surrounding each methylation site

In [None]:
window_sizes <- c(1000, 2000, 5000, 10000, 20000, 50000, 100000, 200000, 500000)

In [None]:
scaffoldIdentifier_prefix <- paste0(tools::file_path_sans_ext(basename(args$snp_data_path)),
                                    "-",
                                    tools::file_path_sans_ext(basename(args$methylation_data_path)))

## Try triple-nested CV scheme

In [None]:
start_time <- Sys.time() 

In [None]:
scaffoldIdentifier_1 <- paste0(scaffoldIdentifier_prefix, "_scheme1")

In [None]:
scaffold_models_1 <- suppressWarnings(build_prediction_model(
  BSobj = BSobj2,
  methInput = methInput,
  window_sizes = window_sizes,
  chunk1 = args$chunk1,
  chunk2 = args$chunk2,
  n_fold = 5,
  cv_nesting = "triple",
  scaffoldIdentifier = scaffoldIdentifier_1,
  outdir = args$outdir,
  record_runtime = TRUE
))

df_1 <- convertToDataFrame(scaffold_models_1)

In [None]:
df_1

In [None]:
end_time <- Sys.time()  # End time capture
total_runtime <- end_time - start_time
total_runtime_seconds <- as.numeric(total_runtime, units = "secs")
hours <- total_runtime_seconds %/% 3600
minutes <- (total_runtime_seconds %% 3600) %/% 60
seconds <- total_runtime_seconds %% 60

# Report the runtime
cat(sprintf("Processed chunks %d through %d in %d hours, %d minutes and %d seconds.\n",
            args$chunk1, args$chunk2, as.integer(hours), as.integer(minutes), as.integer(seconds)))

## Try double-nested, two-step CV scheme

In [None]:
start_time <- Sys.time() 

In [None]:
scaffoldIdentifier_2 <- paste0(scaffoldIdentifier_prefix, "_scheme2")

In [None]:
scaffold_models_2 <- suppressWarnings(build_prediction_model(
  BSobj = BSobj2,
  methInput = methInput,
  window_sizes = window_sizes,
  chunk1 = args$chunk1,
  chunk2 = args$chunk2,
  n_fold = 5,
  cv_nesting = "double",
  scaffoldIdentifier = scaffoldIdentifier_2,
  outdir = args$outdir,
  record_runtime = TRUE
))

df_2 <- convertToDataFrame(scaffold_models_2)

In [None]:
end_time <- Sys.time()  # End time capture
total_runtime <- end_time - start_time
total_runtime_seconds <- as.numeric(total_runtime, units = "secs")
hours <- total_runtime_seconds %/% 3600
minutes <- (total_runtime_seconds %% 3600) %/% 60
seconds <- total_runtime_seconds %% 60

# Report the runtime
cat(sprintf("Processed chunks %d through %d in %d hours, %d minutes and %d seconds.\n",
            args$chunk1, args$chunk2, as.integer(hours), as.integer(minutes), as.integer(seconds)))

## Compare results across two scheme

In [None]:
df_1 <- convertToDataFrame(scaffold_models_1)
df_2 <- convertToDataFrame(scaffold_models_2)

In [None]:
dim(df_1)
dim(df_2)

In [None]:
df_1$scaffoldIdentifier <- df_2$scaffoldIdentifier <- NULL

In [None]:
identical(df_1, df_2)

Make sure orders of samples are the same

In [None]:
df_1_metadata <- df_1[,1:3]
df_2_metadata <- df_2[,1:3]

In [None]:
identical(df_1_metadata, df_2_metadata)

Compare correlations between predicted and observed values, across the two nesting schema

In [None]:
library(ggplot2)
library(ggpubr)

# Merge the two data frames (assuming they have the same number of rows)
combined_df <- data.frame(triple_nesting = df_1$cor, double_nesting = df_2$cor)

In [None]:
combined_df

Note: We're left with dropout (no terms kept in model) much more often with triple than double-nested scheme?

In [None]:
combined_df <- na.omit(combined_df))

In [None]:
combined_df

In [None]:
# Convert columns to numeric if they are not already
combined_df$triple_nesting <- as.numeric(as.character(combined_df$triple_nesting))
combined_df$double_nesting <- as.numeric(as.character(combined_df$double_nesting))

# Create the scatter plot
ggplot(combined_df, aes(x = triple_nesting, y = double_nesting)) +
  geom_point() +  # Add points
  geom_smooth(method = "lm", se = TRUE, color = "blue") +  # Add regression line and CI
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "red") +  # Diagonal line
  stat_regline_equation(aes(label = ..rr.label..), label.x.npc = "left") +  # Add R²
  labs(x = "Triple-Nesting", y = "Double-Nesting")  # Axis titles


In [None]:
mean(na.omit(combined_df$triple_nesting))

In [None]:
mean(na.omit(combined_df$double_nesting))

## Evaluate overall performance

For each methylation site, only keep the test for the `window_size` giving the greatest R^2

In [None]:
library(dplyr)

# Assuming your data frame is named df
result_df_1 <- df_1 %>%
  group_by(methylationPosition) %>%
  filter(cor == max(cor, na.rm = TRUE))

# View the resulting data frame
print(result_df_1)


In [None]:
mean(result_df_1$cor)

In [None]:
# Assuming your data frame is named df
result_df_2 <- df_2 %>%
  group_by(methylationPosition) %>%
  filter(cor == max(cor, na.rm = TRUE))

# View the resulting data frame
print(result_df_2)

Note: For this sample subset, alpha := 0, meaning elastic net always results in pure ridge regression

In [None]:
mean(result_df_2$cor)