In [1]:
library("phenopath")
library("reticulate")
np <- import("numpy")

In [2]:
phenopath_defaults <- function(observations, covariates, ...) {
    return(suppressWarnings(phenopath(observations, covariates, model_mu=TRUE, 
                                      maxiter=50, thin=10, verbose=FALSE, ...)))
}

The gLV models have intercepts $\mu$, and it makes more sense conceptually that a model has them (so they can represent 'intrinsic growth rates'), so `model_mu` is being used even though it is not a default setting. Plus very preliminary investigation seemed to suggest it is likely to get better results, and one would expect that theoretically as well not just for the reasons above but also because the default `model_mu=FALSE` is actually a special case where $\mu = 0$ is assumed, so allowing extra possible values for that parameter in general can only increase the fit. (Of course one can then argue about overfitting, but that seems unlikely since this model is clearly heavily mis-specified anyway, and again the extra parameter would not only increase fit but also be amenable to biological interpretation in this context.)

Regarding small number of iterations, they seem to give decent enough results in practice, all the more so given that in this context one is more interested in qualitative fit, especially correct signs and sparsity, than precise numerical values. In other words any resulting quantitative accuracy will hopefully not also correspond to substantially decreased qualitative accuracy, especially given that 50 iterations is still a decent amount. Plus from a practical perspective, in order to provide estimates for so many estimations, there needs to be a reasonably enough upper bound on the runtime for each, and that is only possible by providing a relatively small of maximum possible iterations. In practice the experimenter, who only needs to analyze the results of a single experiment, should probably not used such a low number of maximum iterations.

In [3]:
spearman <- function(x,y){return(cor(x,y,method='spearman'))}

In [4]:
size = 11
seed = 42
number_droplets = 100000
number_batches = 5
results_dirname = 'results'

base_filename = paste(results_dirname, '/', size, '_strains.seed_', 
                      seed, '.', format(number_droplets, scientific=FALSE), '_droplets.iteration_',
                      '#', '.npz', sep='')

In [5]:
get_results <- function(phenopath_results, true_times) {
    uncensored_results <- interaction_effects(phenopath_results)
    censored_results <- significant_interactions(phenopath_results) * uncensored_results
    pearson <- abs(cor(true_times, trajectory(phenopath_results))) # same level of support if pseudotimes are flipped
    spearman <- abs(spearman(true_times, trajectory(phenopath_results)))
    
    results <- list("uncensored_results" = uncensored_results, 
                   "censored_results" = censored_results,
                   "pearson" = pearson,
                   "spearman" = spearman)
    return(results)
}

In [6]:
get_results_filename <- function(base_dir, scaling, iteration_number) {
    iteration_filename = paste('iteration_', iteration_number, '.npz', sep='')
    results_dir = file.path(paste(base_dir, '/', scaling, sep=''))
    results_filename = file.path(paste(results_dir, '/', iteration_filename, sep=''))
    return(results_filename)
}

save_results <- function(results_filename, results) {
    np$savez_compressed(results_filename,
    uncensored_results = results$uncensored_results,
    censored_results = results$censored_results,
    pearson = results$pearson,
    spearman = results$spearman)
}

The questions that I seek to answer are at a broad level: 'How can phenopath best be applied to this problem?' and 'What settings will allow phenopath to work best?'. Admittedly though not all possible choices of settings are being considered (e.g. alternative choices of `z_init`, like the true time values or the exponentials thereof, nor centering but not scaling the data, etc.) but the idea behind that was because in practice the experimenter is only likely to be willing (or in some cases only able) to use settings which do not deviate too much from the default values, plus it is reasonable to think that default choices of e.g. `z_init` were well thought out by the developers and are likely to be useful in practice, even if not the most useful theoretically possible for this context (which would be impossible to find, at least without a lot of probably intractable theory, due to e.g. the infinite state space of options for many settings).

So the specific questions reduce to: (1) does it make sense to scale the data in this context (even though the data is PCR-bias adjusted and so roughly absolute counts), (2) what makes the most sense to use as the covariates for the model, and (3) do censoring values considered "insignificant" by the phenopath model lead to increased fit (i.e. by reducing false positives)? As an additional fun question, it is also asked: does the correlation (Pearson and/or Spearman) of the computed pseudotimes with the true times of the batches have any predictive value for the performance? E.g. do the true time values have any "importance" in that sense?

As for always using the log-transformed data as the values, well that is both what is recommended in the phenopath paper and in its documentation, and moreover what makes the (generative) model for phenopath most closely resemble the (integral version of) the gLV equations.

create directories to store results in

In [7]:
all_results_dir = 'phenopath_results'
if (!dir.exists(all_results_dir)) {dir.create(all_results_dir)}

binary_results_dir = file.path(paste(all_results_dir, '/', 'binary_covariates', sep=''))
if (!dir.exists(binary_results_dir)) {dir.create(binary_results_dir)}

counts_cov_results_dir = file.path(paste(all_results_dir, '/', 'count_covariates', sep=''))
if (!dir.exists(counts_cov_results_dir)) {dir.create(counts_cov_results_dir)}

for (subdirectory in list.files(path='phenopath_results', full.names=T)) {
    scaled_results_dir = file.path(paste(subdirectory, '/', 'scaled', sep=''))
    if (!dir.exists(scaled_results_dir)) {dir.create(scaled_results_dir)}
    
    unscaled_results_dir = file.path(paste(subdirectory, '/', 'unscaled', sep=''))
    if (!dir.exists(unscaled_results_dir)) {dir.create(unscaled_results_dir)}
}

loop through the iterations of stored results

In [8]:
for (iteration_number in 1:100) {
    
    filename = gsub("#", iteration_number, base_filename)
    npzfile = np$load(filename)

    read_log_counts = npzfile[["read_log_counts"]]
    read_init_vectors = npzfile[["read_init_vectors"]]
    read_counts = exp(read_log_counts)*(read_log_counts != 0)    
    
    merged_droplets_per_batch <- dim(read_log_counts)[1]/number_batches
    true_times = c()
    for (i in 1:number_batches) {true_times <- append(true_times, rep(i, merged_droplets_per_batch))}
    
    start_time <- proc.time()
    binary_scaled <- phenopath_defaults(read_log_counts, read_init_vectors, scale_y=TRUE)
    results <- get_results(binary_scaled, true_times)
    save_results(get_results_filename(binary_results_dir, 'scaled', iteration_number), results)
    run_time <- proc.time() - start_time; print(run_time)
    
    start_time <- proc.time()
    binary_unscaled <- phenopath_defaults(read_log_counts, read_init_vectors, scale_y=FALSE)
    results <- get_results(binary_unscaled, true_times)
    save_results(get_results_filename(binary_results_dir, 'unscaled', iteration_number), results)
    run_time <- proc.time() - start_time; print(run_time)
    
    start_time <- proc.time()
    counts_scaled <- phenopath_defaults(read_log_counts, read_counts, scale_y=TRUE)
    results <- get_results(counts_scaled, true_times)
    save_results(get_results_filename(counts_cov_results_dir, 'scaled', iteration_number), results)
    run_time <- proc.time() - start_time; print(run_time)
    
    start_time <- proc.time()
    counts_unscaled <- phenopath_defaults(read_log_counts, read_counts, scale_y=FALSE)
    results <- get_results(counts_unscaled, true_times)
    save_results(get_results_filename(counts_cov_results_dir, 'unscaled', iteration_number), results)
    run_time <- proc.time() - start_time; print(run_time)

    # this is supposed to be an embarrassingly parallel for loop, so memory usage should not change with number of iterations
    # but system monitor shows memory usage continually increasing. Hadley Wickham seems to have said that calling `gc`
    # manually for garbage collection should never be necessary, but honestly at this point I don't trust R so...
    gc()    
}



   user  system elapsed 
578.808  40.469 616.007 
   user  system elapsed 
655.871  53.642 708.781 
   user  system elapsed 
609.292  46.813 649.827 
   user  system elapsed 
601.935  45.446 640.358 
   user  system elapsed 
615.453  44.580 653.335 
   user  system elapsed 
600.692  49.633 644.267 
   user  system elapsed 
595.652  47.068 636.187 
   user  system elapsed 
595.279  47.350 635.374 
   user  system elapsed 
604.513  44.688 643.507 
   user  system elapsed 
598.752  49.352 642.918 
   user  system elapsed 
595.934  48.551 638.260 
   user  system elapsed 
595.137  45.667 633.389 
   user  system elapsed 
610.917  41.211 644.635 
   user  system elapsed 
601.256  47.271 642.019 
   user  system elapsed 
597.178  44.118 634.382 
   user  system elapsed 
603.378  45.562 641.312 
   user  system elapsed 
604.535  42.183 639.504 
   user  system elapsed 
595.430  44.377 632.023 
   user  system elapsed 
598.875  44.162 634.990 
   user  system elapsed 
595.987  43.845 632.727 
