Reconcile Gene Aggregation Feature with Memory Footprint Overhaul #99

warrenmcg · 2017-04-20T23:41:09Z

Hi @pimentel and Sleuth team,

I went ahead and made changes to reconcile the gene aggregation feature currently available on the most recent stable release, and the overhaul you've already done to improve the memory footprint of Sleuth. I started from commit 8fba175 from the gene_agg branch and accepted all of the commits through to the current stable release (commit 048f055). I then added in the two commits from the gene_memory branch (through commit b0d4731). This was to help facilitate an automatic merge. The rest is my work.

Here is a summary of what has changed from that point onward:

I updated the reads_per_base_transform function to use data.table throughout, which results in a significant speed boost. This is used to process the bootstrap matrix and the bootstrap TPM counts, and increased the speed several fold (20 seconds each for the matrix and TPM calculations in one of my own datasets, compared to several minutes each per sample using the old code). Also, because of the details of how I did this work, the TPM counts are done before the est. counts summary.
I added in code into the for loop within sleuth_prep to process the bootstrap data at the gene-level. This takes advantage of using data.table throughout for fast processing (total processing per sample is <1 min per sample with my own full datasets).
I also added in code within sleuth_prep to update the sleuth object features to gene level data, not just bs_summary: filter_df, filter_bool, obs_norm, and obs_norm_filt now report gene-level information. obs_raw remains untouched, so it still has the transcript-level data.
I added a new transform_function option to give people the option of using another transformation function (e.g. using log2 instead of ln; using a different offset instead of 0.5). I extended the primitive $<- function and added a transform_synced item to the sleuth_model object, to make sure that users can't update the transformation function after having run sleuth_prep to protect reproducibility when sharing sleuth results.
I added tests to make sure gene aggregation works as expected, and then also added tests to quickly look at whether the whole sleuth pipeline is working properly. I computed preprocessed sleuth results at the transcript and gene level, and do tests to compare these precomputed results to newly run results to make sure the current version of sleuth does not change the sleuth results significantly from before. This can be updated in future releases as other features are added, or redesigned. This can also be a starting point for when to increment major releases.

If this work is accepted, the gene_summary function will be obsolete.

Other features that are not related:

The features from previous pull requests are included, specifically #71, #94, #95, and #96. If those are all good, you can just focus on this pull request. If they are not good, we can discuss and make changes accordingly.

Plan:

I plan to move the code within the for loop to its own function, and then set things up so that the bootstrap reading can be done in parallel, so we can add back in the num_cores option. On my machine, this would reduce what previously took almost three hours with the old code (current release) down to 1-2 minutes (new code with parallel options).

Let me know what you think!

For sleuth_gene_table, sleuth_results, and sleuth_to_matrix

Updated sleuth_gene_table, sleuth_prep, sleuth_wt, sleuth_fit, and models.

More precise definition of num_transcripts

…terlab#64)

temporarily disable filtering at gene level #vc stash gene mode #vc work in progress #vc add functionality to filter by target id #vc update aggregation #vc comment out a section in aggregation #vc deal with intersection #vc first cut at empirical bayes #vc fix small bug in empirical bayes #vc testing out pmax #vc empirical bayes and propagate filter #vc norm by effective length massive cleanup more cleaning refactor out check target mapping remove filter by target_id update warnings in check_target_mapping more cleanup

Gene aggregation

update vignette

…dded sleuth_prep option to select number of cores for mclapply

…ons for num_cores to throw informative error

…trix did not have dim names that matched the formula or the sample ids

… of s2c

… option into 'spread_abundance_by' to prevent downstream error when preparing just one sample

…achterlab#86

…trary string can be used as a sample name; proposed solution for pachterlab#89

+ any change to transform_fxn leads to all existing fits have 'FALSE' transform_synced statuses + the user is warned that all existing fits need to be redone.

@example

+ switch order of @example and @export to prevent error (newbie mistake) + comment out reference to 'num_cores' for time being (not currently used)

+ User is now warned to re-do both sleuth_prep and sleuth_fit, since changing transform_fxn also affects bootstrap summaries + Bug 1: transform function was applied twice; now only applied in sleuth_prep + Bug 2: checks for transform sync status in sleuth_lrt referred to wrong variable

…iable for obs_norm_gene and tpm_norm_gene

…e syntax rather than dplyr

… pared down target_mapping table

…es leads to downstream errors when doing gene aggregation

…lumn

+ make sure blank entries are treated as NA values + fix behavior of overall list of gene IDs for gene-level filter_bool

+ this processed data uses the full target mappings + also includes TPM calculations for the bootstraps and the extra_bootstrap_summary + update Ellahi dataset README with info on the preprocessed results

…eparation for parallelization

warrenmcg · 2017-04-22T23:27:22Z

Update: proposed parallelization code is finished. added back in the num_cores option from #94.

pimentel

hi @warrenmcg

I'm really sorry this took so long. so far this looking great. I've done a bit of testing locally and so far no issues. I'm going through and making comments to myself (no need for more action on your part) and will be making some minor modifications in the coming days.

Thanks so much for your hard work on this!

pimentel · 2017-05-13T01:17:31Z

R/bootstrap.R

+    # for backwards compatibility
+    tidy_bs <- dplyr::select(tidy_bs, target_id, 
+                             est_counts, sample = bootstrap_num)
+    tidy_bs <- merge(data.table::as.data.table(tidy_bs), 


make all explicit by using scope operator.

done

pimentel · 2017-05-13T01:18:18Z

R/bootstrap.R

+  }
+
+  if (extra_bootstrap_summary) {
+    bs_quant_est_counts <- aperm(apply(bs_mat, 2, 


look into whether or not there is a faster matrixStats function. I think I tested the 1 that exists there and this is actually faster. will check soon

pimentel · 2017-05-13T01:19:24Z

R/bootstrap.R

+    ret$bs_quants[[samp_name]]$est_counts <- bs_quant_est_counts
+  }
+
+  bs_mat <- transform_fxn(bs_mat)


change to transform_fun for consistency

done

warrenmcg · 2017-05-13T01:54:06Z

Don't worry about it at all!! Glad to see you back in action! :)

pimentel · 2017-05-18T04:46:56Z

merged in -- thanks again for all the hard work! will be making it into the next version scheduled for sometime next week!

pimentel and others added 30 commits January 20, 2016 22:04

wip gene_summary #vc

8fba175

Updated documentation

cc678f5

For sleuth_gene_table, sleuth_results, and sleuth_to_matrix

Further documentation

734f992

Updated sleuth_gene_table, sleuth_prep, sleuth_wt, sleuth_fit, and models.

Updated sleuth_gene_table

3cde4f6

More precise definition of num_transcripts

modifications to suggestions from @map222 (major thanks!) #vc

17d1579

Merge branch 'map222-master' (pull request Updated documentation pach…

a7e64fc

…terlab#64)

fix gene mode and pca bug

07ca1d9

make interface nicer and update vignette

a5c91b1

update version

659e36f

fix to plot_pc_variance

bd811f4

update vignette

58913fc

merge in conflicts

2b86978

Merge pull request pachterlab#74 from pachterlab/hjp/geneagg

ed5ad30

Gene aggregation

update vignette

9d4a67a

Merge pull request pachterlab#75 from pachterlab/hjp/geneagg

048f055

update vignette

switch to column wise reading of bootstraps #vc

846673e

some cleanup and massive speedup using matrixStats #vc

b0d4731

modified mclapply usage in gene_summary to reduce memory footprint; a…

932fc47

…dded sleuth_prep option to select number of cores for mclapply

corrected default value for num_cores, and strengthened check conditi…

90b5b99

…ons for num_cores to throw informative error

fixed 'give a design matrix' test, which failed because the design ma…

161d231

…trix did not have dim names that matched the formula or the sample ids

now lintr clean

66cea3a

clean a few lints that I missed

c321f90

added code from pull request pachterlab#71, which drops unused levels…

59ff84c

… of s2c

add in code from pull request pachterlab#92, which adds in 'drop = F'…

62cfab4

… option into 'spread_abundance_by' to prevent downstream error when preparing just one sample

change 'give a design matrix' test to match changes on 'devel' branch

d6d1d2c

add code to 'sleuth_results' to add in gene annotations, to address p…

780b8dc

…achterlab#86

Merge branch 'issue80_81'

d77401a

Merge branch 'issue86'

bbd8859

add grave accent ` to sample expressions for plot_scatter so any arbi…

a5f10e3

…trary string can be used as a sample name; proposed solution for pachterlab#89

warrenmcg added 22 commits March 24, 2017 02:26

extend primitive '$<-' and add a replacement function for transform_fxn

bb57ce8

+ any change to transform_fxn leads to all existing fits have 'FALSE' transform_synced statuses + the user is warned that all existing fits need to be redone.

Merge branch 'issue89' after typo correction

bf82d58

update documentation for transform argument and functions; also fix misc

60a4c49

+ switch order of @example and @export to prevent error (newbie mistake) + comment out reference to 'num_cores' for time being (not currently used)

fix bug where bootstrap summary overwrites tpm summary

e64abe5

Merge branch 'merge' after fixing bug

4d077d1

fixed attribute bug introduced by reusing obs_norm; created a new var…

2fe6fa7

…iable for obs_norm_gene and tpm_norm_gene

speed boost by converting gene aggregation code to use just data.tabl…

f16ff40

…e syntax rather than dplyr

fix bug that leads to incongruent tpm_norm_gene and obs_norm_gene tables

29453d0

improve memory usage by removing objects sooner, and switching to use…

08d6410

… pared down target_mapping table

fix bug where target_id names with non-synatically valid R column nam…

ba3b374

…es leads to downstream errors when doing gene aggregation

warn user when target_ids are missing annotations from aggregation_co…

ffcb4a4

…lumn

fix bug introduced if aggregation column has NA values

325c267

+ make sure blank entries are treated as NA values + fix behavior of overall list of gene IDs for gene-level filter_bool

add gene mappings to Ellahi test data, and update README

db4d209

add processed sleuth data for testing purposes

c005aa4

+ this processed data uses the full target mappings + also includes TPM calculations for the bootstraps and the extra_bootstrap_summary + update Ellahi dataset README with info on the preprocessed results

load preprocessed results for comparison during testing

9763f7a

add code to test gene-level aggregation

482110f

add tests to make sure details of sleuth prep are unchanged

2e89826

add tests for fit and Wald Test aspects of sleuth pipeline

d53d14f

move bootstrap summary code to separate function in bootstrap.R in pr…

6401590

…eparation for parallelization

modify code to parallelize the bootstrap summary step

a1e4765

add back in num_cores option; correct typo

08edf11

warrenmcg mentioned this pull request Apr 22, 2017

sleuth_to_matrix reports transcripts even when an aggregation_column was provided to sleuth_prep #84

Closed

pimentel reviewed May 13, 2017

View reviewed changes

pimentel merged commit 08edf11 into pachterlab:devel May 18, 2017

pimentel mentioned this pull request May 29, 2017

Release v0.29.0 #110

Merged

warrenmcg deleted the gene_testing branch June 3, 2017 03:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reconcile Gene Aggregation Feature with Memory Footprint Overhaul #99

Reconcile Gene Aggregation Feature with Memory Footprint Overhaul #99

warrenmcg commented Apr 20, 2017 •

edited

Loading

warrenmcg commented Apr 22, 2017

pimentel left a comment

pimentel May 13, 2017 •

edited

Loading

pimentel May 13, 2017

pimentel May 13, 2017 •

edited

Loading

warrenmcg commented May 13, 2017

pimentel commented May 18, 2017

Reconcile Gene Aggregation Feature with Memory Footprint Overhaul #99

Reconcile Gene Aggregation Feature with Memory Footprint Overhaul #99

Conversation

warrenmcg commented Apr 20, 2017 • edited Loading

Here is a summary of what has changed from that point onward:

Other features that are not related:

Plan:

warrenmcg commented Apr 22, 2017

pimentel left a comment

Choose a reason for hiding this comment

pimentel May 13, 2017 • edited Loading

Choose a reason for hiding this comment

pimentel May 13, 2017

Choose a reason for hiding this comment

pimentel May 13, 2017 • edited Loading

Choose a reason for hiding this comment

warrenmcg commented May 13, 2017

pimentel commented May 18, 2017

warrenmcg commented Apr 20, 2017 •

edited

Loading

pimentel May 13, 2017 •

edited

Loading

pimentel May 13, 2017 •

edited

Loading