Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconcile Gene Aggregation Feature with Memory Footprint Overhaul #99

Merged
merged 69 commits into from May 18, 2017

Conversation

warrenmcg
Copy link
Collaborator

@warrenmcg warrenmcg commented Apr 20, 2017

Hi @pimentel and Sleuth team,

I went ahead and made changes to reconcile the gene aggregation feature currently available on the most recent stable release, and the overhaul you've already done to improve the memory footprint of Sleuth. I started from commit 8fba175 from the gene_agg branch and accepted all of the commits through to the current stable release (commit 048f055). I then added in the two commits from the gene_memory branch (through commit b0d4731). This was to help facilitate an automatic merge. The rest is my work.

Here is a summary of what has changed from that point onward:

  • I updated the reads_per_base_transform function to use data.table throughout, which results in a significant speed boost. This is used to process the bootstrap matrix and the bootstrap TPM counts, and increased the speed several fold (20 seconds each for the matrix and TPM calculations in one of my own datasets, compared to several minutes each per sample using the old code). Also, because of the details of how I did this work, the TPM counts are done before the est. counts summary.
  • I added in code into the for loop within sleuth_prep to process the bootstrap data at the gene-level. This takes advantage of using data.table throughout for fast processing (total processing per sample is <1 min per sample with my own full datasets).
  • I also added in code within sleuth_prep to update the sleuth object features to gene level data, not just bs_summary: filter_df, filter_bool, obs_norm, and obs_norm_filt now report gene-level information. obs_raw remains untouched, so it still has the transcript-level data.
  • I added a new transform_function option to give people the option of using another transformation function (e.g. using log2 instead of ln; using a different offset instead of 0.5). I extended the primitive $<- function and added a transform_synced item to the sleuth_model object, to make sure that users can't update the transformation function after having run sleuth_prep to protect reproducibility when sharing sleuth results.
  • I added tests to make sure gene aggregation works as expected, and then also added tests to quickly look at whether the whole sleuth pipeline is working properly. I computed preprocessed sleuth results at the transcript and gene level, and do tests to compare these precomputed results to newly run results to make sure the current version of sleuth does not change the sleuth results significantly from before. This can be updated in future releases as other features are added, or redesigned. This can also be a starting point for when to increment major releases.

If this work is accepted, the gene_summary function will be obsolete.

Other features that are not related:

The features from previous pull requests are included, specifically #71, #94, #95, and #96. If those are all good, you can just focus on this pull request. If they are not good, we can discuss and make changes accordingly.

Plan:

I plan to move the code within the for loop to its own function, and then set things up so that the bootstrap reading can be done in parallel, so we can add back in the num_cores option. On my machine, this would reduce what previously took almost three hours with the old code (current release) down to 1-2 minutes (new code with parallel options).

Let me know what you think!

pimentel and others added 30 commits January 20, 2016 22:04
For sleuth_gene_table, sleuth_results, and sleuth_to_matrix
Updated sleuth_gene_table, sleuth_prep, sleuth_wt, sleuth_fit, and
models.
More precise definition of num_transcripts
temporarily disable filtering at gene level #vc

stash gene mode #vc

work in progress #vc

add functionality to filter by target id #vc

update aggregation #vc

comment out a section in aggregation #vc

deal with intersection #vc

first cut at empirical bayes #vc

fix small bug in empirical bayes #vc

testing out pmax #vc

empirical bayes and propagate filter #vc

norm by effective length

massive cleanup

more cleaning

refactor out check target mapping

remove filter by target_id

update warnings in check_target_mapping

more cleanup
…dded sleuth_prep option to select number of cores for mclapply
…ons for num_cores to throw informative error
…trix did not have dim names that matched the formula or the sample ids
… option into 'spread_abundance_by' to prevent downstream error when preparing just one sample
…trary string can be used as a sample name; proposed solution for pachterlab#89
+ any change to transform_fxn leads to all existing fits have 'FALSE' transform_synced statuses
+ the user is warned that all existing fits need to be redone.
+ switch order of @example and @export to prevent error (newbie mistake)
+ comment out reference to 'num_cores' for time being (not currently used)
+ User is now warned to re-do both sleuth_prep and sleuth_fit, since changing transform_fxn also affects bootstrap summaries
+ Bug 1: transform function was applied twice; now only applied in sleuth_prep
+ Bug 2: checks for transform sync status in sleuth_lrt referred to wrong variable
…es leads to downstream errors when doing gene aggregation
+ make sure blank entries are treated as NA values
+ fix behavior of overall list of gene IDs for gene-level filter_bool
+ this processed data uses the full target mappings
+ also includes TPM calculations for the bootstraps and the extra_bootstrap_summary
+ update Ellahi dataset README with info on the preprocessed results
@warrenmcg
Copy link
Collaborator Author

Update: proposed parallelization code is finished. added back in the num_cores option from #94.

Copy link
Collaborator

@pimentel pimentel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @warrenmcg

I'm really sorry this took so long. so far this looking great. I've done a bit of testing locally and so far no issues. I'm going through and making comments to myself (no need for more action on your part) and will be making some minor modifications in the coming days.

Thanks so much for your hard work on this!

# for backwards compatibility
tidy_bs <- dplyr::select(tidy_bs, target_id,
est_counts, sample = bootstrap_num)
tidy_bs <- merge(data.table::as.data.table(tidy_bs),
Copy link
Collaborator

@pimentel pimentel May 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make all explicit by using scope operator.

done

}

if (extra_bootstrap_summary) {
bs_quant_est_counts <- aperm(apply(bs_mat, 2,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

look into whether or not there is a faster matrixStats function. I think I tested the 1 that exists there and this is actually faster. will check soon

ret$bs_quants[[samp_name]]$est_counts <- bs_quant_est_counts
}

bs_mat <- transform_fxn(bs_mat)
Copy link
Collaborator

@pimentel pimentel May 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change to transform_fun for consistency

done

@warrenmcg
Copy link
Collaborator Author

Don't worry about it at all!! Glad to see you back in action! :)

@pimentel pimentel merged commit 08edf11 into pachterlab:devel May 18, 2017
@pimentel
Copy link
Collaborator

merged in -- thanks again for all the hard work! will be making it into the next version scheduled for sometime next week!

@pimentel pimentel mentioned this pull request May 29, 2017
@warrenmcg warrenmcg deleted the gene_testing branch June 3, 2017 03:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants