Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

plot updates, vignette & bug fixes, etc. #85

Merged
merged 12 commits into from
Apr 29, 2022
Merged

plot updates, vignette & bug fixes, etc. #85

merged 12 commits into from
Apr 29, 2022

Conversation

mattssca
Copy link
Contributor

@mattssca mattssca commented Apr 14, 2022

Changes in this PR include the following updates:

  1. fancy_cnlohbar - updated colours (to match gambl palette for CN states), optional parameter to include CN state = 2, adding a second y-axis to annotate the number of nucleotides affected by each CN state. Both y-axis are now in log10 (since CN = 2 is now included). nNucleotides in each CN state are now plotted as geom_points over the corresponding CN count bar.

  2. fancy_sv_chrdistplot - Now uses the newly added palette for DEL and INS.

  3. fancy_snv_ chrdistplot - Name and description of this plot have now been adjusted for SNVs (previously stated SNPs).

  4. fancy_svbar - Now uses the newly added palette for DEL and INS.

  5. fancy_vplot - Now uses the newly added palette for DEL and INS.

  6. fancy_ideogram - Added in this PR. Colours have also been updated to match the defined palette for CN states. Previously, the plot annotated CN states up to 10+, but since the defined colour palette only goes up to CN state = 6, thus all CN states => 6 are annotated the same.

  7. New palette added to get_gambl_colours, all_colours[["indels"]] = c("DEL" = "#53B1FC", "INS" = "#FC9C6D").

  8. Updating get_codin_ssm to have these_samples_metadata parameter

  9. Fixing bug in get_ssm_bt_region (caused an error in ashm_rainbow_plot in the vignette). Were Streamlined = TRUE was wrongly embedded inside the function, causing the function to always return only two columns. This was removed, streamlined and basic_columns were also set to FALSE by default.

  10. lollipopPlot example in the vignette was updated to use get_coding_ssm (instead of get_ssm_by_gene), now works the intended way again.

  11. Regenerated package documentation.

  12. A lot of changes related to removing trailing white space in all scripts (no new code in portal.R, preprocessing_io.R and web.R).

Outstanding things that will be addressed in my next PR are examples in the vignette that are relying on the database for reading data. Examples of newly incorporated plotting functions will also be added.

Pull Request Checklists

Important: When opening a pull request, keep only the applicable checklist and delete all other sections.

Checklist for all PRs

Required

  • I tested the new code for my use case (please provide a reproducible example of how you tested the new functionality)

  • I ensured all dplyr functions that commonly conflict with other packages are fully qualified.

This can be checked and addressed by running check_functions.pl and responding to the prompts. Test your code after you do this.

  • I generated the documentation and checked for errors relating to the new function (e.g. devtools::document()) and added NAMESPACE and all other modified files in the root directory and under man.

Optional but preferred with PRs

  • I updated and/or successfully knitted a vignette that relies on the modified code (which ones?)

Checklist for New Functions

Required

  • I documented my function using ROxygen style.)

  • All parameters for the function are described in the documentation and the function has a decriptive title.

Example:

#' Use GISTIC2.0 scores output to reproduce maftools::chromoplot with more flexibility
#'
#' @param scores output file scores.gistic from the run of GISTIC2.0
#' @param genes_to_label optional. Provide a data frame of genes to label (if mutated). The first 3 columns must contain chromosome, start, and end coordinates. Another required column must contain gene names and be named `gene`. (truncated for example)
#' @param cutoff optional. Used to determine which regions to color as aberrant. Must be float in the range [0-1]. (truncated for example)
  • My function uses a library that isn't already a dependency of GAMBLR and I made the package aware of this dependency using the function documentation import statment.

Example:

#' @return nothing
#' @export
#' @import tidyverse ggrepel

Checklist for changes to existing code

  • I added/removed arguments to a function and updated documentation for all changed/new arguments

  • I tested the new code for compatability with existing functionality in the Master branch (please provide a reprex of how you tested the original functionality)

…new plotting functions to vignette, package doc etc.
@mattssca
Copy link
Contributor Author

In this commit the following updates have been made:

  • Removed LOH from ideogram (replaced with CN0 and CN1)
  • New ideogram plotting function for visualizing multisample CN segments. This function is based and expanded on the previous ideogram function. The function currently accepts 2, 3 or 4 samples.
  • New examples were added to the vignette to showcase the latest additions in terms of plotting functions within GAMBL
  • Examples in the vignette that are using database data have been updated to read from flat files instead.
  • Examples in the vignette that calls no longer available functions have been removed or replaced.

@Kdreval Kdreval self-requested a review April 28, 2022 04:19
Copy link
Contributor

@Kdreval Kdreval left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Adam! Some small updates here, mostly regarding converting colors and coordinates to call of GAMBLR functions or objects.

@@ -51,9 +51,9 @@ annotate_ssm_blacklist = function(mutations_df,
for(b in blacklist_files){
full_path = paste0(project_base,b)
lifted_blacklist = read_tsv(full_path, col_names = c("chrpos", "blacklist_count"), show_col_types = FALSE)
lifted_blacklist = lifted_blacklist %>%
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this file, there are only space/tab edits - is that right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, Sir!

R/database.R Outdated

all_meta = get_gambl_metadata(from_flatfile = from_flatfile)
all_meta = get_gambl_metadata(from_flatfile = from_flatfile)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the line I mentioned here: #87. Can you please add seq_type=seq_type for this call?
Also, it looks like it will overwrite whatever the metadata user provided in these_samples_metadata handled in lines 1646-1651 here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think line 1653 needs to be removed, this was a mistake on my end. Ryan asked me to add this code and I should've probably removed the mentioned line. to not overwrite.

@@ -34,100 +34,100 @@ setup_study = function(short_name = "GAMBL",
project_name = "gambl_minus_icgc",
description = "GAMBL data without ICGC",
cancer_type = "mixed",
data_cancer_type = "brca",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the portal.R is also just space/tab related I think

@@ -998,7 +998,7 @@ liftover_bedpe = function(bedpe_file,
output = rtracklayer::export(first_ok, format = "bed") %>%
read_tsv(col_names = c("chrom", "start", "end", "score", "strand", "nothing", "s1", "e1", "junk", "more", "stuff", "nada")) %>%
dplyr::select("chrom", "start", "end")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same goes here in this file, only space/tabs. Just making sure I did not miss relevant changes 😃

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, same thing here. Just space/tab edits :)

R/utilities.R Outdated
cn_segments = subset(cn_segments, CN != 2)

#update CN annotations (if present in cn_segment data).
if("0" %in% cn_segments$CN){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this can be simplified to a unified call:

cn_segments$CN = paste0("cn_", cn_segments$CN , "_sample", samplen)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, indeed! Nice, good suggestion.

R/viz.R Outdated
l_cn_seg = maf
l_cn_seg$lenght = l_cn_seg$end - l_cn_seg$start
l_cn_seg$CN = as.factor(l_cn_seg$CN)
l_cn_seg = dplyr::filter(l_cn_seg, CN != 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this filtering on neutral segments without CNV? Then this needs to be CN != 2, since this is not log ratios but the absolute CN

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another good catch, when I developed these functions I wrongly assumed that CN states 0 and 1 always would be interpreted as LOH events. I've made updates to this and the plots (i.e ideograms) now show CN states 0 and 1. So, this line needs to be updated so that CN0 is not removed.

R/viz.R Outdated
@@ -3045,8 +3066,396 @@ fancy_vplot = function(this_sample,
labs(title = paste0(this_sample), subtitle = plot_subtitle, x = "", y = "Variant Size") +
geom_violin(trim = FALSE, scale = "width", color = NA) +
stat_summary(fun = mean, geom = "point", shape = 20, size = 3, color = "black") +
scale_fill_manual("", values = c("DEL" = "#E6856F", "INS" = "#3A8799")) +
scale_fill_manual("", values = c("DEL" = "#53B1FC", "INS" = "#FC9C6D")) +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also needs conversion to a function call here

R/viz.R Outdated
}

#grch37 coordinates
grch37_end = c(249250621, 243199373, 198022430, 191154276, 180915260, 171115067, 159138663, 146364022, 141213431, 135534747, 135006516, 133851895, 115169878, 107349540, 102531392, 90354753, 81195210, 78077248, 59128983, 63025520, 48129895, 51304566)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are chromosome arms coordinates objects not as part of GAMBLR, maybe this should be converted to using those instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I could add ref/lcr-modules-references/genomes/grch37/genome_fasta/main_chromosomes.bed to the config.yml and then get the coordinates from the bed directly. This would also allow for use of different genome projections. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is already added to GAMBLR 😃 You can check these out:

GAMBLR::chromosome_arms_grch37
GAMBLR::chromosome_arms_hg38

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, nice! That's perfect, thanks.

R/viz.R Outdated
geom_segment(data = segment_data, aes(x = chr_start, xend = chr_end, y = y + 0.5, yend = yend + 0.5, label = chr), color = "white", lineend = "butt", size = 2, stat = "identity", position = position_dodge()) + #white space between chromosome groups (bottom)
annotate(geom = "text", x = anno_dist, y = segment_data$yend, label = segment_data$chr, color = "black", size = 7, hjust = 1) + #chr labels
labs(title = plot_title, subtitle = plot_sub) + #plot titles
scale_colour_manual(name = "", values = c(CN0 = "#4393C3", CN1 = "#92C5DE", CN3 = "#D6604DFF", CN4 = "#B2182BFF", CN5 = "#67001FFF", CN6_or_more = "#380015FF")) + #colours and legend
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we convert it here to get_gambl_colors call?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed!

R/viz.R Outdated
}

#plot
ggplot() +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function does not seem to return any object. Sometimes it makes it difficult if you want to store a plot in a variable and then combine multiple plots with something like ggpubr:: ggarrange when arranging a multiplot figure. Like the next function below fancy_multisamp_ideogram can we store the plot here in something like p and then return it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this function should also be updated to use the newly added helper function to subset cn states for the sample.

Copy link
Contributor Author

@mattssca mattssca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all your comments Kostia, I will address the changes requested, as well as some other minor things I just caught. There will be a new commit later today (or early tomorrow) in regards to this. Let me know if there. are any questions at your end, related to my replies.

@@ -51,9 +51,9 @@ annotate_ssm_blacklist = function(mutations_df,
for(b in blacklist_files){
full_path = paste0(project_base,b)
lifted_blacklist = read_tsv(full_path, col_names = c("chrpos", "blacklist_count"), show_col_types = FALSE)
lifted_blacklist = lifted_blacklist %>%
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, Sir!

R/database.R Outdated

all_meta = get_gambl_metadata(from_flatfile = from_flatfile)
all_meta = get_gambl_metadata(from_flatfile = from_flatfile)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think line 1653 needs to be removed, this was a mistake on my end. Ryan asked me to add this code and I should've probably removed the mentioned line. to not overwrite.

@@ -998,7 +998,7 @@ liftover_bedpe = function(bedpe_file,
output = rtracklayer::export(first_ok, format = "bed") %>%
read_tsv(col_names = c("chrom", "start", "end", "score", "strand", "nothing", "s1", "e1", "junk", "more", "stuff", "nada")) %>%
dplyr::select("chrom", "start", "end")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, same thing here. Just space/tab edits :)

R/utilities.R Outdated
cn_segments = subset(cn_segments, CN != 2)

#update CN annotations (if present in cn_segment data).
if("0" %in% cn_segments$CN){
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, indeed! Nice, good suggestion.

}

#convert to factor.
cn_segments$CN = as.factor(cn_segments$CN)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly, the order of the levels doesn't matter, more so that they in fact are factors.

R/viz.R Outdated
l_cn_seg = maf
l_cn_seg$lenght = l_cn_seg$end - l_cn_seg$start
l_cn_seg$CN = as.factor(l_cn_seg$CN)
l_cn_seg = dplyr::filter(l_cn_seg, CN != 0)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another good catch, when I developed these functions I wrongly assumed that CN states 0 and 1 always would be interpreted as LOH events. I've made updates to this and the plots (i.e ideograms) now show CN states 0 and 1. So, this line needs to be updated so that CN0 is not removed.

R/viz.R Outdated
}

#grch37 coordinates
grch37_end = c(249250621, 243199373, 198022430, 191154276, 180915260, 171115067, 159138663, 146364022, 141213431, 135534747, 135006516, 133851895, 115169878, 107349540, 102531392, 90354753, 81195210, 78077248, 59128983, 63025520, 48129895, 51304566)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I could add ref/lcr-modules-references/genomes/grch37/genome_fasta/main_chromosomes.bed to the config.yml and then get the coordinates from the bed directly. This would also allow for use of different genome projections. What do you think?

R/viz.R Outdated
}

#plot
ggplot() +
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely!

R/viz.R Outdated
geom_segment(data = segment_data, aes(x = chr_start, xend = chr_end, y = y + 0.5, yend = yend + 0.5, label = chr), color = "white", lineend = "butt", size = 2, stat = "identity", position = position_dodge()) + #white space between chromosome groups (bottom)
annotate(geom = "text", x = anno_dist, y = segment_data$yend, label = segment_data$chr, color = "black", size = 7, hjust = 1) + #chr labels
labs(title = plot_title, subtitle = plot_sub) + #plot titles
scale_colour_manual(name = "", values = c(CN0 = "#4393C3", CN1 = "#92C5DE", CN3 = "#D6604DFF", CN4 = "#B2182BFF", CN5 = "#67001FFF", CN6_or_more = "#380015FF")) + #colours and legend
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed!

R/viz.R Outdated
}

#plot
ggplot() +
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this function should also be updated to use the newly added helper function to subset cn states for the sample.

@mattssca
Copy link
Contributor Author

mattssca commented Apr 28, 2022

Changes in this commit include fixes to issues/suggestions raised in the recent PR review.

  1. Use global colours - The following plotting function have been updated to read palette from get_gambl_colours(): fancy_sv_chrdistplot, fancy_sv_bar, fancy_vplot, fancy_ideogram, fancy_multisamp_ideogram.

  2. Renaming of variable returned after calling get_sample_cn_segments in fancy_cnbar (not a maf returned)

  3. New parameter was added to fancy_cnbar, allowing the user to specify the cut-off value for maximum CN states to be retrieved.

  4. CN0 is now retained in fancy_cnbar.

  5. Chromosome tables for both ideograms are updated, now using GAMBLR::chromosome_arms_grch37 to retrieve this information.

  6. fancy_ideogram now stores the plot inside a variable (p) allowing the user to combine multiple plots when arranging a multiplot figure.

  7. fancy_ideogram is now also using the newly added helper function (subset_cnstates).

  8. Above mentioned helper function is also simplified to a unified call.

  9. Overwriting line (all_meta) to database.R has been removed.

  10. Package documentation has been regenerated to reflect added/updated parameters.

@mattssca mattssca requested a review from Kdreval April 28, 2022 21:52
R/viz.R Outdated
@@ -2942,48 +2947,48 @@ fancy_svbar = function(this_sample,
fancy_cnbar = function(this_sample,
plot_subtitle = "n CNV Segments (barplots, left y-axis), n Affected bases for each CN state",
chr_select = paste0("chr", c(1:22)),
cutoff = 160,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a typo or intended to be 160 for the CN state?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I just put a really high number as a placeholder and forgot to change it back to something more realistic.

Copy link
Contributor

@Kdreval Kdreval left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Adam!

@mattssca mattssca merged commit 90d4c67 into master Apr 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants