-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Isolated immune cell reconstruction evaluation #5
Isolated immune cell reconstruction evaluation #5
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only minor comments
gs.file <- file.path("data", "expression_data", | ||
"E-MTAB-2452_hugene11st_SCANfast_with_GeneSymbol.pcl") | ||
exprs.df <-readr::read_tsv(gs.file) | ||
exprs.mat <- as.matrix(exprs.df[, 3:ncol(exprs.df)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why starting at column 3? Maybe add comment about what first 2 columns are
height = 11, width = 8.5) | ||
``` | ||
|
||
### E-MTAB-2452 Boxplots |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The x axis tick labels seem to bleed a bit into one another. Is it possible to rename them? For example, when their is an n =
in the label, I tend to put on a new line
ggplot2::ggsave(plot.file, plot = ggplot2::last_plot()) | ||
``` | ||
|
||
## Summary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are all datasets trained together? or are different models trained on each individually? Or, are the models trained using a single dataset and the other datasets are transformed into this space?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or, are the models trained using a single dataset and the other datasets are transformed into this space?
This one -- a single PLIER model is trained on the recount2 dataset, which includes SRP045500
. E-MTAB-2452
is transformed into the recount2 PLIER space.
I'm working with LVs from this recount2 PLIER model exclusively, but in some cases I'm using only LVs that are significantly associated with a pathway or only those LVs that are not associated with a pathway.
small doc change, relabel boxplot x axis ticks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good ! only some minor comments
dplyr::mutate(MASE = as.numeric(as.character(MASE)), | ||
`Spearman correlation` = | ||
as.numeric(as.character(`Spearman correlation`))) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not quite understand why MASE need to do as.numeric(as.character(MASE)) conversion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When binding together the columns, MASE and the Spearman correlation end up as factors.
ggplot2::theme_bw() + | ||
ggplot2::scale_fill_manual(values = c("white", "gray50", "black")) + | ||
ggplot2::ggtitle(paste("All, n =", ncol(z.matrix))) | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the plot, recount2
means reconstructing the gene expression of recount samples based on recount PLIER model ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is correct.
"E-MTAB-2452_reconstruction_error_recount2_model.pdf") | ||
ggplot2::ggsave(plot.file, plot = ggplot2::last_plot()) | ||
``` | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, add statistics to show the distributions is significant different (t-test or ANNOVA) ? One benefit is it can provide a quantitative way to support the conclusion (e.g. the pre- and post-reconstruction correlation values are much more similar between two datasets), but it is depends on you since the difference is clear from the plot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pairwise.t.test
coming up in the next commit
This PR adds
04-isolated_immune_cell_reconstruction
.In this notebook, I evaluate the reconstruction of the sorted leukocyte microarray dataset introduced in
03-isolated_cell_type_populations
(E-MTAB-2452) as compared to a sorted leukocyte RNA-seq dataset that is included in the recount2 dataset (SRP045500), and, therefore, the training set for the PLIER model under consideration.During #3, the idea of using only high-weight genes for reconstruction came up (see #4). I've chosen not explore this at this time because my goal was/is to test how the subset of latent variables used for reconstruction (all vs. only pathway-associated vs. only thoses LVs that are not significantly associated with any gene sets -- I assume these capture variation from technical factors), rather than to improve the reconstruction performance. I think exploration of improving reconstruction performance would probably require a deeper dive than makes sense for this particular project. Please let me know what you think.
Here's the notebook HTML file for easy viewing:
04-isolated_immune_cell_reconstruction.nb.html.zip
I've made a few changes upstream of this notebook in
02-recount2_PLIER_exploration
to save the recount2 reconstructed expression data and associated evaluation metrics, as well.