Figure: Statistical Techniques #106

cgreene · 2020-09-01T14:52:19Z

It would be grand to get a figure on the statistical techniques we discuss and to link those to how they address challenges in the rare disease space.

jaybee84 · 2020-10-12T23:23:15Z

This is a tentative sketch for the figure depicting key takeaways from the newly rewritten "Manage model complexity" section.

cc: @jaclyn-taroni @allaway @cgreene

We can use this as a starting point for the final figure for this section based on your comments. Happy to modify as needed.

jaybee84 · 2020-11-09T04:20:36Z

What do we intend to communicate via the figure?
-- The main message is that applying machine learning to rare disease datasets can lead to complex and misinterpreted models due to scarcity of data points and other challenges associated with this kind of data. But we can avail of various statistical techniques to make simple and stable models that capture the essential and relevant patterns in rare disease data. A tentative sketch of the figure is presented in the comment above. The "person" in the figure represents patients, the small spheres and ovals represent features (e.g. genes, variants, symptoms, etc) associated with a patient sample.

If there are multiple pieces of information we are trying to present (see the first point), what is the one core piece of information that the audience should walk away with?
-- Using specific statistical techniques that can mitigate the challenges posed by small and heterogenous datasets is essential for the successful application of machine learning to rare diseases

A list of or brief description of concepts to familiarize herself with
-- "Manage model complexity" section of the manuscript outlines the main strategies captured in this figure

dvenprasad · 2020-11-30T21:49:38Z

I spilt this up to individual methods for the purposes of illustrating them. We can combine them once we are happy with how the individual methods are presented.

They are very generic. As we iterate, we can make them more specific.

Bootstrapping

Ensemble Learning

I have annotated my comments.

Regularization

For this I used the same illustration as dimension reduction. It seemed to me that both are reducing feature space and that was what we wanted to show. Also, someone (maybe Robert? not sure) mentioned on the call today about having two levels of abstraction for #116 and maybe here is a good place to present a more abstracted figure?

One-class-at-a time

I have annotated my comments.

jaybee84 · 2020-11-30T23:46:51Z

Thanks @dvenprasad ! Below are my first thoughts:

Bootstrapping : it seems to me that in the depiction the features are being shuffled (please correct me if my understanding is not right). In actuality, bootstrap shuffles the data points (i.e. takes various drawings of the same samples like picking balls from a bag with replacement) but the features are left untouched.
-- Following on the rectangle box (few samples/ many features) vs square box (many samples/ many features) analogy from today morning, bootstrap is a way to turn a rectangular box into a square box (where each side is equal to the long side of the rectangle)
Ensemble learning : Would the open circles be considered features that the model is considering ?
Regularization decreases feature space by penalizing the models that consider too many features and thus selecting models that select few important features. Again taking the rectangle box analogy, it helps pick the models that use a square box (where each side is equal to the short side of the rectangular box) instead of the ones that try to use the rectangular box.
One class at a time: Here I think it may be helpful to divide each cube into 4 or more colored sections (a discrete gradient of 4 shades) each shade depicting a "class", and then use the simple model to separate one sub-color (e.g. the darkest shade) from all the datasets at each step, instead of a complex model separating all shades at the same time.

allaway · 2020-12-03T19:43:17Z

Bootstrapping : it seems to me that in the depiction the features are being shuffled (please correct me if my understanding is not right). In actuality, bootstrap shuffles the data points (i.e. takes various drawings of the same samples like picking balls from a bag with replacement) but the features are left untouched.
-- Following on the rectangle box (few samples/ many features) vs square box (many samples/ many features) analogy from today morning, bootstrap is a way to turn a rectangular box into a square box (where each side is equal to the long side of the rectangle)

I agree with @jaybee84 here- and important to note that the sampling is with replacement (at least for all of the bootstrapping i'm familiar with....:) ) so you typically end up with replicates of some samples for every "bootstrap".

dvenprasad · 2020-12-07T18:43:30Z

Bootstrapping:
Yes, I was shuffling features. I've redone them with your notes. I'm shuffling the colors on the cube to indicate the shuffling of data points ( I do want to add circles within the squares as data points and we can change the colors of that instead of the sides on the cube). I'm not sure how to show how each side is the length of the longer side.

Ensemble learning
Yes, each circle could be considered as a feature.

Regularization

Model A has learned from a rectangle box and Model B has learned from a cube. So is Regularization taking these two models and ranking them? So the end result would be that Model B is ranked higher than Model A?

dvenprasad · 2020-12-07T18:58:53Z

One class at a time
I've depicted the output as the model being able to tell which class it has learned and treat the rest of the classes as a A class I don't know

dvenprasad · 2020-12-09T15:02:57Z

Okay, took another pass at bootstrapping and ensemble learning after Monday's discussion.

Bootstrapping

Couple of notes:

Shapes indicate dataset source, color indicates classes
For the aggregate and resampled datasets, I kept the colors together because it makes it easier to track what is changing. The number of classes are the same but the samples can be from another source (I hope this is correct)

Ensemble Learning

I've kept the average health of the modes similar across the 3 runs i.e 2 good health model and 1 poor health model. But my question is can you have 3 average-ish models and still get a good health combined model or have 2 poor health model and 1 good health model and get a good combined health model? Do we want to show that kind of variation?

allaway · 2020-12-09T16:56:44Z

bootstrapping figure:

I really like this! My only thought would be to add the individual models that are created during each bootstrap to help the reader understand how each bootstrap is contributing some knowledge to the final aggregate model.

here's a sloppy mock of what I was thinking:

allaway · 2020-12-09T17:09:04Z

re ensemble modeling:

I've kept the average health of the modes similar across the 3 runs i.e 2 good health model and 1 poor health model. But my question is can you have 3 average-ish models and still get a good health combined model or have 2 poor health model and 1 good health model and get a good combined health model? Do we want to show that kind of variation?

Here's an example of actually combining models:

The first box is our "best" model alone. The 2nd is the "best + 2nd best", the 3rd is "best + 2nd best + 3rd best" and so on....

The blue and yellow boxes are anything that are substantially "better health" than the the best model alone. The red boxes are statistically indistinguishable from the best model alone. So, if we combine 'good health' models with each other, the resulting ensemble is better, but if we start adding too many models in that are of 'poor health' we actually don't get a model that's better than the sum of it's parts.

I'm not sure if this is worth conveying in the figure, but I think that it might be beyond the scope of this manuscript because it's probably variable based on the problem, and I doubt this is specific to rare disease modeling.

Also, I'm a little unsure of what the "runs" indicate? This looks like an ensemble of ensembles (which is still an ensemble, but might be unnecessary to convey the concept?)

jaybee84 · 2020-12-09T17:48:28Z

Re: bootstrap

suggest changing "aggregate" to "harmonize" :)

jaybee84 · 2020-12-10T19:03:27Z

Re: ensemble

jaybee84 · 2020-12-10T20:04:38Z

@dvenprasad I added a rough sketch re: ensemble learning in the above comment. Please let me know if you cannot view it or have questions.

dvenprasad · 2020-12-14T18:51:18Z

Regularization

jaybee84 · 2020-12-14T19:26:59Z

I like the above figure... just a few notes:

We might not want to use circles as a dataset indicator since we are using circles as features
To be consistent with the dimension reduction fig (and the heatmap panel), we might want the datasets to be rectangles in portrait mode ie many features as rows, and few samples as columns
It may just be me but in this figure it seems like the top two rectangles are making the unhealthy model and bottom two rectangles are making the healthy model. It would be ideal to show that all 4 datasets together lead to the top model when not regularized, and bottom model when regularized.

dvenprasad · 2020-12-18T23:34:14Z

Changed yellow->purple because the contrast was really bad with white

Bootstrapping
Updated it based on feedback.-> Added in sad models per resampling of dataset.

Ensemble Learning
Re-did it based on @jaybee84 sketch.

Regularization
Made changes based on #106 (comment)

One class at a time

Greyed out shapes are items that the model does not recognize.
The sad model misclassifies and does not recognize some samples.

jaybee84 self-assigned this Sep 17, 2020

This was referenced Nov 4, 2020

Figures: Prep for design #122

Closed

Figure: Prior Knowledge #107

Closed

Figure: MultiPLIER / DeepProfile - Rare Disease Putting It Together #108

Closed

jaybee84 assigned dvenprasad Dec 9, 2020

jaybee84 mentioned this issue Dec 9, 2020

Figure: Feature selection and dimension reduction #116

Closed

jaclyn-taroni added the figure label Jan 10, 2021

dvenprasad mentioned this issue Jan 20, 2021

Added 3. figures #142

Merged

6 tasks

jaybee84 closed this as completed Jan 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Figure: Statistical Techniques #106

Figure: Statistical Techniques #106

cgreene commented Sep 1, 2020

jaybee84 commented Oct 12, 2020

jaybee84 commented Nov 9, 2020

dvenprasad commented Nov 30, 2020

jaybee84 commented Nov 30, 2020 •

edited

Loading

allaway commented Dec 3, 2020

dvenprasad commented Dec 7, 2020

dvenprasad commented Dec 7, 2020

dvenprasad commented Dec 9, 2020 •

edited

Loading

allaway commented Dec 9, 2020 •

edited

Loading

allaway commented Dec 9, 2020

jaybee84 commented Dec 9, 2020

jaybee84 commented Dec 10, 2020 •

edited

Loading

jaybee84 commented Dec 10, 2020

dvenprasad commented Dec 14, 2020

jaybee84 commented Dec 14, 2020

dvenprasad commented Dec 18, 2020 •

edited

Loading

Figure: Statistical Techniques #106

Figure: Statistical Techniques #106

Comments

cgreene commented Sep 1, 2020

jaybee84 commented Oct 12, 2020

jaybee84 commented Nov 9, 2020

dvenprasad commented Nov 30, 2020

Bootstrapping

Ensemble Learning

Regularization

One-class-at-a time

jaybee84 commented Nov 30, 2020 • edited Loading

allaway commented Dec 3, 2020

dvenprasad commented Dec 7, 2020

dvenprasad commented Dec 7, 2020

dvenprasad commented Dec 9, 2020 • edited Loading

allaway commented Dec 9, 2020 • edited Loading

allaway commented Dec 9, 2020

jaybee84 commented Dec 9, 2020

jaybee84 commented Dec 10, 2020 • edited Loading

jaybee84 commented Dec 10, 2020

dvenprasad commented Dec 14, 2020

jaybee84 commented Dec 14, 2020

dvenprasad commented Dec 18, 2020 • edited Loading

jaybee84 commented Nov 30, 2020 •

edited

Loading

dvenprasad commented Dec 9, 2020 •

edited

Loading

allaway commented Dec 9, 2020 •

edited

Loading

jaybee84 commented Dec 10, 2020 •

edited

Loading

dvenprasad commented Dec 18, 2020 •

edited

Loading