Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Figure: Statistical Techniques #106

Closed
cgreene opened this issue Sep 1, 2020 · 16 comments
Closed

Figure: Statistical Techniques #106

cgreene opened this issue Sep 1, 2020 · 16 comments
Assignees
Labels

Comments

@cgreene
Copy link
Collaborator

cgreene commented Sep 1, 2020

It would be grand to get a figure on the statistical techniques we discuss and to link those to how they address challenges in the rare disease space.

@jaybee84 jaybee84 self-assigned this Sep 17, 2020
@jaybee84
Copy link
Owner

This is a tentative sketch for the figure depicting key takeaways from the newly rewritten "Manage model complexity" section.

cc: @jaclyn-taroni @allaway @cgreene

We can use this as a starting point for the final figure for this section based on your comments. Happy to modify as needed.

@jaybee84
Copy link
Owner

jaybee84 commented Nov 9, 2020

What do we intend to communicate via the figure?
-- The main message is that applying machine learning to rare disease datasets can lead to complex and misinterpreted models due to scarcity of data points and other challenges associated with this kind of data. But we can avail of various statistical techniques to make simple and stable models that capture the essential and relevant patterns in rare disease data. A tentative sketch of the figure is presented in the comment above. The "person" in the figure represents patients, the small spheres and ovals represent features (e.g. genes, variants, symptoms, etc) associated with a patient sample.

If there are multiple pieces of information we are trying to present (see the first point), what is the one core piece of information that the audience should walk away with?
-- Using specific statistical techniques that can mitigate the challenges posed by small and heterogenous datasets is essential for the successful application of machine learning to rare diseases

A list of or brief description of concepts to familiarize herself with
-- "Manage model complexity" section of the manuscript outlines the main strategies captured in this figure

@dvenprasad
Copy link
Collaborator

I spilt this up to individual methods for the purposes of illustrating them. We can combine them once we are happy with how the individual methods are presented.

They are very generic. As we iterate, we can make them more specific.

Bootstrapping

bootstrapping

Ensemble Learning

I have annotated my comments.

ensemble-learning

Regularization

For this I used the same illustration as dimension reduction. It seemed to me that both are reducing feature space and that was what we wanted to show. Also, someone (maybe Robert? not sure) mentioned on the call today about having two levels of abstraction for #116 and maybe here is a good place to present a more abstracted figure?

regularization

One-class-at-a time

I have annotated my comments.

one-class-at-a-time

@jaybee84
Copy link
Owner

jaybee84 commented Nov 30, 2020

Thanks @dvenprasad ! Below are my first thoughts:

  1. Bootstrapping : it seems to me that in the depiction the features are being shuffled (please correct me if my understanding is not right). In actuality, bootstrap shuffles the data points (i.e. takes various drawings of the same samples like picking balls from a bag with replacement) but the features are left untouched.
    -- Following on the rectangle box (few samples/ many features) vs square box (many samples/ many features) analogy from today morning, bootstrap is a way to turn a rectangular box into a square box (where each side is equal to the long side of the rectangle)
  2. Ensemble learning : Would the open circles be considered features that the model is considering ?
  3. Regularization decreases feature space by penalizing the models that consider too many features and thus selecting models that select few important features. Again taking the rectangle box analogy, it helps pick the models that use a square box (where each side is equal to the short side of the rectangular box) instead of the ones that try to use the rectangular box.
  4. One class at a time: Here I think it may be helpful to divide each cube into 4 or more colored sections (a discrete gradient of 4 shades) each shade depicting a "class", and then use the simple model to separate one sub-color (e.g. the darkest shade) from all the datasets at each step, instead of a complex model separating all shades at the same time.

@allaway
Copy link
Collaborator

allaway commented Dec 3, 2020

Bootstrapping : it seems to me that in the depiction the features are being shuffled (please correct me if my understanding is not right). In actuality, bootstrap shuffles the data points (i.e. takes various drawings of the same samples like picking balls from a bag with replacement) but the features are left untouched.
-- Following on the rectangle box (few samples/ many features) vs square box (many samples/ many features) analogy from today morning, bootstrap is a way to turn a rectangular box into a square box (where each side is equal to the long side of the rectangle)

I agree with @jaybee84 here- and important to note that the sampling is with replacement (at least for all of the bootstrapping i'm familiar with....:) ) so you typically end up with replicates of some samples for every "bootstrap".

@dvenprasad
Copy link
Collaborator

Bootstrapping:
Yes, I was shuffling features. I've redone them with your notes. I'm shuffling the colors on the cube to indicate the shuffling of data points ( I do want to add circles within the squares as data points and we can change the colors of that instead of the sides on the cube). I'm not sure how to show how each side is the length of the longer side.

bootstrapping

Ensemble learning
Yes, each circle could be considered as a feature.

Regularization
PXL_20201207_183721344

Model A has learned from a rectangle box and Model B has learned from a cube. So is Regularization taking these two models and ranking them? So the end result would be that Model B is ranked higher than Model A?

@dvenprasad
Copy link
Collaborator

One class at a time
I've depicted the output as the model being able to tell which class it has learned and treat the rest of the classes as a A class I don't know

one-class-at-a-time

@dvenprasad
Copy link
Collaborator

dvenprasad commented Dec 9, 2020

Okay, took another pass at bootstrapping and ensemble learning after Monday's discussion.

Bootstrapping

Couple of notes:

  • Shapes indicate dataset source, color indicates classes
  • For the aggregate and resampled datasets, I kept the colors together because it makes it easier to track what is changing. The number of classes are the same but the samples can be from another source (I hope this is correct)

bootstrapping

Ensemble Learning

I've kept the average health of the modes similar across the 3 runs i.e 2 good health model and 1 poor health model. But my question is can you have 3 average-ish models and still get a good health combined model or have 2 poor health model and 1 good health model and get a good combined health model? Do we want to show that kind of variation?

ensemble-learning

@allaway
Copy link
Collaborator

allaway commented Dec 9, 2020

bootstrapping figure:

I really like this! My only thought would be to add the individual models that are created during each bootstrap to help the reader understand how each bootstrap is contributing some knowledge to the final aggregate model.

here's a sloppy mock of what I was thinking:
bootstrap

@allaway
Copy link
Collaborator

allaway commented Dec 9, 2020

re ensemble modeling:

I've kept the average health of the modes similar across the 3 runs i.e 2 good health model and 1 poor health model. But my question is can you have 3 average-ish models and still get a good health combined model or have 2 poor health model and 1 good health model and get a good combined health model? Do we want to show that kind of variation?

Here's an example of actually combining models:

Screen Shot 2020-12-09 at 9 02 55 AM

The first box is our "best" model alone. The 2nd is the "best + 2nd best", the 3rd is "best + 2nd best + 3rd best" and so on....

The blue and yellow boxes are anything that are substantially "better health" than the the best model alone. The red boxes are statistically indistinguishable from the best model alone. So, if we combine 'good health' models with each other, the resulting ensemble is better, but if we start adding too many models in that are of 'poor health' we actually don't get a model that's better than the sum of it's parts.

I'm not sure if this is worth conveying in the figure, but I think that it might be beyond the scope of this manuscript because it's probably variable based on the problem, and I doubt this is specific to rare disease modeling.

Also, I'm a little unsure of what the "runs" indicate? This looks like an ensemble of ensembles (which is still an ensemble, but might be unnecessary to convey the concept?)

@jaybee84
Copy link
Owner

jaybee84 commented Dec 9, 2020

Re: bootstrap

suggest changing "aggregate" to "harmonize" :)

@jaybee84
Copy link
Owner

jaybee84 commented Dec 10, 2020

Re: ensemble
image

@jaybee84
Copy link
Owner

@dvenprasad I added a rough sketch re: ensemble learning in the above comment. Please let me know if you cannot view it or have questions.

@dvenprasad
Copy link
Collaborator

Regularization

regularization

@jaybee84
Copy link
Owner

I like the above figure... just a few notes:

  1. We might not want to use circles as a dataset indicator since we are using circles as features
  2. To be consistent with the dimension reduction fig (and the heatmap panel), we might want the datasets to be rectangles in portrait mode ie many features as rows, and few samples as columns
  3. It may just be me but in this figure it seems like the top two rectangles are making the unhealthy model and bottom two rectangles are making the healthy model. It would be ideal to show that all 4 datasets together lead to the top model when not regularized, and bottom model when regularized.

@dvenprasad
Copy link
Collaborator

dvenprasad commented Dec 18, 2020

Changed yellow->purple because the contrast was really bad with white

Bootstrapping
Updated it based on feedback.-> Added in sad models per resampling of dataset.

bootstrapping

Ensemble Learning
Re-did it based on @jaybee84 sketch.
ensemble-learning

Regularization
Made changes based on #106 (comment)

regularization

One class at a time

  • Greyed out shapes are items that the model does not recognize.
  • The sad model misclassifies and does not recognize some samples.

one-class-at-a-time

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants