Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Small subset #28

Merged
merged 9 commits into from
Dec 2, 2020
Merged

Small subset #28

merged 9 commits into from
Dec 2, 2020

Conversation

ben-heil
Copy link
Contributor

This PR finds the performance crossover point between a three layer neural network and logistic regression by looking at small training datasets.

Interestingly, we find that tb increases to a point then flatlines, while sepsis seems to take more data to saturate. Also it looks like the training of deep_net is very broken, which is unsurprising given past results.

Changes:

  • Normal snakefile changes to add small_subsets.py
  • Update benchmark results notebook with figures showing the crossover point
  • Update the model code to patch the case where the training set is too small to have a good tune set. There is probably a more elegant way to handle this, but I'll do that when I implement early stopping
  • Implement small_subsets.py

Copy link
Member

@jjc2718 jjc2718 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@@ -537,6 +538,13 @@ def fit(self, dataset: LabeledDataset) -> "PytorchSupervised":
train_dataset, tune_dataset = dataset.train_test_split(train_fraction=train_fraction,
train_study_count=train_count,
seed=seed)
# For very small training sets the tune dataset will be empty
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have to hold out entire studies for your tune dataset? I get why that's important for your test data (you want to evaluate if your model generalizes to unseen studies), but for your tune data maybe it's okay to split up studies, at least in the case when your training set is small.

Just an idea, not necessary to change anything now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea, but given the differences between studies I imagine I would end up overfitting if I did early stopping using the tune loss. I could probably do the splitting differently though. Maybe I should just designate a single study at the beginning of the run to be a tuning set shrug.

I'll open an issue and add your comment to it for thinking about in the future

from saged import utils, datasets, models


def subset_to_equal_ratio(train_data: datasets.LabeledDataset,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is similar to the function in saged/keep_ratios.py, right? Could you combine them into a single function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, it's an updated version of the one in saged/keep_ratios.py. When I refactor I'm moving this one to utils.py and calling changing both benchmarks to just call the utils version

print(plot)

# %%
plot = ggplot(tb_metrics, aes(x='train_count', y='balanced_accuracy', color='supervised'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could try plotting this with error bars (e.g. standard error) to emphasize that the crossover isn't really significant, if you want. (At least I assume it won't be, looking at the data points)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think geom_smooth is supposed to plot confidence intervals by default but didn't here for some reason. Maybe the fact that the x axis isn't really continuous?

@ben-heil ben-heil mentioned this pull request Dec 2, 2020
@ben-heil ben-heil merged commit 1e11d40 into greenelab:master Dec 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants