Small subset #28

ben-heil · 2020-11-25T15:17:10Z

This PR finds the performance crossover point between a three layer neural network and logistic regression by looking at small training datasets.

Interestingly, we find that tb increases to a point then flatlines, while sepsis seems to take more data to saturate. Also it looks like the training of deep_net is very broken, which is unsurprising given past results.

Changes:

Normal snakefile changes to add small_subsets.py
Update benchmark results notebook with figures showing the crossover point
Update the model code to patch the case where the training set is too small to have a good tune set. There is probably a more elegant way to handle this, but I'll do that when I implement early stopping
Implement small_subsets.py

…data

jjc2718

Looks good to me!

jjc2718 · 2020-11-30T18:05:47Z

saged/models.py

@@ -537,6 +538,13 @@ def fit(self, dataset: LabeledDataset) -> "PytorchSupervised":
        train_dataset, tune_dataset = dataset.train_test_split(train_fraction=train_fraction,
                                                               train_study_count=train_count,
                                                               seed=seed)
+        # For very small training sets the tune dataset will be empty


Do you have to hold out entire studies for your tune dataset? I get why that's important for your test data (you want to evaluate if your model generalizes to unseen studies), but for your tune data maybe it's okay to split up studies, at least in the case when your training set is small.

Just an idea, not necessary to change anything now.

That's a good idea, but given the differences between studies I imagine I would end up overfitting if I did early stopping using the tune loss. I could probably do the splitting differently though. Maybe I should just designate a single study at the beginning of the run to be a tuning set shrug.

I'll open an issue and add your comment to it for thinking about in the future

jjc2718 · 2020-11-30T18:07:57Z

saged/small_subsets.py

+from saged import utils, datasets, models
+
+
+def subset_to_equal_ratio(train_data: datasets.LabeledDataset,


This is similar to the function in saged/keep_ratios.py, right? Could you combine them into a single function?

Right, it's an updated version of the one in saged/keep_ratios.py. When I refactor I'm moving this one to utils.py and calling changing both benchmarks to just call the utils version

jjc2718 · 2020-11-30T18:18:31Z

notebook/analysis/benchmark_results.py

+print(plot)
+
+# %%
+plot = ggplot(tb_metrics, aes(x='train_count', y='balanced_accuracy', color='supervised')) 


You could try plotting this with error bars (e.g. standard error) to emphasize that the crossover isn't really significant, if you want. (At least I assume it won't be, looking at the data points)

I think geom_smooth is supposed to plot confidence intervals by default but didn't here for some reason. Maybe the fact that the x axis isn't really continuous?

ben-heil added 8 commits November 23, 2020 11:28

Add script to evaluate performance curve when using small amounts of …

a91c1b7

…data

Add small_subsets to snakefile

f54732b

Fixed script name typo

5d33442

Fixed output file typo in snakefile

3310fc6

Added temporary fix for small training sets leading to empty tune sets

d4af707

Visualize results from small_subsets.py

eca22e8

Update benchmark results to merge more smoothly with main branch

238de99

Merge branch 'master' into small_subset

5df7bce

ben-heil requested a review from jjc2718 November 25, 2020 15:33

Prettified benchmark result figures for use in committee meeting

0f3d131

jjc2718 approved these changes Nov 30, 2020

View reviewed changes

ben-heil mentioned this pull request Dec 2, 2020

Feature: Early Stopping #29

Closed

ben-heil merged commit 1e11d40 into greenelab:master Dec 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Small subset #28

Small subset #28

ben-heil commented Nov 25, 2020

jjc2718 left a comment

jjc2718 Nov 30, 2020

ben-heil Dec 2, 2020

jjc2718 Nov 30, 2020

ben-heil Dec 2, 2020

jjc2718 Nov 30, 2020

ben-heil Dec 2, 2020

		from saged import utils, datasets, models


		def subset_to_equal_ratio(train_data: datasets.LabeledDataset,

Small subset #28

Small subset #28

Conversation

ben-heil commented Nov 25, 2020

jjc2718 left a comment

Choose a reason for hiding this comment

jjc2718 Nov 30, 2020

Choose a reason for hiding this comment

ben-heil Dec 2, 2020

Choose a reason for hiding this comment

jjc2718 Nov 30, 2020

Choose a reason for hiding this comment

ben-heil Dec 2, 2020

Choose a reason for hiding this comment

jjc2718 Nov 30, 2020

Choose a reason for hiding this comment

ben-heil Dec 2, 2020

Choose a reason for hiding this comment