Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Summary of classification model performance on Task V576 DNAse #6

Open
annashcherbina opened this issue May 16, 2018 · 22 comments
Open

Comments

@annashcherbina
Copy link
Collaborator

Google spreadsheet with all information:

https://docs.google.com/spreadsheets/d/1gfbolLoB1o_oRHjGdV6ht5eTIjFHtTjIivlxsLugvp4/edit?usp=sharing

Performance matrix of models ## :

Yellow highlighting indicates performance on training dataset. Absence of yellow highlighting indicates performance on test dataset.

Blue text indicates performance on "version 1" of the data labels (i.e. negatives from peaks present in other CRC samples, but absent in current CRC sample).

Green text indicates performance on "version 2" of the data labels (i.e. negatives from ENCODE DNAse summits minus colon-specific data)
image

Loss curves for most promising models in the data matrix:

image

Interestingly, it appears the baseline models are actually outperforming the models with GC-balanced negative sets, Dinucleotide-balanced negative sets, reverse complement values added.

Next steps: Try regression models

@akundaje
Copy link

akundaje commented May 17, 2018 via email

@annashcherbina
Copy link
Collaborator Author

Yes, it seems to basically be overfitting to the training data.
Hence I am hoping that some of the other architectures (i.e. those from Surag & Jacob) will be less prone to the overfitting problem.

I will move away from Basset for this dataset.

@akundaje
Copy link

akundaje commented May 17, 2018 via email

@annashcherbina
Copy link
Collaborator Author

I've triple-checked by running the same code on DMSO & het data and reproducing the top-scoring performance on those models with this workflow & the basic bassett architecture.
hence, I'm inclined to think it's not a bug.

I can use Surag's code to train the model, that is the next thing on my to-do list.

@akundaje
Copy link

akundaje commented May 17, 2018 via email

@annashcherbina
Copy link
Collaborator Author

annashcherbina commented May 17, 2018

the weird thing is that the baseline performance is actually quite good. And the dataset for the baseline model was generated with the same approach to determining negatives as was the het data & DMSO data -- those performance values actually match quite closely.

The new negative set is what's primarily leading to huge drop in auPRC & recallAtFDR50.
Is it possible, the new negative set is actually more difficult to learn than the old one?

@akundaje
Copy link

akundaje commented May 17, 2018 via email

@akundaje
Copy link

akundaje commented May 17, 2018 via email

@annashcherbina
Copy link
Collaborator Author

Here's what I'm getting for het model losses with the same code & basic basset.
I used ENCODE initializations for the model and it was a multi-tasked model:
image

And for DMSO with same code & basic bassett.
I also used ENCODE initializations and multi-tasked model:

image

The average performance values across tasks were:

Hets

image
image

DMSO

image
image

@akundaje
Copy link

akundaje commented May 17, 2018 via email

@annashcherbina
Copy link
Collaborator Author

Yes, but it's the best performance we've gotten on these 2 datasets...
(also for het, training stopped at epoch 6, i.e. early stopping meant epoch 6 was last one used).

@akundaje
Copy link

akundaje commented May 17, 2018 via email

@annashcherbina
Copy link
Collaborator Author

By best performance, I mean best performance from all hyperparam searches & applications of bassett to the data. This is why I am fan of moving on to other architectures.

@akundaje
Copy link

akundaje commented May 17, 2018 via email

@annashcherbina
Copy link
Collaborator Author

annashcherbina commented May 18, 2018

The models trained on the "easy" datasets achieve near perfect performance (auPRC values in the range of 0.93 - 0.97, recall at FDR50 in range 0.97 - 1.00:
image

The corresponding loss curves are:
image

The loss curves have similar behavior to those we observed previously.

What this is telling me is that a single epoch of training is sufficient to learn the data --- especially with ENCODE initializations. The ENCODE initializations prove to be very beneficial on these toy datasets, as they increase the auPRC for the 1 neg: 1 pos GC-balanced dataset from 0.84 to 0.97

@annashcherbina
Copy link
Collaborator Author

These are the loss curves when I plot the loss with fewer examples per epoch of training (i.e. one epoch = 5 batches of size 1000).

Are these closer to what we'd expect to see?
image

If so, then the reason the previous curves look different is that I use large epochs (i.e. 1 epoch = 700 batches of size 1000). With the large epochs, 1 pass is usually enough to learn the data, and anything further just leads to overfiting.

And just for completeness, the test set accuracy of the model :
image

@annashcherbina
Copy link
Collaborator Author

This would explain why the performance (auPRC/recallAtFDR50 ) was the best we've attained to date, with the loss curves showing no improvement beyond the first epoch.

@akundaje
Copy link

akundaje commented May 18, 2018 via email

@annashcherbina
Copy link
Collaborator Author

I thought that the model achieving auPRC ~ 1 on all four toy datasets is indication that everything is working fine. So I guess I'm confused at what the problem we are trying to solve is? What do we expect the loss curves to look like? I'm not sure I understand why the ones we are observing are different than what usually shows up in the literature?

Early stopping is triggered by no drop in validation loss for 5 consecutive epochs.

The performance I am reporting is based on early stopping (i.e. generally after the first full epoch, i.e. a pass through 700,000 training examples ).

Learning rate used for the "toy" models was 0.001. I have used learning rate values ranging from 0.00001 to 0.01, with no major change in auPRC for the GECCO datasets. I can post performance values for those, but they are within a few percent of the ones above.

@annashcherbina
Copy link
Collaborator Author

For example, all params kept constant, but learning rate is varied:

image

Graph on the left is for LR = 0.001
Graph on the right is for LR = 0.00001

(This is on the full dataset, not the toy datasets).
The curves don't improve after first epoch because 700,000 examples is enough to train the model. If we use smaller epochs, we get more traditional looking learning curves with the exponential decay pattern.

@annashcherbina
Copy link
Collaborator Author

annashcherbina commented May 21, 2018

Updated learning curves for the negative datasets. The blue box on the learning curve graphs indicates the stopping epoch for which "Trained" accuracy is reported:

GC-balanced, 1 neg: 1 pos

image

GC-balanced, 5 neg: 1 pos

I tried lr = 0.001 and lr=0.0001 for this negative set, though the curve for the lower lr looks more like what we'd like to see, the auPRC is higher for the higher lr.
image

Dinucleotide-balanced, 1 neg: 1 pos

image

Dinucleotide-balanced, 5 neg: 1 pos

image

Negatives from (ENCODE - CRC cell types), 10 neg: 1 pos

image

Negatives for V576 DNAse from Scacheri sample matrix, 10 neg: 1 pos

image

@akundaje
Copy link

akundaje commented May 21, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants