Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data/code sharing and data limitations #367

Merged
merged 11 commits into from May 5, 2017
Merged

Conversation

agitter
Copy link
Collaborator

@agitter agitter commented Apr 30, 2017

I'm trying to finish a complete draft of the Discussion section. These sub-sections could definitely use feedback. I confirmed we'll be getting a draft of the evaluation sub-section soon so I didn't write that. I also removed other Discussion prompts.

@brettbj I'd like to briefly touch on the diet networks paper (#140) here. Can you please help me describe this in a sentence or two? What I have is not very clear.

@cgreene I'm closing #172 and we can include any relevant papers in this sub-section.

If there are build errors, I'll clean them up later today.

Copy link
Member

@cgreene cgreene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! I had some thoughts + identified potential options to highlight. But this looks great. Once you're happy w/ it I'm happy with it.

training instances can be just as problematic. In genomics, for example,
labeled data may be derived from an experimental assay that has known and
unknown technical artifacts, biases, and error profiles. It is possible to
weight training examples or construct Bayesian models to account for uncertainty
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We didn't account for uncertainty, but we have corrected for non-independence in our bayesian integrations, which I think is related to the point that you are making. For example, in 10.1038/ng.3259 we had to deal with correlated noise primarily from experimental platform (e.g. array type, etc). We had a very heuristic approach for this: we down-weighted datasets where the negative training examples (i.e. gene pairs with no relationships to each other) had high mutual information with other already included datasets.

I imagine that lots of people do this to avoid finding trivial results. However, it may not often be clear why something like this was done. I think this is sort of the practical how-to knowledge that often gets lost when a paper is written and revised because it's not as splashy as the "methods" bits.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I re-read the paragraph - this falls into the weight training examples part.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want me to write this specific example, feel free to make a quick issue and I can edit after this PR. It will be a light revision on the example above.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made #377 for you to add these lines. You'll do a better job of getting the example correct.

This is along the lines of what I meant by "weight training examples", but I didn't explain that phrase well.

For some types of data, especially images, it is straightforward to augment
training datasets by splitting one labeled example into multiple examples. An
image can easily be rotated or inverted and retain its label `TODO: does someone
have examples to cite here? Issue #163?`. 3D MRI and 4D fMRI (with time as a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - #163 is perfect for this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added #163

full dataset [@tag:Arvaniti2016_rare_subsets].

Simulated or semi-synthetic training data has also been employed in multiple
biomedical domains. `TODO: simulated data: #5 #99 #293, maybe #117 and #197.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10.1093/bioinformatics/btw243 trains on synthetic data. We don't have an issue for it. I think it's worth mentioning for this point as well

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened #378 for that paper. I'll see if I need it.

[@tag:Romero2017_diet], that drastically reduce the number of free parameters
by predicting parameters (weights) as a function of each SNP's embedding
in some feature space instead of directly optimizing a weight for each SNP.
`TODO: not very familiar with this paper, can someone describe it better?`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @brettbj @traversc : can you help with this?
#140

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How's this (tried to still keep it brief without getting heavily into embeddings) -

Multimodal, multi-task, and transfer learning, discussed in detail below, can also combat data limitations to some degree. There are also emerging network architectures, such as Diet Networks for high-dimensional SNP data [@tag:Romero2017_diet]. Diet Networks use multiple networks to drastically reduce the number of free parameters by first flipping the problem and tasking a network to predict parameters (weights) for each input (SNP) to learn a feature embedding. This embedding (i.e. PCA, per class histograms, x2vec style) can be learned directly from the input data, but has the advantage that it can also be learned from other datasets or from domain knowledge. Additionally, in this task, the features are the examples, an important fact when it is typical to have 500k+ SNPs and only a few thousand patients. Finally, this embedding is of a much lower dimension, allowing for a large reduction in the number of free parameters. In the example given, the authors reduced the number of free parameters from 30M to 50K, a factor of 600.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brettbj thanks, I added this. I want to rephrase the x2vec style or remove the example embedding types and will use everything else as-is.

also recent techniques to mitigate these concerns. Furthermore, in some
domains, some of the best training data has been generated privately, for
example, high-throughput chemical screening data at pharmaceutical companies.
There is little expectation or incentive for this private data to be shared.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about: "One perspective is that there is little expectation or incentive for this private data to be shared. However, data are not inherently valuable. Instead, the insights that we glean from them are where the value lies. Private companies may establish a competitive advantage by releasing sufficient data for improved methods to be developed."

Basically a good benchmark dataset gets computer scientists to do lots of hard work for you for free.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used this. There is an example where Merck provided data for a Kaggle competition, but I don't want to get into too much detail.


Code sharing and open source licensing is essential for continued progress in
this domain. Even though some journals have improved requirements about code
availability at the time of publication, there remain many opportunities to do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if you want to cite this, but it was an encouraging step:
http://blogs.nature.com/methagora/2014/02/guidelines-for-algorithms-and-software-in-nature-methods.html

Copy link
Collaborator Author

@agitter agitter May 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I want to focus on journal requirements, so I'll rephrase to state that we encourage best practices for software regardless of what is or isn't required by publishers.

Now citing http://science.sciencemag.org/content/354/6317/1240.full and shortening my discussion text.

data cleaning (see above) and hyperparameter optimization. These improve
reproducibility and serve as documentation of the detailed decisions that impact
model performance but may not be exhaustively captured in a manuscript's methods
text. `TODO: there are many other things that could be added here (e.g.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree - this could be an entire perspective article. I think that what's here is sufficient. It is very surprising to me how little deep learning code gets shared. I feel like comp bio is ahead of the game on this - or maybe just the genomics field. Maybe someone sees this and decides to write a perspective on this in a similar manner? :)

that used pre-trained image classifier?` "Model zoos", collections of
pre-trained models, are not yet common in biomedical domains but have started to
appear in genomics applications [@tag:Angermueller2016_single_methyl
@tag:Dragonn]. `TODO: other examples, possibly from Categorize or Treat?`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From categorize we need to have a sentence mentioning the possibility of attacking the model to identify training examples. Something like: For models trained on data where privacy is required, we discuss the potential for information leakage in section ### and methods to mitigate this concern.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Section is:
"Data sharing is hampered by standardization and privacy considerations"

It's brutally long. Not sure what the most efficient way is to get people to there.

Copy link
Collaborator Author

@agitter agitter May 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cgreene two paragraphs above I have "As discussed in the Categorize section, there are complex privacy and legal issues involved in sharing patient data and deep learning models trained with it but also recent techniques to mitigate these concerns." Do you suggest I move that to this paragraph?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about changing above to:
"There are complex privacy and legal issues involved in sharing patient data that cannot be ignored."

Then here: "Deep learning models can also be attacked to identify examples used in training. We discuss this issue as well as recent techniques to mitigate these concerns... [this refers to categorize]"

I just want to make sure that nobody drops in here, reads that they should share their model, and then shares a model that leads to inadvertent release of something they thought was private.

Copy link
Collaborator Author

@agitter agitter May 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll make this edit (reworded slightly), then merge.


DeepChem [@tag:AltaeTran2016_one_shot @tag:DeepChem @tag:Wu2017_molecule_net]
and DragoNN [@tag:Dragonn] exemplify the benefits of sharing code under an open
source license and pre-trained models. `TODO: it would be great to add more
Copy link
Member

@cgreene cgreene Apr 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Source code:
#213 - Tagging @spoilt333 - don't think the raw data is here (looks like /data/ path but not 100% sure). - did not see a license
#266 - tagging @zhangyan32 - looks like source but not models + data - did not see a license
#217 - source code, data can be downloaded - tagging @uci-cbcl - has license (GPL)
#171 tagging @CampagneLaboratory - this one has some docs, but not data/models if i'm seeing things correctly - has license (Apache)
#155 - code yes w/ docs. Don't think data. Tagging @traversc - just a quick note as I look through the source code I think some .pyc files made it in. not sure if that's intended. - no license
#150 - contains pre-trained models - tagging @raghavchalapathy - did not see a license
#32 - also has source code from @uci-cbcl + example file + data download + very nice readme - Custom license
#24 - source code + data [not version controlled]

That last one highlights to me that we don't mention version control here but we probably should recommend that explicitly.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My repo is source code, and we are still working on it now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#171 is a good example with tutorials, license, etc. I think #24 does have a GitHub repo https://github.com/uci-cbcl/D-GEX

@agitter
Copy link
Collaborator Author

agitter commented Apr 30, 2017

@cgreene thanks for all the helpful suggestions. I should be able to work on these Monday.

@brettbj
Copy link
Contributor

brettbj commented May 1, 2017

@agitter Will do tonight

@brettbj
Copy link
Contributor

brettbj commented May 2, 2017

@agitter @cgreene - needed a full reread last night, will add the couple of sentences after a few meetings this morning

@agitter
Copy link
Collaborator Author

agitter commented May 2, 2017

@brettbj no rush, I won't get to this until today or tomorrow most likely because I shifted my efforts to drafting #370

Copy link
Contributor

@agapow agapow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, any comments are suggestions

Code sharing and open source licensing is essential for continued progress in
this domain. Even though some journals have improved requirements about code
availability at the time of publication, there remain many opportunities to do
better. Preprints are firmly established in many areas of machine learning and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm. I feel that preprints - great as they are - is perhaps getting a little out of scope and are qualitatively different to code and data sharing. Just trying to pick out the deep learning specific material here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it is out of scope. I brought it up because it has been a point of personal frustration. More than once, there has been an exciting preprint on some biomedical deep learning topic that my group wanted to test out or build upon. If the code isn't available, we're stuck. Sometimes we don't even want to run the code, we just want to understand minute details of the training that aren't written down in the paper. Without code, we're left guessing.

Part of what I wanted in this section is a call to go beyond the accepted practices in the community. That includes sharing code when not required and highlighting example repositories that make the extra effort to provide pre-trained models, data pre-processing scripts, tutorials, etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get it. My feeling is that is a broader issue and might diffuse any DL-specific message. I'll leave this to you. Maybe see how it fits when the paper is more complete and polished.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While it might be slightly out of scope, I think we should keep it. It's an important point and it flows well with the discussion. A couple of additional thoughts:

  • maybe say explicitly that this recommendation is a general one, not specific to deep learning;
  • we could go as far as recommend code sharing platforms such as GitHub (maybe even Zenodo?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@enricoferrero I support this. I think to avoid diffusing the DL-specific message further, we should briefly mention these things but then refer to an existing paper. There is probably a suitable PLoS Comp Bio 10 simple rules or similar article I can cite.

@agapow if it seems out of place when we polish the full Discussion section, I'm willing to drop it. I prefer to keep it in the initial draft.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@agapow I revised this paragraph to try to keep it more focused on the core element of code sharing. I also cite http://science.sciencemag.org/content/354/6317/1240.full to expand on best practices so we don't get too off topic here.

@cgreene and @enricoferrero, please see if you still like the overall message (once I push).

deep learning can make the preprocessing code (e.g., Basset
[@tag:Kelley2016_basset]) and cleaned data (e.g., MoleculeNet
[@tag:Wu2017_molecule_net]) publicly available to catalyze further research. As
discussed in the Categorize section, there are complex privacy and legal issues
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually refer to that section as "the Categorize section" in the paper?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not 😄

I'm used to that jargon from our issues here and should fix it.

This was referenced May 4, 2017
@agitter
Copy link
Collaborator Author

agitter commented May 5, 2017

I believe I addressed all of the helpful comments above. Let me know if I missed something.

@cgreene we can do a final review and merge now if everything looks okay. I have one TODO remaining regarding the simulated data paragraph, but I can make a separate small pull request for that.

@cgreene
Copy link
Member

cgreene commented May 5, 2017

This LGTM 👍. The outstanding issues appear to be noted, and we can go through with more fine grained revisions after a merge if necessary.

@agitter agitter merged commit c7d3db6 into greenelab:master May 5, 2017
@agitter agitter deleted the sharing branch May 5, 2017 19:49
dhimmel pushed a commit that referenced this pull request May 5, 2017
This build is based on
c7d3db6.

This commit was created by the following Travis CI build and job:
https://travis-ci.org/greenelab/deep-review/builds/229261127
https://travis-ci.org/greenelab/deep-review/jobs/229261128

[ci skip]

The full commit message that triggered this build is copied below:

Data/code sharing and data limitations (#367)

* Remove wide data sub-section

* Remove prompts

* Brief discussion intro

* Expand data sharing

* Full draft of sharing sub-section

* Partial data limitations sub-section

* Fix URL tags and minor data splitting edit

* Add diet networks text from @brettbj and respond to other comments

* Update code sharing discussion

* Example code repositories and pre-trained models

* Caution about sharing models with patient data
dhimmel pushed a commit that referenced this pull request May 5, 2017
This build is based on
c7d3db6.

This commit was created by the following Travis CI build and job:
https://travis-ci.org/greenelab/deep-review/builds/229261127
https://travis-ci.org/greenelab/deep-review/jobs/229261128

[ci skip]

The full commit message that triggered this build is copied below:

Data/code sharing and data limitations (#367)

* Remove wide data sub-section

* Remove prompts

* Brief discussion intro

* Expand data sharing

* Full draft of sharing sub-section

* Partial data limitations sub-section

* Fix URL tags and minor data splitting edit

* Add diet networks text from @brettbj and respond to other comments

* Update code sharing discussion

* Example code repositories and pre-trained models

* Caution about sharing models with patient data
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants