Data/code sharing and data limitations #367

agitter · 2017-04-30T13:23:06Z

I'm trying to finish a complete draft of the Discussion section. These sub-sections could definitely use feedback. I confirmed we'll be getting a draft of the evaluation sub-section soon so I didn't write that. I also removed other Discussion prompts.

@brettbj I'd like to briefly touch on the diet networks paper (#140) here. Can you please help me describe this in a sentence or two? What I have is not very clear.

@cgreene I'm closing #172 and we can include any relevant papers in this sub-section.

If there are build errors, I'll clean them up later today.

cgreene

Looks good to me! I had some thoughts + identified potential options to highlight. But this looks great. Once you're happy w/ it I'm happy with it.

cgreene · 2017-04-30T14:26:47Z

sections/06_discussion.md

+training instances can be just as problematic.  In genomics, for example,
+labeled data may be derived from an experimental assay that has known and
+unknown technical artifacts, biases, and error profiles.  It is possible to
+weight training examples or construct Bayesian models to account for uncertainty


We didn't account for uncertainty, but we have corrected for non-independence in our bayesian integrations, which I think is related to the point that you are making. For example, in 10.1038/ng.3259 we had to deal with correlated noise primarily from experimental platform (e.g. array type, etc). We had a very heuristic approach for this: we down-weighted datasets where the negative training examples (i.e. gene pairs with no relationships to each other) had high mutual information with other already included datasets.

I imagine that lots of people do this to avoid finding trivial results. However, it may not often be clear why something like this was done. I think this is sort of the practical how-to knowledge that often gets lost when a paper is written and revised because it's not as splashy as the "methods" bits.

As I re-read the paragraph - this falls into the weight training examples part.

If you want me to write this specific example, feel free to make a quick issue and I can edit after this PR. It will be a light revision on the example above.

I made #377 for you to add these lines. You'll do a better job of getting the example correct.

This is along the lines of what I meant by "weight training examples", but I didn't explain that phrase well.

cgreene · 2017-04-30T14:35:30Z

sections/06_discussion.md

+For some types of data, especially images, it is straightforward to augment
+training datasets by splitting one labeled example into multiple examples. An
+image can easily be rotated or inverted and retain its label `TODO: does someone
+have examples to cite here? Issue #163?`.  3D MRI and 4D fMRI (with time as a


Yes - #163 is perfect for this.

cgreene · 2017-04-30T14:38:17Z

sections/06_discussion.md

+full dataset [@tag:Arvaniti2016_rare_subsets].
+
+Simulated or semi-synthetic training data has also been employed in multiple
+biomedical domains. `TODO:  simulated data: #5 #99 #293, maybe #117 and #197.


10.1093/bioinformatics/btw243 trains on synthetic data. We don't have an issue for it. I think it's worth mentioning for this point as well

I opened #378 for that paper. I'll see if I need it.

cgreene · 2017-04-30T14:44:11Z

sections/06_discussion.md

+[@tag:Romero2017_diet], that drastically reduce the number of free parameters
+by predicting parameters (weights) as a function of each SNP's embedding
+in some feature space instead of directly optimizing a weight for each SNP.
+`TODO: not very familiar with this paper, can someone describe it better?`


cc @brettbj @traversc : can you help with this?
#140

How's this (tried to still keep it brief without getting heavily into embeddings) -

Multimodal, multi-task, and transfer learning, discussed in detail below, can also combat data limitations to some degree. There are also emerging network architectures, such as Diet Networks for high-dimensional SNP data [@tag:Romero2017_diet]. Diet Networks use multiple networks to drastically reduce the number of free parameters by first flipping the problem and tasking a network to predict parameters (weights) for each input (SNP) to learn a feature embedding. This embedding (i.e. PCA, per class histograms, x2vec style) can be learned directly from the input data, but has the advantage that it can also be learned from other datasets or from domain knowledge. Additionally, in this task, the features are the examples, an important fact when it is typical to have 500k+ SNPs and only a few thousand patients. Finally, this embedding is of a much lower dimension, allowing for a large reduction in the number of free parameters. In the example given, the authors reduced the number of free parameters from 30M to 50K, a factor of 600.

@brettbj thanks, I added this. I want to rephrase the x2vec style or remove the example embedding types and will use everything else as-is.

cgreene · 2017-04-30T14:47:07Z

sections/06_discussion.md

+also recent techniques to mitigate these concerns.  Furthermore, in some
+domains, some of the best training data has been generated privately, for
+example, high-throughput chemical screening data at pharmaceutical companies.
+There is little expectation or incentive for this private data to be shared.


What about: "One perspective is that there is little expectation or incentive for this private data to be shared. However, data are not inherently valuable. Instead, the insights that we glean from them are where the value lies. Private companies may establish a competitive advantage by releasing sufficient data for improved methods to be developed."

Basically a good benchmark dataset gets computer scientists to do lots of hard work for you for free.

I used this. There is an example where Merck provided data for a Kaggle competition, but I don't want to get into too much detail.

cgreene · 2017-04-30T14:48:49Z

sections/06_discussion.md

+
+Code sharing and open source licensing is essential for continued progress in
+this domain.  Even though some journals have improved requirements about code
+availability at the time of publication, there remain many opportunities to do


Not sure if you want to cite this, but it was an encouraging step:
http://blogs.nature.com/methagora/2014/02/guidelines-for-algorithms-and-software-in-nature-methods.html

I don't think I want to focus on journal requirements, so I'll rephrase to state that we encourage best practices for software regardless of what is or isn't required by publishers.

Now citing http://science.sciencemag.org/content/354/6317/1240.full and shortening my discussion text.

cgreene · 2017-04-30T14:50:03Z

sections/06_discussion.md

+data cleaning (see above) and hyperparameter optimization.  These improve
+reproducibility and serve as documentation of the detailed decisions that impact
+model performance but may not be exhaustively captured in a manuscript's methods
+text. `TODO: there are many other things that could be added here (e.g.


I agree - this could be an entire perspective article. I think that what's here is sufficient. It is very surprising to me how little deep learning code gets shared. I feel like comp bio is ahead of the game on this - or maybe just the genomics field. Maybe someone sees this and decides to write a perspective on this in a similar manner? :)

cgreene · 2017-04-30T14:53:29Z

sections/06_discussion.md

+that used pre-trained image classifier?` "Model zoos", collections of
+pre-trained models, are not yet common in biomedical domains but have started to
+appear in genomics applications [@tag:Angermueller2016_single_methyl
+@tag:Dragonn]. `TODO: other examples, possibly from Categorize or Treat?`


From categorize we need to have a sentence mentioning the possibility of attacking the model to identify training examples. Something like: For models trained on data where privacy is required, we discuss the potential for information leakage in section ### and methods to mitigate this concern.

Section is:
"Data sharing is hampered by standardization and privacy considerations"

It's brutally long. Not sure what the most efficient way is to get people to there.

@cgreene two paragraphs above I have "As discussed in the Categorize section, there are complex privacy and legal issues involved in sharing patient data and deep learning models trained with it but also recent techniques to mitigate these concerns." Do you suggest I move that to this paragraph?

What about changing above to:
"There are complex privacy and legal issues involved in sharing patient data that cannot be ignored."

Then here: "Deep learning models can also be attacked to identify examples used in training. We discuss this issue as well as recent techniques to mitigate these concerns... [this refers to categorize]"

I just want to make sure that nobody drops in here, reads that they should share their model, and then shares a model that leads to inadvertent release of something they thought was private.

I'll make this edit (reworded slightly), then merge.

cgreene · 2017-04-30T15:10:35Z

sections/06_discussion.md

+
+DeepChem [@tag:AltaeTran2016_one_shot @tag:DeepChem @tag:Wu2017_molecule_net]
+and DragoNN [@tag:Dragonn] exemplify the benefits of sharing code under an open
+source license and pre-trained models. `TODO: it would be great to add more


Source code:
#213 - Tagging @spoilt333 - don't think the raw data is here (looks like /data/ path but not 100% sure). - did not see a license
#266 - tagging @zhangyan32 - looks like source but not models + data - did not see a license
#217 - source code, data can be downloaded - tagging @uci-cbcl - has license (GPL)
#171 tagging @CampagneLaboratory - this one has some docs, but not data/models if i'm seeing things correctly - has license (Apache)
#155 - code yes w/ docs. Don't think data. Tagging @traversc - just a quick note as I look through the source code I think some .pyc files made it in. not sure if that's intended. - no license
#150 - contains pre-trained models - tagging @raghavchalapathy - did not see a license
#32 - also has source code from @uci-cbcl + example file + data download + very nice readme - Custom license
#24 - source code + data [not version controlled]

That last one highlights to me that we don't mention version control here but we probably should recommend that explicitly.

My repo is source code, and we are still working on it now.

#171 is a good example with tutorials, license, etc. I think #24 does have a GitHub repo https://github.com/uci-cbcl/D-GEX

agitter · 2017-04-30T18:49:56Z

@cgreene thanks for all the helpful suggestions. I should be able to work on these Monday.

brettbj · 2017-05-01T15:20:39Z

@agitter Will do tonight

brettbj · 2017-05-02T12:40:19Z

@agitter @cgreene - needed a full reread last night, will add the couple of sentences after a few meetings this morning

agitter · 2017-05-02T14:03:12Z

@brettbj no rush, I won't get to this until today or tomorrow most likely because I shifted my efforts to drafting #370

agapow

Looks good, any comments are suggestions

agapow · 2017-05-02T15:28:37Z

sections/06_discussion.md

+Code sharing and open source licensing is essential for continued progress in
+this domain.  Even though some journals have improved requirements about code
+availability at the time of publication, there remain many opportunities to do
+better.  Preprints are firmly established in many areas of machine learning and


Hmmm. I feel that preprints - great as they are - is perhaps getting a little out of scope and are qualitatively different to code and data sharing. Just trying to pick out the deep learning specific material here.

Perhaps it is out of scope. I brought it up because it has been a point of personal frustration. More than once, there has been an exciting preprint on some biomedical deep learning topic that my group wanted to test out or build upon. If the code isn't available, we're stuck. Sometimes we don't even want to run the code, we just want to understand minute details of the training that aren't written down in the paper. Without code, we're left guessing.

Part of what I wanted in this section is a call to go beyond the accepted practices in the community. That includes sharing code when not required and highlighting example repositories that make the extra effort to provide pre-trained models, data pre-processing scripts, tutorials, etc.

I get it. My feeling is that is a broader issue and might diffuse any DL-specific message. I'll leave this to you. Maybe see how it fits when the paper is more complete and polished.

While it might be slightly out of scope, I think we should keep it. It's an important point and it flows well with the discussion. A couple of additional thoughts:

maybe say explicitly that this recommendation is a general one, not specific to deep learning;

we could go as far as recommend code sharing platforms such as GitHub (maybe even Zenodo?)

@enricoferrero I support this. I think to avoid diffusing the DL-specific message further, we should briefly mention these things but then refer to an existing paper. There is probably a suitable PLoS Comp Bio 10 simple rules or similar article I can cite.

@agapow if it seems out of place when we polish the full Discussion section, I'm willing to drop it. I prefer to keep it in the initial draft.

@agapow I revised this paragraph to try to keep it more focused on the core element of code sharing. I also cite http://science.sciencemag.org/content/354/6317/1240.full to expand on best practices so we don't get too off topic here.

@cgreene and @enricoferrero, please see if you still like the overall message (once I push).

enricoferrero · 2017-05-02T16:28:27Z

sections/06_discussion.md

+deep learning can make the preprocessing code (e.g., Basset
+[@tag:Kelley2016_basset]) and cleaned data (e.g., MoleculeNet
+[@tag:Wu2017_molecule_net]) publicly available to catalyze further research. As
+discussed in the Categorize section, there are complex privacy and legal issues


Do we actually refer to that section as "the Categorize section" in the paper?

Probably not 😄

I'm used to that jargon from our issues here and should fix it.

agitter · 2017-05-05T16:35:13Z

I believe I addressed all of the helpful comments above. Let me know if I missed something.

@cgreene we can do a final review and merge now if everything looks okay. I have one TODO remaining regarding the simulated data paragraph, but I can make a separate small pull request for that.

cgreene · 2017-05-05T16:40:26Z

This LGTM 👍. The outstanding issues appear to be noted, and we can go through with more fine grained revisions after a merge if necessary.

@brettbj

This build is based on c7d3db6. This commit was created by the following Travis CI build and job: https://travis-ci.org/greenelab/deep-review/builds/229261127 https://travis-ci.org/greenelab/deep-review/jobs/229261128 [ci skip] The full commit message that triggered this build is copied below: Data/code sharing and data limitations (#367) * Remove wide data sub-section * Remove prompts * Brief discussion intro * Expand data sharing * Full draft of sharing sub-section * Partial data limitations sub-section * Fix URL tags and minor data splitting edit * Add diet networks text from @brettbj and respond to other comments * Update code sharing discussion * Example code repositories and pre-trained models * Caution about sharing models with patient data

@brettbj

This build is based on c7d3db6. This commit was created by the following Travis CI build and job: https://travis-ci.org/greenelab/deep-review/builds/229261127 https://travis-ci.org/greenelab/deep-review/jobs/229261128 [ci skip] The full commit message that triggered this build is copied below: Data/code sharing and data limitations (#367) * Remove wide data sub-section * Remove prompts * Brief discussion intro * Expand data sharing * Full draft of sharing sub-section * Partial data limitations sub-section * Fix URL tags and minor data splitting edit * Add diet networks text from @brettbj and respond to other comments * Update code sharing discussion * Example code repositories and pre-trained models * Caution about sharing models with patient data

agitter added 6 commits April 29, 2017 15:02

Remove wide data sub-section

9bd54ce

Remove prompts

7553f67

Brief discussion intro

40cee57

Expand data sharing

28afbdb

Full draft of sharing sub-section

3c8bde0

Partial data limitations sub-section

d49d603

agitter requested a review from cgreene April 30, 2017 13:23

agitter mentioned this pull request Apr 30, 2017

Augmented Training Data List #172

Closed

cgreene approved these changes Apr 30, 2017

View reviewed changes

Fix URL tags and minor data splitting edit

871f7cc

agapow approved these changes May 2, 2017

View reviewed changes

enricoferrero reviewed May 2, 2017

View reviewed changes

This was referenced May 4, 2017

Current Section Status #188

Closed

Weighting training examples #377

Closed

agitter added 3 commits May 5, 2017 10:11

Add diet networks text from @brettbj and respond to other comments

b4fdbe0

Update code sharing discussion

2930c22

Example code repositories and pre-trained models

e3895fa

Caution about sharing models with patient data

3ac6606

agitter merged commit c7d3db6 into greenelab:master May 5, 2017

agitter deleted the sharing branch May 5, 2017 19:49

Data/code sharing and data limitations #367

Data/code sharing and data limitations #367

Conversation

agitter commented Apr 30, 2017

cgreene left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

agitter May 5, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

agitter May 5, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

agitter May 5, 2017 • edited

Choose a reason for hiding this comment

cgreene Apr 30, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

agitter commented Apr 30, 2017

brettbj commented May 1, 2017

brettbj commented May 2, 2017

agitter commented May 2, 2017

agapow left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

agitter commented May 5, 2017

cgreene commented May 5, 2017

agitter May 5, 2017 •

edited

agitter May 5, 2017 •

edited

agitter May 5, 2017 •

edited

cgreene Apr 30, 2017 •

edited