Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with "wide-matrix" datasets for deep learning #95

Closed
w9 opened this issue Sep 15, 2016 · 1 comment
Closed

Problem with "wide-matrix" datasets for deep learning #95

w9 opened this issue Sep 15, 2016 · 1 comment

Comments

@w9
Copy link

w9 commented Sep 15, 2016

My name is Xun Zhu and I'm from Dr. Lana Garmire's lab. I agree with Dr. Greene when he said in #88 that we don't want to publish a "cheer-leading" paper. I would like discuss with you about a major concern of mine regarding current application of Deep Learning in medicine/biology.

Currently, many types of datasets are still "wide-matrices", that is, the number of features still dwarfs the number of samples. And the number of samples is still far from being considered large. This makes me wonder if applying complex models such as a deep neural network is going to be successful. "You cannot make a model more complex than your data."

Many types of next generation sequencing data fall into this category, including single-cell RNA Sequencing (scRNA-Seq) data. Currently a scRNA-Seq dataset considered "large-scale" consists of about a few thousand samples, which is grossly insufficient relative to its typical 20,000 to 30,000 features. To put these numbers in perspective, the ImageNet dataset, on which Deep Learning algorithms have been quite successful, contains ten million pictures. The majority of the pictures in this dataset can be fully represented by a 4096-pixel (64-by-64) thumbnail. That is over 2,500 times as many samples as there are features. Not to mention one million of these pictures are manually tagged with highly accurate labels and the fact that the local structure of the pixels is not only simplified by their high correlation but their well-understood nature by common sense (edges, corners, etc.)

There are perhaps two possible directions one could work on to remedy this: reducing the complexity of the model or the number of features, or somehow increase the number of samples.

To reduce the complexity of the model one could build neural networks with more specialized structures, base on domain-specific knowledge. This was a very common technique before the current generation of Deep Learning, back when end-to-end (from raw features to final output) model building was not computationally realistic. Similarly, a small number of features can be picked either manually or following a mathematically defined rule, again with the help of domain-specific knowledge. Shalini Ananda referred to this as "specialized deep learning". Although it is curious at which point the model is considered too restricted to be a "deep learning" model.

Alternatively, one might seek to conglomerate a large number of studies with the same type of datasets to train the model. This might bump up the number of samples by one or two orders of magnitude (and will continue to increase as the publications in the field accumulate.) The advantage of this approach is that the complexity of the model is not compromised, and it may even allow for end-to-end learning. However, it might be tricky to ask a question that is pertinent across all studies when these studies have different designs. Also, the model needs to effectively be able to normalize the datasets, since they may also vary in their protocols and quality.

I would like to hear your comments on dealing with "wid-matrix" datasets in deep learning.

@w9 w9 changed the title Problem "wide-matrix" datasets for deep learning Problem with "wide-matrix" datasets for deep learning Sep 15, 2016
@agitter
Copy link
Collaborator

agitter commented Sep 15, 2016

Thanks for these comments, I'm very happy to see a few new people joining in. I think this could be worth discussing in the review. Single-cell RNA-seq, and gene expression data generally, is certainly a case where the wide data matrix creates challenges.

There are some biomedical applications where there are many instances relative to the number of features, so we could contrast these settings. For example, in #55 there are tens or hundreds of thousands of instances per task and each instance is represented with about a thousand features. And in contexts like regulatory genomics (e.g. #13), splitting a single dataset (e.g. a ChIP-seq experiment) into many instances (1 kb windows of the genome) leads to more instances than features. This is alluded to in the review #47 when it discussed how convolutional architectures make it possible to use fewer parameters even when the number of input features is large.

From the specialized deep learning post:

But to train an algorithm to diagnose a disease from an MRI image, the data scientist has to determine inconspicuous features to extract and hand craft their network to train these complex features.

Is that the case? I don't work in medical imaging, but I would have liked to see some references or something to support that. One MRI paper on our list (#86) appears to be doing applying standard deep learning strategies without manual feature extraction. They actually employ the "splitting" technique as well, dividing 3D images into many 2D instances. And #79 draws multiple samples of single cells from the same original pool to create instances. Perhaps this is a commonality across domains to overcome the problem of limited data?

ImageNet dataset... manually tagged with highly accurate labels

Quality of the training labels is another topic that will be good to discuss in the review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants