Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accurate prediction of single-cell DNA methylation states using deep learning #39

Closed
cgreene opened this issue Aug 5, 2016 · 14 comments

Comments

@cgreene
Copy link
Member

cgreene commented Aug 5, 2016

Published: https://doi.org/10.1186/s13059-017-1189-z
Preprint: https://dx.doi.org/10.1101/055715

@gwaybio
Copy link
Contributor

gwaybio commented Aug 24, 2016

Very well written article predicting binary methylation status (0: hypo, 1: hypermethation) in single cell bisulfite sequencing experiments (scBS-seq). A secondary goal is to visualize the DNA motifs contributing to methylation status and to cellular methylation heterogeneity.

Biology

The authors use scBS-seq data from 32 mouse embryonic stem cells to build their deep network. The features of the network are described in detail and consist of DNA sequence elements and nearby methylation states of the target cells and other experimental cells. Since there is only between 20-40% coverage in scBS-seq experiments because of low DNA yields, models that can impute methylation states in missing regions are extremely important. The authors also show variable predictive performance of their model depending on sequence context of the target CpG (e.g. TSS, Exon, promoter, CpG Island, etc.)

Computational Aspects

There are three deep networks in the model, all of which are convolutional neural networks (CNNs) with one fully connected hidden convolutional layer consisting of max pooling and ReLU activations. Some aspects of the architecture were difficult to decipher (e.g. stride of convolution, feature map size).

  1. DNA module
    • Uses sequence elements +/- 250 bp from given CpG
      • The authors did test shorter sequence lengths and report decreased performance
      • It is unclear if larger, or more biologically informed windows would improve performance
    • Convolution in 1 dimension - akin to detecting position-specific scoring matrices (PSSM)
  2. CpG module
    • Binary methylation state +/- 25 neighboring CpG in target cell and in other assayed cells
    • Convolution in 2 dimensions taking into account other cells who may have target CpG measured
  3. Fusion module
    • Recieves the CNN output from both the DNA and CpG module
    • Fully connected with one output node
      • Sigmoid activation on output layer to predict binary hat{y} = {0, 1}

Model trained with dropout, glorot initialized weights, Adam adaptive learning with early stopping. What is especially nice is the availability of all code used to implement the model.

Why we should include it in our review

  1. Deep learning for epigenetics - I buy this one more than Predicting DNA Methylation State of CpG Dinucleotide Using Genome Topological Features and Deep Networks #68
  2. Deep learning in single cells
    1. This data is huge and will only continue to grow - one area where deep learning could have a more profound impact
  3. Produce nice interpretations/visualizations (PSSM motifs) for what the DNA module is actually learning in the convolution (with added interpretations of heterogeneity of single cells)
    1. One example of overcoming a black box (although blackbox remains for CpG and Fusion module)

I am tagging the first author of the article @cangermueller to make sure I didn't miss anything and/or to add on to this summary.

@agitter
Copy link
Collaborator

agitter commented Aug 24, 2016

@gwaygenomics What did they do to create the 2D input for the CpG module if the single cells are initially unordered? Did they create a cell-cell similarity matrix? This relates to the discussion in #79.

@gwaybio
Copy link
Contributor

gwaybio commented Aug 24, 2016

@agitter Yeah, I stared at this bit for a while - still not sure if I'm understanding correctly. From the supplement:

The methylation state and distance of observed neighbouring CpG sites are inputs to a 2d-convolutional layer. Importantly, this layer convolves each cell separately with the same convolutional filters to unlink the number of model parameters from the number of cells, which can be large.

it looks like the convolutions are only at the single cell level, but weights are shared across cells. This makes more sense since any structure across cells would be artificially imposed.

@agitter
Copy link
Collaborator

agitter commented Aug 25, 2016

@gwaygenomics You're right, and that makes a lot more sense. They say:

A 2d-convolution layer convolves the CpG neighbourhood of cells t independently at every position 𝑖 by using filters 𝑤_f of dimension 1 x L x D and length 𝐿

There is still something interesting that they are doing with the distances between neighboring CpG sites that I need to look at further.

@cangermueller
Copy link

Hi guys,

sorry for the late reply, I was on traveling. I am happy to hear that you want to review DeepCpG.

I did not use a 2d convolutional kernel of size C x L to learn dependencies between C cells and L CpG sites, since here the information flow between cells would depend on the ordering of rows (=cells) in the input tensor. Instead, I used a 2d convolution with kernel size 1 x L to only learn dependencies between CpG sites. Dependencies between cells are learnt afterwards by fusion modules, i.e. hidden layers that are connected to the all output neurons of the CpG module and the DNA module. This is the same as scanning the CpG neighbourhood of cells with 1d convolutions, sharing their weights, and connecting the resulting hidden layers. However, this would be slower. Does this make sense?

Concerning point '1. DNA module': Prediction performance only increased slightly by using a window wider than +/- 250bp. As a trade-off between compute costs and performance, I therefore decided to use +/- 250bp.

Concerning point 3. ‘Why we should include it’: I tried to make the model interpretable by

  • Visualizing DNA motifs (weights of convolutional filters)
  • Correlating activations of convolutional filters with predicted CpG methylation states
  • Using learnt DNA motifs to predict cell-to-cell variability
  • Quantifying the influence of base-pair mutations and neighboring CpG sites by gradient back-propation

Let me know if I can help you with anything else.

Best,
Christof

@cgreene
Copy link
Member Author

cgreene commented Aug 29, 2016

@cangermueller Thank you for providing context for your paper! Regarding point 1, what kind of computational costs would have been required to go to a larger window (say 1k bp)? Are there any practical concerns (e.g. the examples become somewhat more unique with a larger window and thus are more training data required)?

I could easily see some discussion of the computational costs associated with scaling these methods discussed in the review. If you want to pitch in on the full review (via #2 and #88) we'd love to get your perspective.

@cangermueller
Copy link

Twice the window size means twice as much GPU memory and compute time. The main concern is the memory bottleneck of GPUs. E.g. the cluster I used only had GPUs with 4GBs.

I'll have a look at the entire review.

@agitter
Copy link
Collaborator

agitter commented Aug 29, 2016

In the quest for common themes across papers, note that the authors of #24 also wrote that memory was a limiting factor.

@cangermueller if you do decide you want to contribute more, I'd be interested in your thoughts on what topics weren't covered in your recent review #47. We all thought that was an excellent overview and aim to provide a different perspective here, as described in #2 and #88.

@agitter
Copy link
Collaborator

agitter commented Feb 17, 2017

As noted in #244, this preprint was updated this month. I haven't checked the differences, but there was mention of updated code at https://github.com/cangermueller/deepcpg/

We may consider highlighting the software as one example of a project that provides good documentation, IPython notebook examples, pre-trained models ("model zoo"), etc.

@cangermueller
Copy link

The main differences are:

  • Different model archicture
    • DNA module has two instead of one conv layer
    • DNA module operates on 1001bp window instead of 501bp window
    • CpG module is bidirectional GRU instead of CNN
  • Extended evaluation
    • Five instead of two cell types, including human and mouse cells
    • Comparision scBS-seq vs. scRRBS-seq
    • Evaluation predicted mutation effects on known mQTLs
  • Results
    • New model architecture more accurate
    • Performance gain highest for scRRBS-seq profiled cells
    • Predicted mutation effects higher for known mQTLs

As you noted, I have also refactored the code-base of DeepCpG, provided pre-trained models, and notebooks. However, it is not yet perfect. I am still extending the documentation and notebooks.

Let me know if anything is still unclear!

@cangermueller
Copy link

What is not mentioned in the manuscript: Batch-normalization yielded worse results, such that it is not used. I also evaluated a couple of different architecture for the DNA module, including convolutional - recurrent models, ResNets, and dilated convolutionals. However, I quite simple CNN with two conv layer and one FC layer with 128 units performed best.

@agitter
Copy link
Collaborator

agitter commented Feb 18, 2017

@cangermueller thanks for updating us here. It sounds like some major improvements.

I really like the runnable examples and effort to make the software reusable.

@agitter
Copy link
Collaborator

agitter commented Apr 11, 2017

I edited the original post with the DOI of the published version.

@cangermueller
Copy link

Thanks!

dhimmel added a commit to dhimmel/deep-review that referenced this issue Nov 3, 2017
@cgreene cgreene closed this as completed Mar 12, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants