Skip to content

HLTCHKUST/snp2vec

Repository files navigation

SNP2Vec: Scalable Self-Supervised Pre-Training for Genome-Wide Association Study

Pull Requests Welcome GitHub license Contributor Covenant

SNP2Vec is a scalable self-supervised pre-training approach for understanding SNP patterns of genomic sequences. The effectiveness of SNP2Vec has been evaluated for Alzheimer's disease risk in a Chinese cohort and is found to significantly outperforms existing polygenic risk score methods and all other deep learning baselines which are trained on haploid sequences.

Research Paper

SNP2Vec has been accepted by BioNLP 2022 and you can find the details in our paper . If you are using any component on SNP2Vec including the token mapping resources, the cached chromosome matrix, or the Alzheimer's disease risk dataset in your work, please cite the following paper:

@inproceedings{cahyawijaya-etal-2022-snp2vec,
    title = "{SNP}2{V}ec: Scalable Self-Supervised Pre-Training for Genome-Wide Association Study",
    author = "Cahyawijaya, Samuel  and
      Yu, Tiezheng  and
      Liu, Zihan  and
      Zhou, Xiaopu  and
      Mak, Tze Wing Tiffany  and
      Ip, Yuk Yu Nancy  and
      Fung, Pascale",
    booktitle = "Proceedings of the 21st Workshop on Biomedical Language Processing",
    month = may,
    year = "2022",
    address = "Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.bionlp-1.14",
    doi = "10.18653/v1/2022.bionlp-1.14",
    pages = "140--154",
    abstract = "Self-supervised pre-training methods have brought remarkable breakthroughs in the understanding of text, image, and speech. Recent developments in genomics has also adopted these pre-training methods for genome understanding. However, they focus only on understanding haploid sequences, which hinders their applicability towards understanding genetic variations, also known as single nucleotide polymorphisms (SNPs), which is crucial for genome-wide association study. In this paper, we introduce SNP2Vec, a scalable self-supervised pre-training approach for understanding SNP. We apply SNP2Vec to perform long-sequence genomics modeling, and we evaluate the effectiveness of our approach on predicting Alzheimer{'}s disease risk in a Chinese cohort. Our approach significantly outperforms existing polygenic risk score methods and all other baselines, including the model that is trained entirely with haploid sequences.",
}

Repository Structure

We provide code and resources to generate the SNP pre-training dataset which we use to build the Dipformer model in our paper.

Chromosome Matrix

We provide two pre-processed chromosome matrix for Chromosome-19 and Chromosome-21 which build from GRCh37 and dbSNP 153

For generating other chromosome matrices, you can check the gen_chromosome_matrix.ipynb provided on this repo.

Alzheimer's Disease Risk Dataset

To access the Alzheimer's disease risk dataset used for evaluating the model in our paper, you need to request and sign a Data Use Agreement (DUA) by contacting Tiffany T.W MAK (tiffanytze@ust.hk) or Xiaopu Zhou (xpzhou@ust.hk).

About

Code and dataset for SNP2Vec paper

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published