Skip to content
A curated list of resources for learning bioinformatics.
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.gitattributes
.gitignore
LICENSE
README.md
TODO.md

README.md

awosome-bioinformatics

Abstract: A curated list of resources for learning bioinformatics. Some of this repo resources were collected by BioInstaller project. You can use BioInstaller to directly download the source code or database files, or fetch the meta information by BioInstaller::get.meta()$item.

Purpose:

  • Provide some of bioinformatics learning resources for beginners
  • Provide a profiling of bioinformatics

Field:

  • Next generation sequencing (NGS)
  • Bioinformatics Data Analysis
Table of content

Table of content


Resources

General

Journal

Sequencing Technology

This section mainly copied from enseqlopedia.

Thanks this work: Hadfield, J. & Retief, J. A profusion of confusion in NGS methods naming. Nat Methods 15, 7-8 (2018).

RNA Sequencing Methods

Low-Level RNA Detection
RNA Modifications
RNA Structure
RNA Transcription
RNA-Protein Interactions

DNA Sequencing Methods

Protein-Protein Interaction
Sequence Rearrangements
DNA Break Mapping
DNA Protein Interactions
Epigenetics
Low-Level DNA Detection

Tools

Package management

Web Application Developement Framework

Web-based Service

  • UCSC
  • NCBI
  • ExPASy
  • EMBL-EBI
  • TCGA
  • COSMIC
    • COSMIC-3D: a comprehensive integration of cancer mutations with protein structure across the human genome and structural proteome, seeking to support the identification and characterization of protein targets for novel drug design in precision oncology
  • St. Jude PeCan Data Portal
  • BIG Data Center
  • DAVID Bioinformatics Resources
  • cBioPortal
  • Oncotator
  • QIAGEN Analysis Platform
  • Wordcloud
  • Omictools
  • iCoMut
  • UniProt
  • Pfam
  • SMART
  • STRING
  • DiseaseEnhancer
  • SEECancer
  • eQTL Browser
  • Cistrome Project
  • VarCards
  • superdrug2
  • MeDReaders
  • ECOdrug
  • rSNPBase3.0
  • MNDR
  • MSDD
  • funcoup
  • proteinatlas
  • DGIdb
  • Drugbank
  • InterPro
  • ncbi-biosystems
  • denovo-db
  • The Human Phenotype Ontology (HPO)
  • FANTOM
  • dbNSFP
  • regSNP-intron
  • RADAR
  • DARNED
  • REDIportal
  • LNCediting
  • EggNOG
  • MiSTIC
  • DTMiner
  • PDBFlex
  • Cancer3d
  • Dsysmap
  • CBS Prediction Servers
  • wANNOVAR: Public web service of ANNOVAR
  • Harmonizome: Search for genes or proteins and their functional terms extracted and organized from over a hundred publicly available resources
  • GDA: A web-based tool that combines NCI60 uniquely large number of drug sensitivity data with CCLE and NCI60 gene mutation and expression profiles
  • CLUE: Unravel biology with the world’s largest perturbation-driven gene expression dataset
  • CMAP: The Connectivity Map (also known as cmap) is a collection of genome-wide transcriptional expression data from cultured human cells treated with bioactive small molecules and simple pattern-matching algorithms that together enable the discovery of functional connections between drugs, genes and diseases through the transitory feature of common gene-expression changes.
  • pssmsearch: a web application to discover novel protein motifs (SLiMs, mORFs, miniMotifs) and PTM sites
  • bammmotif: Bayesian Markov Models (BaMMs), a web server for de-novo motif discovery and regulatory sequence analysis
  • LOLAweb: a containerized web server for interactive genomic locus overlap enrichment analysis
  • GeNets: a unified web platform for network-based genomic analyses
  • HiCExplorer: a web server for reproducible Hi-C data analysis, quality control and visualization
  • paintomics: a web resource for the pathway analysis and visualization of multi-omics data
  • kinact: a computational approach for predicting activating missense mutations in protein kinases
  • VAReporter: VAReporter can provide comprehensive annotation by integrating a wide variety of biomedical databases
  • SNPnexus: SNPnexus was designed to simplify and assist in the selection of functionally relevant Single Nucleotide Polymorphisms (SNP) for large-scale genotyping studies of multifactorial disorders
  • Oncoscape: an online open-access dataanalysis and visualization platform that empowers researchers and clinicians to discover novel patterns and relationships between linked clinical and molecular data
  • cellmarker: a manually curated resource of cell markers in human and mouse
  • awesome: a database of SNPs that affect protein post-translational modifications
  • hmdb: an online database of small molecule metabolites found in the human body, which facilitates human metabolomics research including the identification and characterization of human metabolites using NMR and MS
  • redoxdb: a curated database of protein oxidative modification
  • instruct: a database of 3D protein interactome networks with structural resolution
  • consensuspathdb: integrates interaction networks in Homo sapiens including binary and complex protein-protein, genetic, metabolic, signaling, gene regulatory and drug-target interactions, as well as biochemical pathways
  • phosphonetworks: a database for experimentally determined kinase-substrate relationships
  • dbsno: protein S-nitrosylation (SNO) is a reversible post-translational modification (PTM) and involves the covalent attachment of nitric oxide (NO) to the thiol group of cysteine (Cys) residues. Given the increasing number of proteins reported to be regulated by this modification, S-nitrosylation is considered to act, in a manner analogous to phosphorylation, as a pleiotropic regulator that elicits dual effects to regulate diverse pathophysiological processes by altering protein function, stability, and conformation change in various cancers and human disorders
  • hpdi: Human Protein-DNA Interactome (hPDI)
  • islandviewer: an integrated interface for computational identification and visualization of genomic islands
  • appris: a system that deploys a range of computational methods to provide annotations of alternative splice isoforms and identify principal isoforms for vertebrate species
  • rbpdb: a collection of RNA-binding proteins linked to a curated database of published observations of RNA binding
  • type2diabetesgenetics: providing data and tools to promote understanding and treatment of type 2 diabetes and its complications
  • pepquery: a peptide-centric search engine for novel peptide identification and validation
  • Gene Info eXtension (GIX): a browser extension that allows you to retrieve information about a gene product directly on any webpage simply by double clicking an official gene name, synonym or supported accession.
  • cancermine: a literature-mined resource for drivers, oncogenes and tumor suppressors in cancer.
  • gpcrdb: contains data, diagrams and web tools for G protein-coupled receptors (GPCRs). Users can browse all GPCR structures and the largest collections of receptor mutants. Diagrams can be produced and downloaded to illustrate receptor residues (snake-plot and helix box diagrams) and relationships (phylogenetic trees). Reference (structure) structure-based sequence alignments take into account helix bulges and constrictions, display statistics of amino acid conservation and have been assigned generic residue numbering for equivalent residues in different receptors.
  • FPbase: a free, open-source, web-based, communityeditable database for fluorescent proteins (FPs) and their properties.
  • Image Data Resource (IDR): Image Data Resource (IDR) is a public repository of image datasets from published scientific studies, where the community can submit, search and access high-quality bio-image data.
  • Allen Brain Atlases and Data: The Allen Institute for Brain Science uses a unique approach to generate data, tools and knowledge for researchers to explore the biological complexity of the mammalian brain. This portal provides access to high quality data and web-based applications created for the benefit of the global research community.
  • Allen Cell Explorer: a python-based, open-source toolkit that combines classic 3D image segmentation with artificial intelligence to detect cellular structures.
  • Mitotic Cell Atlas: Provides a comprehensive and quantitative 4D model of the mitotic protein localization network in a dividing human cell. Mitotic Cell Atlas is an integrated experimental and computational framework that provides a standardized yet dynamic spatio-temporal reference system for the mitotic cell. It can be used to integrate quantitative information on any number of protein distributions sampled in thousands of different experiments.
  • Broad Bioimage Benchmark Collection: a collection of freely downloadable microscopy image sets. In addition to the images themselves, each set includes a description of the biological application and some type of "ground truth" (expected results).
  • Cell Image Library: a repository for images and movies of cells from a variety of organisms. It demonstrates cellular architecture and functions with high quality images, videos, and animations. This comprehensive and easily accessible Library is designed as a public resource first and foremost for research, and secondarily as a tool for education. The long-term goal is the construction of a library of images that will serve as primary data for research.
  • Mitocheck: the goal of this resource is to integrate information on cellular functions of human genes while also giving access to supporting information such as microscopy images of phenotypes. Although its primary focus is on the biology of mitosis, the resource also integrates data relevant to many other cellular functions.
  • ssbd: Systems Science of Biological Dynamics (SSBD) database provides a rich set of open resources for analyzing quantitative data and microscopy images of biological objects, such as single-molecule, cell, gene expression nuclei, etc. Quantitative biological data and microscopy image are collected from a variety of species, sources and methods. These include data obtained from both experiment and computational simulation.
  • IMPC: the International Mouse Phenotyping Consortium (IMPC) is an international effort by 19 research institutions to identify the function of every protein-coding gene in the mouse genome. The entire genome of many species has now been published and whole genome sequencing is becoming relatively quick and cheap to complete. Despite these advancements the function of the majority of genes remains unknown.
  • elixir: ELIXIR unites Europe’s leading life science organisations in managing and safeguarding the increasing volume of data being generated by publicly funded research. It coordinates, integrates and sustains bioinformatics resources across its member states and enables users in academia and industry to access services that are vital for their research.
  • Global BioImaging Project: the imaging landscape changed significantly in the last 10 years as the the concept of open user access to cutting-edge technologies became valued and well recognized. In Europe imaging experts from 25 countries joined their forces and draw the vision of a pan-European imaging infrastructure, which gave momentum to the project of founding a Euro-BioImaging European Research Infrastructure Consortium (the EuBI ERIC).
Clinical Annotation
  • CIViC
  • DoCM
  • ClinVar
  • Intogen
  • Cancer Hotspots
  • DisGeNET
  • Cancer Biomarkers database
  • OncoKB: Precision Oncology Knowledge Base
  • LncRNADisease: Not only a resource that curated the experimentally supported lncRNA-disease association data but also a platform that integrated tool(s) for predicting novel lncRNA-disease associatons
  • fusiongdb: fusion gene annotation DataBase, which collected 48 117 FGs across pan-cancer from three representative fusion gene resources: the improved database of chimeric transcripts and RNA-seq data (ChiTaRS 3.1), an integrative resource for cancerassociated transcript fusions (TumorFusions), and The Cancer Genome Atlas (TCGA) fusions by Gao et al.
  • sedb: the comprehensive human Super-Enhancer database.
  • pmkb: the cancer precision medicine knowledge base for structured clinical-grade mutations and interpretations
  • ewasdb: epigenome-wide association study database
  • dcdb: DCDB (Drug Combination Database), Accumulating scientific and clinical evidences have suggested the use of drug combinations as a safe and effective approach, to treat complicated and refractory diseases. The Drug Combination Database (DCDB) is devoted to the research and development of multi-component drugs. The current version of DCDB collected 1363 drug combinations (330 approved and 1033 investigational, including 237 unsuccessful usages), involving 904 individual drugs, 805 targets
Noncoding RNA Related Database
  • CSCD
  • AtCircDB
  • CircNet
  • circBase
  • circRNADb
  • exoRBase
  • EVLncRNAs
  • NONCODE: an integrated knowledge database dedicated to non-coding RNAs (excluding tRNAs and rRNAs)
  • MiTranscriptome: a catalog of human long poly-adenylated RNA transcripts derived from computational analysis of high-throughput RNA sequencing (RNA-Seq) data from over 6,500 samples spanning diverse cancer and tissue types
  • FANTOM CAT: an atlas of human long non-coding RNAs with accurate 5’ ends
  • lnc2cancer2: an updated database that provides comprehensive experimentally supported associations between lncRNAs and human cancers
  • sm2mir: a manual curated database which collects and incorporates the experimentally validated small molecules' effects on miRNA expression in 20 species from the published papers. Each entry contains the detailed information about small molecules, miRNAs and their relationships, including species, small molecule name, DrugBank Accession number, PubChem CID, approved by FDA or not, miRNA name, miRBase Accession number, expression pattern of miRNA, experimental detection method, tissues or conditions for detection, evidences in the reference, PubMed ID and the published year of the reference
  • oncomirdb: a Database for Oncogenic & Tumor-Suppressive MicroRNAs
  • mircancer: provides comprehensive collection of microRNA (miRNA) expression profiles in various human cancers which are automatically extracted from published literatures in PubMed. It utilizes text mining techniques for information collection. Manual revision is applied after auto-extraction to provide 100% precision
  • lncipedia: a public database for long non-coding RNA (lncRNA) sequence and annotation. The current release contains 127,802 transcripts and 56,946 genes
  • mirnest: an integrative collection of animal, plant and virus microRNA data
  • mirtarbase: the experimentally validated microRNA-target interactions database
  • mirdb: an online resource for microRNA target prediction and functional annotations
eQTL Related Database

Sequencing Data Portal

Local tools

Quality Control
Alignment And Assembly
Variant Detection (SNVs, INDELs, SVs)
  • GATK
  • MuTect
  • lofreq
  • VarScan2
  • freebayes
  • TVC
  • SomaticSniper
  • speedseq
  • FusionCatcher
  • svtoolkit
  • pindel
  • breakdancer
  • delly
  • CNVkit
  • speedseq
  • GRIDSS
  • PancanQTL
  • TumorFusions
  • SVScore
  • SVTools
  • RDDpred
  • iseq
  • deepvariant
  • SV2
  • facets
  • MutScan
  • svaba: structural variation and indel detection by local assembly
  • manta: structural variant and indel caller using mapped sequencing data
  • JAFFA: a multi-step pipeline that takes either raw RNA-Seq reads, or pre-assembled transcripts, then searches for gene fusions
  • Picky: structural variants pipeline for long reads
  • CREST: a algorithm for detecting genomic structural variations at base-pair resolution using next-generation sequencing data
  • Control-FREEC: a tool for detection of copy-number changes and allelic imbalances (including LOH) using deep-sequencing data
  • Strelka: accurate somatic small-variant calling from sequenced tumor–normal sample pairs
  • GISTIC2: facilitates sensitive and confident localization of the targets of focal somatic copy-number alteration in human cancers
  • BreaKmer: A method to identify structural variation from sequencing data in target regions
  • deTiN: DeTiN is designed to measure tumor-in-normal contamination and improve somatic variant detection sensitivity when using a contaminated matched control.
  • vadir: an integrated approach to Variant Detection in RNA
  • CN_Learn: a framework to integrate Copy Number Variant (CNV) predictions made by multiple algorithms using exome sequencing datasets
  • SVseq2
  • SoftSV: a tool for the detection of small and large deletions, inversions, tandem duplications and translocations from paired-end sequencing data.
  • wham: consists of two programs, wham and whamg. wham, the original tool, is a very sensitive method with a high false discovery rate. The second program, whamg, is more accurate and better suited for general structural variant (SV) discovery.
Variant Annotation
Variant Visualization (SNVs, INDELs, SVs)
Variant Screen
  • LARVA
  • DANN
  • NCBoost: Classifier of pathogenic non-coding variants in Mendelian diseases
Alternative Splicing
Gene Expression Data Analysis
  • Cufflinks
  • DESeq2
  • edgeR
  • HTSeq
  • RESM: RNA-Seq by Expectation-Maximization, accurate quantification of gene and isoform expression from RNA-Seq data.
  • sRNAnalyzer
  • mrnn: an implementation of a Gated Recurrent Unit (GRU) network for classification of transcripts as either coding or noncoding
  • prada: pipeline for RNA-Sequencing Data Analysis
  • ballgown: a software package designed to facilitate flexible differential expression analysis of RNA-Seq data. It also provides functions to organize, visualize, and analyze the expression measurements for your transcriptome assembly.
  • subread: comprises a suite of software programs for processing next-gen sequencing read data, i.e. featureCounts: a software program developed for counting reads to genomic features such as genes, exons, promoters and genomic bins. High-performance read alignment, quantification and mutation discovery.
  • kallisto: a program for quantifying abundances of transcripts from bulk and single-cell RNA-Seq data, or more generally of target sequences using high-throughput sequencing reads. It is based on the novel idea of pseudoalignment for rapidly determining the compatibility of reads with targets, without the need for alignment.
  • salmon: a tool for quantifying the expression of transcripts using RNA-seq data. Salmon uses new algorithms (specifically, coupling the concept of quasi-mapping with a two-phase inference procedure) to provide accurate expression estimates very quickly (i.e. wicked-fast) and while using little memory. Salmon performs its inference using an expressive and realistic model of RNA-seq data that takes into account experimental attributes and biases commonly observed in real RNA-seq data.
  • mixcr: a universal software for fast and accurate extraction of T- and B- cell receptor repertoires from any type of sequencing data. Free for academic use only
  • trust: Tcr Receptor Utilities for Solid Tissue (TRUST) is a computational tool to analyze TCR and BCR sequences using unselected RNA sequencing data, profiled from solid tissues, including tumors. TRUST performs de novo assembly on the hypervariable complementarity-determining region 3 (CDR3) and reports contigs containing the CDR3 DNA and amino acid sequences. TRUST then realigns the contigs to IMGT reference gene sequences to report the corresponding variable (V) or joining (J) genes.
  • topconfects: is intended for RNA-seq or microarray Differntial Expression analysis and similar, where we are interested in placing confidence bounds on many effect sizes--one per gene--from few samples.
  • PLIER: Pathway-Level Information Extractor (PLIER): a generative model for gene expression data.
Virus and Microbial Related
  • viral-ngs
  • qap
  • ROP: discovering the source of all RNA-seq reads, including those originating from repeat sequences, recombinant B and T cell receptors, and microbial communities
  • ViFi: pipeline for identifying viral integration and fusion mRNA reads from NGS data
  • hgtid: an efficient and sensitive workflow to detect human-viral insertion sites using next-generation sequencing data
  • MicroPro: a software to perform profiling of both known and unknown microbial organisms for metagenomic dataset.
  • FEAST: a scalable algorithm for quantifying the origins of complex microbial communities.
  • mcorr: inferring bacterial recombination rates from large-scale sequencing datasets.
  • VirusFinder2: a new software tool for characterizing intra-host viruses through next generation sequencing (NGS) data.
  • VirusSeq: a algorithmic tool for detecting known viruses and their integration sites using next-generation sequencing of human cancer tissue.
  • BatVI: a fast and sensitive method to determine viral integrations.
Single Cell
  • seurat
  • SCnorm
  • dropClust
  • scran: batch effect adjust
  • trendsceek: spatial expression trends in single-cell gene expression data
  • scRNA-tools: a database of software tools for the analysis of single-cell RNA-seq data.
  • awesome-single-cell: list of software packages (and the people developing these methods) for single-cell data analysis, including RNA-seq, ATAC-seq, etc.
  • SAVER: SAVER (Single-cell Analysis Via Expression Recovery) implements a regularized regression prediction and empirical Bayes method to recover the true gene expression profile in noisy and sparse single-cell RNA-seq data.
  • CellSIUS: an R package enabling the identification and characterization of (rare) cell sub-populations from complex scRNA-seq datasets: it takes as input expression values of N cells grouped into M(>1) clusters. Within each cluster, genes with a bimodal distribution are selected and only genes with cluster-specific expression are retained. Among these candidate marker genes, sets with correlated expression patterns are identified by graph-based clustering. Finally, cells are assigned to subgroups based on their average expression of each gene set. The CellSIUS algorithm output provides the rare/ sub cell types by cell indices and their transcriptomic signatures.
  • SCRABBLE: Single Cell RNA-Seq imputAtion constrained By BuLk RNAsEq data (SCRABBLE)
  • Melissa: a Bayesian hierarchical method to quantify spatially-varying methylation profiles across genomic regions from single-cell bisulfite sequencing data (scBS-seq). Melissa clusters individual cells based on local methylation patterns, enabling the discovery of epigenetic diversities and commonalities among individual cells. The clustering also acts as an effective regularisation method for imputation of methylation on unassayed CpG sites, enabling transfer of information between individual cells.
  • paga: mapping out the coarse-grained connectivity structures of complex manifolds.
  • clonealign: Bayesian inference of clone-specific gene expression estimates by integrating single-cell RNA-seq and single-cell DNA-seq data
  • CellFishing.jl: (cell finder via hashing) is a tool to find similar cells of query cells based on their transcriptome expression profiles.
  • VIPER: variability-preserving imputation for accurate gene expression recovery in single-cell RNA sequencing studies.
  • scgen: a tensorflow implementation of scGen. scGen is a generative model to predict single-cell perturbation response across cell types, studies and species.
  • conos: a package to wire together large collections of single-cell RNA-seq datasets. It focuses on uniform mapping of homologous cell types across heterogeneous sample collections. For instance, a collection of dozens of peripheral blood samples from cancer patients, combined with dozens of controls. And perhaps also including samples of a related tissue, such as lymph nodes.
  • MAGIC: Markov Affinity-based Graph Imputation of Cells (MAGIC) is an algorithm for denoising high-dimensional data most commonly applied to single-cell RNA sequencing data. MAGIC learns the manifold data, using the resultant graph to smooth the features and restore the structure of the data.
  • zinbwave: a zero-inflated negative binomial model for single-cell RNA-seq data, with latent factors.
  • SIMLR_PY: Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning.
  • dca: a deep count autoencoder network to denoise scRNA-seq data and remove the dropout effect by taking the count structure, overdispersed nature and sparsity of the data into account using a deep autoencoder with zero-inflated negative binomial (ZINB) loss function.
  • scVI: deep generative modeling for single-cell transcriptomics.
  • PhenoGraph: a clustering method designed for high-dimensional single-cell data. It works by creating a graph ("network") representing phenotypic similarities between cells and then identifying communities in this graph.
  • splatter: simulation of Single-cell RNA sequencing data.
  • DeepNovo-DIA: de novo peptide sequencing for DDA and DIA by deep learning.
  • scVI: Deep generative modeling for single-cell transcriptomics.
Protein Data Related
Expression Quantitative Trait Loci, eQTL
ChIP-seq analysis
Primer Design
Work flow
Unclassified
  • biopython
  • IRanges
  • org.Hs.eg.db
  • Biobase
  • GenomicAlignments
  • GenomicRanges
  • Rsamtools
  • jvarkit
  • htslib
  • samtools
  • bedtools
  • bedops: a suite of tools to address common questions raised in genomic studies — mostly with regard to overlap and proximity relationships between data sets. It aims to be scalable and flexible, facilitating the efficient and accurate analysis and management of large-scale genomic data.
  • vcftools
  • bcftools
  • bamtools
  • maftools
  • bamUtil
  • vcflib
  • samstat
  • seqtk
  • sratools
  • bcl2fastq2
  • ucsc_utils
  • MeQA
  • IdCheck
  • SAMBLASTER
  • ngstk
  • BioInstaller
  • ChromHMM
  • ABSOLUTE
  • HAPSEG
  • Atlas-SNP, Atlas2 Suite
  • Beagle
  • CIBERSORT
  • biobloom
  • APAtrap
  • phenopredict: predicting phenotype sample information using gene expression
  • recount
  • bart: predicting functional transcription factors using gene set or a ChIP-seq dataset as input
  • LSMM (Latent Sparse Mixed Model): integrating functional annotations with genome-wide association studies
  • vcf2maf: Convert a VCF into a MAF, where each variant is annotated to only one of all possible gene isoforms
  • r2d3: R Interface to D3 Visualizations
  • liteq: Serverless R message queue using SQLite
  • ReLaXed: Create PDF documents using web technologies
  • dash: RStudio Addin to Run a Selection as a Background Job
  • threadpool: Parallel Processing in R using a Thread Pool
  • marina: master Regulator Inference Algorithm
  • paradigm: PAthway Representation and Analysis by Direct Inference on Graphical Models
  • hupan: a pan-genome analysis pipeline for human genomes.
  • RaPID: an ultra-fast tool for the identification of identity-by-descent segments among genotyped individuals.
  • gemini: a variational Bayesian approach to identify genetic interactions from combinatorial CRISPR screens.
  • CONFINED: for the purpose of capturing replicable sources of biological variability in methylation data. These sources include, for example, age, sex, and cell-type composition. Importantly, the variation captured by CONFINED does not include any variability from technical or batch effects.
  • marginPhase: a program for simultaneous haplotyping and genotyping.
  • osca: (OmicS-data-based Complex trait Analysis) is a software tool written in C/C++ for the analysis of complex traits using multi-omics data.
  • ChiCMaxima: a pipeline for analyzing and identificantion of chromation loops in CHi-C promoters data.
  • circBrain: Detection of circular RNA expression and related quantitative trait loci in the human dorsolateral prefrontal cortex.
  • bazam: A read extraction and realignment tool for next generation sequencing data.
  • DegNorm: short for degradation normalization, is a bioinformatics pipeline designed to correct for bias due to the heterogeneous patterns of transcript degradation in RNA-seq data. DegNorm helps improve the accuracy of the differential expression analysis by accounting for this degradation.
  • conbase: a software for unsupervised discovery of clonal somatic mutations in single cells through read phasing
  • 3DChromatin_ReplicateQC: Software to compute reproducibility and quality scores for Hi-C data.
  • rnbeads: an R package for comprehensive analysis of DNA methylation data obtained with any experimental protocol that provides single-CpG resolution. Supported assays include Infinium and EPIC microarrays and bisulfite sequencing protocols, and also MeDIP-seq and MBD-seq once the data have been preprocessed with DNA methylation level inference software.
  • I-Boost: a statistical boosting method that integrates multiple types of high-dimensional genomics data with clinical data for predicting survival time.
  • bin3C: extract metagenome-assembled genomes (MAGs) from metagenomic data using Hi-C.
  • dStruct: method for identifying differential reactive regions from RNA structurome profiling data.
  • Skmer: a fast tool for estimating distances between genomes from low-coverage sequencing reads (genome-skims), without needing any assembly or alignment step.
  • iGUIDE: a pipeline written in snakemake for processing and analyzing double-strand DNA break events. These events may be induced, such as by designer nucleases like Cas9, or spontaneous, as produced through DNA replication or ionizing radiation.
  • plyranges: provides a consistent interface for importing and wrangling genomics data from a variety of sources. The package defines a grammar of genomic data manipulation based on dplyr and the Bioconductor packages IRanges, GenomicRanges, and rtracklayer.
  • FORGe: tool for ranking variants and building an optimal graph genome.
  • SE-MEI: tools for finding mobile element insertions from single-end datasets.
  • Anchor: trans-cell Type Prediction of Transcription Factor Binding Sites
  • adVNTR: a tool for genotyping Variable Number Tandem Repeats (VNTR) from sequence data. It works with both NGS short reads (Illumina HiSeq) and SMRT reads (PacBio) and finds diploid repeating counts for VNTRs and identifies possible mutations in the VNTR sequences.
  • ldsc: a command line tool for estimating heritability and genetic correlation from GWAS summary statistics. ldsc also computes LD Scores.
  • BigStitcher: ImgLib2/BDV implementation of Stitching for large datasets.
  • ivtnmr: In Vitro Transcription NMR. Protocol, code and examples for the co-transcriptional RNA folding network reconstruction.
  • DIVERS: (Decomposition of Variance Using Replicate Sampling), including absolute abundance estimation from spike-in sequencing and the variance/covariance decompostion of absolute bacterial abundances.
  • prosit: offers high quality MS2 predicted spectra for any organism and protease as well as iRT prediction
  • DeepCell: Software library for deep-learning-enabled single-cell analysis in the cloud. Users manage their own cloud deployment; model training and deployment are performed through a web interface.
  • CDeep3M: Amazon machine image for training and deploying deep learning models for 2D and 3D image segmentation
  • U-Net: ImageJ plug-in for single-cell image segmentation with U-Net.
  • CellProfiler: Python-based software for single-cell segmentation and morphological profiling. Single-cell segmentation with U-Net available through a REST API.
  • Mask R-CNN: Mask R-CNN for object detection and instance segmentation on Keras and TensorFlow.
  • Cell Cognition Explorer: an open-source image processing tool for the analysis of cellular phenotypes in microscopy. CellCognition Explorer enables phenotype classification by supervised machine learning. To detect rare phenotypes, outlier morphologies can be automatically found by novelty detection methods. A key feature of CellCognition Explorer is an improved classifier training procedure based on automated pre-processing of the full data set into cell gallery images, which can be automatically sorted based on phenotype similarity for efficient iterative classifier training.
  • DeepLabCut: a toolbox for markerless pose estimation of animals performing various tasks.
  • LEAP: LEAP Estimates Animal Pose, a framework for animal body part position estimation via deep learning.
  • idtracker.ai: a software that tracks and identifies animals in collectives from videos.
  • In silico labeling: Predicting fluorescent labels in unlabeled images.
  • Image restoration: a toolbox for Content-aware Image Restoration (CARE).
  • trackViewer: a Bioconductor package for interactive and integrative visualization of multi-omics data
  • cistopic: probabilistic modelling of cis-regulatory topics from single cell epigenomics data
  • selene: a framework for training sequence-level deep learning networks.
  • sirius: a rapid tool for turning tandem mass spectra into metabolite structure information.
  • SDA: Segmental Duplication Assembler (SDA).
  • fmriprep: a robust and easy-to-use pipeline for preprocessing of diverse fMRI data. The transparent workflow dispenses of manual intervention, thereby ensuring the reproducibility of the results.
  • unifrac: for high-performance phylogenetic diversity calculations
Statistical and Visualization
Text editor and IDE
Remote Connection (SSH)
Remote Connection (Desktop)
Other

Books&Tutorial

R

Linux&Shell

Python

C/C++

JAVA

Statistics and Deep learning

│  李航.统计学习方法.pdf
│  机器学习及其应用.pdf
│  All of Statistics - A Concise Course in Statistical Inference - Larry Wasserman - Springer.pdf
│  Machine Learning - Tom Mitchell.pdf
│  PRML.pdf
│  PRML读书会合集打印版.pdf
│  Programming Collective Intelligence.pdf
│  [奥莱理] Machine Learning for Hackers.pdf
│  [机器学习]Tom.Mitchell.pdf
│  《大数据:互联网大规模数据挖掘与分布式处理》迷你书.pdf
│  推荐系统实践.pdf
│  数据挖掘-实用机器学习技术(中文第二版).pdf
│  数据挖掘_概念与技术.pdf
│  机器学习-Mitchell-中文-清晰版.pdf
│  机器学习导论.pdf
│  模式分类第二版中文版Duda.pdf(全).pdf
│  深入搜索引擎--海量信息的压缩、索引和查询.pdf
│  矩阵分析.美国 Roger.A.Horn.扫描版.pdf
│  统计学习基础 数据挖掘、推理与预测.pdf
│  
├─机器学习实战
│      machinelearninginaction.zip
│      机器学习实战 单页.pdf
│      机器学习实战.pdf
│      
└─论文文集
    └─LDA
            LDA-wangyi.pdf
            LDA数学八卦.pdf
            text-est.pdf

Git

Cloud

Bioinfomatics

Paper

Basic of High-throughput sequencing technology

  • Hadfield, J. & Retief, J. A profusion of confusion in NGS methods naming. Nat Methods 15, 7-8 (2018): http://enseqlopedia.com/enseqlopedia/
  • Schuster S C. Next-generation sequencing transforms today's biology[J]. Nature methods, 2008, 5(1): 16-18.
  • Ozsolak F, Milos P M. RNA sequencing: advances, challenges and opportunities.[J]. Nature Reviews Genetics, 2011, 12(2):87-98.
  • Stark R, Grzelak M, Hadfield J. RNA sequencing: the teenage years[J]. Nature Reviews Genetics, 2019, 20(11): 631-656
  • Ansorge W J. Next-generation DNA sequencing techniques[J]. New biotechnology, 2009, 25(4): 195-203.
  • Heather J M, Chain B. The sequence of sequencers: The history of sequencing DNA[J]. Genomics, 2016, 107(1): 1-8.
  • Schneider G F, Dekker C. DNA sequencing with nanopores[J]. Nature biotechnology, 2012, 30(4): 326.
  • Restrepo-Pérez L, Joo C, Dekker C. Paving the way to single-molecule protein sequencing[J]. Nature nanotechnology, 2018, 13(9): 786-796.

Large research project

  • Cancer Genome Atlas Research, N., et al., The Cancer Genome Atlas Pan-Cancer analysis project. Nat Genet, 2013. 45(10): p. 1113-20.
  • International Cancer Genome, C., et al., International network of cancer genome projects. Nature, 2010. 464(7291): p. 993-8.
  • Consortium, G.T., The Genotype-Tissue Expression (GTEx) project. Nat Genet, 2013. 45(6): p. 580-5.
  • G.P., Enhancing GTEx by bridging the gaps between genotype, gene expression, and disease. Nat Genet, 2017. 49(12): p. 1664-1670.
  • Consortium, G.T., et al., Genetic effects on gene expression across human tissues. Nature, 2017. 550(7675): p. 204-213.

Precision medicine

  • Byron, S.A., et al., Translating RNA sequencing into clinical diagnostics: opportunities and challenges. Nat Rev Genet, 2016. 17(5): p. 257-71.
  • Price, N.D., et al., A wellness study of 108 individuals using personal, dense, dynamic data clouds. Nat Biotechnol, 2017. 35(8): p. 747-756.
  • Kumar-Sinha, C. and A.M. Chinnaiyan, Precision oncology in the age of integrative genomics. Nat Biotechnol, 2018. 36(1): p. 46-60.
  • Torkamani, A., N.E. Wineinger, and E.J. Topol, The personal and clinical utility of polygenic risk scores. Nat Rev Genet, 2018.
  • Berdasco, M. and M. Esteller, Clinical epigenetics: seizing opportunities for translation. Nat Rev Genet, 2018.

Tumor biology

  • Stratton, M.R., P.J. Campbell, and P.A. Futreal, The cancer genome. Nature, 2009. 458(7239): p. 719-24.
  • Sanchez-Vega, F., et al., Oncogenic Signaling Pathways in The Cancer Genome Atlas. Cell, 2018. 173(2): p. 321-337 e10.
  • Huang, K.L., et al., Pathogenic Germline Variants in 10,389 Adult Cancers. Cell, 2018. 173(2): p. 355-370 e14.
  • Kahles, A., et al., Comprehensive Analysis of Alternative Splicing Across Tumors from 8,705 Patients. Cancer Cell, 2018.
  • Castro-Giner, F., P. Ratcliffe, and I. Tomlinson, The mini-driver model of polygenic cancer evolution. Nat Rev Cancer, 2015. 15(11): p. 680-5.
  • Salk, J.J., M.W. Schmitt, and L.A. Loeb, Enhancing the accuracy of next-generation sequencing for detecting rare and subclonal mutations. Nat Rev Genet, 2018.
  • Winters, I.P., C.W. Murray, and M.M. Winslow, Towards quantitative and multiplexed in vivo functional cancer genomics. Nat Rev Genet, 2018. 19(12): p. 741-755.
  • Pesavento, P.A., et al., Cancer in wildlife: patterns of emergence. Nat Rev Cancer, 2018.
  • Maman, S. and I.P. Witz, A history of exploring cancer in context. Nat Rev Cancer, 2018. 18(6): p. 359-376.
  • Hamidi, H. and J. Ivaska, Every step of the way: integrins in cancer progression and metastasis. Nat Rev Cancer, 2018.
  • Archetti, M. and K.J. Pienta, Cooperation among cancer cells: applying game theory to cancer. Nat Rev Cancer, 2018.

Bioinformatics databases and tools

  • Ding, L., et al., Expanding the computational toolbox for mining cancer genomes. Nat Rev Genet, 2014. 15(8): p. 556-70.
  • Cheng, F., J. Zhao, and Z. Zhao, Advances in computational approaches for prioritizing driver mutations and significantly mutated genes in cancer genomes. Brief Bioinform, 2016. 17(4): p. 642-56.
  • Zhang, Z., et al., A survey and evaluation of Web-based tools/databases for variant analysis of TCGA data. Brief Bioinform, 2018.
  • Casper J, Zweig A S, Villarreal C, et al. The UCSC genome browser database: 2018 update[J]. Nucleic acids research, 2017, 46(D1): D762-D769.
  • Afgan, E., et al., The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update. Nucleic Acids Res, 2018. 46(W1): p. W537-W544.
  • Sondka, Z., et al., The COSMIC Cancer Gene Census: describing genetic dysfunction across all human cancers. Nat Rev Cancer, 2018. 18(11): p. 696-705.

Application of machine learning on bioinformatics

  • Zou, J., et al., A primer on deep learning in genomics. Nat Genet, 2019. 51(1): p. 12-18.
  • Eraslan, G., et al., Deep learning: new computational modelling techniques for genomics. Nat Rev Genet, 2019.
  • Wainberg, M., et al., Deep learning in biomedicine. Nat Biotechnol, 2018. 36(9): p. 829-838.
  • Ching, T., et al., Opportunities and obstacles for deep learning in biology and medicine. J R Soc Interface, 2018. 15(141).
  • Min, S., B. Lee, and S. Yoon, Deep learning in bioinformatics. Brief Bioinform, 2017. 18(5): p. 851-869.
  • Jones, W., et al., Computational biology: deep learning. Emerging Topics in Life Sciences, 2017. 1(3): p. 257-274.
  • Angermueller, C., et al., Deep learning for computational biology. Mol Syst Biol, 2016. 12(7): p. 878.
  • Zhou, J., et al., Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk. Nat Genet, 2018. 50(8): p. 1171-1179.
  • Sundaram, L., et al., Predicting the clinical impact of human mutation with deep neural networks. Nat Genet, 2018.
  • Libbrecht, M.W. and W.S. Noble, Machine learning applications in genetics and genomics. Nat Rev Genet, 2015. 16(6): p. 321-32.
  • Camacho, D.M., et al., Next-Generation Machine Learning for Biological Networks. Cell, 2018. 173(7): p. 1581-1592.

Whole-genome sequencing

  • Kosugi, Shunichi, et al. "Comprehensive evaluation of structural variation detection algorithms for whole genome sequencing."Genome biology20.1 (2019): 117.

Single cell sequencing

  • Kiselev, V.Y., T.S. Andrews, and M. Hemberg, Challenges in unsupervised clustering of single-cell RNA-seq data. Nat Rev Genet, 2019. 20(5): p. 273-282.
  • McInnes, L., J. Healy, and J. Melville UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv e-prints, 2018.
  • Maaten, L.v.d.a.H., Geoffrey, Visualizing Data using t-SNE. Journal of Machine Learning Research, 2008. 9: p. 2579--2605.
  • Lake, B.B., et al., Integrative single-cell analysis of transcriptional and epigenetic states in the human adult brain. Nat Biotechnol, 2018. 36(1): p. 70-80.
  • Cusanovich, D.A., et al., A Single-Cell Atlas of In Vivo Mammalian Chromatin Accessibility. Cell, 2018. 174(5): p. 1309-1324 e18.
  • Haghverdi, L., et al., Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat Biotechnol, 2018.
  • Raj, B., et al., Simultaneous single-cell profiling of lineages and cell types in the vertebrate brain. Nat Biotechnol, 2018. 36(5): p. 442-450.
  • Edsgard, D., P. Johnsson, and R. Sandberg, Identification of spatial expression trends in single-cell gene expression data. Nat Methods, 2018. 15(5): p. 339-342.

Non-coding region and synonymous mutation

  • Fredriksson, N.J., et al., Systematic analysis of noncoding somatic mutations and gene expression alterations across 14 tumor types. Nat Genet, 2014. 46(12): p. 1258-63.
  • Weinhold, N., et al., Genome-wide analysis of noncoding regulatory mutations in cancer. Nat Genet, 2014. 46(11): p. 1160-5.
  • Uszczynska-Ratajczak, B., et al., Towards a complete map of the human long non-coding RNA transcriptome. Nat Rev Genet, 2018.
  • Chamary J V, Parmley J L, Hurst L D. Hearing silence: non-neutral evolution at synonymous sites in mammals[J]. Nature Reviews Genetics, 2006, 7(2): 98.
  • Sauna Z E, Kimchi-Sarfaty C. Understanding the contribution of synonymous mutations to human disease[J]. Nature Reviews Genetics, 2011, 12(10): 683.
  • Supek F, Miñana B, Valcárcel J, et al. Synonymous mutations frequently act as driver mutations in human cancers[J]. Cell, 2014, 156(6): 1324-1335.
  • Sharma, Y., et al., A pan-cancer analysis of synonymous mutations. Nat Commun, 2019. 10(1): p. 2569.

Pan-genome

  • Li, R., et al., Building the sequence map of the human pan-genome. Nat Biotechnol, 2010. 28(1): p. 57-63.
  • Duan Z, Qiao Y, Lu J, et al. HUPAN: a pan-genome analysis pipeline for human genomes[J]. Genome biology, 2019, 20(1): 149.

3D genome

  • Spielmann, M., D.G. Lupianez, and S. Mundlos, Structural variation in the 3D genome. Nat Rev Genet, 2018. 19(7): p. 453-467.

Skills

Programming language

Statistics

Code Management

Organization

Google Summer of Code Registered

  • Open Bioinformatics Foundation: Promoting practice & philosophy of OSS & Open Science in biological research.
  • National Resource for Network Biology (NRNB): The National Resource for Network Biology (NRNB) organizes the development of free, open source software to enable network-based visualization, analysis, and biomedical discovery.
  • INCF: INCF advances data reuse and reproducibility in brain research by coordinating the development of Open, FAIR, and Citable tools and resources for neuroscience.
  • Computational Biology @ University of Nebraska-Lincoln: Our organization develops tools for bioinformatics and computational biology research. Our goal is to further knowledge in health through data visualization and analysis.
  • Biomedical Informatics, Emory University: Big Data for Healthcare and Biomedical Research
  • Ensembl: The Ensembl project maintains and updates databases that annotate a wide number of genome sequences and distributes them freely to the worldwide research community.
  • R project for statistical computing: R provides a wide variety of statistical and graphical techniques, and is highly extensible. R is often the tool of choice for research in statistical methodology.
  • InterMine: InterMine integrates biological data sources and makes it easy to query, visualise, and analyse the data via a graphical user interface or via APIs in Python, R, Perl, and more.
  • NumFOCUS: NumFOCUS supports and promotes world-class, innovative, open source scientific software.
  • PEcAn Project: PEcAn is an integrated ecoinformatics toolbox that consists of a set of scientific workflows that wrap around ecosystem models and manage flow of information in and out of models

Project-based community

  • galaxyproject: Galaxy is an open, web-based platform for accessible, reproducible, and transparent computational biomedical research.
  • bioconda: A channel for the conda package manager specializing in bioinformatics software.
  • biopython: An international association of developers of freely available Python tools for computational molecular biology.
  • samtools: Tools (written in C using htslib) for manipulating next-generation sequencing data.
  • opengene: Open source tools for NGS data analysis.
  • MultiQC: Aggregate results from bioinformatics analyses across many samples into a single report.
  • Gatk: GATK4 aims to bring together well-established tools from the GATK and Picard codebases under a streamlined framework, and to enable selected tools to be run in a massively parallel way on local clusters or in the cloud using Apache Spark. It also contains many newly developed tools not present in earlier releases of the toolkit.
  • nextflow: A bioinformatics workflow manager that enables the development of portable and reproducible workflows.
  • spack: A flexible package manager that supports multiple versions, configurations, platforms, and compilers.
  • omicX: Reap the rewards of a biological insight engine.

Communication-based community

Institute or business company

People

Blog

Contributors

You can’t perform that action at this time.