Shinya Oki edited this page Jul 31, 2018 · 66 revisions

ChIP-Atlas / Documents

Documents for computational processing in ChIP-Atlas.

Table of Contents

  1. Data source
  2. Primary processing
  3. Data Annotation
  4. Peak Browser
  5. Target Genes
  6. Colocalization
  7. Enrichment Analysis
  8. Downloads
  9. External Genome Browser

1. Data source

Currently, most academic journals require that authors of studies including high-throughput sequencing must submit their raw sequence data as SRAs (Sequence Read Archives) to public repositories (NCBI, DDBJ or ENA). Each experiment is assigned an ID, called an experimental accession, beginning with SRX, DRX, or ERX (hereafter ‘SRXs’). To refer to corresponding ‘experiment’ and ‘biosample’ metadata in the XML format (available from NCBI FTP site), ChIP-Atlas uses SRXs with the following criteria:

  • LIBRARY STRATEGY == "ChIP-seq" or "DNase-Hypersensitivity"
  • LIBRARY_SOURCE == "GENOMIC"
  • INSTRUMENT_MODEL ~ "Illumina"

2. Primary processing

Introduction

Raw sequence data from SRXs as shown above were aligned to reference genomes with Bowtie2 before being analyzed for coverage in BigWig format and peak-calls in BED format.

Methods

  1. Binarized sequence raw data (.sra) for each SRX were downloaded and decoded into Fastq format with the fastq-dump command of SRA Toolkit (ver 2.3.2-4) with a default mode, except paired-end reads, which were decoded with the --split-files option. In an SRX including multiple runs, decoded Fastq files were concatenated into a single one.

  2. Fastq files were then aligned with Bowtie 2 (ver 2.2.2) with a default mode, except paired-end reads, for which two Fastq files were specified with -1 and -2 options. The following genome assemblies were used for the alignment and subsequent processing:

    • hg19 (H. sapiens)
    • mm9 (M. musculus)
    • rn6 (R. norvegicus)
    • dm3 (D. melanogaster)
    • ce10 (C. elegans)
    • sacCer3 (S. cerevisiae)
  3. Resultant SAM-formatted files were then binarized into BAM format with SAMtools (ver 0.1.19; samtools view) and sorted (samtools sort) before removing PCR duplicates (samtools rmdup).

  4. BedGraph-formatted coverage scores were calculated with bedtools (ver 2.17.0; genomeCoverageBed) in RPM (Reads Per Million mapped reads) units with -scale 1000000/N option, where N is mapped read counts after removing PCR duplicates as shown in section 3.

  5. BedGraph files were binarized into BigWig format with UCSC bedGraphToBigWig tool (ver 4). BAM files made in (3) were used to peak-call with MACS2 (ver 2.1.0; macs2 callpeak) in BED4 format. Options for Q-value threshold were set (-q 1e-05, 1e-10, or 1e-20), with the options for genome sizes as follows:

    • hg19: -g hs
    • mm9: -g mm
    • rn6: -g 2.15e9
    • dm3: -g dm
    • ce10: -g ce
    • sacCer3: -g 12100000

    Each row in the BED4 files includes the genomic location in the 1st to 3rd columns and MACS2 score (-10*Log10[MACS2 Q-value]) in the 4th column.

  6. BED4 files were binarized into BigBed format with UCSC bedToBigBed tool (ver 2.5).

3. Data Annotation

Introduction

Experimental materials used for each SRX were manually annotated to allow for extracting data via keywords for antigens and cell types.

Methods

  1. Sample metadata for all SRXs (biosample_set.xml) were downloaded from the NCBI FTP site to extract the attributes for antigens and antibodies (see here) as well as cell types and tissues (see here).

  2. According to the attribute values ascribed to each SRX, antigens and cell types used were manually annotated by curators who have been fully trained on molecular and developmental biology. Each annotation has a ‘Class’ and ‘Subclass’ as shown in antigenList.tab (Download, Table schema) and celltypeList.tab (Download, Table schema).

  3. Guidelines for antigens annotation:

    • Histones
      Based on Brno nomenclature (PMID: 15702071).
      (e.g., H3K4me3, H3K27ac)
    • Gene-encoded proteins
      • Gene symbols were recorded according to the following gene nomenclature databases:

        • HGNC (H. sapiens)
        • MGI (M. musculus)
        • RGD (R. norvegicus)
        • FlyBase (D. melanogaster)
        • WormBase (C. elegans)
        • SGD (S. cerevisiae)
          (e.g., OCT3/4 → POU5F1; p53 → TP53)
      • Modifications such as phosphorylation were ignored.
        (e.g., phospho-SMAD3 → SMAD3)

      • If an antibody recognizes multiple molecules in a family, the first in an ascending order was chosen.
        (e.g., Anti-SMAD2/3 antibody → SMAD2)

  4. Criteria for cell types annotation:

    • H. sapiens, M. musculus and R. norvegicus
      Cell types were mainly classified by the tissues derived from. ES and iPS cells were exceptionally classified in ‘Pluripotent stem cell’ class.
    Cell-type class Cell type
    Blood K-562; CD4-Positive T-Lymphocytes
    Breast MCF-7; T-47D
    Pluripotent stem cell hESC H1; iPS cells
    • D. melanogaster
      Cell types were mainly classified by cell lines and developmental stages.
    • C. elegans
      Mainly classified by developmental stages.
    • S. cerevisiae
      Classified by yeast strains.
    • Standardized Nomenclatures
      Nomenclatures of cell lines and tissue names were standardized according to the following frameworks and resources:
      • Supplementary Table S2 in Yu et. al 2015 (PMID: 25877200), proposing unified cell-line names
      • ATCC, a nonprofit repository of cell lines
      • MeSH (Medical Subject Headings) for tissue names
      • FlyBase for cell lines of D. melanogaster
        (e.g., MDA-231, MDA231, MDAMB231 → MDA-MB-231)
  5. Antigens or cell types were classified in ‘Uncategorized’ class if the curators could not understand attribute values.

  6. Antigens or cell types were classified in ‘No description’ class if there was no attribute value.

4. Peak Browser

ChIP-Atlas Peak Browser allows users to search for proteins bound to given genomic loci on the genome browser IGV. This is useful for predicting cis-regulatory elements, as well as to find regulatory proteins and the epigenetic status of given regions. BED4-formatted peak-call data from 2.5 were concatenated and converted to BED9 + GFF3 format to browse on genome browser IGV. The BED9 files can be downloaded from Peak Browser web site, and the table schema is as follows:

Column Description Example
Header Track name and link URL (Strings)
Column 1 Chromosome chr12
Column 2 Begin 1234
Column 3 End 5678
Column 4* Sample metadata (Strings)
Column 5 -10Log10(MACS2 Q-value) 345
Column 6 . .
Column 7 Begin (= Column 2) 1234
Column 8 End (= Column 3) 5678
Column 9** Color code 255,61,0
  • *Column 4
    Sample metadata described in GFF3 format to show annotated antigens and cell types on IGV. Furthermore, mousing over a peak displays accession number, title, and all attribute values described in Biosample metadata for the SRX.
  • **Column 9
    Heatmap color codes for Column 5.
    (If Column 5 is 0, 500, or 1000, then colors are blue, green, or red, respectively.)

To find the URLs of the BED9 files, see Assembled Peak-call data used in “Peak Browser” section of 8. Downloads chapter.

5. Target Genes

Introduction

The ChIP-Atlas Target Genes feature predicts genes directly regulated by given proteins, based on binding profiles of all public ChIP-seq data for particular gene loci. Target genes were accepted if the peak-call intervals of a given protein overlapped with a transcription start site (TSS) ± N kb (N = 1, 5, or 10).

Methods

  1. Peak-call data:
    BED4-formatted peak-call data of each SRX made in section 2.5 were used (MACS2 Q-value < 1E-05; antigen class = ‘TFs and others’).
  2. Preparation of TSS library:
    Location of TSSs and gene symbols were according to refFlat files (at UCSC FTP site); only protein-coding genes were used for this analysis.
  3. Preparation of STRING library:
    STRING is a comprehensive database recording protein-protein and protein-gene interactions based on experimental evidence. A file describing all interactions was downloaded from protein.actions.v10.txt.gz, and the protein IDs were converted to gene symbols with protein.aliases.v10.txt.gz.
  4. Processing:
    bedtools window command (bedtools ver 2.17.0) was used to search target genes from peak-call data (5.1) from the TSS library (5.2) with a window size option (-w 1000, 5000, or 10000). Peak-call data of the same antigens were collected, and MACS2 scores (-10*Log10[MACS2 Q-value]) were indicated as heatmap colors on the web browser (MACS2 score = 0, 500, 1000 → color = blue, green, red) (see example). If a gene intersected with multiple peaks of a single SRX, the highest MACS2 score was chosen for the color indication. The ‘Average’ column at the far left of the table shows the means of the MACS2 scores in the same row. The ‘STRING’ column on the far right indicates the STRING scores for the protein-gene interaction according to STRING library (5.3). For more details, protein-gene pairs in protein.actions.v10.txt.gz file were extracted when meeting the following conditions:
    • 1st column (item_id_a) == Query antigen
    • 2nd column (item_id_b) == Target gene
    • 3rd column (mode) == "expression"
    • 5th column (a_is_acting) == "1"

6. Colocalization

Introduction

Many transcription factors (TFs) form complexes to promote or enhance transcriptional activity (e.g., Pou5f1, Nanog, and Sox2 in mouse ES cells). ChIP-seq profiles of such TFs are often similar, showing colocalization on multiple genomic regions. The ChIP-Atlas Colocalization predicts colocalization partners of given TFs, evaluated through comprehensive and combinatorial similarity analyses of all public ChIP-seq data.

Algorithms

BED4-formatted peak-call data made in section 2.5 were analyzed to evaluate the similarities to other peak-call data in identical cell-type classes. Their similarities were analyzed with CoLo, a tool to evaluate the colocalization of transcription factors (TFs) with multiple ChIP-seq peak-call data. Advantages of CoLo are:

(a) it compensates for biases derived from different experimental conditions.
(b) it adjusts the difference of the peak numbers and distributions coming from innate characteristics of the TFs.

The function (a) is programed so that MACS2 scores in each BED4 file were fitted to a Gaussian distribution, dividing the BED4 files into three groups:

  • H (High binding; Z-score > 0.5)
  • M (Middle binding; -0.5 ≤ Z-score ≤ 0.5)
  • L (Low binding; Z-score < -0.5)

These three groups are used as independent data to evaluate similarity through the function (b). Thus, CoLo evaluates the similarity of two SRXs (e.g., SRX_1 and SRX_2) with nine combinations:

[H/M/L of SRX_1] x [H/M/L of SRX_2]

Eventually, a set of nine Boolean results (similar or not) is returned to indicate the similarity of SRX_1 and SRX_2.

Methods

  1. Peak-call data: Same as (5.1).
  2. STRING library: Same as (5.2).
  3. Processing:
    Peak-call data in identical cell-type classes were processed through CoLo. The scores between the two BED files were calculated by multiplication of the combination of the H (= 3), M (= 2), or L (= 1) as follows:
SRX_1 SRX_2 Scores
H H 9
H M 6
H L 3
M H 6
M M 4
M L 2
L H 3
L M 2
L L 1

If multiple H/M/L combinations were returned from SRX_1 and SRX_2, the highest score was adopted. The scores (1 to 9) were colored in blue, green to red, and gray if all nine H/M/L combinations were false (see example). The ‘Average’ column on the far left of the table shows the means of the CoLo scores in the same row. The ‘STRING’ column on the far right indicates the STRING scores for the protein-protein interaction (6.2). For more detail, protein-protein pairs in protein.actions.v10.txt.gz file were extracted if meeting all the following conditions:

  • 1st column (item_id_a) == query antigen
  • 2nd column (item_id_b) == co-association partner
  • 3rd column (mode) == "binding"

7. Enrichment Analysis

Introduction

ChIP-Atlas Enrichment Analysis accepts users’ data in the following three formats:

  • Genomic regions in BED format (to search proteins bound to the regions)
  • Sequence motif (to search proteins bound to the motif)
  • Gene list (to search proteins bound to the genes)

In addition, the following analyses are possible by specifying the data for comparison on the submission form of Enrichment Analysis:

Data in panel 4. Data in panel 5. Aims and analyses
BED Random permutation Proteins bound to BED intervals more often than by chance.
BED BED Proteins differentially bound between the two sets of BED intervals.
Motif Random permutation Proteins bound to a sequence motif more often than by chance.
Motif Motif Proteins differentially bound between the two motifs.
Genes RefSeq coding genes Proteins bound to genes more often than other RefSeq genes.
Genes Genes Proteins differentially bound between the two sets of gene lists.

Requirements and acceptable data

  • Reference peak-call data (upper panels (1 to 3) of the submission form):
    Comprehensive peak-call data as described above (4. Peak browser). The result will be returned more quickly if the classes of antigens and cell-types are specified.

  • BED (lower panels (4 and 5) of the submission form):
    UCSC BED format, minimally requiring three tab-delimited columns describing chromosome, and starting and ending positions.

    chr1<tab>1435385<tab>1436458
    chrX<tab>4634643<tab>4635798

    A header and column 4 or later can be included, but they are ignored for the analysis. BE CAREFUL that only BED files in the following genome assemblies are acceptable:

    • hg19 (H. sapiens)
    • mm9 (M. musculus)
    • rn6 (R. norvegicus)
    • dm3 (D. melanogaster)
    • ce10 (C. elegans)
    • sacCer3 (S. cerevisiae)

    If the BED file is in other genome assembly, convert it to a suitable one with UCSC liftOver tool.

  • Motif (lower panels (4 and 5) of the submission form):
    A sequence motif described in IUPAC nucleic acid notation. In addition to normal codes (ATGC), ambiguity codes are also acceptable (WSMKRYBDHVN).

  • Gene list (lower panels (4 and 5) of the submission form):
    Gene symbols must be entered according to following nomenclatures:

    (e.g., OCT3/4 → POU5F1; p53 → TP53)

    If the gene lists are described using any other format (e.g., Gene IDs in Refseq or Emsemble format), use a batch conversion tool such as DAVID (Convert into OFFICIAL_GENE_SYMBOL with Gene ID Conversion Tool).

Methods

  1. Submitted data are converted to BED files depending on the data types.
  • BED
    Submitted BED files are used only for further processing. If ‘Random permutation’ is set for the comparison, the submitted BED intervals are permuted on a random chromosome at a random position for specified times with bedtools shuffle command (bedtools; ver 2.17.0).

  • Motif
    Genomic locations perfectly matching to submitted sequence are searched by Bowtie (ver 0.12.8) and converted to BED format. If ‘Random permutation’ is set for the comparison, the BED is used for random permutation as described above.

  • Gene list
    Unique TSSs of submitted genes are defined with xxxCanonical.txt.gz* library distributed from UCSC FTP site.
    * xxx is a placeholder for:

    • “known” (H. sapiens and M. musculus)
    • “flyBase” (D. melanogaster)
    • “sanger” (C. elegans)
    • “sgd” (S. cerevisiae)

    Unique TSSs of R. norvegicus genes are defined with a gene list distributed from RGD.

    The locations of TSSs are converted to BED format with the addition of widths specified in ‘Distance range from TSS’ on the submission form. If ‘RefSeq coding gene’ is set for the comparison, RefSeq coding genes excluding those in submitted list are processed to BED format as mentioned above.

  1. The overlaps between the BED (originated from panels 4 and 5 of the submission form) and reference peak-call data (specified on upper panels 1 to 3 of the submission form) are counted with bedtools intersect command (BedTools2; ver 2.23.0).
  2. P-values are calculated with two-tailed Fisher’s exact probability test (see example). The null hypothesis is that the intersection of reference peaks with submitted data in panel 4 occurs in the same proportion to those with data in pannel 5 of the submission form. Q-values are calculated with the Benjamini & Hochberg method.
  3. Fold enrichment is calculated by (column 6) / (column 7) of of the same row. If the ratio > 1, the rightmost column is ‘TRUE’, meaning that the proteins from column 3 binds to the data of panel 4 in a greater proportion than to those of panel 5 specified in the submission form.

8. Downloads

Data for each SRX

All ChIP-seq experiments recorded in ChIP-Atlas are described in experimentList.tab (Download, Table schema)

  • BigWig
    Download URL:
    http://dbarchive.biosciencedbc.jp/kyushu-u/Genome/eachData/bw/Experimental_ID.bw

    Example:
    http://dbarchive.biosciencedbc.jp/kyushu-u/hg19/eachData/bw/SRX097088.bw

  • Peak-call (BED)
    Download URL:
    http://dbarchive.biosciencedbc.jp/kyushu-u/Genome/eachData/bedThreshold/Experimental_ID.Threshold.bed
    (Threshold = 05, 10, or 20)

    Example:
    http://dbarchive.biosciencedbc.jp/kyushu-u/hg19/eachData/bed05/SRX097088.05.bed
    (Peak-call data of SRX097088 with Q-value < 1E-05.)

  • Peak-call (BigBed)
    Download URL:
    http://dbarchive.biosciencedbc.jp/kyushu-u/Genome/eachData/bbThreshold/Experimental_ID.Threshold.bb
    (Threshold = 05, 10, or 20)

    Example:
    http://dbarchive.biosciencedbc.jp/kyushu-u/hg19/eachData/bb05/SRX097088.05.bb
    (Peak-call data of SRX097088 with Q-value < 1E-05.)


Assembled Peak-call data used in “Peak Browser”

Download URL:
http://dbarchive.biosciencedbc.jp/kyushu-u/Genome/assembled/File_name.bed
(Genome and File_name are listed in fileList.tab [Download, Table schema])

Example:
http://dbarchive.biosciencedbc.jp/kyushu-u/hg19/assembled/Oth.ALL.05.GATA2.AllCell.bed
(All peak-call data of GATA2 in all cell types with Q-value < 1E-05.)

Note:
As the file size of the assembled peak-call data used in “Peak Browser” is very huge, we recommend you to download the lighter version of all peak-call data (see below URLs and table schema), and to join the SRXs with the sample metadata described in experimentList.tab (Download, Table schema) on a command-line interface.


  • Table schema of the lighter version of all peak-call data:

    Column Description Example
    Column 1 Chromosome chr12
    Column 2 Begin 1234
    Column 3 End 5678
    Column 4 SRX SRX344646
    Column 5 -10Log10(MACS2 Q-value) 345

Analyzed data used in “Target Genes”

Download URL:
http://dbarchive.biosciencedbc.jp/kyushu-u/Genome/target/Protein.Distance.tsv
(Proteins are listed in analysisList.tab [Download, Table schema])
(Distance = 1, 5, or 10, indicating the distance [kb] from TSS.)

Example:
http://dbarchive.biosciencedbc.jp/kyushu-u/hg19/target/POU5F1.5.tsv
(TSV file describing the genes bound by POU5F1 at TSS ± 5 kb.)


Analyzed data used in “Colocalization”

Download URL:
http://dbarchive.biosciencedbc.jp/kyushu-u/Genome/colo/Protein.Cell_type_class.tsv
(Protein and Cell_type_class are listed in analysisList.tab [Download, Table schema])

Example:
http://dbarchive.biosciencedbc.jp/kyushu-u/hg19/colo/POU5F1.Pluripotent_stem_cell.tsv
(TSV file describing the proteins colocalizing with POU5F1 in Pluripotentstemcell.)
(Spaces in the name of cell type class must be replaced with underscores _.)


Tables summarizing metadata and files

  • experimentList.tab (Download)
    All ChIP-seq experiments recorded in ChIP-Atlas.
Column Description Example
1 Experimental ID (SRX, ERX, DRX) SRX097088
2 Genome assembly hg19
3 Antigen class TFs and others
4 Antigen GATA2
5 Cell type class Blood
6 Cell type K-562
7 Cell type description Primary Tissue=Blood|Tissue Diagnosis=Leukemia Chronic Myelogenous
8 Processing logs (# of reads, % mapped, % duplicates, # of peaks [Q < 1E-05]) 30180878,82.3,42.1,6691
9 Title GSM722415: GATA2 K562bmp r1 110325 3
10- Meta data submitted by authors source_name=GATA2 ChIP-seq K562 BMP
cell line=K562
chip antibody=GATA2
antibody catalog number=Santa Cruz SC-9008

  • fileList.tab (Download)
    All assembled peak-call data used in Peak Browser.
Column Description Example
1 File name Oth.ALL.05.GATA2.AllCell
2 Genome assembly hg19
3 Antigen class TFs and others
4 Antigen GATA2
5 Cell type class All cell types
6 Cell type -
7 Threshold 05 (indicating Q-value < 1E-05)
8 Experimental IDs included SRX070877,SRX150427,SRX092303,SRX070876,SRX150668,...

  • analysisList.tab (Download)
    All proteins shown in “Target Genes” and “Colocalization”.
Column Description Example
1 Antigen POU5F1
2 Cell type class in Colocalization Epidermis,Pluripotent stem cell
3 Recorded (+) or not (-) in Target Genes +
4 Genome assembly hg19

  • antigenList.tab (Download)
    All antigens recorded in ChIP-Atlas.
Column Description Example
1 Genome assembly hg19
2 Antigen class TFs and others
3 Antigen POU5F1
4 Number of experiments 24
5 Experimental IDs included SRX011571,SRX011572,SRX017276,SRX021069,SRX021070,...


  • celltypeList.tab (Download)
    All cell types recorded in ChIP-Atlas.
Column Description Example
1 Genome assembly hg19
2 Cell type class Prostate
3 Cell type VCaP
4 Number of experiments 185
5 Experimental IDs included SRX020917,SRX020918,SRX020919,SRX020920,SRX020921,...

9. External Genome Browser

BigBed and BigWig format files in ChIP-Atlas database are now able to be browsed on UCSC Genome Browser. Use links below to jump to UCSC Genome Browser.

Clone this wiki locally
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.