Skip to content
/ CTCF Public

Genomic coordinates of FIMO-predicted CTCF binding sites using JASPAR and other PWMs, human and mouse genome assemblies including mm39 and T2T. Also included experimentally derived ENCODE SCREEN CTCF-bound CREs.

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md
Notifications You must be signed in to change notification settings

mdozmorov/CTCF

Repository files navigation

CTCF

Lifecycle: stable

CTCF defines an AnnotationHub resource representing genomic coordinates of FIMO-predicted CTCF binding sites for human and mouse genomes, including the Telomere-to-Telomere and mm39 genome assemblies. It also includes experimentally defined CTCF-bound cis-regulatory elements from ENCODE SCREEN.

TL;DR - for human hg38 genome assembly, use hg38.MA0139.1.RData (“AH104729”). For mouse mm10 genome assembly, use mm10.MA0139.1.RData (“AH104755”). For ENCODE SCREEN data, use hg38.SCREEN.GRCh38_CTCF.RData (“AH104730”) or mm10.SCREEN.mm10_CTCF.RData (“AH104756”) objects.

The CTCF GRanges are named as <assembly>.<Database>. The FIMO-predicted data includes extra columns with motif name, score, p-value, q-value, and the motif sequence.

Installation instructions

Install the latest release of R, then get the latest version of Bioconductor by starting R and entering the commands:

# if (!require("BiocManager", quietly = TRUE))
#     install.packages("BiocManager")
# BiocManager::install(version = "3.16")

Then, install additional packages using the following code:

# BiocManager::install("AnnotationHub", update = FALSE) 
# BiocManager::install("GenomicRanges", update = FALSE)
# BiocManager::install("plyranges", update = FALSE)

Example

suppressMessages(library(AnnotationHub))
ah <- AnnotationHub()
query_data <- subset(ah, preparerclass == "CTCF")
# Explore the AnnotationHub object
query_data
#> AnnotationHub with 51 records
#> # snapshotDate(): 2024-04-30
#> # $dataprovider: JASPAR 2022, CTCFBSDB 2.0, SwissRegulon, Jolma 2013, HOCOMO...
#> # $species: Homo sapiens, Mus musculus
#> # $rdataclass: GRanges
#> # additional mcols(): taxonomyid, genome, description,
#> #   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#> #   rdatapath, sourceurl, sourcetype 
#> # retrieve records with, e.g., 'object[["AH104716"]]' 
#> 
#>              title                                                 
#>   AH104716 | T2T.CIS_BP_2.00_Homo_sapiens.RData                    
#>   AH104717 | T2T.CTCFBSDB_PWM.RData                                
#>   AH104718 | T2T.HOCOMOCOv11_core_HUMAN_mono_meme_format.RData     
#>   AH104719 | T2T.JASPAR2022_CORE_vertebrates_non_redundant_v2.RData
#>   AH104720 | T2T.Jolma2013.RData                                   
#>   ...        ...                                                   
#>   AH104762 | mm9.JASPAR2022_CORE_vertebrates_non_redundant_v2.RData
#>   AH104763 | mm9.Jolma2013.RData                                   
#>   AH104764 | mm9.MA0139.1.RData                                    
#>   AH104765 | mm9.SwissRegulon_human_and_mouse.RData                
#>   AH104766 | mm8.CTCFBSDB.CTCF_predicted_mouse.RData
# Get the list of data providers
query_data$dataprovider %>% table()
#> .
#>           CIS-BP     CTCFBSDB 2.0 ENCODE SCREEN v3     HOCOMOCO v11 
#>                6               12                2                6 
#>      JASPAR 2022       Jolma 2013     SwissRegulon 
#>               13                6                6

We can find CTCF sites identified using JASPAR 2022 database in hg38 human genome

subset(query_data, species == "Homo sapiens" & 
                   genome == "hg38" & 
                   dataprovider == "JASPAR 2022")
#> AnnotationHub with 2 records
#> # snapshotDate(): 2024-04-30
#> # $dataprovider: JASPAR 2022
#> # $species: Homo sapiens
#> # $rdataclass: GRanges
#> # additional mcols(): taxonomyid, genome, description,
#> #   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#> #   rdatapath, sourceurl, sourcetype 
#> # retrieve records with, e.g., 'object[["AH104727"]]' 
#> 
#>              title                                                  
#>   AH104727 | hg38.JASPAR2022_CORE_vertebrates_non_redundant_v2.RData
#>   AH104729 | hg38.MA0139.1.RData
# Same for mm10 mouse genome
# subset(query_data, species == "Mus musculus" & genome == "mm10" & dataprovider == "JASPAR 2022")

The hg38.JASPAR2022_CORE_vertebrates_non_redundant_v2 object contains CTCF sites detected using the all three CTCF PWMs. To retrieve, we’ll use:

# hg38.JASPAR2022_CORE_vertebrates_non_redundant_v2
CTCF_hg38_all <- query_data[["AH104727"]]
#> downloading 1 resources
#> retrieving 1 resource
#> loading from cache
#> require("GenomicRanges")
#> Warning: package 'S4Vectors' was built under R version 4.4.1
#> Warning: package 'IRanges' was built under R version 4.4.1
CTCF_hg38_all
#> GRanges object with 3093041 ranges and 5 metadata columns:
#>             seqnames            ranges strand |                   name
#>                <Rle>         <IRanges>  <Rle> |            <character>
#>         [1]     chr1       11212-11246      + | JASPAR2022_CORE_vert..
#>         [2]     chr1       11399-11432      + | JASPAR2022_CORE_vert..
#>         [3]     chr1       11414-11432      + | JASPAR2022_CORE_vert..
#>         [4]     chr1       12373-12406      + | JASPAR2022_CORE_vert..
#>         [5]     chr1       13507-13541      + | JASPAR2022_CORE_vert..
#>         ...      ...               ...    ... .                    ...
#>   [3093037]     chrY 57215115-57215148      - | JASPAR2022_CORE_vert..
#>   [3093038]     chrY 57215146-57215164      - | JASPAR2022_CORE_vert..
#>   [3093039]     chrY 57215146-57215179      - | JASPAR2022_CORE_vert..
#>   [3093040]     chrY 57215332-57215366      - | JASPAR2022_CORE_vert..
#>   [3093041]     chrY 57216319-57216352      - | JASPAR2022_CORE_vert..
#>                 score    pvalue    qvalue               sequence
#>             <numeric> <numeric> <numeric>            <character>
#>         [1]   7.77064  5.25e-05     0.459 gtgctgtgccagggcgcccc..
#>         [2]  18.48780  2.54e-07     0.118 cagcacgcccacctgctggc..
#>         [3]   9.11475  5.65e-05     0.555    ctggcagctggggacactg
#>         [4]   9.21951  5.25e-05     0.421 CAGCAGGTCTGGCTTTGGCC..
#>         [5]   9.71560  2.24e-05     0.397 GTGCCCTTCCTTTGCTCTGC..
#>         ...       ...       ...       ...                    ...
#>   [3093037]   8.06504  9.11e-05     0.614 CTGCTGGGCCCTCTTGCTCC..
#>   [3093038]   9.11475  5.65e-05     0.726    CTGGCAGCTGGGGACACTG
#>   [3093039]  17.91870  3.72e-07     0.246 CAGCACGCCCGCCTGCTGGC..
#>   [3093040]   7.77064  5.25e-05     0.584 GTGCTGTGCCAGGGCGCCCC..
#>   [3093041]   8.63415  6.96e-05     0.595 CTGCATTTGCGTTCCGACGC..
#>   -------
#>   seqinfo: 24 sequences from hg38 genome

The hg38.MA0139.1 object contains CTCF sites detected using the most popular MA0139.1 CTCF PWM. To retrieve:

# hg38.MA0139.1
CTCF_hg38 <- query_data[["AH104729"]]
#> downloading 1 resources
#> retrieving 1 resource
#> loading from cache
CTCF_hg38
#> GRanges object with 887980 ranges and 5 metadata columns:
#>            seqnames            ranges strand |        name     score    pvalue
#>               <Rle>         <IRanges>  <Rle> | <character> <numeric> <numeric>
#>        [1]     chr1       11414-11432      + |    MA0139.1   9.11475  5.65e-05
#>        [2]     chr1       14316-14334      + |    MA0139.1   7.83607  9.71e-05
#>        [3]     chr1       15439-15457      + |    MA0139.1   8.00000  9.08e-05
#>        [4]     chr1       16603-16621      + |    MA0139.1   8.04918  8.89e-05
#>        [5]     chr1       16651-16669      + |    MA0139.1  11.42620  1.97e-05
#>        ...      ...               ...    ... .         ...       ...       ...
#>   [887976]     chrY 57209918-57209936      - |    MA0139.1  11.42620  1.97e-05
#>   [887977]     chrY 57209966-57209984      - |    MA0139.1   8.04918  8.89e-05
#>   [887978]     chrY 57211133-57211151      - |    MA0139.1   8.00000  9.08e-05
#>   [887979]     chrY 57212256-57212274      - |    MA0139.1   7.83607  9.71e-05
#>   [887980]     chrY 57215146-57215164      - |    MA0139.1   9.11475  5.65e-05
#>               qvalue            sequence
#>            <numeric>         <character>
#>        [1]     0.555 ctggcagctggggacactg
#>        [2]     0.601 GGACCAACAGGGGCAGGAG
#>        [3]     0.599 TAGCCTCCAGAGGCCTCAG
#>        [4]     0.597 CCACCTGAAGGAGACGCGC
#>        [5]     0.504 TGGCCTACAGGGGCCGCGG
#>        ...       ...                 ...
#>   [887976]     0.648 TGGCCTACAGGGGCCGCGG
#>   [887977]     0.770 CCACCTGAAGGAGACGCGC
#>   [887978]     0.770 TAGCCTCCAGAGGCCTCAG
#>   [887979]     0.770 GGACCAACAGGGGCAGGAG
#>   [887980]     0.726 CTGGCAGCTGGGGACACTG
#>   -------
#>   seqinfo: 24 sequences from hg38 genome

It is always advisable to sort GRanges objects and keep standard chromsomes:

suppressMessages(library(plyranges))
CTCF_hg38_all <- CTCF_hg38_all %>% keepStandardChromosomes() %>% sort()
CTCF_hg38 <- CTCF_hg38 %>% keepStandardChromosomes() %>% sort()

Save the data in a BED file, if needed.

# Note that rtracklayer::import and rtracklayer::export perform unexplained
# start coordinate conversion, likely related to 0- and 1-based coordinate
# system. We recommend converting GRanges to a data frame and save tab-separated
write.table(CTCF_hg38_all %>% sort() %>% as.data.frame(), 
            file = "CTCF_hg38_all.bed",
            sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE)
write.table(CTCF_hg38 %>% sort() %>% as.data.frame(), 
            file = "CTCF_hg38.bed",
            sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE)

Create an IGV XML session file out of the saved BED files using the tracktables package. See vignette("tracktables", package = "tracktables") for more details.

library(tracktables) # BiocManager::install("tracktables")
# Sample sheet metadata
SampleSheet <- data.frame(SampleName = c("CTCF all", "CTCF MA0139.1"),
                          Description = c("All CTCF matrices from JASPAR2022",
                                          "MA0139.1 CTCF matrix from JASPAR2022"))
# File sheet linking files with sample names
FileSheet <- data.frame(SampleName = c("CTCF all", "CTCF MA0139.1"),
                        bigwig = c(NA, NA),
                        interval = c("CTCF_hg38_all.bed", "CTCF_hg38.bed"),
                        bam = c(NA, NA))
# Creating an IGV session XML file
MakeIGVSession(SampleSheet, FileSheet, 
               igvdirectory = getwd(), "CTCF_from_JASPAR2022", "hg38")

Note that the FIMO tool detects CTCF binding sites using the 1e-4 p-value threshold by default (the more significant p-value corresponds to the more confidently detected CTCF motif). We found that this threshold may be too permissive. Using the ENCODE SCREEN database as ground truth, we found 1e-6 as the optimal threshold providing approximately 80% true positive rate. However, less significant CTCF motifs may be cell type-specific or have weaker CTCF binding and therefore be missed by conventional peak callers. If cell type-specific CTCF binding is of interest, we recommend exploring less significant CTCF sites.

To filter the GRanges object and keep high-confidence CTCF sites, use:

# Check length before filtering
print(paste("Number of CTCF motifs at the default 1e-4 threshold:", length(CTCF_hg38)))
#> [1] "Number of CTCF motifs at the default 1e-4 threshold: 887980"
# Filter and check length after filtering
CTCF_hg38_filtered <- CTCF_hg38 %>% plyranges::filter(pvalue < 1e-6)
print(paste("Number of CTCF motifs at the 1e-6 threshold:", length(CTCF_hg38_filtered)))
#> [1] "Number of CTCF motifs at the 1e-6 threshold: 21671"
# Similarly, filter
CTCF_hg38_all_filtered <- CTCF_hg38_all %>% plyranges::filter(pvalue < 1e-6)

Given some databases provide multiple CTCF PWMs, one CTCF site may be detected multiple times resulting in overlapping CTCF sites. For example, the proportion of overlapping CTCF sites in the CTCF_hg38_all_filtered object containing CTCF sites detected by three matrices nearly 40%:

# Proportion of overlapping enrtries
tmp <- findOverlaps(CTCF_hg38_all, CTCF_hg38_all)
prop_overlap <- sort(table(queryHits(tmp)) %>% table(), decreasing = TRUE)
sum(prop_overlap[which(names(prop_overlap) != "1")]) / length(CTCF_hg38_all)
#> [1] 0.375093

The proportion of overlapping CTCF sites in the CTCF_hg38_filtered object containing CTCF sites detected by the MA0139.1 matrix is less than 2.5%

tmp <- findOverlaps(CTCF_hg38, CTCF_hg38)
prop_overlap <- sort(table(queryHits(tmp)) %>% table(), decreasing = TRUE)
sum(prop_overlap[which(names(prop_overlap) != "1")]) / length(CTCF_hg38)
#> [1] 0.0233485

Reducing them (merging overlapping CTCF sites), combined with 1E-6 cutoff filtering, yields the number of CTCF sites comparable to previously reported.

print(paste("Number of CTCF_hg38 motifs at the 1e-6 threshold AND reduced:", length(CTCF_hg38_filtered %>% reduce())))
#> [1] "Number of CTCF_hg38 motifs at the 1e-6 threshold AND reduced: 21652"
print(paste("Number of CTCF_hg38_all motifs at the 1e-6 threshold AND reduced:", length(CTCF_hg38_all_filtered %>% reduce())))
#> [1] "Number of CTCF_hg38_all motifs at the 1e-6 threshold AND reduced: 63572"

However, regulatory elements with CTCF proteins co-occupying adjacent/overlapping CTCF binding motifs were shown to be functionally and structurally different from those with single CTCF motifs. We provide non-reduced CTCF data and advise considering overlap of CTCF sites depending on the study’s goal.

liftOver of CTCF coordinates

As genome assemblies for model organisms continue to improve, CTCF sites for previous genome assemblies become obsolete. Typically, the actual genome sequence changes little, leading to changes in genomic coordinates. The liftOver method allows for conversion of genomic coordinates between genome assemblies.

Some carefully curated CTCF sites are available only for older genome assemblies. Examples include the data from CTCFBSDB, available for hg18 and mm8 genome assemblies.

To investigate whether liftOver of CTCF sites from older genome assemblies is a viable option, we tested for overlap between CTCF sites directly detected in specific genome assemblies with those lifted over. We detected CTCF sites using the MA0139.1 PWM from JASPAR 2022 database in hg18, hg19, hg38, and T2T genome assemblies and converted their genomic coordinates using the corresponding liftOver chains (download_liftOver.sh and convert_liftOver.sh scripts). We observed high Jaccard overlap among CTCF sites detected in the original genome assemblies or lifted over.

Jaccard overlaps among CTCF binding sites detected in the original and liftOver human genome assemblies. CTCF sites were detected using JASPAR 2022 MA0139.1 PWM. The correlogram was clustered using Euclidean distance and Ward.D clustering . White-red gradient indicate low-to-high Jaccard overlaps. Jaccard values are shown in the corresponding cells.

Our results suggest that liftOver is a viable alternative to obtain CTCF genomic annotations for different genome assemblies. We provide CTCFBSDB data converted to hg19 and hg38 genome assemblies.

CTCF Position Weight Matrices

CTCF PWM information. “Motif” - individual motif IDs or the total number of motifs per database; “Length” - motif length of the range of lengths; “URL” - direct links to motif pages. Jaspar, Hocomoco, Jolma 2013 PWMs were downloaded from the MEME database.

Motif Length..bp. URL
Jaspar2022
MA0139.1 19 https://jaspar2022.genereg.net/matrix/MA0139.1
MA1929.1 34 http://jaspar2022.genereg.net/matrix/MA1929.1
MA1930.1 35 http://jaspar2022.genereg.net/matrix/MA1930.1
HOCOMOCO v11
CTCF_HUMAN.H11MO.0.A 19 http://hocomoco.autosome.ru/motif/CTCF_HUMAN.H11MO.0.A
CTCF_MOUSE.H11MO.0.A 20 http://hocomoco.autosome.ru/motif/CTCF_MOUSE.H11MO.0.A
SwissRegulon
CTCF.p2 20 https://swissregulon.unibas.ch/wm/?wm=CTCF.p2&org=hg18
Jolma 2013
CTCF_full 17 http://floresta.eead.csic.es/footprintdb/index.php?db=HumanTF:1.0&motif=CTCF_full
CTCFBSDB
EMBL_M1, EMBL_M2, MIT_LM2, MIT_LM7, MIT_LM23, REN_20 9-20 https://insulatordb.uthsc.edu/download/CTCFBSDB_PWM.mat
CIS-BP
83 CTCF (Homo sapiens) C2H2 ZF 11-21 http://cisbp.ccbr.utoronto.ca/TFreport.php?searchTF=T094831_2.00
2 Ctcf (Mus musculus) C2H2 ZF 15, 20 http://cisbp.ccbr.utoronto.ca/TFreport.php?searchTF=T100985_2.00

CTCF motif logos. PWMs from (A) MEME, (B) CTCFBSDB, (C) CIS-BP human, and (D) CIS-BP mouse databases. Clustering and alignment of motifs was performed using the rBiocStyle::Biocpkg(“motifStack”)` R package.

See inst/scripts/make-data.R how the CTCF GRanges objects were created.

CTCF predicted and experimental data

Predefined CTCF binding data. “Database” - source of data; “Number” - number of binding sites; “Assembly” - genome assembly; “URL” - direct link to data download.

Database Number Assembly URL
CTCFBSDB 2.0 NA
Predicted human CTCF binding sites 13401 hg18 https://insulatordb.uthsc.edu/download/allcomp.txt.gz
Predicted mouse CTCF binding sites 5504 mm8 https://insulatordb.uthsc.edu/download/allcomp.txt.gz
SCREEN ENCODE NA
Human CTCF-bound cCREs 450641 hg38 https://api.wenglab.org/screen_v13/fdownloads/cCREs/GRCh38-CTCF.bed
Mouse CTCF-bound cCREs 82777 mm10 https://api.wenglab.org/screen_v13/fdownloads/cCREs/mm10-CTCF.bed

All GRanges objects included in the package

Summary of CTCF binding data provided in the package. CTCF sites for each genome assembly and PWM combination were detected using FIMO. “ID” - object names formatted as <assembly>.<database name>; “Assembly” - genome assembly, T2T - telomere to telomere (GCA_009914755.4) genome assembly; “All (p-value threshold Xe-Y)” - the total number of CTCF binding sites in the corresponding BED file at the Xe-Y threshold; “Non-overlapping (p-value threshold Xe-Y)” - number of non-overlapping CTCF binding sites (overlapping regions are merged) at the Xe-Y threshold.

ID Description Genome Species Taxonomy Data.provider All..p.value.threshold.1e.4. Non.overlapping..p.value.threshold.1e.4. All..p.value.threshold.1e.6. Non.overlapping..p.value.threshold.1e.6.
T2T.CIS_BP_2.00_Homo_sapiens.RData T2T CTCF motifs detected using human PWM matrices from http://cisbp.ccbr.utoronto.ca/, by FIMO T2T Homo sapiens 9606 CIS-BP 85,004,288 11,610,275 1,642,859 308,511
T2T.CTCFBSDB_PWM.RData T2T CTCF motifs detected using PWM matrices from https://insulatordb.uthsc.edu/, by FIMO T2T Homo sapiens 9606 CTCFBSDB 2.0 5,385,530 3,452,150 98,088 61,030
T2T.HOCOMOCOv11_core_HUMAN_mono_meme_format.RData T2T CTCF motifs detected using human PWM matrices from https://hocomoco11.autosome.org/, by FIMO T2T Homo sapiens 9606 HOCOMOCO v11 927,477 909,247 21,491 21,450
T2T.JASPAR2022_CORE_vertebrates_non_redundant_v2.RData T2T CTCF motifs detected using human PWM matrices from https://jaspar.genereg.net/, by FIMO T2T Homo sapiens 9606 JASPAR 2022 3,196,774 2,538,458 75,126 64,767
T2T.Jolma2013.RData T2T CTCF motifs detected using PWM matrices from DOI:10.1016/j.cell.2012.12.009, by FIMO T2T Homo sapiens 9606 Jolma 2013 342,765 341,669 6,857 6,857
T2T.MA0139.1.RData T2T CTCF motifs detected using MA0139.1 PWM matrix from https://jaspar.genereg.net/, by FIMO T2T Homo sapiens 9606 JASPAR 2022 916,829 905,231 22,149 22,131
T2T.SwissRegulon_human_and_mouse.RData T2T CTCF motifs detected using PWM matrices from https://swissregulon.unibas.ch/sr/, by FIMO T2T Homo sapiens 9606 SwissRegulon 1,106,234 1,094,142 23,050 23,006
hg38.CIS_BP_2.00_Homo_sapiens.RData hg38 CTCF motifs detected using human PWM matrices from http://cisbp.ccbr.utoronto.ca/, by FIMO hg38 Homo sapiens 9606 CIS-BP 82,804,664 11,502,445 1,612,042 303,446
hg38.CTCFBSDB.CTCF_predicted_human.RData hg38 CTCF predicted motifs from https://insulatordb.uthsc.edu/ hg38 Homo sapiens 9606 CTCFBSDB 2.0 5,212,674 3,343,301 97,491 60,648
hg38.CTCFBSDB_PWM.RData hg38 CTCF motifs detected using PWM matrices from https://insulatordb.uthsc.edu/, by FIMO hg38 Homo sapiens 9606 CTCFBSDB 2.0 13,357 13,278 - -
hg38.HOCOMOCOv11_core_HUMAN_mono_meme_format.RData hg38 CTCF motifs detected using human PWM matrices from https://hocomoco11.autosome.org/, by FIMO hg38 Homo sapiens 9606 HOCOMOCO v11 891,816 875,038 21,020 20,975
hg38.JASPAR2022_CORE_vertebrates_non_redundant_v2.RData hg38 CTCF motifs detected using human PWM matrices from https://jaspar.genereg.net/, by FIMO hg38 Homo sapiens 9606 JASPAR 2022 3,093,041 2,456,234 73,545 63,572
hg38.Jolma2013.RData hg38 CTCF motifs detected using PWM matrices from DOI:10.1016/j.cell.2012.12.009, by FIMO hg38 Homo sapiens 9606 Jolma 2013 330,892 329,834 6,823 6,823
hg38.MA0139.1.RData hg38 CTCF motifs detected using MA0139.1 PWM matrix from https://jaspar.genereg.net/, by FIMO hg38 Homo sapiens 9606 JASPAR 2022 887,980 876,938 21,671 21,652
hg38.SCREEN.GRCh38_CTCF.RData hg38 CTCF-bound human cis-regulatory elements from https://screen.encodeproject.org/ hg38 Homo sapiens 9606 ENCODE SCREEN v3 450,641 444,379 - -
hg38.SwissRegulon_human_and_mouse.RData hg38 CTCF motifs detected using PWM matrices from https://swissregulon.unibas.ch/sr/, by FIMO hg38 Homo sapiens 9606 SwissRegulon 1,079,055 1,067,050 22,569 22,521
hg19.CIS_BP_2.00_Homo_sapiens.RData hg19 CTCF motifs detected using human PWM matrices from http://cisbp.ccbr.utoronto.ca/, by FIMO hg19 Homo sapiens 9606 CIS-BP 81,569,702 11,355,221 1,595,787 298,489
hg19.CTCFBSDB.CTCF_predicted_human.RData hg19 CTCF motifs detected using PWM matrices from https://insulatordb.uthsc.edu/, by FIMO hg19 Homo sapiens 9606 CTCFBSDB 2.0 5,118,501 3,294,516 93,692 57,137
hg19.CTCFBSDB_PWM.RData hg19 CTCF predicted motifs from https://insulatordb.uthsc.edu/ hg19 Homo sapiens 9606 CTCFBSDB 2.0 13,355 13,294 - -
hg19.HOCOMOCOv11_core_HUMAN_mono_meme_format.RData hg19 CTCF motifs detected using human PWM matrices from https://hocomoco11.autosome.org/, by FIMO hg19 Homo sapiens 9606 HOCOMOCO v11 876,822 860,212 20,853 20,811
hg19.JASPAR2022_CORE_vertebrates_non_redundant_v2.RData hg19 CTCF motifs detected using human PWM matrices from https://jaspar.genereg.net/, by FIMO hg19 Homo sapiens 9606 JASPAR 2022 3,035,951 2,414,604 71,825 61,940
hg19.Jolma2013.RData hg19 CTCF motifs detected using PWM matrices from DOI:10.1016/j.cell.2012.12.009, by FIMO hg19 Homo sapiens 9606 Jolma 2013 315,914 314,858 6,734 6,734
hg19.MA0139.1.RData hg19 CTCF motifs detected using MA0139.1 PWM matrix from https://jaspar.genereg.net/, by FIMO hg19 Homo sapiens 9606 JASPAR 2022 871,136 860,252 21,511 21,491
hg19.SwissRegulon_human_and_mouse.RData hg19 CTCF motifs detected using PWM matrices from https://swissregulon.unibas.ch/sr/, by FIMO hg19 Homo sapiens 9606 SwissRegulon 1,061,972 1,050,085 22,325 22,282
hg18.CTCFBSDB.CTCF_predicted_human.RData hg18 CTCF predicted motifs from https://insulatordb.uthsc.edu/ hg18 Homo sapiens 9606 CTCFBSDB 2.0 13,401 13,343 - -
hg18.MA0139.1.RData hg18 CTCF motifs detected using MA0139.1 PWM matrix from https://jaspar.genereg.net/, by FIMO hg18 Homo sapiens 9606 JASPAR 2022 869,463 858,603 21,436 21,416
mm39.CIS_BP_2.00_Mus_musculus.RData mm39 CTCF motifs detected using mouse PWM matrices from http://cisbp.ccbr.utoronto.ca/, by FIMO mm39 Mus musculus 10090 CIS-BP 1,811,582 1,308,209 101,066 80,637
mm39.CTCFBSDB_PWM.RData mm39 CTCF predicted motifs from https://insulatordb.uthsc.edu/ mm39 Mus musculus 10090 CTCFBSDB 2.0 5,476,211 3,237,788 164,967 104,883
mm39.HOCOMOCOv11_core_MOUSE_mono_meme_format.RData mm39 CTCF motifs detected using mouse PWM matrices from https://hocomoco11.autosome.org/, by FIMO mm39 Mus musculus 10090 HOCOMOCO v11 740,710 726,641 27,620 27,582
mm39.JASPAR2022_CORE_vertebrates_non_redundant_v2.RData mm39 CTCF motifs detected using mouse PWM matrices from https://jaspar.genereg.net/, by FIMO mm39 Mus musculus 10090 JASPAR 2022 2,940,147 2,280,509 109,427 90,585
mm39.Jolma2013.RData mm39 CTCF motifs detected using PWM matrices from DOI:10.1016/j.cell.2012.12.009, by FIMO mm39 Mus musculus 10090 Jolma 2013 423,521 422,552 12,800 12,797
mm39.MA0139.1.RData mm39 CTCF motifs detected using MA0139.1 PWM matrix from https://jaspar.genereg.net/, by FIMO mm39 Mus musculus 10090 JASPAR 2022 962,332 949,270 32,210 32,199
mm39.SwissRegulon_human_and_mouse.RData mm39 CTCF motifs detected using PWM matrices from https://swissregulon.unibas.ch/sr/, by FIMO mm39 Mus musculus 10090 SwissRegulon 1,094,643 1,080,229 40,549 40,523
mm10.CIS_BP_2.00_Mus_musculus.RData mm10 CTCF motifs detected using mouse PWM matrices from http://cisbp.ccbr.utoronto.ca/, by FIMO mm10 Mus musculus 10090 CIS-BP 1,810,213 1,307,193 101,011 80,592
mm10.CTCFBSDB.CTCF_predicted_mouse.RData mm10 CTCF motifs detected using PWM matrices from https://insulatordb.uthsc.edu/, by FIMO mm10 Mus musculus 10090 CTCFBSDB 2.0 5,472,184 3,235,312 164,855 104,827
mm10.CTCFBSDB_PWM.RData mm10 CTCF predicted motifs from https://insulatordb.uthsc.edu/ mm10 Mus musculus 10090 CTCFBSDB 2.0 5,502 5,491 - -
mm10.HOCOMOCOv11_core_MOUSE_mono_meme_format.RData mm10 CTCF motifs detected using mouse PWM matrices from https://hocomoco11.autosome.org/, by FIMO mm10 Mus musculus 10090 HOCOMOCO v11 740,100 726,052 27,604 27,566
mm10.JASPAR2022_CORE_vertebrates_non_redundant_v2.RData mm10 CTCF motifs detected using mouse PWM matrices from https://jaspar.genereg.net/, by FIMO mm10 Mus musculus 10090 JASPAR 2022 2,938,085 2,278,849 109,363 90,532
mm10.Jolma2013.RData mm10 CTCF motifs detected using PWM matrices from DOI:10.1016/j.cell.2012.12.009, by FIMO mm10 Mus musculus 10090 Jolma 2013 423,194 422,225 12,789 12,786
mm10.MA0139.1.RData mm10 CTCF motifs detected using MA0139.1 PWM matrix from https://jaspar.genereg.net/, by FIMO mm10 Mus musculus 10090 JASPAR 2022 961,635 948,581 32,188 32,177
mm10.SCREEN.mm10_CTCF.RData mm10 CTCF-bound mouse cis-regulatory elements from https://screen.encodeproject.org/ mm10 Mus musculus 10090 ENCODE SCREEN v3 82,777 82,100 - -
mm10.SwissRegulon_human_and_mouse.RData mm10 CTCF motifs detected using PWM matrices from https://swissregulon.unibas.ch/sr/, by FIMO mm10 Mus musculus 10090 SwissRegulon 1,093,870 1,079,467 40,525 40,499
mm9.CIS_BP_2.00_Mus_musculus.RData mm9 CTCF motifs detected using mouse PWM matrices from http://cisbp.ccbr.utoronto.ca/, by FIMO mm9 Mus musculus 10090 CIS-BP 1,770,520 1,278,081 99,197 79,218
mm9.CTCFBSDB.CTCF_predicted_mouse.RData mm9 CTCF motifs detected using PWM matrices from https://insulatordb.uthsc.edu/, by FIMO mm9 Mus musculus 10090 CTCFBSDB 2.0 5,335,475 3,150,407 161,014 102,880
mm9.CTCFBSDB_PWM.RData mm9 CTCF predicted motifs from https://insulatordb.uthsc.edu/ mm9 Mus musculus 10090 CTCFBSDB 2.0 5,502 5,491 - -
mm9.HOCOMOCOv11_core_MOUSE_mono_meme_format.RData mm9 CTCF motifs detected using mouse PWM matrices from https://hocomoco11.autosome.org/, by FIMO mm9 Mus musculus 10090 HOCOMOCO v11 725,199 711,236 27,103 27,065
mm9.JASPAR2022_CORE_vertebrates_non_redundant_v2.RData mm9 CTCF motifs detected using mouse PWM matrices from https://jaspar.genereg.net/, by FIMO mm9 Mus musculus 10090 JASPAR 2022 2,879,784 2,230,170 107,506 89,172
mm9.Jolma2013.RData mm9 CTCF motifs detected using PWM matrices from DOI:10.1016/j.cell.2012.12.009, by FIMO mm9 Mus musculus 10090 Jolma 2013 411,345 410,398 12,309 12,306
mm9.MA0139.1.RData mm9 CTCF motifs detected using MA0139.1 PWM matrix from https://jaspar.genereg.net/, by FIMO mm9 Mus musculus 10090 JASPAR 2022 938,919 925,939 31,446 31,435
mm9.SwissRegulon_human_and_mouse.RData mm9 CTCF motifs detected using PWM matrices from https://swissregulon.unibas.ch/sr/, by FIMO mm9 Mus musculus 10090 SwissRegulon 1,066,676 1,052,374 39,510 39,484
mm8.CTCFBSDB.CTCF_predicted_mouse.RData mm8 CTCF predicted motifs from https://insulatordb.uthsc.edu/ mm8 Mus musculus 10090 CTCFBSDB 2.0 5,504 5,493 - -

Future development

  • High-resolution CTCF footprinting, K562 cell line. CTCF MNase HiChIP data, long fragments correspond to nucleosome-protected DNA whereas short fragments arise from CTCF-protected DNA. FactorFinder tool for near-base-pair resolution CTCF detection, uses the distribution of short fragments around a CTCF binding site. Benchmarking against CTCF motif location, ChIP-seq peaks, loop anchors. A model of topological regulation by partially extruded CTCF loops. Data to be available.
    Paper Sept, Corriene E., Y. Esther Tak, Christian G. Cerda-Smith, Haley M. Hutchinson, Viraat Goel, Marco Blanchette, Mital S. Bhakta, et al. “High-Resolution CTCF Footprinting Reveals Impact of Chromatin State on Cohesin Extrusion Dynamics.” Preprint. Genomics, October 23, 2023. .
  • Divergent CTCF sites are enriched at boundaries, convergent CTCF sites mark the interior of TADs, short loops in the 5-100kb range. Good intro about TADs. CTCF orientation is not linked to the direction of transcription. Definition of eight orientation patterns. Supplementary data: Table S1 - CTCF coordinates with directionality, Table S3 - consensus CTCF coordinates.
    Paper Nanni, Luca, Stefano Ceri, and Colin Logie. “Spatial patterns of CTCF sites define the anatomy of TADs and their boundaries.” Genome biology 21 (2020): 1-25.

Citation

Below is the citation output from using citation('CTCF') in R. Please run this yourself to check for any updates on how to cite CTCF.

print(citation("CTCF"), bibtex = TRUE)

This package was developed using biocthis.

Code of Conduct

Please note that the CTCF project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

About

Genomic coordinates of FIMO-predicted CTCF binding sites using JASPAR and other PWMs, human and mouse genome assemblies including mm39 and T2T. Also included experimentally derived ENCODE SCREEN CTCF-bound CREs.

Resources

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages