# __ctcf-footprint workflow__

### positive data

1. find CTCF motifs with fimo
<br>```run-find-motifs-with-fimo.sh```
2. intersect Merged_CTCF_motifs.bed with CTCF-ChIP-peaks-and-DNase-hotspot.bed
<br>```bedtools intersect -u -f 1 -a fimo/Merged_CTCF_motifs.bed -b data/CTCF-ChIP-peaks-and-DNase-hotspot.bed > candidate_footprints/CTCF_candidate_footprints_positive.bed```
3. separate by motif type
<br>```grep CTCF_(M,L,XL) CTCF_candidate_footprints_positive.bed > CTCF_candidate_footprints_(M,L,XL)_positive.bed```
4. get fiber-seq reads 100bp around CTCF motifs
<br>```ft center -d 100 ../data/GM12878.aligned.bam <(cut -f 1,2,3,6 CTCF_candidate_footprints_(M,L,XL)_positive.bed) > CTCF_m6a_fiberseq_(M,L,XL)_positive.txt```

### negative data

1. find CTCF motifs with fimo
<br>```run-find-motifs-with-fimo.sh```
2. INVERSE intersect Merged_CTCF_motifs.bed with CTCF-ChIP-peaks-and-DNase-hotspot.bed
<br>```bedtools intersect -v -a fimo/Merged_CTCF_motifs.bed -b data/CTCF-ChIP-peaks-and-DNase-hotspot.bed > candidate_footprints/CTCF_candidate_footprints_negative.bed```
3. separate by motif type (optional)
<br>```grep CTCF_(M,L,XL) CTCF_candidate_footprints_negative.bed > CTCF_candidate_footprints_(M,L,XL)_negative.bed```
4. get fiber-seq reads 100bp around CTCF motifs
<br>```ft center -d 100 ../data/GM12878.aligned.bam <(cut -f 1,2,3,6 CTCF_candidate_footprints_(M,L,XL)_negative.bed) > CTCF_m6a_fiberseq_(M,L,XL)_negative.txt```

making test data set
1. get subsample of genome (5%)
<br>```samtools view -s 0.05 -b GM12878.aligned.bam > GM12878_small_5.aligned.bam```
2. make index
<br>```samtools index GM12878_small_5.aligned.bam```

## data files

### candidate_footprints

__CTCF_candidate_footprints__ (independent of bam size)
* CTCF_candidate_footprints_(positive).bed: CTCF motifs within CTCF-ChIP and DNAse-1 peaks
* CTCF_candidate_footprints_(M,L,XL).bed: &uarr; separated by motif type

__CTCF_m6a_fiberseq__ (dependent on bam size)
* CTCF_m6a_fiberseq_(M,L,XL)_(positive).txt - full genome output of fiberseq reads separated by motif size
* CTCF_m6a_fiberseq_(M,L,XL)\_100bp_(positive).txt - within 100bp flank of motif sites separated by size
* CTCF_m6a_fiberseq_merged_100bp_positive.txt - within 100bp flank of ALL positive motif sites
* CTCF_m6a_fiberseq_merged_100bp_small_negative.txt - NEGATIVE [1% of the genome]
* CTCF_m6a_fiberseq_merged_100bp_small_5_negative.txt - NEGATIVE [5% of the genome]

<br>test files (changed to subset)
* (old) CTCF_m6a_fiberseq_(M,L,XL)_100bp_test_negative.txt - 10M line subsamples to __test__ with
    * renamed to: CTCF_m6a_fiberseq_(M,L,XL)_100bp_subset_negative.txt
* (old) CTCF_m6a_fiberseq_XL_100bp_test_positive.txt - 10k line subsample to __test__ with
    * renamed to: CTCF_m6a_fiberseq_XL_100bp_subset_positive.txt

### feature_data

__most updated__
* positive: CTCF_m6a_fiberseq_merged_100bp_positive_features.{txt,pin}
* negative: CTCF_m6a_fiberseq_merged_100bp_small_5_negative_features.{txt,pin}

* CTCF_m6a_fiberseq_L_100bp_positive_features.txt - positive features of L motifs
* CTCF_m6a_fiberseq_merged_100bp_positive_features.txt - positive features of __all__ motifs
* CTCF_m6a_fiberseq_L_100bp_small_negative_features.txt - negative features of L motifs [1% genome]
    * 25,975,750 rows (small)
    * 129,968,381 rows (small 5)
* (__negative__) CTCF_negative_m6a_fiberseq_100bp_small_5_features-test.txt

__pin formatted features__
* CTCF_m6a_fiberseq_merged_100bp_features.pin - combined pos & neg pin file (fed to mokapot)
    * CTCF_m6a_fiberseq_merged_100bp_features-{file_root}.pin / file_root: optional string to append

future work
* filter by p-value
* get negative control (use same data but inverse)
* --> m6a methylation NOT within ChIP or DNase I peaks

### feature collection

potential features
* [x] read length [query_length]
* [ ] 3-mer k-mer count within canonical motif (exclude k-mers without AT)
* [ ] m6a count within each k-mer
* [x] m6a count for: (motif, 40 bp flank left, 40 bp flank right)
* [x] AT count for: (motif, flank left, flank right)
* [x] proportion of bases that are AT per element (0-1)
* [ ] proportion of methylated ATs per element (motif, flank left, flank right) (0-1)
> $$ m6a\ prop = m6a\ count / AT\ count $$
> &uarr; proportion of ATs methylated (m6a count/AT count) or proportion of methylated ATs over the region (m6a count/region length)?
* [ ] total methylation across the read (% of ATs that were methylated across read)
* [x] MSP size
* [ ] maybe do a run length encoding?

## Feb. 23, 2023 (working list)

* [ ] merge mokapot output and ft-center output
    * [x] add unique ID to both (combined motif_name/query_name)
* [ ] re-collect features for merged positive dataset
    * [ ] re-format to pin
* [ ] make density plot of m6a's identified as positive & negative in mokapot
    * filter mokapot q-value < 0.05
    * set q-value of negatives (False labels) to 1
* [ ] find most informing feature in tide_conf
* [ ] downsample so both sets have ~65k observations
    * rerun mokapot (recalc scale_pos_weight)
* [x] make script(s) to collect features
* [x] collect features for 5% negative data set
* [ ] make dummy presentation for 1-on-1 meeting
* [ ] add run length encoding (bottom priority)