**ChIP-seq** data for a particular transcription factor.

# Topical Intro
DNA transcription and thus the regulation of gene exression is mediated by specific proteins, called transcription factors (TF), that recognize specific DNA elements.
TF can act as promoters or blockers of the enzyme that transcribes the DNA into RNA, called  RNA polymerase.

A defining feature of a TF is a DNA-binding domain, i.e. a region that binds to a specific sequence of DNA. This sequence is ajdacent to the genome the TF regulates.

Chromatin immunoprecipitation (ChIP) is a form or Immunopercipitation (IP).

**Immunopercipitation (IP)** is a method to isolate a specific target protein. It uses antibodies tied to a solid substrate and with binding sites specific to this target protein. Whit this setup it is then possible to 'capture' the target protein by bindig it to the solid substrate via the antibodies.

**Chromatin immunoprecipitation (ChIP)** follows the same approach as IP, but the target protein is (part of) a protein that has formed covalend bonds with the DNA sequence it bound to, also called 'cross-linked' to DNA. Cross-linking can occur when the DNA is broken down e.g. by sonification.
From the captured and purified target protein complex (i.e. including the DNA sequence) the DNA sequence can then be isolated.

To 'read' the isolated DNA sequences **ChIP-seq** uses sequencing.


# Questions
- How long is a typical sequence of DNA a TF binds to?

  From https://pubmed.ncbi.nlm.nih.gov/22887818/
  
  Resembles somewhat a Poisson dist. around ~10 bp:
  
  ![973fig1.jpg](attachment:973fig1.jpg)

# The Data

- Peak data: 1000 sequences for the top 1000 peaks
- Shuffled data: Shuffled set of sequences with conserved properties with the peak sequences (e.g. GC content).

### Raw data exhibits

**Peak data**

In [10]:
!cat data/peak_data.txt

>>hg19_chrX_152665682_152665785_reg1000053.p1_25.097_+
CTGAAGCCTTTCCAATTCCACCTGTAGGGGGCCAGGCTCCTCACGGCACTGCCACGCCGACAGCCTCCAGGGGGCGCTCACCGTCGCCTGTACCACTGGCCGCC
>>hg19_chrX_100023865_100023966_reg1000044.p1_24.868_+
GAGCCCCACCTGGTGTCTATCTGTAGTACTGCAGCCTCGCTGGTGGTTTAATAAGACATAGCACTGGAGCGCCGCCTAGTGTCTATCTGTGATAATGCACTC
>>hg19_chrX_132926613_132926706_reg1000071.p2_24.241_+
TCCAGCTTTCGGCACCTCTCCGAGCTGGACACCAGAGGGCGGTATACCCACACTTACTGCTTCCATCCAGTAGAGGGCCAGAAGAAAGGGGAAT
>>hg19_chrX_153763152_153763262_reg1000063.p1_23.996_+
TGCATTCGCAGAGCAAGGCTGCCACCCTGCGGCCTGGCCGGGCCTTTGGGGAAGCAGAGCGGAAAGGCGGTGTTTCGTGGAGCAACGCTGCCACCTTGTGGTCCCGCTGGG
>>hg19_chrX_73770289_73770396_reg1000123.p2_23.278_+
AGGCGGGTTCGCGCGCCCCCTGCTGGCCACCAGAGACAGTGCGAGATTCTCAGCGCTCACTGTGTAGCGGTAGCCGTGCGCCCCCTCCTGGCTGTCCTGCTGAATCAA
>>hg19_chrX_12965021_12965133_reg1000303.p1_22.615_+
GGATGGGCCTAGCTGGAATTTACCACCAGGGGGCTTGGACTCTTTCTCATGCGTCAACACCAATCACTGCCAATAGGGGGCAGCAACAATTTGCCTACAGCCTTTAGGCTTTG
>>hg19_chrX_

**Shuffled data**

In [3]:
!tail data/shuffled_data.txt

>>hg19_182891140_257637474
CGAATGCTCCGCCGAGTTCTGGCTATTGTACGGTACCAACGCAGCGTAGCACCCAGGCGGTGGTCTCAGCAGGGGGTCGTTGCTCGCTTTCGT
>>hg19_497119174_187902088
CTCCATGGGGGAGGAGACCGGGCTACTTCCCTCTACTTCGTATCCTGCCCGACTGGCTGAGCCATAACGATCAACCACAGGTGGCCTACATGCTGTC
>>hg19_226604158_473800962
CGGCAGGGCCGTTCGGGATCGCGACAACTCAGGTCTCGTCGACAAACAACTAATCGCCGAAGGCCAGTCAGAATGACAGGACATGGGCAACTAGCTTTGGCGATGCGCCAACACCCTCCC
>>hg19_447524963_453957589
ACGACGGTAGGCTCCCTTTGCGAGGGAATGCGTACAGAGACATGACCTCGGCCTCCTCGTCGGGCAGGGTTTGGGCTCAACTGGCCCGGCGGCTTCCCTA
>>hg19_81132884_273121648
TTGTGAGCGTACCGCCTGCTGACATCCAGGTGGGGACCACGAGTGCGAAGCCCGAGACGGCTAATGGAGAGCACGGCGATGTCCGAGTCTAGTTGCC


# Assumptions
The peak data resembles the "BED" format (see e.g. https://learn.gencore.bio.nyu.edu/ngs-file-formats/bed-format/)
so I assume:

- `hg19` is the genome
- `chr...` determines the chromosome
- `reg...` is the name of this sequence
- `p1_...` is the binding 'strength' in some unknown unit
- `+` is indicating that this is a forward read
