# chip-seq analysis

cut up the genome into sequences. align them to the genome. look at distribution.

## macs2 - peak caller

output is bedfile. has chromosome, start, strand information. this is where we come in.

We love [ENCODE](https://www.encodeproject.org). Encyclopedia of DNA Elements. This has lots of non-coding DNA elements (which is often regulatory).

How do we know if a DNA region is regulatory? Just because a protein binds to it does not necessarily mean it is biologically meaningful. 

We want to narrow down to TF ChIP-seq experiments.

For example, https://www.encodeproject.org/experiments/ENCSR000DRZ/

- BAM file has alignements.
- bigWig allows you to visualize data on genome browser (nice for presentations). Shush and Amber have been doing this with ChIP-seq peaks.

We are most interested in BED files. Want to use merged data (replicates 1,2 for example). Go with "conservative IDR thresholded peaks" (though there are many more).

## todo:
- [ ] generate about 10 datasets.

## Get reference genome(s)

In [2]:
%%bash
mkdir -p references
cd references
if [ ! -f hg19.fa ]; then
    curl -fL https://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz | gunzip > hg19.fa
fi
if [ ! -f hg38.fa ]; then
    curl -fL https://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz | gunzip > hg38.fa
fi

<IPython.core.display.Javascript object>

## Process positives and negative peaks

Go from BED file to one-hot encoded output.

In [3]:
from collections import namedtuple
from pathlib import Path

import chipseq_utils
import h5py
import numpy as np

datasets_dir = Path("datasets")
references_dir = Path("references")
reference_hg19 = references_dir / "hg19.fa"
reference_hg38 = references_dir / "hg38.fa"

assert reference_hg19.exists()
assert reference_hg38.exists()

dataset = namedtuple(
    "Dataset",
    [
        "path",
        "positive_encode_url",
        "positive_bed_url",
        "negative_encode_url",
        "negative_bed_url",
        "reference_genome_fasta",
        "summary",
    ],
)

datasets = [
    dataset(
        path=datasets_dir / "CTCF" / "hg19" / "ENCSR000DRZ_ENCSR000EMT",
        positive_encode_url="https://www.encodeproject.org/experiments/ENCSR000DRZ/",
        positive_bed_url="https://www.encodeproject.org/files/ENCFF963PJY/@@download/ENCFF963PJY.bed.gz",
        negative_encode_url="https://www.encodeproject.org/experiments/ENCSR000EMT/",
        negative_bed_url="https://www.encodeproject.org/files/ENCFF097LEF/@@download/ENCFF097LEF.bed.gz",
        reference_genome_fasta=reference_hg19,
        summary="""\
Positive reads are from CTCF ChIP-seq on human GM12878.
Negative reads are from DNase-seq on human GM12878.
Reference is hg19.""",
    ),
    dataset(
        path=datasets_dir / "CTCF" / "hg38" / "ENCSR000DZN_ENCSR000EMT",
        positive_encode_url="https://www.encodeproject.org/experiments/ENCSR000DZN/",
        positive_bed_url="https://www.encodeproject.org/files/ENCFF796WRU/@@download/ENCFF796WRU.bed.gz",
        negative_encode_url="https://www.encodeproject.org/experiments/ENCSR000EMT/",
        negative_bed_url="https://www.encodeproject.org/files/ENCFF195QAV/@@download/ENCFF195QAV.bed.gz",
        reference_genome_fasta=reference_hg38,
        summary="""\
Positive reads are from CTCF ChIP-seq on human GM12878.
Negative reads are from DNase-seq on human GM12878.
Reference is GRCh38 (hg38).""",
    ),
    dataset(
        path=datasets_dir / "CTCF" / "hg38" / "ENCSR000AKB_ENCSR000EMT",
        positive_encode_url="https://www.encodeproject.org/experiments/ENCSR000AKB/",
        positive_bed_url="https://www.encodeproject.org/files/ENCFF017XLW/@@download/ENCFF017XLW.bed.gz",
        negative_encode_url="https://www.encodeproject.org/experiments/ENCSR000EMT/",
        negative_bed_url="https://www.encodeproject.org/files/ENCFF195QAV/@@download/ENCFF195QAV.bed.gz",
        reference_genome_fasta=reference_hg38,
        summary="""\
Positive reads are from CTCF ChIP-seq on human GM12878.
Negative reads are from DNase-seq on human GM12878.
Reference is GRCh38 (hg38).""",
    ),
    dataset(
        path=datasets_dir / "CTCF" / "hg19" / "ENCSR000AKB_ENCSR000EMT",
        positive_encode_url="https://www.encodeproject.org/experiments/ENCSR000AKB/",
        positive_bed_url="https://www.encodeproject.org/files/ENCFF096AKZ/@@download/ENCFF096AKZ.bed.gz",
        negative_encode_url="https://www.encodeproject.org/experiments/ENCSR000EMT/",
        negative_bed_url="https://www.encodeproject.org/files/ENCFF097LEF/@@download/ENCFF097LEF.bed.gz",
        reference_genome_fasta=reference_hg19,
        summary="""\
Positive reads are from CTCF ChIP-seq on human GM12878.
Negative reads are from DNase-seq on human GM12878.
Reference is hg19.""",
    ),
    dataset(
        path=datasets_dir / "CTCF" / "hg38" / "ENCSR000DKV_ENCSR000EMT",
        positive_encode_url="https://www.encodeproject.org/experiments/ENCSR000DKV/",
        positive_bed_url="https://www.encodeproject.org/files/ENCFF258AFQ/@@download/ENCFF258AFQ.bed.gz",
        negative_encode_url="https://www.encodeproject.org/experiments/ENCSR000EMT/",
        negative_bed_url="https://www.encodeproject.org/files/ENCFF195QAV/@@download/ENCFF195QAV.bed.gz",
        reference_genome_fasta=reference_hg38,
        summary="""\
Positive reads are from CTCF ChIP-seq on human GM12878.
Negative reads are from DNase-seq on human GM12878.
Reference is GRCh38 (hg38).""",
    ),
    dataset(
        path=datasets_dir / "CTCF" / "hg38" / "ENCSR560BUE_ENCSR000EPH",
        positive_encode_url="https://www.encodeproject.org/experiments/ENCSR560BUE/",
        positive_bed_url="https://www.encodeproject.org/files/ENCFF203DXT/@@download/ENCFF203DXT.bed.gz",
        negative_encode_url="https://www.encodeproject.org/experiments/ENCSR000EPH/",
        negative_bed_url="https://www.encodeproject.org/files/ENCFF886OJN/@@download/ENCFF886OJN.bed.gz",
        reference_genome_fasta=reference_hg38,
        summary="""\
Positive reads are from CTCF ChIP-seq on human MCF-7.
Negative reads are from DNase-seq on human MCF-7 treated with estradiol at 100nM for 0 hour (control).
Reference is GRCh38 (hg38).""",
    ),
    dataset(
        path=datasets_dir / "CTCF" / "hg19" / "ENCSR560BUE_ENCSR000EPH",
        positive_encode_url="https://www.encodeproject.org/experiments/ENCSR560BUE/",
        positive_bed_url="https://www.encodeproject.org/files/ENCFF990LUT/@@download/ENCFF990LUT.bed.gz",
        negative_encode_url="https://www.encodeproject.org/experiments/ENCSR000EPH/",
        negative_bed_url="https://www.encodeproject.org/files/ENCFF846DFL/@@download/ENCFF846DFL.bed.gz",
        reference_genome_fasta=reference_hg19,
        summary="""\
Positive reads are from CTCF ChIP-seq on human MCF-7.
Negative reads are from DNase-seq on human MCF-7 treated with estradiol at 100nM for 0 hour (control).
Reference is hg19.""",
    ),
    dataset(
        path=datasets_dir / "CTCF" / "hg38" / "ENCSR000DWH_ENCSR000EPH",
        positive_encode_url="https://www.encodeproject.org/experiments/ENCSR000DWH/",
        positive_bed_url="https://www.encodeproject.org/files/ENCFF742FBX/@@download/ENCFF742FBX.bed.gz",
        negative_encode_url="https://www.encodeproject.org/experiments/ENCSR000EPH/",
        negative_bed_url="https://www.encodeproject.org/files/ENCFF886OJN/@@download/ENCFF886OJN.bed.gz",
        reference_genome_fasta=reference_hg38,
        summary="""\
Positive reads are from CTCF ChIP-seq on human MCF-7.
Negative reads are from DNase-seq on human MCF-7 treated with estradiol at 100nM for 0 hour (control).
Reference is GRCH38 (hg38).
""",
    ),
    dataset(
        path=datasets_dir / "CTCF" / "hg19" / "ENCSR000DWH_ENCSR000EPH",
        positive_encode_url="https://www.encodeproject.org/experiments/ENCSR000DWH/",
        positive_bed_url="https://www.encodeproject.org/files/ENCFF720OXG/@@download/ENCFF720OXG.bed.gz",
        negative_encode_url="https://www.encodeproject.org/experiments/ENCSR000EPH/",
        negative_bed_url="https://www.encodeproject.org/files/ENCFF846DFL/@@download/ENCFF846DFL.bed.gz",
        reference_genome_fasta=reference_hg19,
        summary="""\
Positive reads are from CTCF ChIP-seq on human MCF-7.
Negative reads are from DNase-seq on human MCF-7 treated with estradiol at 100nM for 0 hour (control).
Reference is hg19.""",
    ),
    dataset(
        path=datasets_dir / "CTCF" / "hg38" / "ENCSR000DMV_ENCSR000EPH",
        positive_encode_url="https://www.encodeproject.org/experiments/ENCSR000DMV/",
        positive_bed_url="https://www.encodeproject.org/files/ENCFF663NFF/@@download/ENCFF663NFF.bed.gz",
        negative_encode_url="https://www.encodeproject.org/experiments/ENCSR000EPH/",
        negative_bed_url="https://www.encodeproject.org/files/ENCFF886OJN/@@download/ENCFF886OJN.bed.gz",
        reference_genome_fasta=reference_hg38,
        summary="""\
Positive reads are from CTCF ChIP-seq on human MCF-7.
Negative reads are from DNase-seq on human MCF-7 treated with estradiol at 100nM for 0 hour (control).
Reference is GRCH38 (hg38).
""",
    ),
]

<IPython.core.display.Javascript object>

In [4]:
# constants
max_read_length = 250
new_read_length = 200
alphabet = "ACGT"
nonsense_letters = "N"
hdf5_path = Path("chip-seq-datasets.h5")

for d in datasets:
    positive_output, negative_output = chipseq_utils.bed_to_fasta_to_one_hot(
        dataset_dir=d.path,
        positive_bed_url=d.positive_bed_url,
        negative_bed_url=d.negative_bed_url,
        reference_genome_fasta=d.reference_genome_fasta,
        max_read_length=max_read_length,
        new_read_length=new_read_length,
        alphabet=alphabet,
        nonsense_letters=nonsense_letters,
    )

    # Sample negatives so GC content is similar to positives.
    print("sampling negatives for similar GC content...")
    positive_gc_content = positive_output.one_hot[:, :, 1:3].any(-1).mean(1)
    negative_gc_content = negative_output.one_hot[:, :, 1:3].any(-1).mean(1)
    size = min(positive_gc_content.shape[0], negative_gc_content.shape[0])
    inds = chipseq_utils.sample_b_matched_to_a(
        positive_gc_content, negative_gc_content, size=size, seed=42
    )

    # Save to hdf5.
    print("saving to hdf5 ...")
    features = np.concatenate((positive_output.one_hot, negative_output.one_hot[inds]))
    n_positives = positive_output.one_hot.shape[0]
    n_negatives = inds.shape[0]
    n_total = n_positives + n_negatives
    print(n_total, "total sequences")
    labels = np.zeros(n_total, dtype=np.uint8)
    labels[:n_positives] = 1
    dataset_features = str(d.path / "features")
    dataset_labels = str(d.path / "labels")
    print(f"features dataset  {dataset_features}")
    print(f"labels dataset    {dataset_labels}")

    with h5py.File(hdf5_path, mode="a") as f:
        f.create_dataset(dataset_features, data=features, compression="gzip")
        f.create_dataset(dataset_labels, data=labels, compression="gzip")

skipping download step...
reading data...
getting peak length...
filtering...
  saved to 'datasets/CTCF/hg19/ENCSR000DRZ_ENCSR000EMT/positive_peaks_filtered.bed.gz'
converting chip-seq data to fasta...
  saved to 'datasets/CTCF/hg19/ENCSR000DRZ_ENCSR000EMT/positive_peaks_filtered_extracted.bed.fa'
loading fasta...
filtering nonsense...
  found 0 sequences with nonsense letters
one-hot encoding...
skipping download step...
reading data...
getting peak length...
filtering...
  saved to 'datasets/CTCF/hg19/ENCSR000DRZ_ENCSR000EMT/negative_peaks_nonintersect_filtered.bed.gz'
converting chip-seq data to fasta...
  saved to 'datasets/CTCF/hg19/ENCSR000DRZ_ENCSR000EMT/negative_peaks_nonintersect_filtered_extracted.bed.fa'
loading fasta...
filtering nonsense...
  found 1 sequences with nonsense letters
one-hot encoding...
sampling negatives for similar GC content...
saving to hdf5 ...
72876 total sequences
features dataset  datasets/CTCF/hg19/ENCSR000DRZ_ENCSR000EMT/features
labels dataset    

downloading...
reading data...
getting peak length...
filtering...
  saved to 'datasets/CTCF/hg19/ENCSR000DWH_ENCSR000EPH/positive_peaks_filtered.bed.gz'
converting chip-seq data to fasta...
  saved to 'datasets/CTCF/hg19/ENCSR000DWH_ENCSR000EPH/positive_peaks_filtered_extracted.bed.fa'
loading fasta...
filtering nonsense...
  found 0 sequences with nonsense letters
one-hot encoding...
downloading...
reading data...
getting peak length...
filtering...
  saved to 'datasets/CTCF/hg19/ENCSR000DWH_ENCSR000EPH/negative_peaks_nonintersect_filtered.bed.gz'
converting chip-seq data to fasta...
  saved to 'datasets/CTCF/hg19/ENCSR000DWH_ENCSR000EPH/negative_peaks_nonintersect_filtered_extracted.bed.fa'
loading fasta...
filtering nonsense...
  found 2 sequences with nonsense letters
one-hot encoding...
sampling negatives for similar GC content...
saving to hdf5 ...
43184 total sequences
features dataset  datasets/CTCF/hg19/ENCSR000DWH_ENCSR000EPH/features
labels dataset    datasets/CTCF/hg19/ENC

<IPython.core.display.Javascript object>

# scratch space below

## match by GC content

In [None]:
import matplotlib.pyplot as plt

fig, axes = plt.subplots(nrows=1, ncols=2, sharey=True, figsize=(10, 5))
axes = axes.ravel()

positive_gc_content = positive_output.one_hot[:, :, 1:3].any(-1).mean(1)
axes[0].hist(positive_gc_content, bins=25, range=(0, 1))
axes[0].set_title("GC in positives")

negative_gc_content = negative_output.one_hot[:, :, 1:3].any(-1).mean(1)
axes[1].hist(negative_gc_content, bins=25, range=(0, 1))
axes[1].set_title("GC negatives")

plt.tight_layout()
plt.show()

In [None]:
inds = chipseq_utils.sample_b_matched_to_a(positive_gc_content, negative_gc_content)

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=2, sharey=True, figsize=(10, 5))
axes = axes.ravel()
axes[0].hist(positive_gc_content, bins=25, range=(0, 1))
axes[0].set_title("GC in positives")

axes[1].hist(negative_gc_content[inds], bins=25, range=(0, 1))
axes[1].set_title(f"GC in sampled negatives")
plt.tight_layout()
plt.show()

## Save to HDF5

In [None]:
import h5py
import numpy as np

features = np.concatenate((positive_output.one_hot, negative_output.one_hot[inds]))

n_positives = positive_output.one_hot.shape[0]
n_negatives = inds.shape[0]
n_total = n_positives + n_negatives
print(n_total_sequences, "total sequences")

labels = np.zeros(n_total, dtype=np.uint8)
labels[:n_positives] = 1

dataset_features = str(dataset_dir / "features")
dataset_labels = str(dataset_dir / "labels")
print(f"features dataset  {dataset_features}")
print(f"labels dataset    {dataset_labels}")

hdf5_path = Path("chip-seq-datasets.h5")

with h5py.File(hdf5_path, mode="w") as f:
    f.create_dataset(dataset_features, data=features, compression="gzip")
    f.create_dataset(dataset_labels, data=labels, compression="gzip")

## Get positive peak data

- Stamatoyannopoulos - Univ of Washington
- CTCF ChIP-seq on human GM12878
- https://www.encodeproject.org/experiments/ENCSR000DRZ/
- conservative IDR thresholded peaks  1,2  hg19

In [None]:
input_bed_file_url = (
    "https://www.encodeproject.org/files/ENCFF963PJY/@@download/ENCFF963PJY.bed.gz"
)
input_bed_file = datasets / "CTCF" / "Stamatoyannopoulos" / "positive_peaks.bed.gz"
input_bed_file.parent.mkdir(parents=True, exist_ok=True)

In [None]:
if not input_bed_file.exists():
    print("downloading...")
    _ = chipseq_utils.download(
        url=input_bed_file_url,
        output_path=input_bed_file,
        force=True,
    )

output = chipseq_utils._bed_to_fasta_to_onehot(
    bed_file=input_bed_file,
    max_read_length=max_read_length,
    new_read_length=constant_read_length,
    reference_genome_fasta=reference_genome_fasta,
    alphabet=alphabet,
    nonsense_letters=nonsense_letters,
    bedtools_exe="bedtools",
)

## Get negative peak data

- Stamatoyannopoulos - Univ of Washington
- DNase-seq on human GM12878
- https://www.encodeproject.org/experiments/ENCSR000EMT/

In [None]:
input_bed_file_url = (
    "https://www.encodeproject.org/files/ENCFF097LEF/@@download/ENCFF097LEF.bed.gz"
)
input_bed_file = datasets / "CTCF" / "Stamatoyannopoulos" / "negative_peaks.bed.gz"
input_bed_file.parent.mkdir(parents=True, exist_ok=True)

if not input_bed_file.exists():
    print("downloading...")
    _ = chipseq_utils.download(
        url=input_bed_file_url,
        output_path=input_bed_file,
        force=True,
    )

In [None]:
input_file_file_nonintersect = chipseq_utils.add_str_before_suffixes(
    input_bed_file, "_nonintersect"
)

_ = chipseq_utils.bedtools_intersect(
    a=input_bed_file,
    b=output.bed_file_filtered,
    output_bedfile=input_file_file_nonintersect,
    write_a=True,
    invert_match=True,
)

In [None]:
output = chipseq_utils._bed_to_fasta_to_onehot(
    bed_file=input_file_file_nonintersect,
    max_read_length=max_read_length,
    new_read_length=constant_read_length,
    reference_genome_fasta=reference_genome_fasta,
    alphabet=alphabet,
    nonsense_letters=nonsense_letters,
    bedtools_exe="bedtools",
)

## download ChIP-seq data

In [None]:
from pathlib import Path

# save all things here
data_dir = Path("datasets") / "CTCF_Stamatoyannopoulos"
data_dir.mkdir(parents=True, exist_ok=True)

chipseq_bedfile = data_dir / "positive_peaks.bed.gz"

In [None]:
# Stamatoyannopoulos - Univ of Washington
# CTCF ChIP-seq on human GM12878
# https://www.encodeproject.org/experiments/ENCSR000DRZ/
# conservative IDR thresholded peaks  1,2  hg19
!wget -O $chipseq_bedfile https://www.encodeproject.org/files/ENCFF963PJY/@@download/ENCFF963PJY.bed.gz

In [None]:
!ls $data_dir

## inspect ChIP-seq data

In [None]:
import pandas as pd

df = pd.read_csv(chipseq_bedfile, delimiter="\t", header=None)
df.head()

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

lengths = df.loc[:, 2] - df.loc[:, 1]
plt.hist(lengths, bins=20)
plt.title("Distribution of peak length in ChIP-seq data")
plt.show()

## filter ChIP-seq data

In [None]:
import chipseq_utils

In [None]:
df = pd.read_csv(chipseq_bedfile, delimiter="\t", header=None)
df = chipseq_utils.filter_bed_by_max_length(df, max_length=250)
df = chipseq_utils.transform_bed_to_constant_size(df, new_length=250)

chipseq_bedfile_filtered = chipseq_utils.add_str_before_suffixes(
    chipseq_bedfile, string="_filtered"
)
df.to_csv(chipseq_bedfile_filtered, sep="\t", index=False, header=False)
print(f"Saved to '{chipseq_bedfile_filtered}'")

## something something fasta

### get reference genome

In this case, reference genome is in `.2bit` format, so we must convert to fasta.

In [None]:
# get reference genome
!wget -N -nv --show-progress https://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.2bit

In [None]:
# get program that converts twobit to fasta format
!wget -N -nv --show-progress http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/twoBitToFa
!chmod +x twoBitToFa

In [None]:
# convert 2bit for fasta.
reference_fasta = "hg19.fa"
process = chipseq_utils.twobit_to_fasta("hg19.2bit", reference_fasta)
print(process)

In [None]:
# Install bedtools if the program is not found.
![[ $(command -v bedtools) ]] || sudo apt-get install --yes --quiet bedtools

In [None]:
# create fasta file from the bedfile.

chipseq_fasta = chipseq_utils.add_str_before_suffixes(
    chipseq_bedfile_filtered, "_hg19"
).with_suffix(".fa")

process = chipseq_utils.bedtools_getfasta(
    input_fasta=reference_fasta,
    output_fasta=chipseq_fasta,
    bed_file=chipseq_bedfile_filtered,
    force_strandedness=True,
)
print(process)

In [None]:
!head -n 4 $chipseq_fasta

In [None]:
# load sequences
descriptions, sequences = chipseq_utils.parse_fasta(chipseq_fasta)

# filter out nonsense
nonsense = chipseq_utils.get_nonsense_sequence_mask(sequences, nonsense_letters="N")
print(f"Found {nonsense.sum()} sequences with nonsense letters")
descriptions = descriptions[~nonsense]
sequences = sequences[~nonsense]

# one-hot encode
sequences_onehot = chipseq_utils.one_hot(sequences)
print("Shape of one-hot encoded data:", sequences_onehot.shape)

## get GC content per sequence

In [None]:
# This assumes that GC are in slices 1:3 of the one-hot encoded data.
# shape of this is (n_sequences,)
gc_content_pos = sequences_onehot[:, :, 1:3].any(-1).mean(1)

plt.hist(gc_content_pos, bins=25, range=(0, 1))
plt.title("Histogram of GC content among positive sequences")
plt.show()

## create negative data

Get non-overlap between positive peaks and negative peaks.

Negative labels sampled from same distribution but without the pattern we are interested in. We will use DNAseq for the same cell-type as our negative control. This gives us accessible regions.

In [None]:
neg_chipseq_bedfile = data_dir / "negative_peaks.bed.gz"

In [None]:
# Stamatoyannopoulos - Univ of Washington
# DNase-seq on human GM12878
# https://www.encodeproject.org/experiments/ENCSR000EMT/
!wget -O $neg_chipseq_bedfile https://www.encodeproject.org/files/ENCFF097LEF/@@download/ENCFF097LEF.bed.gz

In [None]:
neg_chipseq_bedfile_nonoverlap = data_dir / "neg_nonoverlap.bed.gz"

In [None]:
!bedtools intersect -v -wa -a $neg_chipseq_bedfile -b $chipseq_bedfile_filtered | gzip > $neg_chipseq_bedfile_nonoverlap

In [None]:
process = chipseq_utils.bedtools_intersect(
    neg_chipseq_bedfile,
    chipseq_bedfile_filtered,
    output_bedfile="output.bed",
    write_a=True,
    invert_match=True,
)

In [None]:
# TODO:
# do the same processing as above for negative peaks. in the end, you want
# one-hot representation of the negatives.

In [None]:
import pandas as pd

df = pd.read_csv(neg_chipseq_bedfile_nonoverlap, delimiter="\t", header=None)
df.head()

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

lengths = df.loc[:, 2] - df.loc[:, 1]
plt.hist(lengths, bins=20)
plt.title("Distribution of peak length in negative reads")
plt.show()

In [None]:
df = pd.read_csv(neg_chipseq_bedfile_nonoverlap, delimiter="\t", header=None)
df = chipseq_utils.filter_bed_by_max_length(df, max_length=250)
df = chipseq_utils.transform_bed_to_constant_size(df, new_length=250)

neg_chipseq_bedfile_nonoverlap_filtered = chipseq_utils.add_str_before_suffixes(
    neg_chipseq_bedfile_nonoverlap, string="_filtered"
)
df.to_csv(neg_chipseq_bedfile_nonoverlap_filtered, sep="\t", index=False, header=False)
print(f"Saved to '{neg_chipseq_bedfile_nonoverlap_filtered}'")

In [None]:
reference_fasta = "hg19.fa"

neg_chipseq_bedfile_nonoverlap_filtered_fasta = chipseq_utils.add_str_before_suffixes(
    neg_chipseq_bedfile_nonoverlap_filtered, "_hg19"
).with_suffix(".fa")
neg_chipseq_bedfile_nonoverlap_filtered_fasta

In [None]:
process = chipseq_utils.bedtools_getfasta(
    input_fasta=reference_fasta,
    output_fasta=neg_chipseq_bedfile_nonoverlap_filtered_fasta,
    bed_file=neg_chipseq_bedfile_nonoverlap_filtered,
    force_strandedness=True,
)
print(process)

In [None]:
# load sequences
descriptions, sequences = chipseq_utils.parse_fasta(
    neg_chipseq_bedfile_nonoverlap_filtered_fasta
)

# filter out nonsense
nonsense = chipseq_utils.get_nonsense_sequence_mask(sequences, nonsense_letters="N")
print(f"Found {nonsense.sum()} sequences with nonsense letters")
descriptions = descriptions[~nonsense]
sequences = sequences[~nonsense]

# one-hot encode
sequences_onehot = chipseq_utils.one_hot(sequences)
print("Shape of one-hot encoded data:", sequences_onehot.shape)

In [None]:
# This assumes that GC are in slices 1:3 of the one-hot encoded data.
# shape of this is (n_sequences,)
gc_content_pos = sequences_onehot[:, :, 1:3].any(-1).mean(1)

plt.hist(gc_content_pos, bins=25, range=(0, 1))
plt.title("Histogram of GC content among positive sequences")
plt.show()

In [None]:
# try to match positives and negatives by GC content.
# use the one-hot encoded array.
# we have to downsample negative peaks.
#
# do this after filtering with bedtools interset.
#
# then try to balance dataset. how much downsampling of negative data?

# also give labels of 0 or 1.

# save to hdf5

# and then train on a model, and look at saliency map.
# CTCF dataset is niiiiice. use that. Saliency map should show CTCF nicely.

In [None]:
# https://datascience.stackexchange.com/questions/67645/how-to-resample-one-dataset-to-conform-to-the-distribution-of-another-dataset
# https://stackoverflow.com/questions/41495240/how-to-sample-data-based-off-the-distribution-of-another-dataset-in-r

In [None]:
np.random.choice(gc_content_neg, replace=False, p=gc_content_pos)

In [None]:
import scipy.stats

In [None]:
kde = scipy.stats.kde.gaussian_kde(gc_content_pos)

In [None]:
neg_chosen = np.random.choice(
    gc_content_neg, size=gc_content_pos.shape, replace=False, p=kde(gc_content_neg)
)