We'll continue our work with ChIP-Seq peaks. First, we'll load the BED file of summits.

In [None]:
import pandas as pd

summits = pd.read_csv("ChIP_1M_summits.bed",
                      sep="\t", header=None,
                      names=["chrom", "start", "end", "name", "score"])
summits.head()

Now, we want to sort the summits by score in descending order, so the highest-scoring peaks are at the top.

In [None]:
summits_sorted = summits.sort_values(by="score", ascending=False)
print(summits_sorted.head())

Then, we can take the 25 highest-scoring summits from the top of the list using head

In [None]:
summits25 = summits_sorted.head(25)
summits25

Set up pybedtools again (it should be faster after the first time!)

In [None]:
!pip3 install pybedtools
import pybedtools
#pybedtools.helpers.set_bedtools_path("/home/jovyan/mcb200-2019/bedtools2/bin/")

Convert the top-25 summits data frame into a BedTool object of genome locations

In [None]:
summits25bed = pybedtools.BedTool.from_dataframe(summits25)

Create another BedTool object of genes from the yeast genome

In [None]:
genes = pybedtools.BedTool('../S288C_R64-2-1/saccharomyces_cerevisiae_R64-2-1_20150113_mrna.bed')
genes.head()

Find the genes that are closest to the high-scoring BED summits -- these are potential Hsf1 targets.


In [None]:
summits25bed = summits25bed.sort()
genes = genes.sort()
neighbors_bed = summits25bed.closest(genes, k=2)
print(neighbors_bed)


Convert the BED file back into a data frame

In [None]:
neighbors = neighbors_bed.to_dataframe()
neighbors.head()

We'd like to know a bit about these genes without needing to look them all up individually on SGD.

We can get this information from another data table from the yeast genome database that maps systematic names (e.g., YAL005C) to standard names (e.g. SSA1) and includes a brief synopsis of the gene function.

In [None]:
sgd = pd.read_csv("../S288C_R64-2-1/SGD_features.tab",
                      sep="\t", header=None)
sgd = sgd[ [3,4,15] ]
neighbors = neighbors[ ['chrom', 'start', 'end', 'score', 'itemRgb'] ]
sgd.head(20)

We merge the data frame of neighboring genes with the data frame of gene names.

We use a _left_ join because we want to keep every row in the left data frame, `neighbors`.

We join `itemRgb` from neighbors (which contains systematic names) with `3` from sgd (which contains systematic names).

In [None]:
neighbors.merge(sgd, left_on="itemRgb", right_on=3, how="left")

It looks like we have some genes with the same underlying function -- but are there more than we expect by chance?

We can write a table of all the gene names with Hsf1 targets and search Gene Ontology for enrichment.

In [None]:
print(neighbors['itemRgb'].to_string(index=False))

## Motif Searching

We can also extract the DNA sequences around the Hsf1 summits to look for an enriched DNA motif.

First, we need to define the regions of interest. The summit itself is just 1 nt long, so we extend it 100 bases in each direction using the pybedtools `slop` function.

In [None]:
motif_regions = summits25bed.slop(b=100, g="../S288C_R64-2-1/S288C_reference_sequence_R64-2-1_20150113_genome.txt")
print(motif_regions)

Now, we can extract the sequence for these motif regions

In [None]:
res = motif_regions.sequence(fi="../S288C_R64-2-1/S288C_reference_sequence_R64-2-1_20150113.fsa")
print(res.seqfn)
print(open(res.seqfn).read())