We'll continue our work with ChIP-Seq peaks. First, we'll load the BED file of summits.

In [1]:
import pandas as pd

summits = pd.read_csv("ChIP_1M_summits.bed",
                      sep="\t", header=None,
                      names=["chrom", "start", "end", "name", "score"])
summits.head()

Unnamed: 0,chrom,start,end,name,score
0,chrI,73174,73175,MACS_peak_1,13.0
1,chrI,101289,101290,MACS_peak_2,19.0
2,chrI,141772,141773,MACS_peak_3,105.0
3,chrI,223145,223146,MACS_peak_4,20.0
4,chrII,137670,137671,MACS_peak_5,17.0


Now, we want to sort the summits by score in descending order, so the highest-scoring peaks are at the top.

In [2]:
summits_sorted = summits.sort_values(by="score", ascending=False)
summits_sorted.head()
#summits_sorted.tail()

Unnamed: 0,chrom,start,end,name,score
2,chrI,141772,141773,MACS_peak_3,105.0
59,chrXII,490220,490221,MACS_peak_60,86.0
39,chrVII,915019,915020,MACS_peak_40,59.0
31,chrVI,210370,210371,MACS_peak_32,53.0
56,chrXII,97708,97709,MACS_peak_57,51.0


Then, we can take the 25 highest-scoring summits from the top of the list using head

In [3]:
top25 = summits_sorted.head(25)

Set up pybedtools again (it should be faster after the first time!)

In [4]:
!pip3 install pybedtools
import pybedtools
pybedtools.helpers.set_bedtools_path("/home/jovyan/mcb200-2019/bedtools2/bin/")



Convert the top-25 summits data frame into a BedTool object of genome locations

In [5]:
top25bed = pybedtools.BedTool.from_dataframe(top25)

Create another BedTool object of genes from the yeast genome

In [11]:
genes = pybedtools.BedTool('../S288C_R64-2-1/saccharomyces_cerevisiae_R64-2-1_20150113_mrna.bed')
genes.head()
#genes.to_dataframe().head()

chrmt	3951	4338	Q0010	1	+
 chrmt	4253	4415	Q0017	1	+
 chrmt	11666	11957	Q0032	1	+
 chrmt	13817	26701	Q0045	1	+
 chrmt	13817	16322	Q0050	1	+
 chrmt	13817	18830	Q0055	1	+
 chrmt	13817	19996	Q0060	1	+
 chrmt	13817	21935	Q0065	1	+
 chrmt	13817	23167	Q0070	1	+
 chrmt	24155	25255	Q0075	1	+
 

Find the genes that are closest to the high-scoring BED summits -- these are potential Hsf1 targets.


In [18]:
top25bed = top25bed.sort()
genes = genes.sort()
#top25bed.head()
#genes.head()
print(top25bed.closest(genes, k=2))
nearest = top25bed.closest(genes, k=2)

chrI	141772	141773	MACS_peak_3	105.0	chrI	139502	141431	YAL005C	1	-
chrI	141772	141773	MACS_peak_3	105.0	chrI	140759	141407	YAL004W	1	+
chrII	444847	444848	MACS_peak_7	29.0	chrII	443820	444693	YBR101C	1	-
chrII	444847	444848	MACS_peak_7	29.0	chrII	445061	447323	YBR102C	1	-
chrII	477469	477470	MACS_peak_8	31.0	chrII	477670	479047	YBR118W	1	+
chrII	477469	477470	MACS_peak_8	31.0	chrII	474391	476437	YBR117C	1	-
chrIV	974483	974484	MACS_peak_22	30.0	chrIV	974630	975782	YDR259C	1	-
chrIV	974483	974484	MACS_peak_22	30.0	chrIV	971807	974243	YDR258C	1	-
chrIV	1357491	1357492	MACS_peak_24	29.0	chrIV	1357579	1358902	YDR449C	1	-
chrIV	1357491	1357492	MACS_peak_24	29.0	chrIV	1356064	1357369	YDR448W	1	+
chrIX	387157	387158	MACS_peak_26	32.0	chrIX	385563	385701	YIR018C-A	1	-
chrIX	387157	387158	MACS_peak_26	32.0	chrIX	384608	385346	YIR018W	1	+
chrV	86498	86499	MACS_peak_27	35.0	chrV	86178	86598	YEL033W	1	+
chrV	86498	86499	MACS_peak_27	35.0	chrV	85612	86215	YEL034C-A	1	-
chrV	364353	364354	MACS_peak

Convert the BED file back into a data frame

In [19]:
nearest_df = nearest.to_dataframe()
print(nearest_df.head())

   chrom   start     end         name  score strand  thickStart  thickEnd  \
0   chrI  141772  141773  MACS_peak_3  105.0   chrI      139502    141431   
1   chrI  141772  141773  MACS_peak_3  105.0   chrI      140759    141407   
2  chrII  444847  444848  MACS_peak_7   29.0  chrII      443820    444693   
3  chrII  444847  444848  MACS_peak_7   29.0  chrII      445061    447323   
4  chrII  477469  477470  MACS_peak_8   31.0  chrII      477670    479047   

   itemRgb  blockCount blockSizes  
0  YAL005C           1          -  
1  YAL004W           1          +  
2  YBR101C           1          -  
3  YBR102C           1          -  
4  YBR118W           1          +  


We'd like to know a bit about these genes without needing to look them all up individually on SGD.

We can get this information from another data table from the yeast genome database that maps systematic names (e.g., YAL005C) to standard names (e.g. SSA1) and includes a brief synopsis of the gene function.

In [24]:
sgd = pd.read_csv("../S288C_R64-2-1/SGD_features.tab", sep="\t", header=None)
sgd.head()
sgd = sgd[ [3,4,15] ]
sgd.head(20)

Unnamed: 0,3,4,15
0,YAL069W,,Dubious open reading frame; unlikely to encode...
1,,,
2,YAL068W-A,,Dubious open reading frame; unlikely to encode...
3,,,
4,ARS102,,Autonomously Replicating Sequence
5,TEL01L,,Telomeric region on the left arm of Chromosome...
6,,,Terminal telomeric repeats on the left arm of ...
7,,,Telomeric X element Core sequence on the left ...
8,,,Telomeric X element combinatorial repeat on th...
9,YAL068C,PAU8,Protein of unknown function; member of the ser...


We merge the data frame of neighboring genes with the data frame of gene names.

We use a _left_ join because we want to keep every row in the left data frame, `neighbors`.

We join `itemRgb` from neighbors (which contains systematic names) with `3` from sgd (which contains systematic names).

In [25]:
nearest_df.merge(sgd, left_on="itemRgb", right_on=3)

Unnamed: 0,chrom,start,end,name,score,strand,thickStart,thickEnd,itemRgb,blockCount,blockSizes,3,4,15
0,chrI,141772,141773,MACS_peak_3,105.0,chrI,139502,141431,YAL005C,1,-,YAL005C,SSA1,ATPase involved in protein folding and NLS-dir...
1,chrI,141772,141773,MACS_peak_3,105.0,chrI,140759,141407,YAL004W,1,+,YAL004W,,Dubious open reading frame; unlikely to encode...
2,chrII,444847,444848,MACS_peak_7,29.0,chrII,443820,444693,YBR101C,1,-,YBR101C,FES1,Hsp70 (Ssa1p) nucleotide exchange factor; requ...
3,chrII,444847,444848,MACS_peak_7,29.0,chrII,445061,447323,YBR102C,1,-,YBR102C,EXO84,Exocyst subunit with dual roles in exocytosis ...
4,chrII,477469,477470,MACS_peak_8,31.0,chrII,477670,479047,YBR118W,1,+,YBR118W,TEF2,Translational elongation factor EF-1 alpha; in...
5,chrII,477469,477470,MACS_peak_8,31.0,chrII,474391,476437,YBR117C,1,-,YBR117C,TKL2,Transketolase; catalyzes conversion of xylulos...
6,chrIV,974483,974484,MACS_peak_22,30.0,chrIV,974630,975782,YDR259C,1,-,YDR259C,YAP6,Basic leucine zipper (bZIP) transcription fact...
7,chrIV,974483,974484,MACS_peak_22,30.0,chrIV,971807,974243,YDR258C,1,-,YDR258C,HSP78,Oligomeric mitochondrial matrix chaperone; coo...
8,chrIV,1357491,1357492,MACS_peak_24,29.0,chrIV,1357579,1358902,YDR449C,1,-,YDR449C,UTP6,Nucleolar protein; component of the small subu...
9,chrIV,1357491,1357492,MACS_peak_24,29.0,chrIV,1356064,1357369,YDR448W,1,+,YDR448W,ADA2,Transcription coactivator; component of the AD...


It looks like we have some genes with the same underlying function -- but are there more than we expect by chance?

We can write a table of all the gene names with Hsf1 targets and search Gene Ontology for enrichment.

In [29]:
print(nearest_df['itemRgb'].to_string(index=False))

YAL005C
  YAL004W
  YBR101C
  YBR102C
  YBR118W
  YBR117C
  YDR259C
  YDR258C
  YDR449C
  YDR448W
YIR018C-A
  YIR018W
  YEL033W
YEL034C-A
  YER103W
  YER102W
  YFR028C
  YFR029W
  YGL222C
  YGL223C
  YGL073W
  YGL074C
  YGR034W
  YGR033C
  YGR141W
  YGR142W
  YGR211W
  YGR210C
  YHR071W
YHR070C-A
  YJR115W
  YJR116W
  YLL038C
  YLL039C
  YLL027W
  YLL026W
  YLL024C
  YLL023C
YLR162W-A
  YLR162W
  YLR347C
  YLR346C
  YNL141W
  YNL142W
  YNL006W
  YNL007C
YOR298C-A
  YOR299W
  YPL250C
YPL249C-A


## Motif Searching

We can also extract the DNA sequences around the Hsf1 summits to look for an enriched DNA motif.

First, we need to define the regions of interest. The summit itself is just 1 nt long, so we extend it 100 bases in each direction using the pybedtools `slop` function.

In [8]:
#top25bed.head()
motif_bed = top25bed.slop(b=100, 
                         g="../S288C_R64-2-1/S288C_reference_sequence_R64-2-1_20150113_genome.txt")
motif_bed.head()

chrI	141672	141873	MACS_peak_3	105.0
 chrXII	490120	490321	MACS_peak_60	86.0
 chrVII	914919	915120	MACS_peak_40	59.0
 chrVI	210270	210471	MACS_peak_32	53.0
 chrXII	97608	97809	MACS_peak_57	51.0
 chrVII	772032	772233	MACS_peak_38	48.0
 chrXIV	619805	620006	MACS_peak_72	45.0
 chrXII	88266	88467	MACS_peak_56	45.0
 chrXVI	74997	75198	MACS_peak_84	40.0
 chrXIV	359262	359463	MACS_peak_68	39.0
 

Now, we can extract the sequence for these motif regions

In [14]:
res = motif_bed.sequence(fi="../S288C_R64-2-1/S288C_reference_sequence_R64-2-1_20150113.fsa")
print(open(res.seqfn).read())                    

>chrI:141672-141873
GCCGATGGAACGTTCTGGAAAAAGAAGAATAATTTAATTACTTTCTCAACTAAAATCTGGAGAAAAAACGCAAATGACAGCTTCTAAACGTTCCGTGTGCTTTCTTTCTAGAATGTTCTGGAAAGTTTACAACAATCCACAAGAACGAAAATGCCGTTGACAATGATGAAACCATCATCCACACACCGCGCACACGTGCTT
>chrXII:490120-490321
TCTCGTACTAAGTTCAATTACTATTGCGGTAACATTCATCAGTAGGGTAAAACTAACCTGTCTCACGACGGTCTAAACCCAGCTCACGTTCCCTATTAGTGGGTGAACAATCCAACGCTTACCGAATTCTGCTTCGGTATGATAGGAAGAGCCGACATCGAAGAATCAAAAAGCAATGTCGCTATGAACGCTTGACTGCCA
>chrVII:914919-915120
CTATTCGAGAAAAAAAAAAAAAGGCATCGAGTGAATTTTTCACCTTGATAAAAAAGCCCTTACTAACCCTACAATAAATTGTGCCGAAACCCTCTGGAGTTTTCTAGAATATTCTAGCCCCATCAGGGCTAGAATATCCTAAAAGTTTATAGTTGACGAAAATTTTTCAGCGATGAGATGCACATTTATAATGCTATGATG
>chrVI:210270-210471
TAGCAATGGCCTTCAAATGCATATCTCTACTATCGGCTAAAAAACGAATGACTCACGTTATCAGGCTCATAGCTTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGATTGTTGTTCTAGTCGCTTGCTTTATAAAGTAACGACACTTTCTGGTGCCAATATGTGAAAACGC
>chrXII:97608-97809
GCGGCTTATATATAACAATTCGTCCACACCTTCCCATAGTGCTTAAGAATGAAATTTCGTCAAAACCTCGAAGCTTTTTTTTTCGAA