BED format files are just a special kind of tab-delimited data file, and so we can read them in using Pandas.

In [1]:
import pandas as pd

We can read in this data using the `read_csv()` function in Pandas, with `sep="\t"` because it's tab-delimited.

Reading in the peaks file as a tab-delimited file will use the first line for column names by default.

In [4]:
peaks = pd.read_csv("ChIP_1M_peaks.bed", sep="\t")
peaks.head()

Unnamed: 0,chrI,72280,73586,MACS_peak_1,70.72
0,chrI,100877,101489,MACS_peak_2,74.7
1,chrI,141083,142278,MACS_peak_3,574.05
2,chrI,222800,223346,MACS_peak_4,90.32
3,chrII,137479,138048,MACS_peak_5,71.61
4,chrII,145391,146109,MACS_peak_6,112.91


We can use `header=None` to indicate that there is no header.

In [5]:
peaks = pd.read_csv("ChIP_1M_peaks.bed", sep="\t", header=None)
peaks.head()

Unnamed: 0,0,1,2,3,4
0,chrI,72280,73586,MACS_peak_1,70.72
1,chrI,100877,101489,MACS_peak_2,74.7
2,chrI,141083,142278,MACS_peak_3,574.05
3,chrI,222800,223346,MACS_peak_4,90.32
4,chrII,137479,138048,MACS_peak_5,71.61


This gives us numbered columns. We can give the columns our own names instead with the `names=[...]` parameter.

The first five columns of a BED file are:
1. `chrom` is the name of the chromosome
1. `start` is the starting position of the feature in 0-based coordinates
1. `end` is the ending position of the feature in Python style -- the first position _after_ the end
1. `name` is the name of the feature
1. `score` can be used to provide a numeric score. MACS uses this to score peaks.

In [14]:
peaks = pd.read_csv("ChIP_1M_peaks.bed", sep="\t", header=None,
                    names=["chrom", "start", "end", "name", "score"])
peaks.head()

Unnamed: 0,chrom,start,end,name,score
0,chrI,72280,73586,MACS_peak_1,70.72
1,chrI,100877,101489,MACS_peak_2,74.7
2,chrI,141083,142278,MACS_peak_3,574.05
3,chrI,222800,223346,MACS_peak_4,90.32
4,chrII,137479,138048,MACS_peak_5,71.61


The peaks are listed in order of their genomic position. We want to look at the "best" peaks, and so we want to sort them by score, with the highest score first.

In [17]:
peaks_sorted = peaks.sort_values(by="score", ascending=False)

We can also look at the "summits", which are 1-base-wide features within the peaks.

In [18]:
summits = pd.read_csv("ChIP_1M_summits.bed", sep="\t", header=None,
                      names=["chrom", "start", "end", "name", "score"])
summits.head()

Unnamed: 0,chrom,start,end,name,score
0,chrI,73174,73175,MACS_peak_1,13.0
1,chrI,101289,101290,MACS_peak_2,19.0
2,chrI,141772,141773,MACS_peak_3,105.0
3,chrI,223145,223146,MACS_peak_4,20.0
4,chrII,137670,137671,MACS_peak_5,17.0


In [19]:
!pip3 install pybedtools

Collecting pybedtools
Collecting pysam (from pybedtools)
Installing collected packages: pysam, pybedtools
Successfully installed pybedtools-0.8.0 pysam-0.15.3


In [20]:
import pybedtools
pybedtools.helpers.set_bedtools_path("/home/jovyan/mcb200-2019/bedtools2/bin")

In [22]:
summits_bed = pybedtools.BedTool.from_dataframe(summits)

In [23]:
summits_bed

<BedTool(/tmp/pybedtools.dym908cp.tmp)>

In [24]:
print(summits_bed)

chrI	73174	73175	MACS_peak_1	13.0
chrI	101289	101290	MACS_peak_2	19.0
chrI	141772	141773	MACS_peak_3	105.0
chrI	223145	223146	MACS_peak_4	20.0
chrII	137670	137671	MACS_peak_5	17.0
chrII	145916	145917	MACS_peak_6	25.0
chrII	444847	444848	MACS_peak_7	29.0
chrII	477469	477470	MACS_peak_8	31.0
chrII	478725	478726	MACS_peak_9	19.0
chrIII	259	260	MACS_peak_10	26.0
chrIII	57066	57067	MACS_peak_11	20.0
chrIII	90767	90768	MACS_peak_12	13.0
chrIII	137449	137450	MACS_peak_13	19.0
chrIII	227880	227881	MACS_peak_14	12.0
chrIV	149140	149141	MACS_peak_15	13.0
chrIV	369270	369271	MACS_peak_16	19.0
chrIV	417047	417048	MACS_peak_17	17.0
chrIV	465445	465446	MACS_peak_18	16.0
chrIV	678033	678034	MACS_peak_19	19.0
chrIV	806357	806358	MACS_peak_20	23.0
chrIV	892702	892703	MACS_peak_21	26.0
chrIV	974483	974484	MACS_peak_22	30.0
chrIV	1165315	1165316	MACS_peak_23	16.0
chrIV	1357491	1357492	MACS_peak_24	29.0
chrIX	325731	325732	MACS_peak_25	11.0
chrIX	387157	387158	MACS_peak_26	32.0
chrV	86498	86499	MACS_peak_