BED format files are just a special kind of tab-delimited data file, and so we can read them in using Pandas.

In [2]:
import pandas as pd

We can read in this data using the `read_csv()` function in Pandas, with `sep="\t"` because it's tab-delimited.

Reading in the peaks file as a tab-delimited file will use the first line for column names by default.

In [4]:
peaks = pd.read_csv("ChIP_1M_peaks.bed", sep="\t")
print(peaks.head())

    chrI   72280   73586  MACS_peak_1   70.72
0   chrI  100877  101489  MACS_peak_2   74.70
1   chrI  141083  142278  MACS_peak_3  574.05
2   chrI  222800  223346  MACS_peak_4   90.32
3  chrII  137479  138048  MACS_peak_5   71.61
4  chrII  145391  146109  MACS_peak_6  112.91


We can use `header=None` to indicate that there is no header.

In [5]:
peaks = pd.read_csv("ChIP_1M_peaks.bed", sep="\t", header=None)
print(peaks.head())

       0       1       2            3       4
0   chrI   72280   73586  MACS_peak_1   70.72
1   chrI  100877  101489  MACS_peak_2   74.70
2   chrI  141083  142278  MACS_peak_3  574.05
3   chrI  222800  223346  MACS_peak_4   90.32
4  chrII  137479  138048  MACS_peak_5   71.61


This gives us numbered columns. We can give the columns our own names instead with the `names=[...]` parameter.

The first five columns of a BED file are:
1. `chrom` is the name of the chromosome
1. `start` is the starting position of the feature in 0-based coordinates
1. `end` is the ending position of the feature in Python style -- the first position _after_ the end
1. `name` is the name of the feature
1. `score` can be used to provide a numeric score. MACS uses this to score peaks.

In [8]:
peaks = pd.read_csv("ChIP_1M_peaks.bed", sep="\t", header=None, names=["chrom", "start", "end", "name", "score"])
print(peaks.head())

   chrom   start     end         name   score
0   chrI   72280   73586  MACS_peak_1   70.72
1   chrI  100877  101489  MACS_peak_2   74.70
2   chrI  141083  142278  MACS_peak_3  574.05
3   chrI  222800  223346  MACS_peak_4   90.32
4  chrII  137479  138048  MACS_peak_5   71.61


The peaks are listed in order of their genomic position. We want to look at the "best" peaks, and so we want to sort them by score, with the highest score first.

In [10]:
peaks_sorted = peaks.sort_values(by="score", ascending=False)
print(peaks_sorted.head())

     chrom   start     end          name   score
59  chrXII  489239  490823  MACS_peak_60  926.51
2     chrI  141083  142278   MACS_peak_3  574.05
31   chrVI  210170  210861  MACS_peak_32  353.14
37  chrVII  771714  772791  MACS_peak_38  315.88
34  chrVII  368400  371541  MACS_peak_35  309.39


We can also look at the "summits", which are 1-base-wide features within the peaks.

In [13]:
summits = pd.read_csv("ChIP_1M_summits.bed", sep="\t", header=None, names=["chrom", "start", "end", "name", "score"])
summits_sorted = summits.sort_values(by="score", ascending=False)
print(summits_sorted.head())

     chrom   start     end          name  score
2     chrI  141772  141773   MACS_peak_3  105.0
59  chrXII  490220  490221  MACS_peak_60   86.0
39  chrVII  915019  915020  MACS_peak_40   59.0
31   chrVI  210370  210371  MACS_peak_32   53.0
56  chrXII   97708   97709  MACS_peak_57   51.0
