# Reading BED files into Pandas

BED format files are just a special kind of tab-delimited data file, and so we can read them in using Pandas. Start by importing pandas with `import pandas as pd`.

In [1]:
import pandas as pd

We can read in this data using the `read_csv()` function in Pandas, with `sep="\t"` because it's tab-delimited.

```
peaks = pd.read_csv("ChIP_1M_peaks.bed", sep="\t")
```


In [3]:
pd.read_csv("ChIP_1M_peaks.bed", sep="\t")

Unnamed: 0,chrI,72280,73586,MACS_peak_1,70.72
0,chrI,100877,101489,MACS_peak_2,74.70
1,chrI,141083,142278,MACS_peak_3,574.05
2,chrI,222800,223346,MACS_peak_4,90.32
3,chrII,137479,138048,MACS_peak_5,71.61
4,chrII,145391,146109,MACS_peak_6,112.91
...,...,...,...,...,...
82,chrXVI,74544,75658,MACS_peak_84,182.91
83,chrXVI,98512,99488,MACS_peak_85,155.72
84,chrXVI,345490,347056,MACS_peak_86,87.23
85,chrXVI,352005,352789,MACS_peak_87,93.79


Reading in the peaks file as a tab-delimited file will use the first line for column names by default.

We can use `header=None` to indicate that there is no header.

In [4]:
pd.read_csv("ChIP_1M_peaks.bed", sep="\t", header=None)

Unnamed: 0,0,1,2,3,4
0,chrI,72280,73586,MACS_peak_1,70.72
1,chrI,100877,101489,MACS_peak_2,74.70
2,chrI,141083,142278,MACS_peak_3,574.05
3,chrI,222800,223346,MACS_peak_4,90.32
4,chrII,137479,138048,MACS_peak_5,71.61
...,...,...,...,...,...
83,chrXVI,74544,75658,MACS_peak_84,182.91
84,chrXVI,98512,99488,MACS_peak_85,155.72
85,chrXVI,345490,347056,MACS_peak_86,87.23
86,chrXVI,352005,352789,MACS_peak_87,93.79


This gives us numbered columns. We can give the columns our own names instead with the `names=[...]` parameter.

The first five columns of a BED file are:
1. `chrom` is the name of the chromosome
1. `start` is the starting position of the feature in 0-based coordinates
1. `end` is the ending position of the feature in Python style -- the first position _after_ the end
1. `name` is the name of the feature
1. `score` can be used to provide a numeric score. MACS uses this to score peaks.

We can use the argument `names=["chrom", "start", "end", "name", "score"]` to set these names.

In [8]:
peaks = pd.read_csv("ChIP_1M_peaks.bed", sep="\t", header=None, names = [ "chrom", "start", "end", "name", "score"])
peaks

Unnamed: 0,chrom,start,end,name,score
0,chrI,72280,73586,MACS_peak_1,70.72
1,chrI,100877,101489,MACS_peak_2,74.70
2,chrI,141083,142278,MACS_peak_3,574.05
3,chrI,222800,223346,MACS_peak_4,90.32
4,chrII,137479,138048,MACS_peak_5,71.61
...,...,...,...,...,...
83,chrXVI,74544,75658,MACS_peak_84,182.91
84,chrXVI,98512,99488,MACS_peak_85,155.72
85,chrXVI,345490,347056,MACS_peak_86,87.23
86,chrXVI,352005,352789,MACS_peak_87,93.79


The peaks are listed in order of their genomic position. We want to look at the "best" peaks, and so we want to sort them by score, with the highest score first.

```
peaks_sorted = peaks.sort_values(by="score", ascending=False)
```

In [10]:
peaks_sorted = peaks.sort_values(by = "score", ascending=False)
peaks_sorted

Unnamed: 0,chrom,start,end,name,score
59,chrXII,489239,490823,MACS_peak_60,926.51
2,chrI,141083,142278,MACS_peak_3,574.05
31,chrVI,210170,210861,MACS_peak_32,353.14
37,chrVII,771714,772791,MACS_peak_38,315.88
34,chrVII,368400,371541,MACS_peak_35,309.39
...,...,...,...,...,...
29,chrVI,30953,31582,MACS_peak_30,54.41
75,chrXV,370587,371406,MACS_peak_76,52.26
72,chrXIV,762851,763388,MACS_peak_73,52.18
52,chrXII,10859,11638,MACS_peak_53,52.14


We can also look at the "summits", which are 1-base-wide features within the peaks. This is just the same as looking at the peaks, but with a different filename:
```
summits = pd.read_csv("ChIP_1M_summits.bed", ...)
```

In [13]:
summits = pd.read_csv("ChIP_1M_summits.bed", sep="\t", header=None, names=["chrom", "start", "end", "name", "score"])
summits_sorted = summits.sort_values(by = "score", ascending = False)
summits_sorted

Unnamed: 0,chrom,start,end,name,score
2,chrI,141772,141773,MACS_peak_3,105.0
59,chrXII,490220,490221,MACS_peak_60,86.0
39,chrVII,915019,915020,MACS_peak_40,59.0
31,chrVI,210370,210371,MACS_peak_32,53.0
56,chrXII,97708,97709,MACS_peak_57,51.0
...,...,...,...,...,...
13,chrIII,227880,227881,MACS_peak_14,12.0
85,chrXVI,345902,345903,MACS_peak_86,12.0
63,chrXIII,24669,24670,MACS_peak_64,11.0
29,chrVI,31287,31288,MACS_peak_30,11.0
