## Hsf1 binding motif

We have a table of Hsf1 ChIP-Seq peaks, including the name of the chromosome and the position of the "summit" within the peak.

In earlier classes, we learned how to read in the yeast genome sequence.

Today, we'll look up the genome sequence around each ChIP-Seq summit and write our own Fasta-format file that we can use as input to MEME, the motif search tool.

We'll use pandas quite a bit here.

We saw last time that the "peaks" file from MACS contained all the information we wanted, with nicely named columns.

We'll read in the peaks file `ChIP_1M_peaks.xls` as a pandas `DataFrame` and store it in the `peaks` variable

Use `sort_values()` to sort peaks according to their significance. Start with the most significant peak first, using the `ascending` parameter.

We can use the `head` function to take only the "best" peaks. We'll test our code on the best 5 peaks from the mini ChIP-Seq data set here in class today.

We will extract the DNA sequence in a somewhat arbitrary window of ~100 base pairs, 50 on each side of the actual summit.

To do this, define two new columns in the `peaks` data frame, one containing the start of the motif search region and the other containing the end of the region. We'll call these `motif_start` and `motif_end`

We want to _loop_ over each row of the `DataFrame`.

The `itertuples()` method gives us an iterator over the rows. Each row is a `namedtuple` with named fields taken from the column names in the DataFrame.

To illustrate this, loop over each peak and print the name and the location (chromosome and start to end coordinates).

To extract the sequences from these regions of the genome, we need to refer back to the yeast genome sequence.

Below, we import `SeqIO` from the `Bio` package.

I've also provided the name of the fasta file of chromosome sequences.
```
chrom_file = '/home/jovyan/shared/MCB280A_data/S288C_R64-3-1/S288C_reference_sequence_R64-3-1_20210421.fsa'
```

In [None]:
from Bio import SeqIO

chrom_file = '/home/jovyan/shared/MCB280A_data/S288C_R64-3-1/S288C_reference_sequence_R64-3-1_20210421.fsa'

As a starting point, we'll extract the sequence for one _manually defined_ genomic region.
```
motif_chr = 'chrI'
motif_start = 141723
motif_end = 141823
```

We'll loop through each chromosome, test whether its name (the `.name` field) matches our chromosome of interest, and then slice out the desired region from the sequence (in the `.seq` field).

In [None]:
motif_chr = 'chrI'
motif_start = 141723
motif_end = 141823

Next, instead of manually defining the region, we'll loop over the rows in `peaks` and use the values in the table for each sequence.

This works okay, but it's inefficient because we read the genome file once for each peak. Instead, we can read the genome just once and store the sequence in a `dict`. We'll use the chromosome names as the _key_ and the sequence as the _value_.

Build this dictionary in a variable named `genome`.

Now, we can use the `.get()` method on the `genome` dictionary to look up any chromosome sequence we want, very quickly.

Adapt the loop above for this more efficient 

Now, let's write these sequences into our own Fasta file.

We'll need to create our own `Seq` and `SeqRecord` objects. To do this, we'll need to import some more parts of biopython:
```
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
```

In [None]:
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord

After doing this, we can create our own sequence object by writing `Seq(...)`

We can create our own sequence record—a sequence object, plus a _name_ in the `id` field, as well.

Finally, we can use the `SeqIO.write()` function to write a list of `SeqRecord`s into a file. This function has three parameters:
```
SeqIO.write(list-of-sequences, filename, format)
```

Now, we will construct a list called `sites` containing a `SeqRecord` for each ChIP-Seq peak. Then, we'll write the list into a file called `"Hsf1_sites.fa"`. 