<a name="top"></a>
# Introduction to Python Programming for Bioinformatics. Lesson 11

<details>
<summary>
About this notebook
</summary>

This notebook was originally written by [Marc Cohen](https://github.com/mco-gh), an engineer at Google. The original source can be found on [Marc's short link service](https://mco.fyi/), and starts with [Python lesson 0](https://mco.fyi/py0), and I encourage you to work through that notebook if you find some details missing here.

Rob Edwards edited the notebook, adapted it for bioinformatics, using some simple geneticy examples, condensed it into a single notebook, and rearranged some of the lessons, so if some of it does not make sense, it is Rob's fault!

It is intended as a hands-on companion to an in-person course, and if you would like Rob to teach this course (or one of the other courses) don't hesitate to get in touch with him.

</details>
<details>
<summary>
Using this notebook
</summary>

You can download the original version of this notebook from [GitHub](https://linsalrob.github.io/ComputationalGenomicsManual/Python/Python_Lesson_9.ipynb) and from [Rob's Google Drive]()

**You should make your own copy of this notebook by selecting File->Save a copy in Drive from the menu bar above, and then you can edit the code and run it as your own**

There are several lessons, and you can do them in any order. I've tried to organise them in the order I think most appropriate, but you may disagree!

</details>

<a name="lessons"></a>
# Lesson Links

* [Lesson 11 - BioPython](#Lesson-11---BioPython)

Previous Lesson: [GitHub](Python_Lesson_9.ipynb) | [Google Colab](https://colab.research.google.com/drive/1JGRJpUPKkkVukyNvtfEJYVVCcdpkyRLZ)

Next Lesson: [GitHub](Python_Lesson_12.ipynb) | [Google Colab](https://colab.research.google.com/drive/1Qbff17-gZbktQliV6TaFFupDpfUiAnBE)

<!-- #region id="qXu_bY7yPpsS" -->

# Lesson 11 - BioPython

Earlier, (in lessons 8 and 9) we talked about modules and using other people's code. One of the most important libaries for bioinformatics is called [BioPython](https://biopython.org/). There is a [complete tutorial on BioPython](https://biopython.org/DIST/docs/tutorial/Tutorial.html), and the BioPython group also provide [an excellent cookbook of recipes](https://biopython.org/wiki/Category%3ACookbook) that will help you out!

BioPython is designed to help with common biological problems, and is particularly good at:

* Parsing files
  * fasta
  * fastq
  * GenBank
* Manipulating sequences
  * Reverse complement
  * Translating
  * Aligning (wrappers to aligners)
  * Slicing
* Connecting to biological databases

**Before you carry on!** Make sure you have installed biopython by uploading the `requirements.txt` file and running the installation command:


In [None]:
!pip install -r requirements.txt

Now we can `import` biopython and create a new sequence object.



## Translating a DNA sequence

Here is how we would translate a DNA sequence

In [None]:
from Bio.Seq import Seq
dna = Seq("TCGCGCACGCTGATCGTGGGGTGA")
dna.translate()

Can you use BioPython to answer the question from [Lesson 10](https://colab.research.google.com/drive/1trXzcwT0VnmdnVQY_Wj9b__pXVY8_7GJ): translate these sequences:

```
TCGCGCACGCTGATCGTGGGGTGA
AGTAAAACTTTAATTGTTGGTTAA
```


## Reading a fasta file

We read the fasta file before using code that we wrote. But here's a way to do it using BioPython

In [None]:
from Bio import SeqIO
import gzip

with gzip.open('Bc01.fasta.gz', 'rt') as handle:
  for sequence in SeqIO.parse(handle, 'fasta'):
      print(f"{sequence.id} is {len(sequence.seq)} bp")


## Reading a fastq file

Here is some simple code to read a fastq file.

Note that we only read 10 lines from this fastq file! Often, fastq files are _huge_ and so this just provides a glimpse of the sequences and their qualities.


In [None]:
from Bio import SeqIO
i = 0
with gzip.open('barcode01.fastq.gz', 'rt') as handle:
  for sequence in SeqIO.parse(handle, 'fastq'):
      i += 1
      print(sequence.seq)
      print(sequence.letter_annotations["phred_quality"])
      if 10 == i:
          break

## Reading GenBank Files

One of the areas where using a mature toolset like BioPython _really_ helps is parsing complex file types, like GenBank files.

The GenBank file format is [defined at NCBI](https://www.ncbi.nlm.nih.gov/genbank/).

Features and segments in GenBank files are defined by the number of spaces at the start of a line, just like code is in Python!

However, BioPython makes going through a GenBank file very easy.

Here is a link to our [Bc01 phage genome](https://github.com/linsalrob/ComputationalGenomicsManual/raw/master/Python/tikkala.gbk.gz) which is in a file called `tikkala.gbk.gz` and is `gzip` compressed as we used previously.

Download the file, and then upload it to Google Colab using the folder icon on the left.

(_Note_: If you have a regular file you can either replace `gzip.open()` with `open()` in this example, or you can skip that line completely and provide `SeqIO.parse()` with the filename).



In [None]:
import gzip
from Bio import SeqIO

genbank_file = 'tikkala.gbk.gz'
with gzip.open(genbank_file, 'rt') as handle:
  for record in SeqIO.parse(handle, 'genbank'):
      print(f"Processing record: {record.id}")
      for feature in record.features:
          start = feature.location.start
          end = feature.location.end
          product = feature.qualifiers.get('product')
          if product:
              product = product[0]
          print(f"Feature: {feature.type}, Product: {product} Start: {start}, End: {end}")