# Biopython 1: Introduction to Biopython

## What is Biopython?

Biopython is a set of Python dools developed for common tasks in computational molecular biology. Since it is open-source, its capabilities are constantly being developed and added to. Some common utilities are:
* Parsing bioinformatics files into usable data structures
* Conveniently iterating over multiple data files
* Interfacing with online resources such as ENTREZ
* Working with sequences to perform analyses

In this workshop, I hope you will become familiar with Biopython and its basic utilities, and that you are able to then apply when you have learned to your own research. 

## Check if you have Biopython installed

Run the following code. If the first line fails, you do not have Biopython installed. You can install it through the command line by running `pip install biopython` or get it through Anaconda. If the second line fails, you have a very old version installed and you will need to update it.

In [4]:
import Bio
print(Bio.__version__)

1.78


## Example 1: Let's dive in!
### A usage example from the official Biopython Tutorial

In this example, we will be parsing files for a Ladyslipper Orchid. This data was obtained by searching the nucleotide database at NCBI and downloading the results in .fasta and .gbk formats. Later on, we will learn how to search and download data programatically using Biopython.

The data is saved in the folder called `sample-data`. 

In [3]:
from Bio import SeqIO
for seq_record in SeqIO.parse("sample-data/ls_orchid.fasta", "fasta"):
    print(seq_record.id)
    print(repr(seq_record.seq))
    print(len(seq_record))

gi|2765658|emb|Z78533.1|CIZ78533
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC')
740
gi|2765657|emb|Z78532.1|CCZ78532
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAACAG...GGC')
753
gi|2765656|emb|Z78531.1|CFZ78531
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGCAG...TAA')
748
gi|2765655|emb|Z78530.1|CMZ78530
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAAACAACAT...CAT')
744
gi|2765654|emb|Z78529.1|CLZ78529
Seq('ACGGCGAGCTGCCGAAGGACATTGTTGAGACAGCAGAATATACGATTGAGTGAA...AAA')
733
gi|2765652|emb|Z78527.1|CYZ78527
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGTAG...CCC')
718
gi|2765651|emb|Z78526.1|CGZ78526
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGTAG...TGT')
730
gi|2765650|emb|Z78525.1|CAZ78525
Seq('TGTTGAGATAGCAGAATATACATCGAGTGAATCCGGAGGACCTGTGGTTATTCG...GCA')
704
gi|2765649|emb|Z78524.1|CFZ78524
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATAGTAG...AGC')
740
gi|2765648|emb|Z78523.1|CHZ78523
Seq('CGTAACCAGGTTTCCGT

Let's "unpack" this code and take a look at some of the objects and functions within Biopython.

`from Bio import SeqIO`

First, we imported `SeqIO` from `Bio`. `Bio` is how you call the Biopython library, and `SeqIO` is a module within the Biopython library that deals with importing and exporting sequences, as the name suggests.

Then, we iterate through a for loop to extract the individual sequences contained within the data file. The magic happens in the line:

`SeqIO.parse("sample-data/ls_orchid.fasta", "fasta")`

Here, we use the `parse()` function to parse the data file. The arguments of `parse()` are the file path, as well as the type of file (in this case, fasta). Once the data file has been parsed, each sequence in the file has various properties which can be printed including the sequence `id`, the sequence itself, and the sequence length.