# Introduction to Biopython, Jupyter, and Github

# Section 1: Control Flow 

[**Control flow**][cflow-wiki] refers to how the statements in a program *control* how the program *flows* (yeah....).  A basic example would be if you are looking for sequences shorter than 200nb in a FASTA file so that you can save them in another file.  In order to do this, you would:

* Read through each sequence in the file
* Determine if the sequence is shorter than 200nb.  If it is, write it to another file.

Each of these are considered *control statements*, the first being an example of a loop and the second a conditional.  We'll start with `for` loops.

[cflow-wiki]: https://en.wikipedia.org/wiki/Control_flow

## Section 1.1: Installing Biopython

### Exercise 1.1

## Section 1.2: Working With Sequence Objects


Sequence objects, or `Seq`s, are the foundation upon which Biopython is built.  A `Seq` object is made up of a `sequence` and an `alphabet`.

In [52]:
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC
my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC", IUPAC.unambiguous_dna)
my_seq

Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC', IUPACUnambiguousDNA())

In [53]:
print(my_seq)

GATCGATGGGCCTATATAGGATCGAAAATCGC


Because `Seq`s are subclasses of `str`, it inherits all of the `str` class methods.  For example:

In [54]:
len(my_seq)

32

In [55]:
my_seq[0]

'G'

In [56]:
my_seq[-1]

'C'

In [57]:
print(my_seq[0:20])

GATCGATGGGCCTATATAGG


In [58]:
for base in my_seq[:5]:
    print(base)

G
A
T
C
G


In [59]:
print(my_seq.lower())

gatcgatgggcctatataggatcgaaaatcgc


In [19]:
'ATC' in my_seq

True

We can also count the number of (non-overlapping) occurrences of a substring:

In [22]:
my_seq.count("ATC")

3

We can use `str.count()` to calculate the GC content of a sequence:

In [25]:
(my_seq.count("G") + my_seq.count("C")) / len(my_seq)

0.46875

Calculating the GC content of a string in this way doesn't account for the ambiguous nucleotide S.  The `GC` function in `Bio.SeqUtils` can handle such cases.

In [36]:
ambig_string = Seq("ATGCRAGCTSGSTRSTGCGGCGASSGAGSARRRGSSA", IUPAC.ambiguous_dna)
(ambig_string.count("G") + ambig_string.count("C")) / len(ambig_string)

0.3783783783783784

In [37]:
from Bio.SeqUtils import GC
GC(ambig_string)

59.45945945945946

In [74]:
GC(my_seq)

46.875

You can also slice `Seq` objects as you would a string.  For instance, we can get the first ten characters:

In [38]:
ambig_string[:10]

Seq('ATGCRAGCTS', IUPACAmbiguousDNA())

We can print out every other character:

In [40]:
ambig_string[::2]

Seq('AGRGTGTSGGCASASRRSA', IUPACAmbiguousDNA())

Every third position:

In [42]:
ambig_string[::3]

Seq('ACGSTTGGSGRGA', IUPACAmbiguousDNA())

And we can shift our reading frame on the sequence:

In [43]:
ambig_string[1::3]

Seq('TRCGRGGAGSRS', IUPACAmbiguousDNA())

In [44]:
ambig_string[2::3]

Seq('GATSSCCSAARS', IUPACAmbiguousDNA())

We can also reverse the string:

In [45]:
ambig_string[::-1]

Seq('ASSGRRRASGAGSSAGCGGCGTSRTSGSTCGARCGTA', IUPACAmbiguousDNA())

Note that slicing a `Seq` returns another `Seq`, so anything you can do with a `Seq` you can do with a slice of a `Seq`.  

In [39]:
GC(ambig_string[:10])

50.0

There may be times where you just need a plain string version of a sequence; for instance, when writing to a file or a database.  

In [46]:
with open('test', 'w') as fout:
    fout.write(ambig_string)

TypeError: write() argument must be str, not Seq

In [47]:
with open('test', 'w') as fout:
    fout.write(str(ambig_string))

Note, though, that using string formatting avoids this problem since `format` coerces `Seq` objects to strings.

In [48]:
with open('test', 'w') as fout:
    fout.write("See? \n{}".format(ambig_string))

We can also concatenate sequences with `+` just as you would strings.  As you would expect, the result is also a sequence object.  We'll concatenate `my_seq`, which has the `IUPACUnambiguousDNA` alphabet, and `ambig_string`, which has the `IUPACAmbiguousDNA` alphabet.  Which alphabet will their concatenation have?  

In [62]:
my_seq + ambig_string

Seq('GATCGATGGGCCTATATAGGATCGAAAATCGCATGCRAGCTSGSTRSTGCGGCG...SSA', IUPACAmbiguousDNA())

Since their concatenation has ambiguous bases, its alphabet is `IUPACAmbiguousDNA`.

Alphabets ensure that we only concatenate compatible sequences.  For example, let's create a protein sequence and try to concatenate it to `my_seq`:

In [64]:
my_prot = Seq("EVRNAK", IUPAC.protein)
my_prot

Seq('EVRNAK', IUPACProtein())

In [65]:
my_prot + my_seq

TypeError: Incompatible alphabets IUPACProtein() and IUPACUnambiguousDNA()

We can use a `for` loop to concatenate multiple sequences together:

In [68]:
sequences = [Seq("AGCGATGTTACGCATCAGGGCAGTCGCCCTAAAACAAAGTTAGGCCGC", IUPAC.unambiguous_dna),
            Seq("CCGGTTGGTAACGGCGCAGTGGCGGTTTTCAT", IUPAC.unambiguous_dna),
            Seq("CACAGCGGTTTTCAT", IUPAC.unambiguous_dna),
            Seq("AACGGCGCATGGCGGTTTTCAT", IUPAC.unambiguous_dna),
            Seq("AGTGGCGGTTTTCAT", IUPAC.unambiguous_dna)]

concatenated_sequences = ''
for seq in sequences:
    concatenated_sequences += seq
    
concatenated_sequences

Seq('AGCGATGTTACGCATCAGGGCAGTCGCCCTAAAACAAAGTTAGGCCGCCCGGTT...CAT', IUPACUnambiguousDNA())

### Exercise 1.2

## Section 1.3: SeqRecord Objects

### Exercise 1.3

## Section 1.4: Working With Seq and SeqRecord Objects

### Exercise 1.4

## Section 1.5: Multiple Sequence Alignments

### Exercise 1.5