## Putting it together - strings!

Assign the string 'acctgtagctgaatcgtgtgttcgatcgat' to a variable called ```my_seq```. If you use the ```dir``` function on ```my_seq``` you'll see that there are methods called ```upper``` and ```count```.

Use the ```upper``` method to print an uppercase version of ```my_seq```. Use the ```count``` method to print the number of cytosine residues and, separately the number of guanine residues. Use the ```help``` function to see the syntax for these methods if you need to. 

Use string formatting to print an informative message about these numbers to the user.

Bonus: Can you subset this string to print out the substring 'GAAT'?

In [1]:
my_seq = 'acctgtagctgaatcgtgtgttcgatcgat' # create var
my_seq = my_seq.upper() # use upper() method
cy = my_seq.count('C') # use count method
gu = my_seq.count('G') 
print('The number of cytosines is %d; the number of guanines is %d.' %(cy, gu))

print(my_seq[10:14]) # simple counting of original string, remember to start at zero!

# or be sly
# 'GAAT' in my_seq # is the string GAAT in there
# strt = my_seq.find('GAAT') # where does GAAT begin
# end = strt + len('GAAT') # we know the start position, add the length to get the end position
# my_seq[strt:end] # subset out the string

The number of cytosines is 6; the number of guanines is 8.
GAAT


## Putting it together - numbers!

The GC percentage is a common statistic generated for DNA sequences. 

Apply the ```count``` method and ```len``` function to your DNA character string - ```my_seq``` ( 'acctgtagctgaatcgtgtgttcgatcgat') - to calculate the GC percentage. Assign this percentage to a variable ```gc_perc``` and print the value out using string formatting. Assign the AT percentage to a variable ```at_perc``` and print this out as well.

Bonus - Using the ```len``` function and the modulo operator decide whether ```my_seq``` has an odd or even number of DNA bases.

In [3]:
my_seq = 'acctgtagctgaatcgtgtgttcgatcgat'

cyt = my_seq.count('c')
gn = my_seq.count('g')

gc_perc = (cyt+gn)/len(my_seq)*100 # simple arithmetic to get %
at_perc = 100-gc_perc

print('The GC is %.2f. The AT is %.2f' % (gc_perc, at_perc))

seq_l = len(my_seq)
# next use boolean to check whether sequence length is odd/even
seq_l%2==0 # if True then seq_l must be even i.e. divides cleanly by 2 (no remainder)

The GC is 46.67. The AT is 53.33


True

## Homework

The DNA sequence below is the [FASTA](http://en.wikipedia.org/wiki/FASTA_format) representation of the human [PPARG](http://www.ncbi.nlm.nih.gov/nuccore/NM_005037.5) gene. As you know the [translation](https://en.wikipedia.org/wiki/Translation_%28biology%29) from DNA/RNA to protein is brought about by a triplet code whereby triplets of nucleotide bases code for a particular amino acid in the protein sequence. Futhermore translation always starts at the 'ATG' codon which codes for the amino acid methionine (M). As you can see in the sequence below the first codon is not ATG. 

For this assignment you'll need to look up the help for the ```index()``` method for strings.

Cut and paste the sequence (only the sequence, not the FASTA header, i.e. paste from GGC onwards) into a string variable. Note that the sequence will span several lines so think about how to deal with that. Use the ```index``` method to find the position of the first ATG codon. Print out that position. Use python to extract the next three triplet codons and print these along with the amino acids in the protein sequence that these code for. You don't have to use python for this bit - there is a genetic code table [here](http://www.google.co.uk/imgres?imgurl=http://www.geek.com/wp-content/uploads/2013/12/genetic-code.jpg&imgrefurl=http://www.geek.com/science/scientists-discover-a-second-genetic-code-except-not-really-1579496/&h=1000&w=1172&tbnid=pvhXPeyS1CPIdM:&zoom=1&tbnh=156&tbnw=184&usg=__--K3O7YCY8zGaQbDTw400OngOhM=&docid=m1SWW2CQ_miS9M&itg=1). 

Include brief comments in the code explaining what it does.

Use string formatting to print out informative messages from this analysis. 

In [4]:
# one potential solution
# simple copy paste into triple quotes - deals with newlines
seq = '''GGCGCCCGCGCCCGCCCCCGCGCCGGGCCCGGCTCGGCCCGACCCGGCTCCGCCGCGGGCAGGCGGGGCC
CAGCGCACTCGGAGCCCGAGCCCGAGCCGCAGCCGCCGCCTGGGGCGCTTGGGTCGGCCTCGAGGACACC
GGAGAGGGGCGCCACGCCGCCGTGGCCGCAGAAATGACCATGGTTGACACAGAGATGCCATTCTGGCCCA
CCAACTTTGGGATCAGCTCCGTGGATCTCTCCGTAATGGAAGACCACTCCCACTCCTTTGATATCAAGCC
CTTCACTACTGTTGACTTCTCCAGCATTTCTACTCCACATTACGAAGACATTCCATTCACAAGAACAGAT
CCAGTGGTTGCAGATTACAAGTATGACCTGAAACTTCAAGAGTACCAAAGTGCAATCAAAGTGGAGCCTG
CATCTCCACCTTATTATTCTGAGAAGACTCAGCTCTACAATAAGCCTCATGAAGAGCCTTCCAACTCCCT
CATGGCAATTGAATGTCGTGTCTGTGGAGATAAAGCTTCTGGATTTCACTATGGAGTTCATGCTTGTGAA
GGATGCAAGGGTTTCTTCCGGAGAACAATCAGATTGAAGCTTATCTATGACAGATGTGATCTTAACTGTC
GGATCCACAAAAAAAGTAGAAATAAATGTCAGTACTGTCGGTTTCAGAAATGCCTTGCAGTGGGGATGTC
TCATAATGCCATCAGGTTTGGGCGGATGCCACAGGCCGAGAAGGAGAAGCTGTTGGCGGAGATCTCCAGT
GATATCGACCAGCTGAATCCAGAGTCCGCTGACCTCCGGGCCCTGGCAAAACATTTGTATGACTCATACA
TAAAGTCCTTCCCGCTGACCAAAGCAAAGGCGAGGGCGATCTTGACAGGAAAGACAACAGACAAATCACC
ATTCGTTATCTATGACATGAATTCCTTAATGATGGGAGAAGATAAAATCAAGTTCAAACACATCACCCCC
CTGCAGGAGCAGAGCAAAGAGGTGGCCATCCGCATCTTTCAGGGCTGCCAGTTTCGCTCCGTGGAGGCTG
TGCAGGAGATCACAGAGTATGCCAAAAGCATTCCTGGTTTTGTAAATCTTGACTTGAACGACCAAGTAAC
TCTCCTCAAATATGGAGTCCACGAGATCATTTACACAATGCTGGCCTCCTTGATGAATAAAGATGGGGTT
CTCATATCCGAGGGCCAAGGCTTCATGACAAGGGAGTTTCTAAAGAGCCTGCGAAAGCCTTTTGGTGACT
TTATGGAGCCCAAGTTTGAGTTTGCTGTGAAGTTCAATGCACTGGAATTAGATGACAGCGACTTGGCAAT
ATTTATTGCTGTCATTATTCTCAGTGGAGACCGCCCAGGTTTGCTGAATGTGAAGCCCATTGAAGACATT
CAAGACAACCTGCTACAAGCCCTGGAGCTCCAGCTGAAGCTGAACCACCCTGAGTCCTCACAGCTGTTTG
CCAAGCTGCTCCAGAAAATGACAGACCTCAGACAGATTGTCACGGAACACGTGCAGCTACTGCAGGTGAT
CAAGAAGACGGAGACAGACATGAGTCTTCACCCGCTCCTGCAGGAGATCTACAAGGACTTGTACTAGCAG
AGAGTCCTGAGCCACTGCCAACATTTCCCTTCTTCCAGTTGCACTATTCTGAGGGAAAATCTGACACCTA
AGAAATTTACTGTGAAAAAGCATTTTAAAAAGAAAAGGTTTTAGAATATGATCTATTTTATGCATATTGT
TTATAAAGACACATTTACAATTTACTTTTAATATTAAAAATTACCATATTATGAAATTGCTGATAGTA'''

# get ATG position
print('The ATG codon begins at %d.' % seq.index('ATG')) # or you could use seq.find('ATG')

# next 9 bases (3 codons), simple counting
three_codons = seq[178:187]

# subsetting gets us the codons we want
codon1 = three_codons[:3]   
codon2 = three_codons[3:6]
codon3 = three_codons[6:]        

# print stuff out, string formatting
print('The next three codons are: %s, %s, %s.' % (codon1, codon2, codon3))
print('The amino acids are Thr, Met and Val.') # just looked these up in table, better method next week!

The ATG codon begins at 175.
The next three codons are: ACC, ATG, GTT.
The amino acids are Thr, Met and Val.
